All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]
@ 2013-07-01  5:44 Dave Chinner
  2013-07-04  2:09 ` Mark Tinguely
  2013-07-08 12:44   ` Dave Chinner
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-01  5:44 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Note: This is an RFC right now - it'll need to be broken up into
several patches for final submission.

The CIL insertion during transaction commit currently does multiple
passes across the transaction objects and requires multiple memory
allocations per object that is to be inserted into the CIL. It is
quite inefficient, and as such xfs_log_commit_cil() and it's
children show up quite highly in profiles under metadata
modification intensive workloads.

The current insertion tries to minimise ithe number of times the
xc_cil_lock is grabbed and the hold times via a couple of methods:

	1. an initial loop across the transaction items outside the
	lock to allocate log vectors, buffers and copy the data into
	them.
	2. a second pass across the log vectors that then inserts
	them into the CIL, modifies the CIL state and frees the old
	vectors.

This is somewhat inefficient. While it minimises lock grabs, the
hold time is still quite high because we are freeing objects with
the spinlock held and so the hold times are much higher than they
need to be.

Optimisations that can be made:

	1. change IOP_SIZE() to return the size of the metadata to
	be logged with the number of vectors required. This means we
	can allocate the buffer to hold the metadata as part of the
	allocation for the log vector array. Once allocation instead
	of two.

	2. Store the size of the allocated region for the log vector
	in the log vector itself. This allows us to determine if we
	can just reuse the existing log vector for the new
	transactions. Avoids all memory allocation for that item.

	3. when setting up a new log vector, we don't need to
	hold the CIL lock while determining the difference in it's
	size when compared to the old one. Hence we can do all the
	accounting, swapping of the log vectors and freeing the old
	log vectors outside the xc_cil_lock.

	4. We can do all the CIL insertation of items and busy
	extents and the accounting under a single grab of the
	xc_cil_lock. This minimises the number of times we need to
	grab it and hence contention points are kept to a minimum

	5. the xc_cil_lock is used for serialising both the CIL and
	the push/commit ordering. Separate the two functions into
	different locks so push/commit ordering does not affect CIL
	insertion

	6. the xc_cil_lock shares cachelines with other locks.
	Separate them onto different cachelines.

The result is that my standard fsmark benchmark (8-way, 50m files)
on my standard test VM (8-way, 4GB RAM, 4xSSD in RAID0, 100TB fs)
gives the following results with a xfs-oss tree. No CRCs:

                vanilla         patched         Difference
create  (time)  483s            435s            -10.0%  (faster)
        (rate)  109k+/6k        122k+/-7k       +11.9%  (faster)

walk            339s            335s            (noise)
     (sys cpu)  1134s           1135s           (noise)

unlink          692s            645s             -6.8%  (faster)

So it's significantly faster than the current code, and lock_stat
reports lower contention on the xc_cil_lock, too. So, big win here.

With CRCs:

                vanilla         patched         Difference
create  (time)  510s            460s             -9.8%  (faster)
        (rate)  105k+/5.4k      117k+/-5k       +11.4%  (faster)

walk            494s            486s            (noise)
     (sys cpu)  1324s           1290s           (noise)

unlink          959s            889s             -7.3%  (faster)

Gains are of the same order, with walk and unlink still affected by
VFS LRU lock contention. IOWs, with this changes, filesystems with
CRCs enabled will still be faster than the old non-CRC kernels...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c     |   52 ++++---
 fs/xfs/xfs_dquot_item.c   |   22 +--
 fs/xfs/xfs_extfree_item.c |   50 ++++---
 fs/xfs/xfs_icreate_item.c |    9 +-
 fs/xfs/xfs_inode_item.c   |   52 ++++---
 fs/xfs/xfs_log.h          |    1 +
 fs/xfs/xfs_log_cil.c      |  366 +++++++++++++++++++++++----------------------
 fs/xfs/xfs_log_priv.h     |    7 +-
 fs/xfs/xfs_trans.h        |    4 +-
 9 files changed, 311 insertions(+), 252 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index bfc4e0c..9358504 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -39,6 +39,14 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
 
 STATIC void	xfs_buf_do_callbacks(struct xfs_buf *bp);
 
+static inline int
+xfs_buf_log_format_size(
+	struct xfs_buf_log_format *blfp)
+{
+	return offsetof(struct xfs_buf_log_format, blf_data_map) +
+			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
+}
+
 /*
  * This returns the number of log iovecs needed to log the
  * given buf log item.
@@ -49,25 +57,27 @@ STATIC void	xfs_buf_do_callbacks(struct xfs_buf *bp);
  *
  * If the XFS_BLI_STALE flag has been set, then log nothing.
  */
-STATIC uint
+STATIC void
 xfs_buf_item_size_segment(
 	struct xfs_buf_log_item	*bip,
-	struct xfs_buf_log_format *blfp)
+	struct xfs_buf_log_format *blfp,
+	int			*nvecs,
+	int			*nbytes)
 {
 	struct xfs_buf		*bp = bip->bli_buf;
-	uint			nvecs;
 	int			next_bit;
 	int			last_bit;
 
 	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
 	if (last_bit == -1)
-		return 0;
+		return;
 
 	/*
 	 * initial count for a dirty buffer is 2 vectors - the format structure
 	 * and the first dirty region.
 	 */
-	nvecs = 2;
+	*nvecs += 2;
+	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
 
 	while (last_bit != -1) {
 		/*
@@ -87,18 +97,17 @@ xfs_buf_item_size_segment(
 			break;
 		} else if (next_bit != last_bit + 1) {
 			last_bit = next_bit;
-			nvecs++;
+			(*nvecs)++;
 		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
 			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
 			    XFS_BLF_CHUNK)) {
 			last_bit = next_bit;
-			nvecs++;
+			(*nvecs)++;
 		} else {
 			last_bit++;
 		}
+		*nbytes += XFS_BLF_CHUNK;
 	}
-
-	return nvecs;
 }
 
 /*
@@ -118,12 +127,13 @@ xfs_buf_item_size_segment(
  * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log
  * format structures.
  */
-STATIC uint
+STATIC void
 xfs_buf_item_size(
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
-	uint			nvecs;
 	int			i;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
@@ -135,7 +145,11 @@ xfs_buf_item_size(
 		 */
 		trace_xfs_buf_item_size_stale(bip);
 		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
-		return bip->bli_format_count;
+		*nvecs += bip->bli_format_count;
+		for (i = 0; i < bip->bli_format_count; i++) {
+			*nbytes += xfs_buf_log_format_size(&bip->bli_formats[i]);
+		}
+		return;
 	}
 
 	ASSERT(bip->bli_flags & XFS_BLI_LOGGED);
@@ -147,7 +161,8 @@ xfs_buf_item_size(
 		 * commit, so no vectors are used at all.
 		 */
 		trace_xfs_buf_item_size_ordered(bip);
-		return XFS_LOG_VEC_ORDERED;
+		*nvecs = XFS_LOG_VEC_ORDERED;
+		return;
 	}
 
 	/*
@@ -159,13 +174,11 @@ xfs_buf_item_size(
 	 * count for the extra buf log format structure that will need to be
 	 * written.
 	 */
-	nvecs = 0;
 	for (i = 0; i < bip->bli_format_count; i++) {
-		nvecs += xfs_buf_item_size_segment(bip, &bip->bli_formats[i]);
+		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
+					  nvecs, nbytes);
 	}
-
 	trace_xfs_buf_item_size(bip);
-	return nvecs;
 }
 
 static struct xfs_log_iovec *
@@ -192,8 +205,7 @@ xfs_buf_item_format_segment(
 	 * the actual size of the dirty bitmap rather than the size of the in
 	 * memory structure.
 	 */
-	base_size = offsetof(struct xfs_buf_log_format, blf_data_map) +
-			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
+	base_size = xfs_buf_log_format_size(blfp);
 
 	nvecs = 0;
 	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index f07a436..60c6e1f 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -44,14 +44,15 @@ static inline struct xfs_dq_logitem *DQUOT_ITEM(struct xfs_log_item *lip)
 /*
  * returns the number of iovecs needed to log the given dquot item.
  */
-STATIC uint
+STATIC void
 xfs_qm_dquot_logitem_size(
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
 {
-	/*
-	 * we need only two iovecs, one for the format, one for the real thing
-	 */
-	return 2;
+	*nvecs += 2;
+	*nbytes += sizeof(struct xfs_dq_logformat) +
+		   sizeof(struct xfs_disk_dquot);
 }
 
 /*
@@ -286,11 +287,14 @@ static inline struct xfs_qoff_logitem *QOFF_ITEM(struct xfs_log_item *lip)
  * We only need 1 iovec for an quotaoff item.  It just logs the
  * quotaoff_log_format structure.
  */
-STATIC uint
+STATIC void
 xfs_qm_qoff_logitem_size(
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
 {
-	return 1;
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_qoff_logitem);
 }
 
 /*
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 452920a..dc53e8f 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -73,11 +73,22 @@ __xfs_efi_release(
  * We only need 1 iovec for an efi item.  It just logs the efi_log_format
  * structure.
  */
-STATIC uint
+static inline int
+xfs_efi_item_sizeof(
+	struct xfs_efi_log_item *efip)
+{
+	return sizeof(struct xfs_efi_log_format) +
+	       (efip->efi_format.efi_nextents - 1) * sizeof(xfs_extent_t);
+}
+
+STATIC void
 xfs_efi_item_size(
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
 {
-	return 1;
+	*nvecs += 1;
+	*nbytes += xfs_efi_item_sizeof(EFI_ITEM(lip));
 }
 
 /*
@@ -93,21 +104,17 @@ xfs_efi_item_format(
 	struct xfs_log_iovec	*log_vector)
 {
 	struct xfs_efi_log_item	*efip = EFI_ITEM(lip);
-	uint			size;
 
 	ASSERT(atomic_read(&efip->efi_next_extent) ==
 				efip->efi_format.efi_nextents);
 
 	efip->efi_format.efi_type = XFS_LI_EFI;
-
-	size = sizeof(xfs_efi_log_format_t);
-	size += (efip->efi_format.efi_nextents - 1) * sizeof(xfs_extent_t);
 	efip->efi_format.efi_size = 1;
 
 	log_vector->i_addr = &efip->efi_format;
-	log_vector->i_len = size;
+	log_vector->i_len = xfs_efi_item_sizeof(efip);
 	log_vector->i_type = XLOG_REG_TYPE_EFI_FORMAT;
-	ASSERT(size >= sizeof(xfs_efi_log_format_t));
+	ASSERT(log_vector->i_len >= sizeof(xfs_efi_log_format_t));
 }
 
 
@@ -333,11 +340,22 @@ xfs_efd_item_free(struct xfs_efd_log_item *efdp)
  * We only need 1 iovec for an efd item.  It just logs the efd_log_format
  * structure.
  */
-STATIC uint
+static inline int
+xfs_efd_item_sizeof(
+	struct xfs_efd_log_item *efdp)
+{
+	return sizeof(xfs_efd_log_format_t) +
+	       (efdp->efd_format.efd_nextents - 1) * sizeof(xfs_extent_t);
+}
+
+STATIC void
 xfs_efd_item_size(
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
 {
-	return 1;
+	*nvecs += 1;
+	*nbytes += xfs_efd_item_sizeof(EFD_ITEM(lip));
 }
 
 /*
@@ -353,20 +371,16 @@ xfs_efd_item_format(
 	struct xfs_log_iovec	*log_vector)
 {
 	struct xfs_efd_log_item	*efdp = EFD_ITEM(lip);
-	uint			size;
 
 	ASSERT(efdp->efd_next_extent == efdp->efd_format.efd_nextents);
 
 	efdp->efd_format.efd_type = XFS_LI_EFD;
-
-	size = sizeof(xfs_efd_log_format_t);
-	size += (efdp->efd_format.efd_nextents - 1) * sizeof(xfs_extent_t);
 	efdp->efd_format.efd_size = 1;
 
 	log_vector->i_addr = &efdp->efd_format;
-	log_vector->i_len = size;
+	log_vector->i_len = xfs_efd_item_sizeof(efdp);
 	log_vector->i_type = XLOG_REG_TYPE_EFD_FORMAT;
-	ASSERT(size >= sizeof(xfs_efd_log_format_t));
+	ASSERT(log_vector->i_len >= sizeof(xfs_efd_log_format_t));
 }
 
 /*
diff --git a/fs/xfs/xfs_icreate_item.c b/fs/xfs/xfs_icreate_item.c
index 441a78a..5a5a593 100644
--- a/fs/xfs/xfs_icreate_item.c
+++ b/fs/xfs/xfs_icreate_item.c
@@ -40,11 +40,14 @@ static inline struct xfs_icreate_item *ICR_ITEM(struct xfs_log_item *lip)
  *
  * We only need one iovec for the icreate log structure.
  */
-STATIC uint
+STATIC void
 xfs_icreate_item_size(
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
 {
-	return 1;
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_icreate_log);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f76ff52..366b3bd 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -47,32 +47,44 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
  * inode core, and possibly one for the inode data/extents/b-tree root
  * and one for the inode attribute data/extents/b-tree root.
  */
-STATIC uint
+STATIC void
 xfs_inode_item_size(
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
 {
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
-	uint			nvecs = 2;
+
+	*nvecs += 2;
+	*nbytes += sizeof(struct xfs_inode_log_format) +
+		   xfs_icdinode_size(ip->i_d.di_version);
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
 		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
 		    ip->i_d.di_nextents > 0 &&
-		    ip->i_df.if_bytes > 0)
-			nvecs++;
+		    ip->i_df.if_bytes > 0) {
+			/* worst case, doesn't subtract delalloc extents */
+			*nbytes += ip->i_df.if_bytes;
+			*nvecs += 1;
+		}
 		break;
 
 	case XFS_DINODE_FMT_BTREE:
 		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
-		    ip->i_df.if_broot_bytes > 0)
-			nvecs++;
+		    ip->i_df.if_broot_bytes > 0) {
+			*nbytes += ip->i_df.if_broot_bytes;
+			*nvecs += 1;
+		}
 		break;
 
 	case XFS_DINODE_FMT_LOCAL:
 		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
-		    ip->i_df.if_bytes > 0)
-			nvecs++;
+		    ip->i_df.if_bytes > 0) {
+			*nbytes += roundup(ip->i_df.if_bytes, 4);
+			*nvecs += 1;
+		}
 		break;
 
 	case XFS_DINODE_FMT_DEV:
@@ -85,7 +97,7 @@ xfs_inode_item_size(
 	}
 
 	if (!XFS_IFORK_Q(ip))
-		return nvecs;
+		return;
 
 
 	/*
@@ -95,28 +107,32 @@ xfs_inode_item_size(
 	case XFS_DINODE_FMT_EXTENTS:
 		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
 		    ip->i_d.di_anextents > 0 &&
-		    ip->i_afp->if_bytes > 0)
-			nvecs++;
+		    ip->i_afp->if_bytes > 0) {
+			*nbytes += ip->i_afp->if_bytes;
+			*nvecs += 1;
+		}
 		break;
 
 	case XFS_DINODE_FMT_BTREE:
 		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
-		    ip->i_afp->if_broot_bytes > 0)
-			nvecs++;
+		    ip->i_afp->if_broot_bytes > 0) {
+			*nbytes += ip->i_afp->if_broot_bytes;
+			*nvecs += 1;
+		}
 		break;
 
 	case XFS_DINODE_FMT_LOCAL:
 		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
-		    ip->i_afp->if_bytes > 0)
-			nvecs++;
+		    ip->i_afp->if_bytes > 0) {
+			*nbytes += roundup(ip->i_afp->if_bytes, 4);
+			*nvecs += 1;
+		}
 		break;
 
 	default:
 		ASSERT(0);
 		break;
 	}
-
-	return nvecs;
 }
 
 /*
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index e63d9e1..1c45848 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -27,6 +27,7 @@ struct xfs_log_vec {
 	struct xfs_log_item	*lv_item;	/* owner */
 	char			*lv_buf;	/* formatted buffer */
 	int			lv_buf_len;	/* size of formatted buffer */
+	int			lv_size;	/* size of allocated lv */
 };
 
 #define XFS_LOG_VEC_ORDERED	(-1)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 02b9cf3..4425406 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -77,7 +77,85 @@ xlog_cil_init_post_recovery(
 	log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log);
 	log->l_cilp->xc_ctx->sequence = 1;
 	log->l_cilp->xc_ctx->commit_lsn = xlog_assign_lsn(log->l_curr_cycle,
-								log->l_curr_block);
+							  log->l_curr_block);
+}
+
+/*
+ * Prepare the log item for insertion into the CIL. Calculate the difference in
+ * log space and vectors it will consume, and if it is a new item pin it as
+ * well.
+ */
+STATIC void
+xfs_cil_prepare_item(
+	struct xlog		*log,
+	struct xfs_log_vec	*lv,
+	struct xfs_log_vec	*old_lv,
+	int			*len,
+	int			*diff_iovecs)
+{
+	if (old_lv) {
+		/* existing lv on log item, space used is a delta */
+		ASSERT(old_lv->lv_buf_len != lv->lv_buf_len ||
+		       old_lv->lv_niovecs != lv->lv_niovecs);
+		ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED);
+
+		*len += lv->lv_buf_len - old_lv->lv_buf_len;
+		*diff_iovecs += lv->lv_niovecs - old_lv->lv_niovecs;
+	} else {
+		/* new lv, must pin the log item */
+		ASSERT(!lv->lv_item->li_lv);
+
+		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) {
+			*len += lv->lv_buf_len;
+			*diff_iovecs += lv->lv_niovecs;
+		}
+		IOP_PIN(lv->lv_item);
+
+	}
+
+	/* attach new log vector to log item */
+	lv->lv_item->li_lv = lv;
+
+	/*
+	 * If this is the first time the item is being committed to the
+	 * CIL, store the sequence number on the log item so we can
+	 * tell in future commits whether this is the first checkpoint
+	 * the item is being committed into.
+	 */
+	if (!lv->lv_item->li_seq)
+		lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
+}
+
+STATIC void
+xlog_cil_lv_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	int	index;
+	char	*ptr;
+
+	/* format new vectors into array */
+	lip->li_ops->iop_format(lip, lv->lv_iovecp);
+
+	/* copy data into existing array */
+	ptr = lv->lv_buf;
+	for (index = 0; index < lv->lv_niovecs; index++) {
+		struct xfs_log_iovec *vec = &lv->lv_iovecp[index];
+
+		//printk("\tptr 0x%p, vlen 0x%x\n", ptr, vec->i_len);
+		memcpy(ptr, vec->i_addr, vec->i_len);
+		vec->i_addr = ptr;
+		ptr += vec->i_len;
+	}
+
+	/*
+	 * some size calculations for log vectors over-estimate, so we'll see
+	 * this as not using the entire buffer here. Hence just check ofr
+	 * overruns on debug kernels.
+	 */
+	//printk("ptr 0x%p, lvb 0x%p lvbe 0x%p\n",
+		//ptr, lv->lv_buf, lv->lv_buf + lv->lv_buf_len);
+	ASSERT(ptr > lv->lv_buf && ptr <= lv->lv_buf + lv->lv_buf_len);
 }
 
 /*
@@ -106,37 +184,37 @@ xlog_cil_init_post_recovery(
  * format the regions into the iclog as though they are being formatted
  * directly out of the objects themselves.
  */
-static struct xfs_log_vec *
-xlog_cil_prepare_log_vecs(
-	struct xfs_trans	*tp)
+void
+xlog_cil_insert_format_items(
+	struct xlog		*log,
+	struct xfs_trans	*tp,
+	int			*diff_len,
+	int			*diff_iovecs)
 {
 	struct xfs_log_item_desc *lidp;
-	struct xfs_log_vec	*lv = NULL;
-	struct xfs_log_vec	*ret_lv = NULL;
-
 
 	/* Bail out if we didn't find a log item.  */
 	if (list_empty(&tp->t_items)) {
 		ASSERT(0);
-		return NULL;
+		return;
 	}
 
+	/* walk each transaction item */
 	list_for_each_entry(lidp, &tp->t_items, lid_trans) {
-		struct xfs_log_vec *new_lv;
-		void	*ptr;
-		int	index;
-		int	len = 0;
-		uint	niovecs;
-		bool	ordered = false;
+		struct xfs_log_item	*lip = lidp->lid_item;
+		struct xfs_log_vec	*lv;
+		struct xfs_log_vec	*old_lv;
+		int			niovecs = 0;
+		int			nbytes = 0;
+		int			buf_size;
+		bool			ordered = false;
 
 		/* Skip items which aren't dirty in this transaction. */
 		if (!(lidp->lid_flags & XFS_LID_DIRTY))
 			continue;
 
-		/* Skip items that do not have any vectors for writing */
-		niovecs = IOP_SIZE(lidp->lid_item);
-		if (!niovecs)
-			continue;
+		/* get number of vecs and size of data to be stored */
+		lip->li_ops->iop_size(lip, &niovecs, &nbytes);
 
 		/*
 		 * Ordered items need to be tracked but we do not wish to write
@@ -146,111 +224,63 @@ xlog_cil_prepare_log_vecs(
 		if (niovecs == XFS_LOG_VEC_ORDERED) {
 			ordered = true;
 			niovecs = 0;
+			nbytes = 0;
 		}
 
-		new_lv = kmem_zalloc(sizeof(*new_lv) +
-				niovecs * sizeof(struct xfs_log_iovec),
-				KM_SLEEP|KM_NOFS);
-
-		new_lv->lv_item = lidp->lid_item;
-		new_lv->lv_niovecs = niovecs;
-		if (ordered) {
-			/* track as an ordered logvec */
-			new_lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
-			goto next;
-		}
-
-		/* The allocated iovec region lies beyond the log vector. */
-		new_lv->lv_iovecp = (struct xfs_log_iovec *)&new_lv[1];
-
-		/* build the vector array and calculate it's length */
-		IOP_FORMAT(new_lv->lv_item, new_lv->lv_iovecp);
-		for (index = 0; index < new_lv->lv_niovecs; index++)
-			len += new_lv->lv_iovecp[index].i_len;
-
-		new_lv->lv_buf_len = len;
-		new_lv->lv_buf = kmem_alloc(new_lv->lv_buf_len,
-				KM_SLEEP|KM_NOFS);
-		ptr = new_lv->lv_buf;
-
-		for (index = 0; index < new_lv->lv_niovecs; index++) {
-			struct xfs_log_iovec *vec = &new_lv->lv_iovecp[index];
-
-			memcpy(ptr, vec->i_addr, vec->i_len);
-			vec->i_addr = ptr;
-			ptr += vec->i_len;
-		}
-		ASSERT(ptr == new_lv->lv_buf + new_lv->lv_buf_len);
-
-next:
-		if (!ret_lv)
-			ret_lv = new_lv;
-		else
-			lv->lv_next = new_lv;
-		lv = new_lv;
-	}
-
-	return ret_lv;
-}
-
-/*
- * Prepare the log item for insertion into the CIL. Calculate the difference in
- * log space and vectors it will consume, and if it is a new item pin it as
- * well.
- */
-STATIC void
-xfs_cil_prepare_item(
-	struct xlog		*log,
-	struct xfs_log_vec	*lv,
-	int			*len,
-	int			*diff_iovecs)
-{
-	struct xfs_log_vec	*old = lv->lv_item->li_lv;
+		/* calc buffer size */
+		buf_size = sizeof(struct xfs_log_vec) + nbytes +
+				niovecs * sizeof(struct xfs_log_iovec);
 
-	if (old) {
-		/* existing lv on log item, space used is a delta */
-		ASSERT((old->lv_buf && old->lv_buf_len && old->lv_niovecs) ||
-			old->lv_buf_len == XFS_LOG_VEC_ORDERED);
+		/* compare to existing item size */
+		if (lip->li_lv && buf_size <= lip->li_lv->lv_size) {
+			/* same or smaller, optimise common overwrite case */
 
-		/*
-		 * If the new item is ordered, keep the old one that is already
-		 * tracking dirty or ordered regions
-		 */
-		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
-			ASSERT(!lv->lv_buf);
-			kmem_free(lv);
-			return;
+			/* format log item vectors into existing lv */
+			if (!ordered)
+				xlog_cil_lv_item_format(lip, lip->li_lv);
+			continue;
 		}
 
-		*len += lv->lv_buf_len - old->lv_buf_len;
-		*diff_iovecs += lv->lv_niovecs - old->lv_niovecs;
-		kmem_free(old->lv_buf);
-		kmem_free(old);
-	} else {
-		/* new lv, must pin the log item */
-		ASSERT(!lv->lv_item->li_lv);
+		/* different */
 
-		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) {
-			*len += lv->lv_buf_len;
-			*diff_iovecs += lv->lv_niovecs;
+		/* allocate new data chunk */
+		lv = kmem_zalloc(buf_size, KM_SLEEP|KM_NOFS);
+		lv->lv_item = lip;
+		lv->lv_size = buf_size;
+		lv->lv_niovecs = niovecs;
+		if (ordered) {
+			/* track as an ordered logvec */
+			ASSERT(lip->li_lv == NULL);
+			lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
+			goto insert;
 		}
-		IOP_PIN(lv->lv_item);
 
+		/* The allocated iovec region lies beyond the log vector. */
+		lv->lv_iovecp = (struct xfs_log_iovec *)&lv[1];
+
+		/* the allocated data region lies beyond the iovec region */
+		lv->lv_buf = (char *)lv + buf_size - nbytes;
+		lv->lv_buf_len = nbytes;
+
+		//printk("lv 0x%p, ivp 0x%p, buf 0x%p\n",
+			//lv, lv->lv_iovecp, lv->lv_buf);
+		//printk("size 0x%x, blen 0x%x, nvecs 0x%x\n",
+			//lv->lv_size, lv->lv_buf_len, lv->lv_niovecs);
+
+		/* format item in new lv */
+		xlog_cil_lv_item_format(lip, lv);
+insert:
+		/* update new CIL space usage */
+		old_lv = lip->li_lv;
+		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
+
+		/* free old lv */
+		kmem_free(old_lv);
 	}
 
-	/* attach new log vector to log item */
-	lv->lv_item->li_lv = lv;
-
-	/*
-	 * If this is the first time the item is being committed to the
-	 * CIL, store the sequence number on the log item so we can
-	 * tell in future commits whether this is the first checkpoint
-	 * the item is being committed into.
-	 */
-	if (!lv->lv_item->li_seq)
-		lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
 }
 
+
 /*
  * Insert the log items into the CIL and calculate the difference in space
  * consumed by the item. Add the space to the checkpoint ticket and calculate
@@ -261,53 +291,49 @@ xfs_cil_prepare_item(
 static void
 xlog_cil_insert_items(
 	struct xlog		*log,
-	struct xfs_log_vec	*log_vector,
+	struct xfs_trans	*tp,
 	struct xlog_ticket	*ticket)
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
-	struct xfs_log_vec	*lv;
+	struct xfs_log_item_desc *lidp;
 	int			len = 0;
 	int			diff_iovecs = 0;
 	int			iclog_space;
 
-	ASSERT(log_vector);
+	ASSERT(tp);
 
 	/*
-	 * Do all the accounting aggregation and switching of log vectors
-	 * around in a separate loop to the insertion of items into the CIL.
-	 * Then we can do a separate loop to update the CIL within a single
-	 * lock/unlock pair. This reduces the number of round trips on the CIL
-	 * lock from O(nr_logvectors) to O(1) and greatly reduces the overall
-	 * hold time for the transaction commit.
-	 *
-	 * If this is the first time the item is being placed into the CIL in
-	 * this context, pin it so it can't be written to disk until the CIL is
-	 * flushed to the iclog and the iclog written to disk.
-	 *
 	 * We can do this safely because the context can't checkpoint until we
 	 * are done so it doesn't matter exactly how we update the CIL.
 	 */
+	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
+
+
+	/*
+	 * Now (re-)position everything modified at the tail of the CIL.
+	 * We do this here so we only need to take the CIL lock once during
+	 * the transaction commit.
+	 */
 	spin_lock(&cil->xc_cil_lock);
-	for (lv = log_vector; lv; ) {
-		struct xfs_log_vec *next = lv->lv_next;
+	list_for_each_entry(lidp, &tp->t_items, lid_trans) {
+		struct xfs_log_item	*lip = lidp->lid_item;
 
-		ASSERT(lv->lv_item->li_lv || list_empty(&lv->lv_item->li_cil));
-		lv->lv_next = NULL;
+		/* Skip items which aren't dirty in this transaction. */
+		if (!(lidp->lid_flags & XFS_LID_DIRTY))
+			continue;
 
-		/*
-		 * xfs_cil_prepare_item() may free the lv, so move the item on
-		 * the CIL first.
-		 */
-		list_move_tail(&lv->lv_item->li_cil, &cil->xc_cil);
-		xfs_cil_prepare_item(log, lv, &len, &diff_iovecs);
-		lv = next;
+		list_move_tail(&lip->li_cil, &cil->xc_cil);
 	}
 
 	/* account for space used by new iovec headers  */
 	len += diff_iovecs * sizeof(xlog_op_header_t);
 	ctx->nvecs += diff_iovecs;
 
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &ctx->busy_extents);
+
 	/*
 	 * Now transfer enough transaction reservation to the context ticket
 	 * for the checkpoint. The context ticket is special - the unit
@@ -350,7 +376,6 @@ xlog_cil_free_logvec(
 
 	for (lv = log_vector; lv; ) {
 		struct xfs_log_vec *next = lv->lv_next;
-		kmem_free(lv->lv_buf);
 		kmem_free(lv);
 		lv = next;
 	}
@@ -376,9 +401,9 @@ xlog_cil_committed(
 	xfs_extent_busy_clear(mp, &ctx->busy_extents,
 			     (mp->m_flags & XFS_MOUNT_DISCARD) && !abort);
 
-	spin_lock(&ctx->cil->xc_cil_lock);
+	spin_lock(&ctx->cil->xc_push_lock);
 	list_del(&ctx->committing);
-	spin_unlock(&ctx->cil->xc_cil_lock);
+	spin_unlock(&ctx->cil->xc_push_lock);
 
 	xlog_cil_free_logvec(ctx->lv_chain);
 
@@ -433,7 +458,7 @@ xlog_cil_push(
 	down_write(&cil->xc_ctx_lock);
 	ctx = cil->xc_ctx;
 
-	spin_lock(&cil->xc_cil_lock);
+	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
 
@@ -444,10 +469,10 @@ xlog_cil_push(
 	 */
 	if (list_empty(&cil->xc_cil)) {
 		cil->xc_push_seq = 0;
-		spin_unlock(&cil->xc_cil_lock);
+		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
 	}
-	spin_unlock(&cil->xc_cil_lock);
+	spin_unlock(&cil->xc_push_lock);
 
 
 	/* check for a previously pushed seqeunce */
@@ -515,9 +540,9 @@ xlog_cil_push(
 	 * that higher sequences will wait for us to write out a commit record
 	 * before they do.
 	 */
-	spin_lock(&cil->xc_cil_lock);
+	spin_lock(&cil->xc_push_lock);
 	list_add(&ctx->committing, &cil->xc_committing);
-	spin_unlock(&cil->xc_cil_lock);
+	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
 
 	/*
@@ -552,7 +577,7 @@ xlog_cil_push(
 	 * order the commit records so replay will get them in the right order.
 	 */
 restart:
-	spin_lock(&cil->xc_cil_lock);
+	spin_lock(&cil->xc_push_lock);
 	list_for_each_entry(new_ctx, &cil->xc_committing, committing) {
 		/*
 		 * Higher sequences will wait for this one so skip them.
@@ -565,11 +590,11 @@ restart:
 			 * It is still being pushed! Wait for the push to
 			 * complete, then start again from the beginning.
 			 */
-			xlog_wait(&cil->xc_commit_wait, &cil->xc_cil_lock);
+			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
 			goto restart;
 		}
 	}
-	spin_unlock(&cil->xc_cil_lock);
+	spin_unlock(&cil->xc_push_lock);
 
 	/* xfs_log_done always frees the ticket on error. */
 	commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, 0);
@@ -588,10 +613,10 @@ restart:
 	 * callbacks to the iclog we can assign the commit LSN to the context
 	 * and wake up anyone who is waiting for the commit to complete.
 	 */
-	spin_lock(&cil->xc_cil_lock);
+	spin_lock(&cil->xc_push_lock);
 	ctx->commit_lsn = commit_lsn;
 	wake_up_all(&cil->xc_commit_wait);
-	spin_unlock(&cil->xc_cil_lock);
+	spin_unlock(&cil->xc_push_lock);
 
 	/* release the hounds! */
 	return xfs_log_release_iclog(log->l_mp, commit_iclog);
@@ -644,12 +669,12 @@ xlog_cil_push_background(
 	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
 		return;
 
-	spin_lock(&cil->xc_cil_lock);
+	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
 		cil->xc_push_seq = cil->xc_current_sequence;
 		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	}
-	spin_unlock(&cil->xc_cil_lock);
+	spin_unlock(&cil->xc_push_lock);
 
 }
 
@@ -672,14 +697,14 @@ xlog_cil_push_foreground(
 	 * If the CIL is empty or we've already pushed the sequence then
 	 * there's no work we need to do.
 	 */
-	spin_lock(&cil->xc_cil_lock);
+	spin_lock(&cil->xc_push_lock);
 	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
-		spin_unlock(&cil->xc_cil_lock);
+		spin_unlock(&cil->xc_push_lock);
 		return;
 	}
 
 	cil->xc_push_seq = push_seq;
-	spin_unlock(&cil->xc_cil_lock);
+	spin_unlock(&cil->xc_push_lock);
 
 	/* do the push now */
 	xlog_cil_push(log);
@@ -706,41 +731,23 @@ xfs_log_commit_cil(
 	int			flags)
 {
 	struct xlog		*log = mp->m_log;
+	struct xfs_cil		*cil = log->l_cilp;
 	int			log_flags = 0;
-	struct xfs_log_vec	*log_vector;
 
 	if (flags & XFS_TRANS_RELEASE_LOG_RES)
 		log_flags = XFS_LOG_REL_PERM_RESERV;
 
-	/*
-	 * Do all the hard work of formatting items (including memory
-	 * allocation) outside the CIL context lock. This prevents stalling CIL
-	 * pushes when we are low on memory and a transaction commit spends a
-	 * lot of time in memory reclaim.
-	 */
-	log_vector = xlog_cil_prepare_log_vecs(tp);
-	if (!log_vector)
-		return ENOMEM;
-
 	/* lock out background commit */
-	down_read(&log->l_cilp->xc_ctx_lock);
+	down_read(&cil->xc_ctx_lock);
 	if (commit_lsn)
-		*commit_lsn = log->l_cilp->xc_ctx->sequence;
+		*commit_lsn = cil->xc_ctx->sequence;
 
 	/* xlog_cil_insert_items() destroys log_vector list */
-	xlog_cil_insert_items(log, log_vector, tp->t_ticket);
+	xlog_cil_insert_items(log, tp, tp->t_ticket);
 
 	/* check we didn't blow the reservation */
 	if (tp->t_ticket->t_curr_res < 0)
-		xlog_print_tic_res(log->l_mp, tp->t_ticket);
-
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy)) {
-		spin_lock(&log->l_cilp->xc_cil_lock);
-		list_splice_init(&tp->t_busy,
-					&log->l_cilp->xc_ctx->busy_extents);
-		spin_unlock(&log->l_cilp->xc_cil_lock);
-	}
+		xlog_print_tic_res(mp, tp->t_ticket);
 
 	tp->t_commit_lsn = *commit_lsn;
 	xfs_log_done(mp, tp->t_ticket, NULL, log_flags);
@@ -800,7 +807,7 @@ xlog_cil_force_lsn(
 	 * on commits for those as well.
 	 */
 restart:
-	spin_lock(&cil->xc_cil_lock);
+	spin_lock(&cil->xc_push_lock);
 	list_for_each_entry(ctx, &cil->xc_committing, committing) {
 		if (ctx->sequence > sequence)
 			continue;
@@ -809,7 +816,7 @@ restart:
 			 * It is still being pushed! Wait for the push to
 			 * complete, then start again from the beginning.
 			 */
-			xlog_wait(&cil->xc_commit_wait, &cil->xc_cil_lock);
+			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
 			goto restart;
 		}
 		if (ctx->sequence != sequence)
@@ -817,7 +824,7 @@ restart:
 		/* found it! */
 		commit_lsn = ctx->commit_lsn;
 	}
-	spin_unlock(&cil->xc_cil_lock);
+	spin_unlock(&cil->xc_push_lock);
 	return commit_lsn;
 }
 
@@ -875,6 +882,7 @@ xlog_cil_init(
 	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
+	spin_lock_init(&cil->xc_push_lock);
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index edd0964..56654931 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -278,13 +278,16 @@ struct xfs_cil {
 	struct xlog		*xc_log;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;
-	struct xfs_cil_ctx	*xc_ctx;
+
+	struct xfs_cil_ctx	*xc_ctx ____cacheline_aligned_in_smp;
 	struct rw_semaphore	xc_ctx_lock;
+
+	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
+	xfs_lsn_t		xc_push_seq;
 	xfs_lsn_t		xc_current_sequence;
 	struct work_struct	xc_push_work;
-	xfs_lsn_t		xc_push_seq;
 };
 
 /*
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8e67284..157e9c4 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -140,7 +140,7 @@ typedef struct xfs_log_item {
 	{ XFS_LI_ABORTED,	"ABORTED" }
 
 struct xfs_item_ops {
-	uint (*iop_size)(xfs_log_item_t *);
+	void (*iop_size)(xfs_log_item_t *, int *, int *);
 	void (*iop_format)(xfs_log_item_t *, struct xfs_log_iovec *);
 	void (*iop_pin)(xfs_log_item_t *);
 	void (*iop_unpin)(xfs_log_item_t *, int remove);
@@ -150,8 +150,6 @@ struct xfs_item_ops {
 	void (*iop_committing)(xfs_log_item_t *, xfs_lsn_t);
 };
 
-#define IOP_SIZE(ip)		(*(ip)->li_ops->iop_size)(ip)
-#define IOP_FORMAT(ip,vp)	(*(ip)->li_ops->iop_format)(ip, vp)
 #define IOP_PIN(ip)		(*(ip)->li_ops->iop_pin)(ip)
 #define IOP_UNPIN(ip, remove)	(*(ip)->li_ops->iop_unpin)(ip, remove)
 #define IOP_PUSH(ip, list)	(*(ip)->li_ops->iop_push)(ip, list)
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]
  2013-07-01  5:44 [PATCH] xfs: optimise CIL insertion during transaction commit [RFC] Dave Chinner
@ 2013-07-04  2:09 ` Mark Tinguely
  2013-07-08 12:44   ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Mark Tinguely @ 2013-07-04  2:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 07/01/13 00:44, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> Note: This is an RFC right now - it'll need to be broken up into
> several patches for final submission.
>
> The CIL insertion during transaction commit currently does multiple
> passes across the transaction objects and requires multiple memory
> allocations per object that is to be inserted into the CIL. It is
> quite inefficient, and as such xfs_log_commit_cil() and it's
> children show up quite highly in profiles under metadata
> modification intensive workloads.
>
> The current insertion tries to minimise ithe number of times the
> xc_cil_lock is grabbed and the hold times via a couple of methods:
>
> 	1. an initial loop across the transaction items outside the
> 	lock to allocate log vectors, buffers and copy the data into
> 	them.
> 	2. a second pass across the log vectors that then inserts
> 	them into the CIL, modifies the CIL state and frees the old
> 	vectors.
>
> This is somewhat inefficient. While it minimises lock grabs, the
> hold time is still quite high because we are freeing objects with
> the spinlock held and so the hold times are much higher than they
> need to be.
>
> Optimisations that can be made:
>
> 	1. change IOP_SIZE() to return the size of the metadata to
> 	be logged with the number of vectors required. This means we
> 	can allocate the buffer to hold the metadata as part of the
> 	allocation for the log vector array. Once allocation instead
> 	of two.
>
> 	2. Store the size of the allocated region for the log vector
> 	in the log vector itself. This allows us to determine if we
> 	can just reuse the existing log vector for the new
> 	transactions. Avoids all memory allocation for that item.
>
> 	3. when setting up a new log vector, we don't need to
> 	hold the CIL lock while determining the difference in it's
> 	size when compared to the old one. Hence we can do all the
> 	accounting, swapping of the log vectors and freeing the old
> 	log vectors outside the xc_cil_lock.
>
> 	4. We can do all the CIL insertation of items and busy
> 	extents and the accounting under a single grab of the
> 	xc_cil_lock. This minimises the number of times we need to
> 	grab it and hence contention points are kept to a minimum
>
> 	5. the xc_cil_lock is used for serialising both the CIL and
> 	the push/commit ordering. Separate the two functions into
> 	different locks so push/commit ordering does not affect CIL
> 	insertion
>
> 	6. the xc_cil_lock shares cachelines with other locks.
> 	Separate them onto different cachelines.
>
> The result is that my standard fsmark benchmark (8-way, 50m files)
> on my standard test VM (8-way, 4GB RAM, 4xSSD in RAID0, 100TB fs)
> gives the following results with a xfs-oss tree. No CRCs:
>
>                  vanilla         patched         Difference
> create  (time)  483s            435s            -10.0%  (faster)
>          (rate)  109k+/6k        122k+/-7k       +11.9%  (faster)
>
> walk            339s            335s            (noise)
>       (sys cpu)  1134s           1135s           (noise)
>
> unlink          692s            645s             -6.8%  (faster)
>
> So it's significantly faster than the current code, and lock_stat
> reports lower contention on the xc_cil_lock, too. So, big win here.
>
> With CRCs:
>
>                  vanilla         patched         Difference
> create  (time)  510s            460s             -9.8%  (faster)
>          (rate)  105k+/5.4k      117k+/-5k       +11.4%  (faster)
>
> walk            494s            486s            (noise)
>       (sys cpu)  1324s           1290s           (noise)
>
> unlink          959s            889s             -7.3%  (faster)
>
> Gains are of the same order, with walk and unlink still affected by
> VFS LRU lock contention. IOWs, with this changes, filesystems with
> CRCs enabled will still be faster than the old non-CRC kernels...
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>

Dave,

The idea to separate the transaction commits from the CIL push looks good.

I am still concerned that the xc_ctx->space_used comparison to
XLOG_CIL_SPACE_LIMIT(log) does not also contain any information on
the "stolen" ticket log space. Maybe the future Liu/Chinner minimum
log space series is the best place to address this and my other log
space concerns.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-01  5:44 [PATCH] xfs: optimise CIL insertion during transaction commit [RFC] Dave Chinner
@ 2013-07-08 12:44   ` Dave Chinner
  2013-07-08 12:44   ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-08 12:44 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

[cc fsdevel because after all the XFS stuff I did a some testing on
mmotm w.r.t per-node LRU lock contention avoidance, and also some
scalability tests against ext4 and btrfs for comparison on some new
hardware. That bit ain't pretty. ]

On Mon, Jul 01, 2013 at 03:44:36PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Note: This is an RFC right now - it'll need to be broken up into
> several patches for final submission.
> 
> The CIL insertion during transaction commit currently does multiple
> passes across the transaction objects and requires multiple memory
> allocations per object that is to be inserted into the CIL. It is
> quite inefficient, and as such xfs_log_commit_cil() and it's
> children show up quite highly in profiles under metadata
> modification intensive workloads.
> 
> The current insertion tries to minimise ithe number of times the
> xc_cil_lock is grabbed and the hold times via a couple of methods:
> 
> 	1. an initial loop across the transaction items outside the
> 	lock to allocate log vectors, buffers and copy the data into
> 	them.
> 	2. a second pass across the log vectors that then inserts
> 	them into the CIL, modifies the CIL state and frees the old
> 	vectors.
> 
> This is somewhat inefficient. While it minimises lock grabs, the
> hold time is still quite high because we are freeing objects with
> the spinlock held and so the hold times are much higher than they
> need to be.
> 
> Optimisations that can be made:
.....
> 
> The result is that my standard fsmark benchmark (8-way, 50m files)
> on my standard test VM (8-way, 4GB RAM, 4xSSD in RAID0, 100TB fs)
> gives the following results with a xfs-oss tree. No CRCs:
> 
>                 vanilla         patched         Difference
> create  (time)  483s            435s            -10.0%  (faster)
>         (rate)  109k+/6k        122k+/-7k       +11.9%  (faster)
> 
> walk            339s            335s            (noise)
>      (sys cpu)  1134s           1135s           (noise)
> 
> unlink          692s            645s             -6.8%  (faster)
> 
> So it's significantly faster than the current code, and lock_stat
> reports lower contention on the xc_cil_lock, too. So, big win here.
> 
> With CRCs:
> 
>                 vanilla         patched         Difference
> create  (time)  510s            460s             -9.8%  (faster)
>         (rate)  105k+/5.4k      117k+/-5k       +11.4%  (faster)
> 
> walk            494s            486s            (noise)
>      (sys cpu)  1324s           1290s           (noise)
> 
> unlink          959s            889s             -7.3%  (faster)
> 
> Gains are of the same order, with walk and unlink still affected by
> VFS LRU lock contention. IOWs, with this changes, filesystems with
> CRCs enabled will still be faster than the old non-CRC kernels...

FWIW, I have new hardware here that I'll be using for benchmarking
like this, so here's a quick baseline comparison using the same
8p/4GB RAM VM (just migrated across) and same SSD based storage
(physically moved) and 100TB filesystem. The disks are behind a
faster RAID controller w/ 1GB of BBWC, so random read and write IOPS
are higher and hence traversal times will due to lower IO latency.

Create times
		  wall time(s)		     rate (files/s)
		vanilla	 patched   diff	   vanilla  patched    diff
Old system	  483	  435	 -10.0%	   109k+-6k 122k+-7k +11.9%
New system	  378	  342	  -9.5%	   143k+-9k 158k+-8k +10.5%
diff		-21.7%	-21.4%		    +31.2%   +29.5%

Walk times
		  wall time(s)
		vanilla	 patched   diff
Old system	  339	  335	 (noise)
New system	  194	  197	 (noise)
diff		-42.7%	-41.2%

Unlink times
		  wall time(s)
		vanilla	 patched   diff
Old system	  692	  645	  -7.3%
New system	  457	  405	 -11.4%
diff		-34.0%  -37.2%

So, overall, the new system is 20-40% faster than the old one on a
comparitive test. but I have a few more cores and a lot more memory
to play with, so a 16-way test on the same machine with the VM
expanded to 16p/16GB RAM, 4 fake numa nodes follows:

New system, patched kernel:

Threads	    create		walk		unlink
	time(s)	 rate		time(s)		time(s)
8	  342	158k+-8k	  197		  405
16	  222	266k+-32k	  170		  295
diff	-35.1%	 +68.4%		-13.7%		-27.2%

Create rates are much more variable because the memory reclaim
behaviour appears to be very harsh, pulling 4-6 million inodes out
of memory every 10s or so and thrashing on the LRU locks, and then
doing nothing until another large step occurs.

Walk rates improve, but not much because of lock contention. I added
8 cpu cores to the workload, and I'm burning at least 4 of those
cores on the inode LRU lock.

-  30.61%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 65.33% _raw_spin_lock
         + 88.19% inode_add_lru
         + 7.31% dentry_lru_del
         + 1.07% shrink_dentry_list
         + 0.90% dput
         + 0.83% inode_sb_list_add
         + 0.59% evict
      + 27.79% do_raw_spin_lock
      + 4.03% do_raw_spin_trylock
      + 2.85% _raw_spin_trylock

The current mmotm (and hence probably 3.11) has the new per-node LRU
code in it, so this variance and contention should go away very
soon.

Unlinks go lots faster because they don't cause inode LRU lock
contention, but we are still a long way from linear scalability
from 8- to 16-way.

FWIW, the mmotm kernel (which has a fair bit of debug enabled, so
not quite comparitive) doesn't have any LRU lock contention to speak
of. For create:

-   7.81%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 70.98% _raw_spin_lock
         + 97.55% xfs_log_commit_cil
         + 0.93% __d_instantiate
         + 0.58% inode_sb_list_add
      - 29.02% do_raw_spin_lock
         - _raw_spin_lock
            + 41.14% xfs_log_commit_cil
            + 8.29% _xfs_buf_find
            + 8.00% xfs_iflush_cluster

And the walk:

-  26.37%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 49.10% _raw_spin_lock
         - 50.65% evict
              dispose_list
              prune_icache_sb
              super_cache_scan
            + shrink_slab
         - 26.99% list_lru_add
            + 89.01% inode_add_lru
            + 10.95% dput
         + 7.03% __remove_inode_hash
      - 40.65% do_raw_spin_lock
         - _raw_spin_lock
            - 41.96% evict
                 dispose_list
                 prune_icache_sb
                 super_cache_scan
               + shrink_slab
            - 13.55% list_lru_add
                 84.33% inode_add_lru
                    iput
                    d_kill
                    shrink_dentry_list
                    prune_dcache_sb
                    super_cache_scan
                    shrink_slab
                 15.01% dput
                 0.66% xfs_buf_rele
            + 10.10% __remove_inode_hash                                                                                                                               
                    system_call_fastpath

There's quite a different pattern of contention - it has moved
inward to evict which implies the inode_sb_list_lock is the next
obvious point of contention. I have patches in the works for that.
Also, the inode_hash_lock is causing some contention, even though we
fake inode hashing. I have a patch to fix that for XFS as well.

I also note an interesting behaviour of the per-node inode LRUs -
the contention is coming from the dentry shrinker on one node
freeing inodes allocated on a different node during reclaim. There's
scope for improvement there.

But here' the interesting part:

Kernel	    create		walk		unlink
	time(s)	 rate		time(s)		time(s)
3.10-cil  222	266k+-32k	  170		  295
mmotm	  251	222k+-16k	  128		  356

Even with all the debug enabled, the overall walk time dropped by
25% to 128s. So performance in this workload has substantially
improved because of the per-node LRUs and variability is also down
as well, as predicted. Once I add all the tweaks I have in the
3.10-cil tree to mmotm, I expect significant improvements to create
and unlink performance as well...

So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
3.10-cil kernel I've been testing XFS on):

	    create		 walk		unlink
	 time(s)   rate		time(s)		time(s)
xfs	  222	266k+-32k	  170		  295
ext4	  978	 54k+- 2k	  325		 2053
btrfs	 1223	 47k+- 8k	  366		12000(*)

(*) Estimate based on a removal rate of 18.5 minutes for the first
4.8 million inodes.

Basically, neither btrfs or ext4 have any concurrency scaling to
demonstrate, and unlinks on btrfs a just plain woeful.

ext4 create rate is limited by the extent cache LRU locking:

-  41.81%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 60.67% _raw_spin_lock
         - 99.60% ext4_es_lru_add
            + 99.63% ext4_es_lookup_extent
      - 39.15% do_raw_spin_lock
         - _raw_spin_lock
            + 95.38% ext4_es_lru_add
              0.51% insert_inode_locked
                 __ext4_new_inode
-   16.20%  [kernel]  [k] native_read_tsc
   - native_read_tsc
      - 60.91% delay_tsc
           __delay
           do_raw_spin_lock
         + _raw_spin_lock
      - 39.09% __delay
           do_raw_spin_lock
         + _raw_spin_lock

Ext4 unlink is serialised on orphan list processing:

-  12.67%  [kernel]  [k] __mutex_unlock_slowpath
   - __mutex_unlock_slowpath
      - 99.95% mutex_unlock
         + 54.37% ext4_orphan_del
         + 43.26% ext4_orphan_add
+   5.33%  [kernel]  [k] __mutex_lock_slowpath


btrfs create has tree lock problems:

-  21.68%  [kernel]  [k] __write_lock_failed
   - __write_lock_failed
      - 99.93% do_raw_write_lock
         - _raw_write_lock
            - 79.04% btrfs_try_tree_write_lock
               - btrfs_search_slot
                  - 97.48% btrfs_insert_empty_items
                       99.82% btrfs_new_inode
                  + 2.52% btrfs_lookup_inode
            - 20.37% btrfs_tree_lock
               - 99.38% btrfs_search_slot
                    99.92% btrfs_insert_empty_items
                 0.52% btrfs_lock_root_node
                    btrfs_search_slot
                    btrfs_insert_empty_items
-  21.24%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 61.22% prepare_to_wait
         + 61.52% btrfs_tree_lock
         + 32.31% btrfs_tree_read_lock
           6.17% reserve_metadata_bytes
              btrfs_block_rsv_add

btrfs walk phase hammers the inode_hash_lock:

-  18.45%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 47.38% _raw_spin_lock
         + 42.99% iget5_locked
         + 15.17% __remove_inode_hash
         + 13.77% btrfs_get_delayed_node
         + 11.27% inode_tree_add
         + 9.32% btrfs_destroy_inode
.....
      - 46.77% do_raw_spin_lock
         - _raw_spin_lock
            + 30.51% iget5_locked
            + 11.40% __remove_inode_hash
            + 11.38% btrfs_get_delayed_node
            + 9.45% inode_tree_add
            + 7.28% btrfs_destroy_inode
.....

I have a RCU inode hash lookup patch floating around somewhere if
someone wants it...

And, well, the less said about btrfs unlinks the better:

+  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
+  33.18%  [kernel]  [k] __write_lock_failed
+  17.96%  [kernel]  [k] __read_lock_failed
+   1.35%  [kernel]  [k] _raw_spin_unlock_irq
+   0.82%  [kernel]  [k] __do_softirq
+   0.53%  [kernel]  [k] btrfs_tree_lock
+   0.41%  [kernel]  [k] btrfs_tree_read_lock
+   0.41%  [kernel]  [k] do_raw_read_lock
+   0.39%  [kernel]  [k] do_raw_write_lock
+   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
+   0.37%  [kernel]  [k] free_extent_buffer
+   0.36%  [kernel]  [k] btrfs_tree_read_unlock
+   0.32%  [kernel]  [k] do_raw_write_unlock

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-08 12:44   ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-08 12:44 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

[cc fsdevel because after all the XFS stuff I did a some testing on
mmotm w.r.t per-node LRU lock contention avoidance, and also some
scalability tests against ext4 and btrfs for comparison on some new
hardware. That bit ain't pretty. ]

On Mon, Jul 01, 2013 at 03:44:36PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Note: This is an RFC right now - it'll need to be broken up into
> several patches for final submission.
> 
> The CIL insertion during transaction commit currently does multiple
> passes across the transaction objects and requires multiple memory
> allocations per object that is to be inserted into the CIL. It is
> quite inefficient, and as such xfs_log_commit_cil() and it's
> children show up quite highly in profiles under metadata
> modification intensive workloads.
> 
> The current insertion tries to minimise ithe number of times the
> xc_cil_lock is grabbed and the hold times via a couple of methods:
> 
> 	1. an initial loop across the transaction items outside the
> 	lock to allocate log vectors, buffers and copy the data into
> 	them.
> 	2. a second pass across the log vectors that then inserts
> 	them into the CIL, modifies the CIL state and frees the old
> 	vectors.
> 
> This is somewhat inefficient. While it minimises lock grabs, the
> hold time is still quite high because we are freeing objects with
> the spinlock held and so the hold times are much higher than they
> need to be.
> 
> Optimisations that can be made:
.....
> 
> The result is that my standard fsmark benchmark (8-way, 50m files)
> on my standard test VM (8-way, 4GB RAM, 4xSSD in RAID0, 100TB fs)
> gives the following results with a xfs-oss tree. No CRCs:
> 
>                 vanilla         patched         Difference
> create  (time)  483s            435s            -10.0%  (faster)
>         (rate)  109k+/6k        122k+/-7k       +11.9%  (faster)
> 
> walk            339s            335s            (noise)
>      (sys cpu)  1134s           1135s           (noise)
> 
> unlink          692s            645s             -6.8%  (faster)
> 
> So it's significantly faster than the current code, and lock_stat
> reports lower contention on the xc_cil_lock, too. So, big win here.
> 
> With CRCs:
> 
>                 vanilla         patched         Difference
> create  (time)  510s            460s             -9.8%  (faster)
>         (rate)  105k+/5.4k      117k+/-5k       +11.4%  (faster)
> 
> walk            494s            486s            (noise)
>      (sys cpu)  1324s           1290s           (noise)
> 
> unlink          959s            889s             -7.3%  (faster)
> 
> Gains are of the same order, with walk and unlink still affected by
> VFS LRU lock contention. IOWs, with this changes, filesystems with
> CRCs enabled will still be faster than the old non-CRC kernels...

FWIW, I have new hardware here that I'll be using for benchmarking
like this, so here's a quick baseline comparison using the same
8p/4GB RAM VM (just migrated across) and same SSD based storage
(physically moved) and 100TB filesystem. The disks are behind a
faster RAID controller w/ 1GB of BBWC, so random read and write IOPS
are higher and hence traversal times will due to lower IO latency.

Create times
		  wall time(s)		     rate (files/s)
		vanilla	 patched   diff	   vanilla  patched    diff
Old system	  483	  435	 -10.0%	   109k+-6k 122k+-7k +11.9%
New system	  378	  342	  -9.5%	   143k+-9k 158k+-8k +10.5%
diff		-21.7%	-21.4%		    +31.2%   +29.5%

Walk times
		  wall time(s)
		vanilla	 patched   diff
Old system	  339	  335	 (noise)
New system	  194	  197	 (noise)
diff		-42.7%	-41.2%

Unlink times
		  wall time(s)
		vanilla	 patched   diff
Old system	  692	  645	  -7.3%
New system	  457	  405	 -11.4%
diff		-34.0%  -37.2%

So, overall, the new system is 20-40% faster than the old one on a
comparitive test. but I have a few more cores and a lot more memory
to play with, so a 16-way test on the same machine with the VM
expanded to 16p/16GB RAM, 4 fake numa nodes follows:

New system, patched kernel:

Threads	    create		walk		unlink
	time(s)	 rate		time(s)		time(s)
8	  342	158k+-8k	  197		  405
16	  222	266k+-32k	  170		  295
diff	-35.1%	 +68.4%		-13.7%		-27.2%

Create rates are much more variable because the memory reclaim
behaviour appears to be very harsh, pulling 4-6 million inodes out
of memory every 10s or so and thrashing on the LRU locks, and then
doing nothing until another large step occurs.

Walk rates improve, but not much because of lock contention. I added
8 cpu cores to the workload, and I'm burning at least 4 of those
cores on the inode LRU lock.

-  30.61%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 65.33% _raw_spin_lock
         + 88.19% inode_add_lru
         + 7.31% dentry_lru_del
         + 1.07% shrink_dentry_list
         + 0.90% dput
         + 0.83% inode_sb_list_add
         + 0.59% evict
      + 27.79% do_raw_spin_lock
      + 4.03% do_raw_spin_trylock
      + 2.85% _raw_spin_trylock

The current mmotm (and hence probably 3.11) has the new per-node LRU
code in it, so this variance and contention should go away very
soon.

Unlinks go lots faster because they don't cause inode LRU lock
contention, but we are still a long way from linear scalability
from 8- to 16-way.

FWIW, the mmotm kernel (which has a fair bit of debug enabled, so
not quite comparitive) doesn't have any LRU lock contention to speak
of. For create:

-   7.81%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 70.98% _raw_spin_lock
         + 97.55% xfs_log_commit_cil
         + 0.93% __d_instantiate
         + 0.58% inode_sb_list_add
      - 29.02% do_raw_spin_lock
         - _raw_spin_lock
            + 41.14% xfs_log_commit_cil
            + 8.29% _xfs_buf_find
            + 8.00% xfs_iflush_cluster

And the walk:

-  26.37%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 49.10% _raw_spin_lock
         - 50.65% evict
              dispose_list
              prune_icache_sb
              super_cache_scan
            + shrink_slab
         - 26.99% list_lru_add
            + 89.01% inode_add_lru
            + 10.95% dput
         + 7.03% __remove_inode_hash
      - 40.65% do_raw_spin_lock
         - _raw_spin_lock
            - 41.96% evict
                 dispose_list
                 prune_icache_sb
                 super_cache_scan
               + shrink_slab
            - 13.55% list_lru_add
                 84.33% inode_add_lru
                    iput
                    d_kill
                    shrink_dentry_list
                    prune_dcache_sb
                    super_cache_scan
                    shrink_slab
                 15.01% dput
                 0.66% xfs_buf_rele
            + 10.10% __remove_inode_hash                                                                                                                               
                    system_call_fastpath

There's quite a different pattern of contention - it has moved
inward to evict which implies the inode_sb_list_lock is the next
obvious point of contention. I have patches in the works for that.
Also, the inode_hash_lock is causing some contention, even though we
fake inode hashing. I have a patch to fix that for XFS as well.

I also note an interesting behaviour of the per-node inode LRUs -
the contention is coming from the dentry shrinker on one node
freeing inodes allocated on a different node during reclaim. There's
scope for improvement there.

But here' the interesting part:

Kernel	    create		walk		unlink
	time(s)	 rate		time(s)		time(s)
3.10-cil  222	266k+-32k	  170		  295
mmotm	  251	222k+-16k	  128		  356

Even with all the debug enabled, the overall walk time dropped by
25% to 128s. So performance in this workload has substantially
improved because of the per-node LRUs and variability is also down
as well, as predicted. Once I add all the tweaks I have in the
3.10-cil tree to mmotm, I expect significant improvements to create
and unlink performance as well...

So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
3.10-cil kernel I've been testing XFS on):

	    create		 walk		unlink
	 time(s)   rate		time(s)		time(s)
xfs	  222	266k+-32k	  170		  295
ext4	  978	 54k+- 2k	  325		 2053
btrfs	 1223	 47k+- 8k	  366		12000(*)

(*) Estimate based on a removal rate of 18.5 minutes for the first
4.8 million inodes.

Basically, neither btrfs or ext4 have any concurrency scaling to
demonstrate, and unlinks on btrfs a just plain woeful.

ext4 create rate is limited by the extent cache LRU locking:

-  41.81%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 60.67% _raw_spin_lock
         - 99.60% ext4_es_lru_add
            + 99.63% ext4_es_lookup_extent
      - 39.15% do_raw_spin_lock
         - _raw_spin_lock
            + 95.38% ext4_es_lru_add
              0.51% insert_inode_locked
                 __ext4_new_inode
-   16.20%  [kernel]  [k] native_read_tsc
   - native_read_tsc
      - 60.91% delay_tsc
           __delay
           do_raw_spin_lock
         + _raw_spin_lock
      - 39.09% __delay
           do_raw_spin_lock
         + _raw_spin_lock

Ext4 unlink is serialised on orphan list processing:

-  12.67%  [kernel]  [k] __mutex_unlock_slowpath
   - __mutex_unlock_slowpath
      - 99.95% mutex_unlock
         + 54.37% ext4_orphan_del
         + 43.26% ext4_orphan_add
+   5.33%  [kernel]  [k] __mutex_lock_slowpath


btrfs create has tree lock problems:

-  21.68%  [kernel]  [k] __write_lock_failed
   - __write_lock_failed
      - 99.93% do_raw_write_lock
         - _raw_write_lock
            - 79.04% btrfs_try_tree_write_lock
               - btrfs_search_slot
                  - 97.48% btrfs_insert_empty_items
                       99.82% btrfs_new_inode
                  + 2.52% btrfs_lookup_inode
            - 20.37% btrfs_tree_lock
               - 99.38% btrfs_search_slot
                    99.92% btrfs_insert_empty_items
                 0.52% btrfs_lock_root_node
                    btrfs_search_slot
                    btrfs_insert_empty_items
-  21.24%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 61.22% prepare_to_wait
         + 61.52% btrfs_tree_lock
         + 32.31% btrfs_tree_read_lock
           6.17% reserve_metadata_bytes
              btrfs_block_rsv_add

btrfs walk phase hammers the inode_hash_lock:

-  18.45%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 47.38% _raw_spin_lock
         + 42.99% iget5_locked
         + 15.17% __remove_inode_hash
         + 13.77% btrfs_get_delayed_node
         + 11.27% inode_tree_add
         + 9.32% btrfs_destroy_inode
.....
      - 46.77% do_raw_spin_lock
         - _raw_spin_lock
            + 30.51% iget5_locked
            + 11.40% __remove_inode_hash
            + 11.38% btrfs_get_delayed_node
            + 9.45% inode_tree_add
            + 7.28% btrfs_destroy_inode
.....

I have a RCU inode hash lookup patch floating around somewhere if
someone wants it...

And, well, the less said about btrfs unlinks the better:

+  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
+  33.18%  [kernel]  [k] __write_lock_failed
+  17.96%  [kernel]  [k] __read_lock_failed
+   1.35%  [kernel]  [k] _raw_spin_unlock_irq
+   0.82%  [kernel]  [k] __do_softirq
+   0.53%  [kernel]  [k] btrfs_tree_lock
+   0.41%  [kernel]  [k] btrfs_tree_read_lock
+   0.41%  [kernel]  [k] do_raw_read_lock
+   0.39%  [kernel]  [k] do_raw_write_lock
+   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
+   0.37%  [kernel]  [k] free_extent_buffer
+   0.36%  [kernel]  [k] btrfs_tree_read_unlock
+   0.32%  [kernel]  [k] do_raw_write_unlock

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 12:44   ` Dave Chinner
  (?)
@ 2013-07-08 13:59   ` Jan Kara
  2013-07-08 15:22       ` Marco Stornelli
  -1 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2013-07-08 13:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Mon 08-07-13 22:44:53, Dave Chinner wrote:
<snipped some nice XFS results ;)>
> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> 3.10-cil kernel I've been testing XFS on):
> 
> 	    create		 walk		unlink
> 	 time(s)   rate		time(s)		time(s)
> xfs	  222	266k+-32k	  170		  295
> ext4	  978	 54k+- 2k	  325		 2053
> btrfs	 1223	 47k+- 8k	  366		12000(*)
> 
> (*) Estimate based on a removal rate of 18.5 minutes for the first
> 4.8 million inodes.
> 
> Basically, neither btrfs or ext4 have any concurrency scaling to
> demonstrate, and unlinks on btrfs a just plain woeful.
  Thanks for posting the numbers. There isn't anyone seriously testing ext4
SMP scalability AFAIK so it's not surprising it sucks.
 
> ext4 create rate is limited by the extent cache LRU locking:
> 
> -  41.81%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 60.67% _raw_spin_lock
>          - 99.60% ext4_es_lru_add
>             + 99.63% ext4_es_lookup_extent
  At least this should improve with the patches in 3.11-rc1.

>       - 39.15% do_raw_spin_lock
>          - _raw_spin_lock
>             + 95.38% ext4_es_lru_add
>               0.51% insert_inode_locked
>                  __ext4_new_inode
> -   16.20%  [kernel]  [k] native_read_tsc
>    - native_read_tsc
>       - 60.91% delay_tsc
>            __delay
>            do_raw_spin_lock
>          + _raw_spin_lock
>       - 39.09% __delay
>            do_raw_spin_lock
>          + _raw_spin_lock
> 
> Ext4 unlink is serialised on orphan list processing:
> 
> -  12.67%  [kernel]  [k] __mutex_unlock_slowpath
>    - __mutex_unlock_slowpath
>       - 99.95% mutex_unlock
>          + 54.37% ext4_orphan_del
>          + 43.26% ext4_orphan_add
> +   5.33%  [kernel]  [k] __mutex_lock_slowpath
  ext4 can do better here I'm sure. The current solution is pretty
simplistic. At least we could use spinlock for in-memory orphan list and
atomic ops for on disk one (as it's only singly linked list). But not sure
if I find time to look into this in forseeable future...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 13:59   ` Jan Kara
@ 2013-07-08 15:22       ` Marco Stornelli
  0 siblings, 0 replies; 23+ messages in thread
From: Marco Stornelli @ 2013-07-08 15:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, xfs, linux-fsdevel

Il 08/07/2013 15:59, Jan Kara ha scritto:
> On Mon 08-07-13 22:44:53, Dave Chinner wrote:
> <snipped some nice XFS results ;)>
>> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
>> 3.10-cil kernel I've been testing XFS on):
>>
>> 	    create		 walk		unlink
>> 	 time(s)   rate		time(s)		time(s)
>> xfs	  222	266k+-32k	  170		  295
>> ext4	  978	 54k+- 2k	  325		 2053
>> btrfs	 1223	 47k+- 8k	  366		12000(*)
>>
>> (*) Estimate based on a removal rate of 18.5 minutes for the first
>> 4.8 million inodes.
>>
>> Basically, neither btrfs or ext4 have any concurrency scaling to
>> demonstrate, and unlinks on btrfs a just plain woeful.
>    Thanks for posting the numbers. There isn't anyone seriously testing ext4
> SMP scalability AFAIK so it's not surprising it sucks.

Funny, if I well remember Google guys switched android from yaffs2 to 
ext4 due to its superiority on SMP :)

Marco

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-08 15:22       ` Marco Stornelli
  0 siblings, 0 replies; 23+ messages in thread
From: Marco Stornelli @ 2013-07-08 15:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, xfs

Il 08/07/2013 15:59, Jan Kara ha scritto:
> On Mon 08-07-13 22:44:53, Dave Chinner wrote:
> <snipped some nice XFS results ;)>
>> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
>> 3.10-cil kernel I've been testing XFS on):
>>
>> 	    create		 walk		unlink
>> 	 time(s)   rate		time(s)		time(s)
>> xfs	  222	266k+-32k	  170		  295
>> ext4	  978	 54k+- 2k	  325		 2053
>> btrfs	 1223	 47k+- 8k	  366		12000(*)
>>
>> (*) Estimate based on a removal rate of 18.5 minutes for the first
>> 4.8 million inodes.
>>
>> Basically, neither btrfs or ext4 have any concurrency scaling to
>> demonstrate, and unlinks on btrfs a just plain woeful.
>    Thanks for posting the numbers. There isn't anyone seriously testing ext4
> SMP scalability AFAIK so it's not surprising it sucks.

Funny, if I well remember Google guys switched android from yaffs2 to 
ext4 due to its superiority on SMP :)

Marco

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 15:22       ` Marco Stornelli
  (?)
@ 2013-07-08 15:38       ` Jan Kara
  2013-07-09  0:15           ` Dave Chinner
  -1 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2013-07-08 15:38 UTC (permalink / raw)
  To: Marco Stornelli; +Cc: linux-fsdevel, Jan Kara, xfs

On Mon 08-07-13 17:22:43, Marco Stornelli wrote:
> Il 08/07/2013 15:59, Jan Kara ha scritto:
> >On Mon 08-07-13 22:44:53, Dave Chinner wrote:
> ><snipped some nice XFS results ;)>
> >>So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> >>3.10-cil kernel I've been testing XFS on):
> >>
> >>	    create		 walk		unlink
> >>	 time(s)   rate		time(s)		time(s)
> >>xfs	  222	266k+-32k	  170		  295
> >>ext4	  978	 54k+- 2k	  325		 2053
> >>btrfs	 1223	 47k+- 8k	  366		12000(*)
> >>
> >>(*) Estimate based on a removal rate of 18.5 minutes for the first
> >>4.8 million inodes.
> >>
> >>Basically, neither btrfs or ext4 have any concurrency scaling to
> >>demonstrate, and unlinks on btrfs a just plain woeful.
> >   Thanks for posting the numbers. There isn't anyone seriously testing ext4
> >SMP scalability AFAIK so it's not surprising it sucks.
> 
> Funny, if I well remember Google guys switched android from yaffs2
> to ext4 due to its superiority on SMP :)
  Well, there's SMP and SMP. Ext4 is perfectly OK for desktop kind of SMP -
that's what lots of people use. When we speak of heavy IO load with 16 CPUs
on enterprise grade storage so that CPU (and not IO) bottlenecks are actually
visible, that's not so easily available and so we don't have serious
performance work in that direction...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 15:38       ` Jan Kara
@ 2013-07-09  0:15           ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-09  0:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Marco Stornelli, xfs, linux-fsdevel

On Mon, Jul 08, 2013 at 05:38:07PM +0200, Jan Kara wrote:
> On Mon 08-07-13 17:22:43, Marco Stornelli wrote:
> > Il 08/07/2013 15:59, Jan Kara ha scritto:
> > >On Mon 08-07-13 22:44:53, Dave Chinner wrote:
> > ><snipped some nice XFS results ;)>
> > >>So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> > >>3.10-cil kernel I've been testing XFS on):
> > >>
> > >>	    create		 walk		unlink
> > >>	 time(s)   rate		time(s)		time(s)
> > >>xfs	  222	266k+-32k	  170		  295
> > >>ext4	  978	 54k+- 2k	  325		 2053
> > >>btrfs	 1223	 47k+- 8k	  366		12000(*)
> > >>
> > >>(*) Estimate based on a removal rate of 18.5 minutes for the first
> > >>4.8 million inodes.
> > >>
> > >>Basically, neither btrfs or ext4 have any concurrency scaling to
> > >>demonstrate, and unlinks on btrfs a just plain woeful.
> > >   Thanks for posting the numbers. There isn't anyone seriously testing ext4
> > >SMP scalability AFAIK so it's not surprising it sucks.

It's worse than that - nobody picked up on review that taking a
global lock on every extent lookup might be a scalability issue?
Scalability is not an afterthought anymore - new filesystem and
kernel features need to be designed from the ground up with this in
mind. We're living in a world where even phones have 4 CPU cores....

> > Funny, if I well remember Google guys switched android from yaffs2
> > to ext4 due to its superiority on SMP :)
>   Well, there's SMP and SMP. Ext4 is perfectly OK for desktop kind of SMP -

Barely. It tops out in parallelism at between 2-4 threads depending
on the metadata operations being done.

> that's what lots of people use. When we speak of heavy IO load with 16 CPUs
> on enterprise grade storage so that CPU (and not IO) bottlenecks are actually
> visible, that's not so easily available and so we don't have serious
> performance work in that direction...

I'm not testing with "enterprise grade" storage. The filesystem I'm
testing on is hosted on less than $300 of SSDs.  The "enterprise"
RAID controller they sit behind is actually an IOPS bottleneck, not
an improvement.

My 2.5 year old desktop has a pair of cheap, no name sandforce SSDs
in RAID0 and they can do at least 2x the read and write IOPS of the
new hardware I just tested. And yes, I run XFS on my desktop.

And then there's my 3 month old laptop, which has a recent SATA SSD
in it. It also has 8 threads, but twice the memory and about 1.5x
the IOPS and bandwidth of my desktop machine.

The bottlenecks showing up in ext4 and btrfs don't magically show up
at 16 threads - they are present and reproducable at 2-4 threads.
Indeed, I didn't bother testing at 32 threads - even though my new
server can do that - because that will just hammer the same
bottlenecks even harder.  Fundamentally, I'm not testing anything
you can't test on a $2000 desktop PC....

FWIW, the SSDs are making ext4 and btrfs look good in these
workloads. XFS is creating >250k files/s doing about 1500 IOPS. ext4
is making 50k files/s at 23,000 IOPS. btrfs has peaks every 30s of
over 30,000 IOPS. Which filesystem is going to scale better on
desktops with spinning rust?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-09  0:15           ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-09  0:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Marco Stornelli, linux-fsdevel, xfs

On Mon, Jul 08, 2013 at 05:38:07PM +0200, Jan Kara wrote:
> On Mon 08-07-13 17:22:43, Marco Stornelli wrote:
> > Il 08/07/2013 15:59, Jan Kara ha scritto:
> > >On Mon 08-07-13 22:44:53, Dave Chinner wrote:
> > ><snipped some nice XFS results ;)>
> > >>So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> > >>3.10-cil kernel I've been testing XFS on):
> > >>
> > >>	    create		 walk		unlink
> > >>	 time(s)   rate		time(s)		time(s)
> > >>xfs	  222	266k+-32k	  170		  295
> > >>ext4	  978	 54k+- 2k	  325		 2053
> > >>btrfs	 1223	 47k+- 8k	  366		12000(*)
> > >>
> > >>(*) Estimate based on a removal rate of 18.5 minutes for the first
> > >>4.8 million inodes.
> > >>
> > >>Basically, neither btrfs or ext4 have any concurrency scaling to
> > >>demonstrate, and unlinks on btrfs a just plain woeful.
> > >   Thanks for posting the numbers. There isn't anyone seriously testing ext4
> > >SMP scalability AFAIK so it's not surprising it sucks.

It's worse than that - nobody picked up on review that taking a
global lock on every extent lookup might be a scalability issue?
Scalability is not an afterthought anymore - new filesystem and
kernel features need to be designed from the ground up with this in
mind. We're living in a world where even phones have 4 CPU cores....

> > Funny, if I well remember Google guys switched android from yaffs2
> > to ext4 due to its superiority on SMP :)
>   Well, there's SMP and SMP. Ext4 is perfectly OK for desktop kind of SMP -

Barely. It tops out in parallelism at between 2-4 threads depending
on the metadata operations being done.

> that's what lots of people use. When we speak of heavy IO load with 16 CPUs
> on enterprise grade storage so that CPU (and not IO) bottlenecks are actually
> visible, that's not so easily available and so we don't have serious
> performance work in that direction...

I'm not testing with "enterprise grade" storage. The filesystem I'm
testing on is hosted on less than $300 of SSDs.  The "enterprise"
RAID controller they sit behind is actually an IOPS bottleneck, not
an improvement.

My 2.5 year old desktop has a pair of cheap, no name sandforce SSDs
in RAID0 and they can do at least 2x the read and write IOPS of the
new hardware I just tested. And yes, I run XFS on my desktop.

And then there's my 3 month old laptop, which has a recent SATA SSD
in it. It also has 8 threads, but twice the memory and about 1.5x
the IOPS and bandwidth of my desktop machine.

The bottlenecks showing up in ext4 and btrfs don't magically show up
at 16 threads - they are present and reproducable at 2-4 threads.
Indeed, I didn't bother testing at 32 threads - even though my new
server can do that - because that will just hammer the same
bottlenecks even harder.  Fundamentally, I'm not testing anything
you can't test on a $2000 desktop PC....

FWIW, the SSDs are making ext4 and btrfs look good in these
workloads. XFS is creating >250k files/s doing about 1500 IOPS. ext4
is making 50k files/s at 23,000 IOPS. btrfs has peaks every 30s of
over 30,000 IOPS. Which filesystem is going to scale better on
desktops with spinning rust?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 12:44   ` Dave Chinner
@ 2013-07-09  0:43     ` Zheng Liu
  -1 siblings, 0 replies; 23+ messages in thread
From: Zheng Liu @ 2013-07-09  0:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel

Hi Dave,

On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote:
[...]
> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> 3.10-cil kernel I've been testing XFS on):
> 
> 	    create		 walk		unlink
> 	 time(s)   rate		time(s)		time(s)
> xfs	  222	266k+-32k	  170		  295
> ext4	  978	 54k+- 2k	  325		 2053
> btrfs	 1223	 47k+- 8k	  366		12000(*)
> 
> (*) Estimate based on a removal rate of 18.5 minutes for the first
> 4.8 million inodes.
> 
> Basically, neither btrfs or ext4 have any concurrency scaling to
> demonstrate, and unlinks on btrfs a just plain woeful.
> 
> ext4 create rate is limited by the extent cache LRU locking:

I have a patch to fix this problem and the patch has been applied into
3.11-rc1.  The patch is (d3922a77):
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time

I do really appreicate that if you could try your testing again against
this patch.  I just want to make sure that this problem has been fixed.
At least in my own testing it looks fine.

Thanks,
                                                - Zheng

> 
> -  41.81%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 60.67% _raw_spin_lock
>          - 99.60% ext4_es_lru_add
>             + 99.63% ext4_es_lookup_extent
>       - 39.15% do_raw_spin_lock
>          - _raw_spin_lock
>             + 95.38% ext4_es_lru_add
>               0.51% insert_inode_locked
>                  __ext4_new_inode
> -   16.20%  [kernel]  [k] native_read_tsc
>    - native_read_tsc
>       - 60.91% delay_tsc
>            __delay
>            do_raw_spin_lock
>          + _raw_spin_lock
>       - 39.09% __delay
>            do_raw_spin_lock
>          + _raw_spin_lock
> 
> Ext4 unlink is serialised on orphan list processing:
> 
> -  12.67%  [kernel]  [k] __mutex_unlock_slowpath
>    - __mutex_unlock_slowpath
>       - 99.95% mutex_unlock
>          + 54.37% ext4_orphan_del
>          + 43.26% ext4_orphan_add
> +   5.33%  [kernel]  [k] __mutex_lock_slowpath
> 
> 
> btrfs create has tree lock problems:
> 
> -  21.68%  [kernel]  [k] __write_lock_failed
>    - __write_lock_failed
>       - 99.93% do_raw_write_lock
>          - _raw_write_lock
>             - 79.04% btrfs_try_tree_write_lock
>                - btrfs_search_slot
>                   - 97.48% btrfs_insert_empty_items
>                        99.82% btrfs_new_inode
>                   + 2.52% btrfs_lookup_inode
>             - 20.37% btrfs_tree_lock
>                - 99.38% btrfs_search_slot
>                     99.92% btrfs_insert_empty_items
>                  0.52% btrfs_lock_root_node
>                     btrfs_search_slot
>                     btrfs_insert_empty_items
> -  21.24%  [kernel]  [k] _raw_spin_unlock_irqrestore
>    - _raw_spin_unlock_irqrestore
>       - 61.22% prepare_to_wait
>          + 61.52% btrfs_tree_lock
>          + 32.31% btrfs_tree_read_lock
>            6.17% reserve_metadata_bytes
>               btrfs_block_rsv_add
> 
> btrfs walk phase hammers the inode_hash_lock:
> 
> -  18.45%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 47.38% _raw_spin_lock
>          + 42.99% iget5_locked
>          + 15.17% __remove_inode_hash
>          + 13.77% btrfs_get_delayed_node
>          + 11.27% inode_tree_add
>          + 9.32% btrfs_destroy_inode
> .....
>       - 46.77% do_raw_spin_lock
>          - _raw_spin_lock
>             + 30.51% iget5_locked
>             + 11.40% __remove_inode_hash
>             + 11.38% btrfs_get_delayed_node
>             + 9.45% inode_tree_add
>             + 7.28% btrfs_destroy_inode
> .....
> 
> I have a RCU inode hash lookup patch floating around somewhere if
> someone wants it...
> 
> And, well, the less said about btrfs unlinks the better:
> 
> +  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
> +  33.18%  [kernel]  [k] __write_lock_failed
> +  17.96%  [kernel]  [k] __read_lock_failed
> +   1.35%  [kernel]  [k] _raw_spin_unlock_irq
> +   0.82%  [kernel]  [k] __do_softirq
> +   0.53%  [kernel]  [k] btrfs_tree_lock
> +   0.41%  [kernel]  [k] btrfs_tree_read_lock
> +   0.41%  [kernel]  [k] do_raw_read_lock
> +   0.39%  [kernel]  [k] do_raw_write_lock
> +   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
> +   0.37%  [kernel]  [k] free_extent_buffer
> +   0.36%  [kernel]  [k] btrfs_tree_read_unlock
> +   0.32%  [kernel]  [k] do_raw_write_unlock
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-09  0:43     ` Zheng Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Zheng Liu @ 2013-07-09  0:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

Hi Dave,

On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote:
[...]
> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> 3.10-cil kernel I've been testing XFS on):
> 
> 	    create		 walk		unlink
> 	 time(s)   rate		time(s)		time(s)
> xfs	  222	266k+-32k	  170		  295
> ext4	  978	 54k+- 2k	  325		 2053
> btrfs	 1223	 47k+- 8k	  366		12000(*)
> 
> (*) Estimate based on a removal rate of 18.5 minutes for the first
> 4.8 million inodes.
> 
> Basically, neither btrfs or ext4 have any concurrency scaling to
> demonstrate, and unlinks on btrfs a just plain woeful.
> 
> ext4 create rate is limited by the extent cache LRU locking:

I have a patch to fix this problem and the patch has been applied into
3.11-rc1.  The patch is (d3922a77):
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time

I do really appreicate that if you could try your testing again against
this patch.  I just want to make sure that this problem has been fixed.
At least in my own testing it looks fine.

Thanks,
                                                - Zheng

> 
> -  41.81%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 60.67% _raw_spin_lock
>          - 99.60% ext4_es_lru_add
>             + 99.63% ext4_es_lookup_extent
>       - 39.15% do_raw_spin_lock
>          - _raw_spin_lock
>             + 95.38% ext4_es_lru_add
>               0.51% insert_inode_locked
>                  __ext4_new_inode
> -   16.20%  [kernel]  [k] native_read_tsc
>    - native_read_tsc
>       - 60.91% delay_tsc
>            __delay
>            do_raw_spin_lock
>          + _raw_spin_lock
>       - 39.09% __delay
>            do_raw_spin_lock
>          + _raw_spin_lock
> 
> Ext4 unlink is serialised on orphan list processing:
> 
> -  12.67%  [kernel]  [k] __mutex_unlock_slowpath
>    - __mutex_unlock_slowpath
>       - 99.95% mutex_unlock
>          + 54.37% ext4_orphan_del
>          + 43.26% ext4_orphan_add
> +   5.33%  [kernel]  [k] __mutex_lock_slowpath
> 
> 
> btrfs create has tree lock problems:
> 
> -  21.68%  [kernel]  [k] __write_lock_failed
>    - __write_lock_failed
>       - 99.93% do_raw_write_lock
>          - _raw_write_lock
>             - 79.04% btrfs_try_tree_write_lock
>                - btrfs_search_slot
>                   - 97.48% btrfs_insert_empty_items
>                        99.82% btrfs_new_inode
>                   + 2.52% btrfs_lookup_inode
>             - 20.37% btrfs_tree_lock
>                - 99.38% btrfs_search_slot
>                     99.92% btrfs_insert_empty_items
>                  0.52% btrfs_lock_root_node
>                     btrfs_search_slot
>                     btrfs_insert_empty_items
> -  21.24%  [kernel]  [k] _raw_spin_unlock_irqrestore
>    - _raw_spin_unlock_irqrestore
>       - 61.22% prepare_to_wait
>          + 61.52% btrfs_tree_lock
>          + 32.31% btrfs_tree_read_lock
>            6.17% reserve_metadata_bytes
>               btrfs_block_rsv_add
> 
> btrfs walk phase hammers the inode_hash_lock:
> 
> -  18.45%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 47.38% _raw_spin_lock
>          + 42.99% iget5_locked
>          + 15.17% __remove_inode_hash
>          + 13.77% btrfs_get_delayed_node
>          + 11.27% inode_tree_add
>          + 9.32% btrfs_destroy_inode
> .....
>       - 46.77% do_raw_spin_lock
>          - _raw_spin_lock
>             + 30.51% iget5_locked
>             + 11.40% __remove_inode_hash
>             + 11.38% btrfs_get_delayed_node
>             + 9.45% inode_tree_add
>             + 7.28% btrfs_destroy_inode
> .....
> 
> I have a RCU inode hash lookup patch floating around somewhere if
> someone wants it...
> 
> And, well, the less said about btrfs unlinks the better:
> 
> +  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
> +  33.18%  [kernel]  [k] __write_lock_failed
> +  17.96%  [kernel]  [k] __read_lock_failed
> +   1.35%  [kernel]  [k] _raw_spin_unlock_irq
> +   0.82%  [kernel]  [k] __do_softirq
> +   0.53%  [kernel]  [k] btrfs_tree_lock
> +   0.41%  [kernel]  [k] btrfs_tree_read_lock
> +   0.41%  [kernel]  [k] do_raw_read_lock
> +   0.39%  [kernel]  [k] do_raw_write_lock
> +   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
> +   0.37%  [kernel]  [k] free_extent_buffer
> +   0.36%  [kernel]  [k] btrfs_tree_read_unlock
> +   0.32%  [kernel]  [k] do_raw_write_unlock
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 15:22       ` Marco Stornelli
@ 2013-07-09  0:56         ` Theodore Ts'o
  -1 siblings, 0 replies; 23+ messages in thread
From: Theodore Ts'o @ 2013-07-09  0:56 UTC (permalink / raw)
  To: Marco Stornelli; +Cc: Jan Kara, Dave Chinner, xfs, linux-fsdevel

On Mon, Jul 08, 2013 at 05:22:43PM +0200, Marco Stornelli wrote:
> 
> Funny, if I well remember Google guys switched android from yaffs2
> to ext4 due to its superiority on SMP :)

The bigger reason why was because raw NAND flash doesn't really make
sense any more; especially as the feature size of flash cells has
shrunk and with the introduction of MLC and TLC, you really need to
use hardware assist to make flash sufficiently reliable.  Modern flash
storage uses dynamic adjustment of voltage levels as the flash cells
age, and error correcting codes to compensate for flash reliability
challenges.  This means accessing flash using eMMC, SATA, SAS, etc.,
and that rules out YAFFS2.

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-09  0:56         ` Theodore Ts'o
  0 siblings, 0 replies; 23+ messages in thread
From: Theodore Ts'o @ 2013-07-09  0:56 UTC (permalink / raw)
  To: Marco Stornelli; +Cc: linux-fsdevel, Jan Kara, xfs

On Mon, Jul 08, 2013 at 05:22:43PM +0200, Marco Stornelli wrote:
> 
> Funny, if I well remember Google guys switched android from yaffs2
> to ext4 due to its superiority on SMP :)

The bigger reason why was because raw NAND flash doesn't really make
sense any more; especially as the feature size of flash cells has
shrunk and with the introduction of MLC and TLC, you really need to
use hardware assist to make flash sufficiently reliable.  Modern flash
storage uses dynamic adjustment of voltage levels as the flash cells
age, and error correcting codes to compensate for flash reliability
challenges.  This means accessing flash using eMMC, SATA, SAS, etc.,
and that rules out YAFFS2.

						- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 12:44   ` Dave Chinner
@ 2013-07-09  1:15     ` Chris Mason
  -1 siblings, 0 replies; 23+ messages in thread
From: Chris Mason @ 2013-07-09  1:15 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel

Quoting Dave Chinner (2013-07-08 08:44:53)
> [cc fsdevel because after all the XFS stuff I did a some testing on
> mmotm w.r.t per-node LRU lock contention avoidance, and also some
> scalability tests against ext4 and btrfs for comparison on some new
> hardware. That bit ain't pretty. ]
> 
> And, well, the less said about btrfs unlinks the better:
> 
> +  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
> +  33.18%  [kernel]  [k] __write_lock_failed
> +  17.96%  [kernel]  [k] __read_lock_failed
> +   1.35%  [kernel]  [k] _raw_spin_unlock_irq
> +   0.82%  [kernel]  [k] __do_softirq
> +   0.53%  [kernel]  [k] btrfs_tree_lock
> +   0.41%  [kernel]  [k] btrfs_tree_read_lock
> +   0.41%  [kernel]  [k] do_raw_read_lock
> +   0.39%  [kernel]  [k] do_raw_write_lock
> +   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
> +   0.37%  [kernel]  [k] free_extent_buffer
> +   0.36%  [kernel]  [k] btrfs_tree_read_unlock
> +   0.32%  [kernel]  [k] do_raw_write_unlock
> 

Hi Dave,

Thanks for doing these runs.  At least on Btrfs the best way to resolve
the tree locking today is to break things up into more subvolumes.  I've
got another run at the root lock contention in the queue after I get
the skiplists in place in a few other parts of the Btrfs code.

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-09  1:15     ` Chris Mason
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Mason @ 2013-07-09  1:15 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel

Quoting Dave Chinner (2013-07-08 08:44:53)
> [cc fsdevel because after all the XFS stuff I did a some testing on
> mmotm w.r.t per-node LRU lock contention avoidance, and also some
> scalability tests against ext4 and btrfs for comparison on some new
> hardware. That bit ain't pretty. ]
> 
> And, well, the less said about btrfs unlinks the better:
> 
> +  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
> +  33.18%  [kernel]  [k] __write_lock_failed
> +  17.96%  [kernel]  [k] __read_lock_failed
> +   1.35%  [kernel]  [k] _raw_spin_unlock_irq
> +   0.82%  [kernel]  [k] __do_softirq
> +   0.53%  [kernel]  [k] btrfs_tree_lock
> +   0.41%  [kernel]  [k] btrfs_tree_read_lock
> +   0.41%  [kernel]  [k] do_raw_read_lock
> +   0.39%  [kernel]  [k] do_raw_write_lock
> +   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
> +   0.37%  [kernel]  [k] free_extent_buffer
> +   0.36%  [kernel]  [k] btrfs_tree_read_unlock
> +   0.32%  [kernel]  [k] do_raw_write_unlock
> 

Hi Dave,

Thanks for doing these runs.  At least on Btrfs the best way to resolve
the tree locking today is to break things up into more subvolumes.  I've
got another run at the root lock contention in the queue after I get
the skiplists in place in a few other parts of the Btrfs code.

-chris

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-09  0:43     ` Zheng Liu
@ 2013-07-09  1:23       ` Dave Chinner
  -1 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-09  1:23 UTC (permalink / raw)
  To: xfs, linux-fsdevel

On Tue, Jul 09, 2013 at 08:43:32AM +0800, Zheng Liu wrote:
> Hi Dave,
> 
> On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote:
> [...]
> > So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> > 3.10-cil kernel I've been testing XFS on):
> > 
> > 	    create		 walk		unlink
> > 	 time(s)   rate		time(s)		time(s)
> > xfs	  222	266k+-32k	  170		  295
> > ext4	  978	 54k+- 2k	  325		 2053
> > btrfs	 1223	 47k+- 8k	  366		12000(*)
> > 
> > (*) Estimate based on a removal rate of 18.5 minutes for the first
> > 4.8 million inodes.
> > 
> > Basically, neither btrfs or ext4 have any concurrency scaling to
> > demonstrate, and unlinks on btrfs a just plain woeful.
> > 
> > ext4 create rate is limited by the extent cache LRU locking:
> 
> I have a patch to fix this problem and the patch has been applied into
> 3.11-rc1.  The patch is (d3922a77):
>   ext4: improve extent cache shrink mechanism to avoid to burn CPU time
> 
> I do really appreicate that if you could try your testing again against
> this patch.  I just want to make sure that this problem has been fixed.
> At least in my own testing it looks fine.

I'll redo them when 3.11-rc1 comes around. I'll let you know how
much better it is, and where the next ring of the onion lies.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-09  1:23       ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-09  1:23 UTC (permalink / raw)
  To: xfs, linux-fsdevel

On Tue, Jul 09, 2013 at 08:43:32AM +0800, Zheng Liu wrote:
> Hi Dave,
> 
> On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote:
> [...]
> > So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> > 3.10-cil kernel I've been testing XFS on):
> > 
> > 	    create		 walk		unlink
> > 	 time(s)   rate		time(s)		time(s)
> > xfs	  222	266k+-32k	  170		  295
> > ext4	  978	 54k+- 2k	  325		 2053
> > btrfs	 1223	 47k+- 8k	  366		12000(*)
> > 
> > (*) Estimate based on a removal rate of 18.5 minutes for the first
> > 4.8 million inodes.
> > 
> > Basically, neither btrfs or ext4 have any concurrency scaling to
> > demonstrate, and unlinks on btrfs a just plain woeful.
> > 
> > ext4 create rate is limited by the extent cache LRU locking:
> 
> I have a patch to fix this problem and the patch has been applied into
> 3.11-rc1.  The patch is (d3922a77):
>   ext4: improve extent cache shrink mechanism to avoid to burn CPU time
> 
> I do really appreicate that if you could try your testing again against
> this patch.  I just want to make sure that this problem has been fixed.
> At least in my own testing it looks fine.

I'll redo them when 3.11-rc1 comes around. I'll let you know how
much better it is, and where the next ring of the onion lies.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-09  1:15     ` Chris Mason
  (?)
@ 2013-07-09  1:26     ` Dave Chinner
  2013-07-09  1:54         ` Chris Mason
  -1 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2013-07-09  1:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, xfs

On Mon, Jul 08, 2013 at 09:15:33PM -0400, Chris Mason wrote:
> Quoting Dave Chinner (2013-07-08 08:44:53)
> > [cc fsdevel because after all the XFS stuff I did a some testing on
> > mmotm w.r.t per-node LRU lock contention avoidance, and also some
> > scalability tests against ext4 and btrfs for comparison on some new
> > hardware. That bit ain't pretty. ]
> > 
> > And, well, the less said about btrfs unlinks the better:
> > 
> > +  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > +  33.18%  [kernel]  [k] __write_lock_failed
> > +  17.96%  [kernel]  [k] __read_lock_failed
> > +   1.35%  [kernel]  [k] _raw_spin_unlock_irq
> > +   0.82%  [kernel]  [k] __do_softirq
> > +   0.53%  [kernel]  [k] btrfs_tree_lock
> > +   0.41%  [kernel]  [k] btrfs_tree_read_lock
> > +   0.41%  [kernel]  [k] do_raw_read_lock
> > +   0.39%  [kernel]  [k] do_raw_write_lock
> > +   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
> > +   0.37%  [kernel]  [k] free_extent_buffer
> > +   0.36%  [kernel]  [k] btrfs_tree_read_unlock
> > +   0.32%  [kernel]  [k] do_raw_write_unlock
> > 
> 
> Hi Dave,
> 
> Thanks for doing these runs.  At least on Btrfs the best way to resolve
> the tree locking today is to break things up into more subvolumes.

Sure, but you can't do that most workloads. Only on specialised
workloads (e.g. hashed directory tree based object stores) is this
really a viable option....

> I've
> got another run at the root lock contention in the queue after I get
> the skiplists in place in a few other parts of the Btrfs code.

It will be interesting to see how these new structures play out ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BULK] Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-09  1:26     ` Dave Chinner
@ 2013-07-09  1:54         ` Chris Mason
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Mason @ 2013-07-09  1:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel

Quoting Dave Chinner (2013-07-08 21:26:14)
> On Mon, Jul 08, 2013 at 09:15:33PM -0400, Chris Mason wrote:
> > Quoting Dave Chinner (2013-07-08 08:44:53)
> > > [cc fsdevel because after all the XFS stuff I did a some testing on
> > > mmotm w.r.t per-node LRU lock contention avoidance, and also some
> > > scalability tests against ext4 and btrfs for comparison on some new
> > > hardware. That bit ain't pretty. ]
> > > 
> > > And, well, the less said about btrfs unlinks the better:
> > > 
> > > +  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > > +  33.18%  [kernel]  [k] __write_lock_failed
> > > +  17.96%  [kernel]  [k] __read_lock_failed
> > > +   1.35%  [kernel]  [k] _raw_spin_unlock_irq
> > > +   0.82%  [kernel]  [k] __do_softirq
> > > +   0.53%  [kernel]  [k] btrfs_tree_lock
> > > +   0.41%  [kernel]  [k] btrfs_tree_read_lock
> > > +   0.41%  [kernel]  [k] do_raw_read_lock
> > > +   0.39%  [kernel]  [k] do_raw_write_lock
> > > +   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
> > > +   0.37%  [kernel]  [k] free_extent_buffer
> > > +   0.36%  [kernel]  [k] btrfs_tree_read_unlock
> > > +   0.32%  [kernel]  [k] do_raw_write_unlock
> > > 
> > 
> > Hi Dave,
> > 
> > Thanks for doing these runs.  At least on Btrfs the best way to resolve
> > the tree locking today is to break things up into more subvolumes.
> 
> Sure, but you can't do that most workloads. Only on specialised
> workloads (e.g. hashed directory tree based object stores) is this
> really a viable option....

Yes and no.  It makes a huge difference even when you have 8 procs all
working on the same 8 subvolumes.  It's not perfect but it's all I
have ;)

> 
> > I've
> > got another run at the root lock contention in the queue after I get
> > the skiplists in place in a few other parts of the Btrfs code.
> 
> It will be interesting to see how these new structures play out ;)

The skiplists don't translate well to the tree roots, so I'll probably
have to do something different there.  But I'll get the onion peeled one
way or another.

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BULK] Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-09  1:54         ` Chris Mason
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Mason @ 2013-07-09  1:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

Quoting Dave Chinner (2013-07-08 21:26:14)
> On Mon, Jul 08, 2013 at 09:15:33PM -0400, Chris Mason wrote:
> > Quoting Dave Chinner (2013-07-08 08:44:53)
> > > [cc fsdevel because after all the XFS stuff I did a some testing on
> > > mmotm w.r.t per-node LRU lock contention avoidance, and also some
> > > scalability tests against ext4 and btrfs for comparison on some new
> > > hardware. That bit ain't pretty. ]
> > > 
> > > And, well, the less said about btrfs unlinks the better:
> > > 
> > > +  37.14%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > > +  33.18%  [kernel]  [k] __write_lock_failed
> > > +  17.96%  [kernel]  [k] __read_lock_failed
> > > +   1.35%  [kernel]  [k] _raw_spin_unlock_irq
> > > +   0.82%  [kernel]  [k] __do_softirq
> > > +   0.53%  [kernel]  [k] btrfs_tree_lock
> > > +   0.41%  [kernel]  [k] btrfs_tree_read_lock
> > > +   0.41%  [kernel]  [k] do_raw_read_lock
> > > +   0.39%  [kernel]  [k] do_raw_write_lock
> > > +   0.38%  [kernel]  [k] btrfs_clear_lock_blocking_rw
> > > +   0.37%  [kernel]  [k] free_extent_buffer
> > > +   0.36%  [kernel]  [k] btrfs_tree_read_unlock
> > > +   0.32%  [kernel]  [k] do_raw_write_unlock
> > > 
> > 
> > Hi Dave,
> > 
> > Thanks for doing these runs.  At least on Btrfs the best way to resolve
> > the tree locking today is to break things up into more subvolumes.
> 
> Sure, but you can't do that most workloads. Only on specialised
> workloads (e.g. hashed directory tree based object stores) is this
> really a viable option....

Yes and no.  It makes a huge difference even when you have 8 procs all
working on the same 8 subvolumes.  It's not perfect but it's all I
have ;)

> 
> > I've
> > got another run at the root lock contention in the queue after I get
> > the skiplists in place in a few other parts of the Btrfs code.
> 
> It will be interesting to see how these new structures play out ;)

The skiplists don't translate well to the tree roots, so I'll probably
have to do something different there.  But I'll get the onion peeled one
way or another.

-chris

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
  2013-07-08 12:44   ` Dave Chinner
@ 2013-07-09  8:26     ` Dave Chinner
  -1 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-09  8:26 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote:
> [cc fsdevel because after all the XFS stuff I did a some testing on
> mmotm w.r.t per-node LRU lock contention avoidance, and also some
> scalability tests against ext4 and btrfs for comparison on some new
> hardware. That bit ain't pretty. ]

A quick follow on mmotm:

> FWIW, the mmotm kernel (which has a fair bit of debug enabled, so
> not quite comparitive) doesn't have any LRU lock contention to speak
> of. For create:
> 
> -   7.81%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 70.98% _raw_spin_lock
>          + 97.55% xfs_log_commit_cil
>          + 0.93% __d_instantiate
>          + 0.58% inode_sb_list_add
>       - 29.02% do_raw_spin_lock
>          - _raw_spin_lock
>             + 41.14% xfs_log_commit_cil
>             + 8.29% _xfs_buf_find
>             + 8.00% xfs_iflush_cluster

So i just ported all my prototype sync and inode_sb_list_lock
changes across to mmotm, as well as the XFS CIL optimisations.

-   2.33%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 70.14% do_raw_spin_lock
         - _raw_spin_lock
            + 16.91% _xfs_buf_find
            + 15.20% list_lru_add
            + 12.83% xfs_log_commit_cil
            + 11.18% d_alloc
            + 7.43% dput
            + 4.56% __d_instantiate
....

Most of the spinlock contention has gone away.
 

> And the walk:
> 
> -  26.37%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 49.10% _raw_spin_lock
>          - 50.65% evict
...
>          - 26.99% list_lru_add
>             + 89.01% inode_add_lru
>             + 10.95% dput
>          + 7.03% __remove_inode_hash
>       - 40.65% do_raw_spin_lock
>          - _raw_spin_lock
>             - 41.96% evict
....
>             - 13.55% list_lru_add
>                  84.33% inode_add_lru
....
>             + 10.10% __remove_inode_hash                                                                                                                               
>                     system_call_fastpath

-  15.44%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 46.59% _raw_spin_lock
         + 69.40% list_lru_add
           17.65% list_lru_del
           5.70% list_lru_count_node
           2.44% shrink_dentry_list
              prune_dcache_sb
              super_cache_scan
              shrink_slab
           0.86% __page_check_address
      - 33.06% do_raw_spin_lock
         - _raw_spin_lock
            + 36.96% list_lru_add
            + 11.98% list_lru_del
            + 6.68% shrink_dentry_list
            + 6.43% d_alloc
            + 4.79% _xfs_buf_find
.....
      + 11.48% do_raw_spin_trylock
      + 8.87% _raw_spin_trylock

So now we see that CPU wasted on contention is down by 40%.
Observation shows that most of the list_lru_add/list_lru_del
contention occurs when reclaim is running - before memory filled
up the lookup rate was on the high side of 600,000 inodes/s, but
fell back to about 425,000/s once reclaim started working.

> 
> There's quite a different pattern of contention - it has moved
> inward to evict which implies the inode_sb_list_lock is the next
> obvious point of contention. I have patches in the works for that.
> Also, the inode_hash_lock is causing some contention, even though we
> fake inode hashing. I have a patch to fix that for XFS as well.
> 
> I also note an interesting behaviour of the per-node inode LRUs -
> the contention is coming from the dentry shrinker on one node
> freeing inodes allocated on a different node during reclaim. There's
> scope for improvement there.
> 
> But here' the interesting part:
> 
> Kernel	    create		walk		unlink
> 	time(s)	 rate		time(s)		time(s)
> 3.10-cil  222	266k+-32k	  170		  295
> mmotm	  251	222k+-16k	  128		  356

mmotm-cil  225  258k+-26k	  122		  296

So even with all the debug on, the mmotm kernel with most of the
mods as I was running in 3.10-cil, plus the s_inodes ->list_lru
conversion gets the same throughput for create and unlink and has
much better walk times.

> Even with all the debug enabled, the overall walk time dropped by
> 25% to 128s. So performance in this workload has substantially
> improved because of the per-node LRUs and variability is also down
> as well, as predicted. Once I add all the tweaks I have in the
> 3.10-cil tree to mmotm, I expect significant improvements to create
> and unlink performance as well...
> 
> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> 3.10-cil kernel I've been testing XFS on):
> 
> 	    create		 walk		unlink
> 	 time(s)   rate		time(s)		time(s)
> xfs	  222	266k+-32k	  170		  295
> ext4	  978	 54k+- 2k	  325		 2053
> btrfs	 1223	 47k+- 8k	  366		12000(*)
> 
> (*) Estimate based on a removal rate of 18.5 minutes for the first
> 4.8 million inodes.

So, let's run these again on my current mmotm tree - it has the ext4
extent tree fixes in it and my rcu inode hash lookup patch...

	    create		 walk		unlink
	 time(s)   rate		time(s)		time(s)
xfs	  225	258k+-26k	  122		  296
ext4	  456	118k+- 4k	  128		 1632
btrfs	 1122	 51k+- 3k	  281		 3200(*)

(*) about 4.7 million inodes removed in 5 minutes.

ext4 is a lot healthier: create speed doubles from the extent cache
lock contention fixes, and the walk time halves due to the rcu inode
cache lookup. That said, it is still burning a huge amount of CPU on
the inode_hash_lock adding and removing inodes. Unlink perf is a bit
faster, but still slow.  So, yeah, things will get better in the
not-too distant future...

And for btrfs? Well, create is a tiny bit faster, the walk is 20%
faster thanks to the rcu hash lookups, and unlinks are markedly
faster (3x). Still not fast enough for me to hang around waiting for
them to complete, though.

FWIW, while the results are a lot better for ext4, let me just point
out how hard it is driving the storage to get that performance:

load	|    create  |	    walk    |		unlink
IO type	|    write   |	    read    |	   read	    |	   write
	| IOPS	 BW  |	 IOPS	 BW |	IOPS	BW  |	 IOPS	 BW
--------+------------+--------------+---------------+--------------
xfs	|  900	200  |	18000	140 |	7500	50  |	  400	 50
ext4	|23000	390  |	55000	200 |	2000	10  |	13000	160
btrfs(*)|peaky	 75  |	26000	100 |	decay	10  |	peaky peaky

ext4 is hammering the SSDs far harder than XFS, both in terms of
IOPS and bandwidth. You do not want to run ext4 on your SSD if you
have a metadata intensive workload as it will age the SSD much, much
faster than XFS with that sort of write behaviour.

(*) the btrfs create IO pattern is 5s peaks of write IOPS every 30s.
The baseline is about 500 IOPS, but the peaks reach upwards of
30,000 write IOPS. Unlink does this as well.  There are also short
bursts of 2-3000 read IOPS just before the write IOPS bursts in the
create workload. For the unlink, it starts off with about 10,000
read IOPS, and goes quickly into exponential decay down to about
2000 read IOPS in 90s.  Then it hits some trigger and the cycle
starts again. The trigger appears to co-incide with the reclaim 1-2
million dentries being reclaimed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
@ 2013-07-09  8:26     ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2013-07-09  8:26 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote:
> [cc fsdevel because after all the XFS stuff I did a some testing on
> mmotm w.r.t per-node LRU lock contention avoidance, and also some
> scalability tests against ext4 and btrfs for comparison on some new
> hardware. That bit ain't pretty. ]

A quick follow on mmotm:

> FWIW, the mmotm kernel (which has a fair bit of debug enabled, so
> not quite comparitive) doesn't have any LRU lock contention to speak
> of. For create:
> 
> -   7.81%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 70.98% _raw_spin_lock
>          + 97.55% xfs_log_commit_cil
>          + 0.93% __d_instantiate
>          + 0.58% inode_sb_list_add
>       - 29.02% do_raw_spin_lock
>          - _raw_spin_lock
>             + 41.14% xfs_log_commit_cil
>             + 8.29% _xfs_buf_find
>             + 8.00% xfs_iflush_cluster

So i just ported all my prototype sync and inode_sb_list_lock
changes across to mmotm, as well as the XFS CIL optimisations.

-   2.33%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 70.14% do_raw_spin_lock
         - _raw_spin_lock
            + 16.91% _xfs_buf_find
            + 15.20% list_lru_add
            + 12.83% xfs_log_commit_cil
            + 11.18% d_alloc
            + 7.43% dput
            + 4.56% __d_instantiate
....

Most of the spinlock contention has gone away.
 

> And the walk:
> 
> -  26.37%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 49.10% _raw_spin_lock
>          - 50.65% evict
...
>          - 26.99% list_lru_add
>             + 89.01% inode_add_lru
>             + 10.95% dput
>          + 7.03% __remove_inode_hash
>       - 40.65% do_raw_spin_lock
>          - _raw_spin_lock
>             - 41.96% evict
....
>             - 13.55% list_lru_add
>                  84.33% inode_add_lru
....
>             + 10.10% __remove_inode_hash                                                                                                                               
>                     system_call_fastpath

-  15.44%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 46.59% _raw_spin_lock
         + 69.40% list_lru_add
           17.65% list_lru_del
           5.70% list_lru_count_node
           2.44% shrink_dentry_list
              prune_dcache_sb
              super_cache_scan
              shrink_slab
           0.86% __page_check_address
      - 33.06% do_raw_spin_lock
         - _raw_spin_lock
            + 36.96% list_lru_add
            + 11.98% list_lru_del
            + 6.68% shrink_dentry_list
            + 6.43% d_alloc
            + 4.79% _xfs_buf_find
.....
      + 11.48% do_raw_spin_trylock
      + 8.87% _raw_spin_trylock

So now we see that CPU wasted on contention is down by 40%.
Observation shows that most of the list_lru_add/list_lru_del
contention occurs when reclaim is running - before memory filled
up the lookup rate was on the high side of 600,000 inodes/s, but
fell back to about 425,000/s once reclaim started working.

> 
> There's quite a different pattern of contention - it has moved
> inward to evict which implies the inode_sb_list_lock is the next
> obvious point of contention. I have patches in the works for that.
> Also, the inode_hash_lock is causing some contention, even though we
> fake inode hashing. I have a patch to fix that for XFS as well.
> 
> I also note an interesting behaviour of the per-node inode LRUs -
> the contention is coming from the dentry shrinker on one node
> freeing inodes allocated on a different node during reclaim. There's
> scope for improvement there.
> 
> But here' the interesting part:
> 
> Kernel	    create		walk		unlink
> 	time(s)	 rate		time(s)		time(s)
> 3.10-cil  222	266k+-32k	  170		  295
> mmotm	  251	222k+-16k	  128		  356

mmotm-cil  225  258k+-26k	  122		  296

So even with all the debug on, the mmotm kernel with most of the
mods as I was running in 3.10-cil, plus the s_inodes ->list_lru
conversion gets the same throughput for create and unlink and has
much better walk times.

> Even with all the debug enabled, the overall walk time dropped by
> 25% to 128s. So performance in this workload has substantially
> improved because of the per-node LRUs and variability is also down
> as well, as predicted. Once I add all the tweaks I have in the
> 3.10-cil tree to mmotm, I expect significant improvements to create
> and unlink performance as well...
> 
> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> 3.10-cil kernel I've been testing XFS on):
> 
> 	    create		 walk		unlink
> 	 time(s)   rate		time(s)		time(s)
> xfs	  222	266k+-32k	  170		  295
> ext4	  978	 54k+- 2k	  325		 2053
> btrfs	 1223	 47k+- 8k	  366		12000(*)
> 
> (*) Estimate based on a removal rate of 18.5 minutes for the first
> 4.8 million inodes.

So, let's run these again on my current mmotm tree - it has the ext4
extent tree fixes in it and my rcu inode hash lookup patch...

	    create		 walk		unlink
	 time(s)   rate		time(s)		time(s)
xfs	  225	258k+-26k	  122		  296
ext4	  456	118k+- 4k	  128		 1632
btrfs	 1122	 51k+- 3k	  281		 3200(*)

(*) about 4.7 million inodes removed in 5 minutes.

ext4 is a lot healthier: create speed doubles from the extent cache
lock contention fixes, and the walk time halves due to the rcu inode
cache lookup. That said, it is still burning a huge amount of CPU on
the inode_hash_lock adding and removing inodes. Unlink perf is a bit
faster, but still slow.  So, yeah, things will get better in the
not-too distant future...

And for btrfs? Well, create is a tiny bit faster, the walk is 20%
faster thanks to the rcu hash lookups, and unlinks are markedly
faster (3x). Still not fast enough for me to hang around waiting for
them to complete, though.

FWIW, while the results are a lot better for ext4, let me just point
out how hard it is driving the storage to get that performance:

load	|    create  |	    walk    |		unlink
IO type	|    write   |	    read    |	   read	    |	   write
	| IOPS	 BW  |	 IOPS	 BW |	IOPS	BW  |	 IOPS	 BW
--------+------------+--------------+---------------+--------------
xfs	|  900	200  |	18000	140 |	7500	50  |	  400	 50
ext4	|23000	390  |	55000	200 |	2000	10  |	13000	160
btrfs(*)|peaky	 75  |	26000	100 |	decay	10  |	peaky peaky

ext4 is hammering the SSDs far harder than XFS, both in terms of
IOPS and bandwidth. You do not want to run ext4 on your SSD if you
have a metadata intensive workload as it will age the SSD much, much
faster than XFS with that sort of write behaviour.

(*) the btrfs create IO pattern is 5s peaks of write IOPS every 30s.
The baseline is about 500 IOPS, but the peaks reach upwards of
30,000 write IOPS. Unlink does this as well.  There are also short
bursts of 2-3000 read IOPS just before the write IOPS bursts in the
create workload. For the unlink, it starts off with about 10,000
read IOPS, and goes quickly into exponential decay down to about
2000 read IOPS in 90s.  Then it hits some trigger and the cycle
starts again. The trigger appears to co-incide with the reclaim 1-2
million dentries being reclaimed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-07-09  8:26 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-01  5:44 [PATCH] xfs: optimise CIL insertion during transaction commit [RFC] Dave Chinner
2013-07-04  2:09 ` Mark Tinguely
2013-07-08 12:44 ` Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) Dave Chinner
2013-07-08 12:44   ` Dave Chinner
2013-07-08 13:59   ` Jan Kara
2013-07-08 15:22     ` Marco Stornelli
2013-07-08 15:22       ` Marco Stornelli
2013-07-08 15:38       ` Jan Kara
2013-07-09  0:15         ` Dave Chinner
2013-07-09  0:15           ` Dave Chinner
2013-07-09  0:56       ` Theodore Ts'o
2013-07-09  0:56         ` Theodore Ts'o
2013-07-09  0:43   ` Zheng Liu
2013-07-09  0:43     ` Zheng Liu
2013-07-09  1:23     ` Dave Chinner
2013-07-09  1:23       ` Dave Chinner
2013-07-09  1:15   ` Chris Mason
2013-07-09  1:15     ` Chris Mason
2013-07-09  1:26     ` Dave Chinner
2013-07-09  1:54       ` [BULK] " Chris Mason
2013-07-09  1:54         ` Chris Mason
2013-07-09  8:26   ` Dave Chinner
2013-07-09  8:26     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.