All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/45 v3] xfs: consolidated log and optimisation changes
@ 2021-03-05  5:10 Dave Chinner
  2021-03-05  5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
                   ` (44 more replies)
  0 siblings, 45 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:10 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This is a consolidated patchset of all the outstanding patches I've
sent out recently. Previous versions and sub-series descriptions
can be found here:

https://lore.kernel.org/linux-xfs/20210226050158.GW4662@dread.disaster.area/T/#mf72a4f6acc05d117adec3fea5f6fee83432ecfd4
https://lore.kernel.org/linux-xfs/20210223033442.3267258-1-david@fromorbit.com/T/#m80819d7f7940e5f216f3219baf4f98ed35643d13
https://lore.kernel.org/linux-xfs/20210223053212.3287398-1-david@fromorbit.com/T/#mad305a62ab3532493c83b4c615f21fbaf9a87ae0
https://lore.kernel.org/linux-xfs/20210223044636.3280862-1-david@fromorbit.com/T/#m791941d515fd437bf07aa42d35df5ddb124cb80f
https://lore.kernel.org/linux-xfs/20210224063459.3436852-1-david@fromorbit.com/T/#mcb037e1495e6bb4f3289e52d581e53efe2d29765
https://lore.kernel.org/linux-xfs/20210223054748.3292734-1-david@fromorbit.com/T/#md88d8f88657f1e33008f864677921f54b4d64a2f
https://lore.kernel.org/linux-xfs/20210225033725.3558450-1-david@fromorbit.com/T/#mb6a91780514d5abe4d8b04ed1d78a8f8d681f101

The changes are largely just a rebase onto a current TOT kernel, bug
fixes and modifications from review comments. These are itemised in
the change log below. It runs though the fstests auto group fine on
e multiple machines here and performance test numbers are largely
unchanged from previous versions.

I've added a couple of new patches to the end of the series, the
most notable being an update to the delayed logging design document
to include descriptions of transaction types, log space accounting
and how we use relogging to ensure rolling transactions do not
deadlock on log space.

Cheers,

Dave.

Version 3:
- rebase onto 5.12-rc1+
- aggregate many small dependent patchsets in one large one.
- simplify xlog_wait_on_iclog_lsn() back to just a call to xlog_wait_on_iclog()
- remove xfs_blkdev_issue_flush() instead of moving and renaming it.
- pass bio to xfs_flush_bdev_async() so it doesn't need allocation.
- skip cache flush in xfs_flush_bdev_async() if the underlying queue does not
  require it.
- fixed whitespace in xfs_flush_bdev_async()
- remove the implicit external log's data device cache flush code and replace it
  with an explicit flush in the unmount record write so that it works the same
  as the new CIL checkpoint cache pre-flush mechanism. This mechanism now
  guarantees metadata vs journal ordering for both internal and external logs.
- updated various commit messages
- fixed incorrect/unintended changes to xfs_log_force() behaviour
- typedef uint64_t xfs_csn_t; and conversion.
- removed stray trace_printk()s that were used for debugging.
- fixed minor formatting details.
- uninlined xlog_prepare_iovec()
- fixed up "lv chain vector and size calculation" commit message to reflect we
  are only calculating and passin gin the vector byte count.
- reworked the loop in xlog_write_single() based on Christoph's suggestion. Much
  cleaner!
- added patch to pass log ticket down to xlog_sync() so that it accounts the
  roundoff to the log ticket rather than directly modifying grant heads. Grant
  heads are hot, so every little bit helps.
- added patch to update delayed logging design doc with background material on
  how transactions and log space accounting works in XFS.

Version 2:
- fix ticket reservation roundoff to include 2 roundoffs
- removed stale copied comment from roundoff initialisation.
- clarified "separation" to mean "separation for ordering purposes" in commit
  message.
- added comment that newly activated, clean, empty iclogs have a LSN of 0 so are
  captured by the "iclog lsn < start_lsn" case that avoids needing to wait
  before releasing the commit iclog to be written.
- added async cache flush infrastructure
- convert CIL checkpoint push work it issue an unconditional metadata device
  cache flush rather than asking the first iclog write to issue it via
  REQ_PREFLUSH.
- cleaned up xlog_write() to remove a redundant parameter and prepare the logic
  for setting flags on the iclog based on the type of operational data is being
  written to the log.
- added XLOG_ICL_NEED_FUA flag to complement the NEED_FLUSH flag, allowing
  callers to issue explicit flushes and clear the NEED_FLUSH flag before the
  iclog is written without dropping the REQ_FUA requirement in /dev/null...
- added CIL commit-in-start-iclog optimisation that clears the NEED_FLUSH flag
  to avoid an unnecessary cache flush when issuing the iclog.
- fixed typo in CIL throttle bugfix comment.
- fixed trailing whitespace in commit message.



^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 01/45] xfs: initialise attr fork on inode create
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
@ 2021-03-05  5:10 ` Dave Chinner
  2021-03-08 22:20   ` Darrick J. Wong
  2021-03-16  8:35   ` Christoph Hellwig
  2021-03-05  5:11 ` [PATCH 02/45] xfs: log stripe roundoff is a property of the log Dave Chinner
                   ` (43 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we allocate a new inode, we often need to add an attribute to
the inode as part of the create. This can happen as a result of
needing to add default ACLs or security labels before the inode is
made visible to userspace.

This is highly inefficient right now. We do the create transaction
to allocate the inode, then we do an "add attr fork" transaction to
modify the just created empty inode to set the inode fork offset to
allow attributes to be stored, then we go and do the attribute
creation.

This means 3 transactions instead of 1 to allocate an inode, and
this greatly increases the load on the CIL commit code, resulting in
excessive contention on the CIL spin locks and performance
degradation:

 18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
  3.57%  [kernel]                [k] do_raw_spin_lock
  2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
  2.48%  [kernel]                [k] memcpy
  2.34%  [kernel]                [k] xfs_log_commit_cil

The typical profile resulting from running fsmark on a selinux enabled
filesytem is adds this overhead to the create path:

  - 15.30% xfs_init_security
     - 15.23% security_inode_init_security
	- 13.05% xfs_initxattrs
	   - 12.94% xfs_attr_set
	      - 6.75% xfs_bmap_add_attrfork
		 - 5.51% xfs_trans_commit
		    - 5.48% __xfs_trans_commit
		       - 5.35% xfs_log_commit_cil
			  - 3.86% _raw_spin_lock
			     - do_raw_spin_lock
				  __pv_queued_spin_lock_slowpath
		 - 0.70% xfs_trans_alloc
		      0.52% xfs_trans_reserve
	      - 5.41% xfs_attr_set_args
		 - 5.39% xfs_attr_set_shortform.constprop.0
		    - 4.46% xfs_trans_commit
		       - 4.46% __xfs_trans_commit
			  - 4.33% xfs_log_commit_cil
			     - 2.74% _raw_spin_lock
				- do_raw_spin_lock
				     __pv_queued_spin_lock_slowpath
			       0.60% xfs_inode_item_format
		      0.90% xfs_attr_try_sf_addname
	- 1.99% selinux_inode_init_security
	   - 1.02% security_sid_to_context_force
	      - 1.00% security_sid_to_context_core
		 - 0.92% sidtab_entry_to_string
		    - 0.90% sidtab_sid2str_get
			 0.59% sidtab_sid2str_put.part.0
	   - 0.82% selinux_determine_inode_label
	      - 0.77% security_transition_sid
		   0.70% security_compute_sid.part.0

And fsmark creation rate performance drops by ~25%. The key point to
note here is that half the additional overhead comes from adding the
attribute fork to the newly created inode. That's crazy, considering
we can do this same thing at inode create time with a couple of
lines of code and no extra overhead.

So, if we know we are going to add an attribute immediately after
creating the inode, let's just initialise the attribute fork inside
the create transaction and chop that whole chunk of code out of
the create fast path. This completely removes the performance
drop caused by enabling SELinux, and the profile looks like:

     - 8.99% xfs_init_security
         - 9.00% security_inode_init_security
            - 6.43% xfs_initxattrs
               - 6.37% xfs_attr_set
                  - 5.45% xfs_attr_set_args
                     - 5.42% xfs_attr_set_shortform.constprop.0
                        - 4.51% xfs_trans_commit
                           - 4.54% __xfs_trans_commit
                              - 4.59% xfs_log_commit_cil
                                 - 2.67% _raw_spin_lock
                                    - 3.28% do_raw_spin_lock
                                         3.08% __pv_queued_spin_lock_slowpath
                                   0.66% xfs_inode_item_format
                        - 0.90% xfs_attr_try_sf_addname
                  - 0.60% xfs_trans_alloc
            - 2.35% selinux_inode_init_security
               - 1.25% security_sid_to_context_force
                  - 1.21% security_sid_to_context_core
                     - 1.19% sidtab_entry_to_string
                        - 1.20% sidtab_sid2str_get
                           - 0.86% sidtab_sid2str_put.part.0
                              - 0.62% _raw_spin_lock_irqsave
                                 - 0.77% do_raw_spin_lock
                                      __pv_queued_spin_lock_slowpath
               - 0.84% selinux_determine_inode_label
                  - 0.83% security_transition_sid
                       0.86% security_compute_sid.part.0

Which indicates the XFS overhead of creating the selinux xattr has
been halved. This doesn't fix the CIL lock contention problem, just
means it's not a limiting factor for this workload. Lock contention
in the security subsystems is going to be an issue soon, though...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c       |  9 ++++-----
 fs/xfs/libxfs/xfs_inode_fork.c | 20 +++++++++++++++-----
 fs/xfs/libxfs/xfs_inode_fork.h |  2 ++
 fs/xfs/xfs_inode.c             | 24 +++++++++++++++++++++---
 fs/xfs/xfs_inode.h             |  6 ++++--
 fs/xfs/xfs_iops.c              | 34 +++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_qm.c                |  2 +-
 fs/xfs/xfs_symlink.c           |  2 +-
 8 files changed, 81 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index e0905ad171f0..5574d345d066 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1027,7 +1027,9 @@ xfs_bmap_add_attrfork_local(
 	return -EFSCORRUPTED;
 }
 
-/* Set an inode attr fork off based on the format */
+/*
+ * Set an inode attr fork offset based on the format of the data fork.
+ */
 int
 xfs_bmap_set_attrforkoff(
 	struct xfs_inode	*ip,
@@ -1092,10 +1094,7 @@ xfs_bmap_add_attrfork(
 		goto trans_cancel;
 	ASSERT(ip->i_afp == NULL);
 
-	ip->i_afp = kmem_cache_zalloc(xfs_ifork_zone,
-				      GFP_KERNEL | __GFP_NOFAIL);
-
-	ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
+	ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
 	ip->i_afp->if_flags = XFS_IFEXTENTS;
 	logflags = 0;
 	switch (ip->i_df.if_format) {
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index e080d7e07643..c606c1a77e5a 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -282,6 +282,19 @@ xfs_dfork_attr_shortform_size(
 	return be16_to_cpu(atp->hdr.totsize);
 }
 
+struct xfs_ifork *
+xfs_ifork_alloc(
+	enum xfs_dinode_fmt	format,
+	xfs_extnum_t		nextents)
+{
+	struct xfs_ifork	*ifp;
+
+	ifp = kmem_cache_zalloc(xfs_ifork_zone, GFP_NOFS | __GFP_NOFAIL);
+	ifp->if_format = format;
+	ifp->if_nextents = nextents;
+	return ifp;
+}
+
 int
 xfs_iformat_attr_fork(
 	struct xfs_inode	*ip,
@@ -293,11 +306,8 @@ xfs_iformat_attr_fork(
 	 * Initialize the extent count early, as the per-format routines may
 	 * depend on it.
 	 */
-	ip->i_afp = kmem_cache_zalloc(xfs_ifork_zone, GFP_NOFS | __GFP_NOFAIL);
-	ip->i_afp->if_format = dip->di_aformat;
-	if (unlikely(ip->i_afp->if_format == 0)) /* pre IRIX 6.2 file system */
-		ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
-	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents);
+	ip->i_afp = xfs_ifork_alloc(dip->di_aformat,
+				be16_to_cpu(dip->di_anextents));
 
 	switch (ip->i_afp->if_format) {
 	case XFS_DINODE_FMT_LOCAL:
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 9e2137cd7372..a0717ab0e5c5 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -141,6 +141,8 @@ static inline int8_t xfs_ifork_format(struct xfs_ifork *ifp)
 	return ifp->if_format;
 }
 
+struct xfs_ifork *xfs_ifork_alloc(enum xfs_dinode_fmt format,
+				xfs_extnum_t nextents);
 struct xfs_ifork *xfs_iext_state_to_fork(struct xfs_inode *ip, int state);
 
 int		xfs_iformat_data_fork(struct xfs_inode *, struct xfs_dinode *);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 46a861d55e48..bed2beb169e4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -774,6 +774,7 @@ xfs_init_new_inode(
 	xfs_nlink_t		nlink,
 	dev_t			rdev,
 	prid_t			prid,
+	bool			init_xattrs,
 	struct xfs_inode	**ipp)
 {
 	struct inode		*dir = pip ? VFS_I(pip) : NULL;
@@ -877,6 +878,20 @@ xfs_init_new_inode(
 		ASSERT(0);
 	}
 
+	/*
+	 * If we need to create attributes immediately after allocating the
+	 * inode, initialise an empty attribute fork right now. We use the
+	 * default fork offset for attributes here as we don't know exactly what
+	 * size or how many attributes we might be adding. We can do this
+	 * safely here because we know the data fork is completely empty and
+	 * this saves us from needing to run a separate transaction to set the
+	 * fork offset in the immediate future.
+	 */
+	if (init_xattrs) {
+		ip->i_d.di_forkoff = xfs_default_attroffset(ip) >> 3;
+		ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
+	}
+
 	/*
 	 * Log the new values stuffed into the inode.
 	 */
@@ -910,6 +925,7 @@ xfs_dir_ialloc(
 	xfs_nlink_t		nlink,
 	dev_t			rdev,
 	prid_t			prid,
+	bool			init_xattrs,
 	struct xfs_inode	**ipp)
 {
 	struct xfs_buf		*agibp;
@@ -937,7 +953,7 @@ xfs_dir_ialloc(
 	ASSERT(ino != NULLFSINO);
 
 	return xfs_init_new_inode(mnt_userns, *tpp, dp, ino, mode, nlink, rdev,
-				  prid, ipp);
+				  prid, init_xattrs, ipp);
 }
 
 /*
@@ -982,6 +998,7 @@ xfs_create(
 	struct xfs_name		*name,
 	umode_t			mode,
 	dev_t			rdev,
+	bool			init_xattrs,
 	xfs_inode_t		**ipp)
 {
 	int			is_dir = S_ISDIR(mode);
@@ -1052,7 +1069,7 @@ xfs_create(
 	 * pointing to itself.
 	 */
 	error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, is_dir ? 2 : 1, rdev,
-			       prid, &ip);
+			       prid, init_xattrs, &ip);
 	if (error)
 		goto out_trans_cancel;
 
@@ -1171,7 +1188,8 @@ xfs_create_tmpfile(
 	if (error)
 		goto out_release_dquots;
 
-	error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, 0, 0, prid, &ip);
+	error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, 0, 0, prid,
+				false, &ip);
 	if (error)
 		goto out_trans_cancel;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 88ee4c3930ae..a2cacdb76d55 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -371,7 +371,8 @@ int		xfs_lookup(struct xfs_inode *dp, struct xfs_name *name,
 			   struct xfs_inode **ipp, struct xfs_name *ci_name);
 int		xfs_create(struct user_namespace *mnt_userns,
 			   struct xfs_inode *dp, struct xfs_name *name,
-			   umode_t mode, dev_t rdev, struct xfs_inode **ipp);
+			   umode_t mode, dev_t rdev, bool need_xattr,
+			   struct xfs_inode **ipp);
 int		xfs_create_tmpfile(struct user_namespace *mnt_userns,
 			   struct xfs_inode *dp, umode_t mode,
 			   struct xfs_inode **ipp);
@@ -413,7 +414,8 @@ xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
 int		xfs_dir_ialloc(struct user_namespace *mnt_userns,
 			       struct xfs_trans **tpp, struct xfs_inode *dp,
 			       umode_t mode, xfs_nlink_t nlink, dev_t dev,
-			       prid_t prid, struct xfs_inode **ipp);
+			       prid_t prid, bool need_xattr,
+			       struct xfs_inode **ipp);
 
 static inline int
 xfs_itruncate_extents(
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66ebccb5a6ff..a9d466b78646 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -126,6 +126,37 @@ xfs_cleanup_inode(
 	xfs_remove(XFS_I(dir), &teardown, XFS_I(inode));
 }
 
+/*
+ * Check to see if we are likely to need an extended attribute to be added to
+ * the inode we are about to allocate. This allows the attribute fork to be
+ * created during the inode allocation, reducing the number of transactions we
+ * need to do in this fast path.
+ *
+ * The security checks are optimistic, but not guaranteed. The two LSMs that
+ * require xattrs to be added here (selinux and smack) are also the only two
+ * LSMs that add a sb->s_security structure to the superblock. Hence if security
+ * is enabled and sb->s_security is set, we have a pretty good idea that we are
+ * going to be asked to add a security xattr immediately after allocating the
+ * xfs inode and instantiating the VFS inode.
+ */
+static inline bool
+xfs_create_need_xattr(
+	struct inode	*dir,
+	struct posix_acl *default_acl,
+	struct posix_acl *acl)
+{
+	if (acl)
+		return true;
+	if (default_acl)
+		return true;
+	if (!IS_ENABLED(CONFIG_SECURITY))
+		return false;
+	if (dir->i_sb->s_security)
+		return true;
+	return false;
+}
+
+
 STATIC int
 xfs_generic_create(
 	struct user_namespace	*mnt_userns,
@@ -163,7 +194,8 @@ xfs_generic_create(
 
 	if (!tmpfile) {
 		error = xfs_create(mnt_userns, XFS_I(dir), &name, mode, rdev,
-				   &ip);
+				xfs_create_need_xattr(dir, default_acl, acl),
+				&ip);
 	} else {
 		error = xfs_create_tmpfile(mnt_userns, XFS_I(dir), mode, &ip);
 	}
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index bfa4164990b1..6fde318b9fed 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -788,7 +788,7 @@ xfs_qm_qino_alloc(
 
 	if (need_alloc) {
 		error = xfs_dir_ialloc(&init_user_ns, &tp, NULL, S_IFREG, 1, 0,
-				       0, ipp);
+				       0, false, ipp);
 		if (error) {
 			xfs_trans_cancel(tp);
 			return error;
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 1379013d74b8..162cf69bd982 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -223,7 +223,7 @@ xfs_symlink(
 	 * Allocate an inode for the symlink.
 	 */
 	error = xfs_dir_ialloc(mnt_userns, &tp, dp, S_IFLNK | (mode & ~S_IFMT),
-			       1, 0, prid, &ip);
+			       1, 0, prid, false, &ip);
 	if (error)
 		goto out_trans_cancel;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 02/45] xfs: log stripe roundoff is a property of the log
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
  2021-03-05  5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-05  5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
                   ` (42 subsequent siblings)
  44 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We don't need to look at the xfs_mount and superblock every time we
need to do an iclog roundoff calculation. The property is fixed for
the life of the log, so store the roundoff in the log at mount time
and use that everywhere.

On a debug build:

$ size fs/xfs/xfs_log.o.*
   text	   data	    bss	    dec	    hex	filename
  27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
  27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_log_format.h |  3 --
 fs/xfs/xfs_log.c               | 59 ++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h          |  2 ++
 3 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 8bd00da6d2a4..16587219549c 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t;
 #define XLOG_MIN_RECORD_BSHIFT	14		/* 16384 == 1 << 14 */
 #define XLOG_BIG_RECORD_BSHIFT	15		/* 32k == 1 << 15 */
 #define XLOG_MAX_RECORD_BSHIFT	18		/* 256k == 1 << 18 */
-#define XLOG_BTOLSUNIT(log, b)  (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \
-                                 (log)->l_mp->m_sb.sb_logsunit)
-#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit)
 
 #define XLOG_HEADER_SIZE	512
 
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 06041834daa3..fa284f26d10e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1399,6 +1399,11 @@ xlog_alloc_log(
 	xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
 	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
 
+	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1)
+		log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
+	else
+		log->l_iclog_roundoff = BBSIZE;
+
 	xlog_grant_head_init(&log->l_reserve_head);
 	xlog_grant_head_init(&log->l_write_head);
 
@@ -1852,29 +1857,15 @@ xlog_calc_iclog_size(
 	uint32_t		*roundoff)
 {
 	uint32_t		count_init, count;
-	bool			use_lsunit;
-
-	use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
-			log->l_mp->m_sb.sb_logsunit > 1;
 
 	/* Add for LR header */
 	count_init = log->l_iclog_hsize + iclog->ic_offset;
+	count = roundup(count_init, log->l_iclog_roundoff);
 
-	/* Round out the log write size */
-	if (use_lsunit) {
-		/* we have a v2 stripe unit to use */
-		count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init));
-	} else {
-		count = BBTOB(BTOBB(count_init));
-	}
-
-	ASSERT(count >= count_init);
 	*roundoff = count - count_init;
 
-	if (use_lsunit)
-		ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit);
-	else
-		ASSERT(*roundoff < BBTOB(1));
+	ASSERT(count >= count_init);
+	ASSERT(*roundoff < log->l_iclog_roundoff);
 	return count;
 }
 
@@ -3149,10 +3140,9 @@ xlog_state_switch_iclogs(
 	log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize);
 
 	/* Round up to next log-sunit */
-	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
-	    log->l_mp->m_sb.sb_logsunit > 1) {
-		uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit);
-		log->l_curr_block = roundup(log->l_curr_block, sunit_bb);
+	if (log->l_iclog_roundoff > BBSIZE) {
+		log->l_curr_block = roundup(log->l_curr_block,
+						BTOBB(log->l_iclog_roundoff));
 	}
 
 	if (log->l_curr_block >= log->l_logBBsize) {
@@ -3404,12 +3394,11 @@ xfs_log_ticket_get(
  * Figure out the total log space unit (in bytes) that would be
  * required for a log ticket.
  */
-int
-xfs_log_calc_unit_res(
-	struct xfs_mount	*mp,
+static int
+xlog_calc_unit_res(
+	struct xlog		*log,
 	int			unit_bytes)
 {
-	struct xlog		*log = mp->m_log;
 	int			iclog_space;
 	uint			num_headers;
 
@@ -3485,18 +3474,20 @@ xfs_log_calc_unit_res(
 	/* for commit-rec LR header - note: padding will subsume the ophdr */
 	unit_bytes += log->l_iclog_hsize;
 
-	/* for roundoff padding for transaction data and one for commit record */
-	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) {
-		/* log su roundoff */
-		unit_bytes += 2 * mp->m_sb.sb_logsunit;
-	} else {
-		/* BB roundoff */
-		unit_bytes += 2 * BBSIZE;
-        }
+	/* roundoff padding for transaction data and one for commit record */
+	unit_bytes += 2 * log->l_iclog_roundoff;
 
 	return unit_bytes;
 }
 
+int
+xfs_log_calc_unit_res(
+	struct xfs_mount	*mp,
+	int			unit_bytes)
+{
+	return xlog_calc_unit_res(mp->m_log, unit_bytes);
+}
+
 /*
  * Allocate and initialise a new log ticket.
  */
@@ -3513,7 +3504,7 @@ xlog_ticket_alloc(
 
 	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
 
-	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
+	unit_res = xlog_calc_unit_res(log, unit_bytes);
 
 	atomic_set(&tic->t_ref, 1);
 	tic->t_task		= current;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 1c6fdbf3d506..037950cf1061 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -436,6 +436,8 @@ struct xlog {
 #endif
 	/* log recovery lsn tracking (for buffer submission */
 	xfs_lsn_t		l_recovery_lsn;
+
+	uint32_t		l_iclog_roundoff;/* padding roundoff */
 };
 
 #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 03/45] xfs: separate CIL commit record IO
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
  2021-03-05  5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
  2021-03-05  5:11 ` [PATCH 02/45] xfs: log stripe roundoff is a property of the log Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08  8:34   ` Chandan Babu R
                     ` (2 more replies)
  2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
                   ` (41 subsequent siblings)
  44 siblings, 3 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To allow for iclog IO device cache flush behaviour to be optimised,
we first need to separate out the commit record iclog IO from the
rest of the checkpoint so we can wait for the checkpoint IO to
complete before we issue the commit record.

This separation is only necessary if the commit record is being
written into a different iclog to the start of the checkpoint as the
upcoming cache flushing changes requires completion ordering against
the other iclogs submitted by the checkpoint.

If the entire checkpoint and commit is in the one iclog, then they
are both covered by the one set of cache flush primitives on the
iclog and hence there is no need to separate them for ordering.

Otherwise, we need to wait for all the previous iclogs to complete
so they are ordered correctly and made stable by the REQ_PREFLUSH
that the commit record iclog IO issues. This guarantees that if a
reader sees the commit record in the journal, they will also see the
entire checkpoint that commit record closes off.

This also provides the guarantee that when the commit record IO
completes, we can safely unpin all the log items in the checkpoint
so they can be written back because the entire checkpoint is stable
in the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c      | 8 +++++---
 fs/xfs/xfs_log_cil.c  | 9 +++++++++
 fs/xfs/xfs_log_priv.h | 2 ++
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fa284f26d10e..317c466232d4 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -784,10 +784,12 @@ xfs_log_mount_cancel(
 }
 
 /*
- * Wait for the iclog to be written disk, or return an error if the log has been
- * shut down.
+ * Wait for the iclog and all prior iclogs to be written disk as required by the
+ * log force state machine. Waiting on ic_force_wait ensures iclog completions
+ * have been ordered and callbacks run before we are woken here, hence
+ * guaranteeing that all the iclogs up to this one are on stable storage.
  */
-static int
+int
 xlog_wait_on_iclog(
 	struct xlog_in_core	*iclog)
 		__releases(iclog->ic_log->l_icloglock)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..1e5fd6f268c2 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -870,6 +870,15 @@ xlog_cil_push_work(
 	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 
+	/*
+	 * If the checkpoint spans multiple iclogs, wait for all previous
+	 * iclogs to complete before we submit the commit_iclog.
+	 */
+	if (ctx->start_lsn != commit_lsn) {
+		spin_lock(&log->l_icloglock);
+		xlog_wait_on_iclog(commit_iclog->ic_prev);
+	}
+
 	/* release the hounds! */
 	xfs_log_release_iclog(commit_iclog);
 	return;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 037950cf1061..ee7786b33da9 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -584,6 +584,8 @@ xlog_wait(
 	remove_wait_queue(wq, &wait);
 }
 
+int xlog_wait_on_iclog(struct xlog_in_core *iclog);
+
 /*
  * The LSN is valid so long as it is behind the current LSN. If it isn't, this
  * means that the next log record that includes this metadata could have a
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (2 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08  9:31   ` Chandan Babu R
                     ` (3 more replies)
  2021-03-05  5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
                   ` (40 subsequent siblings)
  44 siblings, 4 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It's a one line wrapper around blkdev_issue_flush(). Just replace it
with direct calls to blkdev_issue_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c   | 2 +-
 fs/xfs/xfs_file.c  | 6 +++---
 fs/xfs/xfs_log.c   | 2 +-
 fs/xfs/xfs_super.c | 7 -------
 fs/xfs/xfs_super.h | 1 -
 5 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 37a1d12762d8..7043546a04b8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1958,7 +1958,7 @@ xfs_free_buftarg(
 	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
-	xfs_blkdev_issue_flush(btp);
+	blkdev_issue_flush(btp->bt_bdev);
 
 	kmem_free(btp);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a007ca0711d9..24c7f45fc4eb 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -197,9 +197,9 @@ xfs_file_fsync(
 	 * inode size in case of an extending write.
 	 */
 	if (XFS_IS_REALTIME_INODE(ip))
-		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
+		blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev);
 	else if (mp->m_logdev_targp != mp->m_ddev_targp)
-		xfs_blkdev_issue_flush(mp->m_ddev_targp);
+		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
 
 	/*
 	 * Any inode that has dirty modifications in the log is pinned.  The
@@ -219,7 +219,7 @@ xfs_file_fsync(
 	 */
 	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
 	    mp->m_logdev_targp == mp->m_ddev_targp)
-		xfs_blkdev_issue_flush(mp->m_ddev_targp);
+		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
 
 	return error;
 }
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 317c466232d4..fee76c485727 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1962,7 +1962,7 @@ xlog_sync(
 	 * layer state machine for preflushes.
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
-		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
+		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
 		need_flush = false;
 	}
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e5e0713bebcd..ca2cb0448b5e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -339,13 +339,6 @@ xfs_blkdev_put(
 		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 }
 
-void
-xfs_blkdev_issue_flush(
-	xfs_buftarg_t		*buftarg)
-{
-	blkdev_issue_flush(buftarg->bt_bdev);
-}
-
 STATIC void
 xfs_close_devices(
 	struct xfs_mount	*mp)
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index 1ca484b8357f..79cb2dece811 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -88,7 +88,6 @@ struct block_device;
 
 extern void xfs_quiesce_attr(struct xfs_mount *mp);
 extern void xfs_flush_inodes(struct xfs_mount *mp);
-extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
 extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
 					   xfs_agnumber_t agcount);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (3 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08  9:48   ` Chandan Babu R
                     ` (2 more replies)
  2021-03-05  5:11 ` [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
                   ` (39 subsequent siblings)
  44 siblings, 3 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The new checkpoint caceh flush mechanism requires us to issue an
unconditional cache flush before we start a new checkpoint. We don't
want to block for this if we can help it, and we have a fair chunk
of CPU work to do between starting the checkpoint and issuing the
first journal IO.

Hence it makes sense to amortise the latency cost of the cache flush
by issuing it asynchronously and then waiting for it only when we
need to issue the first IO in the transaction.

TO do this, we need async cache flush primitives to submit the cache
flush bio and to wait on it. THe block layer has no such primitives
for filesystems, so roll our own for the moment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_linux.h  |  2 ++
 2 files changed, 38 insertions(+)

diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index 17f36db2f792..668f8bd27b4a 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count)
 	return bio_max_segs(howmany(count, PAGE_SIZE));
 }
 
+void
+xfs_flush_bdev_async_endio(
+	struct bio	*bio)
+{
+	if (bio->bi_private)
+		complete(bio->bi_private);
+}
+
+/*
+ * Submit a request for an async cache flush to run. If the request queue does
+ * not require flush operations, just skip it altogether. If the caller needsi
+ * to wait for the flush completion at a later point in time, they must supply a
+ * valid completion. This will be signalled when the flush completes.  The
+ * caller never sees the bio that is issued here.
+ */
+void
+xfs_flush_bdev_async(
+	struct bio		*bio,
+	struct block_device	*bdev,
+	struct completion	*done)
+{
+	struct request_queue	*q = bdev->bd_disk->queue;
+
+	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
+		complete(done);
+		return;
+	}
+
+	bio_init(bio, NULL, 0);
+	bio_set_dev(bio, bdev);
+	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
+	bio->bi_private = done;
+	bio->bi_end_io = xfs_flush_bdev_async_endio;
+
+	submit_bio(bio);
+}
 int
 xfs_rw_bdev(
 	struct block_device	*bdev,
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index af6be9b9ccdf..953d98bc4832 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -196,6 +196,8 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
 
 int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
 		char *data, unsigned int op);
+void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev,
+		struct completion *done);
 
 #define ASSERT_ALWAYS(expr)	\
 	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (4 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-15 14:43   ` Brian Foster
  2021-03-16  8:47   ` Christoph Hellwig
  2021-03-05  5:11 ` [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
                   ` (38 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

These rules apply to the atomic transactions recorded in the
journal, not to the journal IO itself. Hence we need to ensure
metadata is stable before we start writing a new transaction to the
journal (guarantee #1), and we need to ensure the entire transaction
is stable in the journal before we start metadata writeback
(guarantee #2).

The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
being added to every iclog IO. This causes the journal IO to issue a
cache flush and wait for it to complete before issuing the write IO
to the journal. Hence all completed metadata IO is guaranteed to be
stable before the journal overwrites the old metadata.

However, for long running CIL checkpoints that might do a thousand
journal IOs, we don't need every single one of these iclog IOs to
issue a cache flush - the cache flush done before the first iclog is
submitted is sufficient to cover the entire range in the log that
the checkpoint will overwrite because the CIL space reservation
guarantees the tail of the log (completed metadata) is already
beyond the range of the checkpoint write.

Hence we only need a full cache flush between closing off the CIL
checkpoint context (i.e. when the push switches it out) and issuing
the first journal IO. Rather than plumbing this through to the
journal IO, we can start this cache flush the moment the CIL context
is owned exclusively by the push worker. The cache flush can be in
progress while we process the CIL ready for writing, hence
reducing the latency of the initial iclog write. This is especially
true for large checkpoints, where we might have to process hundreds
of thousands of log vectors before we issue the first iclog write.
In these cases, it is likely the cache flush has already been
completed by the time we have built the CIL log vector chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log_cil.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 1e5fd6f268c2..b4cdb8b6c4c3 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -656,6 +656,8 @@ xlog_cil_push_work(
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
+	struct bio		bio;
+	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -719,10 +721,25 @@ xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 
 	/*
-	 * pull all the log vectors off the items in the CIL, and
-	 * remove the items from the CIL. We don't need the CIL lock
-	 * here because it's only needed on the transaction commit
-	 * side which is currently locked out by the flush lock.
+	 * The CIL is stable at this point - nothing new will be added to it
+	 * because we hold the flush lock exclusively. Hence we can now issue
+	 * a cache flush to ensure all the completed metadata in the journal we
+	 * are about to overwrite is on stable storage.
+	 *
+	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
+	 * cache flushes to provide this ordering guarantee, and hence for CIL
+	 * checkpoints that require hundreds or thousands of log writes no
+	 * longer need to issue device cache flushes to provide metadata
+	 * writeback ordering.
+	 */
+	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
+				&bdev_flush);
+
+	/*
+	 * Pull all the log vectors off the items in the CIL, and remove the
+	 * items from the CIL. We don't need the CIL lock here because it's only
+	 * needed on the transaction commit side which is currently locked out
+	 * by the flush lock.
 	 */
 	lv = NULL;
 	num_iovecs = 0;
@@ -806,6 +823,12 @@ xlog_cil_push_work(
 	lvhdr.lv_iovecp = &lhdr;
 	lvhdr.lv_next = ctx->lv_chain;
 
+	/*
+	 * Before we format and submit the first iclog, we have to ensure that
+	 * the metadata writeback ordering cache flush is complete.
+	 */
+	wait_for_completion(&bdev_flush);
+
 	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
 	if (error)
 		goto out_abort_free_ticket;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write()
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (5 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-15 14:45   ` Brian Foster
  2021-03-16 14:15   ` Christoph Hellwig
  2021-03-05  5:11 ` [PATCH 08/45] xfs: journal IO cache flush reductions Dave Chinner
                   ` (37 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The CIL push is the only call to xlog_write that sets this variable
to true. The other callers don't need a start rec, and they tell
xlog_write what to do by passing the type of ophdr they need written
in the flags field. The need_start_rec parameter essentially tells
xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
so get rid of the variable to do this and pass XLOG_START_TRANS as
the flag value into xlog_write() from the CIL push.

$ size fs/xfs/xfs_log.o*
  text	   data	    bss	    dec	    hex	filename
 27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
 27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c      | 44 +++++++++++++++++++++----------------------
 fs/xfs/xfs_log_cil.c  |  3 ++-
 fs/xfs/xfs_log_priv.h |  3 +--
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fee76c485727..364694a83de6 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -818,9 +818,7 @@ xlog_wait_on_iclog(
 static int
 xlog_write_unmount_record(
 	struct xlog		*log,
-	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*lsn,
-	uint			flags)
+	struct xlog_ticket	*ticket)
 {
 	struct xfs_unmount_log_format ulf = {
 		.magic = XLOG_UNMOUNT_TYPE,
@@ -837,7 +835,7 @@ xlog_write_unmount_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(ulf);
-	return xlog_write(log, &vec, ticket, lsn, NULL, flags, false);
+	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
 /*
@@ -851,15 +849,13 @@ xlog_unmount_write(
 	struct xfs_mount	*mp = log->l_mp;
 	struct xlog_in_core	*iclog;
 	struct xlog_ticket	*tic = NULL;
-	xfs_lsn_t		lsn;
-	uint			flags = XLOG_UNMOUNT_TRANS;
 	int			error;
 
 	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
 	if (error)
 		goto out_err;
 
-	error = xlog_write_unmount_record(log, tic, &lsn, flags);
+	error = xlog_write_unmount_record(log, tic);
 	/*
 	 * At this point, we're umounting anyway, so there's no point in
 	 * transitioning log state to IOERROR. Just continue...
@@ -1551,8 +1547,7 @@ xlog_commit_record(
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
-			   false);
+	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2149,13 +2144,16 @@ static int
 xlog_write_calc_vec_length(
 	struct xlog_ticket	*ticket,
 	struct xfs_log_vec	*log_vector,
-	bool			need_start_rec)
+	uint			optype)
 {
 	struct xfs_log_vec	*lv;
-	int			headers = need_start_rec ? 1 : 0;
+	int			headers = 0;
 	int			len = 0;
 	int			i;
 
+	if (optype & XLOG_START_TRANS)
+		headers++;
+
 	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
@@ -2375,8 +2373,7 @@ xlog_write(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
-	uint			flags,
-	bool			need_start_rec)
+	uint			optype)
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
@@ -2404,8 +2401,9 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	len = xlog_write_calc_vec_length(ticket, log_vector, need_start_rec);
-	*start_lsn = 0;
+	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
+	if (start_lsn)
+		*start_lsn = 0;
 	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
 		void		*ptr;
 		int		log_offset;
@@ -2419,7 +2417,7 @@ xlog_write(
 		ptr = iclog->ic_datap + log_offset;
 
 		/* start_lsn is the first lsn written to. That's all we need. */
-		if (!*start_lsn)
+		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
 		/*
@@ -2432,6 +2430,7 @@ xlog_write(
 			int			copy_len;
 			int			copy_off;
 			bool			ordered = false;
+			bool			wrote_start_rec = false;
 
 			/* ordered log vectors have no regions to write */
 			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
@@ -2449,13 +2448,15 @@ xlog_write(
 			 * write a start record. Only do this for the first
 			 * iclog we write to.
 			 */
-			if (need_start_rec) {
+			if (optype & XLOG_START_TRANS) {
 				xlog_write_start_rec(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 						sizeof(struct xlog_op_header));
+				optype &= ~XLOG_START_TRANS;
+				wrote_start_rec = true;
 			}
 
-			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags);
+			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
 			if (!ophdr)
 				return -EIO;
 
@@ -2486,14 +2487,13 @@ xlog_write(
 			}
 			copy_len += sizeof(struct xlog_op_header);
 			record_cnt++;
-			if (need_start_rec) {
+			if (wrote_start_rec) {
 				copy_len += sizeof(struct xlog_op_header);
 				record_cnt++;
-				need_start_rec = false;
 			}
 			data_cnt += contwr ? copy_len : 0;
 
-			error = xlog_write_copy_finish(log, iclog, flags,
+			error = xlog_write_copy_finish(log, iclog, optype,
 						       &record_cnt, &data_cnt,
 						       &partial_copy,
 						       &partial_copy_len,
@@ -2537,7 +2537,7 @@ xlog_write(
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
 	if (commit_iclog) {
-		ASSERT(flags & XLOG_COMMIT_TRANS);
+		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
 	} else {
 		error = xlog_state_release_iclog(log, iclog);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b4cdb8b6c4c3..c04d5d37a3a2 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -829,7 +829,8 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
+	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
+				XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index ee7786b33da9..56e1942c47df 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -480,8 +480,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
-		struct xlog_in_core **commit_iclog, uint flags,
-		bool need_start_rec);
+		struct xlog_in_core **commit_iclog, uint optype);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 08/45] xfs: journal IO cache flush reductions
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (6 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08 10:49   ` Chandan Babu R
  2021-03-08 12:25   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 09/45] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
                   ` (36 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
causes the journal IO to issue a cache flush and wait for it to
complete before issuing the write IO to the journal. Hence all
completed metadata IO is guaranteed to be stable before the journal
overwrites the old metadata.

The ordering guarantees of #2 are provided by the REQ_FUA, which
ensures the journal writes do not complete until they are on stable
storage. Hence by the time the last journal IO in a checkpoint
completes, we know that the entire checkpoint is on stable storage
and we can unpin the dirty metadata and allow it to be written back.

This is the mechanism by which ordering was first implemented in XFS
way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
("Add support for drive write cache flushing") in the xfs-archive
tree.

A lot has changed since then, most notably we now use delayed
logging to checkpoint the filesystem to the journal rather than
write each individual transaction to the journal. Cache flushes on
journal IO are necessary when individual transactions are wholly
contained within a single iclog. However, CIL checkpoints are single
transactions that typically span hundreds to thousands of individual
journal writes, and so the requirements for device cache flushing
have changed.

That is, the ordering rules I state above apply to ordering of
atomic transactions recorded in the journal, not to the journal IO
itself. Hence we need to ensure metadata is stable before we start
writing a new transaction to the journal (guarantee #1), and we need
to ensure the entire transaction is stable in the journal before we
start metadata writeback (guarantee #2).

Hence we only need a REQ_PREFLUSH on the journal IO that starts a
new journal transaction to provide #1, and it is not on any other
journal IO done within the context of that journal transaction.

The CIL checkpoint already issues a cache flush before it starts
writing to the log, so we no longer need the iclog IO to issue a
REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
to xlog_write(), we no longer need to mark the first iclog in
the log write with REQ_PREFLUSH for this case. As an added bonus,
this ordering mechanism works for both internal and external logs,
meaning we can remove the explicit data device cache flushes from
the iclog write code when using external logs.

Given the new ordering semantics of commit records for the CIL, we
need iclogs containing commit records to issue a REQ_PREFLUSH. We
also require unmount records to do this. Hence for both
XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
to mark the first iclog being written with REQ_PREFLUSH.

For both commit records and unmount records, we also want them
immediately on stable storage, so we want to also mark the iclogs
that contain these records to be marked REQ_FUA. That means if a
record is split across multiple iclogs, they are all marked REQ_FUA
and not just the last one so that when the transaction is completed
all the parts of the record are on stable storage.

And for external logs, unmount records need a pre-write data device
cache flush similar to the CIL checkpoint cache pre-flush as the
internal iclog write code does not do this implicitly anymore.

As an optimisation, when the commit record lands in the same iclog
as the journal transaction starts, we don't need to wait for
anything and can simply use REQ_FUA to provide guarantee #2.  This
means that for fsync() heavy workloads, the cache flush behaviour is
completely unchanged and there is no degradation in performance as a
result of optimise the multi-IO transaction case.

The most notable sign that there is less IO latency on my test
machine (nvme SSDs) is that the "noiclogs" rate has dropped
substantially. This metric indicates that the CIL push is blocking
in xlog_get_iclog_space() waiting for iclog IO completion to occur.
With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
is blocking waiting for log IO. With the changes in this patch, this
drops to 1 noiclog event for every 100 iclog writes. Hence it is
clear that log IO is completing much faster than it was previously,
but it is also clear that for large iclog sizes, this isn't the
performance limiting factor on this hardware.

With smaller iclogs (32kB), however, there is a sustantial
difference. With the cache flush modifications, the journal is now
running at over 4000 write IOPS, and the journal throughput is
largely identical to the 256kB iclogs and the noiclog event rate
stays low at about 1:50 iclog writes. The existing code tops out at
about 2500 IOPS as the number of cache flushes dominate performance
and latency. The noiclog event rate is about 1:4, and the
performance variance is quite large as the journal throughput can
fall to less than half the peak sustained rate when the cache flush
rate prevents metadata writeback from keeping up and the log runs
out of space and throttles reservations.

As a result:

	logbsize	fsmark create rate	rm -rf
before	32kb		152851+/-5.3e+04	5m28s
patched	32kb		221533+/-1.1e+04	5m24s

before	256kb		220239+/-6.2e+03	4m58s
patched	256kb		228286+/-9.2e+03	5m06s

The rm -rf times are included because I ran them, but the
differences are largely noise. This workload is largely metadata
read IO latency bound and the changes to the journal cache flushing
doesn't really make any noticable difference to behaviour apart from
a reduction in noiclog events from background CIL pushing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 53 +++++++++++++++++++++++--------------------
 fs/xfs/xfs_log_cil.c  |  7 +++++-
 fs/xfs/xfs_log_priv.h |  4 ++++
 3 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 364694a83de6..ed44d67d7099 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -835,6 +835,14 @@ xlog_write_unmount_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(ulf);
+
+	/*
+	 * For external log devices, we need to flush the data device cache
+	 * first to ensure all metadata writeback is on stable storage before we
+	 * stamp the tail LSN into the unmount record.
+	 */
+	if (log->l_targ != log->l_mp->m_ddev_targp)
+		blkdev_issue_flush(log->l_targ->bt_bdev);
 	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
@@ -1753,8 +1761,7 @@ xlog_write_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
 	uint64_t		bno,
-	unsigned int		count,
-	bool			need_flush)
+	unsigned int		count)
 {
 	ASSERT(bno < log->l_logBBsize);
 
@@ -1792,10 +1799,12 @@ xlog_write_iclog(
 	 * writeback throttle from throttling log writes behind background
 	 * metadata writeback and causing priority inversions.
 	 */
-	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
-				REQ_IDLE | REQ_FUA;
-	if (need_flush)
+	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
+		iclog->ic_bio.bi_opf |= REQ_FUA;
+	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 
 	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -1898,7 +1907,6 @@ xlog_sync(
 	unsigned int		roundoff;       /* roundoff to BB or stripe */
 	uint64_t		bno;
 	unsigned int		size;
-	bool			need_flush = true, split = false;
 
 	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
 
@@ -1923,10 +1931,8 @@ xlog_sync(
 	bno = BLOCK_LSN(be64_to_cpu(iclog->ic_header.h_lsn));
 
 	/* Do we need to split this write into 2 parts? */
-	if (bno + BTOBB(count) > log->l_logBBsize) {
+	if (bno + BTOBB(count) > log->l_logBBsize)
 		xlog_split_iclog(log, &iclog->ic_header, bno, count);
-		split = true;
-	}
 
 	/* calculcate the checksum */
 	iclog->ic_header.h_crc = xlog_cksum(log, &iclog->ic_header,
@@ -1947,22 +1953,8 @@ xlog_sync(
 			 be64_to_cpu(iclog->ic_header.h_lsn));
 	}
 #endif
-
-	/*
-	 * Flush the data device before flushing the log to make sure all meta
-	 * data written back from the AIL actually made it to disk before
-	 * stamping the new log tail LSN into the log buffer.  For an external
-	 * log we need to issue the flush explicitly, and unfortunately
-	 * synchronously here; for an internal log we can simply use the block
-	 * layer state machine for preflushes.
-	 */
-	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
-		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
-		need_flush = false;
-	}
-
 	xlog_verify_iclog(log, iclog, count);
-	xlog_write_iclog(log, iclog, bno, count, need_flush);
+	xlog_write_iclog(log, iclog, bno, count);
 }
 
 /*
@@ -2416,10 +2408,21 @@ xlog_write(
 		ASSERT(log_offset <= iclog->ic_size - 1);
 		ptr = iclog->ic_datap + log_offset;
 
-		/* start_lsn is the first lsn written to. That's all we need. */
+		/* Start_lsn is the first lsn written to. */
 		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
+		/*
+		 * iclogs containing commit records or unmount records need
+		 * to issue ordering cache flushes and commit immediately
+		 * to stable storage to guarantee journal vs metadata ordering
+		 * is correctly maintained in the storage media.
+		 */
+		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
+			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
+						XLOG_ICL_NEED_FUA);
+		}
+
 		/*
 		 * This loop writes out as many regions as can fit in the amount
 		 * of space which was allocated by xlog_state_get_iclog_space().
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index c04d5d37a3a2..263c8d907221 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -896,11 +896,16 @@ xlog_cil_push_work(
 
 	/*
 	 * If the checkpoint spans multiple iclogs, wait for all previous
-	 * iclogs to complete before we submit the commit_iclog.
+	 * iclogs to complete before we submit the commit_iclog. If it is in the
+	 * same iclog as the start of the checkpoint, then we can skip the iclog
+	 * cache flush because there are no other iclogs we need to order
+	 * against.
 	 */
 	if (ctx->start_lsn != commit_lsn) {
 		spin_lock(&log->l_icloglock);
 		xlog_wait_on_iclog(commit_iclog->ic_prev);
+	} else {
+		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 	}
 
 	/* release the hounds! */
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 56e1942c47df..0552e96d2b64 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -133,6 +133,9 @@ enum xlog_iclog_state {
 
 #define XLOG_COVER_OPS		5
 
+#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
+#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
+
 /* Ticket reservation region accounting */ 
 #define XLOG_TIC_LEN_MAX	15
 
@@ -201,6 +204,7 @@ typedef struct xlog_in_core {
 	u32			ic_size;
 	u32			ic_offset;
 	enum xlog_iclog_state	ic_state;
+	unsigned int		ic_flags;
 	char			*ic_datap;	/* pointer to iclog data */
 
 	/* Callback structures need their own cacheline */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 09/45] xfs: Fix CIL throttle hang when CIL space used going backwards
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (7 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 08/45] xfs: journal IO cache flush reductions Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-05  5:11 ` [PATCH 10/45] xfs: reduce buffer log item shadow allocations Dave Chinner
                   ` (35 subsequent siblings)
  44 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

A hang with tasks stuck on the CIL hard throttle was reported and
largely diagnosed by Donald Buczek, who discovered that it was a
result of the CIL context space usage decrementing in committed
transactions once the hard throttle limit had been hit and processes
were already blocked.  This resulted in the CIL push not waking up
those waiters because the CIL context was no longer over the hard
throttle limit.

The surprising aspect of this was the CIL space usage going
backwards regularly enough to trigger this situation. Assumptions
had been made in design that the relogging process would only
increase the size of the objects in the CIL, and so that space would
only increase.

This change and commit message fixes the issue and documents the
result of an audit of the triggers that can cause the CIL space to
go backwards, how large the backwards steps tend to be, the
frequency in which they occur, and what the impact on the CIL
accounting code is.

Even though the CIL ctx->space_used can go backwards, it will only
do so if the log item is already logged to the CIL and contains a
space reservation for it's entire logged state. This is tracked by
the shadow buffer state on the log item. If the item is not
previously logged in the CIL it has no shadow buffer nor log vector,
and hence the entire size of the logged item copied to the log
vector is accounted to the CIL space usage. i.e.  it will always go
up in this case.

If the item has a log vector (i.e. already in the CIL) and the size
decreases, then the existing log vector will be overwritten and the
space usage will go down. This is the only condition where the space
usage reduces, and it can only occur when an item is already tracked
in the CIL. Hence we are safe from CIL space usage underruns as a
result of log items decreasing in size when they are relogged.

Typically this reduction in CIL usage occurs from metadata blocks
being free, such as when a btree block merge occurs or a directory
enter/xattr entry is removed and the da-tree is reduced in size.
This generally results in a reduction in size of around a single
block in the CIL, but also tends to increase the number of log
vectors because the parent and sibling nodes in the tree needs to be
updated when a btree block is removed. If a multi-level merge
occurs, then we see reduction in size of 2+ blocks, but again the
log vector count goes up.

The other vector is inode fork size changes, which only log the
current size of the fork and ignore the previously logged size when
the fork is relogged. Hence if we are removing items from the inode
fork (dir/xattr removal in shortform, extent record removal in
extent form, etc) the relogged size of the inode for can decrease.

No other log items can decrease in size either because they are a
fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
an intent actually creates a new intent log item and doesn't relog
the old item at all.) Hence the only two vectors for CIL context
size reduction are relogging inode forks and marking buffers active
in the CIL as stale.

Long story short: the majority of the code does the right thing and
handles the reduction in log item size correctly, and only the CIL
hard throttle implementation is problematic and needs fixing. This
patch makes that fix, as well as adds comments in the log item code
that result in items shrinking in size when they are relogged as a
clear reminder that this can and does happen frequently.

The throttle fix is based upon the change Donald proposed, though it
goes further to ensure that once the throttle is activated, it
captures all tasks until the CIL push issues a wakeup, regardless of
whether the CIL space used has gone back under the throttle
threshold.

This ensures that we prevent tasks reducing the CIL slightly under
the throttle threshold and then making more changes that push it
well over the throttle limit. This is acheived by checking if the
throttle wait queue is already active as a condition of throttling.
Hence once we start throttling, we continue to apply the throttle
until the CIL context push wakes everything on the wait queue.

We can use waitqueue_active() for the waitqueue manipulations and
checks as they are all done under the ctx->xc_push_lock. Hence the
waitqueue has external serialisation and we can safely peek inside
the wait queue without holding the internal waitqueue locks.

Many thanks to Donald for his diagnostic and analysis work to
isolate the cause of this hang.

Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item.c   | 37 ++++++++++++++++++-------------------
 fs/xfs/xfs_inode_item.c | 14 ++++++++++++++
 fs/xfs/xfs_log_cil.c    | 22 +++++++++++++++++-----
 3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index dc0be2a639cc..17960b1ce5ef 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -56,14 +56,12 @@ xfs_buf_log_format_size(
 }
 
 /*
- * This returns the number of log iovecs needed to log the
- * given buf log item.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item segment.
  *
- * It calculates this as 1 iovec for the buf log format structure
- * and 1 for each stretch of non-contiguous chunks to be logged.
- * Contiguous chunks are logged in a single iovec.
- *
- * If the XFS_BLI_STALE flag has been set, then log nothing.
+ * It calculates this as 1 iovec for the buf log format structure and 1 for each
+ * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
+ * in a single iovec.
  */
 STATIC void
 xfs_buf_item_size_segment(
@@ -119,11 +117,8 @@ xfs_buf_item_size_segment(
 }
 
 /*
- * This returns the number of log iovecs needed to log the given buf log item.
- *
- * It calculates this as 1 iovec for the buf log format structure and 1 for each
- * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
- * in a single iovec.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item.
  *
  * Discontiguous buffers need a format structure per region that is being
  * logged. This makes the changes in the buffer appear to log recovery as though
@@ -133,7 +128,11 @@ xfs_buf_item_size_segment(
  * what ends up on disk.
  *
  * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log
- * format structures.
+ * format structures. If the item has previously been logged and has dirty
+ * regions, we do not relog them in stale buffers. This has the effect of
+ * reducing the size of the relogged item by the amount of dirty data tracked
+ * by the log item. This can result in the committing transaction reducing the
+ * amount of space being consumed by the CIL.
  */
 STATIC void
 xfs_buf_item_size(
@@ -147,9 +146,9 @@ xfs_buf_item_size(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
 		/*
-		 * The buffer is stale, so all we need to log
-		 * is the buf log format structure with the
-		 * cancel flag in it.
+		 * The buffer is stale, so all we need to log is the buf log
+		 * format structure with the cancel flag in it as we are never
+		 * going to replay the changes tracked in the log item.
 		 */
 		trace_xfs_buf_item_size_stale(bip);
 		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
@@ -164,9 +163,9 @@ xfs_buf_item_size(
 
 	if (bip->bli_flags & XFS_BLI_ORDERED) {
 		/*
-		 * The buffer has been logged just to order it.
-		 * It is not being included in the transaction
-		 * commit, so no vectors are used at all.
+		 * The buffer has been logged just to order it. It is not being
+		 * included in the transaction commit, so no vectors are used at
+		 * all.
 		 */
 		trace_xfs_buf_item_size_ordered(bip);
 		*nvecs = XFS_LOG_VEC_ORDERED;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 17e20a6d8b4e..6ff91e5bf3cd 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_inode_log_item, ili_item);
 }
 
+/*
+ * The logged size of an inode fork is always the current size of the inode
+ * fork. This means that when an inode fork is relogged, the size of the logged
+ * region is determined by the current state, not the combination of the
+ * previously logged state + the current state. This is different relogging
+ * behaviour to most other log items which will retain the size of the
+ * previously logged changes when smaller regions are relogged.
+ *
+ * Hence operations that remove data from the inode fork (e.g. shortform
+ * dir/attr remove, extent form extent removal, etc), the size of the relogged
+ * inode gets -smaller- rather than stays the same size as the previously logged
+ * size and this can result in the committing transaction reducing the amount of
+ * space being consumed by the CIL.
+ */
 STATIC void
 xfs_inode_item_data_fork_size(
 	struct xfs_inode_log_item *iip,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 263c8d907221..2f0adc35d8ec 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -670,9 +670,14 @@ xlog_cil_push_work(
 	ASSERT(push_seq <= ctx->sequence);
 
 	/*
-	 * Wake up any background push waiters now this context is being pushed.
+	 * As we are about to switch to a new, empty CIL context, we no longer
+	 * need to throttle tasks on CIL space overruns. Wake any waiters that
+	 * the hard push throttle may have caught so they can start committing
+	 * to the new context. The ctx->xc_push_lock provides the serialisation
+	 * necessary for safely using the lockless waitqueue_active() check in
+	 * this context.
 	 */
-	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+	if (waitqueue_active(&cil->xc_push_wait))
 		wake_up_all(&cil->xc_push_wait);
 
 	/*
@@ -945,7 +950,7 @@ xlog_cil_push_background(
 	ASSERT(!list_empty(&cil->xc_cil));
 
 	/*
-	 * don't do a background push if we haven't used up all the
+	 * Don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
 	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
@@ -969,9 +974,16 @@ xlog_cil_push_background(
 
 	/*
 	 * If we are well over the space limit, throttle the work that is being
-	 * done until the push work on this context has begun.
+	 * done until the push work on this context has begun. Enforce the hard
+	 * throttle on all transaction commits once it has been activated, even
+	 * if the committing transactions have resulted in the space usage
+	 * dipping back down under the hard limit.
+	 *
+	 * The ctx->xc_push_lock provides the serialisation necessary for safely
+	 * using the lockless waitqueue_active() check in this context.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
+	    waitqueue_active(&cil->xc_push_wait)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
 		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
 		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 10/45] xfs: reduce buffer log item shadow allocations
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (8 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 09/45] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-15 14:52   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
                   ` (34 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we modify btrees repeatedly, we regularly increase the size of
the logged region by a single chunk at a time (per transaction
commit). This results in the CIL formatting code having to
reallocate the log vector buffer every time the buffer dirty region
grows. Hence over a typical 4kB btree buffer, we might grow the log
vector 4096/128 = 32x over a short period where we repeatedly add
or remove records to/from the buffer over a series of running
transaction. This means we are doing 32 memory allocations and frees
over this time during a performance critical path in the journal.

The amount of space tracked in the CIL for the object is calculated
during the ->iop_format() call for the buffer log item, but the
buffer memory allocated for it is calculated by the ->iop_size()
call. The size callout determines the size of the buffer, the format
call determines the space used in the buffer.

Hence we can oversize the buffer space required in the size
calculation without impacting the amount of space used and accounted
to the CIL for the changes being logged. This allows us to reduce
the number of allocations by rounding up the buffer size to allow
for future growth. This can safe a substantial amount of CPU time in
this path:

-   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
   - 44.49% xfs_log_commit_cil
      - 30.78% _raw_spin_lock
         - 30.75% do_raw_spin_lock
              30.27% __pv_queued_spin_lock_slowpath

(oh, ouch!)
....
      - 1.05% kmem_alloc_large
         - 1.02% kmem_alloc
              0.94% __kmalloc

This overhead here us what this patch is aimed at. After:

      - 0.76% kmem_alloc_large
         - 0.75% kmem_alloc
              0.70% __kmalloc

The size of 512 bytes is based on the bitmap chunk size being 128
bytes and that random directory entry updates almost never require
more than 3-4 128 byte regions to be logged in the directory block.

The other observation is for per-ag btrees. When we are inserting
into a new btree block, we'll pack it from the front. Hence the
first few records land in the first 128 bytes so we log only 128
bytes, the next 8-16 records land in the second region so now we log
256 bytes. And so on.  If we are doing random updates, it will only
allocate every 4 random 128 byte regions that are dirtied instead of
every single one.

Any larger than 512 bytes and I noticed an increase in memory
footprint in my scalability workloads. Any less than this and I
didn't really see any significant benefit to CPU usage.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 17960b1ce5ef..0628a65d9c55 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -142,6 +142,7 @@ xfs_buf_item_size(
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	int			i;
+	int			bytes;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
@@ -173,7 +174,7 @@ xfs_buf_item_size(
 	}
 
 	/*
-	 * the vector count is based on the number of buffer vectors we have
+	 * The vector count is based on the number of buffer vectors we have
 	 * dirty bits in. This will only be greater than one when we have a
 	 * compound buffer with more than one segment dirty. Hence for compound
 	 * buffers we need to track which segment the dirty bits correspond to,
@@ -181,10 +182,18 @@ xfs_buf_item_size(
 	 * count for the extra buf log format structure that will need to be
 	 * written.
 	 */
+	bytes = 0;
 	for (i = 0; i < bip->bli_format_count; i++) {
 		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
-					  nvecs, nbytes);
+					  nvecs, &bytes);
 	}
+
+	/*
+	 * Round up the buffer size required to minimise the number of memory
+	 * allocations that need to be done as this item grows when relogged by
+	 * repeated modifications.
+	 */
+	*nbytes = round_up(bytes, 512);
 	trace_xfs_buf_item_size(bip);
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (9 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 10/45] xfs: reduce buffer log item shadow allocations Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-05  5:11 ` [PATCH 12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
                   ` (33 subsequent siblings)
  44 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Otherwise it doesn't correctly calculate the number of vectors
in a logged buffer that has a contiguous map that gets split into
multiple regions because the range spans discontigous memory.

Probably never been hit in practice - we don't log contiguous ranges
on unmapped buffers (inode clusters).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 0628a65d9c55..91dc7d8c9739 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -55,6 +55,18 @@ xfs_buf_log_format_size(
 			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
 }
 
+static inline bool
+xfs_buf_item_straddle(
+	struct xfs_buf		*bp,
+	uint			offset,
+	int			next_bit,
+	int			last_bit)
+{
+	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
+		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
+		 XFS_BLF_CHUNK);
+}
+
 /*
  * Return the number of log iovecs and space needed to log the given buf log
  * item segment.
@@ -67,6 +79,7 @@ STATIC void
 xfs_buf_item_size_segment(
 	struct xfs_buf_log_item		*bip,
 	struct xfs_buf_log_format	*blfp,
+	uint				offset,
 	int				*nvecs,
 	int				*nbytes)
 {
@@ -101,12 +114,8 @@ xfs_buf_item_size_segment(
 		 */
 		if (next_bit == -1) {
 			break;
-		} else if (next_bit != last_bit + 1) {
-			last_bit = next_bit;
-			(*nvecs)++;
-		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
-			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
-			    XFS_BLF_CHUNK)) {
+		} else if (next_bit != last_bit + 1 ||
+		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
 			last_bit = next_bit;
 			(*nvecs)++;
 		} else {
@@ -141,8 +150,10 @@ xfs_buf_item_size(
 	int			*nbytes)
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
+	struct xfs_buf		*bp = bip->bli_buf;
 	int			i;
 	int			bytes;
+	uint			offset = 0;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
@@ -184,8 +195,9 @@ xfs_buf_item_size(
 	 */
 	bytes = 0;
 	for (i = 0; i < bip->bli_format_count; i++) {
-		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
+		xfs_buf_item_size_segment(bip, &bip->bli_formats[i], offset,
 					  nvecs, &bytes);
+		offset += BBTOB(bp->b_maps[i].bm_len);
 	}
 
 	/*
@@ -212,18 +224,6 @@ xfs_buf_item_copy_iovec(
 			nbits * XFS_BLF_CHUNK);
 }
 
-static inline bool
-xfs_buf_item_straddle(
-	struct xfs_buf		*bp,
-	uint			offset,
-	int			next_bit,
-	int			last_bit)
-{
-	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
-		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
-		 XFS_BLF_CHUNK);
-}
-
 static void
 xfs_buf_item_format_segment(
 	struct xfs_buf_log_item	*bip,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (10 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-05  5:11 ` [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
                   ` (32 subsequent siblings)
  44 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We process the buf_log_item bitmap one set bit at a time with
xfs_next_bit() so we can detect if a region crosses a memcpy
discontinuity in the buffer data address. This has massive overhead
on large buffers (e.g. 64k directory blocks) because we do a lot of
unnecessary checks and xfs_buf_offset() calls.

For example, 16-way concurrent create workload on debug kernel
running CPU bound has this at the top of the profile at ~120k
create/s on 64kb directory block size:

  20.66%  [kernel]  [k] xfs_dir3_leaf_check_int
   7.10%  [kernel]  [k] memcpy
   6.22%  [kernel]  [k] xfs_next_bit
   3.55%  [kernel]  [k] xfs_buf_offset
   3.53%  [kernel]  [k] xfs_buf_item_format
   3.34%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   3.04%  [kernel]  [k] do_raw_spin_lock
   2.84%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
   2.31%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   1.36%  [kernel]  [k] xfs_log_commit_cil

(debug checks hurt large blocks)

The only buffers with discontinuities in the data address are
unmapped buffers, and they are only used for inode cluster buffers
and only for logging unlinked pointers. IOWs, it is -rare- that we
even need to detect a discontinuity in the buffer item formatting
code.

Optimise all this by using xfs_contig_bits() to find the size of
the contiguous regions, then test for a discontiunity inside it. If
we find one, do the slow "bit at a time" method we do now. If we
don't, then just copy the entire contiguous range in one go.

Profile now looks like:

  25.26%  [kernel]  [k] xfs_dir3_leaf_check_int
   9.25%  [kernel]  [k] memcpy
   5.01%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   2.84%  [kernel]  [k] do_raw_spin_lock
   2.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   1.88%  [kernel]  [k] xfs_buf_find
   1.53%  [kernel]  [k] memmove
   1.47%  [kernel]  [k] xfs_log_commit_cil
....
   0.34%  [kernel]  [k] xfs_buf_item_format
....
   0.21%  [kernel]  [k] xfs_buf_offset
....
   0.16%  [kernel]  [k] xfs_contig_bits
....
   0.13%  [kernel]  [k] xfs_buf_item_size_segment.isra.0

So the bit scanning over for the dirty region tracking for the
buffer log items is basically gone. Debug overhead hurts even more
now...

Perf comparison

		dir block	 creates		unlink
		size (kb)	time	rate		time

Original	 4		4m08s	220k		 5m13s
Original	64		7m21s	115k		13m25s
Patched		 4		3m59s	230k		 5m03s
Patched		64		6m23s	143k		12m33s

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item.c | 102 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 87 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 91dc7d8c9739..14d1fefcbf4c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -59,12 +59,18 @@ static inline bool
 xfs_buf_item_straddle(
 	struct xfs_buf		*bp,
 	uint			offset,
-	int			next_bit,
-	int			last_bit)
+	int			first_bit,
+	int			nbits)
 {
-	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
-		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
-		 XFS_BLF_CHUNK);
+	void			*first, *last;
+
+	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
+	last = xfs_buf_offset(bp,
+			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
+
+	if (last - first != nbits * XFS_BLF_CHUNK)
+		return true;
+	return false;
 }
 
 /*
@@ -84,20 +90,51 @@ xfs_buf_item_size_segment(
 	int				*nbytes)
 {
 	struct xfs_buf			*bp = bip->bli_buf;
+	int				first_bit;
+	int				nbits;
 	int				next_bit;
 	int				last_bit;
 
-	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
-	if (last_bit == -1)
+	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
+	if (first_bit == -1)
 		return;
 
-	/*
-	 * initial count for a dirty buffer is 2 vectors - the format structure
-	 * and the first dirty region.
-	 */
-	*nvecs += 2;
-	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
+	(*nvecs)++;
+	*nbytes += xfs_buf_log_format_size(blfp);
+
+	do {
+		nbits = xfs_contig_bits(blfp->blf_data_map,
+					blfp->blf_map_size, first_bit);
+		ASSERT(nbits > 0);
+
+		/*
+		 * Straddling a page is rare because we don't log contiguous
+		 * chunks of unmapped buffers anywhere.
+		 */
+		if (nbits > 1 &&
+		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
+			goto slow_scan;
+
+		(*nvecs)++;
+		*nbytes += nbits * XFS_BLF_CHUNK;
+
+		/*
+		 * This takes the bit number to start looking from and
+		 * returns the next set bit from there.  It returns -1
+		 * if there are no more bits set or the start bit is
+		 * beyond the end of the bitmap.
+		 */
+		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
+					(uint)first_bit + nbits + 1);
+	} while (first_bit != -1);
 
+	return;
+
+slow_scan:
+	/* Count the first bit we jumped out of the above loop from */
+	(*nvecs)++;
+	*nbytes += XFS_BLF_CHUNK;
+	last_bit = first_bit;
 	while (last_bit != -1) {
 		/*
 		 * This takes the bit number to start looking from and
@@ -115,11 +152,14 @@ xfs_buf_item_size_segment(
 		if (next_bit == -1) {
 			break;
 		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
+		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
 			last_bit = next_bit;
+			first_bit = next_bit;
 			(*nvecs)++;
+			nbits = 1;
 		} else {
 			last_bit++;
+			nbits++;
 		}
 		*nbytes += XFS_BLF_CHUNK;
 	}
@@ -276,6 +316,38 @@ xfs_buf_item_format_segment(
 	/*
 	 * Fill in an iovec for each set of contiguous chunks.
 	 */
+	do {
+		ASSERT(first_bit >= 0);
+		nbits = xfs_contig_bits(blfp->blf_data_map,
+					blfp->blf_map_size, first_bit);
+		ASSERT(nbits > 0);
+
+		/*
+		 * Straddling a page is rare because we don't log contiguous
+		 * chunks of unmapped buffers anywhere.
+		 */
+		if (nbits > 1 &&
+		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
+			goto slow_scan;
+
+		xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
+					first_bit, nbits);
+		blfp->blf_size++;
+
+		/*
+		 * This takes the bit number to start looking from and
+		 * returns the next set bit from there.  It returns -1
+		 * if there are no more bits set or the start bit is
+		 * beyond the end of the bitmap.
+		 */
+		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
+					(uint)first_bit + nbits + 1);
+	} while (first_bit != -1);
+
+	return;
+
+slow_scan:
+	ASSERT(bp->b_addr == NULL);
 	last_bit = first_bit;
 	nbits = 1;
 	for (;;) {
@@ -300,7 +372,7 @@ xfs_buf_item_format_segment(
 			blfp->blf_size++;
 			break;
 		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
+		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
 			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
 						first_bit, nbits);
 			blfp->blf_size++;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (11 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08 22:53   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing Dave Chinner
                   ` (31 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In doing an investigation into AIL push stalls, I was looking at the
log force code to see if an async CIL push could be done instead.
This lead me to xfs_log_force_lsn() and looking at how it works.

xfs_log_force_lsn() is only called from inode synchronisation
contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
value as the LSN to sync the log to. This gets passed to
xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
journal, and then used by xfs_log_force_lsn() to flush the iclogs to
the journal.

The problem with is that ip->i_itemp->ili_last_lsn does not store a
log sequence number. What it stores is passed to it from the
->iop_committing method, which is called by xfs_log_commit_cil().
The value this passes to the iop_committing method is the CIL
context sequence number that the item was committed to.

As it turns out, xlog_cil_force_lsn() converts the sequence to an
actual commit LSN for the related context and returns that to
xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
variable that contained a sequence with an actual LSN and then uses
that to sync the iclogs.

This caused me some confusion for a while, even though I originally
wrote all this code a decade ago. ->iop_committing is only used by
a couple of log item types, and only inode items use the sequence
number it is passed.

Let's clean up the API, CIL structures and inode log item to call it
a sequence number, and make it clear that the high level code is
using CIL sequence numbers and not on-disk LSNs for integrity
synchronisation purposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_types.h |  1 +
 fs/xfs/xfs_buf_item.c     |  2 +-
 fs/xfs/xfs_dquot_item.c   |  2 +-
 fs/xfs/xfs_file.c         | 14 +++++++-------
 fs/xfs/xfs_inode.c        | 10 +++++-----
 fs/xfs/xfs_inode_item.c   |  4 ++--
 fs/xfs/xfs_inode_item.h   |  2 +-
 fs/xfs/xfs_log.c          | 27 ++++++++++++++-------------
 fs/xfs/xfs_log.h          |  4 +---
 fs/xfs/xfs_log_cil.c      | 22 +++++++++-------------
 fs/xfs/xfs_log_priv.h     | 15 +++++++--------
 fs/xfs/xfs_trans.c        |  6 +++---
 fs/xfs/xfs_trans.h        |  4 ++--
 13 files changed, 54 insertions(+), 59 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 064bd6e8c922..0870ef6f933d 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -21,6 +21,7 @@ typedef int32_t		xfs_suminfo_t;	/* type of bitmap summary info */
 typedef uint32_t	xfs_rtword_t;	/* word type for bitmap manipulations */
 
 typedef int64_t		xfs_lsn_t;	/* log sequence number */
+typedef int64_t		xfs_csn_t;	/* CIL sequence number */
 
 typedef uint32_t	xfs_dablk_t;	/* dir/attr block number (in file) */
 typedef uint32_t	xfs_dahash_t;	/* dir/attr hash value */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 14d1fefcbf4c..1cb087b320b1 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -713,7 +713,7 @@ xfs_buf_item_release(
 STATIC void
 xfs_buf_item_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	xfs_csn_t		seq)
 {
 	return xfs_buf_item_release(lip);
 }
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 8c1fdf37ee8f..8ed47b739b6c 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release(
 STATIC void
 xfs_qm_dquot_logitem_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	xfs_csn_t		seq)
 {
 	return xfs_qm_dquot_logitem_release(lip);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 24c7f45fc4eb..ac3120dfe477 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -119,8 +119,8 @@ xfs_dir_fsync(
 	return xfs_log_force_inode(ip);
 }
 
-static xfs_lsn_t
-xfs_fsync_lsn(
+static xfs_csn_t
+xfs_fsync_seq(
 	struct xfs_inode	*ip,
 	bool			datasync)
 {
@@ -128,7 +128,7 @@ xfs_fsync_lsn(
 		return 0;
 	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
 		return 0;
-	return ip->i_itemp->ili_last_lsn;
+	return ip->i_itemp->ili_commit_seq;
 }
 
 /*
@@ -151,12 +151,12 @@ xfs_fsync_flush_log(
 	int			*log_flushed)
 {
 	int			error = 0;
-	xfs_lsn_t		lsn;
+	xfs_csn_t		seq;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	lsn = xfs_fsync_lsn(ip, datasync);
-	if (lsn) {
-		error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC,
+	seq = xfs_fsync_seq(ip, datasync);
+	if (seq) {
+		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
 					  log_flushed);
 
 		spin_lock(&ip->i_itemp->ili_lock);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index bed2beb169e4..1c2ef1f1859a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2644,7 +2644,7 @@ xfs_iunpin(
 	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
 
 	/* Give the log a push to start the unpinning I/O */
-	xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL);
+	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
 
 }
 
@@ -3652,16 +3652,16 @@ int
 xfs_log_force_inode(
 	struct xfs_inode	*ip)
 {
-	xfs_lsn_t		lsn = 0;
+	xfs_csn_t		seq = 0;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip))
-		lsn = ip->i_itemp->ili_last_lsn;
+		seq = ip->i_itemp->ili_commit_seq;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
-	if (!lsn)
+	if (!seq)
 		return 0;
-	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
+	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ff91e5bf3cd..3aba4559469f 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -617,9 +617,9 @@ xfs_inode_item_committed(
 STATIC void
 xfs_inode_item_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	xfs_csn_t		seq)
 {
-	INODE_ITEM(lip)->ili_last_lsn = commit_lsn;
+	INODE_ITEM(lip)->ili_commit_seq = seq;
 	return xfs_inode_item_release(lip);
 }
 
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4b926e32831c..403b45ab9aa2 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -33,7 +33,7 @@ struct xfs_inode_log_item {
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
-	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
+	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
 };
 
 static inline int xfs_inode_clean(struct xfs_inode *ip)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index ed44d67d7099..145db0f88060 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3273,14 +3273,13 @@ xfs_log_force(
 }
 
 static int
-__xfs_log_force_lsn(
-	struct xfs_mount	*mp,
+xlog_force_lsn(
+	struct xlog		*log,
 	xfs_lsn_t		lsn,
 	uint			flags,
 	int			*log_flushed,
 	bool			already_slept)
 {
-	struct xlog		*log = mp->m_log;
 	struct xlog_in_core	*iclog;
 
 	spin_lock(&log->l_icloglock);
@@ -3313,8 +3312,6 @@ __xfs_log_force_lsn(
 		if (!already_slept &&
 		    (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC ||
 		     iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) {
-			XFS_STATS_INC(mp, xs_log_force_sleep);
-
 			xlog_wait(&iclog->ic_prev->ic_write_wait,
 					&log->l_icloglock);
 			return -EAGAIN;
@@ -3352,25 +3349,29 @@ __xfs_log_force_lsn(
  * to disk, that thread will wake up all threads waiting on the queue.
  */
 int
-xfs_log_force_lsn(
+xfs_log_force_seq(
 	struct xfs_mount	*mp,
-	xfs_lsn_t		lsn,
+	xfs_csn_t		seq,
 	uint			flags,
 	int			*log_flushed)
 {
+	struct xlog		*log = mp->m_log;
+	xfs_lsn_t		lsn;
 	int			ret;
-	ASSERT(lsn != 0);
+	ASSERT(seq != 0);
 
 	XFS_STATS_INC(mp, xs_log_force);
-	trace_xfs_log_force(mp, lsn, _RET_IP_);
+	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_lsn(mp->m_log, lsn);
+	lsn = xlog_cil_force_seq(log, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
-	ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false);
-	if (ret == -EAGAIN)
-		ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true);
+	ret = xlog_force_lsn(log, lsn, flags, log_flushed, false);
+	if (ret == -EAGAIN) {
+		XFS_STATS_INC(mp, xs_log_force_sleep);
+		ret = xlog_force_lsn(log, lsn, flags, log_flushed, true);
+	}
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 044e02cb8921..ba96f4ad9576 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -106,7 +106,7 @@ struct xfs_item_ops;
 struct xfs_trans;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
-int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
+int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
 		int *log_forced);
 int	  xfs_log_mount(struct xfs_mount	*mp,
 			struct xfs_buftarg	*log_target,
@@ -132,8 +132,6 @@ bool	xfs_log_writable(struct xfs_mount *mp);
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
 void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
 
-void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
-				xfs_lsn_t *commit_lsn, bool regrant);
 void	xlog_cil_process_committed(struct list_head *list);
 bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
 
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 2f0adc35d8ec..44bb7cc17541 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -794,7 +794,7 @@ xlog_cil_push_work(
 	 * that higher sequences will wait for us to write out a commit record
 	 * before they do.
 	 *
-	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
+	 * xfs_log_force_seq requires us to mirror the new sequence into the cil
 	 * structure atomically with the addition of this sequence to the
 	 * committing list. This also ensures that we can do unlocked checks
 	 * against the current sequence in log forces without risking
@@ -1058,16 +1058,14 @@ xlog_cil_empty(
  * allowed again.
  */
 void
-xfs_log_commit_cil(
-	struct xfs_mount	*mp,
+xlog_cil_commit(
+	struct xlog		*log,
 	struct xfs_trans	*tp,
-	xfs_lsn_t		*commit_lsn,
+	xfs_csn_t		*commit_seq,
 	bool			regrant)
 {
-	struct xlog		*log = mp->m_log;
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_log_item	*lip, *next;
-	xfs_lsn_t		xc_commit_lsn;
 
 	/*
 	 * Do all necessary memory allocation before we lock the CIL.
@@ -1081,10 +1079,6 @@ xfs_log_commit_cil(
 
 	xlog_cil_insert_items(log, tp);
 
-	xc_commit_lsn = cil->xc_ctx->sequence;
-	if (commit_lsn)
-		*commit_lsn = xc_commit_lsn;
-
 	if (regrant && !XLOG_FORCED_SHUTDOWN(log))
 		xfs_log_ticket_regrant(log, tp->t_ticket);
 	else
@@ -1107,8 +1101,10 @@ xfs_log_commit_cil(
 	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
 		xfs_trans_del_item(lip);
 		if (lip->li_ops->iop_committing)
-			lip->li_ops->iop_committing(lip, xc_commit_lsn);
+			lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence);
 	}
+	if (commit_seq)
+		*commit_seq = cil->xc_ctx->sequence;
 
 	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
 	xlog_cil_push_background(log);
@@ -1125,9 +1121,9 @@ xfs_log_commit_cil(
  * iclog flush is necessary following this call.
  */
 xfs_lsn_t
-xlog_cil_force_lsn(
+xlog_cil_force_seq(
 	struct xlog	*log,
-	xfs_lsn_t	sequence)
+	xfs_csn_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 0552e96d2b64..31ce2ce21e27 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -234,7 +234,7 @@ struct xfs_cil;
 
 struct xfs_cil_ctx {
 	struct xfs_cil		*cil;
-	xfs_lsn_t		sequence;	/* chkpt sequence # */
+	xfs_csn_t		sequence;	/* chkpt sequence # */
 	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
@@ -272,10 +272,10 @@ struct xfs_cil {
 	struct xfs_cil_ctx	*xc_ctx;
 
 	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
-	xfs_lsn_t		xc_push_seq;
+	xfs_csn_t		xc_push_seq;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
-	xfs_lsn_t		xc_current_sequence;
+	xfs_csn_t		xc_current_sequence;
 	struct work_struct	xc_push_work;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
@@ -552,19 +552,18 @@ int	xlog_cil_init(struct xlog *log);
 void	xlog_cil_init_post_recovery(struct xlog *log);
 void	xlog_cil_destroy(struct xlog *log);
 bool	xlog_cil_empty(struct xlog *log);
+void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
+			xfs_csn_t *commit_seq, bool regrant);
 
 /*
  * CIL force routines
  */
-xfs_lsn_t
-xlog_cil_force_lsn(
-	struct xlog *log,
-	xfs_lsn_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index b22a09e9daee..21ac7c048380 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -851,7 +851,7 @@ __xfs_trans_commit(
 	bool			regrant)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_lsn_t		commit_lsn = -1;
+	xfs_csn_t		commit_seq = 0;
 	int			error = 0;
 	int			sync = tp->t_flags & XFS_TRANS_SYNC;
 
@@ -893,7 +893,7 @@ __xfs_trans_commit(
 		xfs_trans_apply_sb_deltas(tp);
 	xfs_trans_apply_dquot_deltas(tp);
 
-	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
+	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
 
 	xfs_trans_free(tp);
 
@@ -902,7 +902,7 @@ __xfs_trans_commit(
 	 * log out now and wait for it.
 	 */
 	if (sync) {
-		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
+		error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL);
 		XFS_STATS_INC(mp, xs_trans_sync);
 	} else {
 		XFS_STATS_INC(mp, xs_trans_async);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 9dd745cf77c9..6276c7d251e6 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -43,7 +43,7 @@ struct xfs_log_item {
 	struct list_head		li_cil;		/* CIL pointers */
 	struct xfs_log_vec		*li_lv;		/* active log vector */
 	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
-	xfs_lsn_t			li_seq;		/* CIL commit seq */
+	xfs_csn_t			li_seq;		/* CIL commit seq */
 };
 
 /*
@@ -69,7 +69,7 @@ struct xfs_item_ops {
 	void (*iop_pin)(struct xfs_log_item *);
 	void (*iop_unpin)(struct xfs_log_item *, int remove);
 	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
-	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
+	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
 	int (*iop_recover)(struct xfs_log_item *lip,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (12 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08 23:45   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 15/45] xfs: CIL work is serialised, not pipelined Dave Chinner
                   ` (30 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.

As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:

....
 399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
 400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
 400.099623:  xfsaild: first lsn 0x11002f3600
 400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
 400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
 400.589349:  xfsaild: first lsn 0x1100305000
 400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
 400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
 400.950343:  xfsaild: first lsn 0x1100317c00
 400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
 401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
 401.142334:  xfsaild: first lsn 0x110032e600
 401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
 401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
 401.154328:  xfsaild: first lsn 0x1100353000
 401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
 401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
 401.451526:  xfsaild: first lsn 0x1100353000
 401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
 401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....

In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.

Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.

The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.

Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.

We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.

One of the complexities here is that the CIL push does not guarantee
that the commit record for the CIL checkpoint is written to disk.
The current log force ensures this by submitting the current ACTIVE
iclog that the commit record was written to. We need the CIL to
actually write this commit record to disk for an async push to
ensure that the checkpoint actually makes it to disk and unpins the
pinned items in the checkpoint on completion. Hence we need to pass
down to the CIL push that we are doing an async flush so that it can
switch out the commit_iclog if necessary to get written to disk when
the commit iclog is finally released.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c       | 59 ++++++++++++++++--------------------------
 fs/xfs/xfs_log.h       |  2 +-
 fs/xfs/xfs_log_cil.c   | 58 +++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_log_priv.h  | 10 +++++--
 fs/xfs/xfs_sysfs.c     |  1 +
 fs/xfs/xfs_trace.c     |  1 +
 fs/xfs/xfs_trans.c     |  2 +-
 fs/xfs/xfs_trans_ail.c | 11 +++++---
 8 files changed, 90 insertions(+), 54 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 145db0f88060..f54d48f4584e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -50,11 +50,6 @@ xlog_state_get_iclog_space(
 	int			*continued_write,
 	int			*logoffsetp);
 STATIC void
-xlog_state_switch_iclogs(
-	struct xlog		*log,
-	struct xlog_in_core	*iclog,
-	int			eventual_size);
-STATIC void
 xlog_grant_push_ail(
 	struct xlog		*log,
 	int			need_bytes);
@@ -511,7 +506,7 @@ __xlog_state_release_iclog(
  * Flush iclog to disk if this is the last reference to the given iclog and the
  * it is in the WANT_SYNC state.
  */
-static int
+int
 xlog_state_release_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog)
@@ -531,23 +526,6 @@ xlog_state_release_iclog(
 	return 0;
 }
 
-void
-xfs_log_release_iclog(
-	struct xlog_in_core	*iclog)
-{
-	struct xlog		*log = iclog->ic_log;
-	bool			sync = false;
-
-	if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) {
-		if (iclog->ic_state != XLOG_STATE_IOERROR)
-			sync = __xlog_state_release_iclog(log, iclog);
-		spin_unlock(&log->l_icloglock);
-	}
-
-	if (sync)
-		xlog_sync(log, iclog);
-}
-
 /*
  * Mount a log filesystem
  *
@@ -3125,7 +3103,7 @@ xfs_log_ticket_ungrant(
  * This routine will mark the current iclog in the ring as WANT_SYNC and move
  * the current iclog pointer to the next iclog in the ring.
  */
-STATIC void
+void
 xlog_state_switch_iclogs(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
@@ -3272,6 +3250,20 @@ xfs_log_force(
 	return -EIO;
 }
 
+/*
+ * Force the log to a specific LSN.
+ *
+ * If an iclog with that lsn can be found:
+ *	If it is in the DIRTY state, just return.
+ *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
+ *		state and go to sleep or return.
+ *	If it is in any other state, go to sleep or return.
+ *
+ * Synchronous forces are implemented with a wait queue.  All callers trying
+ * to force a given lsn to disk must wait on the queue attached to the
+ * specific in-core log.  When given in-core log finally completes its write
+ * to disk, that thread will wake up all threads waiting on the queue.
+ */
 static int
 xlog_force_lsn(
 	struct xlog		*log,
@@ -3335,18 +3327,13 @@ xlog_force_lsn(
 }
 
 /*
- * Force the in-core log to disk for a specific LSN.
- *
- * Find in-core log with lsn.
- *	If it is in the DIRTY state, just return.
- *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
- *		state and go to sleep or return.
- *	If it is in any other state, go to sleep or return.
+ * Force the log to a specific checkpoint sequence.
  *
- * Synchronous forces are implemented with a wait queue.  All callers trying
- * to force a given lsn to disk must wait on the queue attached to the
- * specific in-core log.  When given in-core log finally completes its write
- * to disk, that thread will wake up all threads waiting on the queue.
+ * First force the CIL so that all the required changes have been flushed to the
+ * iclogs. If the CIL force completed it will return a commit LSN that indicates
+ * the iclog that needs to be flushed to stable storage. If the caller needs
+ * a synchronous log force, we will wait on the iclog with the LSN returned by
+ * xlog_cil_force_seq() to be completed.
  */
 int
 xfs_log_force_seq(
@@ -3363,7 +3350,7 @@ xfs_log_force_seq(
 	XFS_STATS_INC(mp, xs_log_force);
 	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_seq(log, seq);
+	lsn = xlog_cil_force_seq(log, XFS_LOG_SYNC, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index ba96f4ad9576..1bd080ce3a95 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -104,6 +104,7 @@ struct xlog_ticket;
 struct xfs_log_item;
 struct xfs_item_ops;
 struct xfs_trans;
+struct xlog;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
 int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
@@ -117,7 +118,6 @@ void	xfs_log_mount_cancel(struct xfs_mount *);
 xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
 xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
 void	  xfs_log_space_wake(struct xfs_mount *mp);
-void	  xfs_log_release_iclog(struct xlog_in_core *iclog);
 int	  xfs_log_reserve(struct xfs_mount *mp,
 			  int		   length,
 			  int		   count,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 44bb7cc17541..b101c25cc9a9 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -658,6 +658,7 @@ xlog_cil_push_work(
 	xfs_lsn_t		push_seq;
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
+	bool			commit_iclog_sync = false;
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -668,6 +669,8 @@ xlog_cil_push_work(
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
+	commit_iclog_sync = cil->xc_push_async;
+	cil->xc_push_async = false;
 
 	/*
 	 * As we are about to switch to a new, empty CIL context, we no longer
@@ -914,7 +917,11 @@ xlog_cil_push_work(
 	}
 
 	/* release the hounds! */
-	xfs_log_release_iclog(commit_iclog);
+	spin_lock(&log->l_icloglock);
+	if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(log, commit_iclog, 0);
+	xlog_state_release_iclog(log, commit_iclog);
+	spin_unlock(&log->l_icloglock);
 	return;
 
 out_skip:
@@ -997,13 +1004,26 @@ xlog_cil_push_background(
 /*
  * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
  * number that is passed. When it returns, the work will be queued for
- * @push_seq, but it won't be completed. The caller is expected to do any
- * waiting for push_seq to complete if it is required.
+ * @push_seq, but it won't be completed.
+ *
+ * If the caller is performing a synchronous force, we will flush the workqueue
+ * to get previously queued work moving to minimise the wait time they will
+ * undergo waiting for all outstanding pushes to complete. The caller is
+ * expected to do the required waiting for push_seq to complete.
+ *
+ * If the caller is performing an async push, we need to ensure that the
+ * checkpoint is fully flushed out of the iclogs when we finish the push. If we
+ * don't do this, then the commit record may remain sitting in memory in an
+ * ACTIVE iclog. This then requires another full log force to push to disk,
+ * which defeats the purpose of having an async, non-blocking CIL force
+ * mechanism. Hence in this case we need to pass a flag to the push work to
+ * indicate it needs to flush the commit record itself.
  */
 static void
 xlog_cil_push_now(
 	struct xlog	*log,
-	xfs_lsn_t	push_seq)
+	xfs_lsn_t	push_seq,
+	bool		sync)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -1013,7 +1033,8 @@ xlog_cil_push_now(
 	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
 
 	/* start on any pending background push to minimise wait time on it */
-	flush_work(&cil->xc_push_work);
+	if (sync)
+		flush_work(&cil->xc_push_work);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1026,6 +1047,8 @@ xlog_cil_push_now(
 	}
 
 	cil->xc_push_seq = push_seq;
+	if (!sync)
+		cil->xc_push_async = true;
 	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
@@ -1113,16 +1136,22 @@ xlog_cil_commit(
 /*
  * Conditionally push the CIL based on the sequence passed in.
  *
- * We only need to push if we haven't already pushed the sequence
- * number given. Hence the only time we will trigger a push here is
- * if the push sequence is the same as the current context.
+ * We only need to push if we haven't already pushed the sequence number given.
+ * Hence the only time we will trigger a push here is if the push sequence is
+ * the same as the current context.
  *
- * We return the current commit lsn to allow the callers to determine if a
- * iclog flush is necessary following this call.
+ * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in
+ * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to
+ * indicate we didn't wait for a commit lsn.
+ *
+ * If we waited for the push to complete, then we return the current commit lsn
+ * to allow the callers to determine if a iclog flush is necessary following
+ * this call.
  */
 xfs_lsn_t
 xlog_cil_force_seq(
 	struct xlog	*log,
+	uint32_t	flags,
 	xfs_csn_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
@@ -1131,13 +1160,19 @@ xlog_cil_force_seq(
 
 	ASSERT(sequence <= cil->xc_current_sequence);
 
+	if (!sequence)
+		sequence = cil->xc_current_sequence;
+	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
+
 	/*
 	 * check to see if we need to force out the current context.
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
 restart:
-	xlog_cil_push_now(log, sequence);
+	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
+	if (!(flags & XFS_LOG_SYNC))
+		return commit_lsn;
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -1161,6 +1196,7 @@ xlog_cil_force_seq(
 			 * It is still being pushed! Wait for the push to
 			 * complete, then start again from the beginning.
 			 */
+			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
 			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
 			goto restart;
 		}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 31ce2ce21e27..a4e46258b2aa 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -273,6 +273,7 @@ struct xfs_cil {
 
 	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
 	xfs_csn_t		xc_push_seq;
+	bool			xc_push_async;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
 	xfs_csn_t		xc_current_sequence;
@@ -487,6 +488,10 @@ int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_in_core **commit_iclog, uint optype);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
+void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
+		int eventual_size);
+int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
+
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
@@ -558,12 +563,13 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
 /*
  * CIL force routines
  */
-xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
+				xfs_csn_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index f1bc88f4367c..18dc5eca6c04 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -10,6 +10,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_sysfs.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_mount.h"
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9b8d703dc9fd..d111a994b7b6 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -20,6 +20,7 @@
 #include "xfs_bmap.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_quota.h"
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 21ac7c048380..52f3fdf1e0de 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -9,7 +9,6 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
-#include "xfs_log_priv.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_extent_busy.h"
@@ -17,6 +16,7 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..dfc0206c0d36 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -429,8 +430,12 @@ xfsaild_push(
 
 	/*
 	 * If we encountered pinned items or did not finish writing out all
-	 * buffers the last time we ran, force the log first and wait for it
-	 * before pushing again.
+	 * buffers the last time we ran, force a background CIL push to get the
+	 * items unpinned in the near future. We do not wait on the CIL push as
+	 * that could stall us for seconds if there is enough background IO
+	 * load. Stalling for that long when the tail of the log is pinned and
+	 * needs flushing will hard stop the transaction subsystem when log
+	 * space runs out.
 	 */
 	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
 	    (!list_empty_careful(&ailp->ail_buf_list) ||
@@ -438,7 +443,7 @@ xfsaild_push(
 		ailp->ail_log_flush = 0;
 
 		XFS_STATS_INC(mp, xs_push_ail_flush);
-		xfs_log_force(mp, XFS_LOG_SYNC);
+		xlog_cil_force_seq(mp->m_log, 0, 0);
 	}
 
 	spin_lock(&ailp->ail_lock);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 15/45] xfs: CIL work is serialised, not pipelined
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (13 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08 23:14   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 16/45] xfs: type verification is expensive Dave Chinner
                   ` (29 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because we use a single work structure attached to the CIL rather
than the CIL context, we can only queue a single work item at a
time. This results in the CIL being single threaded and limits
performance when it becomes CPU bound.

The design of the CIL is that it is pipelined and multiple commits
can be running concurrently, but the way the work is currently
implemented means that it is not pipelining as it was intended. The
critical work to switch the CIL context can take a few milliseconds
to run, but the rest of the CIL context flush can take hundreds of
milliseconds to complete. The context switching is the serialisation
point of the CIL, once the context has been switched the rest of the
context push can run asynchrnously with all other context pushes.

Hence we can move the work to the CIL context so that we can run
multiple CIL pushes at the same time and spread the majority of
the work out over multiple CPUs. We can keep the per-cpu CIL commit
state on the CIL rather than the context, because the context is
pinned to the CIL until the switch is done and we aggregate and
drain the per-cpu state held on the CIL during the context switch.

However, because we no longer serialise the CIL work, we can have
effectively unlimited CIL pushes in progress. We don't want to do
this - not only does it create contention on the iclogs and the
state machine locks, we can run the log right out of space with
outstanding pushes. Instead, limit the work concurrency to 4
concurrent works being processed at a time. THis is enough
concurrency to remove the CIL from being a CPU bound bottleneck but
not enough to create new contention points or unbound concurrency
issues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 80 +++++++++++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h |  2 +-
 fs/xfs/xfs_super.c    |  2 +-
 3 files changed, 44 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b101c25cc9a9..dfc9ef692a80 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -47,6 +47,34 @@ xlog_cil_ticket_alloc(
 	return tic;
 }
 
+/*
+ * Unavoidable forward declaration - xlog_cil_push_work() calls
+ * xlog_cil_ctx_alloc() itself.
+ */
+static void xlog_cil_push_work(struct work_struct *work);
+
+static struct xfs_cil_ctx *
+xlog_cil_ctx_alloc(void)
+{
+	struct xfs_cil_ctx	*ctx;
+
+	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
+	INIT_LIST_HEAD(&ctx->committing);
+	INIT_LIST_HEAD(&ctx->busy_extents);
+	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
+	return ctx;
+}
+
+static void
+xlog_cil_ctx_switch(
+	struct xfs_cil		*cil,
+	struct xfs_cil_ctx	*ctx)
+{
+	ctx->sequence = ++cil->xc_current_sequence;
+	ctx->cil = cil;
+	cil->xc_ctx = ctx;
+}
+
 /*
  * After the first stage of log recovery is done, we know where the head and
  * tail of the log are. We need this log initialisation done before we can
@@ -641,11 +669,11 @@ static void
 xlog_cil_push_work(
 	struct work_struct	*work)
 {
-	struct xfs_cil		*cil =
-		container_of(work, struct xfs_cil, xc_push_work);
+	struct xfs_cil_ctx	*ctx =
+		container_of(work, struct xfs_cil_ctx, push_work);
+	struct xfs_cil		*cil = ctx->cil;
 	struct xlog		*log = cil->xc_log;
 	struct xfs_log_vec	*lv;
-	struct xfs_cil_ctx	*ctx;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*tic;
@@ -660,11 +688,10 @@ xlog_cil_push_work(
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 	bool			commit_iclog_sync = false;
 
-	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
+	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
 
 	down_write(&cil->xc_ctx_lock);
-	ctx = cil->xc_ctx;
 
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
@@ -696,7 +723,7 @@ xlog_cil_push_work(
 
 
 	/* check for a previously pushed sequence */
-	if (push_seq < cil->xc_ctx->sequence) {
+	if (push_seq < ctx->sequence) {
 		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
 	}
@@ -767,19 +794,7 @@ xlog_cil_push_work(
 	}
 
 	/*
-	 * initialise the new context and attach it to the CIL. Then attach
-	 * the current context to the CIL committing list so it can be found
-	 * during log forces to extract the commit lsn of the sequence that
-	 * needs to be forced.
-	 */
-	INIT_LIST_HEAD(&new_ctx->committing);
-	INIT_LIST_HEAD(&new_ctx->busy_extents);
-	new_ctx->sequence = ctx->sequence + 1;
-	new_ctx->cil = cil;
-	cil->xc_ctx = new_ctx;
-
-	/*
-	 * The switch is now done, so we can drop the context lock and move out
+	 * Switch the contexts so we can drop the context lock and move out
 	 * of a shared context. We can't just go straight to the commit record,
 	 * though - we need to synchronise with previous and future commits so
 	 * that the commit records are correctly ordered in the log to ensure
@@ -804,7 +819,7 @@ xlog_cil_push_work(
 	 * deferencing a freed context pointer.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	cil->xc_current_sequence = new_ctx->sequence;
+	xlog_cil_ctx_switch(cil, new_ctx);
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
 
@@ -968,7 +983,7 @@ xlog_cil_push_background(
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
 		cil->xc_push_seq = cil->xc_current_sequence;
-		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
+		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
 	}
 
 	/*
@@ -1034,7 +1049,7 @@ xlog_cil_push_now(
 
 	/* start on any pending background push to minimise wait time on it */
 	if (sync)
-		flush_work(&cil->xc_push_work);
+		flush_workqueue(log->l_mp->m_cil_workqueue);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1049,7 +1064,7 @@ xlog_cil_push_now(
 	cil->xc_push_seq = push_seq;
 	if (!sync)
 		cil->xc_push_async = true;
-	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
+	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
 
@@ -1286,13 +1301,6 @@ xlog_cil_init(
 	if (!cil)
 		return -ENOMEM;
 
-	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
-	if (!ctx) {
-		kmem_free(cil);
-		return -ENOMEM;
-	}
-
-	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
@@ -1300,16 +1308,12 @@ xlog_cil_init(
 	init_waitqueue_head(&cil->xc_push_wait);
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
-
-	INIT_LIST_HEAD(&ctx->committing);
-	INIT_LIST_HEAD(&ctx->busy_extents);
-	ctx->sequence = 1;
-	ctx->cil = cil;
-	cil->xc_ctx = ctx;
-	cil->xc_current_sequence = ctx->sequence;
-
 	cil->xc_log = log;
 	log->l_cilp = cil;
+
+	ctx = xlog_cil_ctx_alloc();
+	xlog_cil_ctx_switch(cil, ctx);
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index a4e46258b2aa..bb5fa6b71114 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -245,6 +245,7 @@ struct xfs_cil_ctx {
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
 	struct work_struct	discard_endio_work;
+	struct work_struct	push_work;
 };
 
 /*
@@ -277,7 +278,6 @@ struct xfs_cil {
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
 	xfs_csn_t		xc_current_sequence;
-	struct work_struct	xc_push_work;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ca2cb0448b5e..962f03a541e7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -502,7 +502,7 @@ xfs_init_mount_workqueues(
 
 	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
 			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND),
-			0, mp->m_super->s_id);
+			4, mp->m_super->s_id);
 	if (!mp->m_cil_workqueue)
 		goto out_destroy_unwritten;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 16/45] xfs: type verification is expensive
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (14 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 15/45] xfs: CIL work is serialised, not pipelined Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-05  5:11 ` [PATCH 17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check Dave Chinner
                   ` (28 subsequent siblings)
  44 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

From a concurrent rm -rf workload:

  41.04%  [kernel]  [k] xfs_dir3_leaf_check_int
   9.85%  [kernel]  [k] __xfs_dir3_data_check
   5.60%  [kernel]  [k] xfs_verify_ino
   5.32%  [kernel]  [k] xfs_agino_range
   4.21%  [kernel]  [k] memcpy
   3.06%  [kernel]  [k] xfs_errortag_test
   2.57%  [kernel]  [k] xfs_dir_ino_validate
   1.66%  [kernel]  [k] xfs_dir2_data_get_ftype
   1.17%  [kernel]  [k] do_raw_spin_lock
   1.11%  [kernel]  [k] xfs_verify_dir_ino
   0.84%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   0.83%  [kernel]  [k] xfs_buf_find
   0.64%  [kernel]  [k] xfs_log_commit_cil

THere's an awful lot of overhead in just range checking inode
numbers in that, but each inode number check is not a lot of code.
The total is a bit over 14.5% of the CPU time is spent validating
inode numbers.

The problem is that they deeply nested global scope functions so the
overhead here is all in function call marshalling.

   text	   data	    bss	    dec	    hex	filename
   2077	      0	      0	   2077	    81d fs/xfs/libxfs/xfs_types.o.orig
   2197	      0	      0	   2197	    895	fs/xfs/libxfs/xfs_types.o

There's a small increase in binary size by inlining all the local
nested calls in the verifier functions, but the same workload now
profiles as:

  40.69%  [kernel]  [k] xfs_dir3_leaf_check_int
  10.52%  [kernel]  [k] __xfs_dir3_data_check
   6.68%  [kernel]  [k] xfs_verify_dir_ino
   4.22%  [kernel]  [k] xfs_errortag_test
   4.15%  [kernel]  [k] memcpy
   3.53%  [kernel]  [k] xfs_dir_ino_validate
   1.87%  [kernel]  [k] xfs_dir2_data_get_ftype
   1.37%  [kernel]  [k] do_raw_spin_lock
   0.98%  [kernel]  [k] xfs_buf_find
   0.94%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   0.73%  [kernel]  [k] xfs_log_commit_cil

Now we only spend just over 10% of the time validing inode numbers
for the same workload. Hence a few "inline" keyworks is good enough
to reduce the validation overhead by 30%...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_types.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c
index b254fbeaaa50..04801362e1a7 100644
--- a/fs/xfs/libxfs/xfs_types.c
+++ b/fs/xfs/libxfs/xfs_types.c
@@ -13,7 +13,7 @@
 #include "xfs_mount.h"
 
 /* Find the size of the AG, in blocks. */
-xfs_agblock_t
+inline xfs_agblock_t
 xfs_ag_block_count(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno)
@@ -29,7 +29,7 @@ xfs_ag_block_count(
  * Verify that an AG block number pointer neither points outside the AG
  * nor points at static metadata.
  */
-bool
+inline bool
 xfs_verify_agbno(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
@@ -49,7 +49,7 @@ xfs_verify_agbno(
  * Verify that an FS block number pointer neither points outside the
  * filesystem nor points at static AG metadata.
  */
-bool
+inline bool
 xfs_verify_fsbno(
 	struct xfs_mount	*mp,
 	xfs_fsblock_t		fsbno)
@@ -85,7 +85,7 @@ xfs_verify_fsbext(
 }
 
 /* Calculate the first and last possible inode number in an AG. */
-void
+inline void
 xfs_agino_range(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
@@ -116,7 +116,7 @@ xfs_agino_range(
  * Verify that an AG inode number pointer neither points outside the AG
  * nor points at static metadata.
  */
-bool
+inline bool
 xfs_verify_agino(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
@@ -146,7 +146,7 @@ xfs_verify_agino_or_null(
  * Verify that an FS inode number pointer neither points outside the
  * filesystem nor points at static AG metadata.
  */
-bool
+inline bool
 xfs_verify_ino(
 	struct xfs_mount	*mp,
 	xfs_ino_t		ino)
@@ -162,7 +162,7 @@ xfs_verify_ino(
 }
 
 /* Is this an internal inode number? */
-bool
+inline bool
 xfs_internal_inum(
 	struct xfs_mount	*mp,
 	xfs_ino_t		ino)
@@ -190,7 +190,7 @@ xfs_verify_dir_ino(
  * Verify that an realtime block number pointer doesn't point off the
  * end of the realtime device.
  */
-bool
+inline bool
 xfs_verify_rtbno(
 	struct xfs_mount	*mp,
 	xfs_rtblock_t		rtbno)
@@ -215,7 +215,7 @@ xfs_verify_rtext(
 }
 
 /* Calculate the range of valid icount values. */
-void
+inline void
 xfs_icount_range(
 	struct xfs_mount	*mp,
 	unsigned long long	*min,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (15 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 16/45] xfs: type verification is expensive Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-05  5:11 ` [PATCH 18/45] xfs: reduce debug overhead of dir leaf/node checks Dave Chinner
                   ` (27 subsequent siblings)
  44 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We call xfs_dir_ino_validate() for every dir entry in a directory
when doing validity checking of the directory. It calls
xfs_verify_dir_ino() then emits a corruption report if bad or does
error injection if good. It is extremely costly:

  43.27%  [kernel]  [k] xfs_dir3_leaf_check_int
  10.28%  [kernel]  [k] __xfs_dir3_data_check
   6.61%  [kernel]  [k] xfs_verify_dir_ino
   4.16%  [kernel]  [k] xfs_errortag_test
   4.00%  [kernel]  [k] memcpy
   3.48%  [kernel]  [k] xfs_dir_ino_validate

7% of the cpu usage in this directory traversal workload is
xfs_dir_ino_validate() doing absolutely nothing.

We don't need error injection to simulate a bad inode numbers in the
directory structure because we can do that by fuzzing the structure
on disk.

And we don't need a corruption report, because the
__xfs_dir3_data_check() will emit one if the inode number is bad.

So just call xfs_verify_dir_ino() directly here, and get rid of all
this unnecessary overhead:

  40.30%  [kernel]  [k] xfs_dir3_leaf_check_int
  10.98%  [kernel]  [k] __xfs_dir3_data_check
   8.10%  [kernel]  [k] xfs_verify_dir_ino
   4.42%  [kernel]  [k] memcpy
   2.22%  [kernel]  [k] xfs_dir2_data_get_ftype
   1.52%  [kernel]  [k] do_raw_spin_lock

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_dir2_data.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index 375b3edb2ad2..e67fa086f2c1 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -218,7 +218,7 @@ __xfs_dir3_data_check(
 		 */
 		if (dep->namelen == 0)
 			return __this_address;
-		if (xfs_dir_ino_validate(mp, be64_to_cpu(dep->inumber)))
+		if (!xfs_verify_dir_ino(mp, be64_to_cpu(dep->inumber)))
 			return __this_address;
 		if (offset + xfs_dir2_data_entsize(mp, dep->namelen) > end)
 			return __this_address;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 18/45] xfs: reduce debug overhead of dir leaf/node checks
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (16 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-05  5:11 ` [PATCH 19/45] xfs: factor out the CIL transaction header building Dave Chinner
                   ` (26 subsequent siblings)
  44 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

On debug kernels, we call xfs_dir3_leaf_check_int() multiple times
on every directory modification. The robust hash ordering checks it
does on every entry in the leaf on every call results in a massive
CPU overhead which slows down debug kernels by a large amount.

We use xfs_dir3_leaf_check_int() for the verifiers as well, so we
can't just gut the function to reduce overhead. What we can do,
however, is reduce the work it does when it is called from the
debug interfaces, just leaving the high level checks in place and
leaving the robust validation to the verifiers. This means the debug
checks will catch gross errors, but subtle bugs might not be caught
until a verifier is run.

It is easy enough to restore the existing debug behaviour if the
developer needs it (just change a call parameter in the debug code),
but overwise the overhead makes testing large directory block sizes
on debug kernels very slow.

Profile at an unlink rate of ~80k file/s on a 64k block size
filesystem before the patch:

  40.30%  [kernel]  [k] xfs_dir3_leaf_check_int
  10.98%  [kernel]  [k] __xfs_dir3_data_check
   8.10%  [kernel]  [k] xfs_verify_dir_ino
   4.42%  [kernel]  [k] memcpy
   2.22%  [kernel]  [k] xfs_dir2_data_get_ftype
   1.52%  [kernel]  [k] do_raw_spin_lock

Profile after, at an unlink rate of ~125k files/s (+50% improvement)
has largely dropped the leaf verification debug overhead out of the
profile.

  16.53%  [kernel]  [k] __xfs_dir3_data_check
  12.53%  [kernel]  [k] xfs_verify_dir_ino
   7.97%  [kernel]  [k] memcpy
   3.36%  [kernel]  [k] xfs_dir2_data_get_ftype
   2.86%  [kernel]  [k] __pv_queued_spin_lock_slowpath

Create shows a similar change in profile and a +25% improvement in
performance.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_dir2_leaf.c | 10 +++++++---
 fs/xfs/libxfs/xfs_dir2_node.c |  2 +-
 fs/xfs/libxfs/xfs_dir2_priv.h |  3 ++-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index 95d2a3f92d75..ccd8d0aa62b8 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -113,7 +113,7 @@ xfs_dir3_leaf1_check(
 	} else if (leafhdr.magic != XFS_DIR2_LEAF1_MAGIC)
 		return __this_address;
 
-	return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf);
+	return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf, false);
 }
 
 static inline void
@@ -139,7 +139,8 @@ xfs_failaddr_t
 xfs_dir3_leaf_check_int(
 	struct xfs_mount		*mp,
 	struct xfs_dir3_icleaf_hdr	*hdr,
-	struct xfs_dir2_leaf		*leaf)
+	struct xfs_dir2_leaf		*leaf,
+	bool				expensive_checking)
 {
 	struct xfs_da_geometry		*geo = mp->m_dir_geo;
 	xfs_dir2_leaf_tail_t		*ltp;
@@ -162,6 +163,9 @@ xfs_dir3_leaf_check_int(
 	    (char *)&hdr->ents[hdr->count] > (char *)xfs_dir2_leaf_bests_p(ltp))
 		return __this_address;
 
+	if (!expensive_checking)
+		return NULL;
+
 	/* Check hash value order, count stale entries.  */
 	for (i = stale = 0; i < hdr->count; i++) {
 		if (i + 1 < hdr->count) {
@@ -195,7 +199,7 @@ xfs_dir3_leaf_verify(
 		return fa;
 
 	xfs_dir2_leaf_hdr_from_disk(mp, &leafhdr, bp->b_addr);
-	return xfs_dir3_leaf_check_int(mp, &leafhdr, bp->b_addr);
+	return xfs_dir3_leaf_check_int(mp, &leafhdr, bp->b_addr, true);
 }
 
 static void
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 5d51265d29d6..80a64117b460 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -73,7 +73,7 @@ xfs_dir3_leafn_check(
 	} else if (leafhdr.magic != XFS_DIR2_LEAFN_MAGIC)
 		return __this_address;
 
-	return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf);
+	return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf, false);
 }
 
 static inline void
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index 44c6a77cba05..94943ce49cab 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -127,7 +127,8 @@ xfs_dir3_leaf_find_entry(struct xfs_dir3_icleaf_hdr *leafhdr,
 extern int xfs_dir2_node_to_leaf(struct xfs_da_state *state);
 
 extern xfs_failaddr_t xfs_dir3_leaf_check_int(struct xfs_mount *mp,
-		struct xfs_dir3_icleaf_hdr *hdr, struct xfs_dir2_leaf *leaf);
+		struct xfs_dir3_icleaf_hdr *hdr, struct xfs_dir2_leaf *leaf,
+		bool expensive_checks);
 
 /* xfs_dir2_node.c */
 void xfs_dir2_free_hdr_from_disk(struct xfs_mount *mp,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 19/45] xfs: factor out the CIL transaction header building
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (17 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 18/45] xfs: reduce debug overhead of dir leaf/node checks Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-08 23:47   ` Darrick J. Wong
  2021-03-16 14:50   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 20/45] xfs: only CIL pushes require a start record Dave Chinner
                   ` (25 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It is static code deep in the middle of the CIL push logic. Factor
it out into a helper so that it is clear and easy to modify
separately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log_cil.c | 71 +++++++++++++++++++++++++++++---------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index dfc9ef692a80..b515002e7959 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -651,6 +651,41 @@ xlog_cil_process_committed(
 	}
 }
 
+struct xlog_cil_trans_hdr {
+	struct xfs_trans_header	thdr;
+	struct xfs_log_iovec	lhdr;
+};
+
+/*
+ * Build a checkpoint transaction header to begin the journal transaction.  We
+ * need to account for the space used by the transaction header here as it is
+ * not accounted for in xlog_write().
+ */
+static void
+xlog_cil_build_trans_hdr(
+	struct xfs_cil_ctx	*ctx,
+	struct xlog_cil_trans_hdr *hdr,
+	struct xfs_log_vec	*lvhdr,
+	int			num_iovecs)
+{
+	struct xlog_ticket	*tic = ctx->ticket;
+
+	memset(hdr, 0, sizeof(*hdr));
+
+	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
+	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
+	hdr->thdr.th_tid = tic->t_tid;
+	hdr->thdr.th_num_items = num_iovecs;
+	hdr->lhdr.i_addr = &hdr->thdr;
+	hdr->lhdr.i_len = sizeof(xfs_trans_header_t);
+	hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
+	tic->t_curr_res -= hdr->lhdr.i_len + sizeof(xlog_op_header_t);
+
+	lvhdr->lv_niovecs = 1;
+	lvhdr->lv_iovecp = &hdr->lhdr;
+	lvhdr->lv_next = ctx->lv_chain;
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -676,11 +711,9 @@ xlog_cil_push_work(
 	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
-	struct xlog_ticket	*tic;
 	int			num_iovecs;
 	int			error = 0;
-	struct xfs_trans_header thdr;
-	struct xfs_log_iovec	lhdr;
+	struct xlog_cil_trans_hdr thdr;
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
@@ -827,24 +860,8 @@ xlog_cil_push_work(
 	 * Build a checkpoint transaction header and write it to the log to
 	 * begin the transaction. We need to account for the space used by the
 	 * transaction header here as it is not accounted for in xlog_write().
-	 *
-	 * The LSN we need to pass to the log items on transaction commit is
-	 * the LSN reported by the first log vector write. If we use the commit
-	 * record lsn then we can move the tail beyond the grant write head.
 	 */
-	tic = ctx->ticket;
-	thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
-	thdr.th_type = XFS_TRANS_CHECKPOINT;
-	thdr.th_tid = tic->t_tid;
-	thdr.th_num_items = num_iovecs;
-	lhdr.i_addr = &thdr;
-	lhdr.i_len = sizeof(xfs_trans_header_t);
-	lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
-	tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t);
-
-	lvhdr.lv_niovecs = 1;
-	lvhdr.lv_iovecp = &lhdr;
-	lvhdr.lv_next = ctx->lv_chain;
+	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
 
 	/*
 	 * Before we format and submit the first iclog, we have to ensure that
@@ -852,7 +869,13 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
+	/*
+	 * The LSN we need to pass to the log items on transaction commit is the
+	 * LSN reported by the first log vector write, not the commit lsn. If we
+	 * use the commit record lsn then we can move the tail beyond the grant
+	 * write head.
+	 */
+	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
 				XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
@@ -891,11 +914,11 @@ xlog_cil_push_work(
 	}
 	spin_unlock(&cil->xc_push_lock);
 
-	error = xlog_commit_record(log, tic, &commit_iclog, &commit_lsn);
+	error = xlog_commit_record(log, ctx->ticket, &commit_iclog, &commit_lsn);
 	if (error)
 		goto out_abort_free_ticket;
 
-	xfs_log_ticket_ungrant(log, tic);
+	xfs_log_ticket_ungrant(log, ctx->ticket);
 
 	spin_lock(&commit_iclog->ic_callback_lock);
 	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
@@ -946,7 +969,7 @@ xlog_cil_push_work(
 	return;
 
 out_abort_free_ticket:
-	xfs_log_ticket_ungrant(log, tic);
+	xfs_log_ticket_ungrant(log, ctx->ticket);
 out_abort:
 	ASSERT(XLOG_FORCED_SHUTDOWN(log));
 	xlog_cil_committed(ctx);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 20/45] xfs: only CIL pushes require a start record
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (18 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 19/45] xfs: factor out the CIL transaction header building Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  0:07   ` Darrick J. Wong
  2021-03-16 14:51   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record Dave Chinner
                   ` (24 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

So move the one-off start record writing in xlog_write() out into
the static header that the CIL push builds to write into the log
initially. This simplifes the xlog_write() logic a lot.

pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly
32 bit aligned and packed for copying the log op and transaction
headers directly into the log as a single log region copy.

struct xlog_cil_trans_hdr {
	struct xlog_op_header      oph[2];               /*     0    24 */
	struct xfs_trans_header    thdr;                 /*    24    16 */
	struct xfs_log_iovec       lhdr;                 /*    40    16 */

	/* size: 56, cachelines: 1, members: 3 */
	/* last cacheline: 56 bytes */
};

A wart is needed to handle the fact that length of the region the
opheader points to doesn't include the opheader length. hence if
we embed the opheader, we have to substract the opheader length from
the length written into the opheader by the generic copying code.
This will eventually go away when everything is converted to
embedded opheaders.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c     | 90 ++++++++++++++++++++++----------------------
 fs/xfs/xfs_log_cil.c | 44 ++++++++++++++++++----
 2 files changed, 81 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f54d48f4584e..b2f9fb1b4fed 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2106,9 +2106,9 @@ xlog_print_trans(
 }
 
 /*
- * Calculate the potential space needed by the log vector.  We may need a start
- * record, and each region gets its own struct xlog_op_header and may need to be
- * double word aligned.
+ * Calculate the potential space needed by the log vector. If this is a start
+ * transaction, the caller has already accounted for both opheaders in the start
+ * transaction, so we don't need to account for them here.
  */
 static int
 xlog_write_calc_vec_length(
@@ -2121,9 +2121,6 @@ xlog_write_calc_vec_length(
 	int			len = 0;
 	int			i;
 
-	if (optype & XLOG_START_TRANS)
-		headers++;
-
 	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
@@ -2139,24 +2136,20 @@ xlog_write_calc_vec_length(
 		}
 	}
 
+	/* Don't account for regions with embedded ophdrs */
+	if (optype && headers > 0) {
+		if (optype & XLOG_START_TRANS) {
+			ASSERT(headers >= 2);
+			headers -= 2;
+		}
+	}
+
 	ticket->t_res_num_ophdrs += headers;
 	len += headers * sizeof(struct xlog_op_header);
 
 	return len;
 }
 
-static void
-xlog_write_start_rec(
-	struct xlog_op_header	*ophdr,
-	struct xlog_ticket	*ticket)
-{
-	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
-	ophdr->oh_clientid = ticket->t_clientid;
-	ophdr->oh_len = 0;
-	ophdr->oh_flags = XLOG_START_TRANS;
-	ophdr->oh_res2 = 0;
-}
-
 static xlog_op_header_t *
 xlog_write_setup_ophdr(
 	struct xlog		*log,
@@ -2361,9 +2354,11 @@ xlog_write(
 	 * If this is a commit or unmount transaction, we don't need a start
 	 * record to be written.  We do, however, have to account for the
 	 * commit or unmount header that gets written. Hence we always have
-	 * to account for an extra xlog_op_header here.
+	 * to account for an extra xlog_op_header here for commit and unmount
+	 * records.
 	 */
-	ticket->t_curr_res -= sizeof(struct xlog_op_header);
+	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
+		ticket->t_curr_res -= sizeof(struct xlog_op_header);
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
 		     "ctx ticket reservation ran out. Need to up reservation");
@@ -2411,7 +2406,7 @@ xlog_write(
 			int			copy_len;
 			int			copy_off;
 			bool			ordered = false;
-			bool			wrote_start_rec = false;
+			bool			added_ophdr = false;
 
 			/* ordered log vectors have no regions to write */
 			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
@@ -2425,25 +2420,24 @@ xlog_write(
 			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
 
 			/*
-			 * Before we start formatting log vectors, we need to
-			 * write a start record. Only do this for the first
-			 * iclog we write to.
+			 * The XLOG_START_TRANS has embedded ophdrs for the
+			 * start record and transaction header. They will always
+			 * be the first two regions in the lv chain.
 			 */
 			if (optype & XLOG_START_TRANS) {
-				xlog_write_start_rec(ptr, ticket);
-				xlog_write_adv_cnt(&ptr, &len, &log_offset,
-						sizeof(struct xlog_op_header));
-				optype &= ~XLOG_START_TRANS;
-				wrote_start_rec = true;
-			}
-
-			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
-			if (!ophdr)
-				return -EIO;
+				ophdr = reg->i_addr;
+				if (index)
+					optype &= ~XLOG_START_TRANS;
+			} else {
+				ophdr = xlog_write_setup_ophdr(log, ptr,
+							ticket, optype);
+				if (!ophdr)
+					return -EIO;
 
-			xlog_write_adv_cnt(&ptr, &len, &log_offset,
+				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 					   sizeof(struct xlog_op_header));
-
+				added_ophdr = true;
+			}
 			len += xlog_write_setup_copy(ticket, ophdr,
 						     iclog->ic_size-log_offset,
 						     reg->i_len,
@@ -2452,13 +2446,22 @@ xlog_write(
 						     &partial_copy_len);
 			xlog_verify_dest_ptr(log, ptr);
 
+
+			/*
+			 * Wart: need to update length in embedded ophdr not
+			 * to include it's own length.
+			 */
+			if (!added_ophdr) {
+				ophdr->oh_len = cpu_to_be32(copy_len -
+						sizeof(struct xlog_op_header));
+			}
 			/*
 			 * Copy region.
 			 *
-			 * Unmount records just log an opheader, so can have
-			 * empty payloads with no data region to copy. Hence we
-			 * only copy the payload if the vector says it has data
-			 * to copy.
+			 * Commit and unmount records just log an opheader, so
+			 * we can have empty payloads with no data region to
+			 * copy.  Hence we only copy the payload if the vector
+			 * says it has data to copy.
 			 */
 			ASSERT(copy_len >= 0);
 			if (copy_len > 0) {
@@ -2466,12 +2469,9 @@ xlog_write(
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 						   copy_len);
 			}
-			copy_len += sizeof(struct xlog_op_header);
-			record_cnt++;
-			if (wrote_start_rec) {
+			if (added_ophdr)
 				copy_len += sizeof(struct xlog_op_header);
-				record_cnt++;
-			}
+			record_cnt++;
 			data_cnt += contwr ? copy_len : 0;
 
 			error = xlog_write_copy_finish(log, iclog, optype,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b515002e7959..e9da074ecd69 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -652,14 +652,22 @@ xlog_cil_process_committed(
 }
 
 struct xlog_cil_trans_hdr {
+	struct xlog_op_header	oph[2];
 	struct xfs_trans_header	thdr;
-	struct xfs_log_iovec	lhdr;
+	struct xfs_log_iovec	lhdr[2];
 };
 
 /*
  * Build a checkpoint transaction header to begin the journal transaction.  We
  * need to account for the space used by the transaction header here as it is
  * not accounted for in xlog_write().
+ *
+ * This is the only place we write a transaction header, so we also build the
+ * log opheaders that indicate the start of a log transaction and wrap the
+ * transaction header. We keep the start record in it's own log vector rather
+ * than compacting them into a single region as this ends up making the logic
+ * in xlog_write() for handling empty opheaders for start, commit and unmount
+ * records much simpler.
  */
 static void
 xlog_cil_build_trans_hdr(
@@ -669,20 +677,40 @@ xlog_cil_build_trans_hdr(
 	int			num_iovecs)
 {
 	struct xlog_ticket	*tic = ctx->ticket;
+	uint32_t		tid = cpu_to_be32(tic->t_tid);
 
 	memset(hdr, 0, sizeof(*hdr));
 
+	/* Log start record */
+	hdr->oph[0].oh_tid = tid;
+	hdr->oph[0].oh_clientid = XFS_TRANSACTION;
+	hdr->oph[0].oh_flags = XLOG_START_TRANS;
+
+	/* log iovec region pointer */
+	hdr->lhdr[0].i_addr = &hdr->oph[0];
+	hdr->lhdr[0].i_len = sizeof(struct xlog_op_header);
+	hdr->lhdr[0].i_type = XLOG_REG_TYPE_LRHEADER;
+
+	/* log opheader */
+	hdr->oph[1].oh_tid = tid;
+	hdr->oph[1].oh_clientid = XFS_TRANSACTION;
+
+	/* transaction header */
 	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
 	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
-	hdr->thdr.th_tid = tic->t_tid;
+	hdr->thdr.th_tid = tid;
 	hdr->thdr.th_num_items = num_iovecs;
-	hdr->lhdr.i_addr = &hdr->thdr;
-	hdr->lhdr.i_len = sizeof(xfs_trans_header_t);
-	hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
-	tic->t_curr_res -= hdr->lhdr.i_len + sizeof(xlog_op_header_t);
 
-	lvhdr->lv_niovecs = 1;
-	lvhdr->lv_iovecp = &hdr->lhdr;
+	/* log iovec region pointer */
+	hdr->lhdr[1].i_addr = &hdr->oph[1];
+	hdr->lhdr[1].i_len = sizeof(struct xlog_op_header) +
+				sizeof(struct xfs_trans_header);
+	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
+
+	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
+
+	lvhdr->lv_niovecs = 2;
+	lvhdr->lv_iovecp = &hdr->lhdr[0];
 	lvhdr->lv_next = ctx->lv_chain;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (19 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 20/45] xfs: only CIL pushes require a start record Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  0:15   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 22/45] xfs: embed the xlog_op_header in the commit record Dave Chinner
                   ` (23 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Remove another case where xlog_write() has to prepend an opheader to
a log transaction. The unmount record + ophdr is smaller than the
minimum amount of space guaranteed to be free in an iclog (2 *
sizeof(ophdr)) and so we don't have to care about an unmount record
being split across 2 iclogs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c | 35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b2f9fb1b4fed..94711b9ff007 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -798,12 +798,22 @@ xlog_write_unmount_record(
 	struct xlog		*log,
 	struct xlog_ticket	*ticket)
 {
-	struct xfs_unmount_log_format ulf = {
-		.magic = XLOG_UNMOUNT_TYPE,
+	struct  {
+		struct xlog_op_header ophdr;
+		struct xfs_unmount_log_format ulf;
+	} unmount_rec = {
+		.ophdr = {
+			.oh_clientid = XFS_LOG,
+			.oh_tid = cpu_to_be32(ticket->t_tid),
+			.oh_flags = XLOG_UNMOUNT_TRANS,
+		},
+		.ulf = {
+			.magic = XLOG_UNMOUNT_TYPE,
+		},
 	};
 	struct xfs_log_iovec reg = {
-		.i_addr = &ulf,
-		.i_len = sizeof(ulf),
+		.i_addr = &unmount_rec,
+		.i_len = sizeof(unmount_rec),
 		.i_type = XLOG_REG_TYPE_UNMOUNT,
 	};
 	struct xfs_log_vec vec = {
@@ -812,7 +822,7 @@ xlog_write_unmount_record(
 	};
 
 	/* account for space used by record data */
-	ticket->t_curr_res -= sizeof(ulf);
+	ticket->t_curr_res -= sizeof(unmount_rec);
 
 	/*
 	 * For external log devices, we need to flush the data device cache
@@ -2138,6 +2148,8 @@ xlog_write_calc_vec_length(
 
 	/* Don't account for regions with embedded ophdrs */
 	if (optype && headers > 0) {
+		if (optype & XLOG_UNMOUNT_TRANS)
+			headers--;
 		if (optype & XLOG_START_TRANS) {
 			ASSERT(headers >= 2);
 			headers -= 2;
@@ -2352,12 +2364,11 @@ xlog_write(
 
 	/*
 	 * If this is a commit or unmount transaction, we don't need a start
-	 * record to be written.  We do, however, have to account for the
-	 * commit or unmount header that gets written. Hence we always have
-	 * to account for an extra xlog_op_header here for commit and unmount
-	 * records.
+	 * record to be written.  We do, however, have to account for the commit
+	 * header that gets written. Hence we always have to account for an
+	 * extra xlog_op_header here for commit records.
 	 */
-	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
+	if (optype & XLOG_COMMIT_TRANS)
 		ticket->t_curr_res -= sizeof(struct xlog_op_header);
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
@@ -2428,6 +2439,8 @@ xlog_write(
 				ophdr = reg->i_addr;
 				if (index)
 					optype &= ~XLOG_START_TRANS;
+			} else if (optype & XLOG_UNMOUNT_TRANS) {
+				ophdr = reg->i_addr;
 			} else {
 				ophdr = xlog_write_setup_ophdr(log, ptr,
 							ticket, optype);
@@ -2458,7 +2471,7 @@ xlog_write(
 			/*
 			 * Copy region.
 			 *
-			 * Commit and unmount records just log an opheader, so
+			 * Commit records just log an opheader, so
 			 * we can have empty payloads with no data region to
 			 * copy.  Hence we only copy the payload if the vector
 			 * says it has data to copy.
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 22/45] xfs: embed the xlog_op_header in the commit record
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (20 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  0:17   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 23/45] xfs: log tickets don't need log client id Dave Chinner
                   ` (22 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

Remove the final case where xlog_write() has to prepend an opheader
to a log transaction. Similar to the start record, the commit record
is just an empty opheader with a XLOG_COMMIT_TRANS type, so we can
just make this the payload for the region being passed to
xlog_write() and remove the special handling in xlog_write() for
the commit record.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 94711b9ff007..c2e69a1f5cad 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1529,9 +1529,14 @@ xlog_commit_record(
 	struct xlog_in_core	**iclog,
 	xfs_lsn_t		*lsn)
 {
+	struct xlog_op_header	ophdr = {
+		.oh_clientid = XFS_TRANSACTION,
+		.oh_tid = cpu_to_be32(ticket->t_tid),
+		.oh_flags = XLOG_COMMIT_TRANS,
+	};
 	struct xfs_log_iovec reg = {
-		.i_addr = NULL,
-		.i_len = 0,
+		.i_addr = &ophdr,
+		.i_len = sizeof(struct xlog_op_header),
 		.i_type = XLOG_REG_TYPE_COMMIT,
 	};
 	struct xfs_log_vec vec = {
@@ -1543,6 +1548,8 @@ xlog_commit_record(
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
+	/* account for space used by record data */
+	ticket->t_curr_res -= reg.i_len;
 	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -2148,11 +2155,10 @@ xlog_write_calc_vec_length(
 
 	/* Don't account for regions with embedded ophdrs */
 	if (optype && headers > 0) {
-		if (optype & XLOG_UNMOUNT_TRANS)
-			headers--;
+		headers--;
 		if (optype & XLOG_START_TRANS) {
-			ASSERT(headers >= 2);
-			headers -= 2;
+			ASSERT(headers >= 1);
+			headers--;
 		}
 	}
 
@@ -2362,14 +2368,6 @@ xlog_write(
 	int			data_cnt = 0;
 	int			error = 0;
 
-	/*
-	 * If this is a commit or unmount transaction, we don't need a start
-	 * record to be written.  We do, however, have to account for the commit
-	 * header that gets written. Hence we always have to account for an
-	 * extra xlog_op_header here for commit records.
-	 */
-	if (optype & XLOG_COMMIT_TRANS)
-		ticket->t_curr_res -= sizeof(struct xlog_op_header);
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
 		     "ctx ticket reservation ran out. Need to up reservation");
@@ -2433,14 +2431,13 @@ xlog_write(
 			/*
 			 * The XLOG_START_TRANS has embedded ophdrs for the
 			 * start record and transaction header. They will always
-			 * be the first two regions in the lv chain.
+			 * be the first two regions in the lv chain. Commit and
+			 * unmount records also have embedded ophdrs.
 			 */
-			if (optype & XLOG_START_TRANS) {
+			if (optype) {
 				ophdr = reg->i_addr;
 				if (index)
 					optype &= ~XLOG_START_TRANS;
-			} else if (optype & XLOG_UNMOUNT_TRANS) {
-				ophdr = reg->i_addr;
 			} else {
 				ophdr = xlog_write_setup_ophdr(log, ptr,
 							ticket, optype);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 23/45] xfs: log tickets don't need log client id
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (21 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 22/45] xfs: embed the xlog_op_header in the commit record Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  0:21   ` Darrick J. Wong
  2021-03-16 14:51   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 24/45] xfs: move log iovec alignment to preparation function Dave Chinner
                   ` (21 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We currently set the log ticket client ID when we reserve a
transaction. This client ID is only ever written to the log by
a CIL checkpoint or unmount records, and so anything using a high
level transaction allocated through xfs_trans_alloc() does not need
a log ticket client ID to be set.

For the CIL checkpoint, the client ID written to the journal is
always XFS_TRANSACTION, and for the unmount record it is always
XFS_LOG, and nothing else writes to the log. All of these operations
tell xlog_write() exactly what they need to write to the log (the
optype) and build their own opheaders for start, commit and unmount
records. Hence we no longer need to set the client id in either the
log ticket or the xfs_trans.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c      | 47 ++++++++-----------------------------------
 fs/xfs/xfs_log.h      | 16 ++++++---------
 fs/xfs/xfs_log_cil.c  |  2 +-
 fs/xfs/xfs_log_priv.h | 10 ++-------
 fs/xfs/xfs_trans.c    |  6 ++----
 5 files changed, 19 insertions(+), 62 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c2e69a1f5cad..429cb1e7cc67 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -431,10 +431,9 @@ xfs_log_regrant(
 int
 xfs_log_reserve(
 	struct xfs_mount	*mp,
-	int		 	unit_bytes,
-	int		 	cnt,
+	int			unit_bytes,
+	int			cnt,
 	struct xlog_ticket	**ticp,
-	uint8_t		 	client,
 	bool			permanent)
 {
 	struct xlog		*log = mp->m_log;
@@ -442,15 +441,13 @@ xfs_log_reserve(
 	int			need_bytes;
 	int			error = 0;
 
-	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
-
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
 	XFS_STATS_INC(mp, xs_try_logspace);
 
 	ASSERT(*ticp == NULL);
-	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent);
+	tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent);
 	*ticp = tic;
 
 	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
@@ -847,7 +844,7 @@ xlog_unmount_write(
 	struct xlog_ticket	*tic = NULL;
 	int			error;
 
-	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
+	error = xfs_log_reserve(mp, 600, 1, &tic, 0);
 	if (error)
 		goto out_err;
 
@@ -2170,35 +2167,13 @@ xlog_write_calc_vec_length(
 
 static xlog_op_header_t *
 xlog_write_setup_ophdr(
-	struct xlog		*log,
 	struct xlog_op_header	*ophdr,
-	struct xlog_ticket	*ticket,
-	uint			flags)
+	struct xlog_ticket	*ticket)
 {
 	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
-	ophdr->oh_clientid = ticket->t_clientid;
+	ophdr->oh_clientid = XFS_TRANSACTION;
 	ophdr->oh_res2 = 0;
-
-	/* are we copying a commit or unmount record? */
-	ophdr->oh_flags = flags;
-
-	/*
-	 * We've seen logs corrupted with bad transaction client ids.  This
-	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
-	 * and shut down the filesystem.
-	 */
-	switch (ophdr->oh_clientid)  {
-	case XFS_TRANSACTION:
-	case XFS_VOLUME:
-	case XFS_LOG:
-		break;
-	default:
-		xfs_warn(log->l_mp,
-			"Bad XFS transaction clientid 0x%x in ticket "PTR_FMT,
-			ophdr->oh_clientid, ticket);
-		return NULL;
-	}
-
+	ophdr->oh_flags = 0;
 	return ophdr;
 }
 
@@ -2439,11 +2414,7 @@ xlog_write(
 				if (index)
 					optype &= ~XLOG_START_TRANS;
 			} else {
-				ophdr = xlog_write_setup_ophdr(log, ptr,
-							ticket, optype);
-				if (!ophdr)
-					return -EIO;
-
+                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 					   sizeof(struct xlog_op_header));
 				added_ophdr = true;
@@ -3499,7 +3470,6 @@ xlog_ticket_alloc(
 	struct xlog		*log,
 	int			unit_bytes,
 	int			cnt,
-	char			client,
 	bool			permanent)
 {
 	struct xlog_ticket	*tic;
@@ -3517,7 +3487,6 @@ xlog_ticket_alloc(
 	tic->t_cnt		= cnt;
 	tic->t_ocnt		= cnt;
 	tic->t_tid		= prandom_u32();
-	tic->t_clientid		= client;
 	if (permanent)
 		tic->t_flags |= XLOG_TIC_PERM_RESERV;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 1bd080ce3a95..c0c3141944ea 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -117,16 +117,12 @@ int	  xfs_log_mount_finish(struct xfs_mount *mp);
 void	xfs_log_mount_cancel(struct xfs_mount *);
 xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
 xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
-void	  xfs_log_space_wake(struct xfs_mount *mp);
-int	  xfs_log_reserve(struct xfs_mount *mp,
-			  int		   length,
-			  int		   count,
-			  struct xlog_ticket **ticket,
-			  uint8_t		   clientid,
-			  bool		   permanent);
-int	  xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
-void      xfs_log_unmount(struct xfs_mount *mp);
-int	  xfs_log_force_umount(struct xfs_mount *mp, int logerror);
+void	xfs_log_space_wake(struct xfs_mount *mp);
+int	xfs_log_reserve(struct xfs_mount *mp, int length, int count,
+			struct xlog_ticket **ticket, bool permanent);
+int	xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
+void	xfs_log_unmount(struct xfs_mount *mp);
+int	xfs_log_force_umount(struct xfs_mount *mp, int logerror);
 bool	xfs_log_writable(struct xfs_mount *mp);
 
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index e9da074ecd69..0c81c13e2cf6 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -37,7 +37,7 @@ xlog_cil_ticket_alloc(
 {
 	struct xlog_ticket *tic;
 
-	tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0);
+	tic = xlog_ticket_alloc(log, 0, 1, 0);
 
 	/*
 	 * set the current reservation to zero so we know to steal the basic
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index bb5fa6b71114..7f601c1c9f45 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -158,7 +158,6 @@ typedef struct xlog_ticket {
 	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
 	char		   t_ocnt;	 /* original count		 : 1  */
 	char		   t_cnt;	 /* current count		 : 1  */
-	char		   t_clientid;	 /* who does this belong to;	 : 1  */
 	char		   t_flags;	 /* properties of reservation	 : 1  */
 
         /* reservation array fields */
@@ -465,13 +464,8 @@ extern __le32	 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead,
 			    char *dp, int size);
 
 extern kmem_zone_t *xfs_log_ticket_zone;
-struct xlog_ticket *
-xlog_ticket_alloc(
-	struct xlog	*log,
-	int		unit_bytes,
-	int		count,
-	char		client,
-	bool		permanent);
+struct xlog_ticket *xlog_ticket_alloc(struct xlog *log, int unit_bytes,
+		int count, bool permanent);
 
 static inline void
 xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 52f3fdf1e0de..83c2b7f22eb7 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -194,11 +194,9 @@ xfs_trans_reserve(
 			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
 			error = xfs_log_regrant(mp, tp->t_ticket);
 		} else {
-			error = xfs_log_reserve(mp,
-						resp->tr_logres,
+			error = xfs_log_reserve(mp, resp->tr_logres,
 						resp->tr_logcount,
-						&tp->t_ticket, XFS_TRANSACTION,
-						permanent);
+						&tp->t_ticket, permanent);
 		}
 
 		if (error)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 24/45] xfs: move log iovec alignment to preparation function
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (22 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 23/45] xfs: log tickets don't need log client id Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  2:14   ` Darrick J. Wong
  2021-03-16 14:51   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
                   ` (20 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To include log op headers directly into the log iovec regions that
the ophdrs wrap, we need to move the buffer alignment code from
xlog_finish_iovec() to xlog_prepare_iovec(). This is because the
xlog_op_header is only 12 bytes long, and we need the buffer that
the caller formats their data into to be 8 byte aligned.

Hence once we start prepending the ophdr in xlog_prepare_iovec(), we
are going to need to manage the padding directly to ensure that the
buffer pointer returned is correctly aligned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.h | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index c0c3141944ea..1ca4f2edbdaf 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -21,6 +21,16 @@ struct xfs_log_vec {
 
 #define XFS_LOG_VEC_ORDERED	(-1)
 
+/*
+ * We need to make sure the buffer pointer returned is naturally aligned for the
+ * biggest basic data type we put into it. We have already accounted for this
+ * padding when sizing the buffer.
+ *
+ * However, this padding does not get written into the log, and hence we have to
+ * track the space used by the log vectors separately to prevent log space hangs
+ * due to inaccurate accounting (i.e. a leak) of the used log space through the
+ * CIL context ticket.
+ */
 static inline void *
 xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 		uint type)
@@ -34,6 +44,9 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 		vec = &lv->lv_iovecp[0];
 	}
 
+	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
+		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
+
 	vec->i_type = type;
 	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
 
@@ -43,20 +56,10 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 	return vec->i_addr;
 }
 
-/*
- * We need to make sure the next buffer is naturally aligned for the biggest
- * basic data type we put into it.  We already accounted for this padding when
- * sizing the buffer.
- *
- * However, this padding does not get written into the log, and hence we have to
- * track the space used by the log vectors separately to prevent log space hangs
- * due to inaccurate accounting (i.e. a leak) of the used log space through the
- * CIL context ticket.
- */
 static inline void
 xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
 {
-	lv->lv_buf_len += round_up(len, sizeof(uint64_t));
+	lv->lv_buf_len += len;
 	lv->lv_bytes += len;
 	vec->i_len = len;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (23 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 24/45] xfs: move log iovec alignment to preparation function Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  2:21   ` Darrick J. Wong
  2021-03-16 14:53   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 26/45] xfs: log ticket region debug is largely useless Dave Chinner
                   ` (19 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Current xlog_write() adds op headers to the log manually for every
log item region that is in the vector passed to it. While
xlog_write() needs to stamp the transaction ID into the ophdr, we
already know it's length, flags, clientid, etc at CIL commit time.

This means the only time that xlog write really needs to format and
reserve space for a new ophdr is when a region is split across two
iclogs. Adding the opheader and accounting for it as part of the
normal formatted item region means we simplify the accounting
of space used by a transaction and we don't have to special case
reserving of space in for the ophdrs in xlog_write(). It also means
we can largely initialise the ophdr in transaction commit instead
of xlog_write, making the xlog_write formatting inner loop much
tighter.

xlog_prepare_iovec() is now too large to stay as an inline function,
so we move it out of line and into xfs_log.c.

Object sizes:
text	   data	    bss	    dec	    hex	filename
1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after

So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
out of line, even though it grew in size itself.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_log.h     |  42 +++-------------
 fs/xfs/xfs_log_cil.c |  25 +++++-----
 3 files changed, 99 insertions(+), 83 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 429cb1e7cc67..98de45be80c0 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -89,6 +89,62 @@ xlog_iclogs_empty(
 static int
 xfs_log_cover(struct xfs_mount *);
 
+/*
+ * We need to make sure the buffer pointer returned is naturally aligned for the
+ * biggest basic data type we put into it. We have already accounted for this
+ * padding when sizing the buffer.
+ *
+ * However, this padding does not get written into the log, and hence we have to
+ * track the space used by the log vectors separately to prevent log space hangs
+ * due to inaccurate accounting (i.e. a leak) of the used log space through the
+ * CIL context ticket.
+ *
+ * We also add space for the xlog_op_header that describes this region in the
+ * log. This prepends the data region we return to the caller to copy their data
+ * into, so do all the static initialisation of the ophdr now. Because the ophdr
+ * is not 8 byte aligned, we have to be careful to ensure that we align the
+ * start of the buffer such that the region we return to the call is 8 byte
+ * aligned and packed against the tail of the ophdr.
+ */
+void *
+xlog_prepare_iovec(
+	struct xfs_log_vec	*lv,
+	struct xfs_log_iovec	**vecp,
+	uint			type)
+{
+	struct xfs_log_iovec	*vec = *vecp;
+	struct xlog_op_header	*oph;
+	uint32_t		len;
+	void			*buf;
+
+	if (vec) {
+		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
+		vec++;
+	} else {
+		vec = &lv->lv_iovecp[0];
+	}
+
+	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
+	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
+		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
+					sizeof(struct xlog_op_header);
+	}
+
+	vec->i_type = type;
+	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
+
+	oph = vec->i_addr;
+	oph->oh_clientid = XFS_TRANSACTION;
+	oph->oh_res2 = 0;
+	oph->oh_flags = 0;
+
+	buf = vec->i_addr + sizeof(struct xlog_op_header);
+	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));
+
+	*vecp = vec;
+	return buf;
+}
+
 static void
 xlog_grant_sub_space(
 	struct xlog		*log,
@@ -2120,9 +2176,9 @@ xlog_print_trans(
 }
 
 /*
- * Calculate the potential space needed by the log vector. If this is a start
- * transaction, the caller has already accounted for both opheaders in the start
- * transaction, so we don't need to account for them here.
+ * Calculate the potential space needed by the log vector. All regions contain
+ * their own opheaders and they are accounted for in region space so we don't
+ * need to add them to the vector length here.
  */
 static int
 xlog_write_calc_vec_length(
@@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length(
 			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
 		}
 	}
-
-	/* Don't account for regions with embedded ophdrs */
-	if (optype && headers > 0) {
-		headers--;
-		if (optype & XLOG_START_TRANS) {
-			ASSERT(headers >= 1);
-			headers--;
-		}
-	}
-
 	ticket->t_res_num_ophdrs += headers;
-	len += headers * sizeof(struct xlog_op_header);
 
 	return len;
 }
@@ -2170,7 +2215,6 @@ xlog_write_setup_ophdr(
 	struct xlog_op_header	*ophdr,
 	struct xlog_ticket	*ticket)
 {
-	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
 	ophdr->oh_clientid = XFS_TRANSACTION;
 	ophdr->oh_res2 = 0;
 	ophdr->oh_flags = 0;
@@ -2404,21 +2448,25 @@ xlog_write(
 			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
 
 			/*
-			 * The XLOG_START_TRANS has embedded ophdrs for the
-			 * start record and transaction header. They will always
-			 * be the first two regions in the lv chain. Commit and
-			 * unmount records also have embedded ophdrs.
+			 * Regions always have their ophdr at the start of the
+			 * region, except for:
+			 * - a transaction start which has a start record ophdr
+			 *   before the first region ophdr; and
+			 * - the previous region didn't fully fit into an iclog
+			 *   so needs a continuation ophdr to prepend the region
+			 *   in this new iclog.
 			 */
-			if (optype) {
-				ophdr = reg->i_addr;
-				if (index)
-					optype &= ~XLOG_START_TRANS;
-			} else {
+			ophdr = reg->i_addr;
+			if (optype && index) {
+				optype &= ~XLOG_START_TRANS;
+			} else if (partial_copy) {
                                 ophdr = xlog_write_setup_ophdr(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 					   sizeof(struct xlog_op_header));
 				added_ophdr = true;
 			}
+			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+
 			len += xlog_write_setup_copy(ticket, ophdr,
 						     iclog->ic_size-log_offset,
 						     reg->i_len,
@@ -2436,20 +2484,11 @@ xlog_write(
 				ophdr->oh_len = cpu_to_be32(copy_len -
 						sizeof(struct xlog_op_header));
 			}
-			/*
-			 * Copy region.
-			 *
-			 * Commit records just log an opheader, so
-			 * we can have empty payloads with no data region to
-			 * copy.  Hence we only copy the payload if the vector
-			 * says it has data to copy.
-			 */
-			ASSERT(copy_len >= 0);
-			if (copy_len > 0) {
-				memcpy(ptr, reg->i_addr + copy_off, copy_len);
-				xlog_write_adv_cnt(&ptr, &len, &log_offset,
-						   copy_len);
-			}
+
+			ASSERT(copy_len > 0);
+			memcpy(ptr, reg->i_addr + copy_off, copy_len);
+			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
+
 			if (added_ophdr)
 				copy_len += sizeof(struct xlog_op_header);
 			record_cnt++;
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 1ca4f2edbdaf..af54ea3f8c90 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -21,44 +21,18 @@ struct xfs_log_vec {
 
 #define XFS_LOG_VEC_ORDERED	(-1)
 
-/*
- * We need to make sure the buffer pointer returned is naturally aligned for the
- * biggest basic data type we put into it. We have already accounted for this
- * padding when sizing the buffer.
- *
- * However, this padding does not get written into the log, and hence we have to
- * track the space used by the log vectors separately to prevent log space hangs
- * due to inaccurate accounting (i.e. a leak) of the used log space through the
- * CIL context ticket.
- */
-static inline void *
-xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
-		uint type)
-{
-	struct xfs_log_iovec *vec = *vecp;
-
-	if (vec) {
-		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
-		vec++;
-	} else {
-		vec = &lv->lv_iovecp[0];
-	}
-
-	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
-		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
-
-	vec->i_type = type;
-	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
-
-	ASSERT(IS_ALIGNED((unsigned long)vec->i_addr, sizeof(uint64_t)));
-
-	*vecp = vec;
-	return vec->i_addr;
-}
+void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
+		uint type);
 
 static inline void
 xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
 {
+	struct xlog_op_header	*oph = vec->i_addr;
+
+	/* opheader tracks payload length, logvec tracks region length */
+	oph->oh_len = len;
+
+	len += sizeof(struct xlog_op_header);
 	lv->lv_buf_len += len;
 	lv->lv_bytes += len;
 	vec->i_len = len;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 0c81c13e2cf6..7a5e6bdb7876 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
 		}
 
 		/*
-		 * We 64-bit align the length of each iovec so that the start
-		 * of the next one is naturally aligned.  We'll need to
-		 * account for that slack space here. Then round nbytes up
-		 * to 64-bit alignment so that the initial buffer alignment is
-		 * easy to calculate and verify.
+		 * We 64-bit align the length of each iovec so that the start of
+		 * the next one is naturally aligned.  We'll need to account for
+		 * that slack space here.
+		 *
+		 * We also add the xlog_op_header to each region when
+		 * formatting, but that's not accounted to the size of the item
+		 * at this point. Hence we'll need an addition number of bytes
+		 * for each vector to hold an opheader.
+		 *
+		 * Then round nbytes up to 64-bit alignment so that the initial
+		 * buffer alignment is easy to calculate and verify.
 		 */
-		nbytes += niovecs * sizeof(uint64_t);
+		nbytes += niovecs *
+			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
 		nbytes = round_up(nbytes, sizeof(uint64_t));
 
 		/*
@@ -433,11 +440,6 @@ xlog_cil_insert_items(
 
 	spin_lock(&cil->xc_cil_lock);
 
-	/* account for space used by new iovec headers  */
-	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
-	len += iovhdr_res;
-	ctx->nvecs += diff_iovecs;
-
 	/* attach the transaction to the CIL if it has any busy extents */
 	if (!list_empty(&tp->t_busy))
 		list_splice_init(&tp->t_busy, &ctx->busy_extents);
@@ -469,6 +471,7 @@ xlog_cil_insert_items(
 	}
 	tp->t_ticket->t_curr_res -= len;
 	ctx->space_used += len;
+	ctx->nvecs += diff_iovecs;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 26/45] xfs: log ticket region debug is largely useless
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (24 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  2:31   ` Darrick J. Wong
  2021-03-16 14:55   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 27/45] xfs: pass lv chain length into xlog_write() Dave Chinner
                   ` (18 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xlog_tic_add_region() is used to trace the regions being added to a
log ticket to provide information in the situation where a ticket
reservation overrun occurs. The information gathered is stored int
the ticket, and dumped if xlog_print_tic_res() is called.

For a front end struct xfs_trans overrun, the ticket only contains
reservation tracking information - the ticket is never handed to the
log so has no regions attached to it. The overrun debug information in this
case comes from xlog_print_trans(), which walks the items attached
to the transaction and dumps their attached formatted log vectors
directly. It also dumps the ticket state, but that only contains
reservation accounting and nothing else. Hence xlog_print_tic_res()
never dumps region or overrun information from this path.

xlog_tic_add_region() is actually called from xlog_write(), which
means it is being used to track the regions seen in a
CIL checkpoint log vector chain. In looking at CIL behaviour
recently, I've seen 32MB checkpoints regularly exceed 250,000
regions in the LV chain. The log ticket debug code can track *15*
regions. IOWs, if there is a ticket overrun in the CIL code, the
ticket region tracking code is going to be completely useless for
determining what went wrong. The only thing it can tell us is how
much of an overrun occurred, and we really don't need extra debug
information in the log ticket to tell us that.

Indeed, the main place we call xlog_tic_add_region() is also adding
up the number of regions and the space used so that xlog_write()
knows how much will be written to the log. This is exactly the same
information that log ticket is storing once we take away the useless
region tracking array. Hence xlog_tic_add_region() is not useful,
but can be called 250,000 times a CIL push...

Just strip all that debug "information" out of the of the log ticket
and only have it report reservation space information when an
overrun occurs. This also reduces the size of a log ticket down by
about 150 bytes...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c      | 107 +++---------------------------------------
 fs/xfs/xfs_log_priv.h |  17 -------
 2 files changed, 6 insertions(+), 118 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 98de45be80c0..412b167d8d0e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -377,30 +377,6 @@ xlog_grant_head_check(
 	return error;
 }
 
-static void
-xlog_tic_reset_res(xlog_ticket_t *tic)
-{
-	tic->t_res_num = 0;
-	tic->t_res_arr_sum = 0;
-	tic->t_res_num_ophdrs = 0;
-}
-
-static void
-xlog_tic_add_region(xlog_ticket_t *tic, uint len, uint type)
-{
-	if (tic->t_res_num == XLOG_TIC_LEN_MAX) {
-		/* add to overflow and start again */
-		tic->t_res_o_flow += tic->t_res_arr_sum;
-		tic->t_res_num = 0;
-		tic->t_res_arr_sum = 0;
-	}
-
-	tic->t_res_arr[tic->t_res_num].r_len = len;
-	tic->t_res_arr[tic->t_res_num].r_type = type;
-	tic->t_res_arr_sum += len;
-	tic->t_res_num++;
-}
-
 bool
 xfs_log_writable(
 	struct xfs_mount	*mp)
@@ -448,8 +424,6 @@ xfs_log_regrant(
 	xlog_grant_push_ail(log, tic->t_unit_res);
 
 	tic->t_curr_res = tic->t_unit_res;
-	xlog_tic_reset_res(tic);
-
 	if (tic->t_cnt > 0)
 		return 0;
 
@@ -2066,63 +2040,11 @@ xlog_print_tic_res(
 	struct xfs_mount	*mp,
 	struct xlog_ticket	*ticket)
 {
-	uint i;
-	uint ophdr_spc = ticket->t_res_num_ophdrs * (uint)sizeof(xlog_op_header_t);
-
-	/* match with XLOG_REG_TYPE_* in xfs_log.h */
-#define REG_TYPE_STR(type, str)	[XLOG_REG_TYPE_##type] = str
-	static char *res_type_str[] = {
-	    REG_TYPE_STR(BFORMAT, "bformat"),
-	    REG_TYPE_STR(BCHUNK, "bchunk"),
-	    REG_TYPE_STR(EFI_FORMAT, "efi_format"),
-	    REG_TYPE_STR(EFD_FORMAT, "efd_format"),
-	    REG_TYPE_STR(IFORMAT, "iformat"),
-	    REG_TYPE_STR(ICORE, "icore"),
-	    REG_TYPE_STR(IEXT, "iext"),
-	    REG_TYPE_STR(IBROOT, "ibroot"),
-	    REG_TYPE_STR(ILOCAL, "ilocal"),
-	    REG_TYPE_STR(IATTR_EXT, "iattr_ext"),
-	    REG_TYPE_STR(IATTR_BROOT, "iattr_broot"),
-	    REG_TYPE_STR(IATTR_LOCAL, "iattr_local"),
-	    REG_TYPE_STR(QFORMAT, "qformat"),
-	    REG_TYPE_STR(DQUOT, "dquot"),
-	    REG_TYPE_STR(QUOTAOFF, "quotaoff"),
-	    REG_TYPE_STR(LRHEADER, "LR header"),
-	    REG_TYPE_STR(UNMOUNT, "unmount"),
-	    REG_TYPE_STR(COMMIT, "commit"),
-	    REG_TYPE_STR(TRANSHDR, "trans header"),
-	    REG_TYPE_STR(ICREATE, "inode create"),
-	    REG_TYPE_STR(RUI_FORMAT, "rui_format"),
-	    REG_TYPE_STR(RUD_FORMAT, "rud_format"),
-	    REG_TYPE_STR(CUI_FORMAT, "cui_format"),
-	    REG_TYPE_STR(CUD_FORMAT, "cud_format"),
-	    REG_TYPE_STR(BUI_FORMAT, "bui_format"),
-	    REG_TYPE_STR(BUD_FORMAT, "bud_format"),
-	};
-	BUILD_BUG_ON(ARRAY_SIZE(res_type_str) != XLOG_REG_TYPE_MAX + 1);
-#undef REG_TYPE_STR
-
 	xfs_warn(mp, "ticket reservation summary:");
-	xfs_warn(mp, "  unit res    = %d bytes",
-		 ticket->t_unit_res);
-	xfs_warn(mp, "  current res = %d bytes",
-		 ticket->t_curr_res);
-	xfs_warn(mp, "  total reg   = %u bytes (o/flow = %u bytes)",
-		 ticket->t_res_arr_sum, ticket->t_res_o_flow);
-	xfs_warn(mp, "  ophdrs      = %u (ophdr space = %u bytes)",
-		 ticket->t_res_num_ophdrs, ophdr_spc);
-	xfs_warn(mp, "  ophdr + reg = %u bytes",
-		 ticket->t_res_arr_sum + ticket->t_res_o_flow + ophdr_spc);
-	xfs_warn(mp, "  num regions = %u",
-		 ticket->t_res_num);
-
-	for (i = 0; i < ticket->t_res_num; i++) {
-		uint r_type = ticket->t_res_arr[i].r_type;
-		xfs_warn(mp, "region[%u]: %s - %u bytes", i,
-			    ((r_type <= 0 || r_type > XLOG_REG_TYPE_MAX) ?
-			    "bad-rtype" : res_type_str[r_type]),
-			    ticket->t_res_arr[i].r_len);
-	}
+	xfs_warn(mp, "  unit res    = %d bytes", ticket->t_unit_res);
+	xfs_warn(mp, "  current res = %d bytes", ticket->t_curr_res);
+	xfs_warn(mp, "  original count  = %d", ticket->t_ocnt);
+	xfs_warn(mp, "  remaining count = %d", ticket->t_cnt);
 }
 
 /*
@@ -2187,7 +2109,6 @@ xlog_write_calc_vec_length(
 	uint			optype)
 {
 	struct xfs_log_vec	*lv;
-	int			headers = 0;
 	int			len = 0;
 	int			i;
 
@@ -2196,17 +2117,9 @@ xlog_write_calc_vec_length(
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
 			continue;
 
-		headers += lv->lv_niovecs;
-
-		for (i = 0; i < lv->lv_niovecs; i++) {
-			struct xfs_log_iovec	*vecp = &lv->lv_iovecp[i];
-
-			len += vecp->i_len;
-			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
-		}
+		for (i = 0; i < lv->lv_niovecs; i++)
+			len += lv->lv_iovecp[i].i_len;
 	}
-	ticket->t_res_num_ophdrs += headers;
-
 	return len;
 }
 
@@ -2265,7 +2178,6 @@ xlog_write_setup_copy(
 
 	/* account for new log op header */
 	ticket->t_curr_res -= sizeof(struct xlog_op_header);
-	ticket->t_res_num_ophdrs++;
 
 	return sizeof(struct xlog_op_header);
 }
@@ -2973,9 +2885,6 @@ xlog_state_get_iclog_space(
 	 */
 	if (log_offset == 0) {
 		ticket->t_curr_res -= log->l_iclog_hsize;
-		xlog_tic_add_region(ticket,
-				    log->l_iclog_hsize,
-				    XLOG_REG_TYPE_LRHEADER);
 		head->h_cycle = cpu_to_be32(log->l_curr_cycle);
 		head->h_lsn = cpu_to_be64(
 			xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block));
@@ -3055,7 +2964,6 @@ xfs_log_ticket_regrant(
 	xlog_grant_sub_space(log, &log->l_write_head.grant,
 					ticket->t_curr_res);
 	ticket->t_curr_res = ticket->t_unit_res;
-	xlog_tic_reset_res(ticket);
 
 	trace_xfs_log_ticket_regrant_sub(log, ticket);
 
@@ -3066,7 +2974,6 @@ xfs_log_ticket_regrant(
 		trace_xfs_log_ticket_regrant_exit(log, ticket);
 
 		ticket->t_curr_res = ticket->t_unit_res;
-		xlog_tic_reset_res(ticket);
 	}
 
 	xfs_log_ticket_put(ticket);
@@ -3529,8 +3436,6 @@ xlog_ticket_alloc(
 	if (permanent)
 		tic->t_flags |= XLOG_TIC_PERM_RESERV;
 
-	xlog_tic_reset_res(tic);
-
 	return tic;
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 7f601c1c9f45..8ee6a5f74396 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -139,16 +139,6 @@ enum xlog_iclog_state {
 /* Ticket reservation region accounting */ 
 #define XLOG_TIC_LEN_MAX	15
 
-/*
- * Reservation region
- * As would be stored in xfs_log_iovec but without the i_addr which
- * we don't care about.
- */
-typedef struct xlog_res {
-	uint	r_len;	/* region length		:4 */
-	uint	r_type;	/* region's transaction type	:4 */
-} xlog_res_t;
-
 typedef struct xlog_ticket {
 	struct list_head   t_queue;	 /* reserve/write queue */
 	struct task_struct *t_task;	 /* task that owns this ticket */
@@ -159,13 +149,6 @@ typedef struct xlog_ticket {
 	char		   t_ocnt;	 /* original count		 : 1  */
 	char		   t_cnt;	 /* current count		 : 1  */
 	char		   t_flags;	 /* properties of reservation	 : 1  */
-
-        /* reservation array fields */
-	uint		   t_res_num;                    /* num in array : 4 */
-	uint		   t_res_num_ophdrs;		 /* num op hdrs  : 4 */
-	uint		   t_res_arr_sum;		 /* array sum    : 4 */
-	uint		   t_res_o_flow;		 /* sum overflow : 4 */
-	xlog_res_t	   t_res_arr[XLOG_TIC_LEN_MAX];  /* array of res : 8 * 15 */ 
 } xlog_ticket_t;
 
 /*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 27/45] xfs: pass lv chain length into xlog_write()
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (25 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 26/45] xfs: log ticket region debug is largely useless Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  2:36   ` Darrick J. Wong
  2021-03-16 18:38   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 28/45] xfs: introduce xlog_write_single() Dave Chinner
                   ` (17 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The caller of xlog_write() usually has a close accounting of the
aggregated vector length contained in the log vector chain passed to
xlog_write(). There is no need to iterate the chain to calculate he
length of the data in xlog_write_calculate_len() if the caller is
already iterating that chain to build it.

Passing in the vector length avoids doing an extra chain iteration,
which can be a significant amount of work given that large CIL
commits can have hundreds of thousands of vectors attached to the
chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 37 ++++++-------------------------------
 fs/xfs/xfs_log_cil.c  | 18 +++++++++++++-----
 fs/xfs/xfs_log_priv.h |  2 +-
 3 files changed, 20 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 412b167d8d0e..22f97914ab99 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -858,7 +858,8 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
+	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
+				reg.i_len);
 }
 
 /*
@@ -1577,7 +1578,8 @@ xlog_commit_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= reg.i_len;
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
+	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
+				reg.i_len);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2097,32 +2099,6 @@ xlog_print_trans(
 	}
 }
 
-/*
- * Calculate the potential space needed by the log vector. All regions contain
- * their own opheaders and they are accounted for in region space so we don't
- * need to add them to the vector length here.
- */
-static int
-xlog_write_calc_vec_length(
-	struct xlog_ticket	*ticket,
-	struct xfs_log_vec	*log_vector,
-	uint			optype)
-{
-	struct xfs_log_vec	*lv;
-	int			len = 0;
-	int			i;
-
-	for (lv = log_vector; lv; lv = lv->lv_next) {
-		/* we don't write ordered log vectors */
-		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
-			continue;
-
-		for (i = 0; i < lv->lv_niovecs; i++)
-			len += lv->lv_iovecp[i].i_len;
-	}
-	return len;
-}
-
 static xlog_op_header_t *
 xlog_write_setup_ophdr(
 	struct xlog_op_header	*ophdr,
@@ -2285,13 +2261,13 @@ xlog_write(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
-	uint			optype)
+	uint			optype,
+	uint32_t		len)
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
 	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
 	int			index = 0;
-	int			len;
 	int			partial_copy = 0;
 	int			partial_copy_len = 0;
 	int			contwr = 0;
@@ -2306,7 +2282,6 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
 	if (start_lsn)
 		*start_lsn = 0;
 	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 7a5e6bdb7876..34abc3bae587 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr(
 				sizeof(struct xfs_trans_header);
 	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
 
-	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
-
 	lvhdr->lv_niovecs = 2;
 	lvhdr->lv_iovecp = &hdr->lhdr[0];
+	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
 	lvhdr->lv_next = ctx->lv_chain;
+
+	tic->t_curr_res -= lvhdr->lv_bytes;
 }
 
 /*
@@ -742,7 +743,8 @@ xlog_cil_push_work(
 	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
-	int			num_iovecs;
+	int			num_iovecs = 0;
+	int			num_bytes = 0;
 	int			error = 0;
 	struct xlog_cil_trans_hdr thdr;
 	struct xfs_log_vec	lvhdr = { NULL };
@@ -841,7 +843,6 @@ xlog_cil_push_work(
 	 * by the flush lock.
 	 */
 	lv = NULL;
-	num_iovecs = 0;
 	while (!list_empty(&cil->xc_cil)) {
 		struct xfs_log_item	*item;
 
@@ -855,6 +856,10 @@ xlog_cil_push_work(
 		lv = item->li_lv;
 		item->li_lv = NULL;
 		num_iovecs += lv->lv_niovecs;
+
+		/* we don't write ordered log vectors */
+		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
+			num_bytes += lv->lv_bytes;
 	}
 
 	/*
@@ -893,6 +898,9 @@ xlog_cil_push_work(
 	 * transaction header here as it is not accounted for in xlog_write().
 	 */
 	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
+	num_iovecs += lvhdr.lv_niovecs;
+	num_bytes += lvhdr.lv_bytes;
+
 
 	/*
 	 * Before we format and submit the first iclog, we have to ensure that
@@ -907,7 +915,7 @@ xlog_cil_push_work(
 	 * write head.
 	 */
 	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
-				XLOG_START_TRANS);
+				XLOG_START_TRANS, num_bytes);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 8ee6a5f74396..003c11653955 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -462,7 +462,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
-		struct xlog_in_core **commit_iclog, uint optype);
+		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 28/45] xfs: introduce xlog_write_single()
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (26 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 27/45] xfs: pass lv chain length into xlog_write() Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  2:39   ` Darrick J. Wong
  2021-03-16 18:39   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 29/45] xfs:_introduce xlog_write_partial() Dave Chinner
                   ` (16 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Introduce an optimised version of xlog_write() that is used when the
entire write will fit in a single iclog. This greatly simplifies the
implementation of writing a log vector chain into an iclog, and sets
the ground work for a much more understandable xlog_write()
implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 22f97914ab99..590c1e6db475 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2214,6 +2214,52 @@ xlog_write_copy_finish(
 	return error;
 }
 
+/*
+ * Write log vectors into a single iclog which is guaranteed by the caller
+ * to have enough space to write the entire log vector into. Return the number
+ * of log vectors written into the iclog.
+ */
+static int
+xlog_write_single(
+	struct xfs_log_vec	*log_vector,
+	struct xlog_ticket	*ticket,
+	struct xlog_in_core	*iclog,
+	uint32_t		log_offset,
+	uint32_t		len)
+{
+	struct xfs_log_vec	*lv = log_vector;
+	void			*ptr;
+	int			index = 0;
+	int			record_cnt = 0;
+
+	ASSERT(log_offset + len <= iclog->ic_size);
+
+	ptr = iclog->ic_datap + log_offset;
+	for (lv = log_vector; lv; lv = lv->lv_next) {
+		/*
+		 * Ordered log vectors have no regions to write so this
+		 * loop will naturally skip them.
+		 */
+		for (index = 0; index < lv->lv_niovecs; index++) {
+			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
+			struct xlog_op_header	*ophdr = reg->i_addr;
+
+			ASSERT(reg->i_len % sizeof(int32_t) == 0);
+			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
+
+			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+			ophdr->oh_len = cpu_to_be32(reg->i_len -
+						sizeof(struct xlog_op_header));
+			memcpy(ptr, reg->i_addr, reg->i_len);
+			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
+			record_cnt++;
+		}
+	}
+	ASSERT(len == 0);
+	return record_cnt;
+}
+
+
 /*
  * Write some region out to in-core log
  *
@@ -2294,7 +2340,6 @@ xlog_write(
 			return error;
 
 		ASSERT(log_offset <= iclog->ic_size - 1);
-		ptr = iclog->ic_datap + log_offset;
 
 		/* Start_lsn is the first lsn written to. */
 		if (start_lsn && !*start_lsn)
@@ -2311,10 +2356,20 @@ xlog_write(
 						XLOG_ICL_NEED_FUA);
 		}
 
+		/* If this is a single iclog write, go fast... */
+		if (!contwr && lv == log_vector) {
+			record_cnt = xlog_write_single(lv, ticket, iclog,
+						log_offset, len);
+			len = 0;
+			data_cnt = len;
+			break;
+		}
+
 		/*
 		 * This loop writes out as many regions as can fit in the amount
 		 * of space which was allocated by xlog_state_get_iclog_space().
 		 */
+		ptr = iclog->ic_datap + log_offset;
 		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
 			struct xfs_log_iovec	*reg;
 			struct xlog_op_header	*ophdr;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 29/45] xfs:_introduce xlog_write_partial()
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (27 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 28/45] xfs: introduce xlog_write_single() Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  2:59   ` Darrick J. Wong
  2021-03-18 13:22   ` Brian Foster
  2021-03-05  5:11 ` [PATCH 30/45] xfs: xlog_write() no longer needs contwr state Dave Chinner
                   ` (15 subsequent siblings)
  44 siblings, 2 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Handle writing of a logvec chain into an iclog that doesn't have
enough space to fit it all. The iclog has already been changed to
WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
in the iclog is exclusively owned by this logvec chain.

The difference between the single and partial cases is that
we end up with partial iovec writes in the iclog and have to split
a log vec regions across two iclogs. The state handling for this is
currently awful and so we're building up the pieces needed to
handle this more cleanly one at a time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++-------------------------
 1 file changed, 251 insertions(+), 274 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 590c1e6db475..10916b99bf0f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2099,166 +2099,250 @@ xlog_print_trans(
 	}
 }
 
-static xlog_op_header_t *
-xlog_write_setup_ophdr(
-	struct xlog_op_header	*ophdr,
-	struct xlog_ticket	*ticket)
-{
-	ophdr->oh_clientid = XFS_TRANSACTION;
-	ophdr->oh_res2 = 0;
-	ophdr->oh_flags = 0;
-	return ophdr;
-}
-
 /*
- * Set up the parameters of the region copy into the log. This has
- * to handle region write split across multiple log buffers - this
- * state is kept external to this function so that this code can
- * be written in an obvious, self documenting manner.
+ * Write whole log vectors into a single iclog which is guaranteed to have
+ * either sufficient space for the entire log vector chain to be written or
+ * exclusive access to the remaining space in the iclog.
+ *
+ * Return the number of iovecs and data written into the iclog, as well as
+ * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
+ * end of the chain.
  */
-static int
-xlog_write_setup_copy(
+static struct xfs_log_vec *
+xlog_write_single(
+	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
-	struct xlog_op_header	*ophdr,
-	int			space_available,
-	int			space_required,
-	int			*copy_off,
-	int			*copy_len,
-	int			*last_was_partial_copy,
-	int			*bytes_consumed)
-{
-	int			still_to_copy;
-
-	still_to_copy = space_required - *bytes_consumed;
-	*copy_off = *bytes_consumed;
-
-	if (still_to_copy <= space_available) {
-		/* write of region completes here */
-		*copy_len = still_to_copy;
-		ophdr->oh_len = cpu_to_be32(*copy_len);
-		if (*last_was_partial_copy)
-			ophdr->oh_flags |= (XLOG_END_TRANS|XLOG_WAS_CONT_TRANS);
-		*last_was_partial_copy = 0;
-		*bytes_consumed = 0;
-		return 0;
-	}
-
-	/* partial write of region, needs extra log op header reservation */
-	*copy_len = space_available;
-	ophdr->oh_len = cpu_to_be32(*copy_len);
-	ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
-	if (*last_was_partial_copy)
-		ophdr->oh_flags |= XLOG_WAS_CONT_TRANS;
-	*bytes_consumed += *copy_len;
-	(*last_was_partial_copy)++;
-
-	/* account for new log op header */
-	ticket->t_curr_res -= sizeof(struct xlog_op_header);
-
-	return sizeof(struct xlog_op_header);
-}
-
-static int
-xlog_write_copy_finish(
-	struct xlog		*log,
 	struct xlog_in_core	*iclog,
-	uint			flags,
-	int			*record_cnt,
-	int			*data_cnt,
-	int			*partial_copy,
-	int			*partial_copy_len,
-	int			log_offset,
-	struct xlog_in_core	**commit_iclog)
+	uint32_t		*log_offset,
+	uint32_t		*len,
+	uint32_t		*record_cnt,
+	uint32_t		*data_cnt)
 {
-	int			error;
+	struct xfs_log_vec	*lv = log_vector;
+	void			*ptr;
+	int			index;
 
-	if (*partial_copy) {
+	ASSERT(*log_offset + *len <= iclog->ic_size ||
+		iclog->ic_state == XLOG_STATE_WANT_SYNC);
+
+	ptr = iclog->ic_datap + *log_offset;
+	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/*
-		 * This iclog has already been marked WANT_SYNC by
-		 * xlog_state_get_iclog_space.
+		 * If the entire log vec does not fit in the iclog, punt it to
+		 * the partial copy loop which can handle this case.
 		 */
-		spin_lock(&log->l_icloglock);
-		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
-		*record_cnt = 0;
-		*data_cnt = 0;
-		goto release_iclog;
-	}
+		if (lv->lv_niovecs &&
+		    lv->lv_bytes > iclog->ic_size - *log_offset)
+			break;
 
-	*partial_copy = 0;
-	*partial_copy_len = 0;
+		/*
+		 * Ordered log vectors have no regions to write so this
+		 * loop will naturally skip them.
+		 */
+		for (index = 0; index < lv->lv_niovecs; index++) {
+			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
+			struct xlog_op_header	*ophdr = reg->i_addr;
 
-	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
-		/* no more space in this iclog - push it. */
-		spin_lock(&log->l_icloglock);
-		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
-		*record_cnt = 0;
-		*data_cnt = 0;
+			ASSERT(reg->i_len % sizeof(int32_t) == 0);
+			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
 
-		if (iclog->ic_state == XLOG_STATE_ACTIVE)
-			xlog_state_switch_iclogs(log, iclog, 0);
-		else
-			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
-			       iclog->ic_state == XLOG_STATE_IOERROR);
-		if (!commit_iclog)
-			goto release_iclog;
-		spin_unlock(&log->l_icloglock);
-		ASSERT(flags & XLOG_COMMIT_TRANS);
-		*commit_iclog = iclog;
+			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+			ophdr->oh_len = cpu_to_be32(reg->i_len -
+						sizeof(struct xlog_op_header));
+			memcpy(ptr, reg->i_addr, reg->i_len);
+			xlog_write_adv_cnt(&ptr, len, log_offset, reg->i_len);
+			(*record_cnt)++;
+			*data_cnt += reg->i_len;
+		}
 	}
+	ASSERT(*len == 0 || lv);
+	return lv;
+}
 
-	return 0;
+static int
+xlog_write_get_more_iclog_space(
+	struct xlog		*log,
+	struct xlog_ticket	*ticket,
+	struct xlog_in_core	**iclogp,
+	uint32_t		*log_offset,
+	uint32_t		len,
+	uint32_t		*record_cnt,
+	uint32_t		*data_cnt,
+	int			*contwr)
+{
+	struct xlog_in_core	*iclog = *iclogp;
+	int			error;
 
-release_iclog:
+	spin_lock(&log->l_icloglock);
+	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
+	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
+	       iclog->ic_state == XLOG_STATE_IOERROR);
 	error = xlog_state_release_iclog(log, iclog);
 	spin_unlock(&log->l_icloglock);
-	return error;
+	if (error)
+		return error;
+
+	error = xlog_state_get_iclog_space(log, len, &iclog,
+				ticket, contwr, log_offset);
+	if (error)
+		return error;
+	*record_cnt = 0;
+	*data_cnt = 0;
+	*iclogp = iclog;
+	return 0;
 }
 
 /*
- * Write log vectors into a single iclog which is guaranteed by the caller
- * to have enough space to write the entire log vector into. Return the number
- * of log vectors written into the iclog.
+ * Write log vectors into a single iclog which is smaller than the current chain
+ * length. We write until we cannot fit a full record into the remaining space
+ * and then stop. We return the log vector that is to be written that cannot
+ * wholly fit in the iclog.
  */
-static int
-xlog_write_single(
+static struct xfs_log_vec *
+xlog_write_partial(
+	struct xlog		*log,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
-	struct xlog_in_core	*iclog,
-	uint32_t		log_offset,
-	uint32_t		len)
+	struct xlog_in_core	**iclogp,
+	uint32_t		*log_offset,
+	uint32_t		*len,
+	uint32_t		*record_cnt,
+	uint32_t		*data_cnt,
+	int			*contwr)
 {
+	struct xlog_in_core	*iclog = *iclogp;
 	struct xfs_log_vec	*lv = log_vector;
+	struct xfs_log_iovec	*reg;
+	struct xlog_op_header	*ophdr;
 	void			*ptr;
 	int			index = 0;
-	int			record_cnt = 0;
+	uint32_t		rlen;
+	int			error;
 
-	ASSERT(log_offset + len <= iclog->ic_size);
+	/* walk the logvec, copying until we run out of space in the iclog */
+	ptr = iclog->ic_datap + *log_offset;
+	for (index = 0; index < lv->lv_niovecs; index++) {
+		uint32_t	reg_offset = 0;
+
+		reg = &lv->lv_iovecp[index];
+		ASSERT(reg->i_len % sizeof(int32_t) == 0);
 
-	ptr = iclog->ic_datap + log_offset;
-	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/*
-		 * Ordered log vectors have no regions to write so this
-		 * loop will naturally skip them.
+		 * The first region of a continuation must have a non-zero
+		 * length otherwise log recovery will just skip over it and
+		 * start recovering from the next opheader it finds. Because we
+		 * mark the next opheader as a continuation, recovery will then
+		 * incorrectly add the continuation to the previous region and
+		 * that breaks stuff.
+		 *
+		 * Hence if there isn't space for region data after the
+		 * opheader, then we need to start afresh with a new iclog.
 		 */
-		for (index = 0; index < lv->lv_niovecs; index++) {
-			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
-			struct xlog_op_header	*ophdr = reg->i_addr;
+		if (iclog->ic_size - *log_offset <=
+					sizeof(struct xlog_op_header)) {
+			error = xlog_write_get_more_iclog_space(log, ticket,
+					&iclog, log_offset, *len, record_cnt,
+					data_cnt, contwr);
+			if (error)
+				return ERR_PTR(error);
+			ptr = iclog->ic_datap + *log_offset;
+		}
 
-			ASSERT(reg->i_len % sizeof(int32_t) == 0);
-			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
+		ophdr = reg->i_addr;
+		rlen = min_t(uint32_t, reg->i_len, iclog->ic_size - *log_offset);
+
+		ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+		ophdr->oh_len = cpu_to_be32(rlen - sizeof(struct xlog_op_header));
+		if (rlen != reg->i_len)
+			ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
 
+		ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
+		xlog_verify_dest_ptr(log, ptr);
+		memcpy(ptr, reg->i_addr, rlen);
+		xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
+		(*record_cnt)++;
+		*data_cnt += rlen;
+
+		if (rlen == reg->i_len)
+			continue;
+
+		/*
+		 * We now have a partially written iovec, but it can span
+		 * multiple iclogs so we loop here. First we release the iclog
+		 * we currently have, then we get a new iclog and add a new
+		 * opheader. Then we continue copying from where we were until
+		 * we either complete the iovec or fill the iclog. If we
+		 * complete the iovec, then we increment the index and go right
+		 * back to the top of the outer loop. if we fill the iclog, we
+		 * run the inner loop again.
+		 *
+		 * This is complicated by the tail of a region using all the
+		 * space in an iclog and hence requiring us to release the iclog
+		 * and get a new one before returning to the outer loop. We must
+		 * always guarantee that we exit this inner loop with at least
+		 * space for log transaction opheaders left in the current
+		 * iclog, hence we cannot just terminate the loop at the end
+		 * of the of the continuation. So we loop while there is no
+		 * space left in the current iclog, and check for the end of the
+		 * continuation after getting a new iclog.
+		 */
+		do {
+			/*
+			 * Account for the continuation opheader before we get
+			 * a new iclog. This is necessary so that we reserve
+			 * space in the iclog for it.
+			 */
+			if (ophdr->oh_flags & XLOG_CONTINUE_TRANS) {
+				*len += sizeof(struct xlog_op_header);
+				ticket->t_curr_res -= sizeof(struct xlog_op_header);
+			}
+			error = xlog_write_get_more_iclog_space(log, ticket,
+					&iclog, log_offset, *len, record_cnt,
+					data_cnt, contwr);
+			if (error)
+				return ERR_PTR(error);
+			ptr = iclog->ic_datap + *log_offset;
+
+			ophdr = ptr;
 			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
-			ophdr->oh_len = cpu_to_be32(reg->i_len -
+			ophdr->oh_clientid = XFS_TRANSACTION;
+			ophdr->oh_res2 = 0;
+			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
+
+			xlog_write_adv_cnt(&ptr, len, log_offset,
 						sizeof(struct xlog_op_header));
-			memcpy(ptr, reg->i_addr, reg->i_len);
-			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
-			record_cnt++;
-		}
+			*data_cnt += sizeof(struct xlog_op_header);
+
+			/*
+			 * If rlen fits in the iclog, then end the region
+			 * continuation. Otherwise we're going around again.
+			 */
+			reg_offset += rlen;
+			rlen = reg->i_len - reg_offset;
+			if (rlen <= iclog->ic_size - *log_offset)
+				ophdr->oh_flags |= XLOG_END_TRANS;
+			else
+				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
+
+			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
+			ophdr->oh_len = cpu_to_be32(rlen);
+
+			xlog_verify_dest_ptr(log, ptr);
+			memcpy(ptr, reg->i_addr + reg_offset, rlen);
+			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
+			(*record_cnt)++;
+			*data_cnt += rlen;
+
+		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);
 	}
-	ASSERT(len == 0);
-	return record_cnt;
-}
 
+	/*
+	 * No more iovecs remain in this logvec so return the next log vec to
+	 * the caller so it can go back to fast path copying.
+	 */
+	*iclogp = iclog;
+	return lv->lv_next;
+}
 
 /*
  * Write some region out to in-core log
@@ -2312,14 +2396,11 @@ xlog_write(
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
-	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
-	int			index = 0;
-	int			partial_copy = 0;
-	int			partial_copy_len = 0;
 	int			contwr = 0;
 	int			record_cnt = 0;
 	int			data_cnt = 0;
 	int			error = 0;
+	int			log_offset;
 
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
@@ -2328,157 +2409,52 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	if (start_lsn)
-		*start_lsn = 0;
-	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
-		void		*ptr;
-		int		log_offset;
-
-		error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
-						   &contwr, &log_offset);
-		if (error)
-			return error;
-
-		ASSERT(log_offset <= iclog->ic_size - 1);
+	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
+					   &contwr, &log_offset);
+	if (error)
+		return error;
 
-		/* Start_lsn is the first lsn written to. */
-		if (start_lsn && !*start_lsn)
-			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+	/* start_lsn is the LSN of the first iclog written to. */
+	if (start_lsn)
+		*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
-		/*
-		 * iclogs containing commit records or unmount records need
-		 * to issue ordering cache flushes and commit immediately
-		 * to stable storage to guarantee journal vs metadata ordering
-		 * is correctly maintained in the storage media.
-		 */
-		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
-			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
-						XLOG_ICL_NEED_FUA);
-		}
+	/*
+	 * iclogs containing commit records or unmount records need
+	 * to issue ordering cache flushes and commit immediately
+	 * to stable storage to guarantee journal vs metadata ordering
+	 * is correctly maintained in the storage media. This will always
+	 * fit in the iclog we have been already been passed.
+	 */
+	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
+		iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
+		ASSERT(!contwr);
+	}
 
-		/* If this is a single iclog write, go fast... */
-		if (!contwr && lv == log_vector) {
-			record_cnt = xlog_write_single(lv, ticket, iclog,
-						log_offset, len);
-			len = 0;
-			data_cnt = len;
+	while (lv) {
+		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
+					&len, &record_cnt, &data_cnt);
+		if (!lv)
 			break;
-		}
-
-		/*
-		 * This loop writes out as many regions as can fit in the amount
-		 * of space which was allocated by xlog_state_get_iclog_space().
-		 */
-		ptr = iclog->ic_datap + log_offset;
-		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
-			struct xfs_log_iovec	*reg;
-			struct xlog_op_header	*ophdr;
-			int			copy_len;
-			int			copy_off;
-			bool			ordered = false;
-			bool			added_ophdr = false;
-
-			/* ordered log vectors have no regions to write */
-			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
-				ASSERT(lv->lv_niovecs == 0);
-				ordered = true;
-				goto next_lv;
-			}
-
-			reg = &vecp[index];
-			ASSERT(reg->i_len % sizeof(int32_t) == 0);
-			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
-
-			/*
-			 * Regions always have their ophdr at the start of the
-			 * region, except for:
-			 * - a transaction start which has a start record ophdr
-			 *   before the first region ophdr; and
-			 * - the previous region didn't fully fit into an iclog
-			 *   so needs a continuation ophdr to prepend the region
-			 *   in this new iclog.
-			 */
-			ophdr = reg->i_addr;
-			if (optype && index) {
-				optype &= ~XLOG_START_TRANS;
-			} else if (partial_copy) {
-                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
-				xlog_write_adv_cnt(&ptr, &len, &log_offset,
-					   sizeof(struct xlog_op_header));
-				added_ophdr = true;
-			}
-			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
-
-			len += xlog_write_setup_copy(ticket, ophdr,
-						     iclog->ic_size-log_offset,
-						     reg->i_len,
-						     &copy_off, &copy_len,
-						     &partial_copy,
-						     &partial_copy_len);
-			xlog_verify_dest_ptr(log, ptr);
-
 
-			/*
-			 * Wart: need to update length in embedded ophdr not
-			 * to include it's own length.
-			 */
-			if (!added_ophdr) {
-				ophdr->oh_len = cpu_to_be32(copy_len -
-						sizeof(struct xlog_op_header));
-			}
-
-			ASSERT(copy_len > 0);
-			memcpy(ptr, reg->i_addr + copy_off, copy_len);
-			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
-
-			if (added_ophdr)
-				copy_len += sizeof(struct xlog_op_header);
-			record_cnt++;
-			data_cnt += contwr ? copy_len : 0;
-
-			error = xlog_write_copy_finish(log, iclog, optype,
-						       &record_cnt, &data_cnt,
-						       &partial_copy,
-						       &partial_copy_len,
-						       log_offset,
-						       commit_iclog);
-			if (error)
-				return error;
-
-			/*
-			 * if we had a partial copy, we need to get more iclog
-			 * space but we don't want to increment the region
-			 * index because there is still more is this region to
-			 * write.
-			 *
-			 * If we completed writing this region, and we flushed
-			 * the iclog (indicated by resetting of the record
-			 * count), then we also need to get more log space. If
-			 * this was the last record, though, we are done and
-			 * can just return.
-			 */
-			if (partial_copy)
-				break;
-
-			if (++index == lv->lv_niovecs) {
-next_lv:
-				lv = lv->lv_next;
-				index = 0;
-				if (lv)
-					vecp = lv->lv_iovecp;
-			}
-			if (record_cnt == 0 && !ordered) {
-				if (!lv)
-					return 0;
-				break;
-			}
+		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
+		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
+					&len, &record_cnt, &data_cnt, &contwr);
+		if (IS_ERR_OR_NULL(lv)) {
+			error = PTR_ERR_OR_ZERO(lv);
+			break;
 		}
 	}
+	ASSERT((len == 0 && !lv) || error);
 
-	ASSERT(len == 0);
-
+	/*
+	 * We've already been guaranteed that the last writes will fit inside
+	 * the current iclog, and hence it will already have the space used by
+	 * those writes accounted to it. Hence we do not need to update the
+	 * iclog with the number of bytes written here.
+	 */
+	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
 	spin_lock(&log->l_icloglock);
-	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
+	xlog_state_finish_copy(log, iclog, record_cnt, 0);
 	if (commit_iclog) {
 		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
@@ -2930,7 +2906,7 @@ xlog_state_get_iclog_space(
 	 * xlog_write() algorithm assumes that at least 2 xlog_op_header_t's
 	 * can fit into remaining data section.
 	 */
-	if (iclog->ic_size - iclog->ic_offset < 2*sizeof(xlog_op_header_t)) {
+	if (iclog->ic_size - iclog->ic_offset < 3*sizeof(xlog_op_header_t)) {
 		int		error = 0;
 
 		xlog_state_switch_iclogs(log, iclog, iclog->ic_size);
@@ -3633,11 +3609,12 @@ xlog_verify_iclog(
 					iclog->ic_header.h_cycle_data[idx]);
 			}
 		}
-		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG)
+		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) {
 			xfs_warn(log->l_mp,
-				"%s: invalid clientid %d op "PTR_FMT" offset 0x%lx",
-				__func__, clientid, ophead,
+				"%s: op %d invalid clientid %d op "PTR_FMT" offset 0x%lx",
+				__func__, i, clientid, ophead,
 				(unsigned long)field_offset);
+		}
 
 		/* check length */
 		p = &ophead->oh_len;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 30/45] xfs: xlog_write() no longer needs contwr state
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (28 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 29/45] xfs:_introduce xlog_write_partial() Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  3:01   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 31/45] xfs: CIL context doesn't need to count iovecs Dave Chinner
                   ` (14 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The rework of xlog_write() no longer requires xlog_get_iclog_state()
to tell it about internal iclog space reservation state to direct it
on what to do. Remove this parameter.

$ size fs/xfs/xfs_log.o.*
   text	   data	    bss	    dec	    hex	filename
  26520	    560	      8	  27088	   69d0	fs/xfs/xfs_log.o.orig
  26384	    560	      8	  26952	   6948	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 33 +++++++++++----------------------
 1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 10916b99bf0f..8f4f7ae84358 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -47,7 +47,6 @@ xlog_state_get_iclog_space(
 	int			len,
 	struct xlog_in_core	**iclog,
 	struct xlog_ticket	*ticket,
-	int			*continued_write,
 	int			*logoffsetp);
 STATIC void
 xlog_grant_push_ail(
@@ -2167,8 +2166,7 @@ xlog_write_get_more_iclog_space(
 	uint32_t		*log_offset,
 	uint32_t		len,
 	uint32_t		*record_cnt,
-	uint32_t		*data_cnt,
-	int			*contwr)
+	uint32_t		*data_cnt)
 {
 	struct xlog_in_core	*iclog = *iclogp;
 	int			error;
@@ -2182,8 +2180,8 @@ xlog_write_get_more_iclog_space(
 	if (error)
 		return error;
 
-	error = xlog_state_get_iclog_space(log, len, &iclog,
-				ticket, contwr, log_offset);
+	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
+					log_offset);
 	if (error)
 		return error;
 	*record_cnt = 0;
@@ -2207,8 +2205,7 @@ xlog_write_partial(
 	uint32_t		*log_offset,
 	uint32_t		*len,
 	uint32_t		*record_cnt,
-	uint32_t		*data_cnt,
-	int			*contwr)
+	uint32_t		*data_cnt)
 {
 	struct xlog_in_core	*iclog = *iclogp;
 	struct xfs_log_vec	*lv = log_vector;
@@ -2242,7 +2239,7 @@ xlog_write_partial(
 					sizeof(struct xlog_op_header)) {
 			error = xlog_write_get_more_iclog_space(log, ticket,
 					&iclog, log_offset, *len, record_cnt,
-					data_cnt, contwr);
+					data_cnt);
 			if (error)
 				return ERR_PTR(error);
 			ptr = iclog->ic_datap + *log_offset;
@@ -2298,7 +2295,7 @@ xlog_write_partial(
 			}
 			error = xlog_write_get_more_iclog_space(log, ticket,
 					&iclog, log_offset, *len, record_cnt,
-					data_cnt, contwr);
+					data_cnt);
 			if (error)
 				return ERR_PTR(error);
 			ptr = iclog->ic_datap + *log_offset;
@@ -2396,7 +2393,6 @@ xlog_write(
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
-	int			contwr = 0;
 	int			record_cnt = 0;
 	int			data_cnt = 0;
 	int			error = 0;
@@ -2410,7 +2406,7 @@ xlog_write(
 	}
 
 	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
-					   &contwr, &log_offset);
+					   &log_offset);
 	if (error)
 		return error;
 
@@ -2425,10 +2421,8 @@ xlog_write(
 	 * is correctly maintained in the storage media. This will always
 	 * fit in the iclog we have been already been passed.
 	 */
-	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
+	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
 		iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
-		ASSERT(!contwr);
-	}
 
 	while (lv) {
 		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
@@ -2438,7 +2432,7 @@ xlog_write(
 
 		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
 		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
-					&len, &record_cnt, &data_cnt, &contwr);
+					&len, &record_cnt, &data_cnt);
 		if (IS_ERR_OR_NULL(lv)) {
 			error = PTR_ERR_OR_ZERO(lv);
 			break;
@@ -2452,7 +2446,6 @@ xlog_write(
 	 * those writes accounted to it. Hence we do not need to update the
 	 * iclog with the number of bytes written here.
 	 */
-	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, 0);
 	if (commit_iclog) {
@@ -2856,7 +2849,6 @@ xlog_state_get_iclog_space(
 	int			len,
 	struct xlog_in_core	**iclogp,
 	struct xlog_ticket	*ticket,
-	int			*continued_write,
 	int			*logoffsetp)
 {
 	int		  log_offset;
@@ -2932,13 +2924,10 @@ xlog_state_get_iclog_space(
 	 * iclogs (to mark it taken), this particular iclog will release/sync
 	 * to disk in xlog_write().
 	 */
-	if (len <= iclog->ic_size - iclog->ic_offset) {
-		*continued_write = 0;
+	if (len <= iclog->ic_size - iclog->ic_offset)
 		iclog->ic_offset += len;
-	} else {
-		*continued_write = 1;
+	else
 		xlog_state_switch_iclogs(log, iclog, iclog->ic_size);
-	}
 	*iclogp = iclog;
 
 	ASSERT(iclog->ic_offset <= iclog->ic_size);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 31/45] xfs: CIL context doesn't need to count iovecs
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (29 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 30/45] xfs: xlog_write() no longer needs contwr state Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-09  3:16   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks Dave Chinner
                   ` (13 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we account for log opheaders in the log item formatting
code, we don't actually use the aggregated count of log iovecs in
the CIL for anything. Remove it and the tracking code that
calculates it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 34abc3bae587..4047f95a0fc4 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -252,22 +252,18 @@ xlog_cil_alloc_shadow_bufs(
 
 /*
  * Prepare the log item for insertion into the CIL. Calculate the difference in
- * log space and vectors it will consume, and if it is a new item pin it as
- * well.
+ * log space it will consume, and if it is a new item pin it as well.
  */
 STATIC void
 xfs_cil_prepare_item(
 	struct xlog		*log,
 	struct xfs_log_vec	*lv,
 	struct xfs_log_vec	*old_lv,
-	int			*diff_len,
-	int			*diff_iovecs)
+	int			*diff_len)
 {
 	/* Account for the new LV being passed in */
-	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) {
+	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
 		*diff_len += lv->lv_bytes;
-		*diff_iovecs += lv->lv_niovecs;
-	}
 
 	/*
 	 * If there is no old LV, this is the first time we've seen the item in
@@ -284,7 +280,6 @@ xfs_cil_prepare_item(
 		ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED);
 
 		*diff_len -= old_lv->lv_bytes;
-		*diff_iovecs -= old_lv->lv_niovecs;
 		lv->lv_item->li_lv_shadow = old_lv;
 	}
 
@@ -333,12 +328,10 @@ static void
 xlog_cil_insert_format_items(
 	struct xlog		*log,
 	struct xfs_trans	*tp,
-	int			*diff_len,
-	int			*diff_iovecs)
+	int			*diff_len)
 {
 	struct xfs_log_item	*lip;
 
-
 	/* Bail out if we didn't find a log item.  */
 	if (list_empty(&tp->t_items)) {
 		ASSERT(0);
@@ -381,7 +374,6 @@ xlog_cil_insert_format_items(
 			 * set the item up as though it is a new insertion so
 			 * that the space reservation accounting is correct.
 			 */
-			*diff_iovecs -= lv->lv_niovecs;
 			*diff_len -= lv->lv_bytes;
 
 			/* Ensure the lv is set up according to ->iop_size */
@@ -406,7 +398,7 @@ xlog_cil_insert_format_items(
 		ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)));
 		lip->li_ops->iop_format(lip, lv);
 insert:
-		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
+		xfs_cil_prepare_item(log, lv, old_lv, diff_len);
 	}
 }
 
@@ -426,7 +418,6 @@ xlog_cil_insert_items(
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
 	struct xfs_log_item	*lip;
 	int			len = 0;
-	int			diff_iovecs = 0;
 	int			iclog_space;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 
@@ -436,7 +427,7 @@ xlog_cil_insert_items(
 	 * We can do this safely because the context can't checkpoint until we
 	 * are done so it doesn't matter exactly how we update the CIL.
 	 */
-	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
+	xlog_cil_insert_format_items(log, tp, &len);
 
 	spin_lock(&cil->xc_cil_lock);
 
@@ -471,7 +462,6 @@ xlog_cil_insert_items(
 	}
 	tp->t_ticket->t_curr_res -= len;
 	ctx->space_used += len;
-	ctx->nvecs += diff_iovecs;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (30 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 31/45] xfs: CIL context doesn't need to count iovecs Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-10 23:01   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
                   ` (12 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In the next patches we are going to make the CIL list itself
per-cpu, and so we cannot use list_empty() to check is the list is
empty. Replace the list_empty() checks with a flag in the CIL to
indicate we have committed at least one transaction to the CIL and
hence the CIL is not empty.

We need this flag to be an atomic so that we can clear it without
holding any locks in the commit fast path, but we also need to be
careful to avoid atomic operations in the fast path. Hence we use
the fact that test_bit() is not an atomic op to first check if the
flag is set and then run the atomic test_and_clear_bit() operation
to clear it and steal the initial unit reservation for the CIL
context checkpoint.

When we are switching to a new context in a push, we place the
setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
allows all the other places that need to check whether the CIL is
empty to use test_bit() and still be serialised correctly with the
CIL context swaps that set the bit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 49 +++++++++++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h |  4 ++++
 2 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 4047f95a0fc4..e6e36488f0c7 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -70,6 +70,7 @@ xlog_cil_ctx_switch(
 	struct xfs_cil		*cil,
 	struct xfs_cil_ctx	*ctx)
 {
+	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
 	ctx->sequence = ++cil->xc_current_sequence;
 	ctx->cil = cil;
 	cil->xc_ctx = ctx;
@@ -436,13 +437,12 @@ xlog_cil_insert_items(
 		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 
 	/*
-	 * Now transfer enough transaction reservation to the context ticket
-	 * for the checkpoint. The context ticket is special - the unit
-	 * reservation has to grow as well as the current reservation as we
-	 * steal from tickets so we can correctly determine the space used
-	 * during the transaction commit.
+	 * We need to take the CIL checkpoint unit reservation on the first
+	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
+	 * unnecessarily do an atomic op in the fast path here.
 	 */
-	if (ctx->ticket->t_curr_res == 0) {
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
+	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
 		ctx_res = ctx->ticket->t_unit_res;
 		ctx->ticket->t_curr_res = ctx_res;
 		tp->t_ticket->t_curr_res -= ctx_res;
@@ -771,7 +771,7 @@ xlog_cil_push_work(
 	 * move on to a new sequence number and so we have to be able to push
 	 * this sequence again later.
 	 */
-	if (list_empty(&cil->xc_cil)) {
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
 		cil->xc_push_seq = 0;
 		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
@@ -1019,9 +1019,10 @@ xlog_cil_push_background(
 
 	/*
 	 * The cil won't be empty because we are called while holding the
-	 * context lock so whatever we added to the CIL will still be there
+	 * context lock so whatever we added to the CIL will still be there.
 	 */
 	ASSERT(!list_empty(&cil->xc_cil));
+	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 
 	/*
 	 * Don't do a background push if we haven't used up all the
@@ -1108,7 +1109,8 @@ xlog_cil_push_now(
 	 * there's no work we need to do.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) ||
+	    push_seq <= cil->xc_push_seq) {
 		spin_unlock(&cil->xc_push_lock);
 		return;
 	}
@@ -1128,7 +1130,7 @@ xlog_cil_empty(
 	bool		empty = false;
 
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil))
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		empty = true;
 	spin_unlock(&cil->xc_push_lock);
 	return empty;
@@ -1289,7 +1291,7 @@ xlog_cil_force_seq(
 	 * we would have found the context on the committing list.
 	 */
 	if (sequence == cil->xc_current_sequence &&
-	    !list_empty(&cil->xc_cil)) {
+	    !test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
 		spin_unlock(&cil->xc_push_lock);
 		goto restart;
 	}
@@ -1320,21 +1322,19 @@ xlog_cil_force_seq(
  */
 bool
 xfs_log_item_in_current_chkpt(
-	struct xfs_log_item *lip)
+	struct xfs_log_item	*lip)
 {
-	struct xfs_cil_ctx *ctx;
+	struct xfs_cil		*cil = lip->li_mountp->m_log->l_cilp;
 
-	if (list_empty(&lip->li_cil))
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		return false;
 
-	ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
-
 	/*
 	 * li_seq is written on the first commit of a log item to record the
 	 * first checkpoint it is written to. Hence if it is different to the
 	 * current sequence, we're in a new checkpoint.
 	 */
-	if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0)
+	if (XFS_LSN_CMP(lip->li_seq, cil->xc_ctx->sequence) != 0)
 		return false;
 	return true;
 }
@@ -1373,13 +1373,16 @@ void
 xlog_cil_destroy(
 	struct xlog	*log)
 {
-	if (log->l_cilp->xc_ctx) {
-		if (log->l_cilp->xc_ctx->ticket)
-			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
-		kmem_free(log->l_cilp->xc_ctx);
+	struct xfs_cil	*cil = log->l_cilp;
+
+	if (cil->xc_ctx) {
+		if (cil->xc_ctx->ticket)
+			xfs_log_ticket_put(cil->xc_ctx->ticket);
+		kmem_free(cil->xc_ctx);
 	}
 
-	ASSERT(list_empty(&log->l_cilp->xc_cil));
-	kmem_free(log->l_cilp);
+	ASSERT(list_empty(&cil->xc_cil));
+	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
+	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 003c11653955..b0dc3bc9de59 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -248,6 +248,7 @@ struct xfs_cil_ctx {
  */
 struct xfs_cil {
 	struct xlog		*xc_log;
+	unsigned long		xc_flags;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;
 
@@ -263,6 +264,9 @@ struct xfs_cil {
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
 
+/* xc_flags bit values */
+#define	XLOG_CIL_EMPTY		1
+
 /*
  * The amount of log space we allow the CIL to aggregate is difficult to size.
  * Whatever we choose, we have to make sure we can get a reservation for the
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (31 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-10 23:25   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 34/45] xfs: rework per-iclog header CIL reservation Dave Chinner
                   ` (11 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The xc_cil_lock is the most highly contended lock in XFS now. To
start the process of getting rid of it, lift the initial reservation
of the CIL log space out from under the xc_cil_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index e6e36488f0c7..50101336a7f4 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -430,23 +430,19 @@ xlog_cil_insert_items(
 	 */
 	xlog_cil_insert_format_items(log, tp, &len);
 
-	spin_lock(&cil->xc_cil_lock);
-
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
-
 	/*
 	 * We need to take the CIL checkpoint unit reservation on the first
 	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
-	 * unnecessarily do an atomic op in the fast path here.
+	 * unnecessarily do an atomic op in the fast path here. We don't need to
+	 * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are
+	 * under the xc_ctx_lock here and that needs to be held exclusively to
+	 * reset the XLOG_CIL_EMPTY bit.
 	 */
 	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
-	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
+	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		ctx_res = ctx->ticket->t_unit_res;
-		ctx->ticket->t_curr_res = ctx_res;
-		tp->t_ticket->t_curr_res -= ctx_res;
-	}
+
+	spin_lock(&cil->xc_cil_lock);
 
 	/* do we need space for more log record headers? */
 	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
@@ -456,11 +452,9 @@ xlog_cil_insert_items(
 		/* need to take into account split region headers, too */
 		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
 		ctx->ticket->t_unit_res += split_res;
-		ctx->ticket->t_curr_res += split_res;
-		tp->t_ticket->t_curr_res -= split_res;
-		ASSERT(tp->t_ticket->t_curr_res >= len);
 	}
-	tp->t_ticket->t_curr_res -= len;
+	tp->t_ticket->t_curr_res -= split_res + ctx_res + len;
+	ctx->ticket->t_curr_res += split_res + ctx_res;
 	ctx->space_used += len;
 
 	/*
@@ -498,6 +492,9 @@ xlog_cil_insert_items(
 			list_move_tail(&lip->li_cil, &cil->xc_cil);
 	}
 
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 	spin_unlock(&cil->xc_cil_lock);
 
 	if (tp->t_ticket->t_curr_res < 0)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 34/45] xfs: rework per-iclog header CIL reservation
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (32 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  0:03   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure Dave Chinner
                   ` (10 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

For every iclog that a CIL push will use up, we need to ensure we
have space reserved for the iclog header in each iclog. It is
extremely difficult to do this accurately with a per-cpu counter
without expensive summing of the counter in every commit. However,
we know what the maximum CIL size is going to be because of the
hard space limit we have, and hence we know exactly how many iclogs
we are going to need to write out the CIL.

We are constrained by the requirement that small transactions only
have reservation space for a single iclog header built into them.
At commit time we don't know how much of the current transaction
reservation is made up of iclog header reservations as calculated by
xfs_log_calc_unit_res() when the ticket was reserved. As larger
reservations have multiple header spaces reserved, we can steal
more than one iclog header reservation at a time, but we only steal
the exact number needed for the given log vector size delta.

As a result, we don't know exactly when we are going to steal iclog
header reservations, nor do we know exactly how many we are going to
need for a given CIL.

To make things simple, start by calculating the worst case number of
iclog headers a full CIL push will require. Record this into an
atomic variable in the CIL. Then add a byte counter to the log
ticket that records exactly how much iclog header space has been
reserved in this ticket by xfs_log_calc_unit_res(). This tells us
exactly how much space we can steal from the ticket at transaction
commit time.

Now, at transaction commit time, we can check if the CIL has a full
iclog header reservation and, if not, steal the entire reservation
the current ticket holds for iclog headers. This minimises the
number of times we need to do atomic operations in the fast path,
but still guarantees we get all the reservations we need.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_log_rlimit.c |  2 +-
 fs/xfs/libxfs/xfs_shared.h     |  3 +-
 fs/xfs/xfs_log.c               | 12 +++++---
 fs/xfs/xfs_log_cil.c           | 55 ++++++++++++++++++++++++++--------
 fs/xfs/xfs_log_priv.h          | 20 +++++++------
 5 files changed, 64 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
index 7f55eb3f3653..75390134346d 100644
--- a/fs/xfs/libxfs/xfs_log_rlimit.c
+++ b/fs/xfs/libxfs/xfs_log_rlimit.c
@@ -88,7 +88,7 @@ xfs_log_calc_minimum_size(
 
 	xfs_log_get_max_trans_res(mp, &tres);
 
-	max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres);
+	max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres, NULL);
 	if (tres.tr_logcount > 1)
 		max_logres *= tres.tr_logcount;
 
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 8c61a461bf7b..b4791b817fe3 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -48,7 +48,8 @@ extern const struct xfs_buf_ops xfs_symlink_buf_ops;
 extern const struct xfs_buf_ops xfs_rtbuf_ops;
 
 /* log size calculation functions */
-int	xfs_log_calc_unit_res(struct xfs_mount *mp, int unit_bytes);
+int	xfs_log_calc_unit_res(struct xfs_mount *mp, int unit_bytes,
+				int *niclogs);
 int	xfs_log_calc_minimum_size(struct xfs_mount *);
 
 struct xfs_trans_res;
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 8f4f7ae84358..46a006d41184 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3312,7 +3312,8 @@ xfs_log_ticket_get(
 static int
 xlog_calc_unit_res(
 	struct xlog		*log,
-	int			unit_bytes)
+	int			unit_bytes,
+	int			*niclogs)
 {
 	int			iclog_space;
 	uint			num_headers;
@@ -3392,15 +3393,18 @@ xlog_calc_unit_res(
 	/* roundoff padding for transaction data and one for commit record */
 	unit_bytes += 2 * log->l_iclog_roundoff;
 
+	if (niclogs)
+		*niclogs = num_headers;
 	return unit_bytes;
 }
 
 int
 xfs_log_calc_unit_res(
 	struct xfs_mount	*mp,
-	int			unit_bytes)
+	int			unit_bytes,
+	int			*niclogs)
 {
-	return xlog_calc_unit_res(mp->m_log, unit_bytes);
+	return xlog_calc_unit_res(mp->m_log, unit_bytes, niclogs);
 }
 
 /*
@@ -3418,7 +3422,7 @@ xlog_ticket_alloc(
 
 	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
 
-	unit_res = xlog_calc_unit_res(log, unit_bytes);
+	unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs);
 
 	atomic_set(&tic->t_ref, 1);
 	tic->t_task		= current;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 50101336a7f4..f8fb2f59e24c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -44,9 +44,20 @@ xlog_cil_ticket_alloc(
 	 * transaction overhead reservation from the first transaction commit.
 	 */
 	tic->t_curr_res = 0;
+	tic->t_iclog_hdrs = 0;
 	return tic;
 }
 
+static inline void
+xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil)
+{
+	struct xlog	*log = cil->xc_log;
+
+	atomic_set(&cil->xc_iclog_hdrs,
+		   (XLOG_CIL_BLOCKING_SPACE_LIMIT(log) /
+			(log->l_iclog_size - log->l_iclog_hsize)));
+}
+
 /*
  * Unavoidable forward declaration - xlog_cil_push_work() calls
  * xlog_cil_ctx_alloc() itself.
@@ -70,6 +81,7 @@ xlog_cil_ctx_switch(
 	struct xfs_cil		*cil,
 	struct xfs_cil_ctx	*ctx)
 {
+	xlog_cil_set_iclog_hdr_count(cil);
 	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
 	ctx->sequence = ++cil->xc_current_sequence;
 	ctx->cil = cil;
@@ -92,6 +104,7 @@ xlog_cil_init_post_recovery(
 {
 	log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log);
 	log->l_cilp->xc_ctx->sequence = 1;
+	xlog_cil_set_iclog_hdr_count(log->l_cilp);
 }
 
 static inline int
@@ -419,7 +432,6 @@ xlog_cil_insert_items(
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
 	struct xfs_log_item	*lip;
 	int			len = 0;
-	int			iclog_space;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 
 	ASSERT(tp);
@@ -442,19 +454,36 @@ xlog_cil_insert_items(
 	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		ctx_res = ctx->ticket->t_unit_res;
 
-	spin_lock(&cil->xc_cil_lock);
-
-	/* do we need space for more log record headers? */
-	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
-	if (len > 0 && (ctx->space_used / iclog_space !=
-				(ctx->space_used + len) / iclog_space)) {
-		split_res = (len + iclog_space - 1) / iclog_space;
-		/* need to take into account split region headers, too */
-		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
-		ctx->ticket->t_unit_res += split_res;
+	/*
+	 * Check if we need to steal iclog headers. atomic_read() is not a
+	 * locked atomic operation, so we can check the value before we do any
+	 * real atomic ops in the fast path. If we've already taken the CIL unit
+	 * reservation from this commit, we've already got one iclog header
+	 * space reserved so we have to account for that otherwise we risk
+	 * overrunning the reservation on this ticket.
+	 *
+	 * If the CIL is already at the hard limit, we might need more header
+	 * space that originally reserved. So steal more header space from every
+	 * commit that occurs once we are over the hard limit to ensure the CIL
+	 * push won't run out of reservation space.
+	 *
+	 * This can steal more than we need, but that's OK.
+	 */
+	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
+	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+		int	split_res = log->l_iclog_hsize +
+					sizeof(struct xlog_op_header);
+		if (ctx_res)
+			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
+		else
+			ctx_res = split_res * tp->t_ticket->t_iclog_hdrs;
+		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
 	}
-	tp->t_ticket->t_curr_res -= split_res + ctx_res + len;
-	ctx->ticket->t_curr_res += split_res + ctx_res;
+
+	spin_lock(&cil->xc_cil_lock);
+	tp->t_ticket->t_curr_res -= ctx_res + len;
+	ctx->ticket->t_unit_res += ctx_res;
+	ctx->ticket->t_curr_res += ctx_res;
 	ctx->space_used += len;
 
 	/*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b0dc3bc9de59..e72d14c76e03 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -140,15 +140,16 @@ enum xlog_iclog_state {
 #define XLOG_TIC_LEN_MAX	15
 
 typedef struct xlog_ticket {
-	struct list_head   t_queue;	 /* reserve/write queue */
-	struct task_struct *t_task;	 /* task that owns this ticket */
-	xlog_tid_t	   t_tid;	 /* transaction identifier	 : 4  */
-	atomic_t	   t_ref;	 /* ticket reference count       : 4  */
-	int		   t_curr_res;	 /* current reservation in bytes : 4  */
-	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
-	char		   t_ocnt;	 /* original count		 : 1  */
-	char		   t_cnt;	 /* current count		 : 1  */
-	char		   t_flags;	 /* properties of reservation	 : 1  */
+	struct list_head	t_queue;	/* reserve/write queue */
+	struct task_struct	*t_task;	/* task that owns this ticket */
+	xlog_tid_t		t_tid;		/* transaction identifier */
+	atomic_t		t_ref;		/* ticket reference count */
+	int			t_curr_res;	/* current reservation */
+	int			t_unit_res;	/* unit reservation */
+	char			t_ocnt;		/* original count */
+	char			t_cnt;		/* current count */
+	char			t_flags;	/* properties of reservation */
+	int			t_iclog_hdrs;	/* iclog hdrs in t_curr_res */
 } xlog_ticket_t;
 
 /*
@@ -249,6 +250,7 @@ struct xfs_cil_ctx {
 struct xfs_cil {
 	struct xlog		*xc_log;
 	unsigned long		xc_flags;
+	atomic_t		xc_iclog_hdrs;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (33 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 34/45] xfs: rework per-iclog header CIL reservation Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  0:11   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 36/45] xfs: implement percpu cil space used calculation Dave Chinner
                   ` (9 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The CIL push lock is highly contended on larger machines, becoming a
hard bottleneck that about 700,000 transaction commits/s on >16p
machines. To address this, start moving the CIL tracking
infrastructure to utilise per-CPU structures.

We need to track the space used, the amount of log reservation space
reserved to write the CIL, the log items in the CIL and the busy
extents that need to be completed by the CIL commit.  This requires
a couple of per-cpu counters, an unordered per-cpu list and a
globally ordered per-cpu list.

Create a per-cpu structure to hold these and all the management
interfaces needed, as well as the hooks to handle hotplug CPUs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c       | 94 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_log_priv.h      | 15 ++++++
 include/linux/cpuhotplug.h |  1 +
 3 files changed, 110 insertions(+)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index f8fb2f59e24c..1bcf0d423d30 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1365,6 +1365,93 @@ xfs_log_item_in_current_chkpt(
 	return true;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(xlog_cil_pcp_list);
+static DEFINE_SPINLOCK(xlog_cil_pcp_lock);
+static bool xlog_cil_pcp_init;
+
+static int
+xlog_cil_pcp_dead(
+	unsigned int		cpu)
+{
+	struct xfs_cil		*cil;
+
+        spin_lock(&xlog_cil_pcp_lock);
+        list_for_each_entry(cil, &xlog_cil_pcp_list, xc_pcp_list) {
+		/* move stuff on dead CPU to context */
+	}
+	spin_unlock(&xlog_cil_pcp_lock);
+	return 0;
+}
+
+static int
+xlog_cil_pcp_hpadd(
+	struct xfs_cil		*cil)
+{
+	if (!xlog_cil_pcp_init) {
+		int	ret;
+		ret = cpuhp_setup_state_nocalls(CPUHP_XFS_CIL_DEAD,
+						"xfs/cil_pcp:dead", NULL,
+						xlog_cil_pcp_dead);
+		if (ret < 0) {
+			xfs_warn(cil->xc_log->l_mp,
+	"Failed to initialise CIL hotplug, error %d. XFS is non-functional.",
+				ret);
+			ASSERT(0);
+			return -ENOMEM;
+		}
+		xlog_cil_pcp_init = true;
+	}
+
+	INIT_LIST_HEAD(&cil->xc_pcp_list);
+	spin_lock(&xlog_cil_pcp_lock);
+	list_add(&cil->xc_pcp_list, &xlog_cil_pcp_list);
+	spin_unlock(&xlog_cil_pcp_lock);
+	return 0;
+}
+
+static void
+xlog_cil_pcp_hpremove(
+	struct xfs_cil		*cil)
+{
+	spin_lock(&xlog_cil_pcp_lock);
+	list_del(&cil->xc_pcp_list);
+	spin_unlock(&xlog_cil_pcp_lock);
+}
+
+#else /* !CONFIG_HOTPLUG_CPU */
+static inline void xlog_cil_pcp_hpadd(struct xfs_cil *cil) {}
+static inline void xlog_cil_pcp_hpremove(struct xfs_cil *cil) {}
+#endif
+
+static void __percpu *
+xlog_cil_pcp_alloc(
+	struct xfs_cil		*cil)
+{
+	struct xlog_cil_pcp	*cilpcp;
+
+	cilpcp = alloc_percpu(struct xlog_cil_pcp);
+	if (!cilpcp)
+		return NULL;
+
+	if (xlog_cil_pcp_hpadd(cil) < 0) {
+		free_percpu(cilpcp);
+		return NULL;
+	}
+	return cilpcp;
+}
+
+static void
+xlog_cil_pcp_free(
+	struct xfs_cil		*cil,
+	struct xlog_cil_pcp	*cilpcp)
+{
+	if (!cilpcp)
+		return;
+	xlog_cil_pcp_hpremove(cil);
+	free_percpu(cilpcp);
+}
+
 /*
  * Perform initial CIL structure initialisation.
  */
@@ -1379,6 +1466,12 @@ xlog_cil_init(
 	if (!cil)
 		return -ENOMEM;
 
+	cil->xc_pcp = xlog_cil_pcp_alloc(cil);
+	if (!cil->xc_pcp) {
+		kmem_free(cil);
+		return -ENOMEM;
+	}
+
 	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
@@ -1409,6 +1502,7 @@ xlog_cil_destroy(
 
 	ASSERT(list_empty(&cil->xc_cil));
 	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
+	xlog_cil_pcp_free(cil, cil->xc_pcp);
 	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index e72d14c76e03..2562f29c8986 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -231,6 +231,16 @@ struct xfs_cil_ctx {
 	struct work_struct	push_work;
 };
 
+/*
+ * Per-cpu CIL tracking items
+ */
+struct xlog_cil_pcp {
+	uint32_t		space_used;
+	uint32_t		curr_res;
+	struct list_head	busy_extents;
+	struct list_head	log_items;
+};
+
 /*
  * Committed Item List structure
  *
@@ -264,6 +274,11 @@ struct xfs_cil {
 	wait_queue_head_t	xc_commit_wait;
 	xfs_csn_t		xc_current_sequence;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
+
+	struct xlog_cil_pcp __percpu *xc_pcp;
+#ifdef CONFIG_HOTPLUG_CPU
+	struct list_head	xc_pcp_list;
+#endif
 } ____cacheline_aligned_in_smp;
 
 /* xc_flags bit values */
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index f14adb882338..b13b21d825b3 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -52,6 +52,7 @@ enum cpuhp_state {
 	CPUHP_FS_BUFF_DEAD,
 	CPUHP_PRINTK_DEAD,
 	CPUHP_MM_MEMCQ_DEAD,
+	CPUHP_XFS_CIL_DEAD,
 	CPUHP_PERCPU_CNT_DEAD,
 	CPUHP_RADIX_DEAD,
 	CPUHP_PAGE_ALLOC_DEAD,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 36/45] xfs: implement percpu cil space used calculation
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (34 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  0:20   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure Dave Chinner
                   ` (8 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have the CIL percpu structures in place, implement the
space used counter with a fast sum check similar to the
percpu_counter infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 42 ++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_log_priv.h |  2 +-
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 1bcf0d423d30..5519d112c1fd 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -433,6 +433,8 @@ xlog_cil_insert_items(
 	struct xfs_log_item	*lip;
 	int			len = 0;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
+	int			space_used;
+	struct xlog_cil_pcp	*cilpcp;
 
 	ASSERT(tp);
 
@@ -469,8 +471,9 @@ xlog_cil_insert_items(
 	 *
 	 * This can steal more than we need, but that's OK.
 	 */
+	space_used = atomic_read(&ctx->space_used);
 	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
-	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
 		int	split_res = log->l_iclog_hsize +
 					sizeof(struct xlog_op_header);
 		if (ctx_res)
@@ -480,16 +483,34 @@ xlog_cil_insert_items(
 		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
 	}
 
+	/*
+	 * Update the CIL percpu pointer. This updates the global counter when
+	 * over the percpu batch size or when the CIL is over the space limit.
+	 * This means low lock overhead for normal updates, and when over the
+	 * limit the space used is immediately accounted. This makes enforcing
+	 * the hard limit much more accurate. The per cpu fold threshold is
+	 * based on how close we are to the hard limit.
+	 */
+	cilpcp = get_cpu_ptr(cil->xc_pcp);
+	cilpcp->space_used += len;
+	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
+	    cilpcp->space_used >
+			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
+					num_online_cpus())) {
+		atomic_add(cilpcp->space_used, &ctx->space_used);
+		cilpcp->space_used = 0;
+	}
+	put_cpu_ptr(cilpcp);
+
 	spin_lock(&cil->xc_cil_lock);
-	tp->t_ticket->t_curr_res -= ctx_res + len;
 	ctx->ticket->t_unit_res += ctx_res;
 	ctx->ticket->t_curr_res += ctx_res;
-	ctx->space_used += len;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
 	 * the log items. Shutdown is imminent...
 	 */
+	tp->t_ticket->t_curr_res -= ctx_res + len;
 	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
 		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
 		xfs_warn(log->l_mp,
@@ -769,12 +790,20 @@ xlog_cil_push_work(
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 	bool			commit_iclog_sync = false;
+	int			cpu;
+	struct xlog_cil_pcp	*cilpcp;
 
 	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
 
 	down_write(&cil->xc_ctx_lock);
 
+	/* Reset the CIL pcp counters */
+	for_each_online_cpu(cpu) {
+		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
+		cilpcp->space_used = 0;
+	}
+
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
@@ -1042,6 +1071,7 @@ xlog_cil_push_background(
 	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
+	int		space_used = atomic_read(&cil->xc_ctx->space_used);
 
 	/*
 	 * The cil won't be empty because we are called while holding the
@@ -1054,7 +1084,7 @@ xlog_cil_push_background(
 	 * Don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
 		up_read(&cil->xc_ctx_lock);
 		return;
 	}
@@ -1083,10 +1113,10 @@ xlog_cil_push_background(
 	 * The ctx->xc_push_lock provides the serialisation necessary for safely
 	 * using the lockless waitqueue_active() check in this context.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
+	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
 	    waitqueue_active(&cil->xc_push_wait)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
-		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		ASSERT(space_used < log->l_logsize);
 		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
 		return;
 	}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2562f29c8986..4eb373357f26 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -222,7 +222,7 @@ struct xfs_cil_ctx {
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
 	int			nvecs;		/* number of regions */
-	int			space_used;	/* aggregate size of regions */
+	atomic_t		space_used;	/* aggregate size of regions */
 	struct list_head	busy_extents;	/* busy extents in chkpt */
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (35 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 36/45] xfs: implement percpu cil space used calculation Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  0:26   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 38/45] xfs: convert CIL busy extents to per-cpu Dave Chinner
                   ` (7 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To get it out from under the cil spinlock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 11 ++++++-----
 fs/xfs/xfs_log_priv.h |  2 +-
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 5519d112c1fd..a2f93bd7644b 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -492,6 +492,7 @@ xlog_cil_insert_items(
 	 * based on how close we are to the hard limit.
 	 */
 	cilpcp = get_cpu_ptr(cil->xc_pcp);
+	cilpcp->space_reserved += ctx_res;
 	cilpcp->space_used += len;
 	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
 	    cilpcp->space_used >
@@ -502,10 +503,6 @@ xlog_cil_insert_items(
 	}
 	put_cpu_ptr(cilpcp);
 
-	spin_lock(&cil->xc_cil_lock);
-	ctx->ticket->t_unit_res += ctx_res;
-	ctx->ticket->t_curr_res += ctx_res;
-
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
 	 * the log items. Shutdown is imminent...
@@ -527,6 +524,7 @@ xlog_cil_insert_items(
 	 * We do this here so we only need to take the CIL lock once during
 	 * the transaction commit.
 	 */
+	spin_lock(&cil->xc_cil_lock);
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 
 		/* Skip items which aren't dirty in this transaction. */
@@ -798,10 +796,13 @@ xlog_cil_push_work(
 
 	down_write(&cil->xc_ctx_lock);
 
-	/* Reset the CIL pcp counters */
+	/* Aggregate and reset the CIL pcp counters */
 	for_each_online_cpu(cpu) {
 		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
+		ctx->ticket->t_curr_res += cilpcp->space_reserved;
 		cilpcp->space_used = 0;
+		cilpcp->space_reserved = 0;
+
 	}
 
 	spin_lock(&cil->xc_push_lock);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4eb373357f26..278b9eaea582 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -236,7 +236,7 @@ struct xfs_cil_ctx {
  */
 struct xlog_cil_pcp {
 	uint32_t		space_used;
-	uint32_t		curr_res;
+	uint32_t		space_reserved;
 	struct list_head	busy_extents;
 	struct list_head	log_items;
 };
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 38/45] xfs: convert CIL busy extents to per-cpu
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (36 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  0:36   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 39/45] xfs: Add order IDs to log items in CIL Dave Chinner
                   ` (6 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To get them out from under the CIL lock.

This is an unordered list, so we can simply punt it to per-cpu lists
during transaction commits and reaggregate it back into a single
list during the CIL push work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index a2f93bd7644b..7428b98c8279 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -501,6 +501,9 @@ xlog_cil_insert_items(
 		atomic_add(cilpcp->space_used, &ctx->space_used);
 		cilpcp->space_used = 0;
 	}
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
 	put_cpu_ptr(cilpcp);
 
 	/*
@@ -540,9 +543,6 @@ xlog_cil_insert_items(
 			list_move_tail(&lip->li_cil, &cil->xc_cil);
 	}
 
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 	spin_unlock(&cil->xc_cil_lock);
 
 	if (tp->t_ticket->t_curr_res < 0)
@@ -802,7 +802,10 @@ xlog_cil_push_work(
 		ctx->ticket->t_curr_res += cilpcp->space_reserved;
 		cilpcp->space_used = 0;
 		cilpcp->space_reserved = 0;
-
+		if (!list_empty(&cilpcp->busy_extents)) {
+			list_splice_init(&cilpcp->busy_extents,
+					&ctx->busy_extents);
+		}
 	}
 
 	spin_lock(&cil->xc_push_lock);
@@ -1459,17 +1462,24 @@ static void __percpu *
 xlog_cil_pcp_alloc(
 	struct xfs_cil		*cil)
 {
+	void __percpu		*pcptr;
 	struct xlog_cil_pcp	*cilpcp;
+	int			cpu;
 
-	cilpcp = alloc_percpu(struct xlog_cil_pcp);
-	if (!cilpcp)
+	pcptr = alloc_percpu(struct xlog_cil_pcp);
+	if (!pcptr)
 		return NULL;
 
+	for_each_possible_cpu(cpu) {
+		cilpcp = per_cpu_ptr(pcptr, cpu);
+		INIT_LIST_HEAD(&cilpcp->busy_extents);
+	}
+
 	if (xlog_cil_pcp_hpadd(cil) < 0) {
-		free_percpu(cilpcp);
+		free_percpu(pcptr);
 		return NULL;
 	}
-	return cilpcp;
+	return pcptr;
 }
 
 static void
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 39/45] xfs: Add order IDs to log items in CIL
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (37 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 38/45] xfs: convert CIL busy extents to per-cpu Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  1:00   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 40/45] xfs: convert CIL to unordered per cpu lists Dave Chinner
                   ` (5 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Before we split the ordered CIL up into per cpu lists, we need a
mechanism to track the order of the items in the CIL. We need to do
this because there are rules around the order in which related items
must physically appear in the log even inside a single checkpoint
transaction.

An example of this is intents - an intent must appear in the log
before it's intent done record so taht log recovery can cancel the
intent correctly. If we have these two records misordered in the
CIL, then they will not be recovered correctly by journal replay.

We also will not be able to move items to the tail of
the CIL list when they are relogged, hence the log items will need
some mechanism to allow the correct log item order to be recreated
before we write log items to the hournal.

Hence we need to have a mechanism for recording global order of
transactions in the log items  so that we can recover that order
from un-ordered per-cpu lists.

Do this with a simple monotonic increasing commit counter in the CIL
context. Each log item in the transaction gets stamped with the
current commit order ID before it is added to the CIL. If the item
is already in the CIL, leave it where it is instead of moving it to
the tail of the list and instead sort the list before we start the
push work.

XXX: list_sort() under the cil_ctx_lock held exclusive starts
hurting that >16 threads. Front end commits are waiting on the push
to switch contexts much longer. The item order id should likely be
moved into the logvecs when they are detacted from the items, then
the sort can be done on the logvec after the cil_ctx_lock has been
released. logvecs will need to use a list_head for this rather than
a single linked list like they do now....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 34 ++++++++++++++++++++++++++--------
 fs/xfs/xfs_log_priv.h |  1 +
 fs/xfs/xfs_trans.h    |  1 +
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 7428b98c8279..7420389f4cee 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -434,6 +434,7 @@ xlog_cil_insert_items(
 	int			len = 0;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 	int			space_used;
+	int			order;
 	struct xlog_cil_pcp	*cilpcp;
 
 	ASSERT(tp);
@@ -523,10 +524,12 @@ xlog_cil_insert_items(
 	}
 
 	/*
-	 * Now (re-)position everything modified at the tail of the CIL.
+	 * Now update the order of everything modified in the transaction
+	 * and insert items into the CIL if they aren't already there.
 	 * We do this here so we only need to take the CIL lock once during
 	 * the transaction commit.
 	 */
+	order = atomic_inc_return(&ctx->order_id);
 	spin_lock(&cil->xc_cil_lock);
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 
@@ -534,13 +537,10 @@ xlog_cil_insert_items(
 		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
 			continue;
 
-		/*
-		 * Only move the item if it isn't already at the tail. This is
-		 * to prevent a transient list_empty() state when reinserting
-		 * an item that is already the only item in the CIL.
-		 */
-		if (!list_is_last(&lip->li_cil, &cil->xc_cil))
-			list_move_tail(&lip->li_cil, &cil->xc_cil);
+		lip->li_order_id = order;
+		if (!list_empty(&lip->li_cil))
+			continue;
+		list_add(&lip->li_cil, &cil->xc_cil);
 	}
 
 	spin_unlock(&cil->xc_cil_lock);
@@ -753,6 +753,22 @@ xlog_cil_build_trans_hdr(
 	tic->t_curr_res -= lvhdr->lv_bytes;
 }
 
+static int
+xlog_cil_order_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
+	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
+
+	if (l1->li_order_id > l2->li_order_id)
+		return 1;
+	if (l1->li_order_id < l2->li_order_id)
+		return -1;
+	return 0;
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -891,6 +907,7 @@ xlog_cil_push_work(
 	 * needed on the transaction commit side which is currently locked out
 	 * by the flush lock.
 	 */
+	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
 	lv = NULL;
 	while (!list_empty(&cil->xc_cil)) {
 		struct xfs_log_item	*item;
@@ -898,6 +915,7 @@ xlog_cil_push_work(
 		item = list_first_entry(&cil->xc_cil,
 					struct xfs_log_item, li_cil);
 		list_del_init(&item->li_cil);
+		item->li_order_id = 0;
 		if (!ctx->lv_chain)
 			ctx->lv_chain = item->li_lv;
 		else
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 278b9eaea582..92d9e1a03a07 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -229,6 +229,7 @@ struct xfs_cil_ctx {
 	struct list_head	committing;	/* ctx committing list */
 	struct work_struct	discard_endio_work;
 	struct work_struct	push_work;
+	atomic_t		order_id;
 };
 
 /*
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 6276c7d251e6..226c0f5e7870 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -44,6 +44,7 @@ struct xfs_log_item {
 	struct xfs_log_vec		*li_lv;		/* active log vector */
 	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
 	xfs_csn_t			li_seq;		/* CIL commit seq */
+	uint32_t			li_order_id;	/* CIL commit order */
 };
 
 /*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 40/45] xfs: convert CIL to unordered per cpu lists
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (38 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 39/45] xfs: Add order IDs to log items in CIL Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  1:15   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 41/45] xfs: move CIL ordering to the logvec chain Dave Chinner
                   ` (4 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

So that we can remove the cil_lock which is a global serialisation
point. We've already got ordering sorted, so all we need to do is
treat the CIL list like the busy extent list and reconstruct it
before the push starts.

This is what we're trying to avoid:

 -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
    - 46.35% xfs_log_commit_cil
       - 41.54% _raw_spin_lock
          - 67.30% do_raw_spin_lock
               66.96% __pv_queued_spin_lock_slowpath

Which happens on a 32p system when running a 32-way 'rm -rf'
workload. After this patch:

-   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
   - 17.67% xfs_log_commit_cil
      - 6.51% xfs_log_ticket_ungrant
           1.40% xfs_log_space_wake
        2.32% memcpy_erms
      - 2.18% xfs_buf_item_committing
         - 2.12% xfs_buf_item_release
            - 1.03% xfs_buf_unlock
                 0.96% up
              0.72% xfs_buf_rele
        1.33% xfs_inode_item_format
        1.19% down_read
        0.91% up_read
        0.76% xfs_buf_item_format
      - 0.68% kmem_alloc_large
         - 0.67% kmem_alloc
              0.64% __kmalloc
        0.50% xfs_buf_item_size

It kinda looks like the workload is running out of log space all
the time. But all the spinlock contention is gone and the
transaction commit rate has gone from 800k/s to 1.3M/s so the amount
of real work being done has gone up a *lot*.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++-----------------------
 fs/xfs/xfs_log_priv.h |  2 --
 2 files changed, 29 insertions(+), 34 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 7420389f4cee..3d43a5088154 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -448,10 +448,9 @@ xlog_cil_insert_items(
 	/*
 	 * We need to take the CIL checkpoint unit reservation on the first
 	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
-	 * unnecessarily do an atomic op in the fast path here. We don't need to
-	 * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are
-	 * under the xc_ctx_lock here and that needs to be held exclusively to
-	 * reset the XLOG_CIL_EMPTY bit.
+	 * unnecessarily do an atomic op in the fast path here. We can clear the
+	 * XLOG_CIL_EMPTY bit as we are under the xc_ctx_lock here and that
+	 * needs to be held exclusively to reset the XLOG_CIL_EMPTY bit.
 	 */
 	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
 	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
@@ -505,24 +504,6 @@ xlog_cil_insert_items(
 	/* attach the transaction to the CIL if it has any busy extents */
 	if (!list_empty(&tp->t_busy))
 		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
-	put_cpu_ptr(cilpcp);
-
-	/*
-	 * If we've overrun the reservation, dump the tx details before we move
-	 * the log items. Shutdown is imminent...
-	 */
-	tp->t_ticket->t_curr_res -= ctx_res + len;
-	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
-		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
-		xfs_warn(log->l_mp,
-			 "  log items: %d bytes (iov hdrs: %d bytes)",
-			 len, iovhdr_res);
-		xfs_warn(log->l_mp, "  split region headers: %d bytes",
-			 split_res);
-		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
-		xlog_print_trans(tp);
-	}
-
 	/*
 	 * Now update the order of everything modified in the transaction
 	 * and insert items into the CIL if they aren't already there.
@@ -530,7 +511,6 @@ xlog_cil_insert_items(
 	 * the transaction commit.
 	 */
 	order = atomic_inc_return(&ctx->order_id);
-	spin_lock(&cil->xc_cil_lock);
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 
 		/* Skip items which aren't dirty in this transaction. */
@@ -540,10 +520,26 @@ xlog_cil_insert_items(
 		lip->li_order_id = order;
 		if (!list_empty(&lip->li_cil))
 			continue;
-		list_add(&lip->li_cil, &cil->xc_cil);
+		list_add(&lip->li_cil, &cilpcp->log_items);
+	}
+	put_cpu_ptr(cilpcp);
+
+	/*
+	 * If we've overrun the reservation, dump the tx details before we move
+	 * the log items. Shutdown is imminent...
+	 */
+	tp->t_ticket->t_curr_res -= ctx_res + len;
+	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
+		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
+		xfs_warn(log->l_mp,
+			 "  log items: %d bytes (iov hdrs: %d bytes)",
+			 len, iovhdr_res);
+		xfs_warn(log->l_mp, "  split region headers: %d bytes",
+			 split_res);
+		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
+		xlog_print_trans(tp);
 	}
 
-	spin_unlock(&cil->xc_cil_lock);
 
 	if (tp->t_ticket->t_curr_res < 0)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -806,6 +802,7 @@ xlog_cil_push_work(
 	bool			commit_iclog_sync = false;
 	int			cpu;
 	struct xlog_cil_pcp	*cilpcp;
+	LIST_HEAD		(log_items);
 
 	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -822,6 +819,9 @@ xlog_cil_push_work(
 			list_splice_init(&cilpcp->busy_extents,
 					&ctx->busy_extents);
 		}
+		if (!list_empty(&cilpcp->log_items)) {
+			list_splice_init(&cilpcp->log_items, &log_items);
+		}
 	}
 
 	spin_lock(&cil->xc_push_lock);
@@ -907,12 +907,12 @@ xlog_cil_push_work(
 	 * needed on the transaction commit side which is currently locked out
 	 * by the flush lock.
 	 */
-	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
+	list_sort(NULL, &log_items, xlog_cil_order_cmp);
 	lv = NULL;
-	while (!list_empty(&cil->xc_cil)) {
+	while (!list_empty(&log_items)) {
 		struct xfs_log_item	*item;
 
-		item = list_first_entry(&cil->xc_cil,
+		item = list_first_entry(&log_items,
 					struct xfs_log_item, li_cil);
 		list_del_init(&item->li_cil);
 		item->li_order_id = 0;
@@ -1099,7 +1099,6 @@ xlog_cil_push_background(
 	 * The cil won't be empty because we are called while holding the
 	 * context lock so whatever we added to the CIL will still be there.
 	 */
-	ASSERT(!list_empty(&cil->xc_cil));
 	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 
 	/*
@@ -1491,6 +1490,7 @@ xlog_cil_pcp_alloc(
 	for_each_possible_cpu(cpu) {
 		cilpcp = per_cpu_ptr(pcptr, cpu);
 		INIT_LIST_HEAD(&cilpcp->busy_extents);
+		INIT_LIST_HEAD(&cilpcp->log_items);
 	}
 
 	if (xlog_cil_pcp_hpadd(cil) < 0) {
@@ -1531,9 +1531,7 @@ xlog_cil_init(
 		return -ENOMEM;
 	}
 
-	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
-	spin_lock_init(&cil->xc_cil_lock);
 	spin_lock_init(&cil->xc_push_lock);
 	init_waitqueue_head(&cil->xc_push_wait);
 	init_rwsem(&cil->xc_ctx_lock);
@@ -1559,7 +1557,6 @@ xlog_cil_destroy(
 		kmem_free(cil->xc_ctx);
 	}
 
-	ASSERT(list_empty(&cil->xc_cil));
 	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 	xlog_cil_pcp_free(cil, cil->xc_pcp);
 	kmem_free(cil);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 92d9e1a03a07..12a1a36eef7e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -262,8 +262,6 @@ struct xfs_cil {
 	struct xlog		*xc_log;
 	unsigned long		xc_flags;
 	atomic_t		xc_iclog_hdrs;
-	struct list_head	xc_cil;
-	spinlock_t		xc_cil_lock;
 
 	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
 	struct xfs_cil_ctx	*xc_ctx;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 41/45] xfs: move CIL ordering to the logvec chain
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (39 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 40/45] xfs: convert CIL to unordered per cpu lists Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  1:34   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive Dave Chinner
                   ` (3 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Adding a list_sort() call to the CIL push work while the xc_ctx_lock
is held exclusively has resulted in fairly long lock hold times and
that stops all front end transaction commits from making progress.

We can move the sorting out of the xc_ctx_lock if we can transfer
the ordering information to the log vectors as they are detached
from the log items and then we can sort the log vectors. This
requires log vectors to use a list_head rather than a single linked
list and to hold an order ID field. With these changes, we can move
the list_sort() call to just before we call xlog_write() when we
aren't holding any locks at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c        | 46 +++++++++++++++++++++---------
 fs/xfs/xfs_log.h        |  3 +-
 fs/xfs/xfs_log_cil.c    | 63 +++++++++++++++++++++++++----------------
 fs/xfs/xfs_log_priv.h   |  4 +--
 fs/xfs/xfs_trans.c      |  4 +--
 fs/xfs/xfs_trans_priv.h |  4 +--
 6 files changed, 78 insertions(+), 46 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 46a006d41184..fd58c3213ebf 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -846,6 +846,9 @@ xlog_write_unmount_record(
 		.lv_niovecs = 1,
 		.lv_iovecp = &reg,
 	};
+	LIST_HEAD(lv_chain);
+	INIT_LIST_HEAD(&vec.lv_chain);
+	list_add(&vec.lv_chain, &lv_chain);
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(unmount_rec);
@@ -857,8 +860,8 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
-				reg.i_len);
+	return xlog_write(log, &lv_chain, ticket, NULL, NULL,
+				XLOG_UNMOUNT_TRANS, reg.i_len);
 }
 
 /*
@@ -1571,14 +1574,17 @@ xlog_commit_record(
 		.lv_iovecp = &reg,
 	};
 	int	error;
+	LIST_HEAD(lv_chain);
+	INIT_LIST_HEAD(&vec.lv_chain);
+	list_add(&vec.lv_chain, &lv_chain);
 
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= reg.i_len;
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
-				reg.i_len);
+	error = xlog_write(log, &lv_chain, ticket, lsn, iclog,
+				XLOG_COMMIT_TRANS, reg.i_len);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2109,6 +2115,7 @@ xlog_print_trans(
  */
 static struct xfs_log_vec *
 xlog_write_single(
+	struct list_head	*lv_chain,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
 	struct xlog_in_core	*iclog,
@@ -2117,7 +2124,7 @@ xlog_write_single(
 	uint32_t		*record_cnt,
 	uint32_t		*data_cnt)
 {
-	struct xfs_log_vec	*lv = log_vector;
+	struct xfs_log_vec	*lv;
 	void			*ptr;
 	int			index;
 
@@ -2125,10 +2132,13 @@ xlog_write_single(
 		iclog->ic_state == XLOG_STATE_WANT_SYNC);
 
 	ptr = iclog->ic_datap + *log_offset;
-	for (lv = log_vector; lv; lv = lv->lv_next) {
+	for (lv = log_vector;
+	     !list_entry_is_head(lv, lv_chain, lv_chain);
+	     lv = list_next_entry(lv, lv_chain)) {
 		/*
-		 * If the entire log vec does not fit in the iclog, punt it to
-		 * the partial copy loop which can handle this case.
+		 * If the log vec contains data that needs to be copied and does
+		 * not entirely fit in the iclog, punt it to the partial copy
+		 * loop which can handle this case.
 		 */
 		if (lv->lv_niovecs &&
 		    lv->lv_bytes > iclog->ic_size - *log_offset)
@@ -2154,6 +2164,8 @@ xlog_write_single(
 			*data_cnt += reg->i_len;
 		}
 	}
+	if (list_entry_is_head(lv, lv_chain, lv_chain))
+		lv = NULL;
 	ASSERT(*len == 0 || lv);
 	return lv;
 }
@@ -2199,6 +2211,7 @@ xlog_write_get_more_iclog_space(
 static struct xfs_log_vec *
 xlog_write_partial(
 	struct xlog		*log,
+	struct list_head	*lv_chain,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
 	struct xlog_in_core	**iclogp,
@@ -2338,7 +2351,10 @@ xlog_write_partial(
 	 * the caller so it can go back to fast path copying.
 	 */
 	*iclogp = iclog;
-	return lv->lv_next;
+	lv = list_next_entry(lv, lv_chain);
+	if (list_entry_is_head(lv, lv_chain, lv_chain))
+		return NULL;
+	return lv;
 }
 
 /*
@@ -2384,7 +2400,7 @@ xlog_write_partial(
 int
 xlog_write(
 	struct xlog		*log,
-	struct xfs_log_vec	*log_vector,
+	struct list_head	*lv_chain,
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
@@ -2392,7 +2408,7 @@ xlog_write(
 	uint32_t		len)
 {
 	struct xlog_in_core	*iclog = NULL;
-	struct xfs_log_vec	*lv = log_vector;
+	struct xfs_log_vec	*lv;
 	int			record_cnt = 0;
 	int			data_cnt = 0;
 	int			error = 0;
@@ -2424,15 +2440,17 @@ xlog_write(
 	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
 		iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 
+	lv = list_first_entry_or_null(lv_chain, struct xfs_log_vec, lv_chain);
 	while (lv) {
-		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
+		lv = xlog_write_single(lv_chain, lv, ticket, iclog, &log_offset,
 					&len, &record_cnt, &data_cnt);
 		if (!lv)
 			break;
 
 		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
-		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
-					&len, &record_cnt, &data_cnt);
+		lv = xlog_write_partial(log, lv_chain, lv, ticket, &iclog,
+					&log_offset, &len, &record_cnt,
+					&data_cnt);
 		if (IS_ERR_OR_NULL(lv)) {
 			error = PTR_ERR_OR_ZERO(lv);
 			break;
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index af54ea3f8c90..0445dd6acbce 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -9,7 +9,8 @@
 struct xfs_cil_ctx;
 
 struct xfs_log_vec {
-	struct xfs_log_vec	*lv_next;	/* next lv in build list */
+	struct list_head	lv_chain;	/* lv chain ptrs */
+	int			lv_order_id;	/* chain ordering info */
 	int			lv_niovecs;	/* number of iovecs in lv */
 	struct xfs_log_iovec	*lv_iovecp;	/* iovec array */
 	struct xfs_log_item	*lv_item;	/* owner */
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 3d43a5088154..6dcc23829bef 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -72,6 +72,7 @@ xlog_cil_ctx_alloc(void)
 	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
 	INIT_LIST_HEAD(&ctx->committing);
 	INIT_LIST_HEAD(&ctx->busy_extents);
+	INIT_LIST_HEAD(&ctx->lv_chain);
 	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
 	return ctx;
 }
@@ -237,6 +238,7 @@ xlog_cil_alloc_shadow_bufs(
 			lv = kmem_alloc_large(buf_size, KM_NOFS);
 			memset(lv, 0, xlog_cil_iovec_space(niovecs));
 
+			INIT_LIST_HEAD(&lv->lv_chain);
 			lv->lv_item = lip;
 			lv->lv_size = buf_size;
 			if (ordered)
@@ -252,7 +254,6 @@ xlog_cil_alloc_shadow_bufs(
 			else
 				lv->lv_buf_len = 0;
 			lv->lv_bytes = 0;
-			lv->lv_next = NULL;
 		}
 
 		/* Ensure the lv is set up according to ->iop_size */
@@ -379,8 +380,6 @@ xlog_cil_insert_format_items(
 		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
 			/* same or smaller, optimise common overwrite case */
 			lv = lip->li_lv;
-			lv->lv_next = NULL;
-
 			if (ordered)
 				goto insert;
 
@@ -547,14 +546,14 @@ xlog_cil_insert_items(
 
 static void
 xlog_cil_free_logvec(
-	struct xfs_log_vec	*log_vector)
+	struct list_head	*lv_chain)
 {
 	struct xfs_log_vec	*lv;
 
-	for (lv = log_vector; lv; ) {
-		struct xfs_log_vec *next = lv->lv_next;
+	while(!list_empty(lv_chain)) {
+		lv = list_first_entry(lv_chain, struct xfs_log_vec, lv_chain);
+		list_del_init(&lv->lv_chain);
 		kmem_free(lv);
-		lv = next;
 	}
 }
 
@@ -653,7 +652,7 @@ xlog_cil_committed(
 		spin_unlock(&ctx->cil->xc_push_lock);
 	}
 
-	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
+	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, &ctx->lv_chain,
 					ctx->start_lsn, abort);
 
 	xfs_extent_busy_sort(&ctx->busy_extents);
@@ -664,7 +663,7 @@ xlog_cil_committed(
 	list_del(&ctx->committing);
 	spin_unlock(&ctx->cil->xc_push_lock);
 
-	xlog_cil_free_logvec(ctx->lv_chain);
+	xlog_cil_free_logvec(&ctx->lv_chain);
 
 	if (!list_empty(&ctx->busy_extents))
 		xlog_discard_busy_extents(mp, ctx);
@@ -744,7 +743,7 @@ xlog_cil_build_trans_hdr(
 	lvhdr->lv_niovecs = 2;
 	lvhdr->lv_iovecp = &hdr->lhdr[0];
 	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
-	lvhdr->lv_next = ctx->lv_chain;
+	list_add(&lvhdr->lv_chain, &ctx->lv_chain);
 
 	tic->t_curr_res -= lvhdr->lv_bytes;
 }
@@ -755,12 +754,14 @@ xlog_cil_order_cmp(
 	struct list_head	*a,
 	struct list_head	*b)
 {
-	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
-	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
+	struct xfs_log_vec	*l1 = container_of(a, struct xfs_log_vec,
+							lv_chain);
+	struct xfs_log_vec	*l2 = container_of(b, struct xfs_log_vec,
+							lv_chain);
 
-	if (l1->li_order_id > l2->li_order_id)
+	if (l1->lv_order_id > l2->lv_order_id)
 		return 1;
-	if (l1->li_order_id < l2->li_order_id)
+	if (l1->lv_order_id < l2->lv_order_id)
 		return -1;
 	return 0;
 }
@@ -907,26 +908,25 @@ xlog_cil_push_work(
 	 * needed on the transaction commit side which is currently locked out
 	 * by the flush lock.
 	 */
-	list_sort(NULL, &log_items, xlog_cil_order_cmp);
 	lv = NULL;
 	while (!list_empty(&log_items)) {
 		struct xfs_log_item	*item;
 
 		item = list_first_entry(&log_items,
 					struct xfs_log_item, li_cil);
-		list_del_init(&item->li_cil);
-		item->li_order_id = 0;
-		if (!ctx->lv_chain)
-			ctx->lv_chain = item->li_lv;
-		else
-			lv->lv_next = item->li_lv;
+
 		lv = item->li_lv;
-		item->li_lv = NULL;
+		lv->lv_order_id = item->li_order_id;
 		num_iovecs += lv->lv_niovecs;
-
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
 			num_bytes += lv->lv_bytes;
+		list_add_tail(&lv->lv_chain, &ctx->lv_chain);
+
+		list_del_init(&item->li_cil);
+		item->li_order_id = 0;
+		item->li_lv = NULL;
+
 	}
 
 	/*
@@ -959,6 +959,13 @@ xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
 
+	/*
+	 * Sort the log vector chain before we add the transaction headers.
+	 * This ensures we always have the transaction headers at the start
+	 * of the chain.
+	 */
+	list_sort(NULL, &ctx->lv_chain, xlog_cil_order_cmp);
+
 	/*
 	 * Build a checkpoint transaction header and write it to the log to
 	 * begin the transaction. We need to account for the space used by the
@@ -981,8 +988,14 @@ xlog_cil_push_work(
 	 * use the commit record lsn then we can move the tail beyond the grant
 	 * write head.
 	 */
-	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
-				XLOG_START_TRANS, num_bytes);
+	error = xlog_write(log, &ctx->lv_chain, ctx->ticket, &ctx->start_lsn,
+				NULL, XLOG_START_TRANS, num_bytes);
+
+	/*
+	 * Take the lvhdr back off the lv_chain as it should not be passed
+	 * to log IO completion.
+	 */
+	list_del(&lvhdr.lv_chain);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 12a1a36eef7e..6a4160200417 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -224,7 +224,7 @@ struct xfs_cil_ctx {
 	int			nvecs;		/* number of regions */
 	atomic_t		space_used;	/* aggregate size of regions */
 	struct list_head	busy_extents;	/* busy extents in chkpt */
-	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
+	struct list_head	lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
 	struct work_struct	discard_endio_work;
@@ -480,7 +480,7 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
 
 void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
-int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
+int	xlog_write(struct xlog *log, struct list_head *lv_chain,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
 		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 83c2b7f22eb7..b20e68279808 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -747,7 +747,7 @@ xfs_log_item_batch_insert(
 void
 xfs_trans_committed_bulk(
 	struct xfs_ail		*ailp,
-	struct xfs_log_vec	*log_vector,
+	struct list_head	*lv_chain,
 	xfs_lsn_t		commit_lsn,
 	bool			aborted)
 {
@@ -762,7 +762,7 @@ xfs_trans_committed_bulk(
 	spin_unlock(&ailp->ail_lock);
 
 	/* unpin all the log items */
-	for (lv = log_vector; lv; lv = lv->lv_next ) {
+	list_for_each_entry(lv, lv_chain, lv_chain) {
 		struct xfs_log_item	*lip = lv->lv_item;
 		xfs_lsn_t		item_lsn;
 
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 3004aeac9110..b0bf78e6ff76 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -18,8 +18,8 @@ void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_del_item(struct xfs_log_item *);
 void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
 
-void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
-				xfs_lsn_t commit_lsn, bool aborted);
+void	xfs_trans_committed_bulk(struct xfs_ail *ailp,
+		struct list_head *lv_chain, xfs_lsn_t commit_lsn, bool aborted);
 /*
  * AIL traversal cursor.
  *
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (40 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 41/45] xfs: move CIL ordering to the logvec chain Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  1:36   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 43/45] xfs: avoid cil push lock if possible Dave Chinner
                   ` (2 subsequent siblings)
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

 - 21.92% __xfs_trans_commit
     - 21.62% xfs_log_commit_cil
	- 11.69% xfs_trans_unreserve_and_mod_sb
	   - 11.58% __percpu_counter_compare
	      - 11.45% __percpu_counter_sum
		 - 10.29% _raw_spin_lock_irqsave
		    - 10.28% do_raw_spin_lock
			 __pv_queued_spin_lock_slowpath

We debated just getting rid of it last time this came up and
there was no real objection to removing it. Now it's the biggest
scalability limitation for debug kernels even on smallish machines,
so let's just get rid of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index b20e68279808..637d084c8aa8 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -616,19 +616,12 @@ xfs_trans_unreserve_and_mod_sb(
 		ASSERT(!error);
 	}
 
-	if (idelta) {
+	if (idelta)
 		percpu_counter_add_batch(&mp->m_icount, idelta,
 					 XFS_ICOUNT_BATCH);
-		if (idelta < 0)
-			ASSERT(__percpu_counter_compare(&mp->m_icount, 0,
-							XFS_ICOUNT_BATCH) >= 0);
-	}
 
-	if (ifreedelta) {
+	if (ifreedelta)
 		percpu_counter_add(&mp->m_ifree, ifreedelta);
-		if (ifreedelta < 0)
-			ASSERT(percpu_counter_compare(&mp->m_ifree, 0) >= 0);
-	}
 
 	if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
 		return;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 43/45] xfs: avoid cil push lock if possible
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (41 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  1:47   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
  2021-03-05  5:11 ` [PATCH 45/45] xfs: expanding delayed logging design with background material Dave Chinner
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because now it hurts when the CIL fills up.

  - 37.20% __xfs_trans_commit
      - 35.84% xfs_log_commit_cil
         - 19.34% _raw_spin_lock
            - do_raw_spin_lock
                 19.01% __pv_queued_spin_lock_slowpath
         - 4.20% xfs_log_ticket_ungrant
              0.90% xfs_log_space_wake


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 6dcc23829bef..d60c72ad391a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1115,10 +1115,18 @@ xlog_cil_push_background(
 	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 
 	/*
-	 * Don't do a background push if we haven't used up all the
-	 * space available yet.
+	 * We are done if:
+	 * - we haven't used up all the space available yet; or
+	 * - we've already queued up a push; and
+	 * - we're not over the hard limit; and
+	 * - nothing has been over the hard limit.
+	 *
+	 * If so, we don't need to take the push lock as there's nothing to do.
 	 */
-	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+	if (space_used < XLOG_CIL_SPACE_LIMIT(log) ||
+	    (cil->xc_push_seq == cil->xc_current_sequence &&
+	     space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log) &&
+	     !waitqueue_active(&cil->xc_push_wait))) {
 		up_read(&cil->xc_ctx_lock);
 		return;
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (42 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 43/45] xfs: avoid cil push lock if possible Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  2:00   ` Darrick J. Wong
  2021-03-05  5:11 ` [PATCH 45/45] xfs: expanding delayed logging design with background material Dave Chinner
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When xlog_sync() rounds off the tail the iclog that is being
flushed, it manually subtracts that space from the grant heads. This
space is actually reserved by the transaction ticket that covers
the xlog_sync() call from xlog_write(), but we don't plumb the
ticket down far enough for it to account for the space consumed in
the current log ticket.

The grant heads are hot, so we really should be accounting this to
the ticket is we can, rather than adding thousands of extra grant
head updates every CIL commit.

Interestingly, this actually indicates a potential log space overrun
can occur when we force the log. By the time that xfs_log_force()
pushes out an active iclog and consumes the roundoff space, the
reservation for that roundoff space has been returned to the grant
heads and is no longer covered by a reservation. In theory the
roundoff added to log force on an already full log could push the
write head past the tail. In practice, the CIL commit that writes to
the log and needs the iclog pushed will have reserved space for
roundoff, so when it releases the ticket there will still be
physical space for the roundoff to be committed to the log, even
though it is no longer reserved. This roundoff won't be enough space
to allow a transaction to be woken if the log is full, so overruns
should not actually occur in practice.

That said, it indicates that we should not release the CIL context
log ticket until after we've released the commit iclog. It also
means that xlog_sync() still needs the direct grant head
manipulation if we don't provide it with a ticket. Log forces are
rare when we are in fast paths running 1.5 million transactions/s
that make the grant heads hot, so let's optimise the hot case and
pass CIL log tickets down to the xlog_sync() code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 39 +++++++++++++++++++++++++--------------
 fs/xfs/xfs_log_cil.c  | 19 ++++++++++++++-----
 fs/xfs/xfs_log_priv.h |  3 ++-
 3 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fd58c3213ebf..1c7d522b12cd 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -55,7 +55,8 @@ xlog_grant_push_ail(
 STATIC void
 xlog_sync(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog);
+	struct xlog_in_core	*iclog,
+	struct xlog_ticket	*ticket);
 #if defined(DEBUG)
 STATIC void
 xlog_verify_dest_ptr(
@@ -535,7 +536,8 @@ __xlog_state_release_iclog(
 int
 xlog_state_release_iclog(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog)
+	struct xlog_in_core	*iclog,
+	struct xlog_ticket	*ticket)
 {
 	lockdep_assert_held(&log->l_icloglock);
 
@@ -545,7 +547,7 @@ xlog_state_release_iclog(
 	if (atomic_dec_and_test(&iclog->ic_refcnt) &&
 	    __xlog_state_release_iclog(log, iclog)) {
 		spin_unlock(&log->l_icloglock);
-		xlog_sync(log, iclog);
+		xlog_sync(log, iclog, ticket);
 		spin_lock(&log->l_icloglock);
 	}
 
@@ -898,7 +900,7 @@ xlog_unmount_write(
 	else
 		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
 		       iclog->ic_state == XLOG_STATE_IOERROR);
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, tic);
 	xlog_wait_on_iclog(iclog);
 
 	if (tic) {
@@ -1930,7 +1932,8 @@ xlog_calc_iclog_size(
 STATIC void
 xlog_sync(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog)
+	struct xlog_in_core	*iclog,
+	struct xlog_ticket	*ticket)
 {
 	unsigned int		count;		/* byte count of bwrite */
 	unsigned int		roundoff;       /* roundoff to BB or stripe */
@@ -1941,12 +1944,20 @@ xlog_sync(
 
 	count = xlog_calc_iclog_size(log, iclog, &roundoff);
 
-	/* move grant heads by roundoff in sync */
-	xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
-	xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
+	/*
+	 * If we have a ticket, account for the roundoff via the ticket
+	 * reservation to avoid touching the hot grant heads needlessly.
+	 * Otherwise, we have to move grant heads directly.
+	 */
+	if (ticket) {
+		ticket->t_curr_res -= roundoff;
+	} else {
+		xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
+		xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
+	}
 
 	/* put cycle number in every block */
-	xlog_pack_data(log, iclog, roundoff); 
+	xlog_pack_data(log, iclog, roundoff);
 
 	/* real byte length */
 	size = iclog->ic_offset;
@@ -2187,7 +2198,7 @@ xlog_write_get_more_iclog_space(
 	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
 	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
 	       iclog->ic_state == XLOG_STATE_IOERROR);
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, ticket);
 	spin_unlock(&log->l_icloglock);
 	if (error)
 		return error;
@@ -2470,7 +2481,7 @@ xlog_write(
 		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
 	} else {
-		error = xlog_state_release_iclog(log, iclog);
+		error = xlog_state_release_iclog(log, iclog, ticket);
 	}
 	spin_unlock(&log->l_icloglock);
 
@@ -2929,7 +2940,7 @@ xlog_state_get_iclog_space(
 		 * reference to the iclog.
 		 */
 		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
-			error = xlog_state_release_iclog(log, iclog);
+			error = xlog_state_release_iclog(log, iclog, ticket);
 		spin_unlock(&log->l_icloglock);
 		if (error)
 			return error;
@@ -3157,7 +3168,7 @@ xfs_log_force(
 			atomic_inc(&iclog->ic_refcnt);
 			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 			xlog_state_switch_iclogs(log, iclog, 0);
-			if (xlog_state_release_iclog(log, iclog))
+			if (xlog_state_release_iclog(log, iclog, NULL))
 				goto out_error;
 
 			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
@@ -3250,7 +3261,7 @@ xlog_force_lsn(
 		}
 		atomic_inc(&iclog->ic_refcnt);
 		xlog_state_switch_iclogs(log, iclog, 0);
-		if (xlog_state_release_iclog(log, iclog))
+		if (xlog_state_release_iclog(log, iclog, NULL))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index d60c72ad391a..aef60f19ab05 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -804,6 +804,7 @@ xlog_cil_push_work(
 	int			cpu;
 	struct xlog_cil_pcp	*cilpcp;
 	LIST_HEAD		(log_items);
+	struct xlog_ticket	*ticket;
 
 	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -1037,12 +1038,10 @@ xlog_cil_push_work(
 	if (error)
 		goto out_abort_free_ticket;
 
-	xfs_log_ticket_ungrant(log, ctx->ticket);
-
 	spin_lock(&commit_iclog->ic_callback_lock);
 	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
 		spin_unlock(&commit_iclog->ic_callback_lock);
-		goto out_abort;
+		goto out_abort_free_ticket;
 	}
 	ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE ||
 		      commit_iclog->ic_state == XLOG_STATE_WANT_SYNC);
@@ -1073,12 +1072,23 @@ xlog_cil_push_work(
 		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 	}
 
+	/*
+	 * Pull the ticket off the ctx so we can ungrant it after releasing the
+	 * commit_iclog. The ctx may be freed by the time we return from
+	 * releasing the commit_iclog (i.e. checkpoint has been completed and
+	 * callback run) so we can't reference the ctx after the call to
+	 * xlog_state_release_iclog().
+	 */
+	ticket = ctx->ticket;
+
 	/* release the hounds! */
 	spin_lock(&log->l_icloglock);
 	if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
 		xlog_state_switch_iclogs(log, commit_iclog, 0);
-	xlog_state_release_iclog(log, commit_iclog);
+	xlog_state_release_iclog(log, commit_iclog, ticket);
 	spin_unlock(&log->l_icloglock);
+
+	xfs_log_ticket_ungrant(log, ticket);
 	return;
 
 out_skip:
@@ -1089,7 +1099,6 @@ xlog_cil_push_work(
 
 out_abort_free_ticket:
 	xfs_log_ticket_ungrant(log, ctx->ticket);
-out_abort:
 	ASSERT(XLOG_FORCED_SHUTDOWN(log));
 	xlog_cil_committed(ctx);
 }
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 6a4160200417..3d43d3940757 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -487,7 +487,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
 		int eventual_size);
-int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
+int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog,
+		struct xlog_ticket *ticket);
 
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 45/45] xfs: expanding delayed logging design with background material
  2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
                   ` (43 preceding siblings ...)
  2021-03-05  5:11 ` [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
@ 2021-03-05  5:11 ` Dave Chinner
  2021-03-11  2:30   ` Darrick J. Wong
  44 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-05  5:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

I wrote up a description of how transactions, space reservations and
relogging work together in response to a question for background
material on the delayed logging design. Add this to the existing
document for ease of future reference.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 .../xfs-delayed-logging-design.rst            | 362 ++++++++++++++++--
 1 file changed, 323 insertions(+), 39 deletions(-)

diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst
index 464405d2801e..e02235911ff3 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.rst
+++ b/Documentation/filesystems/xfs-delayed-logging-design.rst
@@ -1,29 +1,315 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-==========================
-XFS Delayed Logging Design
-==========================
-
-Introduction to Re-logging in XFS
-=================================
-
-XFS logging is a combination of logical and physical logging. Some objects,
-such as inodes and dquots, are logged in logical format where the details
-logged are made up of the changes to in-core structures rather than on-disk
-structures. Other objects - typically buffers - have their physical changes
-logged. The reason for these differences is to reduce the amount of log space
-required for objects that are frequently logged. Some parts of inodes are more
-frequently logged than others, and inodes are typically more frequently logged
-than any other object (except maybe the superblock buffer) so keeping the
-amount of metadata logged low is of prime importance.
-
-The reason that this is such a concern is that XFS allows multiple separate
-modifications to a single object to be carried in the log at any given time.
-This allows the log to avoid needing to flush each change to disk before
-recording a new change to the object. XFS does this via a method called
-"re-logging". Conceptually, this is quite simple - all it requires is that any
-new change to the object is recorded with a *new copy* of all the existing
-changes in the new transaction that is written to the log.
+==================
+XFS Logging Design
+==================
+
+Preamble
+========
+
+This document describes the design and algorithms that the XFS journalling
+subsystem is based on. While originally focussed on just the design of
+the delayed logging extension introduced in 2010, it assumed the reader already
+had a fair amount of in-depth knowledge about how XFS transactions are formed
+and executed. It also largely omitted any details of how journal space
+reservations are accounted for to ensure the operation of the logging subsystem
+is guaranteed to be deadlock-free.
+
+Much of the original document is retained unmodified because it is still valid
+and correct. It also allows the new background material to avoid long
+descriptions for various algorithms because they are already well documented
+in the original document (e.g. what "relogging" is and why it is needed).
+
+Hence we first start with an overview of transactions, followed by the way
+transaction reservations are structured and accounted,
+and then move into how we guarantee forwards progress for long running
+transactions with finite initial reservations bounds. At this point we need
+to explain how relogging works, and that is where the original document
+started.
+
+Introduction
+============
+
+XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
+are atomic and recoverable. For reasons of space and time efficiency, the
+logging mechanisms are varied and complex, combining intents, logical and
+physical logging mechanisms to provide the necessary recovery guarantees the
+filesystem requires.
+
+Some objects, such as inodes and dquots, are logged in logical format where the
+details logged are made up of the changes to in-core structures rather than
+on-disk structures. Other objects - typically buffers - have their physical
+changes logged. And long running atomic modifications have individual changes
+chained together by intents, ensuring that journal recovery can restart and
+finish an operation that was only partially done when the system stopped
+functioning.
+
+The reason for these differences is to keep the amount of log space and CPU time
+required to process objects being modified as small as possible and hence the
+logging overhead as low as possible. Some items are very frequently modified,
+and some parts of objects are more frequently modified than others, so so
+keeping the overhead of metadata logging low is of prime importance.
+
+The method used to log an item or chain modifications together isn't
+particularly important in the scope of this document. It suffices to know that
+the method used for logging a particular object or chaining modifications
+together are different and are dependent on the object and/or modification being
+performed. The logging subsystem only cares that certain specific rules are
+followed to guarantee forwards progress and prevent deadlocks.
+
+
+Transactions in XFS
+===================
+
+XFS has two types of high level transactions, defined by the type of log space
+reservation they take. These are known as "one shot" and "permanent"
+transactions. Permanent transaction reservations can be used for one-shot
+transactions, but one-shot reservations cannot be used for permanent
+transactions. Reservations must be matched to the modification taking place.
+
+In the code, a one-shot transaction pattern looks somewhat like this::
+
+        tp = xfs_trans_alloc(<reservation>)
+	<lock items>
+        <do modification>
+        xfs_trans_commit(tp);
+
+As items are modified in the transaction, the dirty regions in those items are
+Once the transaction is committed, all resources joined to it are released,
+along with the remaining unused reservation space that was taken at the
+transaction allocation time.
+
+In contrast, a permanent transaction is made up of multiple linked individual
+transactions, and the pattern looks like this::
+
+	tp = xfs_trans_alloc(<reservation>)
+	xfs_ilock(ip, XFS_ILOCK_EXCL)
+
+	loop {
+		xfs_trans_ijoin(tp, 0);
+		<do modification>
+		xfs_trans_log_inode(tp, ip);
+		xfs_trans_roll(&tp);
+	}
+
+	xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+While this might look similar to a one-shot transaction, there is an important
+difference: xfs_trans_roll() performs a specific operation that links two
+transactions together::
+
+	ntp = xfs_trans_dup(tp);
+	xfs_trans_commit(tp);
+	xfs_log_reserve(ntp);
+
+This results in a series of "rolling transactions" where the inode is locked
+across the entire chain of transactions.  Hence while this series of rolling
+transactions is running, nothing else can read from or write to the inode and
+this provides a mechanism for complex changes to appear atomic from an external
+observer's point of view.
+
+It is important to note that a series of rolling transactions in a permanent
+transaction does not form an atomic change in the journal. While each
+individual modification is atomic, the chain is *not atomic*. If we crash half
+way through, then recovery will only replay up to the last transactional
+modification the loop made that was committed to the journal.
+
+
+Transactions are Asynchronous
+=============================
+
+In XFS, all high level transactions are asynchronous by default. This means that
+xfs_trans_commit() does not guarantee that the modification has been committed
+to stable storage when it returns. Hence when a system crashes, not all the
+completed transactions will be replayed during recovery.
+
+However, the logging subsystem does provide global ordering guarantees, such
+that if a specific change is seen after recovery, all metadata modifications
+that were committed prior to that change will also be seen.
+
+This affects long running permanent transactions in that it is not possible to
+predict how much of a long running operation will actually be recovered because
+there is no guarantee of how much of the operation reached stale storage. Hence
+if a long running operation requires multiple transactions to fully complete,
+the high level operation must use intents and defered operations to guarantee
+recovery can complete the operation once the first modification reached the
+journal.
+
+For single shot operations that need to reach stable storage immediately, or
+ensuring that a long running permanent transaction is fully committed once it is
+complete, we can explicitly tag a transaction as synchronous. This will trigger
+a "log force" to flush the outstanding committed transactions to stable storage
+in the journal and wait for that to complete.
+
+Synchronous transactions are rarely used, however, because they limit logging
+throughput to the IO latency limitations of the underlying storage. Instead, we
+tend to use log forces to ensure modifications are on stable storage only when
+a user operation requires a synchronisation point to occur (e.g. fsync).
+
+
+Transaction Reservations
+========================
+
+It has been mentioned a number of times now that the logging subsystem needs to
+provide a forwards progress guarantee so that no modification ever stalls
+because it can't be written to the journal due to a lack of space in the
+journal. This is acheived by the transaction reservations that are made when
+a transaction is first allocated. For permanent transactions, these reservations
+are maintained as part of the transaction rolling mechanism.
+
+A transaction reservation provides a guarantee that there is physical log space
+available to write the modification into the journal before we start making
+modifications to objects and items. As such, the reservation needs to be large
+enough to take into account the amount of metadata that the change might need to
+log in the worst case. This means that if we are modifying a btree in the
+transaction, we have to reserve enough space to record a full leaf-to-root split
+of the btree. As such, the reservations are quite complex because we have to
+take into account all the hidden changes that might occur.
+
+For example, a user data extent allocation involves allocating an extent from
+free space, which modifies the free space trees. That's two btrees. Then
+inserting the extent into the inode extent map requires modify another btree,
+which might require mor allocation that modifies the free space btrees again.
+Then we might have to update reverse mappings, which modifies yet another btree
+which might require more space. And so on.  Hence the amount of metadata that a
+"simple" operation can modify can be quite large.
+
+This "worst case" calculation provides us with the static "unit reservation"
+for the transaction that is calculated at mount time. We must guarantee that the
+log has this much space available before the transaction is allowed to proceed
+so that when we come to write the dirty metadata into the log we don't run out
+of log space half way through the write.
+
+For one-shot transactions, a single unit space reservation is all that is
+required for the transaction to proceed. For permanent transactions, however, we
+also have a "log count" that affects the size of the reservation that is to be
+made.
+
+While a permanent transaction can get by with a single unit of space
+reservation, it is somewhat inefficient to do this as it requires the
+transaction rolling mechanism to re-reserve space on every transaction roll. We
+know from the implementation of the permanent transactions how many transaction
+rolls are likely for the common modifications that need to be made.
+
+For example, and inode allocation is typically two transactions - one to
+physically allocate a free inode chunk on disk, and another to allocate an inode
+from an inode chunk that has free inodes in it.  Hence for an inode allocation
+transaction, we might set the reservation log count to a value of 2 to indicate
+that the common/fast path transaction will commit two linked transactions in a
+chain. Each time a permanent transaction rolls, it consumes an entire unit
+reservation.
+
+Hence when the permanent transaction is first allocated, the log space
+reservation is increases from a single unit reservation to multiple unit
+reservations. That multiple is defined by the reservation log count, and this
+means we can roll the transaction multiple times before we have to re-reserve
+log space when we roll the transaction. This ensures that the common
+modifications we make only need to reserve log space once.
+
+If the log count for a permanent transation reaches zero, then it needs to
+re-reserve physical space in the log. This is somewhat complex, and requires
+an understanding of how the log accounts for space that has been reserved.
+
+
+Log Space Accounting
+====================
+
+The position in the log is typically referred to as a Log Sequence Number (LSN).
+The log is circular, so the positions in the log are defined by the combination
+of a cycle number - the number of times the log has been overwritten - and the
+offset into the log.  A LSN carries the cycle in the upper 32 bits and the
+offset in the lower 32 bits. The offset is in units of "basic blocks" (512
+bytes). Hence we can do realtively simple LSN based math to keep track of
+available space in the log.
+
+Log space accounting is done via a pair of constructs called "grant heads".  The
+position of the grant heads is an absolute value, so the amount of space
+available in the log is defined by the distance between the position of the
+grant head and the current log tail. That is, how much space can be
+reserved/consumed before the grant heads would fully wrap the log and overtake
+the tail position.
+
+The first grant head is the "reserve" head. This tracks the byte count of the
+reservations currently held by active transactions. It is a purely in-memory
+accounting of the space reservation and, as such, actually tracks byte offsets
+into the log rather than basic blocks. Hence it technically isn't using LSNs to
+represent the log position, but it is still treated like a split {cycle,offset}
+tuple for the purposes of tracking reservation space.
+
+The reserve grant head is used to accurately account for exact transaction
+reservations amounts and the exact byte count that modifications actually make
+and need to write into the log. The reserve head is used to prevent new
+transactions from taking new reservations when the head reaches the current
+tail. It will block new reservations in a FIFO queue and as the log tail moves
+forward it will wake them in order once sufficient space is available. This FIFO
+mechanism ensures no transaction is starved of resources when log space
+shortages occur.
+
+The other grant head is the "write" head. Unlike the reserve head, this grant
+head contains an LSN and it tracks the physical space usage in the log. While
+this might sound like it is accounting the same state as the reserve grant head
+- and it mostly does track exactly the same location as the reserve grant head -
+there are critical differences in behaviour between them that provides the
+forwards progress guarantees that rolling permanent transactions require.
+
+These differences when a permanent transaction is rolled and the internal "log
+count" reaches zero and the initial set of unit reservations have been
+exhausted. At this point, we still require a log space reservation to continue
+the next transaction in the sequeunce, but we have none remaining. We cannot
+sleep during the transaction commit process waiting for new log space to become
+available, as we may end up on the end of the FIFO queue and the items we have
+locked while we sleep could end up pinning the tail of the log before there is
+enough free space in the log to fulfil all of the pending reservations and
+then wake up transaction commit in progress.
+
+To take a new reservation without sleeping requires us to be able to take a
+reservation even if there is no reservation space currently available. That is,
+we need to be able to *overcommit* the log reservation space. As has already
+been detailed, we cannot overcommit physical log space. However, the reserve
+grant head does not track physical space - it only accounts for the amount of
+reservations we currently have outstanding. Hence if the reserve head passes
+over the tail of the log all it means is that new reservations will be throttled
+immediately and remain throttled until the log tail is moved forward far enough
+to remove the overcommit and start taking new reservations. In other words, we
+can overcommit the reserve head without violating the physical log head and tail
+rules.
+
+As a result, permanent transactions only "regrant" reservation space during
+xfs_trans_commit() calls, while the physical log space reservation - tracked by
+the write head - is then reserved separately by a call to xfs_log_reserve()
+after the commit completes. Once the commit completes, we can sleep waiting for
+physical log space to be reserved from the write grant head, but only if one
+critical rule has been observed::
+
+	Code using permanent reservations must always log the items they hold
+	locked across each transaction they roll in the chain.
+
+"Re-logging" the locked items on every transaction roll ensures that the items
+the transaction chain is rolling are always relocated to the physical head of
+the log and so do not pin the tail of the log. If a locked item pins the tail of
+the log when we sleep on the write reservation, then we will deadlock the log as
+we cannot take the locks needed to write back that item and move the tail of the
+log forwards to free up write grant space. Re-logging the locked items avoids
+this deadlock and guarantees that the log reservation we are making cannot
+self-deadlock.
+
+If all rolling transactions obey this rule, then they can all make forwards
+progress independently because nothing will block the progress of the log
+tail moving forwards and hence ensuring that write grant space is always
+(eventually) made available to permanent transactions no matter how many times
+they roll.
+
+
+Re-logging Explained
+====================
+
+XFS allows multiple separate modifications to a single object to be carried in
+the log at any given time.  This allows the log to avoid needing to flush each
+change to disk before recording a new change to the object. XFS does this via a
+method called "re-logging". Conceptually, this is quite simple - all it requires
+is that any new change to the object is recorded with a *new copy* of all the
+existing changes in the new transaction that is written to the log.
 
 That is, if we have a sequence of changes A through to F, and the object was
 written to disk after change D, we would see in the log the following series
@@ -42,16 +328,13 @@ transaction::
 In other words, each time an object is relogged, the new transaction contains
 the aggregation of all the previous changes currently held only in the log.
 
-This relogging technique also allows objects to be moved forward in the log so
-that an object being relogged does not prevent the tail of the log from ever
-moving forward.  This can be seen in the table above by the changing
-(increasing) LSN of each subsequent transaction - the LSN is effectively a
-direct encoding of the location in the log of the transaction.
+This relogging technique allows objects to be moved forward in the log so that
+an object being relogged does not prevent the tail of the log from ever moving
+forward.  This can be seen in the table above by the changing (increasing) LSN
+of each subsequent transaction, and it's the technique that allows us to
+implement long-running, multiple-commit permanent transactions. 
 
-This relogging is also used to implement long-running, multiple-commit
-transactions.  These transaction are known as rolling transactions, and require
-a special log reservation known as a permanent transaction reservation. A
-typical example of a rolling transaction is the removal of extents from an
+A typical example of a rolling transaction is the removal of extents from an
 inode which can only be done at a rate of two extents per transaction because
 of reservation size limitations. Hence a rolling extent removal transaction
 keeps relogging the inode and btree buffers as they get modified in each
@@ -67,12 +350,13 @@ the log over and over again. Worse is the fact that objects tend to get
 dirtier as they get relogged, so each subsequent transaction is writing more
 metadata into the log.
 
-Another feature of the XFS transaction subsystem is that most transactions are
-asynchronous. That is, they don't commit to disk until either a log buffer is
-filled (a log buffer can hold multiple transactions) or a synchronous operation
-forces the log buffers holding the transactions to disk. This means that XFS is
-doing aggregation of transactions in memory - batching them, if you like - to
-minimise the impact of the log IO on transaction throughput.
+It should now also be obvious how relogging and asynchronous transactions go
+hand in hand. That is, transactions don't get written to the physical journal
+until either a log buffer is filled (a log buffer can hold multiple
+transactions) or a synchronous operation forces the log buffers holding the
+transactions to disk. This means that XFS is doing aggregation of transactions
+in memory - batching them, if you like - to minimise the impact of the log IO on
+transaction throughput.
 
 The limitation on asynchronous transaction throughput is the number and size of
 log buffers made available by the log manager. By default there are 8 log
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* Re: [PATCH 03/45] xfs: separate CIL commit record IO
  2021-03-05  5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
@ 2021-03-08  8:34   ` Chandan Babu R
  2021-03-15 14:40   ` Brian Foster
  2021-03-16  8:40   ` Christoph Hellwig
  2 siblings, 0 replies; 145+ messages in thread
From: Chandan Babu R @ 2021-03-08  8:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
>
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
>
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
>
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
>
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
>

I see that xlog_state_clean_iclog() wakes up tasks waiting on
iclog->ic_force_wait and that xlog_state_clean_iclog() itself is invoked after
the corresponding iclog is written to disk and the log vectors are moved to
AIL. Hence using iclog->ic_force_wait to wait for previous iclogs to complete
I/O ensures that the commit record iclog is written to disk only after the
previous iclogs have already been written.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

--
chandan

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush
  2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
@ 2021-03-08  9:31   ` Chandan Babu R
  2021-03-08 22:21   ` Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 145+ messages in thread
From: Chandan Babu R @ 2021-03-08  9:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> It's a one line wrapper around blkdev_issue_flush(). Just replace it
> with direct calls to blkdev_issue_flush().
>

Looks good.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

-- 
chandan

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-05  5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
@ 2021-03-08  9:48   ` Chandan Babu R
  2021-03-08 22:24     ` Darrick J. Wong
  2021-03-08 22:26   ` Darrick J. Wong
  2021-03-15 14:42   ` Brian Foster
  2 siblings, 1 reply; 145+ messages in thread
From: Chandan Babu R @ 2021-03-08  9:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The new checkpoint caceh flush mechanism requires us to issue an
> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
>
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
>
> TO do this, we need async cache flush primitives to submit the cache
> flush bio and to wait on it. THe block layer has no such primitives
> for filesystems, so roll our own for the moment.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_linux.h  |  2 ++
>  2 files changed, 38 insertions(+)
>
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 17f36db2f792..668f8bd27b4a 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count)
>  	return bio_max_segs(howmany(count, PAGE_SIZE));
>  }
>  
> +void
> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	if (bio->bi_private)
> +		complete(bio->bi_private);
> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the request queue does
> + * not require flush operations, just skip it altogether. If the caller needsi
> + * to wait for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.  The
> + * caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct bio		*bio,
> +	struct block_device	*bdev,
> +	struct completion	*done)
> +{
> +	struct request_queue	*q = bdev->bd_disk->queue;
> +
> +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> +		complete(done);

complete() should be invoked only when "done" has a non-NULL value.

> +		return;
> +	}

-- 
chandan

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 08/45] xfs: journal IO cache flush reductions
  2021-03-05  5:11 ` [PATCH 08/45] xfs: journal IO cache flush reductions Dave Chinner
@ 2021-03-08 10:49   ` Chandan Babu R
  2021-03-08 12:25   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Chandan Babu R @ 2021-03-08 10:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
>
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
>
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
>
> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> causes the journal IO to issue a cache flush and wait for it to
> complete before issuing the write IO to the journal. Hence all
> completed metadata IO is guaranteed to be stable before the journal
> overwrites the old metadata.
>
> The ordering guarantees of #2 are provided by the REQ_FUA, which
> ensures the journal writes do not complete until they are on stable
> storage. Hence by the time the last journal IO in a checkpoint
> completes, we know that the entire checkpoint is on stable storage
> and we can unpin the dirty metadata and allow it to be written back.
>
> This is the mechanism by which ordering was first implemented in XFS
> way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> ("Add support for drive write cache flushing") in the xfs-archive
> tree.
>
> A lot has changed since then, most notably we now use delayed
> logging to checkpoint the filesystem to the journal rather than
> write each individual transaction to the journal. Cache flushes on
> journal IO are necessary when individual transactions are wholly
> contained within a single iclog. However, CIL checkpoints are single
> transactions that typically span hundreds to thousands of individual
> journal writes, and so the requirements for device cache flushing
> have changed.
>
> That is, the ordering rules I state above apply to ordering of
> atomic transactions recorded in the journal, not to the journal IO
> itself. Hence we need to ensure metadata is stable before we start
> writing a new transaction to the journal (guarantee #1), and we need
> to ensure the entire transaction is stable in the journal before we
> start metadata writeback (guarantee #2).
>
> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> new journal transaction to provide #1, and it is not on any other
> journal IO done within the context of that journal transaction.
>
> The CIL checkpoint already issues a cache flush before it starts
> writing to the log, so we no longer need the iclog IO to issue a
> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> to xlog_write(), we no longer need to mark the first iclog in
> the log write with REQ_PREFLUSH for this case. As an added bonus,
> this ordering mechanism works for both internal and external logs,
> meaning we can remove the explicit data device cache flushes from
> the iclog write code when using external logs.
>
> Given the new ordering semantics of commit records for the CIL, we
> need iclogs containing commit records to issue a REQ_PREFLUSH. We
> also require unmount records to do this. Hence for both
> XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
> to mark the first iclog being written with REQ_PREFLUSH.
>
> For both commit records and unmount records, we also want them
> immediately on stable storage, so we want to also mark the iclogs
> that contain these records to be marked REQ_FUA. That means if a
> record is split across multiple iclogs, they are all marked REQ_FUA
> and not just the last one so that when the transaction is completed
> all the parts of the record are on stable storage.
>
> And for external logs, unmount records need a pre-write data device
> cache flush similar to the CIL checkpoint cache pre-flush as the
> internal iclog write code does not do this implicitly anymore.
>
> As an optimisation, when the commit record lands in the same iclog
> as the journal transaction starts, we don't need to wait for
> anything and can simply use REQ_FUA to provide guarantee #2.  This
> means that for fsync() heavy workloads, the cache flush behaviour is
> completely unchanged and there is no degradation in performance as a
> result of optimise the multi-IO transaction case.
>
> The most notable sign that there is less IO latency on my test
> machine (nvme SSDs) is that the "noiclogs" rate has dropped
> substantially. This metric indicates that the CIL push is blocking
> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
> is blocking waiting for log IO. With the changes in this patch, this
> drops to 1 noiclog event for every 100 iclog writes. Hence it is
> clear that log IO is completing much faster than it was previously,
> but it is also clear that for large iclog sizes, this isn't the
> performance limiting factor on this hardware.
>
> With smaller iclogs (32kB), however, there is a sustantial
> difference. With the cache flush modifications, the journal is now
> running at over 4000 write IOPS, and the journal throughput is
> largely identical to the 256kB iclogs and the noiclog event rate
> stays low at about 1:50 iclog writes. The existing code tops out at
> about 2500 IOPS as the number of cache flushes dominate performance
> and latency. The noiclog event rate is about 1:4, and the
> performance variance is quite large as the journal throughput can
> fall to less than half the peak sustained rate when the cache flush
> rate prevents metadata writeback from keeping up and the log runs
> out of space and throttles reservations.
>
> As a result:
>
> 	logbsize	fsmark create rate	rm -rf
> before	32kb		152851+/-5.3e+04	5m28s
> patched	32kb		221533+/-1.1e+04	5m24s
>
> before	256kb		220239+/-6.2e+03	4m58s
> patched	256kb		228286+/-9.2e+03	5m06s
>
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.
>

I see that the missing preflush w.r.t previous iclogs of a multi-iclog
checkpoint transaction has been handled in this version. Hence,

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

-- 
chandan

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 08/45] xfs: journal IO cache flush reductions
  2021-03-05  5:11 ` [PATCH 08/45] xfs: journal IO cache flush reductions Dave Chinner
  2021-03-08 10:49   ` Chandan Babu R
@ 2021-03-08 12:25   ` Brian Foster
  2021-03-09  1:13     ` Dave Chinner
  1 sibling, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-03-08 12:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:06PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
> 
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
> 
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
> 
> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> causes the journal IO to issue a cache flush and wait for it to
> complete before issuing the write IO to the journal. Hence all
> completed metadata IO is guaranteed to be stable before the journal
> overwrites the old metadata.
> 
> The ordering guarantees of #2 are provided by the REQ_FUA, which
> ensures the journal writes do not complete until they are on stable
> storage. Hence by the time the last journal IO in a checkpoint
> completes, we know that the entire checkpoint is on stable storage
> and we can unpin the dirty metadata and allow it to be written back.
> 
> This is the mechanism by which ordering was first implemented in XFS
> way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> ("Add support for drive write cache flushing") in the xfs-archive
> tree.
> 
> A lot has changed since then, most notably we now use delayed
> logging to checkpoint the filesystem to the journal rather than
> write each individual transaction to the journal. Cache flushes on
> journal IO are necessary when individual transactions are wholly
> contained within a single iclog. However, CIL checkpoints are single
> transactions that typically span hundreds to thousands of individual
> journal writes, and so the requirements for device cache flushing
> have changed.
> 
> That is, the ordering rules I state above apply to ordering of
> atomic transactions recorded in the journal, not to the journal IO
> itself. Hence we need to ensure metadata is stable before we start
> writing a new transaction to the journal (guarantee #1), and we need
> to ensure the entire transaction is stable in the journal before we
> start metadata writeback (guarantee #2).
> 
> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> new journal transaction to provide #1, and it is not on any other
> journal IO done within the context of that journal transaction.
> 
> The CIL checkpoint already issues a cache flush before it starts
> writing to the log, so we no longer need the iclog IO to issue a
> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> to xlog_write(), we no longer need to mark the first iclog in
> the log write with REQ_PREFLUSH for this case. As an added bonus,
> this ordering mechanism works for both internal and external logs,
> meaning we can remove the explicit data device cache flushes from
> the iclog write code when using external logs.
> 
> Given the new ordering semantics of commit records for the CIL, we
> need iclogs containing commit records to issue a REQ_PREFLUSH. We
> also require unmount records to do this. Hence for both
> XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
> to mark the first iclog being written with REQ_PREFLUSH.
> 
> For both commit records and unmount records, we also want them
> immediately on stable storage, so we want to also mark the iclogs
> that contain these records to be marked REQ_FUA. That means if a
> record is split across multiple iclogs, they are all marked REQ_FUA
> and not just the last one so that when the transaction is completed
> all the parts of the record are on stable storage.
> 
> And for external logs, unmount records need a pre-write data device
> cache flush similar to the CIL checkpoint cache pre-flush as the
> internal iclog write code does not do this implicitly anymore.
> 
> As an optimisation, when the commit record lands in the same iclog
> as the journal transaction starts, we don't need to wait for
> anything and can simply use REQ_FUA to provide guarantee #2.  This
> means that for fsync() heavy workloads, the cache flush behaviour is
> completely unchanged and there is no degradation in performance as a
> result of optimise the multi-IO transaction case.
> 
> The most notable sign that there is less IO latency on my test
> machine (nvme SSDs) is that the "noiclogs" rate has dropped
> substantially. This metric indicates that the CIL push is blocking
> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
> is blocking waiting for log IO. With the changes in this patch, this
> drops to 1 noiclog event for every 100 iclog writes. Hence it is
> clear that log IO is completing much faster than it was previously,
> but it is also clear that for large iclog sizes, this isn't the
> performance limiting factor on this hardware.
> 
> With smaller iclogs (32kB), however, there is a sustantial
> difference. With the cache flush modifications, the journal is now
> running at over 4000 write IOPS, and the journal throughput is
> largely identical to the 256kB iclogs and the noiclog event rate
> stays low at about 1:50 iclog writes. The existing code tops out at
> about 2500 IOPS as the number of cache flushes dominate performance
> and latency. The noiclog event rate is about 1:4, and the
> performance variance is quite large as the journal throughput can
> fall to less than half the peak sustained rate when the cache flush
> rate prevents metadata writeback from keeping up and the log runs
> out of space and throttles reservations.
> 
> As a result:
> 
> 	logbsize	fsmark create rate	rm -rf
> before	32kb		152851+/-5.3e+04	5m28s
> patched	32kb		221533+/-1.1e+04	5m24s
> 
> before	256kb		220239+/-6.2e+03	4m58s
> patched	256kb		228286+/-9.2e+03	5m06s
> 
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Thoughts on my previous feedback to this patch, particularly the locking
bits..? I thought I saw a subsequent patch somewhere that increased the
parallelism of this code..

Brian

>  fs/xfs/xfs_log.c      | 53 +++++++++++++++++++++++--------------------
>  fs/xfs/xfs_log_cil.c  |  7 +++++-
>  fs/xfs/xfs_log_priv.h |  4 ++++
>  3 files changed, 38 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 364694a83de6..ed44d67d7099 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -835,6 +835,14 @@ xlog_write_unmount_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= sizeof(ulf);
> +
> +	/*
> +	 * For external log devices, we need to flush the data device cache
> +	 * first to ensure all metadata writeback is on stable storage before we
> +	 * stamp the tail LSN into the unmount record.
> +	 */
> +	if (log->l_targ != log->l_mp->m_ddev_targp)
> +		blkdev_issue_flush(log->l_targ->bt_bdev);
>  	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
>  }
>  
> @@ -1753,8 +1761,7 @@ xlog_write_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
>  	uint64_t		bno,
> -	unsigned int		count,
> -	bool			need_flush)
> +	unsigned int		count)
>  {
>  	ASSERT(bno < log->l_logBBsize);
>  
> @@ -1792,10 +1799,12 @@ xlog_write_iclog(
>  	 * writeback throttle from throttling log writes behind background
>  	 * metadata writeback and causing priority inversions.
>  	 */
> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
> -				REQ_IDLE | REQ_FUA;
> -	if (need_flush)
> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
> +		iclog->ic_bio.bi_opf |= REQ_FUA;
> +	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>  
>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -1898,7 +1907,6 @@ xlog_sync(
>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>  	uint64_t		bno;
>  	unsigned int		size;
> -	bool			need_flush = true, split = false;
>  
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>  
> @@ -1923,10 +1931,8 @@ xlog_sync(
>  	bno = BLOCK_LSN(be64_to_cpu(iclog->ic_header.h_lsn));
>  
>  	/* Do we need to split this write into 2 parts? */
> -	if (bno + BTOBB(count) > log->l_logBBsize) {
> +	if (bno + BTOBB(count) > log->l_logBBsize)
>  		xlog_split_iclog(log, &iclog->ic_header, bno, count);
> -		split = true;
> -	}
>  
>  	/* calculcate the checksum */
>  	iclog->ic_header.h_crc = xlog_cksum(log, &iclog->ic_header,
> @@ -1947,22 +1953,8 @@ xlog_sync(
>  			 be64_to_cpu(iclog->ic_header.h_lsn));
>  	}
>  #endif
> -
> -	/*
> -	 * Flush the data device before flushing the log to make sure all meta
> -	 * data written back from the AIL actually made it to disk before
> -	 * stamping the new log tail LSN into the log buffer.  For an external
> -	 * log we need to issue the flush explicitly, and unfortunately
> -	 * synchronously here; for an internal log we can simply use the block
> -	 * layer state machine for preflushes.
> -	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> -		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
> -		need_flush = false;
> -	}
> -
>  	xlog_verify_iclog(log, iclog, count);
> -	xlog_write_iclog(log, iclog, bno, count, need_flush);
> +	xlog_write_iclog(log, iclog, bno, count);
>  }
>  
>  /*
> @@ -2416,10 +2408,21 @@ xlog_write(
>  		ASSERT(log_offset <= iclog->ic_size - 1);
>  		ptr = iclog->ic_datap + log_offset;
>  
> -		/* start_lsn is the first lsn written to. That's all we need. */
> +		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> +		/*
> +		 * iclogs containing commit records or unmount records need
> +		 * to issue ordering cache flushes and commit immediately
> +		 * to stable storage to guarantee journal vs metadata ordering
> +		 * is correctly maintained in the storage media.
> +		 */
> +		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> +						XLOG_ICL_NEED_FUA);
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index c04d5d37a3a2..263c8d907221 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -896,11 +896,16 @@ xlog_cil_push_work(
>  
>  	/*
>  	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog.
> +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> +	 * cache flush because there are no other iclogs we need to order
> +	 * against.
>  	 */
>  	if (ctx->start_lsn != commit_lsn) {
>  		spin_lock(&log->l_icloglock);
>  		xlog_wait_on_iclog(commit_iclog->ic_prev);
> +	} else {
> +		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>  	}
>  
>  	/* release the hounds! */
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 56e1942c47df..0552e96d2b64 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -133,6 +133,9 @@ enum xlog_iclog_state {
>  
>  #define XLOG_COVER_OPS		5
>  
> +#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
> +#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
> +
>  /* Ticket reservation region accounting */ 
>  #define XLOG_TIC_LEN_MAX	15
>  
> @@ -201,6 +204,7 @@ typedef struct xlog_in_core {
>  	u32			ic_size;
>  	u32			ic_offset;
>  	enum xlog_iclog_state	ic_state;
> +	unsigned int		ic_flags;
>  	char			*ic_datap;	/* pointer to iclog data */
>  
>  	/* Callback structures need their own cacheline */
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 01/45] xfs: initialise attr fork on inode create
  2021-03-05  5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
@ 2021-03-08 22:20   ` Darrick J. Wong
  2021-03-16  8:35   ` Christoph Hellwig
  1 sibling, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 22:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:10:59PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we allocate a new inode, we often need to add an attribute to
> the inode as part of the create. This can happen as a result of
> needing to add default ACLs or security labels before the inode is
> made visible to userspace.
> 
> This is highly inefficient right now. We do the create transaction
> to allocate the inode, then we do an "add attr fork" transaction to
> modify the just created empty inode to set the inode fork offset to
> allow attributes to be stored, then we go and do the attribute
> creation.
> 
> This means 3 transactions instead of 1 to allocate an inode, and
> this greatly increases the load on the CIL commit code, resulting in
> excessive contention on the CIL spin locks and performance
> degradation:
> 
>  18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
>   3.57%  [kernel]                [k] do_raw_spin_lock
>   2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
>   2.48%  [kernel]                [k] memcpy
>   2.34%  [kernel]                [k] xfs_log_commit_cil
> 
> The typical profile resulting from running fsmark on a selinux enabled
> filesytem is adds this overhead to the create path:
> 
>   - 15.30% xfs_init_security
>      - 15.23% security_inode_init_security
> 	- 13.05% xfs_initxattrs
> 	   - 12.94% xfs_attr_set
> 	      - 6.75% xfs_bmap_add_attrfork
> 		 - 5.51% xfs_trans_commit
> 		    - 5.48% __xfs_trans_commit
> 		       - 5.35% xfs_log_commit_cil
> 			  - 3.86% _raw_spin_lock
> 			     - do_raw_spin_lock
> 				  __pv_queued_spin_lock_slowpath
> 		 - 0.70% xfs_trans_alloc
> 		      0.52% xfs_trans_reserve
> 	      - 5.41% xfs_attr_set_args
> 		 - 5.39% xfs_attr_set_shortform.constprop.0
> 		    - 4.46% xfs_trans_commit
> 		       - 4.46% __xfs_trans_commit
> 			  - 4.33% xfs_log_commit_cil
> 			     - 2.74% _raw_spin_lock
> 				- do_raw_spin_lock
> 				     __pv_queued_spin_lock_slowpath
> 			       0.60% xfs_inode_item_format
> 		      0.90% xfs_attr_try_sf_addname
> 	- 1.99% selinux_inode_init_security
> 	   - 1.02% security_sid_to_context_force
> 	      - 1.00% security_sid_to_context_core
> 		 - 0.92% sidtab_entry_to_string
> 		    - 0.90% sidtab_sid2str_get
> 			 0.59% sidtab_sid2str_put.part.0
> 	   - 0.82% selinux_determine_inode_label
> 	      - 0.77% security_transition_sid
> 		   0.70% security_compute_sid.part.0
> 
> And fsmark creation rate performance drops by ~25%. The key point to
> note here is that half the additional overhead comes from adding the
> attribute fork to the newly created inode. That's crazy, considering
> we can do this same thing at inode create time with a couple of
> lines of code and no extra overhead.
> 
> So, if we know we are going to add an attribute immediately after
> creating the inode, let's just initialise the attribute fork inside
> the create transaction and chop that whole chunk of code out of
> the create fast path. This completely removes the performance
> drop caused by enabling SELinux, and the profile looks like:
> 
>      - 8.99% xfs_init_security
>          - 9.00% security_inode_init_security
>             - 6.43% xfs_initxattrs
>                - 6.37% xfs_attr_set
>                   - 5.45% xfs_attr_set_args
>                      - 5.42% xfs_attr_set_shortform.constprop.0
>                         - 4.51% xfs_trans_commit
>                            - 4.54% __xfs_trans_commit
>                               - 4.59% xfs_log_commit_cil
>                                  - 2.67% _raw_spin_lock
>                                     - 3.28% do_raw_spin_lock
>                                          3.08% __pv_queued_spin_lock_slowpath
>                                    0.66% xfs_inode_item_format
>                         - 0.90% xfs_attr_try_sf_addname
>                   - 0.60% xfs_trans_alloc
>             - 2.35% selinux_inode_init_security
>                - 1.25% security_sid_to_context_force
>                   - 1.21% security_sid_to_context_core
>                      - 1.19% sidtab_entry_to_string
>                         - 1.20% sidtab_sid2str_get
>                            - 0.86% sidtab_sid2str_put.part.0
>                               - 0.62% _raw_spin_lock_irqsave
>                                  - 0.77% do_raw_spin_lock
>                                       __pv_queued_spin_lock_slowpath
>                - 0.84% selinux_determine_inode_label
>                   - 0.83% security_transition_sid
>                        0.86% security_compute_sid.part.0
> 
> Which indicates the XFS overhead of creating the selinux xattr has
> been halved. This doesn't fix the CIL lock contention problem, just
> means it's not a limiting factor for this workload. Lock contention
> in the security subsystems is going to be an issue soon, though...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good to me now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c       |  9 ++++-----
>  fs/xfs/libxfs/xfs_inode_fork.c | 20 +++++++++++++++-----
>  fs/xfs/libxfs/xfs_inode_fork.h |  2 ++
>  fs/xfs/xfs_inode.c             | 24 +++++++++++++++++++++---
>  fs/xfs/xfs_inode.h             |  6 ++++--
>  fs/xfs/xfs_iops.c              | 34 +++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_qm.c                |  2 +-
>  fs/xfs/xfs_symlink.c           |  2 +-
>  8 files changed, 81 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index e0905ad171f0..5574d345d066 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -1027,7 +1027,9 @@ xfs_bmap_add_attrfork_local(
>  	return -EFSCORRUPTED;
>  }
>  
> -/* Set an inode attr fork off based on the format */
> +/*
> + * Set an inode attr fork offset based on the format of the data fork.
> + */
>  int
>  xfs_bmap_set_attrforkoff(
>  	struct xfs_inode	*ip,
> @@ -1092,10 +1094,7 @@ xfs_bmap_add_attrfork(
>  		goto trans_cancel;
>  	ASSERT(ip->i_afp == NULL);
>  
> -	ip->i_afp = kmem_cache_zalloc(xfs_ifork_zone,
> -				      GFP_KERNEL | __GFP_NOFAIL);
> -
> -	ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
> +	ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
>  	ip->i_afp->if_flags = XFS_IFEXTENTS;
>  	logflags = 0;
>  	switch (ip->i_df.if_format) {
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index e080d7e07643..c606c1a77e5a 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -282,6 +282,19 @@ xfs_dfork_attr_shortform_size(
>  	return be16_to_cpu(atp->hdr.totsize);
>  }
>  
> +struct xfs_ifork *
> +xfs_ifork_alloc(
> +	enum xfs_dinode_fmt	format,
> +	xfs_extnum_t		nextents)
> +{
> +	struct xfs_ifork	*ifp;
> +
> +	ifp = kmem_cache_zalloc(xfs_ifork_zone, GFP_NOFS | __GFP_NOFAIL);
> +	ifp->if_format = format;
> +	ifp->if_nextents = nextents;
> +	return ifp;
> +}
> +
>  int
>  xfs_iformat_attr_fork(
>  	struct xfs_inode	*ip,
> @@ -293,11 +306,8 @@ xfs_iformat_attr_fork(
>  	 * Initialize the extent count early, as the per-format routines may
>  	 * depend on it.
>  	 */
> -	ip->i_afp = kmem_cache_zalloc(xfs_ifork_zone, GFP_NOFS | __GFP_NOFAIL);
> -	ip->i_afp->if_format = dip->di_aformat;
> -	if (unlikely(ip->i_afp->if_format == 0)) /* pre IRIX 6.2 file system */
> -		ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
> -	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents);
> +	ip->i_afp = xfs_ifork_alloc(dip->di_aformat,
> +				be16_to_cpu(dip->di_anextents));
>  
>  	switch (ip->i_afp->if_format) {
>  	case XFS_DINODE_FMT_LOCAL:
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> index 9e2137cd7372..a0717ab0e5c5 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.h
> +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> @@ -141,6 +141,8 @@ static inline int8_t xfs_ifork_format(struct xfs_ifork *ifp)
>  	return ifp->if_format;
>  }
>  
> +struct xfs_ifork *xfs_ifork_alloc(enum xfs_dinode_fmt format,
> +				xfs_extnum_t nextents);
>  struct xfs_ifork *xfs_iext_state_to_fork(struct xfs_inode *ip, int state);
>  
>  int		xfs_iformat_data_fork(struct xfs_inode *, struct xfs_dinode *);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 46a861d55e48..bed2beb169e4 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -774,6 +774,7 @@ xfs_init_new_inode(
>  	xfs_nlink_t		nlink,
>  	dev_t			rdev,
>  	prid_t			prid,
> +	bool			init_xattrs,
>  	struct xfs_inode	**ipp)
>  {
>  	struct inode		*dir = pip ? VFS_I(pip) : NULL;
> @@ -877,6 +878,20 @@ xfs_init_new_inode(
>  		ASSERT(0);
>  	}
>  
> +	/*
> +	 * If we need to create attributes immediately after allocating the
> +	 * inode, initialise an empty attribute fork right now. We use the
> +	 * default fork offset for attributes here as we don't know exactly what
> +	 * size or how many attributes we might be adding. We can do this
> +	 * safely here because we know the data fork is completely empty and
> +	 * this saves us from needing to run a separate transaction to set the
> +	 * fork offset in the immediate future.
> +	 */
> +	if (init_xattrs) {
> +		ip->i_d.di_forkoff = xfs_default_attroffset(ip) >> 3;
> +		ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
> +	}
> +
>  	/*
>  	 * Log the new values stuffed into the inode.
>  	 */
> @@ -910,6 +925,7 @@ xfs_dir_ialloc(
>  	xfs_nlink_t		nlink,
>  	dev_t			rdev,
>  	prid_t			prid,
> +	bool			init_xattrs,
>  	struct xfs_inode	**ipp)
>  {
>  	struct xfs_buf		*agibp;
> @@ -937,7 +953,7 @@ xfs_dir_ialloc(
>  	ASSERT(ino != NULLFSINO);
>  
>  	return xfs_init_new_inode(mnt_userns, *tpp, dp, ino, mode, nlink, rdev,
> -				  prid, ipp);
> +				  prid, init_xattrs, ipp);
>  }
>  
>  /*
> @@ -982,6 +998,7 @@ xfs_create(
>  	struct xfs_name		*name,
>  	umode_t			mode,
>  	dev_t			rdev,
> +	bool			init_xattrs,
>  	xfs_inode_t		**ipp)
>  {
>  	int			is_dir = S_ISDIR(mode);
> @@ -1052,7 +1069,7 @@ xfs_create(
>  	 * pointing to itself.
>  	 */
>  	error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, is_dir ? 2 : 1, rdev,
> -			       prid, &ip);
> +			       prid, init_xattrs, &ip);
>  	if (error)
>  		goto out_trans_cancel;
>  
> @@ -1171,7 +1188,8 @@ xfs_create_tmpfile(
>  	if (error)
>  		goto out_release_dquots;
>  
> -	error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, 0, 0, prid, &ip);
> +	error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, 0, 0, prid,
> +				false, &ip);
>  	if (error)
>  		goto out_trans_cancel;
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 88ee4c3930ae..a2cacdb76d55 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -371,7 +371,8 @@ int		xfs_lookup(struct xfs_inode *dp, struct xfs_name *name,
>  			   struct xfs_inode **ipp, struct xfs_name *ci_name);
>  int		xfs_create(struct user_namespace *mnt_userns,
>  			   struct xfs_inode *dp, struct xfs_name *name,
> -			   umode_t mode, dev_t rdev, struct xfs_inode **ipp);
> +			   umode_t mode, dev_t rdev, bool need_xattr,
> +			   struct xfs_inode **ipp);
>  int		xfs_create_tmpfile(struct user_namespace *mnt_userns,
>  			   struct xfs_inode *dp, umode_t mode,
>  			   struct xfs_inode **ipp);
> @@ -413,7 +414,8 @@ xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
>  int		xfs_dir_ialloc(struct user_namespace *mnt_userns,
>  			       struct xfs_trans **tpp, struct xfs_inode *dp,
>  			       umode_t mode, xfs_nlink_t nlink, dev_t dev,
> -			       prid_t prid, struct xfs_inode **ipp);
> +			       prid_t prid, bool need_xattr,
> +			       struct xfs_inode **ipp);
>  
>  static inline int
>  xfs_itruncate_extents(
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 66ebccb5a6ff..a9d466b78646 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -126,6 +126,37 @@ xfs_cleanup_inode(
>  	xfs_remove(XFS_I(dir), &teardown, XFS_I(inode));
>  }
>  
> +/*
> + * Check to see if we are likely to need an extended attribute to be added to
> + * the inode we are about to allocate. This allows the attribute fork to be
> + * created during the inode allocation, reducing the number of transactions we
> + * need to do in this fast path.
> + *
> + * The security checks are optimistic, but not guaranteed. The two LSMs that
> + * require xattrs to be added here (selinux and smack) are also the only two
> + * LSMs that add a sb->s_security structure to the superblock. Hence if security
> + * is enabled and sb->s_security is set, we have a pretty good idea that we are
> + * going to be asked to add a security xattr immediately after allocating the
> + * xfs inode and instantiating the VFS inode.
> + */
> +static inline bool
> +xfs_create_need_xattr(
> +	struct inode	*dir,
> +	struct posix_acl *default_acl,
> +	struct posix_acl *acl)
> +{
> +	if (acl)
> +		return true;
> +	if (default_acl)
> +		return true;
> +	if (!IS_ENABLED(CONFIG_SECURITY))
> +		return false;
> +	if (dir->i_sb->s_security)
> +		return true;
> +	return false;
> +}
> +
> +
>  STATIC int
>  xfs_generic_create(
>  	struct user_namespace	*mnt_userns,
> @@ -163,7 +194,8 @@ xfs_generic_create(
>  
>  	if (!tmpfile) {
>  		error = xfs_create(mnt_userns, XFS_I(dir), &name, mode, rdev,
> -				   &ip);
> +				xfs_create_need_xattr(dir, default_acl, acl),
> +				&ip);
>  	} else {
>  		error = xfs_create_tmpfile(mnt_userns, XFS_I(dir), mode, &ip);
>  	}
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index bfa4164990b1..6fde318b9fed 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -788,7 +788,7 @@ xfs_qm_qino_alloc(
>  
>  	if (need_alloc) {
>  		error = xfs_dir_ialloc(&init_user_ns, &tp, NULL, S_IFREG, 1, 0,
> -				       0, ipp);
> +				       0, false, ipp);
>  		if (error) {
>  			xfs_trans_cancel(tp);
>  			return error;
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index 1379013d74b8..162cf69bd982 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -223,7 +223,7 @@ xfs_symlink(
>  	 * Allocate an inode for the symlink.
>  	 */
>  	error = xfs_dir_ialloc(mnt_userns, &tp, dp, S_IFLNK | (mode & ~S_IFMT),
> -			       1, 0, prid, &ip);
> +			       1, 0, prid, false, &ip);
>  	if (error)
>  		goto out_trans_cancel;
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush
  2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
  2021-03-08  9:31   ` Chandan Babu R
@ 2021-03-08 22:21   ` Darrick J. Wong
  2021-03-15 14:40   ` Brian Foster
  2021-03-16  8:41   ` Christoph Hellwig
  3 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 22:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:02PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a one line wrapper around blkdev_issue_flush(). Just replace it
> with direct calls to blkdev_issue_flush().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Woot!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf.c   | 2 +-
>  fs/xfs/xfs_file.c  | 6 +++---
>  fs/xfs/xfs_log.c   | 2 +-
>  fs/xfs/xfs_super.c | 7 -------
>  fs/xfs/xfs_super.h | 1 -
>  5 files changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 37a1d12762d8..7043546a04b8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1958,7 +1958,7 @@ xfs_free_buftarg(
>  	percpu_counter_destroy(&btp->bt_io_count);
>  	list_lru_destroy(&btp->bt_lru);
>  
> -	xfs_blkdev_issue_flush(btp);
> +	blkdev_issue_flush(btp->bt_bdev);
>  
>  	kmem_free(btp);
>  }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a007ca0711d9..24c7f45fc4eb 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -197,9 +197,9 @@ xfs_file_fsync(
>  	 * inode size in case of an extending write.
>  	 */
>  	if (XFS_IS_REALTIME_INODE(ip))
> -		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
> +		blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev);
>  	else if (mp->m_logdev_targp != mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
>  
>  	/*
>  	 * Any inode that has dirty modifications in the log is pinned.  The
> @@ -219,7 +219,7 @@ xfs_file_fsync(
>  	 */
>  	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
>  	    mp->m_logdev_targp == mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
>  
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 317c466232d4..fee76c485727 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1962,7 +1962,7 @@ xlog_sync(
>  	 * layer state machine for preflushes.
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> -		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
> +		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
>  		need_flush = false;
>  	}
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e5e0713bebcd..ca2cb0448b5e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -339,13 +339,6 @@ xfs_blkdev_put(
>  		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>  }
>  
> -void
> -xfs_blkdev_issue_flush(
> -	xfs_buftarg_t		*buftarg)
> -{
> -	blkdev_issue_flush(buftarg->bt_bdev);
> -}
> -
>  STATIC void
>  xfs_close_devices(
>  	struct xfs_mount	*mp)
> diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
> index 1ca484b8357f..79cb2dece811 100644
> --- a/fs/xfs/xfs_super.h
> +++ b/fs/xfs/xfs_super.h
> @@ -88,7 +88,6 @@ struct block_device;
>  
>  extern void xfs_quiesce_attr(struct xfs_mount *mp);
>  extern void xfs_flush_inodes(struct xfs_mount *mp);
> -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
>  extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
>  					   xfs_agnumber_t agcount);
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-08  9:48   ` Chandan Babu R
@ 2021-03-08 22:24     ` Darrick J. Wong
  2021-03-15 14:41       ` Brian Foster
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 22:24 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Dave Chinner, linux-xfs

On Mon, Mar 08, 2021 at 03:18:09PM +0530, Chandan Babu R wrote:
> On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The new checkpoint caceh flush mechanism requires us to issue an
> > unconditional cache flush before we start a new checkpoint. We don't
> > want to block for this if we can help it, and we have a fair chunk
> > of CPU work to do between starting the checkpoint and issuing the
> > first journal IO.
> >
> > Hence it makes sense to amortise the latency cost of the cache flush
> > by issuing it asynchronously and then waiting for it only when we
> > need to issue the first IO in the transaction.
> >
> > TO do this, we need async cache flush primitives to submit the cache
> > flush bio and to wait on it. THe block layer has no such primitives
> > for filesystems, so roll our own for the moment.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_linux.h  |  2 ++
> >  2 files changed, 38 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> > index 17f36db2f792..668f8bd27b4a 100644
> > --- a/fs/xfs/xfs_bio_io.c
> > +++ b/fs/xfs/xfs_bio_io.c
> > @@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count)
> >  	return bio_max_segs(howmany(count, PAGE_SIZE));
> >  }
> >  
> > +void
> > +xfs_flush_bdev_async_endio(
> > +	struct bio	*bio)
> > +{
> > +	if (bio->bi_private)
> > +		complete(bio->bi_private);
> > +}
> > +
> > +/*
> > + * Submit a request for an async cache flush to run. If the request queue does
> > + * not require flush operations, just skip it altogether. If the caller needsi
> > + * to wait for the flush completion at a later point in time, they must supply a
> > + * valid completion. This will be signalled when the flush completes.  The
> > + * caller never sees the bio that is issued here.
> > + */
> > +void
> > +xfs_flush_bdev_async(
> > +	struct bio		*bio,
> > +	struct block_device	*bdev,
> > +	struct completion	*done)
> > +{
> > +	struct request_queue	*q = bdev->bd_disk->queue;
> > +
> > +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> > +		complete(done);
> 
> complete() should be invoked only when "done" has a non-NULL value.

The only caller always provides a completion.

--D

> > +		return;
> > +	}
> 
> -- 
> chandan

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-05  5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
  2021-03-08  9:48   ` Chandan Babu R
@ 2021-03-08 22:26   ` Darrick J. Wong
  2021-03-15 14:42   ` Brian Foster
  2 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 22:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The new checkpoint caceh flush mechanism requires us to issue an
> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
> 
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
> 
> TO do this, we need async cache flush primitives to submit the cache
> flush bio and to wait on it. THe block layer has no such primitives
> for filesystems, so roll our own for the moment.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_linux.h  |  2 ++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 17f36db2f792..668f8bd27b4a 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count)
>  	return bio_max_segs(howmany(count, PAGE_SIZE));
>  }
>  
> +void

static void?

> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	if (bio->bi_private)
> +		complete(bio->bi_private);

Er... when would bi_private be null?  We always set it in
xfs_flush_bdev_async, and nobody else uses this helper, right?

--D

> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the request queue does
> + * not require flush operations, just skip it altogether. If the caller needsi
> + * to wait for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.  The
> + * caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct bio		*bio,
> +	struct block_device	*bdev,
> +	struct completion	*done)
> +{
> +	struct request_queue	*q = bdev->bd_disk->queue;
> +
> +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> +		complete(done);
> +		return;
> +	}
> +
> +	bio_init(bio, NULL, 0);
> +	bio_set_dev(bio, bdev);
> +	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
> +	bio->bi_private = done;
> +	bio->bi_end_io = xfs_flush_bdev_async_endio;
> +
> +	submit_bio(bio);
> +}
>  int
>  xfs_rw_bdev(
>  	struct block_device	*bdev,
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index af6be9b9ccdf..953d98bc4832 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -196,6 +196,8 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
> +void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev,
> +		struct completion *done);
>  
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-03-05  5:11 ` [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
@ 2021-03-08 22:53   ` Darrick J. Wong
  2021-03-11  0:26     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 22:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:11PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In doing an investigation into AIL push stalls, I was looking at the
> log force code to see if an async CIL push could be done instead.
> This lead me to xfs_log_force_lsn() and looking at how it works.
> 
> xfs_log_force_lsn() is only called from inode synchronisation
> contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
> value as the LSN to sync the log to. This gets passed to
> xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
> journal, and then used by xfs_log_force_lsn() to flush the iclogs to
> the journal.
> 
> The problem with is that ip->i_itemp->ili_last_lsn does not store a
> log sequence number. What it stores is passed to it from the
> ->iop_committing method, which is called by xfs_log_commit_cil().
> The value this passes to the iop_committing method is the CIL
> context sequence number that the item was committed to.
> 
> As it turns out, xlog_cil_force_lsn() converts the sequence to an
> actual commit LSN for the related context and returns that to
> xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
> variable that contained a sequence with an actual LSN and then uses
> that to sync the iclogs.
> 
> This caused me some confusion for a while, even though I originally
> wrote all this code a decade ago. ->iop_committing is only used by
> a couple of log item types, and only inode items use the sequence
> number it is passed.
> 
> Let's clean up the API, CIL structures and inode log item to call it
> a sequence number, and make it clear that the high level code is
> using CIL sequence numbers and not on-disk LSNs for integrity
> synchronisation purposes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_types.h |  1 +
>  fs/xfs/xfs_buf_item.c     |  2 +-
>  fs/xfs/xfs_dquot_item.c   |  2 +-
>  fs/xfs/xfs_file.c         | 14 +++++++-------
>  fs/xfs/xfs_inode.c        | 10 +++++-----
>  fs/xfs/xfs_inode_item.c   |  4 ++--
>  fs/xfs/xfs_inode_item.h   |  2 +-
>  fs/xfs/xfs_log.c          | 27 ++++++++++++++-------------
>  fs/xfs/xfs_log.h          |  4 +---
>  fs/xfs/xfs_log_cil.c      | 22 +++++++++-------------
>  fs/xfs/xfs_log_priv.h     | 15 +++++++--------
>  fs/xfs/xfs_trans.c        |  6 +++---
>  fs/xfs/xfs_trans.h        |  4 ++--
>  13 files changed, 54 insertions(+), 59 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 064bd6e8c922..0870ef6f933d 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -21,6 +21,7 @@ typedef int32_t		xfs_suminfo_t;	/* type of bitmap summary info */
>  typedef uint32_t	xfs_rtword_t;	/* word type for bitmap manipulations */
>  
>  typedef int64_t		xfs_lsn_t;	/* log sequence number */
> +typedef int64_t		xfs_csn_t;	/* CIL sequence number */

I'm unfamiliar with the internal format of CIL sequence numbers.  Do
they have the same cycle:offset segmented structure as LSNs do?  Or are
they a simple linear integer that increases as we checkpoint committed
items?

Looking through the current code, I see a couple of places where we
initialize them to 1, and I also see that when we create a new cil
context we set its sequence to one more than the context that it will
replace.

I also see a bunch of comparisons of cil context sequence numbers that
use standard integer operators, but then I also see one instance of:

	if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0)
		return false;
	return true

in xfs_log_item_in_current_chkpt.  AFAICT this could be replaced with a
simple:

	return lip->li_seq == ctx->sequence;

But the fact that we're using LSN_CMP in /one/ place sticks out like a
sore thumb to me, and now I'm confused.  Is my understanding incorrect,
or is this operator use incorrect?

>  typedef uint32_t	xfs_dablk_t;	/* dir/attr block number (in file) */
>  typedef uint32_t	xfs_dahash_t;	/* dir/attr hash value */
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 14d1fefcbf4c..1cb087b320b1 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -713,7 +713,7 @@ xfs_buf_item_release(
>  STATIC void
>  xfs_buf_item_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	xfs_csn_t		seq)
>  {
>  	return xfs_buf_item_release(lip);
>  }
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index 8c1fdf37ee8f..8ed47b739b6c 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release(
>  STATIC void
>  xfs_qm_dquot_logitem_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	xfs_csn_t		seq)
>  {
>  	return xfs_qm_dquot_logitem_release(lip);

Weird, I didn't know that you could use return like this -- both
_release and _committing return void.

--D

>  }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 24c7f45fc4eb..ac3120dfe477 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -119,8 +119,8 @@ xfs_dir_fsync(
>  	return xfs_log_force_inode(ip);
>  }
>  
> -static xfs_lsn_t
> -xfs_fsync_lsn(
> +static xfs_csn_t
> +xfs_fsync_seq(
>  	struct xfs_inode	*ip,
>  	bool			datasync)
>  {
> @@ -128,7 +128,7 @@ xfs_fsync_lsn(
>  		return 0;
>  	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>  		return 0;
> -	return ip->i_itemp->ili_last_lsn;
> +	return ip->i_itemp->ili_commit_seq;
>  }
>  
>  /*
> @@ -151,12 +151,12 @@ xfs_fsync_flush_log(
>  	int			*log_flushed)
>  {
>  	int			error = 0;
> -	xfs_lsn_t		lsn;
> +	xfs_csn_t		seq;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	lsn = xfs_fsync_lsn(ip, datasync);
> -	if (lsn) {
> -		error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC,
> +	seq = xfs_fsync_seq(ip, datasync);
> +	if (seq) {
> +		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
>  					  log_flushed);
>  
>  		spin_lock(&ip->i_itemp->ili_lock);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index bed2beb169e4..1c2ef1f1859a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2644,7 +2644,7 @@ xfs_iunpin(
>  	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
>  
>  	/* Give the log a push to start the unpinning I/O */
> -	xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL);
> +	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
>  
>  }
>  
> @@ -3652,16 +3652,16 @@ int
>  xfs_log_force_inode(
>  	struct xfs_inode	*ip)
>  {
> -	xfs_lsn_t		lsn = 0;
> +	xfs_csn_t		seq = 0;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip))
> -		lsn = ip->i_itemp->ili_last_lsn;
> +		seq = ip->i_itemp->ili_commit_seq;
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
> -	if (!lsn)
> +	if (!seq)
>  		return 0;
> -	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
> +	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6ff91e5bf3cd..3aba4559469f 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -617,9 +617,9 @@ xfs_inode_item_committed(
>  STATIC void
>  xfs_inode_item_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	xfs_csn_t		seq)
>  {
> -	INODE_ITEM(lip)->ili_last_lsn = commit_lsn;
> +	INODE_ITEM(lip)->ili_commit_seq = seq;
>  	return xfs_inode_item_release(lip);
>  }
>  
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4b926e32831c..403b45ab9aa2 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -33,7 +33,7 @@ struct xfs_inode_log_item {
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> -	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> +	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
>  };
>  
>  static inline int xfs_inode_clean(struct xfs_inode *ip)
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index ed44d67d7099..145db0f88060 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3273,14 +3273,13 @@ xfs_log_force(
>  }
>  
>  static int
> -__xfs_log_force_lsn(
> -	struct xfs_mount	*mp,
> +xlog_force_lsn(
> +	struct xlog		*log,
>  	xfs_lsn_t		lsn,
>  	uint			flags,
>  	int			*log_flushed,
>  	bool			already_slept)
>  {
> -	struct xlog		*log = mp->m_log;
>  	struct xlog_in_core	*iclog;
>  
>  	spin_lock(&log->l_icloglock);
> @@ -3313,8 +3312,6 @@ __xfs_log_force_lsn(
>  		if (!already_slept &&
>  		    (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC ||
>  		     iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) {
> -			XFS_STATS_INC(mp, xs_log_force_sleep);
> -
>  			xlog_wait(&iclog->ic_prev->ic_write_wait,
>  					&log->l_icloglock);
>  			return -EAGAIN;
> @@ -3352,25 +3349,29 @@ __xfs_log_force_lsn(
>   * to disk, that thread will wake up all threads waiting on the queue.
>   */
>  int
> -xfs_log_force_lsn(
> +xfs_log_force_seq(
>  	struct xfs_mount	*mp,
> -	xfs_lsn_t		lsn,
> +	xfs_csn_t		seq,
>  	uint			flags,
>  	int			*log_flushed)
>  {
> +	struct xlog		*log = mp->m_log;
> +	xfs_lsn_t		lsn;
>  	int			ret;
> -	ASSERT(lsn != 0);
> +	ASSERT(seq != 0);
>  
>  	XFS_STATS_INC(mp, xs_log_force);
> -	trace_xfs_log_force(mp, lsn, _RET_IP_);
> +	trace_xfs_log_force(mp, seq, _RET_IP_);
>  
> -	lsn = xlog_cil_force_lsn(mp->m_log, lsn);
> +	lsn = xlog_cil_force_seq(log, seq);
>  	if (lsn == NULLCOMMITLSN)
>  		return 0;
>  
> -	ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false);
> -	if (ret == -EAGAIN)
> -		ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true);
> +	ret = xlog_force_lsn(log, lsn, flags, log_flushed, false);
> +	if (ret == -EAGAIN) {
> +		XFS_STATS_INC(mp, xs_log_force_sleep);
> +		ret = xlog_force_lsn(log, lsn, flags, log_flushed, true);
> +	}
>  	return ret;
>  }
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 044e02cb8921..ba96f4ad9576 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -107,7 +106,7 @@ struct xfs_item_ops;
>  struct xfs_trans;
>  
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
> -int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
> +int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
>  		int *log_forced);
>  int	  xfs_log_mount(struct xfs_mount	*mp,
>  			struct xfs_buftarg	*log_target,
> @@ -132,8 +132,6 @@ bool	xfs_log_writable(struct xfs_mount *mp);
>  struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
>  void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
>  
> -void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
> -				xfs_lsn_t *commit_lsn, bool regrant);
>  void	xlog_cil_process_committed(struct list_head *list);
>  bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
>  
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 2f0adc35d8ec..44bb7cc17541 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -794,7 +794,7 @@ xlog_cil_push_work(
>  	 * that higher sequences will wait for us to write out a commit record
>  	 * before they do.
>  	 *
> -	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
> +	 * xfs_log_force_seq requires us to mirror the new sequence into the cil
>  	 * structure atomically with the addition of this sequence to the
>  	 * committing list. This also ensures that we can do unlocked checks
>  	 * against the current sequence in log forces without risking
> @@ -1058,16 +1058,14 @@ xlog_cil_empty(
>   * allowed again.
>   */
>  void
> -xfs_log_commit_cil(
> -	struct xfs_mount	*mp,
> +xlog_cil_commit(
> +	struct xlog		*log,
>  	struct xfs_trans	*tp,
> -	xfs_lsn_t		*commit_lsn,
> +	xfs_csn_t		*commit_seq,
>  	bool			regrant)
>  {
> -	struct xlog		*log = mp->m_log;
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> -	xfs_lsn_t		xc_commit_lsn;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1081,10 +1079,6 @@ xfs_log_commit_cil(
>  
>  	xlog_cil_insert_items(log, tp);
>  
> -	xc_commit_lsn = cil->xc_ctx->sequence;
> -	if (commit_lsn)
> -		*commit_lsn = xc_commit_lsn;
> -
>  	if (regrant && !XLOG_FORCED_SHUTDOWN(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
>  	else
> @@ -1107,8 +1101,10 @@ xfs_log_commit_cil(
>  	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
>  		xfs_trans_del_item(lip);
>  		if (lip->li_ops->iop_committing)
> -			lip->li_ops->iop_committing(lip, xc_commit_lsn);
> +			lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence);
>  	}
> +	if (commit_seq)
> +		*commit_seq = cil->xc_ctx->sequence;
>  
>  	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
>  	xlog_cil_push_background(log);
> @@ -1125,9 +1121,9 @@ xfs_log_commit_cil(
>   * iclog flush is necessary following this call.
>   */
>  xfs_lsn_t
> -xlog_cil_force_lsn(
> +xlog_cil_force_seq(
>  	struct xlog	*log,
> -	xfs_lsn_t	sequence)
> +	xfs_csn_t	sequence)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 0552e96d2b64..31ce2ce21e27 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -234,7 +234,7 @@ struct xfs_cil;
>  
>  struct xfs_cil_ctx {
>  	struct xfs_cil		*cil;
> -	xfs_lsn_t		sequence;	/* chkpt sequence # */
> +	xfs_csn_t		sequence;	/* chkpt sequence # */
>  	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
> @@ -272,10 +272,10 @@ struct xfs_cil {
>  	struct xfs_cil_ctx	*xc_ctx;
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
> -	xfs_lsn_t		xc_push_seq;
> +	xfs_csn_t		xc_push_seq;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
> -	xfs_lsn_t		xc_current_sequence;
> +	xfs_csn_t		xc_current_sequence;
>  	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
> @@ -552,19 +552,18 @@ int	xlog_cil_init(struct xlog *log);
>  void	xlog_cil_init_post_recovery(struct xlog *log);
>  void	xlog_cil_destroy(struct xlog *log);
>  bool	xlog_cil_empty(struct xlog *log);
> +void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
> +			xfs_csn_t *commit_seq, bool regrant);
>  
>  /*
>   * CIL force routines
>   */
> -xfs_lsn_t
> -xlog_cil_force_lsn(
> -	struct xlog *log,
> -	xfs_lsn_t sequence);
> +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
>  {
> -	xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
> +	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index b22a09e9daee..21ac7c048380 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -851,7 +851,7 @@ __xfs_trans_commit(
>  	bool			regrant)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_lsn_t		commit_lsn = -1;
> +	xfs_csn_t		commit_seq = 0;
>  	int			error = 0;
>  	int			sync = tp->t_flags & XFS_TRANS_SYNC;
>  
> @@ -893,7 +893,7 @@ __xfs_trans_commit(
>  		xfs_trans_apply_sb_deltas(tp);
>  	xfs_trans_apply_dquot_deltas(tp);
>  
> -	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> +	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
>  
>  	xfs_trans_free(tp);
>  
> @@ -902,7 +902,7 @@ __xfs_trans_commit(
>  	 * log out now and wait for it.
>  	 */
>  	if (sync) {
> -		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
> +		error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL);
>  		XFS_STATS_INC(mp, xs_trans_sync);
>  	} else {
>  		XFS_STATS_INC(mp, xs_trans_async);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 9dd745cf77c9..6276c7d251e6 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -43,7 +43,7 @@ struct xfs_log_item {
>  	struct list_head		li_cil;		/* CIL pointers */
>  	struct xfs_log_vec		*li_lv;		/* active log vector */
>  	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
> -	xfs_lsn_t			li_seq;		/* CIL commit seq */
> +	xfs_csn_t			li_seq;		/* CIL commit seq */
>  };
>  
>  /*
> @@ -69,7 +69,7 @@ struct xfs_item_ops {
>  	void (*iop_pin)(struct xfs_log_item *);
>  	void (*iop_unpin)(struct xfs_log_item *, int remove);
>  	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> -	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> +	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
>  	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
>  	int (*iop_recover)(struct xfs_log_item *lip,
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/45] xfs: CIL work is serialised, not pipelined
  2021-03-05  5:11 ` [PATCH 15/45] xfs: CIL work is serialised, not pipelined Dave Chinner
@ 2021-03-08 23:14   ` Darrick J. Wong
  2021-03-08 23:38     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 23:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because we use a single work structure attached to the CIL rather
> than the CIL context, we can only queue a single work item at a
> time. This results in the CIL being single threaded and limits
> performance when it becomes CPU bound.
> 
> The design of the CIL is that it is pipelined and multiple commits
> can be running concurrently, but the way the work is currently
> implemented means that it is not pipelining as it was intended. The
> critical work to switch the CIL context can take a few milliseconds
> to run, but the rest of the CIL context flush can take hundreds of
> milliseconds to complete. The context switching is the serialisation
> point of the CIL, once the context has been switched the rest of the
> context push can run asynchrnously with all other context pushes.
> 
> Hence we can move the work to the CIL context so that we can run
> multiple CIL pushes at the same time and spread the majority of
> the work out over multiple CPUs. We can keep the per-cpu CIL commit
> state on the CIL rather than the context, because the context is
> pinned to the CIL until the switch is done and we aggregate and
> drain the per-cpu state held on the CIL during the context switch.
> 
> However, because we no longer serialise the CIL work, we can have
> effectively unlimited CIL pushes in progress. We don't want to do
> this - not only does it create contention on the iclogs and the
> state machine locks, we can run the log right out of space with
> outstanding pushes. Instead, limit the work concurrency to 4
> concurrent works being processed at a time. THis is enough

Four?  Was that determined experimentally, or is that a fundamental
limit of how many cil checkpoints we can working on at a time?  The
current one, the previous one, and ... something else that was already
in progress?

I think the rest of the patch looks reasonable, FWIW.

--D

> concurrency to remove the CIL from being a CPU bound bottleneck but
> not enough to create new contention points or unbound concurrency
> issues.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 80 +++++++++++++++++++++++--------------------
>  fs/xfs/xfs_log_priv.h |  2 +-
>  fs/xfs/xfs_super.c    |  2 +-
>  3 files changed, 44 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b101c25cc9a9..dfc9ef692a80 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -47,6 +47,34 @@ xlog_cil_ticket_alloc(
>  	return tic;
>  }
>  
> +/*
> + * Unavoidable forward declaration - xlog_cil_push_work() calls
> + * xlog_cil_ctx_alloc() itself.
> + */
> +static void xlog_cil_push_work(struct work_struct *work);
> +
> +static struct xfs_cil_ctx *
> +xlog_cil_ctx_alloc(void)
> +{
> +	struct xfs_cil_ctx	*ctx;
> +
> +	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
> +	INIT_LIST_HEAD(&ctx->committing);
> +	INIT_LIST_HEAD(&ctx->busy_extents);
> +	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
> +	return ctx;
> +}
> +
> +static void
> +xlog_cil_ctx_switch(
> +	struct xfs_cil		*cil,
> +	struct xfs_cil_ctx	*ctx)
> +{
> +	ctx->sequence = ++cil->xc_current_sequence;
> +	ctx->cil = cil;
> +	cil->xc_ctx = ctx;
> +}
> +
>  /*
>   * After the first stage of log recovery is done, we know where the head and
>   * tail of the log are. We need this log initialisation done before we can
> @@ -641,11 +669,11 @@ static void
>  xlog_cil_push_work(
>  	struct work_struct	*work)
>  {
> -	struct xfs_cil		*cil =
> -		container_of(work, struct xfs_cil, xc_push_work);
> +	struct xfs_cil_ctx	*ctx =
> +		container_of(work, struct xfs_cil_ctx, push_work);
> +	struct xfs_cil		*cil = ctx->cil;
>  	struct xlog		*log = cil->xc_log;
>  	struct xfs_log_vec	*lv;
> -	struct xfs_cil_ctx	*ctx;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
>  	struct xlog_ticket	*tic;
> @@ -660,11 +688,10 @@ xlog_cil_push_work(
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  	bool			commit_iclog_sync = false;
>  
> -	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
> +	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
>  
>  	down_write(&cil->xc_ctx_lock);
> -	ctx = cil->xc_ctx;
>  
>  	spin_lock(&cil->xc_push_lock);
>  	push_seq = cil->xc_push_seq;
> @@ -696,7 +723,7 @@ xlog_cil_push_work(
>  
>  
>  	/* check for a previously pushed sequence */
> -	if (push_seq < cil->xc_ctx->sequence) {
> +	if (push_seq < ctx->sequence) {
>  		spin_unlock(&cil->xc_push_lock);
>  		goto out_skip;
>  	}
> @@ -767,19 +794,7 @@ xlog_cil_push_work(
>  	}
>  
>  	/*
> -	 * initialise the new context and attach it to the CIL. Then attach
> -	 * the current context to the CIL committing list so it can be found
> -	 * during log forces to extract the commit lsn of the sequence that
> -	 * needs to be forced.
> -	 */
> -	INIT_LIST_HEAD(&new_ctx->committing);
> -	INIT_LIST_HEAD(&new_ctx->busy_extents);
> -	new_ctx->sequence = ctx->sequence + 1;
> -	new_ctx->cil = cil;
> -	cil->xc_ctx = new_ctx;
> -
> -	/*
> -	 * The switch is now done, so we can drop the context lock and move out
> +	 * Switch the contexts so we can drop the context lock and move out
>  	 * of a shared context. We can't just go straight to the commit record,
>  	 * though - we need to synchronise with previous and future commits so
>  	 * that the commit records are correctly ordered in the log to ensure
> @@ -804,7 +819,7 @@ xlog_cil_push_work(
>  	 * deferencing a freed context pointer.
>  	 */
>  	spin_lock(&cil->xc_push_lock);
> -	cil->xc_current_sequence = new_ctx->sequence;
> +	xlog_cil_ctx_switch(cil, new_ctx);
>  	spin_unlock(&cil->xc_push_lock);
>  	up_write(&cil->xc_ctx_lock);
>  
> @@ -968,7 +983,7 @@ xlog_cil_push_background(
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
>  		cil->xc_push_seq = cil->xc_current_sequence;
> -		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> +		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
>  	}
>  
>  	/*
> @@ -1034,7 +1049,7 @@ xlog_cil_push_now(
>  
>  	/* start on any pending background push to minimise wait time on it */
>  	if (sync)
> -		flush_work(&cil->xc_push_work);
> +		flush_workqueue(log->l_mp->m_cil_workqueue);
>  
>  	/*
>  	 * If the CIL is empty or we've already pushed the sequence then
> @@ -1049,7 +1064,7 @@ xlog_cil_push_now(
>  	cil->xc_push_seq = push_seq;
>  	if (!sync)
>  		cil->xc_push_async = true;
> -	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> +	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
>  	spin_unlock(&cil->xc_push_lock);
>  }
>  
> @@ -1286,13 +1301,6 @@ xlog_cil_init(
>  	if (!cil)
>  		return -ENOMEM;
>  
> -	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
> -	if (!ctx) {
> -		kmem_free(cil);
> -		return -ENOMEM;
> -	}
> -
> -	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
>  	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
>  	spin_lock_init(&cil->xc_cil_lock);
> @@ -1300,16 +1308,12 @@ xlog_cil_init(
>  	init_waitqueue_head(&cil->xc_push_wait);
>  	init_rwsem(&cil->xc_ctx_lock);
>  	init_waitqueue_head(&cil->xc_commit_wait);
> -
> -	INIT_LIST_HEAD(&ctx->committing);
> -	INIT_LIST_HEAD(&ctx->busy_extents);
> -	ctx->sequence = 1;
> -	ctx->cil = cil;
> -	cil->xc_ctx = ctx;
> -	cil->xc_current_sequence = ctx->sequence;
> -
>  	cil->xc_log = log;
>  	log->l_cilp = cil;
> +
> +	ctx = xlog_cil_ctx_alloc();
> +	xlog_cil_ctx_switch(cil, ctx);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index a4e46258b2aa..bb5fa6b71114 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -245,6 +245,7 @@ struct xfs_cil_ctx {
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
>  	struct work_struct	discard_endio_work;
> +	struct work_struct	push_work;
>  };
>  
>  /*
> @@ -277,7 +278,6 @@ struct xfs_cil {
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
>  	xfs_csn_t		xc_current_sequence;
> -	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ca2cb0448b5e..962f03a541e7 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -502,7 +502,7 @@ xfs_init_mount_workqueues(
>  
>  	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
>  			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND),
> -			0, mp->m_super->s_id);
> +			4, mp->m_super->s_id);
>  	if (!mp->m_cil_workqueue)
>  		goto out_destroy_unwritten;
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/45] xfs: CIL work is serialised, not pipelined
  2021-03-08 23:14   ` Darrick J. Wong
@ 2021-03-08 23:38     ` Dave Chinner
  2021-03-09  1:55       ` Darrick J. Wong
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-08 23:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 03:14:32PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:13PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because we use a single work structure attached to the CIL rather
> > than the CIL context, we can only queue a single work item at a
> > time. This results in the CIL being single threaded and limits
> > performance when it becomes CPU bound.
> > 
> > The design of the CIL is that it is pipelined and multiple commits
> > can be running concurrently, but the way the work is currently
> > implemented means that it is not pipelining as it was intended. The
> > critical work to switch the CIL context can take a few milliseconds
> > to run, but the rest of the CIL context flush can take hundreds of
> > milliseconds to complete. The context switching is the serialisation
> > point of the CIL, once the context has been switched the rest of the
> > context push can run asynchrnously with all other context pushes.
> > 
> > Hence we can move the work to the CIL context so that we can run
> > multiple CIL pushes at the same time and spread the majority of
> > the work out over multiple CPUs. We can keep the per-cpu CIL commit
> > state on the CIL rather than the context, because the context is
> > pinned to the CIL until the switch is done and we aggregate and
> > drain the per-cpu state held on the CIL during the context switch.
> > 
> > However, because we no longer serialise the CIL work, we can have
> > effectively unlimited CIL pushes in progress. We don't want to do
> > this - not only does it create contention on the iclogs and the
> > state machine locks, we can run the log right out of space with
> > outstanding pushes. Instead, limit the work concurrency to 4
> > concurrent works being processed at a time. THis is enough
> 
> Four?  Was that determined experimentally, or is that a fundamental
> limit of how many cil checkpoints we can working on at a time?  The
> current one, the previous one, and ... something else that was already
> in progress?

No fundamental limit, but....

> > concurrency to remove the CIL from being a CPU bound bottleneck but
> > not enough to create new contention points or unbound concurrency
> > issues.

spinlocks in well written code scale linearly to 3-4 CPUs banging on
them frequently.  Beyond that they start to show non-linear
behaviour before they break down completely at somewhere between
8-16 threads banging on them. If we have 4 CIL writes going on, we
have 4 CPUs banging on the log->l_icloglock through xlog_write()
through xlog_state_get_iclog_space() and then releasing the iclogs
when they are full. We then have iclog IO completion banging on the
icloglock to serialise completion can change iclog state on
completion.

Hence a 4 CIL push works, we're starting to get back to the point
where the icloglock will start to see non-linear access cost. This
was a problem before delayed logging removed the icloglock from the
front end transaction commit path where it could see unbound
concurrency and was the hottest lock in the log.

Allowing a limited amount of concurrency prevents us from
unnecessarily allowing wasteful and performance limiting lock
contention from occurring. And given that I'm only hitting the
single CPU limit of the CIL push when there's 31 other CPUs all
running transactions flat out, having 4 CPUs to run the same work is
more than enough. Especially as those 31 other CPUs running
transactions are already pushing VFS level spinlocks
(sb->sb_inode_list_lock, dentry ref count locking, etc) to breakdown
point so we're not going to be able to push enough change into the
CIL to keep 4 CPUs fully busy any time soon.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing
  2021-03-05  5:11 ` [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing Dave Chinner
@ 2021-03-08 23:45   ` Darrick J. Wong
  0 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 23:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:12PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
> 
> As a result, many push cycles through the CIL will first issue a
> blocking log force to unpin the items. This can take some time to
> complete, with tracing regularly showing push delays of half a
> second and sometimes up into the range of several seconds. Sequences
> like this aren't uncommon:
> 
> ....
>  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 270ms delay>
>  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
>  400.099623:  xfsaild: first lsn 0x11002f3600
>  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> <wanted 50ms, got 500ms delay>
>  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
>  400.589349:  xfsaild: first lsn 0x1100305000
>  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> <wanted 50ms, got 460ms delay>
>  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
>  400.950343:  xfsaild: first lsn 0x1100317c00
>  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 200ms delay>
>  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
>  401.142334:  xfsaild: first lsn 0x110032e600
>  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> <wanted 10ms, got 10ms delay>
>  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
>  401.154328:  xfsaild: first lsn 0x1100353000
>  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 300ms delay>
>  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
>  401.451526:  xfsaild: first lsn 0x1100353000
>  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> <wanted 50ms, got 500ms delay>
>  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> ....
> 
> In each of these cases, every AIL pass saw 101 log items stuck on
> the AIL (pinned) with very few other items being found. Each pass, a
> log force was issued, and delay between last/first is the sleep time
> + the sync log force time.
> 
> Some of these 101 items pinned the tail of the log. The tail of the
> log does slowly creep forward (first lsn), but the problem is that
> the log is actually out of reservation space because it's been
> running so many transactions that stale items that never reach the
> AIL but consume log space. Hence we have a largely empty AIL, with
> long term pins on items that pin the tail of the log that don't get
> pushed frequently enough to keep log space available.
> 
> The problem is the hundreds of milliseconds that we block in the log
> force pushing the CIL out to disk. The AIL should not be stalled
> like this - it needs to run and flush items that are at the tail of
> the log with minimal latency. What we really need to do is trigger a
> log flush, but then not wait for it at all - we've already done our
> waiting for stuff to complete when we backed off prior to the log
> force being issued.
> 
> Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> still do a blocking flush of the CIL and that is what is causing the
> issue. Hence we need a new interface for the CIL to trigger an
> immediate background push of the CIL to get it moving faster but not
> to wait on that to occur. While the CIL is pushing, the AIL can also
> be pushing.
> 
> We already have an internal interface to do this -
> xlog_cil_push_now() - but we need a wrapper for it to be used
> externally. xlog_cil_force_seq() can easily be extended to do what
> we need as it already implements the synchronous CIL push via
> xlog_cil_push_now(). Add the necessary flags and "push current
> sequence" semantics to xlog_cil_force_seq() and convert the AIL
> pushing to use it.
> 
> One of the complexities here is that the CIL push does not guarantee
> that the commit record for the CIL checkpoint is written to disk.
> The current log force ensures this by submitting the current ACTIVE
> iclog that the commit record was written to. We need the CIL to
> actually write this commit record to disk for an async push to
> ensure that the checkpoint actually makes it to disk and unpins the
> pinned items in the checkpoint on completion. Hence we need to pass
> down to the CIL push that we are doing an async flush so that it can
> switch out the commit_iclog if necessary to get written to disk when
> the commit iclog is finally released.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c       | 59 ++++++++++++++++--------------------------
>  fs/xfs/xfs_log.h       |  2 +-
>  fs/xfs/xfs_log_cil.c   | 58 +++++++++++++++++++++++++++++++++--------
>  fs/xfs/xfs_log_priv.h  | 10 +++++--
>  fs/xfs/xfs_sysfs.c     |  1 +
>  fs/xfs/xfs_trace.c     |  1 +
>  fs/xfs/xfs_trans.c     |  2 +-
>  fs/xfs/xfs_trans_ail.c | 11 +++++---
>  8 files changed, 90 insertions(+), 54 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 145db0f88060..f54d48f4584e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -50,11 +50,6 @@ xlog_state_get_iclog_space(
>  	int			*continued_write,
>  	int			*logoffsetp);
>  STATIC void
> -xlog_state_switch_iclogs(
> -	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	int			eventual_size);
> -STATIC void
>  xlog_grant_push_ail(
>  	struct xlog		*log,
>  	int			need_bytes);
> @@ -511,7 +506,7 @@ __xlog_state_release_iclog(
>   * Flush iclog to disk if this is the last reference to the given iclog and the
>   * it is in the WANT_SYNC state.
>   */
> -static int
> +int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog)
> @@ -531,23 +526,6 @@ xlog_state_release_iclog(
>  	return 0;
>  }
>  
> -void
> -xfs_log_release_iclog(
> -	struct xlog_in_core	*iclog)
> -{
> -	struct xlog		*log = iclog->ic_log;
> -	bool			sync = false;
> -
> -	if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) {
> -		if (iclog->ic_state != XLOG_STATE_IOERROR)
> -			sync = __xlog_state_release_iclog(log, iclog);
> -		spin_unlock(&log->l_icloglock);
> -	}
> -
> -	if (sync)
> -		xlog_sync(log, iclog);
> -}
> -
>  /*
>   * Mount a log filesystem
>   *
> @@ -3125,7 +3103,7 @@ xfs_log_ticket_ungrant(
>   * This routine will mark the current iclog in the ring as WANT_SYNC and move
>   * the current iclog pointer to the next iclog in the ring.
>   */
> -STATIC void
> +void
>  xlog_state_switch_iclogs(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
> @@ -3272,6 +3250,20 @@ xfs_log_force(
>  	return -EIO;
>  }
>  
> +/*
> + * Force the log to a specific LSN.
> + *
> + * If an iclog with that lsn can be found:
> + *	If it is in the DIRTY state, just return.
> + *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
> + *		state and go to sleep or return.
> + *	If it is in any other state, go to sleep or return.
> + *
> + * Synchronous forces are implemented with a wait queue.  All callers trying
> + * to force a given lsn to disk must wait on the queue attached to the
> + * specific in-core log.  When given in-core log finally completes its write
> + * to disk, that thread will wake up all threads waiting on the queue.
> + */
>  static int
>  xlog_force_lsn(
>  	struct xlog		*log,
> @@ -3335,18 +3327,13 @@ xlog_force_lsn(
>  }
>  
>  /*
> - * Force the in-core log to disk for a specific LSN.
> - *
> - * Find in-core log with lsn.
> - *	If it is in the DIRTY state, just return.
> - *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
> - *		state and go to sleep or return.
> - *	If it is in any other state, go to sleep or return.
> + * Force the log to a specific checkpoint sequence.
>   *
> - * Synchronous forces are implemented with a wait queue.  All callers trying
> - * to force a given lsn to disk must wait on the queue attached to the
> - * specific in-core log.  When given in-core log finally completes its write
> - * to disk, that thread will wake up all threads waiting on the queue.
> + * First force the CIL so that all the required changes have been flushed to the
> + * iclogs. If the CIL force completed it will return a commit LSN that indicates
> + * the iclog that needs to be flushed to stable storage. If the caller needs
> + * a synchronous log force, we will wait on the iclog with the LSN returned by
> + * xlog_cil_force_seq() to be completed.
>   */
>  int
>  xfs_log_force_seq(
> @@ -3363,7 +3350,7 @@ xfs_log_force_seq(
>  	XFS_STATS_INC(mp, xs_log_force);
>  	trace_xfs_log_force(mp, seq, _RET_IP_);
>  
> -	lsn = xlog_cil_force_seq(log, seq);
> +	lsn = xlog_cil_force_seq(log, XFS_LOG_SYNC, seq);
>  	if (lsn == NULLCOMMITLSN)
>  		return 0;
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index ba96f4ad9576..1bd080ce3a95 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -104,6 +104,7 @@ struct xlog_ticket;
>  struct xfs_log_item;
>  struct xfs_item_ops;
>  struct xfs_trans;
> +struct xlog;
>  
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
>  int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
> @@ -117,7 +118,6 @@ void	xfs_log_mount_cancel(struct xfs_mount *);
>  xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
>  xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
>  void	  xfs_log_space_wake(struct xfs_mount *mp);
> -void	  xfs_log_release_iclog(struct xlog_in_core *iclog);
>  int	  xfs_log_reserve(struct xfs_mount *mp,
>  			  int		   length,
>  			  int		   count,
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 44bb7cc17541..b101c25cc9a9 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -658,6 +658,7 @@ xlog_cil_push_work(
>  	xfs_lsn_t		push_seq;
>  	struct bio		bio;
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
> +	bool			commit_iclog_sync = false;
>  
>  	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -668,6 +669,8 @@ xlog_cil_push_work(
>  	spin_lock(&cil->xc_push_lock);
>  	push_seq = cil->xc_push_seq;
>  	ASSERT(push_seq <= ctx->sequence);
> +	commit_iclog_sync = cil->xc_push_async;

Confused about the naming here -- we're assigning from a variable named
"async" to a variable named "sync"?

Oh, I see.  This is the mechanism by which the CIL figures out that the
AIL asked us to perform an async push of the current iclog to unpin some
items, so we want to move on to the next iclog so that we can flush
commit_iclog to disk, which will (eventually) help out the AIL?

This underscores the need for a quick comment explaining all this
cleverness, I think...

> +	cil->xc_push_async = false;
>  
>  	/*
>  	 * As we are about to switch to a new, empty CIL context, we no longer
> @@ -914,7 +917,11 @@ xlog_cil_push_work(
>  	}
>  
>  	/* release the hounds! */
> -	xfs_log_release_iclog(commit_iclog);
> +	spin_lock(&log->l_icloglock);


...something like this?

	/*
	 * If the AIL asked for an asynchronous CIL push and the current
	 * iclog is still open, so move on to the next iclog and flush
	 * the current one to unpin the AIL.
	 */

--D

> +	if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
> +		xlog_state_switch_iclogs(log, commit_iclog, 0);
> +	xlog_state_release_iclog(log, commit_iclog);
> +	spin_unlock(&log->l_icloglock);
>  	return;
>  
>  out_skip:
> @@ -997,13 +1004,26 @@ xlog_cil_push_background(
>  /*
>   * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
>   * number that is passed. When it returns, the work will be queued for
> - * @push_seq, but it won't be completed. The caller is expected to do any
> - * waiting for push_seq to complete if it is required.
> + * @push_seq, but it won't be completed.
> + *
> + * If the caller is performing a synchronous force, we will flush the workqueue
> + * to get previously queued work moving to minimise the wait time they will
> + * undergo waiting for all outstanding pushes to complete. The caller is
> + * expected to do the required waiting for push_seq to complete.
> + *
> + * If the caller is performing an async push, we need to ensure that the
> + * checkpoint is fully flushed out of the iclogs when we finish the push. If we
> + * don't do this, then the commit record may remain sitting in memory in an
> + * ACTIVE iclog. This then requires another full log force to push to disk,
> + * which defeats the purpose of having an async, non-blocking CIL force
> + * mechanism. Hence in this case we need to pass a flag to the push work to
> + * indicate it needs to flush the commit record itself.
>   */
>  static void
>  xlog_cil_push_now(
>  	struct xlog	*log,
> -	xfs_lsn_t	push_seq)
> +	xfs_lsn_t	push_seq,
> +	bool		sync)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
>  
> @@ -1013,7 +1033,8 @@ xlog_cil_push_now(
>  	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
>  
>  	/* start on any pending background push to minimise wait time on it */
> -	flush_work(&cil->xc_push_work);
> +	if (sync)
> +		flush_work(&cil->xc_push_work);
>  
>  	/*
>  	 * If the CIL is empty or we've already pushed the sequence then
> @@ -1026,6 +1047,8 @@ xlog_cil_push_now(
>  	}
>  
>  	cil->xc_push_seq = push_seq;
> +	if (!sync)
> +		cil->xc_push_async = true;
>  	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>  	spin_unlock(&cil->xc_push_lock);
>  }
> @@ -1113,16 +1136,22 @@ xlog_cil_commit(
>  /*
>   * Conditionally push the CIL based on the sequence passed in.
>   *
> - * We only need to push if we haven't already pushed the sequence
> - * number given. Hence the only time we will trigger a push here is
> - * if the push sequence is the same as the current context.
> + * We only need to push if we haven't already pushed the sequence number given.
> + * Hence the only time we will trigger a push here is if the push sequence is
> + * the same as the current context.
>   *
> - * We return the current commit lsn to allow the callers to determine if a
> - * iclog flush is necessary following this call.
> + * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in
> + * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to
> + * indicate we didn't wait for a commit lsn.
> + *
> + * If we waited for the push to complete, then we return the current commit lsn
> + * to allow the callers to determine if a iclog flush is necessary following
> + * this call.
>   */
>  xfs_lsn_t
>  xlog_cil_force_seq(
>  	struct xlog	*log,
> +	uint32_t	flags,
>  	xfs_csn_t	sequence)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
> @@ -1131,13 +1160,19 @@ xlog_cil_force_seq(
>  
>  	ASSERT(sequence <= cil->xc_current_sequence);
>  
> +	if (!sequence)
> +		sequence = cil->xc_current_sequence;
> +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
> +
>  	/*
>  	 * check to see if we need to force out the current context.
>  	 * xlog_cil_push() handles racing pushes for the same sequence,
>  	 * so no need to deal with it here.
>  	 */
>  restart:
> -	xlog_cil_push_now(log, sequence);
> +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> +	if (!(flags & XFS_LOG_SYNC))
> +		return commit_lsn;
>  
>  	/*
>  	 * See if we can find a previous sequence still committing.
> @@ -1161,6 +1196,7 @@ xlog_cil_force_seq(
>  			 * It is still being pushed! Wait for the push to
>  			 * complete, then start again from the beginning.
>  			 */
> +			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
>  			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
>  			goto restart;
>  		}
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 31ce2ce21e27..a4e46258b2aa 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -273,6 +273,7 @@ struct xfs_cil {
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
>  	xfs_csn_t		xc_push_seq;
> +	bool			xc_push_async;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
>  	xfs_csn_t		xc_current_sequence;
> @@ -487,6 +488,10 @@ int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_in_core **commit_iclog, uint optype);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
> +void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
> +		int eventual_size);
> +int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
> +
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> @@ -558,12 +563,13 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
>  /*
>   * CIL force routines
>   */
> -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
> +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
> +				xfs_csn_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
>  {
> -	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
> +	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index f1bc88f4367c..18dc5eca6c04 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_sysfs.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_mount.h"
>  
> diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
> index 9b8d703dc9fd..d111a994b7b6 100644
> --- a/fs/xfs/xfs_trace.c
> +++ b/fs/xfs/xfs_trace.c
> @@ -20,6 +20,7 @@
>  #include "xfs_bmap.h"
>  #include "xfs_attr.h"
>  #include "xfs_trans.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_quota.h"
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 21ac7c048380..52f3fdf1e0de 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -9,7 +9,6 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> -#include "xfs_log_priv.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_extent_busy.h"
> @@ -17,6 +16,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
>  #include "xfs_defer.h"
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index dbb69b4bf3ed..dfc0206c0d36 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  
>  #ifdef DEBUG
>  /*
> @@ -429,8 +430,12 @@ xfsaild_push(
>  
>  	/*
>  	 * If we encountered pinned items or did not finish writing out all
> -	 * buffers the last time we ran, force the log first and wait for it
> -	 * before pushing again.
> +	 * buffers the last time we ran, force a background CIL push to get the
> +	 * items unpinned in the near future. We do not wait on the CIL push as
> +	 * that could stall us for seconds if there is enough background IO
> +	 * load. Stalling for that long when the tail of the log is pinned and
> +	 * needs flushing will hard stop the transaction subsystem when log
> +	 * space runs out.
>  	 */
>  	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
>  	    (!list_empty_careful(&ailp->ail_buf_list) ||
> @@ -438,7 +443,7 @@ xfsaild_push(
>  		ailp->ail_log_flush = 0;
>  
>  		XFS_STATS_INC(mp, xs_push_ail_flush);
> -		xfs_log_force(mp, XFS_LOG_SYNC);
> +		xlog_cil_force_seq(mp->m_log, 0, 0);
>  	}
>  
>  	spin_lock(&ailp->ail_lock);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 19/45] xfs: factor out the CIL transaction header building
  2021-03-05  5:11 ` [PATCH 19/45] xfs: factor out the CIL transaction header building Dave Chinner
@ 2021-03-08 23:47   ` Darrick J. Wong
  2021-03-16 14:50   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-08 23:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:17PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It is static code deep in the middle of the CIL push logic. Factor
> it out into a helper so that it is clear and easy to modify
> separately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Looks straightforward,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_cil.c | 71 +++++++++++++++++++++++++++++---------------
>  1 file changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index dfc9ef692a80..b515002e7959 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -651,6 +651,41 @@ xlog_cil_process_committed(
>  	}
>  }
>  
> +struct xlog_cil_trans_hdr {
> +	struct xfs_trans_header	thdr;
> +	struct xfs_log_iovec	lhdr;
> +};
> +
> +/*
> + * Build a checkpoint transaction header to begin the journal transaction.  We
> + * need to account for the space used by the transaction header here as it is
> + * not accounted for in xlog_write().
> + */
> +static void
> +xlog_cil_build_trans_hdr(
> +	struct xfs_cil_ctx	*ctx,
> +	struct xlog_cil_trans_hdr *hdr,
> +	struct xfs_log_vec	*lvhdr,
> +	int			num_iovecs)
> +{
> +	struct xlog_ticket	*tic = ctx->ticket;
> +
> +	memset(hdr, 0, sizeof(*hdr));
> +
> +	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
> +	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
> +	hdr->thdr.th_tid = tic->t_tid;
> +	hdr->thdr.th_num_items = num_iovecs;
> +	hdr->lhdr.i_addr = &hdr->thdr;
> +	hdr->lhdr.i_len = sizeof(xfs_trans_header_t);
> +	hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
> +	tic->t_curr_res -= hdr->lhdr.i_len + sizeof(xlog_op_header_t);
> +
> +	lvhdr->lv_niovecs = 1;
> +	lvhdr->lv_iovecp = &hdr->lhdr;
> +	lvhdr->lv_next = ctx->lv_chain;
> +}
> +
>  /*
>   * Push the Committed Item List to the log.
>   *
> @@ -676,11 +711,9 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	*lv;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
> -	struct xlog_ticket	*tic;
>  	int			num_iovecs;
>  	int			error = 0;
> -	struct xfs_trans_header thdr;
> -	struct xfs_log_iovec	lhdr;
> +	struct xlog_cil_trans_hdr thdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_lsn_t		commit_lsn;
>  	xfs_lsn_t		push_seq;
> @@ -827,24 +860,8 @@ xlog_cil_push_work(
>  	 * Build a checkpoint transaction header and write it to the log to
>  	 * begin the transaction. We need to account for the space used by the
>  	 * transaction header here as it is not accounted for in xlog_write().
> -	 *
> -	 * The LSN we need to pass to the log items on transaction commit is
> -	 * the LSN reported by the first log vector write. If we use the commit
> -	 * record lsn then we can move the tail beyond the grant write head.
>  	 */
> -	tic = ctx->ticket;
> -	thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
> -	thdr.th_type = XFS_TRANS_CHECKPOINT;
> -	thdr.th_tid = tic->t_tid;
> -	thdr.th_num_items = num_iovecs;
> -	lhdr.i_addr = &thdr;
> -	lhdr.i_len = sizeof(xfs_trans_header_t);
> -	lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
> -	tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t);
> -
> -	lvhdr.lv_niovecs = 1;
> -	lvhdr.lv_iovecp = &lhdr;
> -	lvhdr.lv_next = ctx->lv_chain;
> +	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
>  
>  	/*
>  	 * Before we format and submit the first iclog, we have to ensure that
> @@ -852,7 +869,13 @@ xlog_cil_push_work(
>  	 */
>  	wait_for_completion(&bdev_flush);
>  
> -	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
> +	/*
> +	 * The LSN we need to pass to the log items on transaction commit is the
> +	 * LSN reported by the first log vector write, not the commit lsn. If we
> +	 * use the commit record lsn then we can move the tail beyond the grant
> +	 * write head.
> +	 */
> +	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
>  				XLOG_START_TRANS);
>  	if (error)
>  		goto out_abort_free_ticket;
> @@ -891,11 +914,11 @@ xlog_cil_push_work(
>  	}
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	error = xlog_commit_record(log, tic, &commit_iclog, &commit_lsn);
> +	error = xlog_commit_record(log, ctx->ticket, &commit_iclog, &commit_lsn);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> -	xfs_log_ticket_ungrant(log, tic);
> +	xfs_log_ticket_ungrant(log, ctx->ticket);
>  
>  	spin_lock(&commit_iclog->ic_callback_lock);
>  	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
> @@ -946,7 +969,7 @@ xlog_cil_push_work(
>  	return;
>  
>  out_abort_free_ticket:
> -	xfs_log_ticket_ungrant(log, tic);
> +	xfs_log_ticket_ungrant(log, ctx->ticket);
>  out_abort:
>  	ASSERT(XLOG_FORCED_SHUTDOWN(log));
>  	xlog_cil_committed(ctx);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 20/45] xfs: only CIL pushes require a start record
  2021-03-05  5:11 ` [PATCH 20/45] xfs: only CIL pushes require a start record Dave Chinner
@ 2021-03-09  0:07   ` Darrick J. Wong
  2021-03-16 14:51   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  0:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:18PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So move the one-off start record writing in xlog_write() out into
> the static header that the CIL push builds to write into the log
> initially. This simplifes the xlog_write() logic a lot.
> 
> pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly
> 32 bit aligned and packed for copying the log op and transaction
> headers directly into the log as a single log region copy.
> 
> struct xlog_cil_trans_hdr {
> 	struct xlog_op_header      oph[2];               /*     0    24 */
> 	struct xfs_trans_header    thdr;                 /*    24    16 */
> 	struct xfs_log_iovec       lhdr;                 /*    40    16 */
> 
> 	/* size: 56, cachelines: 1, members: 3 */
> 	/* last cacheline: 56 bytes */
> };
> 
> A wart is needed to handle the fact that length of the region the
> opheader points to doesn't include the opheader length. hence if
> we embed the opheader, we have to substract the opheader length from
> the length written into the opheader by the generic copying code.
> This will eventually go away when everything is converted to
> embedded opheaders.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ... ugly, but looking forward a few patches you're clearly getting
ready to refactor a bunch of grody 4-indent code so...

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c     | 90 ++++++++++++++++++++++----------------------
>  fs/xfs/xfs_log_cil.c | 44 ++++++++++++++++++----
>  2 files changed, 81 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f54d48f4584e..b2f9fb1b4fed 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2106,9 +2106,9 @@ xlog_print_trans(
>  }
>  
>  /*
> - * Calculate the potential space needed by the log vector.  We may need a start
> - * record, and each region gets its own struct xlog_op_header and may need to be
> - * double word aligned.
> + * Calculate the potential space needed by the log vector. If this is a start
> + * transaction, the caller has already accounted for both opheaders in the start
> + * transaction, so we don't need to account for them here.
>   */
>  static int
>  xlog_write_calc_vec_length(
> @@ -2121,9 +2121,6 @@ xlog_write_calc_vec_length(
>  	int			len = 0;
>  	int			i;
>  
> -	if (optype & XLOG_START_TRANS)
> -		headers++;
> -
>  	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> @@ -2139,24 +2136,20 @@ xlog_write_calc_vec_length(
>  		}
>  	}
>  
> +	/* Don't account for regions with embedded ophdrs */
> +	if (optype && headers > 0) {
> +		if (optype & XLOG_START_TRANS) {
> +			ASSERT(headers >= 2);
> +			headers -= 2;
> +		}
> +	}
> +
>  	ticket->t_res_num_ophdrs += headers;
>  	len += headers * sizeof(struct xlog_op_header);
>  
>  	return len;
>  }
>  
> -static void
> -xlog_write_start_rec(
> -	struct xlog_op_header	*ophdr,
> -	struct xlog_ticket	*ticket)
> -{
> -	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
> -	ophdr->oh_clientid = ticket->t_clientid;
> -	ophdr->oh_len = 0;
> -	ophdr->oh_flags = XLOG_START_TRANS;
> -	ophdr->oh_res2 = 0;
> -}
> -
>  static xlog_op_header_t *
>  xlog_write_setup_ophdr(
>  	struct xlog		*log,
> @@ -2361,9 +2354,11 @@ xlog_write(
>  	 * If this is a commit or unmount transaction, we don't need a start
>  	 * record to be written.  We do, however, have to account for the
>  	 * commit or unmount header that gets written. Hence we always have
> -	 * to account for an extra xlog_op_header here.
> +	 * to account for an extra xlog_op_header here for commit and unmount
> +	 * records.
>  	 */
> -	ticket->t_curr_res -= sizeof(struct xlog_op_header);
> +	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
> +		ticket->t_curr_res -= sizeof(struct xlog_op_header);
>  	if (ticket->t_curr_res < 0) {
>  		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
>  		     "ctx ticket reservation ran out. Need to up reservation");
> @@ -2411,7 +2406,7 @@ xlog_write(
>  			int			copy_len;
>  			int			copy_off;
>  			bool			ordered = false;
> -			bool			wrote_start_rec = false;
> +			bool			added_ophdr = false;
>  
>  			/* ordered log vectors have no regions to write */
>  			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> @@ -2425,25 +2420,24 @@ xlog_write(
>  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
>  			/*
> -			 * Before we start formatting log vectors, we need to
> -			 * write a start record. Only do this for the first
> -			 * iclog we write to.
> +			 * The XLOG_START_TRANS has embedded ophdrs for the
> +			 * start record and transaction header. They will always
> +			 * be the first two regions in the lv chain.
>  			 */
>  			if (optype & XLOG_START_TRANS) {
> -				xlog_write_start_rec(ptr, ticket);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -						sizeof(struct xlog_op_header));
> -				optype &= ~XLOG_START_TRANS;
> -				wrote_start_rec = true;
> -			}
> -
> -			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
> -			if (!ophdr)
> -				return -EIO;
> +				ophdr = reg->i_addr;
> +				if (index)
> +					optype &= ~XLOG_START_TRANS;
> +			} else {
> +				ophdr = xlog_write_setup_ophdr(log, ptr,
> +							ticket, optype);
> +				if (!ophdr)
> +					return -EIO;
>  
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset,
> +				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
> -
> +				added_ophdr = true;
> +			}
>  			len += xlog_write_setup_copy(ticket, ophdr,
>  						     iclog->ic_size-log_offset,
>  						     reg->i_len,
> @@ -2452,13 +2446,22 @@ xlog_write(
>  						     &partial_copy_len);
>  			xlog_verify_dest_ptr(log, ptr);
>  
> +
> +			/*
> +			 * Wart: need to update length in embedded ophdr not
> +			 * to include it's own length.
> +			 */
> +			if (!added_ophdr) {
> +				ophdr->oh_len = cpu_to_be32(copy_len -
> +						sizeof(struct xlog_op_header));
> +			}
>  			/*
>  			 * Copy region.
>  			 *
> -			 * Unmount records just log an opheader, so can have
> -			 * empty payloads with no data region to copy. Hence we
> -			 * only copy the payload if the vector says it has data
> -			 * to copy.
> +			 * Commit and unmount records just log an opheader, so
> +			 * we can have empty payloads with no data region to
> +			 * copy.  Hence we only copy the payload if the vector
> +			 * says it has data to copy.
>  			 */
>  			ASSERT(copy_len >= 0);
>  			if (copy_len > 0) {
> @@ -2466,12 +2469,9 @@ xlog_write(
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  						   copy_len);
>  			}
> -			copy_len += sizeof(struct xlog_op_header);
> -			record_cnt++;
> -			if (wrote_start_rec) {
> +			if (added_ophdr)
>  				copy_len += sizeof(struct xlog_op_header);
> -				record_cnt++;
> -			}
> +			record_cnt++;
>  			data_cnt += contwr ? copy_len : 0;
>  
>  			error = xlog_write_copy_finish(log, iclog, optype,
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b515002e7959..e9da074ecd69 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -652,14 +652,22 @@ xlog_cil_process_committed(
>  }
>  
>  struct xlog_cil_trans_hdr {
> +	struct xlog_op_header	oph[2];
>  	struct xfs_trans_header	thdr;
> -	struct xfs_log_iovec	lhdr;
> +	struct xfs_log_iovec	lhdr[2];
>  };
>  
>  /*
>   * Build a checkpoint transaction header to begin the journal transaction.  We
>   * need to account for the space used by the transaction header here as it is
>   * not accounted for in xlog_write().
> + *
> + * This is the only place we write a transaction header, so we also build the
> + * log opheaders that indicate the start of a log transaction and wrap the
> + * transaction header. We keep the start record in it's own log vector rather
> + * than compacting them into a single region as this ends up making the logic
> + * in xlog_write() for handling empty opheaders for start, commit and unmount
> + * records much simpler.
>   */
>  static void
>  xlog_cil_build_trans_hdr(
> @@ -669,20 +677,40 @@ xlog_cil_build_trans_hdr(
>  	int			num_iovecs)
>  {
>  	struct xlog_ticket	*tic = ctx->ticket;
> +	uint32_t		tid = cpu_to_be32(tic->t_tid);
>  
>  	memset(hdr, 0, sizeof(*hdr));
>  
> +	/* Log start record */
> +	hdr->oph[0].oh_tid = tid;
> +	hdr->oph[0].oh_clientid = XFS_TRANSACTION;
> +	hdr->oph[0].oh_flags = XLOG_START_TRANS;
> +
> +	/* log iovec region pointer */
> +	hdr->lhdr[0].i_addr = &hdr->oph[0];
> +	hdr->lhdr[0].i_len = sizeof(struct xlog_op_header);
> +	hdr->lhdr[0].i_type = XLOG_REG_TYPE_LRHEADER;
> +
> +	/* log opheader */
> +	hdr->oph[1].oh_tid = tid;
> +	hdr->oph[1].oh_clientid = XFS_TRANSACTION;
> +
> +	/* transaction header */
>  	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
>  	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
> -	hdr->thdr.th_tid = tic->t_tid;
> +	hdr->thdr.th_tid = tid;
>  	hdr->thdr.th_num_items = num_iovecs;
> -	hdr->lhdr.i_addr = &hdr->thdr;
> -	hdr->lhdr.i_len = sizeof(xfs_trans_header_t);
> -	hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
> -	tic->t_curr_res -= hdr->lhdr.i_len + sizeof(xlog_op_header_t);
>  
> -	lvhdr->lv_niovecs = 1;
> -	lvhdr->lv_iovecp = &hdr->lhdr;
> +	/* log iovec region pointer */
> +	hdr->lhdr[1].i_addr = &hdr->oph[1];
> +	hdr->lhdr[1].i_len = sizeof(struct xlog_op_header) +
> +				sizeof(struct xfs_trans_header);
> +	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
> +
> +	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> +
> +	lvhdr->lv_niovecs = 2;
> +	lvhdr->lv_iovecp = &hdr->lhdr[0];
>  	lvhdr->lv_next = ctx->lv_chain;
>  }
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record
  2021-03-05  5:11 ` [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record Dave Chinner
@ 2021-03-09  0:15   ` Darrick J. Wong
  2021-03-11  2:54     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  0:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:19PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Remove another case where xlog_write() has to prepend an opheader to
> a log transaction. The unmount record + ophdr is smaller than the
> minimum amount of space guaranteed to be free in an iclog (2 *
> sizeof(ophdr)) and so we don't have to care about an unmount record
> being split across 2 iclogs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log.c | 35 ++++++++++++++++++++++++-----------
>  1 file changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index b2f9fb1b4fed..94711b9ff007 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -798,12 +798,22 @@ xlog_write_unmount_record(
>  	struct xlog		*log,
>  	struct xlog_ticket	*ticket)
>  {
> -	struct xfs_unmount_log_format ulf = {
> -		.magic = XLOG_UNMOUNT_TYPE,
> +	struct  {
> +		struct xlog_op_header ophdr;
> +		struct xfs_unmount_log_format ulf;
> +	} unmount_rec = {

I wonder, should we have a BUILD_BUG_ON to confirm sizeof(umount_rec)
just in case some weird architecture injects padding between these two?
Prior to this code we formatted the op header and unmount record in
separate incore buffers and wrote them to disk with no gap, right?

--D

> +		.ophdr = {
> +			.oh_clientid = XFS_LOG,
> +			.oh_tid = cpu_to_be32(ticket->t_tid),
> +			.oh_flags = XLOG_UNMOUNT_TRANS,
> +		},
> +		.ulf = {
> +			.magic = XLOG_UNMOUNT_TYPE,
> +		},
>  	};
>  	struct xfs_log_iovec reg = {
> -		.i_addr = &ulf,
> -		.i_len = sizeof(ulf),
> +		.i_addr = &unmount_rec,
> +		.i_len = sizeof(unmount_rec),
>  		.i_type = XLOG_REG_TYPE_UNMOUNT,
>  	};
>  	struct xfs_log_vec vec = {
> @@ -812,7 +822,7 @@ xlog_write_unmount_record(
>  	};
>  
>  	/* account for space used by record data */
> -	ticket->t_curr_res -= sizeof(ulf);
> +	ticket->t_curr_res -= sizeof(unmount_rec);
>  
>  	/*
>  	 * For external log devices, we need to flush the data device cache
> @@ -2138,6 +2148,8 @@ xlog_write_calc_vec_length(
>  
>  	/* Don't account for regions with embedded ophdrs */
>  	if (optype && headers > 0) {
> +		if (optype & XLOG_UNMOUNT_TRANS)
> +			headers--;
>  		if (optype & XLOG_START_TRANS) {
>  			ASSERT(headers >= 2);
>  			headers -= 2;
> @@ -2352,12 +2364,11 @@ xlog_write(
>  
>  	/*
>  	 * If this is a commit or unmount transaction, we don't need a start
> -	 * record to be written.  We do, however, have to account for the
> -	 * commit or unmount header that gets written. Hence we always have
> -	 * to account for an extra xlog_op_header here for commit and unmount
> -	 * records.
> +	 * record to be written.  We do, however, have to account for the commit
> +	 * header that gets written. Hence we always have to account for an
> +	 * extra xlog_op_header here for commit records.
>  	 */
> -	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
> +	if (optype & XLOG_COMMIT_TRANS)
>  		ticket->t_curr_res -= sizeof(struct xlog_op_header);
>  	if (ticket->t_curr_res < 0) {
>  		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
> @@ -2428,6 +2439,8 @@ xlog_write(
>  				ophdr = reg->i_addr;
>  				if (index)
>  					optype &= ~XLOG_START_TRANS;
> +			} else if (optype & XLOG_UNMOUNT_TRANS) {
> +				ophdr = reg->i_addr;
>  			} else {
>  				ophdr = xlog_write_setup_ophdr(log, ptr,
>  							ticket, optype);
> @@ -2458,7 +2471,7 @@ xlog_write(
>  			/*
>  			 * Copy region.
>  			 *
> -			 * Commit and unmount records just log an opheader, so
> +			 * Commit records just log an opheader, so
>  			 * we can have empty payloads with no data region to
>  			 * copy.  Hence we only copy the payload if the vector
>  			 * says it has data to copy.
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 22/45] xfs: embed the xlog_op_header in the commit record
  2021-03-05  5:11 ` [PATCH 22/45] xfs: embed the xlog_op_header in the commit record Dave Chinner
@ 2021-03-09  0:17   ` Darrick J. Wong
  0 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  0:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:20PM +1100, Dave Chinner wrote:
> Remove the final case where xlog_write() has to prepend an opheader
> to a log transaction. Similar to the start record, the commit record
> is just an empty opheader with a XLOG_COMMIT_TRANS type, so we can
> just make this the payload for the region being passed to
> xlog_write() and remove the special handling in xlog_write() for
> the commit record.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks sane enough...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c | 33 +++++++++++++++------------------
>  1 file changed, 15 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 94711b9ff007..c2e69a1f5cad 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1529,9 +1529,14 @@ xlog_commit_record(
>  	struct xlog_in_core	**iclog,
>  	xfs_lsn_t		*lsn)
>  {
> +	struct xlog_op_header	ophdr = {
> +		.oh_clientid = XFS_TRANSACTION,
> +		.oh_tid = cpu_to_be32(ticket->t_tid),
> +		.oh_flags = XLOG_COMMIT_TRANS,
> +	};
>  	struct xfs_log_iovec reg = {
> -		.i_addr = NULL,
> -		.i_len = 0,
> +		.i_addr = &ophdr,
> +		.i_len = sizeof(struct xlog_op_header),
>  		.i_type = XLOG_REG_TYPE_COMMIT,
>  	};
>  	struct xfs_log_vec vec = {
> @@ -1543,6 +1548,8 @@ xlog_commit_record(
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
> +	/* account for space used by record data */
> +	ticket->t_curr_res -= reg.i_len;
>  	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -2148,11 +2155,10 @@ xlog_write_calc_vec_length(
>  
>  	/* Don't account for regions with embedded ophdrs */
>  	if (optype && headers > 0) {
> -		if (optype & XLOG_UNMOUNT_TRANS)
> -			headers--;
> +		headers--;
>  		if (optype & XLOG_START_TRANS) {
> -			ASSERT(headers >= 2);
> -			headers -= 2;
> +			ASSERT(headers >= 1);
> +			headers--;
>  		}
>  	}
>  
> @@ -2362,14 +2368,6 @@ xlog_write(
>  	int			data_cnt = 0;
>  	int			error = 0;
>  
> -	/*
> -	 * If this is a commit or unmount transaction, we don't need a start
> -	 * record to be written.  We do, however, have to account for the commit
> -	 * header that gets written. Hence we always have to account for an
> -	 * extra xlog_op_header here for commit records.
> -	 */
> -	if (optype & XLOG_COMMIT_TRANS)
> -		ticket->t_curr_res -= sizeof(struct xlog_op_header);
>  	if (ticket->t_curr_res < 0) {
>  		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
>  		     "ctx ticket reservation ran out. Need to up reservation");
> @@ -2433,14 +2431,13 @@ xlog_write(
>  			/*
>  			 * The XLOG_START_TRANS has embedded ophdrs for the
>  			 * start record and transaction header. They will always
> -			 * be the first two regions in the lv chain.
> +			 * be the first two regions in the lv chain. Commit and
> +			 * unmount records also have embedded ophdrs.
>  			 */
> -			if (optype & XLOG_START_TRANS) {
> +			if (optype) {
>  				ophdr = reg->i_addr;
>  				if (index)
>  					optype &= ~XLOG_START_TRANS;
> -			} else if (optype & XLOG_UNMOUNT_TRANS) {
> -				ophdr = reg->i_addr;
>  			} else {
>  				ophdr = xlog_write_setup_ophdr(log, ptr,
>  							ticket, optype);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 23/45] xfs: log tickets don't need log client id
  2021-03-05  5:11 ` [PATCH 23/45] xfs: log tickets don't need log client id Dave Chinner
@ 2021-03-09  0:21   ` Darrick J. Wong
  2021-03-09  1:19     ` Dave Chinner
  2021-03-16 14:51   ` Brian Foster
  1 sibling, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  0:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently set the log ticket client ID when we reserve a
> transaction. This client ID is only ever written to the log by
> a CIL checkpoint or unmount records, and so anything using a high
> level transaction allocated through xfs_trans_alloc() does not need
> a log ticket client ID to be set.
> 
> For the CIL checkpoint, the client ID written to the journal is
> always XFS_TRANSACTION, and for the unmount record it is always
> XFS_LOG, and nothing else writes to the log. All of these operations
> tell xlog_write() exactly what they need to write to the log (the
> optype) and build their own opheaders for start, commit and unmount
> records. Hence we no longer need to set the client id in either the
> log ticket or the xfs_trans.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log.c      | 47 ++++++++-----------------------------------
>  fs/xfs/xfs_log.h      | 16 ++++++---------
>  fs/xfs/xfs_log_cil.c  |  2 +-
>  fs/xfs/xfs_log_priv.h | 10 ++-------
>  fs/xfs/xfs_trans.c    |  6 ++----
>  5 files changed, 19 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index c2e69a1f5cad..429cb1e7cc67 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -431,10 +431,9 @@ xfs_log_regrant(
>  int
>  xfs_log_reserve(
>  	struct xfs_mount	*mp,
> -	int		 	unit_bytes,
> -	int		 	cnt,
> +	int			unit_bytes,
> +	int			cnt,
>  	struct xlog_ticket	**ticp,
> -	uint8_t		 	client,
>  	bool			permanent)
>  {
>  	struct xlog		*log = mp->m_log;
> @@ -442,15 +441,13 @@ xfs_log_reserve(
>  	int			need_bytes;
>  	int			error = 0;
>  
> -	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
> -
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
>  	XFS_STATS_INC(mp, xs_try_logspace);
>  
>  	ASSERT(*ticp == NULL);
> -	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent);
> +	tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent);
>  	*ticp = tic;
>  
>  	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
> @@ -847,7 +844,7 @@ xlog_unmount_write(
>  	struct xlog_ticket	*tic = NULL;
>  	int			error;
>  
> -	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
> +	error = xfs_log_reserve(mp, 600, 1, &tic, 0);
>  	if (error)
>  		goto out_err;
>  
> @@ -2170,35 +2167,13 @@ xlog_write_calc_vec_length(
>  
>  static xlog_op_header_t *
>  xlog_write_setup_ophdr(
> -	struct xlog		*log,
>  	struct xlog_op_header	*ophdr,
> -	struct xlog_ticket	*ticket,
> -	uint			flags)
> +	struct xlog_ticket	*ticket)
>  {
>  	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -	ophdr->oh_clientid = ticket->t_clientid;
> +	ophdr->oh_clientid = XFS_TRANSACTION;
>  	ophdr->oh_res2 = 0;
> -
> -	/* are we copying a commit or unmount record? */
> -	ophdr->oh_flags = flags;
> -
> -	/*
> -	 * We've seen logs corrupted with bad transaction client ids.  This
> -	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
> -	 * and shut down the filesystem.
> -	 */
> -	switch (ophdr->oh_clientid)  {
> -	case XFS_TRANSACTION:
> -	case XFS_VOLUME:

Reading between the lines, I'm guessing this clientid is some
now-vestigial organ from the Irix days, where there was some kind of
volume manager (in addition to the filesystem + log)?  And between the
three, there was a need to dispatch recovered log ops to the correct
subsystem?

> -	case XFS_LOG:
> -		break;
> -	default:
> -		xfs_warn(log->l_mp,
> -			"Bad XFS transaction clientid 0x%x in ticket "PTR_FMT,
> -			ophdr->oh_clientid, ticket);
> -		return NULL;
> -	}
> -
> +	ophdr->oh_flags = 0;
>  	return ophdr;
>  }
>  
> @@ -2439,11 +2414,7 @@ xlog_write(
>  				if (index)
>  					optype &= ~XLOG_START_TRANS;
>  			} else {
> -				ophdr = xlog_write_setup_ophdr(log, ptr,
> -							ticket, optype);
> -				if (!ophdr)
> -					return -EIO;
> -
> +                                ophdr = xlog_write_setup_ophdr(ptr, ticket);

Nit: use tabs, not spaces.

With that fixed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
>  				added_ophdr = true;
> @@ -3499,7 +3470,6 @@ xlog_ticket_alloc(
>  	struct xlog		*log,
>  	int			unit_bytes,
>  	int			cnt,
> -	char			client,
>  	bool			permanent)
>  {
>  	struct xlog_ticket	*tic;
> @@ -3517,7 +3487,6 @@ xlog_ticket_alloc(
>  	tic->t_cnt		= cnt;
>  	tic->t_ocnt		= cnt;
>  	tic->t_tid		= prandom_u32();
> -	tic->t_clientid		= client;
>  	if (permanent)
>  		tic->t_flags |= XLOG_TIC_PERM_RESERV;
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 1bd080ce3a95..c0c3141944ea 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -117,16 +117,12 @@ int	  xfs_log_mount_finish(struct xfs_mount *mp);
>  void	xfs_log_mount_cancel(struct xfs_mount *);
>  xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
>  xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
> -void	  xfs_log_space_wake(struct xfs_mount *mp);
> -int	  xfs_log_reserve(struct xfs_mount *mp,
> -			  int		   length,
> -			  int		   count,
> -			  struct xlog_ticket **ticket,
> -			  uint8_t		   clientid,
> -			  bool		   permanent);
> -int	  xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
> -void      xfs_log_unmount(struct xfs_mount *mp);
> -int	  xfs_log_force_umount(struct xfs_mount *mp, int logerror);
> +void	xfs_log_space_wake(struct xfs_mount *mp);
> +int	xfs_log_reserve(struct xfs_mount *mp, int length, int count,
> +			struct xlog_ticket **ticket, bool permanent);
> +int	xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
> +void	xfs_log_unmount(struct xfs_mount *mp);
> +int	xfs_log_force_umount(struct xfs_mount *mp, int logerror);
>  bool	xfs_log_writable(struct xfs_mount *mp);
>  
>  struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index e9da074ecd69..0c81c13e2cf6 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -37,7 +37,7 @@ xlog_cil_ticket_alloc(
>  {
>  	struct xlog_ticket *tic;
>  
> -	tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0);
> +	tic = xlog_ticket_alloc(log, 0, 1, 0);
>  
>  	/*
>  	 * set the current reservation to zero so we know to steal the basic
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index bb5fa6b71114..7f601c1c9f45 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -158,7 +158,6 @@ typedef struct xlog_ticket {
>  	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
>  	char		   t_ocnt;	 /* original count		 : 1  */
>  	char		   t_cnt;	 /* current count		 : 1  */
> -	char		   t_clientid;	 /* who does this belong to;	 : 1  */
>  	char		   t_flags;	 /* properties of reservation	 : 1  */
>  
>          /* reservation array fields */
> @@ -465,13 +464,8 @@ extern __le32	 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead,
>  			    char *dp, int size);
>  
>  extern kmem_zone_t *xfs_log_ticket_zone;
> -struct xlog_ticket *
> -xlog_ticket_alloc(
> -	struct xlog	*log,
> -	int		unit_bytes,
> -	int		count,
> -	char		client,
> -	bool		permanent);
> +struct xlog_ticket *xlog_ticket_alloc(struct xlog *log, int unit_bytes,
> +		int count, bool permanent);
>  
>  static inline void
>  xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 52f3fdf1e0de..83c2b7f22eb7 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -194,11 +194,9 @@ xfs_trans_reserve(
>  			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
>  			error = xfs_log_regrant(mp, tp->t_ticket);
>  		} else {
> -			error = xfs_log_reserve(mp,
> -						resp->tr_logres,
> +			error = xfs_log_reserve(mp, resp->tr_logres,
>  						resp->tr_logcount,
> -						&tp->t_ticket, XFS_TRANSACTION,
> -						permanent);
> +						&tp->t_ticket, permanent);
>  		}
>  
>  		if (error)
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 08/45] xfs: journal IO cache flush reductions
  2021-03-08 12:25   ` Brian Foster
@ 2021-03-09  1:13     ` Dave Chinner
  2021-03-10 20:49       ` Brian Foster
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-09  1:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 07:25:26AM -0500, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 04:11:06PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> > guarantee the ordering requirements the journal has w.r.t. metadata
> > writeback. THe two ordering constraints are:
....
> > The rm -rf times are included because I ran them, but the
> > differences are largely noise. This workload is largely metadata
> > read IO latency bound and the changes to the journal cache flushing
> > doesn't really make any noticable difference to behaviour apart from
> > a reduction in noiclog events from background CIL pushing.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Thoughts on my previous feedback to this patch, particularly the locking
> bits..? I thought I saw a subsequent patch somewhere that increased the
> parallelism of this code..

I seem to have missed that email, too.

I guess you are refering to these two hunks:

> > @@ -2416,10 +2408,21 @@ xlog_write(
> >  		ASSERT(log_offset <= iclog->ic_size - 1);
> >  		ptr = iclog->ic_datap + log_offset;
> >  
> > -		/* start_lsn is the first lsn written to. That's all we need. */
> > +		/* Start_lsn is the first lsn written to. */
> >  		if (start_lsn && !*start_lsn)
> >  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> >  
> > +		/*
> > +		 * iclogs containing commit records or unmount records need
> > +		 * to issue ordering cache flushes and commit immediately
> > +		 * to stable storage to guarantee journal vs metadata ordering
> > +		 * is correctly maintained in the storage media.
> > +		 */
> > +		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> > +			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> > +						XLOG_ICL_NEED_FUA);
> > +		}
> > +
> >  		/*
> >  		 * This loop writes out as many regions as can fit in the amount
> >  		 * of space which was allocated by xlog_state_get_iclog_space().
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index c04d5d37a3a2..263c8d907221 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -896,11 +896,16 @@ xlog_cil_push_work(
> >  
> >  	/*
> >  	 * If the checkpoint spans multiple iclogs, wait for all previous
> > -	 * iclogs to complete before we submit the commit_iclog.
> > +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> > +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> > +	 * cache flush because there are no other iclogs we need to order
> > +	 * against.
> >  	 */
> >  	if (ctx->start_lsn != commit_lsn) {
> >  		spin_lock(&log->l_icloglock);
> >  		xlog_wait_on_iclog(commit_iclog->ic_prev);
> > +	} else {
> > +		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
> >  	}

.... that set/clear the flags on the iclog?  Yes, they probably
should be atomic.

On second thoughts, we can't just clear XLOG_ICL_NEED_FLUSH here
because there may be multiple commit records on this iclog and a
previous one might require the flush. I'll just remove this
optimisation from the patch right now, because it's more complex
than it initially seemed.

And looking at the aggregated code that I have now (including the
stuff I haven't sent out), the need for xlog_write() to set the
flush flags on the iclog is gone. THis is because the unmount record
flushes the iclog directly itself so it can add flags there, and
the iclog that the commit record is written to is returned to the
caller.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 23/45] xfs: log tickets don't need log client id
  2021-03-09  0:21   ` Darrick J. Wong
@ 2021-03-09  1:19     ` Dave Chinner
  2021-03-09  1:48       ` Darrick J. Wong
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-09  1:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 04:21:34PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:21PM +1100, Dave Chinner wrote:
> >  static xlog_op_header_t *
> >  xlog_write_setup_ophdr(
> > -	struct xlog		*log,
> >  	struct xlog_op_header	*ophdr,
> > -	struct xlog_ticket	*ticket,
> > -	uint			flags)
> > +	struct xlog_ticket	*ticket)
> >  {
> >  	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> > -	ophdr->oh_clientid = ticket->t_clientid;
> > +	ophdr->oh_clientid = XFS_TRANSACTION;
> >  	ophdr->oh_res2 = 0;
> > -
> > -	/* are we copying a commit or unmount record? */
> > -	ophdr->oh_flags = flags;
> > -
> > -	/*
> > -	 * We've seen logs corrupted with bad transaction client ids.  This
> > -	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
> > -	 * and shut down the filesystem.
> > -	 */
> > -	switch (ophdr->oh_clientid)  {
> > -	case XFS_TRANSACTION:
> > -	case XFS_VOLUME:
> 
> Reading between the lines, I'm guessing this clientid is some
> now-vestigial organ from the Irix days, where there was some kind of
> volume manager (in addition to the filesystem + log)?  And between the
> three, there was a need to dispatch recovered log ops to the correct
> subsystem?

I guess that was the original thought. It was included in the
initial commit of the log code to XFS in 1993 and never, ever used
in any code anywhere. So it's never been written to an XFS log,
ever.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 23/45] xfs: log tickets don't need log client id
  2021-03-09  1:19     ` Dave Chinner
@ 2021-03-09  1:48       ` Darrick J. Wong
  2021-03-11  3:01         ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  1:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 09, 2021 at 12:19:56PM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 04:21:34PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 05, 2021 at 04:11:21PM +1100, Dave Chinner wrote:
> > >  static xlog_op_header_t *
> > >  xlog_write_setup_ophdr(
> > > -	struct xlog		*log,
> > >  	struct xlog_op_header	*ophdr,
> > > -	struct xlog_ticket	*ticket,
> > > -	uint			flags)
> > > +	struct xlog_ticket	*ticket)
> > >  {
> > >  	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> > > -	ophdr->oh_clientid = ticket->t_clientid;
> > > +	ophdr->oh_clientid = XFS_TRANSACTION;
> > >  	ophdr->oh_res2 = 0;
> > > -
> > > -	/* are we copying a commit or unmount record? */
> > > -	ophdr->oh_flags = flags;
> > > -
> > > -	/*
> > > -	 * We've seen logs corrupted with bad transaction client ids.  This
> > > -	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
> > > -	 * and shut down the filesystem.
> > > -	 */
> > > -	switch (ophdr->oh_clientid)  {
> > > -	case XFS_TRANSACTION:
> > > -	case XFS_VOLUME:
> > 
> > Reading between the lines, I'm guessing this clientid is some
> > now-vestigial organ from the Irix days, where there was some kind of
> > volume manager (in addition to the filesystem + log)?  And between the
> > three, there was a need to dispatch recovered log ops to the correct
> > subsystem?
> 
> I guess that was the original thought. It was included in the
> initial commit of the log code to XFS in 1993 and never, ever used
> in any code anywhere. So it's never been written to an XFS log,
> ever.

In that case, can you get rid of the #define too, please?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/45] xfs: CIL work is serialised, not pipelined
  2021-03-08 23:38     ` Dave Chinner
@ 2021-03-09  1:55       ` Darrick J. Wong
  2021-03-09 22:35         ` Andi Kleen
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  1:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 09, 2021 at 10:38:19AM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 03:14:32PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 05, 2021 at 04:11:13PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Because we use a single work structure attached to the CIL rather
> > > than the CIL context, we can only queue a single work item at a
> > > time. This results in the CIL being single threaded and limits
> > > performance when it becomes CPU bound.
> > > 
> > > The design of the CIL is that it is pipelined and multiple commits
> > > can be running concurrently, but the way the work is currently
> > > implemented means that it is not pipelining as it was intended. The
> > > critical work to switch the CIL context can take a few milliseconds
> > > to run, but the rest of the CIL context flush can take hundreds of
> > > milliseconds to complete. The context switching is the serialisation
> > > point of the CIL, once the context has been switched the rest of the
> > > context push can run asynchrnously with all other context pushes.
> > > 
> > > Hence we can move the work to the CIL context so that we can run
> > > multiple CIL pushes at the same time and spread the majority of
> > > the work out over multiple CPUs. We can keep the per-cpu CIL commit
> > > state on the CIL rather than the context, because the context is
> > > pinned to the CIL until the switch is done and we aggregate and
> > > drain the per-cpu state held on the CIL during the context switch.
> > > 
> > > However, because we no longer serialise the CIL work, we can have
> > > effectively unlimited CIL pushes in progress. We don't want to do
> > > this - not only does it create contention on the iclogs and the
> > > state machine locks, we can run the log right out of space with
> > > outstanding pushes. Instead, limit the work concurrency to 4
> > > concurrent works being processed at a time. THis is enough
> > 
> > Four?  Was that determined experimentally, or is that a fundamental
> > limit of how many cil checkpoints we can working on at a time?  The
> > current one, the previous one, and ... something else that was already
> > in progress?
> 
> No fundamental limit, but....
> 
> > > concurrency to remove the CIL from being a CPU bound bottleneck but
> > > not enough to create new contention points or unbound concurrency
> > > issues.
> 
> spinlocks in well written code scale linearly to 3-4 CPUs banging on
> them frequently.  Beyond that they start to show non-linear
> behaviour before they break down completely at somewhere between
> 8-16 threads banging on them. If we have 4 CIL writes going on, we
> have 4 CPUs banging on the log->l_icloglock through xlog_write()
> through xlog_state_get_iclog_space() and then releasing the iclogs
> when they are full. We then have iclog IO completion banging on the
> icloglock to serialise completion can change iclog state on
> completion.
> 
> Hence a 4 CIL push works, we're starting to get back to the point
> where the icloglock will start to see non-linear access cost. This
> was a problem before delayed logging removed the icloglock from the
> front end transaction commit path where it could see unbound
> concurrency and was the hottest lock in the log.
> 
> Allowing a limited amount of concurrency prevents us from
> unnecessarily allowing wasteful and performance limiting lock
> contention from occurring. And given that I'm only hitting the
> single CPU limit of the CIL push when there's 31 other CPUs all
> running transactions flat out, having 4 CPUs to run the same work is
> more than enough. Especially as those 31 other CPUs running
> transactions are already pushing VFS level spinlocks
> (sb->sb_inode_list_lock, dentry ref count locking, etc) to breakdown
> point so we're not going to be able to push enough change into the
> CIL to keep 4 CPUs fully busy any time soon.

It might be nice to leave that as a breadcrumb, then, in case the
spinlock scalability problems ever get solved.

	/*
	 * Limit ourselves to 4 CIL push workers per log to avoid
	 * excessive contention of the icloglock spinlock.
	 */
	error = alloc_workqueue(..., 4, ...);

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 24/45] xfs: move log iovec alignment to preparation function
  2021-03-05  5:11 ` [PATCH 24/45] xfs: move log iovec alignment to preparation function Dave Chinner
@ 2021-03-09  2:14   ` Darrick J. Wong
  2021-03-16 14:51   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  2:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:22PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To include log op headers directly into the log iovec regions that
> the ophdrs wrap, we need to move the buffer alignment code from
> xlog_finish_iovec() to xlog_prepare_iovec(). This is because the
> xlog_op_header is only 12 bytes long, and we need the buffer that
> the caller formats their data into to be 8 byte aligned.
> 
> Hence once we start prepending the ophdr in xlog_prepare_iovec(), we
> are going to need to manage the padding directly to ensure that the
> buffer pointer returned is correctly aligned.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Ok, now that I grok what's going on in the /next/ patch, this makes
sense to me as the way into the next patch.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.h | 25 ++++++++++++++-----------
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index c0c3141944ea..1ca4f2edbdaf 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -21,6 +21,16 @@ struct xfs_log_vec {
>  
>  #define XFS_LOG_VEC_ORDERED	(-1)
>  
> +/*
> + * We need to make sure the buffer pointer returned is naturally aligned for the
> + * biggest basic data type we put into it. We have already accounted for this
> + * padding when sizing the buffer.
> + *
> + * However, this padding does not get written into the log, and hence we have to
> + * track the space used by the log vectors separately to prevent log space hangs
> + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> + * CIL context ticket.
> + */
>  static inline void *
>  xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  		uint type)
> @@ -34,6 +44,9 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  		vec = &lv->lv_iovecp[0];
>  	}
>  
> +	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
> +		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
> +
>  	vec->i_type = type;
>  	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
>  
> @@ -43,20 +56,10 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  	return vec->i_addr;
>  }
>  
> -/*
> - * We need to make sure the next buffer is naturally aligned for the biggest
> - * basic data type we put into it.  We already accounted for this padding when
> - * sizing the buffer.
> - *
> - * However, this padding does not get written into the log, and hence we have to
> - * track the space used by the log vectors separately to prevent log space hangs
> - * due to inaccurate accounting (i.e. a leak) of the used log space through the
> - * CIL context ticket.
> - */
>  static inline void
>  xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
>  {
> -	lv->lv_buf_len += round_up(len, sizeof(uint64_t));
> +	lv->lv_buf_len += len;
>  	lv->lv_bytes += len;
>  	vec->i_len = len;
>  }
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-03-05  5:11 ` [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
@ 2021-03-09  2:21   ` Darrick J. Wong
  2021-03-11  3:29     ` Dave Chinner
  2021-03-16 14:53   ` Brian Foster
  1 sibling, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  2:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Current xlog_write() adds op headers to the log manually for every
> log item region that is in the vector passed to it. While
> xlog_write() needs to stamp the transaction ID into the ophdr, we
> already know it's length, flags, clientid, etc at CIL commit time.
> 
> This means the only time that xlog write really needs to format and
> reserve space for a new ophdr is when a region is split across two
> iclogs. Adding the opheader and accounting for it as part of the
> normal formatted item region means we simplify the accounting
> of space used by a transaction and we don't have to special case
> reserving of space in for the ophdrs in xlog_write(). It also means
> we can largely initialise the ophdr in transaction commit instead
> of xlog_write, making the xlog_write formatting inner loop much
> tighter.
> 
> xlog_prepare_iovec() is now too large to stay as an inline function,
> so we move it out of line and into xfs_log.c.
> 
> Object sizes:
> text	   data	    bss	    dec	    hex	filename
> 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> 
> So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> out of line, even though it grew in size itself.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Sooo... if I understand this part of the patchset correctly, the goal
here is to simplify and shorten the inner loop of xlog_write. Callers
are now required to create their own log op headers at the start of the
xfs_log_iovec chain in the xfs_log_vec, which means that the only time
xlog_write has to create an ophdr is when we fill up the current iclog
and must continue in a new one, because that's not something the callers
should ever have to know about.  Correct?

If so,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

It /really/ would have been nice to have kept these patches separated by
major functional change area (i.e. separate series) instead of one
gigantic 45-patch behemoth to intimidate the reviewers...

--D

> ---
>  fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_log.h     |  42 +++-------------
>  fs/xfs/xfs_log_cil.c |  25 +++++-----
>  3 files changed, 99 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 429cb1e7cc67..98de45be80c0 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -89,6 +89,62 @@ xlog_iclogs_empty(
>  static int
>  xfs_log_cover(struct xfs_mount *);
>  
> +/*
> + * We need to make sure the buffer pointer returned is naturally aligned for the
> + * biggest basic data type we put into it. We have already accounted for this
> + * padding when sizing the buffer.
> + *
> + * However, this padding does not get written into the log, and hence we have to
> + * track the space used by the log vectors separately to prevent log space hangs
> + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> + * CIL context ticket.
> + *
> + * We also add space for the xlog_op_header that describes this region in the
> + * log. This prepends the data region we return to the caller to copy their data
> + * into, so do all the static initialisation of the ophdr now. Because the ophdr
> + * is not 8 byte aligned, we have to be careful to ensure that we align the
> + * start of the buffer such that the region we return to the call is 8 byte
> + * aligned and packed against the tail of the ophdr.
> + */
> +void *
> +xlog_prepare_iovec(
> +	struct xfs_log_vec	*lv,
> +	struct xfs_log_iovec	**vecp,
> +	uint			type)
> +{
> +	struct xfs_log_iovec	*vec = *vecp;
> +	struct xlog_op_header	*oph;
> +	uint32_t		len;
> +	void			*buf;
> +
> +	if (vec) {
> +		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> +		vec++;
> +	} else {
> +		vec = &lv->lv_iovecp[0];
> +	}
> +
> +	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
> +	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
> +		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
> +					sizeof(struct xlog_op_header);
> +	}
> +
> +	vec->i_type = type;
> +	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> +
> +	oph = vec->i_addr;
> +	oph->oh_clientid = XFS_TRANSACTION;
> +	oph->oh_res2 = 0;
> +	oph->oh_flags = 0;
> +
> +	buf = vec->i_addr + sizeof(struct xlog_op_header);
> +	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));
> +
> +	*vecp = vec;
> +	return buf;
> +}
> +
>  static void
>  xlog_grant_sub_space(
>  	struct xlog		*log,
> @@ -2120,9 +2176,9 @@ xlog_print_trans(
>  }
>  
>  /*
> - * Calculate the potential space needed by the log vector. If this is a start
> - * transaction, the caller has already accounted for both opheaders in the start
> - * transaction, so we don't need to account for them here.
> + * Calculate the potential space needed by the log vector. All regions contain
> + * their own opheaders and they are accounted for in region space so we don't
> + * need to add them to the vector length here.
>   */
>  static int
>  xlog_write_calc_vec_length(
> @@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length(
>  			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
>  		}
>  	}
> -
> -	/* Don't account for regions with embedded ophdrs */
> -	if (optype && headers > 0) {
> -		headers--;
> -		if (optype & XLOG_START_TRANS) {
> -			ASSERT(headers >= 1);
> -			headers--;
> -		}
> -	}
> -
>  	ticket->t_res_num_ophdrs += headers;
> -	len += headers * sizeof(struct xlog_op_header);
>  
>  	return len;
>  }
> @@ -2170,7 +2215,6 @@ xlog_write_setup_ophdr(
>  	struct xlog_op_header	*ophdr,
>  	struct xlog_ticket	*ticket)
>  {
> -	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
>  	ophdr->oh_clientid = XFS_TRANSACTION;
>  	ophdr->oh_res2 = 0;
>  	ophdr->oh_flags = 0;
> @@ -2404,21 +2448,25 @@ xlog_write(
>  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
>  			/*
> -			 * The XLOG_START_TRANS has embedded ophdrs for the
> -			 * start record and transaction header. They will always
> -			 * be the first two regions in the lv chain. Commit and
> -			 * unmount records also have embedded ophdrs.
> +			 * Regions always have their ophdr at the start of the
> +			 * region, except for:
> +			 * - a transaction start which has a start record ophdr
> +			 *   before the first region ophdr; and
> +			 * - the previous region didn't fully fit into an iclog
> +			 *   so needs a continuation ophdr to prepend the region
> +			 *   in this new iclog.
>  			 */
> -			if (optype) {
> -				ophdr = reg->i_addr;
> -				if (index)
> -					optype &= ~XLOG_START_TRANS;
> -			} else {
> +			ophdr = reg->i_addr;
> +			if (optype && index) {
> +				optype &= ~XLOG_START_TRANS;
> +			} else if (partial_copy) {
>                                  ophdr = xlog_write_setup_ophdr(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
>  				added_ophdr = true;
>  			}
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +
>  			len += xlog_write_setup_copy(ticket, ophdr,
>  						     iclog->ic_size-log_offset,
>  						     reg->i_len,
> @@ -2436,20 +2484,11 @@ xlog_write(
>  				ophdr->oh_len = cpu_to_be32(copy_len -
>  						sizeof(struct xlog_op_header));
>  			}
> -			/*
> -			 * Copy region.
> -			 *
> -			 * Commit records just log an opheader, so
> -			 * we can have empty payloads with no data region to
> -			 * copy.  Hence we only copy the payload if the vector
> -			 * says it has data to copy.
> -			 */
> -			ASSERT(copy_len >= 0);
> -			if (copy_len > 0) {
> -				memcpy(ptr, reg->i_addr + copy_off, copy_len);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -						   copy_len);
> -			}
> +
> +			ASSERT(copy_len > 0);
> +			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> +			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> +
>  			if (added_ophdr)
>  				copy_len += sizeof(struct xlog_op_header);
>  			record_cnt++;
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 1ca4f2edbdaf..af54ea3f8c90 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -21,44 +21,18 @@ struct xfs_log_vec {
>  
>  #define XFS_LOG_VEC_ORDERED	(-1)
>  
> -/*
> - * We need to make sure the buffer pointer returned is naturally aligned for the
> - * biggest basic data type we put into it. We have already accounted for this
> - * padding when sizing the buffer.
> - *
> - * However, this padding does not get written into the log, and hence we have to
> - * track the space used by the log vectors separately to prevent log space hangs
> - * due to inaccurate accounting (i.e. a leak) of the used log space through the
> - * CIL context ticket.
> - */
> -static inline void *
> -xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
> -		uint type)
> -{
> -	struct xfs_log_iovec *vec = *vecp;
> -
> -	if (vec) {
> -		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> -		vec++;
> -	} else {
> -		vec = &lv->lv_iovecp[0];
> -	}
> -
> -	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
> -		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
> -
> -	vec->i_type = type;
> -	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> -
> -	ASSERT(IS_ALIGNED((unsigned long)vec->i_addr, sizeof(uint64_t)));
> -
> -	*vecp = vec;
> -	return vec->i_addr;
> -}
> +void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
> +		uint type);
>  
>  static inline void
>  xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
>  {
> +	struct xlog_op_header	*oph = vec->i_addr;
> +
> +	/* opheader tracks payload length, logvec tracks region length */
> +	oph->oh_len = len;
> +
> +	len += sizeof(struct xlog_op_header);
>  	lv->lv_buf_len += len;
>  	lv->lv_bytes += len;
>  	vec->i_len = len;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0c81c13e2cf6..7a5e6bdb7876 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
>  		}
>  
>  		/*
> -		 * We 64-bit align the length of each iovec so that the start
> -		 * of the next one is naturally aligned.  We'll need to
> -		 * account for that slack space here. Then round nbytes up
> -		 * to 64-bit alignment so that the initial buffer alignment is
> -		 * easy to calculate and verify.
> +		 * We 64-bit align the length of each iovec so that the start of
> +		 * the next one is naturally aligned.  We'll need to account for
> +		 * that slack space here.
> +		 *
> +		 * We also add the xlog_op_header to each region when
> +		 * formatting, but that's not accounted to the size of the item
> +		 * at this point. Hence we'll need an addition number of bytes
> +		 * for each vector to hold an opheader.
> +		 *
> +		 * Then round nbytes up to 64-bit alignment so that the initial
> +		 * buffer alignment is easy to calculate and verify.
>  		 */
> -		nbytes += niovecs * sizeof(uint64_t);
> +		nbytes += niovecs *
> +			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
>  		nbytes = round_up(nbytes, sizeof(uint64_t));
>  
>  		/*
> @@ -433,11 +440,6 @@ xlog_cil_insert_items(
>  
>  	spin_lock(&cil->xc_cil_lock);
>  
> -	/* account for space used by new iovec headers  */
> -	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> -	len += iovhdr_res;
> -	ctx->nvecs += diff_iovecs;
> -
>  	/* attach the transaction to the CIL if it has any busy extents */
>  	if (!list_empty(&tp->t_busy))
>  		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> @@ -469,6 +471,7 @@ xlog_cil_insert_items(
>  	}
>  	tp->t_ticket->t_curr_res -= len;
>  	ctx->space_used += len;
> +	ctx->nvecs += diff_iovecs;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/45] xfs: log ticket region debug is largely useless
  2021-03-05  5:11 ` [PATCH 26/45] xfs: log ticket region debug is largely useless Dave Chinner
@ 2021-03-09  2:31   ` Darrick J. Wong
  2021-03-16 14:55   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  2:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:24PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xlog_tic_add_region() is used to trace the regions being added to a
> log ticket to provide information in the situation where a ticket
> reservation overrun occurs. The information gathered is stored int
> the ticket, and dumped if xlog_print_tic_res() is called.
> 
> For a front end struct xfs_trans overrun, the ticket only contains
> reservation tracking information - the ticket is never handed to the
> log so has no regions attached to it. The overrun debug information in this
> case comes from xlog_print_trans(), which walks the items attached
> to the transaction and dumps their attached formatted log vectors
> directly. It also dumps the ticket state, but that only contains
> reservation accounting and nothing else. Hence xlog_print_tic_res()
> never dumps region or overrun information from this path.
> 
> xlog_tic_add_region() is actually called from xlog_write(), which
> means it is being used to track the regions seen in a
> CIL checkpoint log vector chain. In looking at CIL behaviour
> recently, I've seen 32MB checkpoints regularly exceed 250,000
> regions in the LV chain. The log ticket debug code can track *15*

Yikes.  I /had/ noticed that the amount of overrun ledger info didn't
seem to come anywhere close to the numbers in the accounting data.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> regions. IOWs, if there is a ticket overrun in the CIL code, the
> ticket region tracking code is going to be completely useless for
> determining what went wrong. The only thing it can tell us is how
> much of an overrun occurred, and we really don't need extra debug
> information in the log ticket to tell us that.
> 
> Indeed, the main place we call xlog_tic_add_region() is also adding
> up the number of regions and the space used so that xlog_write()
> knows how much will be written to the log. This is exactly the same
> information that log ticket is storing once we take away the useless
> region tracking array. Hence xlog_tic_add_region() is not useful,
> but can be called 250,000 times a CIL push...
> 
> Just strip all that debug "information" out of the of the log ticket
> and only have it report reservation space information when an
> overrun occurs. This also reduces the size of a log ticket down by
> about 150 bytes...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log.c      | 107 +++---------------------------------------
>  fs/xfs/xfs_log_priv.h |  17 -------
>  2 files changed, 6 insertions(+), 118 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 98de45be80c0..412b167d8d0e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -377,30 +377,6 @@ xlog_grant_head_check(
>  	return error;
>  }
>  
> -static void
> -xlog_tic_reset_res(xlog_ticket_t *tic)
> -{
> -	tic->t_res_num = 0;
> -	tic->t_res_arr_sum = 0;
> -	tic->t_res_num_ophdrs = 0;
> -}
> -
> -static void
> -xlog_tic_add_region(xlog_ticket_t *tic, uint len, uint type)
> -{
> -	if (tic->t_res_num == XLOG_TIC_LEN_MAX) {
> -		/* add to overflow and start again */
> -		tic->t_res_o_flow += tic->t_res_arr_sum;
> -		tic->t_res_num = 0;
> -		tic->t_res_arr_sum = 0;
> -	}
> -
> -	tic->t_res_arr[tic->t_res_num].r_len = len;
> -	tic->t_res_arr[tic->t_res_num].r_type = type;
> -	tic->t_res_arr_sum += len;
> -	tic->t_res_num++;
> -}
> -
>  bool
>  xfs_log_writable(
>  	struct xfs_mount	*mp)
> @@ -448,8 +424,6 @@ xfs_log_regrant(
>  	xlog_grant_push_ail(log, tic->t_unit_res);
>  
>  	tic->t_curr_res = tic->t_unit_res;
> -	xlog_tic_reset_res(tic);
> -
>  	if (tic->t_cnt > 0)
>  		return 0;
>  
> @@ -2066,63 +2040,11 @@ xlog_print_tic_res(
>  	struct xfs_mount	*mp,
>  	struct xlog_ticket	*ticket)
>  {
> -	uint i;
> -	uint ophdr_spc = ticket->t_res_num_ophdrs * (uint)sizeof(xlog_op_header_t);
> -
> -	/* match with XLOG_REG_TYPE_* in xfs_log.h */
> -#define REG_TYPE_STR(type, str)	[XLOG_REG_TYPE_##type] = str
> -	static char *res_type_str[] = {
> -	    REG_TYPE_STR(BFORMAT, "bformat"),
> -	    REG_TYPE_STR(BCHUNK, "bchunk"),
> -	    REG_TYPE_STR(EFI_FORMAT, "efi_format"),
> -	    REG_TYPE_STR(EFD_FORMAT, "efd_format"),
> -	    REG_TYPE_STR(IFORMAT, "iformat"),
> -	    REG_TYPE_STR(ICORE, "icore"),
> -	    REG_TYPE_STR(IEXT, "iext"),
> -	    REG_TYPE_STR(IBROOT, "ibroot"),
> -	    REG_TYPE_STR(ILOCAL, "ilocal"),
> -	    REG_TYPE_STR(IATTR_EXT, "iattr_ext"),
> -	    REG_TYPE_STR(IATTR_BROOT, "iattr_broot"),
> -	    REG_TYPE_STR(IATTR_LOCAL, "iattr_local"),
> -	    REG_TYPE_STR(QFORMAT, "qformat"),
> -	    REG_TYPE_STR(DQUOT, "dquot"),
> -	    REG_TYPE_STR(QUOTAOFF, "quotaoff"),
> -	    REG_TYPE_STR(LRHEADER, "LR header"),
> -	    REG_TYPE_STR(UNMOUNT, "unmount"),
> -	    REG_TYPE_STR(COMMIT, "commit"),
> -	    REG_TYPE_STR(TRANSHDR, "trans header"),
> -	    REG_TYPE_STR(ICREATE, "inode create"),
> -	    REG_TYPE_STR(RUI_FORMAT, "rui_format"),
> -	    REG_TYPE_STR(RUD_FORMAT, "rud_format"),
> -	    REG_TYPE_STR(CUI_FORMAT, "cui_format"),
> -	    REG_TYPE_STR(CUD_FORMAT, "cud_format"),
> -	    REG_TYPE_STR(BUI_FORMAT, "bui_format"),
> -	    REG_TYPE_STR(BUD_FORMAT, "bud_format"),
> -	};
> -	BUILD_BUG_ON(ARRAY_SIZE(res_type_str) != XLOG_REG_TYPE_MAX + 1);
> -#undef REG_TYPE_STR
> -
>  	xfs_warn(mp, "ticket reservation summary:");
> -	xfs_warn(mp, "  unit res    = %d bytes",
> -		 ticket->t_unit_res);
> -	xfs_warn(mp, "  current res = %d bytes",
> -		 ticket->t_curr_res);
> -	xfs_warn(mp, "  total reg   = %u bytes (o/flow = %u bytes)",
> -		 ticket->t_res_arr_sum, ticket->t_res_o_flow);
> -	xfs_warn(mp, "  ophdrs      = %u (ophdr space = %u bytes)",
> -		 ticket->t_res_num_ophdrs, ophdr_spc);
> -	xfs_warn(mp, "  ophdr + reg = %u bytes",
> -		 ticket->t_res_arr_sum + ticket->t_res_o_flow + ophdr_spc);
> -	xfs_warn(mp, "  num regions = %u",
> -		 ticket->t_res_num);
> -
> -	for (i = 0; i < ticket->t_res_num; i++) {
> -		uint r_type = ticket->t_res_arr[i].r_type;
> -		xfs_warn(mp, "region[%u]: %s - %u bytes", i,
> -			    ((r_type <= 0 || r_type > XLOG_REG_TYPE_MAX) ?
> -			    "bad-rtype" : res_type_str[r_type]),
> -			    ticket->t_res_arr[i].r_len);
> -	}
> +	xfs_warn(mp, "  unit res    = %d bytes", ticket->t_unit_res);
> +	xfs_warn(mp, "  current res = %d bytes", ticket->t_curr_res);
> +	xfs_warn(mp, "  original count  = %d", ticket->t_ocnt);
> +	xfs_warn(mp, "  remaining count = %d", ticket->t_cnt);
>  }
>  
>  /*
> @@ -2187,7 +2109,6 @@ xlog_write_calc_vec_length(
>  	uint			optype)
>  {
>  	struct xfs_log_vec	*lv;
> -	int			headers = 0;
>  	int			len = 0;
>  	int			i;
>  
> @@ -2196,17 +2117,9 @@ xlog_write_calc_vec_length(
>  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
>  			continue;
>  
> -		headers += lv->lv_niovecs;
> -
> -		for (i = 0; i < lv->lv_niovecs; i++) {
> -			struct xfs_log_iovec	*vecp = &lv->lv_iovecp[i];
> -
> -			len += vecp->i_len;
> -			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
> -		}
> +		for (i = 0; i < lv->lv_niovecs; i++)
> +			len += lv->lv_iovecp[i].i_len;
>  	}
> -	ticket->t_res_num_ophdrs += headers;
> -
>  	return len;
>  }
>  
> @@ -2265,7 +2178,6 @@ xlog_write_setup_copy(
>  
>  	/* account for new log op header */
>  	ticket->t_curr_res -= sizeof(struct xlog_op_header);
> -	ticket->t_res_num_ophdrs++;
>  
>  	return sizeof(struct xlog_op_header);
>  }
> @@ -2973,9 +2885,6 @@ xlog_state_get_iclog_space(
>  	 */
>  	if (log_offset == 0) {
>  		ticket->t_curr_res -= log->l_iclog_hsize;
> -		xlog_tic_add_region(ticket,
> -				    log->l_iclog_hsize,
> -				    XLOG_REG_TYPE_LRHEADER);
>  		head->h_cycle = cpu_to_be32(log->l_curr_cycle);
>  		head->h_lsn = cpu_to_be64(
>  			xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block));
> @@ -3055,7 +2964,6 @@ xfs_log_ticket_regrant(
>  	xlog_grant_sub_space(log, &log->l_write_head.grant,
>  					ticket->t_curr_res);
>  	ticket->t_curr_res = ticket->t_unit_res;
> -	xlog_tic_reset_res(ticket);
>  
>  	trace_xfs_log_ticket_regrant_sub(log, ticket);
>  
> @@ -3066,7 +2974,6 @@ xfs_log_ticket_regrant(
>  		trace_xfs_log_ticket_regrant_exit(log, ticket);
>  
>  		ticket->t_curr_res = ticket->t_unit_res;
> -		xlog_tic_reset_res(ticket);
>  	}
>  
>  	xfs_log_ticket_put(ticket);
> @@ -3529,8 +3436,6 @@ xlog_ticket_alloc(
>  	if (permanent)
>  		tic->t_flags |= XLOG_TIC_PERM_RESERV;
>  
> -	xlog_tic_reset_res(tic);
> -
>  	return tic;
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 7f601c1c9f45..8ee6a5f74396 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -139,16 +139,6 @@ enum xlog_iclog_state {
>  /* Ticket reservation region accounting */ 
>  #define XLOG_TIC_LEN_MAX	15
>  
> -/*
> - * Reservation region
> - * As would be stored in xfs_log_iovec but without the i_addr which
> - * we don't care about.
> - */
> -typedef struct xlog_res {
> -	uint	r_len;	/* region length		:4 */
> -	uint	r_type;	/* region's transaction type	:4 */
> -} xlog_res_t;
> -
>  typedef struct xlog_ticket {
>  	struct list_head   t_queue;	 /* reserve/write queue */
>  	struct task_struct *t_task;	 /* task that owns this ticket */
> @@ -159,13 +149,6 @@ typedef struct xlog_ticket {
>  	char		   t_ocnt;	 /* original count		 : 1  */
>  	char		   t_cnt;	 /* current count		 : 1  */
>  	char		   t_flags;	 /* properties of reservation	 : 1  */
> -
> -        /* reservation array fields */
> -	uint		   t_res_num;                    /* num in array : 4 */
> -	uint		   t_res_num_ophdrs;		 /* num op hdrs  : 4 */
> -	uint		   t_res_arr_sum;		 /* array sum    : 4 */
> -	uint		   t_res_o_flow;		 /* sum overflow : 4 */
> -	xlog_res_t	   t_res_arr[XLOG_TIC_LEN_MAX];  /* array of res : 8 * 15 */ 
>  } xlog_ticket_t;
>  
>  /*
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 27/45] xfs: pass lv chain length into xlog_write()
  2021-03-05  5:11 ` [PATCH 27/45] xfs: pass lv chain length into xlog_write() Dave Chinner
@ 2021-03-09  2:36   ` Darrick J. Wong
  2021-03-11  3:37     ` Dave Chinner
  2021-03-16 18:38   ` Brian Foster
  1 sibling, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  2:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The caller of xlog_write() usually has a close accounting of the
> aggregated vector length contained in the log vector chain passed to
> xlog_write(). There is no need to iterate the chain to calculate he
> length of the data in xlog_write_calculate_len() if the caller is
> already iterating that chain to build it.
> 
> Passing in the vector length avoids doing an extra chain iteration,
> which can be a significant amount of work given that large CIL
> commits can have hundreds of thousands of vectors attached to the
> chain.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 37 ++++++-------------------------------
>  fs/xfs/xfs_log_cil.c  | 18 +++++++++++++-----
>  fs/xfs/xfs_log_priv.h |  2 +-
>  3 files changed, 20 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 412b167d8d0e..22f97914ab99 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -858,7 +858,8 @@ xlog_write_unmount_record(
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
>  		blkdev_issue_flush(log->l_targ->bt_bdev);
> -	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
> +	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
> +				reg.i_len);
>  }
>  
>  /*
> @@ -1577,7 +1578,8 @@ xlog_commit_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= reg.i_len;
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
> +	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
> +				reg.i_len);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2097,32 +2099,6 @@ xlog_print_trans(
>  	}
>  }
>  
> -/*
> - * Calculate the potential space needed by the log vector. All regions contain
> - * their own opheaders and they are accounted for in region space so we don't
> - * need to add them to the vector length here.
> - */
> -static int
> -xlog_write_calc_vec_length(
> -	struct xlog_ticket	*ticket,
> -	struct xfs_log_vec	*log_vector,
> -	uint			optype)
> -{
> -	struct xfs_log_vec	*lv;
> -	int			len = 0;
> -	int			i;
> -
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
> -		/* we don't write ordered log vectors */
> -		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> -			continue;
> -
> -		for (i = 0; i < lv->lv_niovecs; i++)
> -			len += lv->lv_iovecp[i].i_len;
> -	}
> -	return len;
> -}
> -
>  static xlog_op_header_t *
>  xlog_write_setup_ophdr(
>  	struct xlog_op_header	*ophdr,
> @@ -2285,13 +2261,13 @@ xlog_write(
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
> -	uint			optype)
> +	uint			optype,
> +	uint32_t		len)
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
>  	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
>  	int			index = 0;
> -	int			len;
>  	int			partial_copy = 0;
>  	int			partial_copy_len = 0;
>  	int			contwr = 0;
> @@ -2306,7 +2282,6 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
>  	if (start_lsn)
>  		*start_lsn = 0;
>  	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 7a5e6bdb7876..34abc3bae587 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr(
>  				sizeof(struct xfs_trans_header);
>  	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
>  
> -	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> -
>  	lvhdr->lv_niovecs = 2;
>  	lvhdr->lv_iovecp = &hdr->lhdr[0];
> +	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;

Er... does this code change belong in an earlier patch?  Or if not, why
wasn't it important to set lv_bytes here?

>  	lvhdr->lv_next = ctx->lv_chain;
> +
> +	tic->t_curr_res -= lvhdr->lv_bytes;
>  }
>  
>  /*
> @@ -742,7 +743,8 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	*lv;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
> -	int			num_iovecs;
> +	int			num_iovecs = 0;
> +	int			num_bytes = 0;
>  	int			error = 0;
>  	struct xlog_cil_trans_hdr thdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
> @@ -841,7 +843,6 @@ xlog_cil_push_work(
>  	 * by the flush lock.
>  	 */
>  	lv = NULL;
> -	num_iovecs = 0;
>  	while (!list_empty(&cil->xc_cil)) {
>  		struct xfs_log_item	*item;
>  
> @@ -855,6 +856,10 @@ xlog_cil_push_work(
>  		lv = item->li_lv;
>  		item->li_lv = NULL;
>  		num_iovecs += lv->lv_niovecs;
> +
> +		/* we don't write ordered log vectors */
> +		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> +			num_bytes += lv->lv_bytes;
>  	}
>  
>  	/*
> @@ -893,6 +898,9 @@ xlog_cil_push_work(
>  	 * transaction header here as it is not accounted for in xlog_write().
>  	 */
>  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
> +	num_iovecs += lvhdr.lv_niovecs;
> +	num_bytes += lvhdr.lv_bytes;
> +
>  

No need to have two blank lines here.

--D

>  	/*
>  	 * Before we format and submit the first iclog, we have to ensure that
> @@ -907,7 +915,7 @@ xlog_cil_push_work(
>  	 * write head.
>  	 */
>  	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> -				XLOG_START_TRANS);
> +				XLOG_START_TRANS, num_bytes);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 8ee6a5f74396..003c11653955 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -462,7 +462,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint optype);
> +		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/45] xfs: introduce xlog_write_single()
  2021-03-05  5:11 ` [PATCH 28/45] xfs: introduce xlog_write_single() Dave Chinner
@ 2021-03-09  2:39   ` Darrick J. Wong
  2021-03-11  4:19     ` Dave Chinner
  2021-03-16 18:39   ` Brian Foster
  1 sibling, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  2:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:26PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce an optimised version of xlog_write() that is used when the
> entire write will fit in a single iclog. This greatly simplifies the
> implementation of writing a log vector chain into an iclog, and sets
> the ground work for a much more understandable xlog_write()
> implementation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 22f97914ab99..590c1e6db475 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2214,6 +2214,52 @@ xlog_write_copy_finish(
>  	return error;
>  }
>  
> +/*
> + * Write log vectors into a single iclog which is guaranteed by the caller
> + * to have enough space to write the entire log vector into. Return the number
> + * of log vectors written into the iclog.
> + */
> +static int
> +xlog_write_single(
> +	struct xfs_log_vec	*log_vector,
> +	struct xlog_ticket	*ticket,
> +	struct xlog_in_core	*iclog,
> +	uint32_t		log_offset,
> +	uint32_t		len)
> +{
> +	struct xfs_log_vec	*lv = log_vector;
> +	void			*ptr;
> +	int			index = 0;
> +	int			record_cnt = 0;

Any reason these (and the return type) can't be unsigned?  I don't think
negative indices or record counts have any meaning, right?

Otherwise this looks ok to me.

--D

> +
> +	ASSERT(log_offset + len <= iclog->ic_size);
> +
> +	ptr = iclog->ic_datap + log_offset;
> +	for (lv = log_vector; lv; lv = lv->lv_next) {
> +		/*
> +		 * Ordered log vectors have no regions to write so this
> +		 * loop will naturally skip them.
> +		 */
> +		for (index = 0; index < lv->lv_niovecs; index++) {
> +			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> +			struct xlog_op_header	*ophdr = reg->i_addr;
> +
> +			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> +			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +						sizeof(struct xlog_op_header));
> +			memcpy(ptr, reg->i_addr, reg->i_len);
> +			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> +			record_cnt++;
> +		}
> +	}
> +	ASSERT(len == 0);
> +	return record_cnt;
> +}
> +
> +
>  /*
>   * Write some region out to in-core log
>   *
> @@ -2294,7 +2340,6 @@ xlog_write(
>  			return error;
>  
>  		ASSERT(log_offset <= iclog->ic_size - 1);
> -		ptr = iclog->ic_datap + log_offset;
>  
>  		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
> @@ -2311,10 +2356,20 @@ xlog_write(
>  						XLOG_ICL_NEED_FUA);
>  		}
>  
> +		/* If this is a single iclog write, go fast... */
> +		if (!contwr && lv == log_vector) {
> +			record_cnt = xlog_write_single(lv, ticket, iclog,
> +						log_offset, len);
> +			len = 0;
> +			data_cnt = len;
> +			break;
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
>  		 */
> +		ptr = iclog->ic_datap + log_offset;
>  		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
>  			struct xfs_log_iovec	*reg;
>  			struct xlog_op_header	*ophdr;
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/45] xfs:_introduce xlog_write_partial()
  2021-03-05  5:11 ` [PATCH 29/45] xfs:_introduce xlog_write_partial() Dave Chinner
@ 2021-03-09  2:59   ` Darrick J. Wong
  2021-03-11  4:33     ` Dave Chinner
  2021-03-18 13:22   ` Brian Foster
  1 sibling, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  2:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:27PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Handle writing of a logvec chain into an iclog that doesn't have
> enough space to fit it all. The iclog has already been changed to
> WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> in the iclog is exclusively owned by this logvec chain.
> 
> The difference between the single and partial cases is that
> we end up with partial iovec writes in the iclog and have to split
> a log vec regions across two iclogs. The state handling for this is
> currently awful and so we're building up the pieces needed to
> handle this more cleanly one at a time.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++-------------------------
>  1 file changed, 251 insertions(+), 274 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 590c1e6db475..10916b99bf0f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2099,166 +2099,250 @@ xlog_print_trans(
>  	}
>  }
>  
> -static xlog_op_header_t *
> -xlog_write_setup_ophdr(
> -	struct xlog_op_header	*ophdr,
> -	struct xlog_ticket	*ticket)
> -{
> -	ophdr->oh_clientid = XFS_TRANSACTION;
> -	ophdr->oh_res2 = 0;
> -	ophdr->oh_flags = 0;
> -	return ophdr;
> -}
> -
>  /*
> - * Set up the parameters of the region copy into the log. This has
> - * to handle region write split across multiple log buffers - this
> - * state is kept external to this function so that this code can
> - * be written in an obvious, self documenting manner.
> + * Write whole log vectors into a single iclog which is guaranteed to have
> + * either sufficient space for the entire log vector chain to be written or
> + * exclusive access to the remaining space in the iclog.
> + *
> + * Return the number of iovecs and data written into the iclog, as well as
> + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
> + * end of the chain.
>   */
> -static int
> -xlog_write_setup_copy(
> +static struct xfs_log_vec *
> +xlog_write_single(

Ouch.  Could you fix the previous patch to move this new function a
little higher in the file (like above xlog_write_setup_ophdr) so that it
doesn't get shredded like this?

Sooo... I /think/ this looks all right, but this is a pretty long
reorganization.  I might revisit this in the morning. :/

(Skip to the second-to-last hunk, that's where the next comment is...)

> +	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
> -	struct xlog_op_header	*ophdr,
> -	int			space_available,
> -	int			space_required,
> -	int			*copy_off,
> -	int			*copy_len,
> -	int			*last_was_partial_copy,
> -	int			*bytes_consumed)
> -{
> -	int			still_to_copy;
> -
> -	still_to_copy = space_required - *bytes_consumed;
> -	*copy_off = *bytes_consumed;
> -
> -	if (still_to_copy <= space_available) {
> -		/* write of region completes here */
> -		*copy_len = still_to_copy;
> -		ophdr->oh_len = cpu_to_be32(*copy_len);
> -		if (*last_was_partial_copy)
> -			ophdr->oh_flags |= (XLOG_END_TRANS|XLOG_WAS_CONT_TRANS);
> -		*last_was_partial_copy = 0;
> -		*bytes_consumed = 0;
> -		return 0;
> -	}
> -
> -	/* partial write of region, needs extra log op header reservation */
> -	*copy_len = space_available;
> -	ophdr->oh_len = cpu_to_be32(*copy_len);
> -	ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> -	if (*last_was_partial_copy)
> -		ophdr->oh_flags |= XLOG_WAS_CONT_TRANS;
> -	*bytes_consumed += *copy_len;
> -	(*last_was_partial_copy)++;
> -
> -	/* account for new log op header */
> -	ticket->t_curr_res -= sizeof(struct xlog_op_header);
> -
> -	return sizeof(struct xlog_op_header);
> -}
> -
> -static int
> -xlog_write_copy_finish(
> -	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
> -	uint			flags,
> -	int			*record_cnt,
> -	int			*data_cnt,
> -	int			*partial_copy,
> -	int			*partial_copy_len,
> -	int			log_offset,
> -	struct xlog_in_core	**commit_iclog)
> +	uint32_t		*log_offset,
> +	uint32_t		*len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt)
>  {
> -	int			error;
> +	struct xfs_log_vec	*lv = log_vector;
> +	void			*ptr;
> +	int			index;
>  
> -	if (*partial_copy) {
> +	ASSERT(*log_offset + *len <= iclog->ic_size ||
> +		iclog->ic_state == XLOG_STATE_WANT_SYNC);
> +
> +	ptr = iclog->ic_datap + *log_offset;
> +	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/*
> -		 * This iclog has already been marked WANT_SYNC by
> -		 * xlog_state_get_iclog_space.
> +		 * If the entire log vec does not fit in the iclog, punt it to
> +		 * the partial copy loop which can handle this case.
>  		 */
> -		spin_lock(&log->l_icloglock);
> -		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> -		*record_cnt = 0;
> -		*data_cnt = 0;
> -		goto release_iclog;
> -	}
> +		if (lv->lv_niovecs &&
> +		    lv->lv_bytes > iclog->ic_size - *log_offset)
> +			break;
>  
> -	*partial_copy = 0;
> -	*partial_copy_len = 0;
> +		/*
> +		 * Ordered log vectors have no regions to write so this
> +		 * loop will naturally skip them.
> +		 */
> +		for (index = 0; index < lv->lv_niovecs; index++) {
> +			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> +			struct xlog_op_header	*ophdr = reg->i_addr;
>  
> -	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
> -		/* no more space in this iclog - push it. */
> -		spin_lock(&log->l_icloglock);
> -		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> -		*record_cnt = 0;
> -		*data_cnt = 0;
> +			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> +			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
> -		if (iclog->ic_state == XLOG_STATE_ACTIVE)
> -			xlog_state_switch_iclogs(log, iclog, 0);
> -		else
> -			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> -			       iclog->ic_state == XLOG_STATE_IOERROR);
> -		if (!commit_iclog)
> -			goto release_iclog;
> -		spin_unlock(&log->l_icloglock);
> -		ASSERT(flags & XLOG_COMMIT_TRANS);
> -		*commit_iclog = iclog;
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +						sizeof(struct xlog_op_header));
> +			memcpy(ptr, reg->i_addr, reg->i_len);
> +			xlog_write_adv_cnt(&ptr, len, log_offset, reg->i_len);
> +			(*record_cnt)++;
> +			*data_cnt += reg->i_len;
> +		}
>  	}
> +	ASSERT(*len == 0 || lv);
> +	return lv;
> +}
>  
> -	return 0;
> +static int
> +xlog_write_get_more_iclog_space(
> +	struct xlog		*log,
> +	struct xlog_ticket	*ticket,
> +	struct xlog_in_core	**iclogp,
> +	uint32_t		*log_offset,
> +	uint32_t		len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt,
> +	int			*contwr)
> +{
> +	struct xlog_in_core	*iclog = *iclogp;
> +	int			error;
>  
> -release_iclog:
> +	spin_lock(&log->l_icloglock);
> +	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> +	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> +	       iclog->ic_state == XLOG_STATE_IOERROR);
>  	error = xlog_state_release_iclog(log, iclog);
>  	spin_unlock(&log->l_icloglock);
> -	return error;
> +	if (error)
> +		return error;
> +
> +	error = xlog_state_get_iclog_space(log, len, &iclog,
> +				ticket, contwr, log_offset);
> +	if (error)
> +		return error;
> +	*record_cnt = 0;
> +	*data_cnt = 0;
> +	*iclogp = iclog;
> +	return 0;
>  }
>  
>  /*
> - * Write log vectors into a single iclog which is guaranteed by the caller
> - * to have enough space to write the entire log vector into. Return the number
> - * of log vectors written into the iclog.
> + * Write log vectors into a single iclog which is smaller than the current chain
> + * length. We write until we cannot fit a full record into the remaining space
> + * and then stop. We return the log vector that is to be written that cannot
> + * wholly fit in the iclog.
>   */
> -static int
> -xlog_write_single(
> +static struct xfs_log_vec *
> +xlog_write_partial(
> +	struct xlog		*log,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
> -	struct xlog_in_core	*iclog,
> -	uint32_t		log_offset,
> -	uint32_t		len)
> +	struct xlog_in_core	**iclogp,
> +	uint32_t		*log_offset,
> +	uint32_t		*len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt,
> +	int			*contwr)
>  {
> +	struct xlog_in_core	*iclog = *iclogp;
>  	struct xfs_log_vec	*lv = log_vector;
> +	struct xfs_log_iovec	*reg;
> +	struct xlog_op_header	*ophdr;
>  	void			*ptr;
>  	int			index = 0;
> -	int			record_cnt = 0;
> +	uint32_t		rlen;
> +	int			error;
>  
> -	ASSERT(log_offset + len <= iclog->ic_size);
> +	/* walk the logvec, copying until we run out of space in the iclog */
> +	ptr = iclog->ic_datap + *log_offset;
> +	for (index = 0; index < lv->lv_niovecs; index++) {
> +		uint32_t	reg_offset = 0;
> +
> +		reg = &lv->lv_iovecp[index];
> +		ASSERT(reg->i_len % sizeof(int32_t) == 0);
>  
> -	ptr = iclog->ic_datap + log_offset;
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/*
> -		 * Ordered log vectors have no regions to write so this
> -		 * loop will naturally skip them.
> +		 * The first region of a continuation must have a non-zero
> +		 * length otherwise log recovery will just skip over it and
> +		 * start recovering from the next opheader it finds. Because we
> +		 * mark the next opheader as a continuation, recovery will then
> +		 * incorrectly add the continuation to the previous region and
> +		 * that breaks stuff.
> +		 *
> +		 * Hence if there isn't space for region data after the
> +		 * opheader, then we need to start afresh with a new iclog.
>  		 */
> -		for (index = 0; index < lv->lv_niovecs; index++) {
> -			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> -			struct xlog_op_header	*ophdr = reg->i_addr;
> +		if (iclog->ic_size - *log_offset <=
> +					sizeof(struct xlog_op_header)) {
> +			error = xlog_write_get_more_iclog_space(log, ticket,
> +					&iclog, log_offset, *len, record_cnt,
> +					data_cnt, contwr);
> +			if (error)
> +				return ERR_PTR(error);
> +			ptr = iclog->ic_datap + *log_offset;
> +		}
>  
> -			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> -			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +		ophdr = reg->i_addr;
> +		rlen = min_t(uint32_t, reg->i_len, iclog->ic_size - *log_offset);
> +
> +		ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +		ophdr->oh_len = cpu_to_be32(rlen - sizeof(struct xlog_op_header));
> +		if (rlen != reg->i_len)
> +			ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
>  
> +		ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +		xlog_verify_dest_ptr(log, ptr);
> +		memcpy(ptr, reg->i_addr, rlen);
> +		xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> +		(*record_cnt)++;
> +		*data_cnt += rlen;
> +
> +		if (rlen == reg->i_len)
> +			continue;
> +
> +		/*
> +		 * We now have a partially written iovec, but it can span
> +		 * multiple iclogs so we loop here. First we release the iclog
> +		 * we currently have, then we get a new iclog and add a new
> +		 * opheader. Then we continue copying from where we were until
> +		 * we either complete the iovec or fill the iclog. If we
> +		 * complete the iovec, then we increment the index and go right
> +		 * back to the top of the outer loop. if we fill the iclog, we
> +		 * run the inner loop again.
> +		 *
> +		 * This is complicated by the tail of a region using all the
> +		 * space in an iclog and hence requiring us to release the iclog
> +		 * and get a new one before returning to the outer loop. We must
> +		 * always guarantee that we exit this inner loop with at least
> +		 * space for log transaction opheaders left in the current
> +		 * iclog, hence we cannot just terminate the loop at the end
> +		 * of the of the continuation. So we loop while there is no
> +		 * space left in the current iclog, and check for the end of the
> +		 * continuation after getting a new iclog.
> +		 */
> +		do {
> +			/*
> +			 * Account for the continuation opheader before we get
> +			 * a new iclog. This is necessary so that we reserve
> +			 * space in the iclog for it.
> +			 */
> +			if (ophdr->oh_flags & XLOG_CONTINUE_TRANS) {
> +				*len += sizeof(struct xlog_op_header);
> +				ticket->t_curr_res -= sizeof(struct xlog_op_header);
> +			}
> +			error = xlog_write_get_more_iclog_space(log, ticket,
> +					&iclog, log_offset, *len, record_cnt,
> +					data_cnt, contwr);
> +			if (error)
> +				return ERR_PTR(error);
> +			ptr = iclog->ic_datap + *log_offset;
> +
> +			ophdr = ptr;
>  			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +			ophdr->oh_clientid = XFS_TRANSACTION;
> +			ophdr->oh_res2 = 0;
> +			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
> +
> +			xlog_write_adv_cnt(&ptr, len, log_offset,
>  						sizeof(struct xlog_op_header));
> -			memcpy(ptr, reg->i_addr, reg->i_len);
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> -			record_cnt++;
> -		}
> +			*data_cnt += sizeof(struct xlog_op_header);
> +
> +			/*
> +			 * If rlen fits in the iclog, then end the region
> +			 * continuation. Otherwise we're going around again.
> +			 */
> +			reg_offset += rlen;
> +			rlen = reg->i_len - reg_offset;
> +			if (rlen <= iclog->ic_size - *log_offset)
> +				ophdr->oh_flags |= XLOG_END_TRANS;
> +			else
> +				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> +
> +			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
> +			ophdr->oh_len = cpu_to_be32(rlen);
> +
> +			xlog_verify_dest_ptr(log, ptr);
> +			memcpy(ptr, reg->i_addr + reg_offset, rlen);
> +			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> +			(*record_cnt)++;
> +			*data_cnt += rlen;
> +
> +		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);
>  	}
> -	ASSERT(len == 0);
> -	return record_cnt;
> -}
>  
> +	/*
> +	 * No more iovecs remain in this logvec so return the next log vec to
> +	 * the caller so it can go back to fast path copying.
> +	 */
> +	*iclogp = iclog;
> +	return lv->lv_next;
> +}
>  
>  /*
>   * Write some region out to in-core log
> @@ -2312,14 +2396,11 @@ xlog_write(
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
> -	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
> -	int			index = 0;
> -	int			partial_copy = 0;
> -	int			partial_copy_len = 0;
>  	int			contwr = 0;
>  	int			record_cnt = 0;
>  	int			data_cnt = 0;
>  	int			error = 0;
> +	int			log_offset;
>  
>  	if (ticket->t_curr_res < 0) {
>  		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
> @@ -2328,157 +2409,52 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	if (start_lsn)
> -		*start_lsn = 0;
> -	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> -		void		*ptr;
> -		int		log_offset;
> -
> -		error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> -						   &contwr, &log_offset);
> -		if (error)
> -			return error;
> -
> -		ASSERT(log_offset <= iclog->ic_size - 1);
> +	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> +					   &contwr, &log_offset);
> +	if (error)
> +		return error;
>  
> -		/* Start_lsn is the first lsn written to. */
> -		if (start_lsn && !*start_lsn)
> -			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> +	/* start_lsn is the LSN of the first iclog written to. */
> +	if (start_lsn)
> +		*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> -		/*
> -		 * iclogs containing commit records or unmount records need
> -		 * to issue ordering cache flushes and commit immediately
> -		 * to stable storage to guarantee journal vs metadata ordering
> -		 * is correctly maintained in the storage media.
> -		 */
> -		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> -			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> -						XLOG_ICL_NEED_FUA);
> -		}
> +	/*
> +	 * iclogs containing commit records or unmount records need
> +	 * to issue ordering cache flushes and commit immediately
> +	 * to stable storage to guarantee journal vs metadata ordering
> +	 * is correctly maintained in the storage media. This will always
> +	 * fit in the iclog we have been already been passed.
> +	 */
> +	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +		iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
> +		ASSERT(!contwr);
> +	}
>  
> -		/* If this is a single iclog write, go fast... */
> -		if (!contwr && lv == log_vector) {
> -			record_cnt = xlog_write_single(lv, ticket, iclog,
> -						log_offset, len);
> -			len = 0;
> -			data_cnt = len;
> +	while (lv) {
> +		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
> +					&len, &record_cnt, &data_cnt);
> +		if (!lv)
>  			break;
> -		}
> -
> -		/*
> -		 * This loop writes out as many regions as can fit in the amount
> -		 * of space which was allocated by xlog_state_get_iclog_space().
> -		 */
> -		ptr = iclog->ic_datap + log_offset;
> -		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> -			struct xfs_log_iovec	*reg;
> -			struct xlog_op_header	*ophdr;
> -			int			copy_len;
> -			int			copy_off;
> -			bool			ordered = false;
> -			bool			added_ophdr = false;
> -
> -			/* ordered log vectors have no regions to write */
> -			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> -				ASSERT(lv->lv_niovecs == 0);
> -				ordered = true;
> -				goto next_lv;
> -			}
> -
> -			reg = &vecp[index];
> -			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> -			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> -
> -			/*
> -			 * Regions always have their ophdr at the start of the
> -			 * region, except for:
> -			 * - a transaction start which has a start record ophdr
> -			 *   before the first region ophdr; and
> -			 * - the previous region didn't fully fit into an iclog
> -			 *   so needs a continuation ophdr to prepend the region
> -			 *   in this new iclog.
> -			 */
> -			ophdr = reg->i_addr;
> -			if (optype && index) {
> -				optype &= ~XLOG_START_TRANS;
> -			} else if (partial_copy) {
> -                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -					   sizeof(struct xlog_op_header));
> -				added_ophdr = true;
> -			}
> -			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -
> -			len += xlog_write_setup_copy(ticket, ophdr,
> -						     iclog->ic_size-log_offset,
> -						     reg->i_len,
> -						     &copy_off, &copy_len,
> -						     &partial_copy,
> -						     &partial_copy_len);
> -			xlog_verify_dest_ptr(log, ptr);
> -
>  
> -			/*
> -			 * Wart: need to update length in embedded ophdr not
> -			 * to include it's own length.
> -			 */
> -			if (!added_ophdr) {
> -				ophdr->oh_len = cpu_to_be32(copy_len -
> -						sizeof(struct xlog_op_header));
> -			}
> -
> -			ASSERT(copy_len > 0);
> -			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> -
> -			if (added_ophdr)
> -				copy_len += sizeof(struct xlog_op_header);
> -			record_cnt++;
> -			data_cnt += contwr ? copy_len : 0;
> -
> -			error = xlog_write_copy_finish(log, iclog, optype,
> -						       &record_cnt, &data_cnt,
> -						       &partial_copy,
> -						       &partial_copy_len,
> -						       log_offset,
> -						       commit_iclog);
> -			if (error)
> -				return error;
> -
> -			/*
> -			 * if we had a partial copy, we need to get more iclog
> -			 * space but we don't want to increment the region
> -			 * index because there is still more is this region to
> -			 * write.
> -			 *
> -			 * If we completed writing this region, and we flushed
> -			 * the iclog (indicated by resetting of the record
> -			 * count), then we also need to get more log space. If
> -			 * this was the last record, though, we are done and
> -			 * can just return.
> -			 */
> -			if (partial_copy)
> -				break;
> -
> -			if (++index == lv->lv_niovecs) {
> -next_lv:
> -				lv = lv->lv_next;
> -				index = 0;
> -				if (lv)
> -					vecp = lv->lv_iovecp;
> -			}
> -			if (record_cnt == 0 && !ordered) {
> -				if (!lv)
> -					return 0;
> -				break;
> -			}
> +		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
> +		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
> +					&len, &record_cnt, &data_cnt, &contwr);
> +		if (IS_ERR_OR_NULL(lv)) {
> +			error = PTR_ERR_OR_ZERO(lv);
> +			break;
>  		}
>  	}
> +	ASSERT((len == 0 && !lv) || error);
>  
> -	ASSERT(len == 0);
> -
> +	/*
> +	 * We've already been guaranteed that the last writes will fit inside
> +	 * the current iclog, and hence it will already have the space used by
> +	 * those writes accounted to it. Hence we do not need to update the
> +	 * iclog with the number of bytes written here.
> +	 */
> +	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
>  	spin_lock(&log->l_icloglock);
> -	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
> +	xlog_state_finish_copy(log, iclog, record_cnt, 0);
>  	if (commit_iclog) {
>  		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
> @@ -2930,7 +2906,7 @@ xlog_state_get_iclog_space(
>  	 * xlog_write() algorithm assumes that at least 2 xlog_op_header_t's
>  	 * can fit into remaining data section.
>  	 */
> -	if (iclog->ic_size - iclog->ic_offset < 2*sizeof(xlog_op_header_t)) {
> +	if (iclog->ic_size - iclog->ic_offset < 3*sizeof(xlog_op_header_t)) {

Why does this change to 3?  Does the comment need amending?

--D

>  		int		error = 0;
>  
>  		xlog_state_switch_iclogs(log, iclog, iclog->ic_size);
> @@ -3633,11 +3609,12 @@ xlog_verify_iclog(
>  					iclog->ic_header.h_cycle_data[idx]);
>  			}
>  		}
> -		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG)
> +		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) {
>  			xfs_warn(log->l_mp,
> -				"%s: invalid clientid %d op "PTR_FMT" offset 0x%lx",
> -				__func__, clientid, ophead,
> +				"%s: op %d invalid clientid %d op "PTR_FMT" offset 0x%lx",
> +				__func__, i, clientid, ophead,
>  				(unsigned long)field_offset);
> +		}
>  
>  		/* check length */
>  		p = &ophead->oh_len;
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 30/45] xfs: xlog_write() no longer needs contwr state
  2021-03-05  5:11 ` [PATCH 30/45] xfs: xlog_write() no longer needs contwr state Dave Chinner
@ 2021-03-09  3:01   ` Darrick J. Wong
  0 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  3:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:28PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The rework of xlog_write() no longer requires xlog_get_iclog_state()
> to tell it about internal iclog space reservation state to direct it
> on what to do. Remove this parameter.
> 
> $ size fs/xfs/xfs_log.o.*
>    text	   data	    bss	    dec	    hex	filename
>   26520	    560	      8	  27088	   69d0	fs/xfs/xfs_log.o.orig
>   26384	    560	      8	  26952	   6948	fs/xfs/xfs_log.o.patched
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This seems pretty straightforward,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c | 33 +++++++++++----------------------
>  1 file changed, 11 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 10916b99bf0f..8f4f7ae84358 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -47,7 +47,6 @@ xlog_state_get_iclog_space(
>  	int			len,
>  	struct xlog_in_core	**iclog,
>  	struct xlog_ticket	*ticket,
> -	int			*continued_write,
>  	int			*logoffsetp);
>  STATIC void
>  xlog_grant_push_ail(
> @@ -2167,8 +2166,7 @@ xlog_write_get_more_iclog_space(
>  	uint32_t		*log_offset,
>  	uint32_t		len,
>  	uint32_t		*record_cnt,
> -	uint32_t		*data_cnt,
> -	int			*contwr)
> +	uint32_t		*data_cnt)
>  {
>  	struct xlog_in_core	*iclog = *iclogp;
>  	int			error;
> @@ -2182,8 +2180,8 @@ xlog_write_get_more_iclog_space(
>  	if (error)
>  		return error;
>  
> -	error = xlog_state_get_iclog_space(log, len, &iclog,
> -				ticket, contwr, log_offset);
> +	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> +					log_offset);
>  	if (error)
>  		return error;
>  	*record_cnt = 0;
> @@ -2207,8 +2205,7 @@ xlog_write_partial(
>  	uint32_t		*log_offset,
>  	uint32_t		*len,
>  	uint32_t		*record_cnt,
> -	uint32_t		*data_cnt,
> -	int			*contwr)
> +	uint32_t		*data_cnt)
>  {
>  	struct xlog_in_core	*iclog = *iclogp;
>  	struct xfs_log_vec	*lv = log_vector;
> @@ -2242,7 +2239,7 @@ xlog_write_partial(
>  					sizeof(struct xlog_op_header)) {
>  			error = xlog_write_get_more_iclog_space(log, ticket,
>  					&iclog, log_offset, *len, record_cnt,
> -					data_cnt, contwr);
> +					data_cnt);
>  			if (error)
>  				return ERR_PTR(error);
>  			ptr = iclog->ic_datap + *log_offset;
> @@ -2298,7 +2295,7 @@ xlog_write_partial(
>  			}
>  			error = xlog_write_get_more_iclog_space(log, ticket,
>  					&iclog, log_offset, *len, record_cnt,
> -					data_cnt, contwr);
> +					data_cnt);
>  			if (error)
>  				return ERR_PTR(error);
>  			ptr = iclog->ic_datap + *log_offset;
> @@ -2396,7 +2393,6 @@ xlog_write(
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
> -	int			contwr = 0;
>  	int			record_cnt = 0;
>  	int			data_cnt = 0;
>  	int			error = 0;
> @@ -2410,7 +2406,7 @@ xlog_write(
>  	}
>  
>  	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> -					   &contwr, &log_offset);
> +					   &log_offset);
>  	if (error)
>  		return error;
>  
> @@ -2425,10 +2421,8 @@ xlog_write(
>  	 * is correctly maintained in the storage media. This will always
>  	 * fit in the iclog we have been already been passed.
>  	 */
> -	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
>  		iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
> -		ASSERT(!contwr);
> -	}
>  
>  	while (lv) {
>  		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
> @@ -2438,7 +2432,7 @@ xlog_write(
>  
>  		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
>  		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
> -					&len, &record_cnt, &data_cnt, &contwr);
> +					&len, &record_cnt, &data_cnt);
>  		if (IS_ERR_OR_NULL(lv)) {
>  			error = PTR_ERR_OR_ZERO(lv);
>  			break;
> @@ -2452,7 +2446,6 @@ xlog_write(
>  	 * those writes accounted to it. Hence we do not need to update the
>  	 * iclog with the number of bytes written here.
>  	 */
> -	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
>  	spin_lock(&log->l_icloglock);
>  	xlog_state_finish_copy(log, iclog, record_cnt, 0);
>  	if (commit_iclog) {
> @@ -2856,7 +2849,6 @@ xlog_state_get_iclog_space(
>  	int			len,
>  	struct xlog_in_core	**iclogp,
>  	struct xlog_ticket	*ticket,
> -	int			*continued_write,
>  	int			*logoffsetp)
>  {
>  	int		  log_offset;
> @@ -2932,13 +2924,10 @@ xlog_state_get_iclog_space(
>  	 * iclogs (to mark it taken), this particular iclog will release/sync
>  	 * to disk in xlog_write().
>  	 */
> -	if (len <= iclog->ic_size - iclog->ic_offset) {
> -		*continued_write = 0;
> +	if (len <= iclog->ic_size - iclog->ic_offset)
>  		iclog->ic_offset += len;
> -	} else {
> -		*continued_write = 1;
> +	else
>  		xlog_state_switch_iclogs(log, iclog, iclog->ic_size);
> -	}
>  	*iclogp = iclog;
>  
>  	ASSERT(iclog->ic_offset <= iclog->ic_size);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 31/45] xfs: CIL context doesn't need to count iovecs
  2021-03-05  5:11 ` [PATCH 31/45] xfs: CIL context doesn't need to count iovecs Dave Chinner
@ 2021-03-09  3:16   ` Darrick J. Wong
  2021-03-11  5:03     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-09  3:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:29PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we account for log opheaders in the log item formatting
> code, we don't actually use the aggregated count of log iovecs in
> the CIL for anything. Remove it and the tracking code that
> calculates it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 22 ++++++----------------
>  1 file changed, 6 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 34abc3bae587..4047f95a0fc4 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -252,22 +252,18 @@ xlog_cil_alloc_shadow_bufs(
>  
>  /*
>   * Prepare the log item for insertion into the CIL. Calculate the difference in
> - * log space and vectors it will consume, and if it is a new item pin it as
> - * well.
> + * log space it will consume, and if it is a new item pin it as well.
>   */
>  STATIC void
>  xfs_cil_prepare_item(
>  	struct xlog		*log,
>  	struct xfs_log_vec	*lv,
>  	struct xfs_log_vec	*old_lv,
> -	int			*diff_len,
> -	int			*diff_iovecs)
> +	int			*diff_len)
>  {
>  	/* Account for the new LV being passed in */
> -	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) {
> +	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
>  		*diff_len += lv->lv_bytes;
> -		*diff_iovecs += lv->lv_niovecs;
> -	}
>  
>  	/*
>  	 * If there is no old LV, this is the first time we've seen the item in
> @@ -284,7 +280,6 @@ xfs_cil_prepare_item(
>  		ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED);
>  
>  		*diff_len -= old_lv->lv_bytes;
> -		*diff_iovecs -= old_lv->lv_niovecs;
>  		lv->lv_item->li_lv_shadow = old_lv;
>  	}
>  
> @@ -333,12 +328,10 @@ static void
>  xlog_cil_insert_format_items(
>  	struct xlog		*log,
>  	struct xfs_trans	*tp,
> -	int			*diff_len,
> -	int			*diff_iovecs)
> +	int			*diff_len)
>  {
>  	struct xfs_log_item	*lip;
>  
> -
>  	/* Bail out if we didn't find a log item.  */
>  	if (list_empty(&tp->t_items)) {
>  		ASSERT(0);
> @@ -381,7 +374,6 @@ xlog_cil_insert_format_items(
>  			 * set the item up as though it is a new insertion so
>  			 * that the space reservation accounting is correct.
>  			 */
> -			*diff_iovecs -= lv->lv_niovecs;
>  			*diff_len -= lv->lv_bytes;
>  
>  			/* Ensure the lv is set up according to ->iop_size */
> @@ -406,7 +398,7 @@ xlog_cil_insert_format_items(
>  		ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)));
>  		lip->li_ops->iop_format(lip, lv);
>  insert:
> -		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
> +		xfs_cil_prepare_item(log, lv, old_lv, diff_len);
>  	}
>  }
>  
> @@ -426,7 +418,6 @@ xlog_cil_insert_items(
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
> -	int			diff_iovecs = 0;
>  	int			iclog_space;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  
> @@ -436,7 +427,7 @@ xlog_cil_insert_items(
>  	 * We can do this safely because the context can't checkpoint until we
>  	 * are done so it doesn't matter exactly how we update the CIL.
>  	 */
> -	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> +	xlog_cil_insert_format_items(log, tp, &len);
>  
>  	spin_lock(&cil->xc_cil_lock);
>  
> @@ -471,7 +462,6 @@ xlog_cil_insert_items(
>  	}
>  	tp->t_ticket->t_curr_res -= len;
>  	ctx->space_used += len;
> -	ctx->nvecs += diff_iovecs;

If the tracking variable isn't necessary any more, should the field go
away from xfs_cil_ctx?

--D

>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/45] xfs: CIL work is serialised, not pipelined
  2021-03-09  1:55       ` Darrick J. Wong
@ 2021-03-09 22:35         ` Andi Kleen
  2021-03-10  6:11           ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Andi Kleen @ 2021-03-09 22:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

"Darrick J. Wong" <djwong@kernel.org> writes:
> It might be nice to leave that as a breadcrumb, then, in case the
> spinlock scalability problems ever get solved.

It might be already solved, depending on if Dave's rule of thumb
was determined before the Linux spinlocks switched to MCS locks or not.

In my experience spinlock scalability depends a lot on how long the
critical section is (that is very important, short sections are a lot
worse than long sections), as well as if the contention is inside a
socket or over sockets, and the actual hardware behaves differently too.

So I would be quite surprised if the "rule of 4" generally holds.

-Andi

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/45] xfs: CIL work is serialised, not pipelined
  2021-03-09 22:35         ` Andi Kleen
@ 2021-03-10  6:11           ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-10  6:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Darrick J. Wong, linux-xfs

On Tue, Mar 09, 2021 at 02:35:42PM -0800, Andi Kleen wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> > It might be nice to leave that as a breadcrumb, then, in case the
> > spinlock scalability problems ever get solved.
> 
> It might be already solved, depending on if Dave's rule of thumb
> was determined before the Linux spinlocks switched to MCS locks or not.

It's what I see on my current 2-socket, 32p/64t machine with a
handful of optane DC4800 SSDs attached to it running 5.12-rc1+.  MCS
doesn't make spin contention go away, just stops spinlocks from
bouncing the same cacheline all over the machine.

i.e. if you've got more than a single CPU's worth of critical
section to execute, then spinlocks are going to spin, not matter how
they are implemented. So AFAICT the contention I'm measuring is not
cacheline bouncing, but just the cost of multiple CPUs
spinning while queued waiting for the spinlock...

> In my experience spinlock scalability depends a lot on how long the
> critical section is (that is very important, short sections are a lot
> worse than long sections), as well as if the contention is inside a
> socket or over sockets, and the actual hardware behaves differently too.

Yup, and most of the critical sections that the icloglock is used
to protect are quite short.

> So I would be quite surprised if the "rule of 4" generally holds.

It's served me well for the past couple of decades, especially when
working with machines that have thousands of CPUs that can turn even
lightly trafficed spin locks into highly contended locks. That was
the lesson I learnt from this commit:

commit 249a8c1124653fa90f3a3afff869095a31bc229f
Author: David Chinner <dgc@sgi.com>
Date:   Tue Feb 5 12:13:32 2008 +1100

    [XFS] Move AIL pushing into it's own thread
    
    When many hundreds to thousands of threads all try to do simultaneous
    transactions and the log is in a tail-pushing situation (i.e. full), we
    can get multiple threads walking the AIL list and contending on the AIL
    lock.
 
Getting half a dozen simultaneous AIL pushes, and the AIL spinlock
would break down and burn an entire 2048p machine for half a day
doing what should only take half a second. Unbound concurrency
-always- finds spinlocks to contend on. And if the machine is large
enough, it will then block up the entire machine as more and more
CPUs hit the serialisation point.

As long as I've worked with 500+ cpu machines (since 2002),
scalability has always been about either removing spinlocks from
hot paths or controlling concurrency to a level below where a
spinlock breaks down. You see it again and again in XFS commit logs
where I've either changed something to be lockless or to strictly
constrain concurrency to be less than 4-8p across known hot and/or
contended spinlocks and mutexes.

And I've used it outside XFS, too. It was the basic concept behind
the NUMA aware shrinker infrastructure and the per-node LRU lists
that it uses. Even the internal spinlocks on those lists start to
break down when bashing on the inode and dentry caches on systems
with per-node CPU counts of 8-16...

Oh, another "rule of 4" I came across a couple of days ago. My test
machine has 4 node, so 4 kswapd threads. One buffered IO reader,
running 100% cpu bound at 2GB/s from a 6GB/s capable block device.
The reader was burning 12% of that CPU on the mapping spinlock
insert pages into the page cache..  The kswapds were each burning
12% of a CPU on the same mapping spinlock reclaiming page cache
pages. So, overall, the system was burning over 50% of a CPU
spinning on the mapping spinlock and really only doing about half a
CPU worth of real work.

Same workload with one kswapd (i.e. single node)? Contention on the
mapping lock is barely measurable.  IOws, at just 5 concurrent
threads doing repeated fast accesses to the inode mapping spinlock
we have reached lock breakdown conditions.

Perhaps we should consider spinlocks harmful these days...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 08/45] xfs: journal IO cache flush reductions
  2021-03-09  1:13     ` Dave Chinner
@ 2021-03-10 20:49       ` Brian Foster
  2021-03-10 21:28         ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-03-10 20:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 09, 2021 at 12:13:52PM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 07:25:26AM -0500, Brian Foster wrote:
> > On Fri, Mar 05, 2021 at 04:11:06PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> > > guarantee the ordering requirements the journal has w.r.t. metadata
> > > writeback. THe two ordering constraints are:
> ....
> > > The rm -rf times are included because I ran them, but the
> > > differences are largely noise. This workload is largely metadata
> > > read IO latency bound and the changes to the journal cache flushing
> > > doesn't really make any noticable difference to behaviour apart from
> > > a reduction in noiclog events from background CIL pushing.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > Thoughts on my previous feedback to this patch, particularly the locking
> > bits..? I thought I saw a subsequent patch somewhere that increased the
> > parallelism of this code..
> 
> I seem to have missed that email, too.
> 

Seems this occurs more frequently than it should. :/ Mailer problems?

> I guess you are refering to these two hunks:
> 

Yes.

> > > @@ -2416,10 +2408,21 @@ xlog_write(
> > >  		ASSERT(log_offset <= iclog->ic_size - 1);
> > >  		ptr = iclog->ic_datap + log_offset;
> > >  
> > > -		/* start_lsn is the first lsn written to. That's all we need. */
> > > +		/* Start_lsn is the first lsn written to. */
> > >  		if (start_lsn && !*start_lsn)
> > >  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> > >  
> > > +		/*
> > > +		 * iclogs containing commit records or unmount records need
> > > +		 * to issue ordering cache flushes and commit immediately
> > > +		 * to stable storage to guarantee journal vs metadata ordering
> > > +		 * is correctly maintained in the storage media.
> > > +		 */
> > > +		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> > > +			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> > > +						XLOG_ICL_NEED_FUA);
> > > +		}
> > > +
> > >  		/*
> > >  		 * This loop writes out as many regions as can fit in the amount
> > >  		 * of space which was allocated by xlog_state_get_iclog_space().
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index c04d5d37a3a2..263c8d907221 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -896,11 +896,16 @@ xlog_cil_push_work(
> > >  
> > >  	/*
> > >  	 * If the checkpoint spans multiple iclogs, wait for all previous
> > > -	 * iclogs to complete before we submit the commit_iclog.
> > > +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> > > +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> > > +	 * cache flush because there are no other iclogs we need to order
> > > +	 * against.
> > >  	 */
> > >  	if (ctx->start_lsn != commit_lsn) {
> > >  		spin_lock(&log->l_icloglock);
> > >  		xlog_wait_on_iclog(commit_iclog->ic_prev);
> > > +	} else {
> > > +		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
> > >  	}
> 
> .... that set/clear the flags on the iclog?  Yes, they probably
> should be atomic.
> 
> On second thoughts, we can't just clear XLOG_ICL_NEED_FLUSH here
> because there may be multiple commit records on this iclog and a
> previous one might require the flush. I'll just remove this
> optimisation from the patch right now, because it's more complex
> than it initially seemed.
> 

Ok.

> And looking at the aggregated code that I have now (including the
> stuff I haven't sent out), the need for xlog_write() to set the
> flush flags on the iclog is gone. THis is because the unmount record
> flushes the iclog directly itself so it can add flags there, and
> the iclog that the commit record is written to is returned to the
> caller.
> 

Ok.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 08/45] xfs: journal IO cache flush reductions
  2021-03-10 20:49       ` Brian Foster
@ 2021-03-10 21:28         ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-10 21:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 03:49:28PM -0500, Brian Foster wrote:
> On Tue, Mar 09, 2021 at 12:13:52PM +1100, Dave Chinner wrote:
> > On Mon, Mar 08, 2021 at 07:25:26AM -0500, Brian Foster wrote:
> > > On Fri, Mar 05, 2021 at 04:11:06PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> > > > guarantee the ordering requirements the journal has w.r.t. metadata
> > > > writeback. THe two ordering constraints are:
> > ....
> > > > The rm -rf times are included because I ran them, but the
> > > > differences are largely noise. This workload is largely metadata
> > > > read IO latency bound and the changes to the journal cache flushing
> > > > doesn't really make any noticable difference to behaviour apart from
> > > > a reduction in noiclog events from background CIL pushing.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > 
> > > Thoughts on my previous feedback to this patch, particularly the locking
> > > bits..? I thought I saw a subsequent patch somewhere that increased the
> > > parallelism of this code..
> > 
> > I seem to have missed that email, too.
> > 
> 
> Seems this occurs more frequently than it should. :/ Mailer problems?

vger has been causing all sorts of problems recently - fromorbit.com
is backed by gmail, and gmail has been one of the mail targets that
has caused vger the most problems. I've also noticed that gmail is
classifying and awful lot of mailing list traffic as spam in recent
months - I'm typically having to manulaly pull 50 "[PATCH ...]
emails a month out of the spam folders, including stuff from
Christoph, Darrick and @redhat.com addresses. There isn't anything I
can do about either of these things - email does not guarantee
delivery...

> > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > index c04d5d37a3a2..263c8d907221 100644
> > > > --- a/fs/xfs/xfs_log_cil.c
> > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > @@ -896,11 +896,16 @@ xlog_cil_push_work(
> > > >  
> > > >  	/*
> > > >  	 * If the checkpoint spans multiple iclogs, wait for all previous
> > > > -	 * iclogs to complete before we submit the commit_iclog.
> > > > +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> > > > +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> > > > +	 * cache flush because there are no other iclogs we need to order
> > > > +	 * against.
> > > >  	 */
> > > >  	if (ctx->start_lsn != commit_lsn) {
> > > >  		spin_lock(&log->l_icloglock);
> > > >  		xlog_wait_on_iclog(commit_iclog->ic_prev);
> > > > +	} else {
> > > > +		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
> > > >  	}
> > 
> > .... that set/clear the flags on the iclog?  Yes, they probably
> > should be atomic.
> > 
> > On second thoughts, we can't just clear XLOG_ICL_NEED_FLUSH here
> > because there may be multiple commit records on this iclog and a
> > previous one might require the flush. I'll just remove this
> > optimisation from the patch right now, because it's more complex
> > than it initially seemed.
> > 
> 
> Ok.

On the gripping hand, the optimisation can stay once this:

> > And looking at the aggregated code that I have now (including the
> > stuff I haven't sent out), the need for xlog_write() to set the
> > flush flags on the iclog is gone. THis is because the unmount record
> > flushes the iclog directly itself so it can add flags there, and
> > the iclog that the commit record is written to is returned to the
> > caller.

is done.

That's because we are only setting new flags on each commit and so
not removing flags that previous commits to this iclog may have set.
Hence if a previous commit in this iclog set the flush flag, it will
remain set even if a new commit is added that is wholly within the
current iclog is run.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks
  2021-03-05  5:11 ` [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks Dave Chinner
@ 2021-03-10 23:01   ` Darrick J. Wong
  0 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-10 23:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:30PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In the next patches we are going to make the CIL list itself
> per-cpu, and so we cannot use list_empty() to check is the list is
> empty. Replace the list_empty() checks with a flag in the CIL to
> indicate we have committed at least one transaction to the CIL and
> hence the CIL is not empty.
> 
> We need this flag to be an atomic so that we can clear it without
> holding any locks in the commit fast path, but we also need to be
> careful to avoid atomic operations in the fast path. Hence we use
> the fact that test_bit() is not an atomic op to first check if the
> flag is set and then run the atomic test_and_clear_bit() operation
> to clear it and steal the initial unit reservation for the CIL
> context checkpoint.
> 
> When we are switching to a new context in a push, we place the
> setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
> allows all the other places that need to check whether the CIL is
> empty to use test_bit() and still be serialised correctly with the
> CIL context swaps that set the bit.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 49 +++++++++++++++++++++++--------------------
>  fs/xfs/xfs_log_priv.h |  4 ++++
>  2 files changed, 30 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 4047f95a0fc4..e6e36488f0c7 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -70,6 +70,7 @@ xlog_cil_ctx_switch(
>  	struct xfs_cil		*cil,
>  	struct xfs_cil_ctx	*ctx)
>  {
> +	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
>  	ctx->sequence = ++cil->xc_current_sequence;
>  	ctx->cil = cil;
>  	cil->xc_ctx = ctx;
> @@ -436,13 +437,12 @@ xlog_cil_insert_items(
>  		list_splice_init(&tp->t_busy, &ctx->busy_extents);
>  
>  	/*
> -	 * Now transfer enough transaction reservation to the context ticket
> -	 * for the checkpoint. The context ticket is special - the unit
> -	 * reservation has to grow as well as the current reservation as we
> -	 * steal from tickets so we can correctly determine the space used
> -	 * during the transaction commit.
> +	 * We need to take the CIL checkpoint unit reservation on the first
> +	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
> +	 * unnecessarily do an atomic op in the fast path here.
>  	 */
> -	if (ctx->ticket->t_curr_res == 0) {
> +	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
> +	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {

Hm, it'll be amusing to see where this goes.  Usually I tell myself in
siutations like these "I think this is ok, let's see where we are in
another 4-5 patches" but now I'm 7 patches out and my brain is getting
close to ENOSPC so I'll tentatively say:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

With the caveat that I could have more to say after the fact...

--D

>  		ctx_res = ctx->ticket->t_unit_res;
>  		ctx->ticket->t_curr_res = ctx_res;
>  		tp->t_ticket->t_curr_res -= ctx_res;
> @@ -771,7 +771,7 @@ xlog_cil_push_work(
>  	 * move on to a new sequence number and so we have to be able to push
>  	 * this sequence again later.
>  	 */
> -	if (list_empty(&cil->xc_cil)) {
> +	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
>  		cil->xc_push_seq = 0;
>  		spin_unlock(&cil->xc_push_lock);
>  		goto out_skip;
> @@ -1019,9 +1019,10 @@ xlog_cil_push_background(
>  
>  	/*
>  	 * The cil won't be empty because we are called while holding the
> -	 * context lock so whatever we added to the CIL will still be there
> +	 * context lock so whatever we added to the CIL will still be there.
>  	 */
>  	ASSERT(!list_empty(&cil->xc_cil));
> +	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
>  
>  	/*
>  	 * Don't do a background push if we haven't used up all the
> @@ -1108,7 +1109,8 @@ xlog_cil_push_now(
>  	 * there's no work we need to do.
>  	 */
>  	spin_lock(&cil->xc_push_lock);
> -	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
> +	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) ||
> +	    push_seq <= cil->xc_push_seq) {
>  		spin_unlock(&cil->xc_push_lock);
>  		return;
>  	}
> @@ -1128,7 +1130,7 @@ xlog_cil_empty(
>  	bool		empty = false;
>  
>  	spin_lock(&cil->xc_push_lock);
> -	if (list_empty(&cil->xc_cil))
> +	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
>  		empty = true;
>  	spin_unlock(&cil->xc_push_lock);
>  	return empty;
> @@ -1289,7 +1291,7 @@ xlog_cil_force_seq(
>  	 * we would have found the context on the committing list.
>  	 */
>  	if (sequence == cil->xc_current_sequence &&
> -	    !list_empty(&cil->xc_cil)) {
> +	    !test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
>  		spin_unlock(&cil->xc_push_lock);
>  		goto restart;
>  	}
> @@ -1320,21 +1322,19 @@ xlog_cil_force_seq(
>   */
>  bool
>  xfs_log_item_in_current_chkpt(
> -	struct xfs_log_item *lip)
> +	struct xfs_log_item	*lip)
>  {
> -	struct xfs_cil_ctx *ctx;
> +	struct xfs_cil		*cil = lip->li_mountp->m_log->l_cilp;
>  
> -	if (list_empty(&lip->li_cil))
> +	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
>  		return false;
>  
> -	ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
> -
>  	/*
>  	 * li_seq is written on the first commit of a log item to record the
>  	 * first checkpoint it is written to. Hence if it is different to the
>  	 * current sequence, we're in a new checkpoint.
>  	 */
> -	if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0)
> +	if (XFS_LSN_CMP(lip->li_seq, cil->xc_ctx->sequence) != 0)
>  		return false;
>  	return true;
>  }
> @@ -1373,13 +1373,16 @@ void
>  xlog_cil_destroy(
>  	struct xlog	*log)
>  {
> -	if (log->l_cilp->xc_ctx) {
> -		if (log->l_cilp->xc_ctx->ticket)
> -			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
> -		kmem_free(log->l_cilp->xc_ctx);
> +	struct xfs_cil	*cil = log->l_cilp;
> +
> +	if (cil->xc_ctx) {
> +		if (cil->xc_ctx->ticket)
> +			xfs_log_ticket_put(cil->xc_ctx->ticket);
> +		kmem_free(cil->xc_ctx);
>  	}
>  
> -	ASSERT(list_empty(&log->l_cilp->xc_cil));
> -	kmem_free(log->l_cilp);
> +	ASSERT(list_empty(&cil->xc_cil));
> +	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
> +	kmem_free(cil);
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 003c11653955..b0dc3bc9de59 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -248,6 +248,7 @@ struct xfs_cil_ctx {
>   */
>  struct xfs_cil {
>  	struct xlog		*xc_log;
> +	unsigned long		xc_flags;
>  	struct list_head	xc_cil;
>  	spinlock_t		xc_cil_lock;
>  
> @@ -263,6 +264,9 @@ struct xfs_cil {
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
>  
> +/* xc_flags bit values */
> +#define	XLOG_CIL_EMPTY		1
> +
>  /*
>   * The amount of log space we allow the CIL to aggregate is difficult to size.
>   * Whatever we choose, we have to make sure we can get a reservation for the
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock
  2021-03-05  5:11 ` [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
@ 2021-03-10 23:25   ` Darrick J. Wong
  2021-03-11  5:42     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-10 23:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:31PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The xc_cil_lock is the most highly contended lock in XFS now. To
> start the process of getting rid of it, lift the initial reservation
> of the CIL log space out from under the xc_cil_lock.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index e6e36488f0c7..50101336a7f4 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -430,23 +430,19 @@ xlog_cil_insert_items(
>  	 */
>  	xlog_cil_insert_format_items(log, tp, &len);
>  
> -	spin_lock(&cil->xc_cil_lock);

Hm, so looking ahead, the next few patches keep kicking this spin_lock
call further and further down in the file, and the commit messages give
me the impression that this might even go away entirely?

Let me see, the CIL locks are:

xc_ctx_lock, which prevents transactions from committing (into the cil)
any time the CIL itself is preparing a new commited item context so that
it can xlog_write (to disk) the log vectors associated with the current
context.

xc_cil_lock, which serializes transactions adding their items to the CIL
in the first place, hence the motivation to reduce this hot lock?

xc_push_lock, which I think is used to coordinate the CIL push worker
with all the upper level callers that want to force log items to disk?

And the locking order of these three locks is...

xc_ctx_lock --> xc_push_lock
    |
    \---------> xc_cil_lock

Assuming I grokked all that, then I guess moving the spin_lock call
works out because the test_and_clear_bit is atomic.  The rest of the
accounting stuff here is just getting moved further down in the file and
is still protected by xc_cil_lock.

If I understood all that,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> -
> -	/* attach the transaction to the CIL if it has any busy extents */
> -	if (!list_empty(&tp->t_busy))
> -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> -
>  	/*
>  	 * We need to take the CIL checkpoint unit reservation on the first
>  	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
> -	 * unnecessarily do an atomic op in the fast path here.
> +	 * unnecessarily do an atomic op in the fast path here. We don't need to
> +	 * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are
> +	 * under the xc_ctx_lock here and that needs to be held exclusively to
> +	 * reset the XLOG_CIL_EMPTY bit.
>  	 */
>  	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
> -	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
> +	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
>  		ctx_res = ctx->ticket->t_unit_res;
> -		ctx->ticket->t_curr_res = ctx_res;
> -		tp->t_ticket->t_curr_res -= ctx_res;
> -	}
> +
> +	spin_lock(&cil->xc_cil_lock);
>  
>  	/* do we need space for more log record headers? */
>  	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> @@ -456,11 +452,9 @@ xlog_cil_insert_items(
>  		/* need to take into account split region headers, too */
>  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
>  		ctx->ticket->t_unit_res += split_res;
> -		ctx->ticket->t_curr_res += split_res;
> -		tp->t_ticket->t_curr_res -= split_res;
> -		ASSERT(tp->t_ticket->t_curr_res >= len);
>  	}
> -	tp->t_ticket->t_curr_res -= len;
> +	tp->t_ticket->t_curr_res -= split_res + ctx_res + len;
> +	ctx->ticket->t_curr_res += split_res + ctx_res;
>  	ctx->space_used += len;
>  
>  	/*
> @@ -498,6 +492,9 @@ xlog_cil_insert_items(
>  			list_move_tail(&lip->li_cil, &cil->xc_cil);
>  	}
>  
> +	/* attach the transaction to the CIL if it has any busy extents */
> +	if (!list_empty(&tp->t_busy))
> +		list_splice_init(&tp->t_busy, &ctx->busy_extents);
>  	spin_unlock(&cil->xc_cil_lock);
>  
>  	if (tp->t_ticket->t_curr_res < 0)
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 34/45] xfs: rework per-iclog header CIL reservation
  2021-03-05  5:11 ` [PATCH 34/45] xfs: rework per-iclog header CIL reservation Dave Chinner
@ 2021-03-11  0:03   ` Darrick J. Wong
  2021-03-11  6:03     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  0:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:32PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For every iclog that a CIL push will use up, we need to ensure we
> have space reserved for the iclog header in each iclog. It is
> extremely difficult to do this accurately with a per-cpu counter
> without expensive summing of the counter in every commit. However,
> we know what the maximum CIL size is going to be because of the
> hard space limit we have, and hence we know exactly how many iclogs
> we are going to need to write out the CIL.
> 
> We are constrained by the requirement that small transactions only
> have reservation space for a single iclog header built into them.
> At commit time we don't know how much of the current transaction
> reservation is made up of iclog header reservations as calculated by
> xfs_log_calc_unit_res() when the ticket was reserved. As larger
> reservations have multiple header spaces reserved, we can steal
> more than one iclog header reservation at a time, but we only steal
> the exact number needed for the given log vector size delta.
> 
> As a result, we don't know exactly when we are going to steal iclog
> header reservations, nor do we know exactly how many we are going to
> need for a given CIL.
> 
> To make things simple, start by calculating the worst case number of
> iclog headers a full CIL push will require. Record this into an
> atomic variable in the CIL. Then add a byte counter to the log
> ticket that records exactly how much iclog header space has been
> reserved in this ticket by xfs_log_calc_unit_res(). This tells us
> exactly how much space we can steal from the ticket at transaction
> commit time.
> 
> Now, at transaction commit time, we can check if the CIL has a full
> iclog header reservation and, if not, steal the entire reservation
> the current ticket holds for iclog headers. This minimises the
> number of times we need to do atomic operations in the fast path,
> but still guarantees we get all the reservations we need.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_rlimit.c |  2 +-
>  fs/xfs/libxfs/xfs_shared.h     |  3 +-
>  fs/xfs/xfs_log.c               | 12 +++++---
>  fs/xfs/xfs_log_cil.c           | 55 ++++++++++++++++++++++++++--------
>  fs/xfs/xfs_log_priv.h          | 20 +++++++------
>  5 files changed, 64 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
> index 7f55eb3f3653..75390134346d 100644
> --- a/fs/xfs/libxfs/xfs_log_rlimit.c
> +++ b/fs/xfs/libxfs/xfs_log_rlimit.c
> @@ -88,7 +88,7 @@ xfs_log_calc_minimum_size(
>  
>  	xfs_log_get_max_trans_res(mp, &tres);
>  
> -	max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres);
> +	max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres, NULL);

This is currently the only call site of xfs_log_calc_unit_res, so if a
subsequent patch doesn't make use of that last argument it should go
away.  (I don't know yet, I haven't looked...)

>  	if (tres.tr_logcount > 1)
>  		max_logres *= tres.tr_logcount;
>  
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 8c61a461bf7b..b4791b817fe3 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -48,7 +48,8 @@ extern const struct xfs_buf_ops xfs_symlink_buf_ops;
>  extern const struct xfs_buf_ops xfs_rtbuf_ops;
>  
>  /* log size calculation functions */
> -int	xfs_log_calc_unit_res(struct xfs_mount *mp, int unit_bytes);
> +int	xfs_log_calc_unit_res(struct xfs_mount *mp, int unit_bytes,
> +				int *niclogs);
>  int	xfs_log_calc_minimum_size(struct xfs_mount *);
>  
>  struct xfs_trans_res;
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 8f4f7ae84358..46a006d41184 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3312,7 +3312,8 @@ xfs_log_ticket_get(
>  static int
>  xlog_calc_unit_res(
>  	struct xlog		*log,
> -	int			unit_bytes)
> +	int			unit_bytes,
> +	int			*niclogs)
>  {
>  	int			iclog_space;
>  	uint			num_headers;
> @@ -3392,15 +3393,18 @@ xlog_calc_unit_res(
>  	/* roundoff padding for transaction data and one for commit record */
>  	unit_bytes += 2 * log->l_iclog_roundoff;
>  
> +	if (niclogs)
> +		*niclogs = num_headers;
>  	return unit_bytes;
>  }
>  
>  int
>  xfs_log_calc_unit_res(
>  	struct xfs_mount	*mp,
> -	int			unit_bytes)
> +	int			unit_bytes,
> +	int			*niclogs)
>  {
> -	return xlog_calc_unit_res(mp->m_log, unit_bytes);
> +	return xlog_calc_unit_res(mp->m_log, unit_bytes, niclogs);
>  }
>  
>  /*
> @@ -3418,7 +3422,7 @@ xlog_ticket_alloc(
>  
>  	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
>  
> -	unit_res = xlog_calc_unit_res(log, unit_bytes);
> +	unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs);

Ok, so each transaction ticket now gets to know the maximum number of
iclog headers that the transaction can consume if we use every last byte
of the reservation...

>  
>  	atomic_set(&tic->t_ref, 1);
>  	tic->t_task		= current;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 50101336a7f4..f8fb2f59e24c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -44,9 +44,20 @@ xlog_cil_ticket_alloc(
>  	 * transaction overhead reservation from the first transaction commit.
>  	 */
>  	tic->t_curr_res = 0;
> +	tic->t_iclog_hdrs = 0;
>  	return tic;
>  }
>  
> +static inline void
> +xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil)
> +{
> +	struct xlog	*log = cil->xc_log;
> +
> +	atomic_set(&cil->xc_iclog_hdrs,
> +		   (XLOG_CIL_BLOCKING_SPACE_LIMIT(log) /
> +			(log->l_iclog_size - log->l_iclog_hsize)));
> +}
> +
>  /*
>   * Unavoidable forward declaration - xlog_cil_push_work() calls
>   * xlog_cil_ctx_alloc() itself.
> @@ -70,6 +81,7 @@ xlog_cil_ctx_switch(
>  	struct xfs_cil		*cil,
>  	struct xfs_cil_ctx	*ctx)
>  {
> +	xlog_cil_set_iclog_hdr_count(cil);

...and I guess every time the CIL gets a fresh context, we also record
the maximum number of iclog headers that we might be pushing to disk in
one go?  Which I guess happens if someone commits a lot of updates to a
filesystem, a comitting thread hits the throttle threshold, and now the
CIL has to switch contexts and write the old context's transactions to
disk?

>  	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
>  	ctx->sequence = ++cil->xc_current_sequence;
>  	ctx->cil = cil;
> @@ -92,6 +104,7 @@ xlog_cil_init_post_recovery(
>  {
>  	log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log);
>  	log->l_cilp->xc_ctx->sequence = 1;
> +	xlog_cil_set_iclog_hdr_count(log->l_cilp);
>  }
>  
>  static inline int
> @@ -419,7 +432,6 @@ xlog_cil_insert_items(
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
> -	int			iclog_space;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  
>  	ASSERT(tp);
> @@ -442,19 +454,36 @@ xlog_cil_insert_items(
>  	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
>  		ctx_res = ctx->ticket->t_unit_res;
>  
> -	spin_lock(&cil->xc_cil_lock);
> -
> -	/* do we need space for more log record headers? */
> -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> -	if (len > 0 && (ctx->space_used / iclog_space !=
> -				(ctx->space_used + len) / iclog_space)) {
> -		split_res = (len + iclog_space - 1) / iclog_space;
> -		/* need to take into account split region headers, too */
> -		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> -		ctx->ticket->t_unit_res += split_res;
> +	/*
> +	 * Check if we need to steal iclog headers. atomic_read() is not a
> +	 * locked atomic operation, so we can check the value before we do any
> +	 * real atomic ops in the fast path. If we've already taken the CIL unit
> +	 * reservation from this commit, we've already got one iclog header
> +	 * space reserved so we have to account for that otherwise we risk
> +	 * overrunning the reservation on this ticket.
> +	 *
> +	 * If the CIL is already at the hard limit, we might need more header
> +	 * space that originally reserved. So steal more header space from every
> +	 * commit that occurs once we are over the hard limit to ensure the CIL
> +	 * push won't run out of reservation space.
> +	 *
> +	 * This can steal more than we need, but that's OK.
> +	 */
> +	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||

If we haven't stolen enough iclog header space...

> +	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {

...or we've hit a throttling threshold, in which case we know we're
going to push, so we might as well take everything and (I guess?) not
give back any reservation that would encourage more commits before we're
ready?

> +		int	split_res = log->l_iclog_hsize +
> +					sizeof(struct xlog_op_header);
> +		if (ctx_res)
> +			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
> +		else
> +			ctx_res = split_res * tp->t_ticket->t_iclog_hdrs;
> +		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);

What happens if xc_iclog_hdrs goes negative?  Does that merely mean that
we stole more space from the transaction than we needed?  Or does it
indicate that we're trying to cram too much into a single context?

I suppose I worry about what might happen if each transaction's
committed items actually somehow eats up every byte of reservation and
that actually translates to t_iclog_hdrs iclogs being written out with a
particular context, where sum(t_iclog_hdrs) is larger than what
xlog_cil_set_iclog_hdr_count() precomputes?

--D

>  	}
> -	tp->t_ticket->t_curr_res -= split_res + ctx_res + len;
> -	ctx->ticket->t_curr_res += split_res + ctx_res;
> +
> +	spin_lock(&cil->xc_cil_lock);
> +	tp->t_ticket->t_curr_res -= ctx_res + len;
> +	ctx->ticket->t_unit_res += ctx_res;
> +	ctx->ticket->t_curr_res += ctx_res;
>  	ctx->space_used += len;
>  
>  	/*
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b0dc3bc9de59..e72d14c76e03 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -140,15 +140,16 @@ enum xlog_iclog_state {
>  #define XLOG_TIC_LEN_MAX	15
>  
>  typedef struct xlog_ticket {
> -	struct list_head   t_queue;	 /* reserve/write queue */
> -	struct task_struct *t_task;	 /* task that owns this ticket */
> -	xlog_tid_t	   t_tid;	 /* transaction identifier	 : 4  */
> -	atomic_t	   t_ref;	 /* ticket reference count       : 4  */
> -	int		   t_curr_res;	 /* current reservation in bytes : 4  */
> -	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
> -	char		   t_ocnt;	 /* original count		 : 1  */
> -	char		   t_cnt;	 /* current count		 : 1  */
> -	char		   t_flags;	 /* properties of reservation	 : 1  */
> +	struct list_head	t_queue;	/* reserve/write queue */
> +	struct task_struct	*t_task;	/* task that owns this ticket */
> +	xlog_tid_t		t_tid;		/* transaction identifier */
> +	atomic_t		t_ref;		/* ticket reference count */
> +	int			t_curr_res;	/* current reservation */
> +	int			t_unit_res;	/* unit reservation */
> +	char			t_ocnt;		/* original count */
> +	char			t_cnt;		/* current count */
> +	char			t_flags;	/* properties of reservation */
> +	int			t_iclog_hdrs;	/* iclog hdrs in t_curr_res */
>  } xlog_ticket_t;
>  
>  /*
> @@ -249,6 +250,7 @@ struct xfs_cil_ctx {
>  struct xfs_cil {
>  	struct xlog		*xc_log;
>  	unsigned long		xc_flags;
> +	atomic_t		xc_iclog_hdrs;
>  	struct list_head	xc_cil;
>  	spinlock_t		xc_cil_lock;
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure
  2021-03-05  5:11 ` [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure Dave Chinner
@ 2021-03-11  0:11   ` Darrick J. Wong
  2021-03-11  6:33     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  0:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:33PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The CIL push lock is highly contended on larger machines, becoming a
> hard bottleneck that about 700,000 transaction commits/s on >16p
> machines. To address this, start moving the CIL tracking
> infrastructure to utilise per-CPU structures.
> 
> We need to track the space used, the amount of log reservation space
> reserved to write the CIL, the log items in the CIL and the busy
> extents that need to be completed by the CIL commit.  This requires
> a couple of per-cpu counters, an unordered per-cpu list and a
> globally ordered per-cpu list.
> 
> Create a per-cpu structure to hold these and all the management
> interfaces needed, as well as the hooks to handle hotplug CPUs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c       | 94 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_priv.h      | 15 ++++++
>  include/linux/cpuhotplug.h |  1 +
>  3 files changed, 110 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index f8fb2f59e24c..1bcf0d423d30 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1365,6 +1365,93 @@ xfs_log_item_in_current_chkpt(
>  	return true;
>  }
>  
> +#ifdef CONFIG_HOTPLUG_CPU
> +static LIST_HEAD(xlog_cil_pcp_list);
> +static DEFINE_SPINLOCK(xlog_cil_pcp_lock);
> +static bool xlog_cil_pcp_init;
> +
> +static int
> +xlog_cil_pcp_dead(
> +	unsigned int		cpu)
> +{
> +	struct xfs_cil		*cil;
> +
> +        spin_lock(&xlog_cil_pcp_lock);
> +        list_for_each_entry(cil, &xlog_cil_pcp_list, xc_pcp_list) {

Weird indentation.

> +		/* move stuff on dead CPU to context */

Should this have some actual code?  I don't think any of the remaining
patches add anything here.

> +	}
> +	spin_unlock(&xlog_cil_pcp_lock);
> +	return 0;
> +}
> +
> +static int
> +xlog_cil_pcp_hpadd(
> +	struct xfs_cil		*cil)
> +{
> +	if (!xlog_cil_pcp_init) {
> +		int	ret;
> +		ret = cpuhp_setup_state_nocalls(CPUHP_XFS_CIL_DEAD,
> +						"xfs/cil_pcp:dead", NULL,
> +						xlog_cil_pcp_dead);
> +		if (ret < 0) {
> +			xfs_warn(cil->xc_log->l_mp,
> +	"Failed to initialise CIL hotplug, error %d. XFS is non-functional.",

How likely is to happen?

> +				ret);
> +			ASSERT(0);

I guess not that often?

> +			return -ENOMEM;

Why not return ret here?  I guess it's because ret could be any number
of (not centrally documented?) error codes, and we don't really care to
expose that to userspace?

--D

> +		}
> +		xlog_cil_pcp_init = true;
> +	}
> +
> +	INIT_LIST_HEAD(&cil->xc_pcp_list);
> +	spin_lock(&xlog_cil_pcp_lock);
> +	list_add(&cil->xc_pcp_list, &xlog_cil_pcp_list);
> +	spin_unlock(&xlog_cil_pcp_lock);
> +	return 0;
> +}
> +
> +static void
> +xlog_cil_pcp_hpremove(
> +	struct xfs_cil		*cil)
> +{
> +	spin_lock(&xlog_cil_pcp_lock);
> +	list_del(&cil->xc_pcp_list);
> +	spin_unlock(&xlog_cil_pcp_lock);
> +}
> +
> +#else /* !CONFIG_HOTPLUG_CPU */
> +static inline void xlog_cil_pcp_hpadd(struct xfs_cil *cil) {}
> +static inline void xlog_cil_pcp_hpremove(struct xfs_cil *cil) {}
> +#endif
> +
> +static void __percpu *
> +xlog_cil_pcp_alloc(
> +	struct xfs_cil		*cil)
> +{
> +	struct xlog_cil_pcp	*cilpcp;
> +
> +	cilpcp = alloc_percpu(struct xlog_cil_pcp);
> +	if (!cilpcp)
> +		return NULL;
> +
> +	if (xlog_cil_pcp_hpadd(cil) < 0) {
> +		free_percpu(cilpcp);
> +		return NULL;
> +	}
> +	return cilpcp;
> +}
> +
> +static void
> +xlog_cil_pcp_free(
> +	struct xfs_cil		*cil,
> +	struct xlog_cil_pcp	*cilpcp)
> +{
> +	if (!cilpcp)
> +		return;
> +	xlog_cil_pcp_hpremove(cil);
> +	free_percpu(cilpcp);
> +}
> +
>  /*
>   * Perform initial CIL structure initialisation.
>   */
> @@ -1379,6 +1466,12 @@ xlog_cil_init(
>  	if (!cil)
>  		return -ENOMEM;
>  
> +	cil->xc_pcp = xlog_cil_pcp_alloc(cil);
> +	if (!cil->xc_pcp) {
> +		kmem_free(cil);
> +		return -ENOMEM;
> +	}
> +
>  	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
>  	spin_lock_init(&cil->xc_cil_lock);
> @@ -1409,6 +1502,7 @@ xlog_cil_destroy(
>  
>  	ASSERT(list_empty(&cil->xc_cil));
>  	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
> +	xlog_cil_pcp_free(cil, cil->xc_pcp);
>  	kmem_free(cil);
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index e72d14c76e03..2562f29c8986 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -231,6 +231,16 @@ struct xfs_cil_ctx {
>  	struct work_struct	push_work;
>  };
>  
> +/*
> + * Per-cpu CIL tracking items
> + */
> +struct xlog_cil_pcp {
> +	uint32_t		space_used;
> +	uint32_t		curr_res;
> +	struct list_head	busy_extents;
> +	struct list_head	log_items;
> +};
> +
>  /*
>   * Committed Item List structure
>   *
> @@ -264,6 +274,11 @@ struct xfs_cil {
>  	wait_queue_head_t	xc_commit_wait;
>  	xfs_csn_t		xc_current_sequence;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
> +
> +	struct xlog_cil_pcp __percpu *xc_pcp;
> +#ifdef CONFIG_HOTPLUG_CPU
> +	struct list_head	xc_pcp_list;
> +#endif
>  } ____cacheline_aligned_in_smp;
>  
>  /* xc_flags bit values */
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index f14adb882338..b13b21d825b3 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -52,6 +52,7 @@ enum cpuhp_state {
>  	CPUHP_FS_BUFF_DEAD,
>  	CPUHP_PRINTK_DEAD,
>  	CPUHP_MM_MEMCQ_DEAD,
> +	CPUHP_XFS_CIL_DEAD,
>  	CPUHP_PERCPU_CNT_DEAD,
>  	CPUHP_RADIX_DEAD,
>  	CPUHP_PAGE_ALLOC_DEAD,
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 36/45] xfs: implement percpu cil space used calculation
  2021-03-05  5:11 ` [PATCH 36/45] xfs: implement percpu cil space used calculation Dave Chinner
@ 2021-03-11  0:20   ` Darrick J. Wong
  2021-03-11  6:51     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  0:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:34PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have the CIL percpu structures in place, implement the
> space used counter with a fast sum check similar to the
> percpu_counter infrastructure.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 42 ++++++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_log_priv.h |  2 +-
>  2 files changed, 37 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 1bcf0d423d30..5519d112c1fd 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -433,6 +433,8 @@ xlog_cil_insert_items(
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> +	int			space_used;
> +	struct xlog_cil_pcp	*cilpcp;
>  
>  	ASSERT(tp);
>  
> @@ -469,8 +471,9 @@ xlog_cil_insert_items(
>  	 *
>  	 * This can steal more than we need, but that's OK.
>  	 */
> +	space_used = atomic_read(&ctx->space_used);
>  	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> -	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
>  		int	split_res = log->l_iclog_hsize +
>  					sizeof(struct xlog_op_header);
>  		if (ctx_res)
> @@ -480,16 +483,34 @@ xlog_cil_insert_items(
>  		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
>  	}
>  
> +	/*
> +	 * Update the CIL percpu pointer. This updates the global counter when
> +	 * over the percpu batch size or when the CIL is over the space limit.
> +	 * This means low lock overhead for normal updates, and when over the
> +	 * limit the space used is immediately accounted. This makes enforcing
> +	 * the hard limit much more accurate. The per cpu fold threshold is
> +	 * based on how close we are to the hard limit.
> +	 */
> +	cilpcp = get_cpu_ptr(cil->xc_pcp);
> +	cilpcp->space_used += len;
> +	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> +	    cilpcp->space_used >
> +			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> +					num_online_cpus())) {

What happens if the log is very small and there are hundreds of CPUs?
Can we end up on this slow path on a regular basis even if the amount of
space used is not that large?

Granted I can't think of a good way out of that, since I suspect that if
you do that you're already going to be hurting in 5 other places anyway.
That said ... I /do/ keep getting bugs from people with tiny logs on big
iron.  Some day I'll (ha!) stomp out all the bugs that are "NO do not
let your deployment system growfs 10000x, this is not ext4"...

> +		atomic_add(cilpcp->space_used, &ctx->space_used);
> +		cilpcp->space_used = 0;
> +	}
> +	put_cpu_ptr(cilpcp);
> +
>  	spin_lock(&cil->xc_cil_lock);
> -	tp->t_ticket->t_curr_res -= ctx_res + len;
>  	ctx->ticket->t_unit_res += ctx_res;
>  	ctx->ticket->t_curr_res += ctx_res;
> -	ctx->space_used += len;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
>  	 * the log items. Shutdown is imminent...
>  	 */
> +	tp->t_ticket->t_curr_res -= ctx_res + len;

Is moving this really necessary?

--D

>  	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
>  		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
>  		xfs_warn(log->l_mp,
> @@ -769,12 +790,20 @@ xlog_cil_push_work(
>  	struct bio		bio;
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  	bool			commit_iclog_sync = false;
> +	int			cpu;
> +	struct xlog_cil_pcp	*cilpcp;
>  
>  	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
>  
>  	down_write(&cil->xc_ctx_lock);
>  
> +	/* Reset the CIL pcp counters */
> +	for_each_online_cpu(cpu) {
> +		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> +		cilpcp->space_used = 0;
> +	}
> +
>  	spin_lock(&cil->xc_push_lock);
>  	push_seq = cil->xc_push_seq;
>  	ASSERT(push_seq <= ctx->sequence);
> @@ -1042,6 +1071,7 @@ xlog_cil_push_background(
>  	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
> +	int		space_used = atomic_read(&cil->xc_ctx->space_used);
>  
>  	/*
>  	 * The cil won't be empty because we are called while holding the
> @@ -1054,7 +1084,7 @@ xlog_cil_push_background(
>  	 * Don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
>  		up_read(&cil->xc_ctx_lock);
>  		return;
>  	}
> @@ -1083,10 +1113,10 @@ xlog_cil_push_background(
>  	 * The ctx->xc_push_lock provides the serialisation necessary for safely
>  	 * using the lockless waitqueue_active() check in this context.
>  	 */
> -	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
> +	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
>  	    waitqueue_active(&cil->xc_push_wait)) {
>  		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> -		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		ASSERT(space_used < log->l_logsize);
>  		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>  		return;
>  	}
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 2562f29c8986..4eb373357f26 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -222,7 +222,7 @@ struct xfs_cil_ctx {
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
>  	int			nvecs;		/* number of regions */
> -	int			space_used;	/* aggregate size of regions */
> +	atomic_t		space_used;	/* aggregate size of regions */
>  	struct list_head	busy_extents;	/* busy extents in chkpt */
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure
  2021-03-05  5:11 ` [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure Dave Chinner
@ 2021-03-11  0:26   ` Darrick J. Wong
  2021-03-12  0:47     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  0:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:35PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To get it out from under the cil spinlock.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 11 ++++++-----
>  fs/xfs/xfs_log_priv.h |  2 +-
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 5519d112c1fd..a2f93bd7644b 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -492,6 +492,7 @@ xlog_cil_insert_items(
>  	 * based on how close we are to the hard limit.
>  	 */
>  	cilpcp = get_cpu_ptr(cil->xc_pcp);
> +	cilpcp->space_reserved += ctx_res;
>  	cilpcp->space_used += len;
>  	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
>  	    cilpcp->space_used >
> @@ -502,10 +503,6 @@ xlog_cil_insert_items(
>  	}
>  	put_cpu_ptr(cilpcp);
>  
> -	spin_lock(&cil->xc_cil_lock);
> -	ctx->ticket->t_unit_res += ctx_res;
> -	ctx->ticket->t_curr_res += ctx_res;
> -
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
>  	 * the log items. Shutdown is imminent...
> @@ -527,6 +524,7 @@ xlog_cil_insert_items(
>  	 * We do this here so we only need to take the CIL lock once during
>  	 * the transaction commit.
>  	 */
> +	spin_lock(&cil->xc_cil_lock);
>  	list_for_each_entry(lip, &tp->t_items, li_trans) {
>  
>  		/* Skip items which aren't dirty in this transaction. */
> @@ -798,10 +796,13 @@ xlog_cil_push_work(
>  
>  	down_write(&cil->xc_ctx_lock);
>  
> -	/* Reset the CIL pcp counters */
> +	/* Aggregate and reset the CIL pcp counters */
>  	for_each_online_cpu(cpu) {
>  		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> +		ctx->ticket->t_curr_res += cilpcp->space_reserved;

Why isn't it necessary to update ctx->ticket->t_unit_res any more?

(Admittedly I'm struggling to figure out why it matters to keep it
updated even in the current code base...)

--D

>  		cilpcp->space_used = 0;
> +		cilpcp->space_reserved = 0;
> +
>  	}
>  
>  	spin_lock(&cil->xc_push_lock);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 4eb373357f26..278b9eaea582 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -236,7 +236,7 @@ struct xfs_cil_ctx {
>   */
>  struct xlog_cil_pcp {
>  	uint32_t		space_used;
> -	uint32_t		curr_res;
> +	uint32_t		space_reserved;
>  	struct list_head	busy_extents;
>  	struct list_head	log_items;
>  };
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-03-08 22:53   ` Darrick J. Wong
@ 2021-03-11  0:26     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  0:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 02:53:23PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:11PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > In doing an investigation into AIL push stalls, I was looking at the
> > log force code to see if an async CIL push could be done instead.
> > This lead me to xfs_log_force_lsn() and looking at how it works.
> > 
> > xfs_log_force_lsn() is only called from inode synchronisation
> > contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
> > value as the LSN to sync the log to. This gets passed to
> > xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
> > journal, and then used by xfs_log_force_lsn() to flush the iclogs to
> > the journal.
> > 
> > The problem with is that ip->i_itemp->ili_last_lsn does not store a
> > log sequence number. What it stores is passed to it from the
> > ->iop_committing method, which is called by xfs_log_commit_cil().
> > The value this passes to the iop_committing method is the CIL
> > context sequence number that the item was committed to.
> > 
> > As it turns out, xlog_cil_force_lsn() converts the sequence to an
> > actual commit LSN for the related context and returns that to
> > xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
> > variable that contained a sequence with an actual LSN and then uses
> > that to sync the iclogs.
> > 
> > This caused me some confusion for a while, even though I originally
> > wrote all this code a decade ago. ->iop_committing is only used by
> > a couple of log item types, and only inode items use the sequence
> > number it is passed.
> > 
> > Let's clean up the API, CIL structures and inode log item to call it
> > a sequence number, and make it clear that the high level code is
> > using CIL sequence numbers and not on-disk LSNs for integrity
> > synchronisation purposes.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_types.h |  1 +
> >  fs/xfs/xfs_buf_item.c     |  2 +-
> >  fs/xfs/xfs_dquot_item.c   |  2 +-
> >  fs/xfs/xfs_file.c         | 14 +++++++-------
> >  fs/xfs/xfs_inode.c        | 10 +++++-----
> >  fs/xfs/xfs_inode_item.c   |  4 ++--
> >  fs/xfs/xfs_inode_item.h   |  2 +-
> >  fs/xfs/xfs_log.c          | 27 ++++++++++++++-------------
> >  fs/xfs/xfs_log.h          |  4 +---
> >  fs/xfs/xfs_log_cil.c      | 22 +++++++++-------------
> >  fs/xfs/xfs_log_priv.h     | 15 +++++++--------
> >  fs/xfs/xfs_trans.c        |  6 +++---
> >  fs/xfs/xfs_trans.h        |  4 ++--
> >  13 files changed, 54 insertions(+), 59 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 064bd6e8c922..0870ef6f933d 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -21,6 +21,7 @@ typedef int32_t		xfs_suminfo_t;	/* type of bitmap summary info */
> >  typedef uint32_t	xfs_rtword_t;	/* word type for bitmap manipulations */
> >  
> >  typedef int64_t		xfs_lsn_t;	/* log sequence number */
> > +typedef int64_t		xfs_csn_t;	/* CIL sequence number */
> 
> I'm unfamiliar with the internal format of CIL sequence numbers.  Do
> they have the same cycle:offset segmented structure as LSNs do?  Or are
> they a simple linear integer that increases as we checkpoint committed
> items?

Monotonic increasing integer, only ever used in memory and never
written to disk.

> 
> Looking through the current code, I see a couple of places where we
> initialize them to 1, and I also see that when we create a new cil
> context we set its sequence to one more than the context that it will
> replace.
> 
> I also see a bunch of comparisons of cil context sequence numbers that
> use standard integer operators, but then I also see one instance of:
> 
> 	if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0)
> 		return false;
> 	return true
> 
> in xfs_log_item_in_current_chkpt.  AFAICT this could be replaced with a
> simple:
> 
> 	return lip->li_seq == ctx->sequence;

Yup, missed that, will fix.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 38/45] xfs: convert CIL busy extents to per-cpu
  2021-03-05  5:11 ` [PATCH 38/45] xfs: convert CIL busy extents to per-cpu Dave Chinner
@ 2021-03-11  0:36   ` Darrick J. Wong
  2021-03-12  1:15     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  0:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To get them out from under the CIL lock.
> 
> This is an unordered list, so we can simply punt it to per-cpu lists
> during transaction commits and reaggregate it back into a single
> list during the CIL push work.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 26 ++++++++++++++++++--------
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index a2f93bd7644b..7428b98c8279 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -501,6 +501,9 @@ xlog_cil_insert_items(
>  		atomic_add(cilpcp->space_used, &ctx->space_used);
>  		cilpcp->space_used = 0;
>  	}
> +	/* attach the transaction to the CIL if it has any busy extents */
> +	if (!list_empty(&tp->t_busy))
> +		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
>  	put_cpu_ptr(cilpcp);
>  
>  	/*
> @@ -540,9 +543,6 @@ xlog_cil_insert_items(
>  			list_move_tail(&lip->li_cil, &cil->xc_cil);
>  	}
>  
> -	/* attach the transaction to the CIL if it has any busy extents */
> -	if (!list_empty(&tp->t_busy))
> -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
>  	spin_unlock(&cil->xc_cil_lock);
>  
>  	if (tp->t_ticket->t_curr_res < 0)
> @@ -802,7 +802,10 @@ xlog_cil_push_work(
>  		ctx->ticket->t_curr_res += cilpcp->space_reserved;
>  		cilpcp->space_used = 0;
>  		cilpcp->space_reserved = 0;
> -
> +		if (!list_empty(&cilpcp->busy_extents)) {
> +			list_splice_init(&cilpcp->busy_extents,
> +					&ctx->busy_extents);
> +		}
>  	}
>  
>  	spin_lock(&cil->xc_push_lock);
> @@ -1459,17 +1462,24 @@ static void __percpu *
>  xlog_cil_pcp_alloc(
>  	struct xfs_cil		*cil)
>  {
> +	void __percpu		*pcptr;
>  	struct xlog_cil_pcp	*cilpcp;
> +	int			cpu;
>  
> -	cilpcp = alloc_percpu(struct xlog_cil_pcp);
> -	if (!cilpcp)
> +	pcptr = alloc_percpu(struct xlog_cil_pcp);
> +	if (!pcptr)
>  		return NULL;
>  
> +	for_each_possible_cpu(cpu) {
> +		cilpcp = per_cpu_ptr(pcptr, cpu);

So... in my mind, "cilpcp" and "pcptr" aren't really all that distinct
from each other.  I /think/ you're trying to use "cilpcp" everywhere
else to mean "pointer to a particular CPU's CIL data", and this change
makes that usage consistent in the alloc function.

However, this leaves xlog_cil_pcp_free using "cilpcp" to refer to the
entire chunk of per-CPU data structures.  Given that the first refers to
a specific structure and the second refers to them all in aggregate,
maybe _pcp_alloc and _pcp_free should use a name that at least sounds
plural?

e.g.

	void __percpu	*all_cilpcps = alloc_percpu(...);

	for_each_possible_cpu(cpu) {
		cilpcp = per_cpu_ptr(all_cilpcps, cpu);
		cilpcp->magicval = 7777;
	}

and

	cil->xc_all_pcps = xlog_cil_pcp_alloc(...);

Hm?

--D

> +		INIT_LIST_HEAD(&cilpcp->busy_extents);
> +	}
> +
>  	if (xlog_cil_pcp_hpadd(cil) < 0) {
> -		free_percpu(cilpcp);
> +		free_percpu(pcptr);
>  		return NULL;
>  	}
> -	return cilpcp;
> +	return pcptr;
>  }
>  
>  static void
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 39/45] xfs: Add order IDs to log items in CIL
  2021-03-05  5:11 ` [PATCH 39/45] xfs: Add order IDs to log items in CIL Dave Chinner
@ 2021-03-11  1:00   ` Darrick J. Wong
  0 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  1:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Before we split the ordered CIL up into per cpu lists, we need a
> mechanism to track the order of the items in the CIL. We need to do
> this because there are rules around the order in which related items
> must physically appear in the log even inside a single checkpoint
> transaction.
> 
> An example of this is intents - an intent must appear in the log
> before it's intent done record so taht log recovery can cancel the
> intent correctly. If we have these two records misordered in the
> CIL, then they will not be recovered correctly by journal replay.
> 
> We also will not be able to move items to the tail of
> the CIL list when they are relogged, hence the log items will need
> some mechanism to allow the correct log item order to be recreated
> before we write log items to the hournal.
> 
> Hence we need to have a mechanism for recording global order of
> transactions in the log items  so that we can recover that order
> from un-ordered per-cpu lists.
> 
> Do this with a simple monotonic increasing commit counter in the CIL
> context. Each log item in the transaction gets stamped with the
> current commit order ID before it is added to the CIL. If the item
> is already in the CIL, leave it where it is instead of moving it to
> the tail of the list and instead sort the list before we start the
> push work.
> 
> XXX: list_sort() under the cil_ctx_lock held exclusive starts
> hurting that >16 threads. Front end commits are waiting on the push
> to switch contexts much longer. The item order id should likely be
> moved into the logvecs when they are detacted from the items, then
> the sort can be done on the logvec after the cil_ctx_lock has been
> released. logvecs will need to use a list_head for this rather than
> a single linked list like they do now....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 34 ++++++++++++++++++++++++++--------
>  fs/xfs/xfs_log_priv.h |  1 +
>  fs/xfs/xfs_trans.h    |  1 +
>  3 files changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 7428b98c8279..7420389f4cee 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -434,6 +434,7 @@ xlog_cil_insert_items(
>  	int			len = 0;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  	int			space_used;
> +	int			order;
>  	struct xlog_cil_pcp	*cilpcp;
>  
>  	ASSERT(tp);
> @@ -523,10 +524,12 @@ xlog_cil_insert_items(
>  	}
>  
>  	/*
> -	 * Now (re-)position everything modified at the tail of the CIL.
> +	 * Now update the order of everything modified in the transaction
> +	 * and insert items into the CIL if they aren't already there.
>  	 * We do this here so we only need to take the CIL lock once during
>  	 * the transaction commit.
>  	 */
> +	order = atomic_inc_return(&ctx->order_id);
>  	spin_lock(&cil->xc_cil_lock);
>  	list_for_each_entry(lip, &tp->t_items, li_trans) {
>  
> @@ -534,13 +537,10 @@ xlog_cil_insert_items(
>  		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
>  			continue;
>  
> -		/*
> -		 * Only move the item if it isn't already at the tail. This is
> -		 * to prevent a transient list_empty() state when reinserting
> -		 * an item that is already the only item in the CIL.
> -		 */
> -		if (!list_is_last(&lip->li_cil, &cil->xc_cil))
> -			list_move_tail(&lip->li_cil, &cil->xc_cil);
> +		lip->li_order_id = order;
> +		if (!list_empty(&lip->li_cil))
> +			continue;
> +		list_add(&lip->li_cil, &cil->xc_cil);

If the goal here is to end up an xc_cil list where all the log items are
sorted in commit order, why isn't the existing strategy of moving dirty
items to the tail sufficient to keep them in sorted order?

Hm, looking at the /next/ patch, I see you start adding the items to the
per-CPU CIL structure and only combining them into a single list at push
time.  Maybe that's a better place to talk about this.

--D

>  	}
>  
>  	spin_unlock(&cil->xc_cil_lock);
> @@ -753,6 +753,22 @@ xlog_cil_build_trans_hdr(
>  	tic->t_curr_res -= lvhdr->lv_bytes;
>  }
>  
> +static int
> +xlog_cil_order_cmp(
> +	void			*priv,
> +	struct list_head	*a,
> +	struct list_head	*b)
> +{
> +	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
> +	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
> +
> +	if (l1->li_order_id > l2->li_order_id)
> +		return 1;
> +	if (l1->li_order_id < l2->li_order_id)
> +		return -1;
> +	return 0;
> +}
> +
>  /*
>   * Push the Committed Item List to the log.
>   *
> @@ -891,6 +907,7 @@ xlog_cil_push_work(
>  	 * needed on the transaction commit side which is currently locked out
>  	 * by the flush lock.
>  	 */
> +	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
>  	lv = NULL;
>  	while (!list_empty(&cil->xc_cil)) {
>  		struct xfs_log_item	*item;
> @@ -898,6 +915,7 @@ xlog_cil_push_work(
>  		item = list_first_entry(&cil->xc_cil,
>  					struct xfs_log_item, li_cil);
>  		list_del_init(&item->li_cil);
> +		item->li_order_id = 0;
>  		if (!ctx->lv_chain)
>  			ctx->lv_chain = item->li_lv;
>  		else
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 278b9eaea582..92d9e1a03a07 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -229,6 +229,7 @@ struct xfs_cil_ctx {
>  	struct list_head	committing;	/* ctx committing list */
>  	struct work_struct	discard_endio_work;
>  	struct work_struct	push_work;
> +	atomic_t		order_id;
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 6276c7d251e6..226c0f5e7870 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -44,6 +44,7 @@ struct xfs_log_item {
>  	struct xfs_log_vec		*li_lv;		/* active log vector */
>  	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
>  	xfs_csn_t			li_seq;		/* CIL commit seq */
> +	uint32_t			li_order_id;	/* CIL commit order */
>  };
>  
>  /*
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 40/45] xfs: convert CIL to unordered per cpu lists
  2021-03-05  5:11 ` [PATCH 40/45] xfs: convert CIL to unordered per cpu lists Dave Chinner
@ 2021-03-11  1:15   ` Darrick J. Wong
  2021-03-12  2:18     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  1:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:38PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So that we can remove the cil_lock which is a global serialisation
> point. We've already got ordering sorted, so all we need to do is
> treat the CIL list like the busy extent list and reconstruct it
> before the push starts.
> 
> This is what we're trying to avoid:
> 
>  -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
>     - 46.35% xfs_log_commit_cil
>        - 41.54% _raw_spin_lock
>           - 67.30% do_raw_spin_lock
>                66.96% __pv_queued_spin_lock_slowpath
> 
> Which happens on a 32p system when running a 32-way 'rm -rf'
> workload. After this patch:
> 
> -   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
>    - 17.67% xfs_log_commit_cil
>       - 6.51% xfs_log_ticket_ungrant
>            1.40% xfs_log_space_wake
>         2.32% memcpy_erms
>       - 2.18% xfs_buf_item_committing
>          - 2.12% xfs_buf_item_release
>             - 1.03% xfs_buf_unlock
>                  0.96% up
>               0.72% xfs_buf_rele
>         1.33% xfs_inode_item_format
>         1.19% down_read
>         0.91% up_read
>         0.76% xfs_buf_item_format
>       - 0.68% kmem_alloc_large
>          - 0.67% kmem_alloc
>               0.64% __kmalloc
>         0.50% xfs_buf_item_size
> 
> It kinda looks like the workload is running out of log space all
> the time. But all the spinlock contention is gone and the
> transaction commit rate has gone from 800k/s to 1.3M/s so the amount
> of real work being done has gone up a *lot*.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++-----------------------
>  fs/xfs/xfs_log_priv.h |  2 --
>  2 files changed, 29 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 7420389f4cee..3d43a5088154 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -448,10 +448,9 @@ xlog_cil_insert_items(
>  	/*
>  	 * We need to take the CIL checkpoint unit reservation on the first
>  	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
> -	 * unnecessarily do an atomic op in the fast path here. We don't need to
> -	 * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are
> -	 * under the xc_ctx_lock here and that needs to be held exclusively to
> -	 * reset the XLOG_CIL_EMPTY bit.
> +	 * unnecessarily do an atomic op in the fast path here. We can clear the
> +	 * XLOG_CIL_EMPTY bit as we are under the xc_ctx_lock here and that
> +	 * needs to be held exclusively to reset the XLOG_CIL_EMPTY bit.
>  	 */
>  	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
>  	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
> @@ -505,24 +504,6 @@ xlog_cil_insert_items(
>  	/* attach the transaction to the CIL if it has any busy extents */
>  	if (!list_empty(&tp->t_busy))
>  		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
> -	put_cpu_ptr(cilpcp);
> -
> -	/*
> -	 * If we've overrun the reservation, dump the tx details before we move
> -	 * the log items. Shutdown is imminent...
> -	 */
> -	tp->t_ticket->t_curr_res -= ctx_res + len;
> -	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
> -		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
> -		xfs_warn(log->l_mp,
> -			 "  log items: %d bytes (iov hdrs: %d bytes)",
> -			 len, iovhdr_res);
> -		xfs_warn(log->l_mp, "  split region headers: %d bytes",
> -			 split_res);
> -		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
> -		xlog_print_trans(tp);
> -	}
> -
>  	/*
>  	 * Now update the order of everything modified in the transaction
>  	 * and insert items into the CIL if they aren't already there.
> @@ -530,7 +511,6 @@ xlog_cil_insert_items(
>  	 * the transaction commit.
>  	 */
>  	order = atomic_inc_return(&ctx->order_id);
> -	spin_lock(&cil->xc_cil_lock);
>  	list_for_each_entry(lip, &tp->t_items, li_trans) {
>  
>  		/* Skip items which aren't dirty in this transaction. */
> @@ -540,10 +520,26 @@ xlog_cil_insert_items(
>  		lip->li_order_id = order;
>  		if (!list_empty(&lip->li_cil))
>  			continue;
> -		list_add(&lip->li_cil, &cil->xc_cil);
> +		list_add(&lip->li_cil, &cilpcp->log_items);

Ok, so if I understand this correctly -- every time a transaction
commits, it marks every dirty log item with a monotonically increasing
counter.  If the log item isn't already on another CPU's CIL list, it
gets added to the current CPU's CIL list...

> +	}
> +	put_cpu_ptr(cilpcp);
> +
> +	/*
> +	 * If we've overrun the reservation, dump the tx details before we move
> +	 * the log items. Shutdown is imminent...
> +	 */
> +	tp->t_ticket->t_curr_res -= ctx_res + len;
> +	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
> +		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
> +		xfs_warn(log->l_mp,
> +			 "  log items: %d bytes (iov hdrs: %d bytes)",
> +			 len, iovhdr_res);
> +		xfs_warn(log->l_mp, "  split region headers: %d bytes",
> +			 split_res);
> +		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
> +		xlog_print_trans(tp);
>  	}
>  
> -	spin_unlock(&cil->xc_cil_lock);
>  
>  	if (tp->t_ticket->t_curr_res < 0)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -806,6 +802,7 @@ xlog_cil_push_work(
>  	bool			commit_iclog_sync = false;
>  	int			cpu;
>  	struct xlog_cil_pcp	*cilpcp;
> +	LIST_HEAD		(log_items);
>  
>  	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -822,6 +819,9 @@ xlog_cil_push_work(
>  			list_splice_init(&cilpcp->busy_extents,
>  					&ctx->busy_extents);
>  		}
> +		if (!list_empty(&cilpcp->log_items)) {
> +			list_splice_init(&cilpcp->log_items, &log_items);

...and then at CIL push time, we splice each per-CPU list into a big
list, sort the dirty log items by counter number, and process them.

The first thought I had was that it's a darn shame that _insert_items
can't steal a log item from another CPU's CIL list, because you could
then mergesort the per-CPU CIL lists into @log_items.  Unfortunately, I
don't think there's a safe way to steal items from a per-CPU list
without involving locks.

The second thought I had was that we have the xfs_pwork mechanism for
launching a bunch of worker threads.  A pwork workqueue is (probably)
too costly when the item list is short or there aren't that many CPUs,
but once list_sort starts getting painful, would it be faster to launch
a bunch of threads in push_work to sort each per-CPU list and then merge
sort them into the final list?

FWIW at least mechanically, the last two patches look reasonable to me.

--D

> +		}
>  	}
>  
>  	spin_lock(&cil->xc_push_lock);
> @@ -907,12 +907,12 @@ xlog_cil_push_work(
>  	 * needed on the transaction commit side which is currently locked out
>  	 * by the flush lock.
>  	 */
> -	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
> +	list_sort(NULL, &log_items, xlog_cil_order_cmp);
>  	lv = NULL;
> -	while (!list_empty(&cil->xc_cil)) {
> +	while (!list_empty(&log_items)) {
>  		struct xfs_log_item	*item;
>  
> -		item = list_first_entry(&cil->xc_cil,
> +		item = list_first_entry(&log_items,
>  					struct xfs_log_item, li_cil);
>  		list_del_init(&item->li_cil);
>  		item->li_order_id = 0;
> @@ -1099,7 +1099,6 @@ xlog_cil_push_background(
>  	 * The cil won't be empty because we are called while holding the
>  	 * context lock so whatever we added to the CIL will still be there.
>  	 */
> -	ASSERT(!list_empty(&cil->xc_cil));
>  	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
>  
>  	/*
> @@ -1491,6 +1490,7 @@ xlog_cil_pcp_alloc(
>  	for_each_possible_cpu(cpu) {
>  		cilpcp = per_cpu_ptr(pcptr, cpu);
>  		INIT_LIST_HEAD(&cilpcp->busy_extents);
> +		INIT_LIST_HEAD(&cilpcp->log_items);
>  	}
>  
>  	if (xlog_cil_pcp_hpadd(cil) < 0) {
> @@ -1531,9 +1531,7 @@ xlog_cil_init(
>  		return -ENOMEM;
>  	}
>  
> -	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
> -	spin_lock_init(&cil->xc_cil_lock);
>  	spin_lock_init(&cil->xc_push_lock);
>  	init_waitqueue_head(&cil->xc_push_wait);
>  	init_rwsem(&cil->xc_ctx_lock);
> @@ -1559,7 +1557,6 @@ xlog_cil_destroy(
>  		kmem_free(cil->xc_ctx);
>  	}
>  
> -	ASSERT(list_empty(&cil->xc_cil));
>  	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
>  	xlog_cil_pcp_free(cil, cil->xc_pcp);
>  	kmem_free(cil);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 92d9e1a03a07..12a1a36eef7e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -262,8 +262,6 @@ struct xfs_cil {
>  	struct xlog		*xc_log;
>  	unsigned long		xc_flags;
>  	atomic_t		xc_iclog_hdrs;
> -	struct list_head	xc_cil;
> -	spinlock_t		xc_cil_lock;
>  
>  	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
>  	struct xfs_cil_ctx	*xc_ctx;
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 41/45] xfs: move CIL ordering to the logvec chain
  2021-03-05  5:11 ` [PATCH 41/45] xfs: move CIL ordering to the logvec chain Dave Chinner
@ 2021-03-11  1:34   ` Darrick J. Wong
  2021-03-12  2:29     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  1:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Adding a list_sort() call to the CIL push work while the xc_ctx_lock
> is held exclusively has resulted in fairly long lock hold times and
> that stops all front end transaction commits from making progress.

Heh, nice solution. :)

> We can move the sorting out of the xc_ctx_lock if we can transfer
> the ordering information to the log vectors as they are detached
> from the log items and then we can sort the log vectors. This
> requires log vectors to use a list_head rather than a single linked
> list

Ergh, could pull out the list conversion into a separate piece?
Some of the lv_chain usage is ... not entirely textbook.

> and to hold an order ID field. With these changes, we can move
> the list_sort() call to just before we call xlog_write() when we
> aren't holding any locks at all.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c        | 46 +++++++++++++++++++++---------
>  fs/xfs/xfs_log.h        |  3 +-
>  fs/xfs/xfs_log_cil.c    | 63 +++++++++++++++++++++++++----------------
>  fs/xfs/xfs_log_priv.h   |  4 +--
>  fs/xfs/xfs_trans.c      |  4 +--
>  fs/xfs/xfs_trans_priv.h |  4 +--
>  6 files changed, 78 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 46a006d41184..fd58c3213ebf 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -846,6 +846,9 @@ xlog_write_unmount_record(
>  		.lv_niovecs = 1,
>  		.lv_iovecp = &reg,
>  	};
> +	LIST_HEAD(lv_chain);
> +	INIT_LIST_HEAD(&vec.lv_chain);
> +	list_add(&vec.lv_chain, &lv_chain);
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= sizeof(unmount_rec);
> @@ -857,8 +860,8 @@ xlog_write_unmount_record(
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
>  		blkdev_issue_flush(log->l_targ->bt_bdev);
> -	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
> -				reg.i_len);
> +	return xlog_write(log, &lv_chain, ticket, NULL, NULL,
> +				XLOG_UNMOUNT_TRANS, reg.i_len);
>  }
>  
>  /*
> @@ -1571,14 +1574,17 @@ xlog_commit_record(
>  		.lv_iovecp = &reg,
>  	};
>  	int	error;
> +	LIST_HEAD(lv_chain);
> +	INIT_LIST_HEAD(&vec.lv_chain);
> +	list_add(&vec.lv_chain, &lv_chain);
>  
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= reg.i_len;
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
> -				reg.i_len);
> +	error = xlog_write(log, &lv_chain, ticket, lsn, iclog,
> +				XLOG_COMMIT_TRANS, reg.i_len);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2109,6 +2115,7 @@ xlog_print_trans(
>   */
>  static struct xfs_log_vec *
>  xlog_write_single(
> +	struct list_head	*lv_chain,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
>  	struct xlog_in_core	*iclog,
> @@ -2117,7 +2124,7 @@ xlog_write_single(
>  	uint32_t		*record_cnt,
>  	uint32_t		*data_cnt)
>  {
> -	struct xfs_log_vec	*lv = log_vector;
> +	struct xfs_log_vec	*lv;
>  	void			*ptr;
>  	int			index;
>  
> @@ -2125,10 +2132,13 @@ xlog_write_single(
>  		iclog->ic_state == XLOG_STATE_WANT_SYNC);
>  
>  	ptr = iclog->ic_datap + *log_offset;
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
> +	for (lv = log_vector;
> +	     !list_entry_is_head(lv, lv_chain, lv_chain);
> +	     lv = list_next_entry(lv, lv_chain)) {
>  		/*
> -		 * If the entire log vec does not fit in the iclog, punt it to
> -		 * the partial copy loop which can handle this case.
> +		 * If the log vec contains data that needs to be copied and does
> +		 * not entirely fit in the iclog, punt it to the partial copy
> +		 * loop which can handle this case.
>  		 */
>  		if (lv->lv_niovecs &&
>  		    lv->lv_bytes > iclog->ic_size - *log_offset)
> @@ -2154,6 +2164,8 @@ xlog_write_single(
>  			*data_cnt += reg->i_len;
>  		}
>  	}
> +	if (list_entry_is_head(lv, lv_chain, lv_chain))
> +		lv = NULL;
>  	ASSERT(*len == 0 || lv);
>  	return lv;
>  }
> @@ -2199,6 +2211,7 @@ xlog_write_get_more_iclog_space(
>  static struct xfs_log_vec *
>  xlog_write_partial(
>  	struct xlog		*log,
> +	struct list_head	*lv_chain,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
>  	struct xlog_in_core	**iclogp,
> @@ -2338,7 +2351,10 @@ xlog_write_partial(
>  	 * the caller so it can go back to fast path copying.
>  	 */
>  	*iclogp = iclog;
> -	return lv->lv_next;
> +	lv = list_next_entry(lv, lv_chain);
> +	if (list_entry_is_head(lv, lv_chain, lv_chain))
> +		return NULL;
> +	return lv;
>  }
>  
>  /*
> @@ -2384,7 +2400,7 @@ xlog_write_partial(
>  int
>  xlog_write(
>  	struct xlog		*log,
> -	struct xfs_log_vec	*log_vector,
> +	struct list_head	*lv_chain,
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
> @@ -2392,7 +2408,7 @@ xlog_write(
>  	uint32_t		len)
>  {
>  	struct xlog_in_core	*iclog = NULL;
> -	struct xfs_log_vec	*lv = log_vector;
> +	struct xfs_log_vec	*lv;
>  	int			record_cnt = 0;
>  	int			data_cnt = 0;
>  	int			error = 0;
> @@ -2424,15 +2440,17 @@ xlog_write(
>  	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
>  		iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>  
> +	lv = list_first_entry_or_null(lv_chain, struct xfs_log_vec, lv_chain);
>  	while (lv) {
> -		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
> +		lv = xlog_write_single(lv_chain, lv, ticket, iclog, &log_offset,
>  					&len, &record_cnt, &data_cnt);
>  		if (!lv)
>  			break;
>  
>  		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
> -		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
> -					&len, &record_cnt, &data_cnt);
> +		lv = xlog_write_partial(log, lv_chain, lv, ticket, &iclog,
> +					&log_offset, &len, &record_cnt,
> +					&data_cnt);
>  		if (IS_ERR_OR_NULL(lv)) {
>  			error = PTR_ERR_OR_ZERO(lv);
>  			break;
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index af54ea3f8c90..0445dd6acbce 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -9,7 +9,8 @@
>  struct xfs_cil_ctx;
>  
>  struct xfs_log_vec {
> -	struct xfs_log_vec	*lv_next;	/* next lv in build list */
> +	struct list_head	lv_chain;	/* lv chain ptrs */
> +	int			lv_order_id;	/* chain ordering info */

uint32_t to match li_order_id?

>  	int			lv_niovecs;	/* number of iovecs in lv */
>  	struct xfs_log_iovec	*lv_iovecp;	/* iovec array */
>  	struct xfs_log_item	*lv_item;	/* owner */
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 3d43a5088154..6dcc23829bef 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -72,6 +72,7 @@ xlog_cil_ctx_alloc(void)
>  	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> +	INIT_LIST_HEAD(&ctx->lv_chain);
>  	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
>  	return ctx;
>  }
> @@ -237,6 +238,7 @@ xlog_cil_alloc_shadow_bufs(
>  			lv = kmem_alloc_large(buf_size, KM_NOFS);
>  			memset(lv, 0, xlog_cil_iovec_space(niovecs));
>  
> +			INIT_LIST_HEAD(&lv->lv_chain);
>  			lv->lv_item = lip;
>  			lv->lv_size = buf_size;
>  			if (ordered)
> @@ -252,7 +254,6 @@ xlog_cil_alloc_shadow_bufs(
>  			else
>  				lv->lv_buf_len = 0;
>  			lv->lv_bytes = 0;
> -			lv->lv_next = NULL;
>  		}
>  
>  		/* Ensure the lv is set up according to ->iop_size */
> @@ -379,8 +380,6 @@ xlog_cil_insert_format_items(
>  		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
>  			/* same or smaller, optimise common overwrite case */
>  			lv = lip->li_lv;
> -			lv->lv_next = NULL;

What /did/ these null assignments do?

> -
>  			if (ordered)
>  				goto insert;
>  
> @@ -547,14 +546,14 @@ xlog_cil_insert_items(
>  
>  static void
>  xlog_cil_free_logvec(
> -	struct xfs_log_vec	*log_vector)
> +	struct list_head	*lv_chain)
>  {
>  	struct xfs_log_vec	*lv;
>  
> -	for (lv = log_vector; lv; ) {
> -		struct xfs_log_vec *next = lv->lv_next;
> +	while(!list_empty(lv_chain)) {

Nit: space after "while".

> +		lv = list_first_entry(lv_chain, struct xfs_log_vec, lv_chain);
> +		list_del_init(&lv->lv_chain);
>  		kmem_free(lv);
> -		lv = next;
>  	}
>  }
>  
> @@ -653,7 +652,7 @@ xlog_cil_committed(
>  		spin_unlock(&ctx->cil->xc_push_lock);
>  	}
>  
> -	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
> +	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, &ctx->lv_chain,
>  					ctx->start_lsn, abort);
>  
>  	xfs_extent_busy_sort(&ctx->busy_extents);
> @@ -664,7 +663,7 @@ xlog_cil_committed(
>  	list_del(&ctx->committing);
>  	spin_unlock(&ctx->cil->xc_push_lock);
>  
> -	xlog_cil_free_logvec(ctx->lv_chain);
> +	xlog_cil_free_logvec(&ctx->lv_chain);
>  
>  	if (!list_empty(&ctx->busy_extents))
>  		xlog_discard_busy_extents(mp, ctx);
> @@ -744,7 +743,7 @@ xlog_cil_build_trans_hdr(
>  	lvhdr->lv_niovecs = 2;
>  	lvhdr->lv_iovecp = &hdr->lhdr[0];
>  	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> -	lvhdr->lv_next = ctx->lv_chain;
> +	list_add(&lvhdr->lv_chain, &ctx->lv_chain);
>  
>  	tic->t_curr_res -= lvhdr->lv_bytes;
>  }
> @@ -755,12 +754,14 @@ xlog_cil_order_cmp(
>  	struct list_head	*a,
>  	struct list_head	*b)
>  {
> -	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
> -	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
> +	struct xfs_log_vec	*l1 = container_of(a, struct xfs_log_vec,
> +							lv_chain);
> +	struct xfs_log_vec	*l2 = container_of(b, struct xfs_log_vec,
> +							lv_chain);
>  
> -	if (l1->li_order_id > l2->li_order_id)
> +	if (l1->lv_order_id > l2->lv_order_id)
>  		return 1;
> -	if (l1->li_order_id < l2->li_order_id)
> +	if (l1->lv_order_id < l2->lv_order_id)
>  		return -1;
>  	return 0;
>  }
> @@ -907,26 +908,25 @@ xlog_cil_push_work(
>  	 * needed on the transaction commit side which is currently locked out
>  	 * by the flush lock.
>  	 */
> -	list_sort(NULL, &log_items, xlog_cil_order_cmp);
>  	lv = NULL;
>  	while (!list_empty(&log_items)) {
>  		struct xfs_log_item	*item;
>  
>  		item = list_first_entry(&log_items,
>  					struct xfs_log_item, li_cil);
> -		list_del_init(&item->li_cil);
> -		item->li_order_id = 0;
> -		if (!ctx->lv_chain)
> -			ctx->lv_chain = item->li_lv;
> -		else
> -			lv->lv_next = item->li_lv;
> +
>  		lv = item->li_lv;
> -		item->li_lv = NULL;
> +		lv->lv_order_id = item->li_order_id;
>  		num_iovecs += lv->lv_niovecs;
> -
>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
>  			num_bytes += lv->lv_bytes;
> +		list_add_tail(&lv->lv_chain, &ctx->lv_chain);
> +
> +		list_del_init(&item->li_cil);

Do the list manipulations need moving, or could they have stayed further
up in the loop body for a cleaner patch?

> +		item->li_order_id = 0;
> +		item->li_lv = NULL;
> +
>  	}
>  
>  	/*
> @@ -959,6 +959,13 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>  	up_write(&cil->xc_ctx_lock);
>  
> +	/*
> +	 * Sort the log vector chain before we add the transaction headers.
> +	 * This ensures we always have the transaction headers at the start
> +	 * of the chain.
> +	 */
> +	list_sort(NULL, &ctx->lv_chain, xlog_cil_order_cmp);
> +
>  	/*
>  	 * Build a checkpoint transaction header and write it to the log to
>  	 * begin the transaction. We need to account for the space used by the
> @@ -981,8 +988,14 @@ xlog_cil_push_work(
>  	 * use the commit record lsn then we can move the tail beyond the grant
>  	 * write head.
>  	 */
> -	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> -				XLOG_START_TRANS, num_bytes);
> +	error = xlog_write(log, &ctx->lv_chain, ctx->ticket, &ctx->start_lsn,
> +				NULL, XLOG_START_TRANS, num_bytes);
> +
> +	/*
> +	 * Take the lvhdr back off the lv_chain as it should not be passed
> +	 * to log IO completion.
> +	 */
> +	list_del(&lvhdr.lv_chain);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 12a1a36eef7e..6a4160200417 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -224,7 +224,7 @@ struct xfs_cil_ctx {
>  	int			nvecs;		/* number of regions */
>  	atomic_t		space_used;	/* aggregate size of regions */
>  	struct list_head	busy_extents;	/* busy extents in chkpt */
> -	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
> +	struct list_head	lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
>  	struct work_struct	discard_endio_work;
> @@ -480,7 +480,7 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
>  
>  void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
> -int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
> +int	xlog_write(struct xlog *log, struct list_head *lv_chain,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
>  		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 83c2b7f22eb7..b20e68279808 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -747,7 +747,7 @@ xfs_log_item_batch_insert(
>  void
>  xfs_trans_committed_bulk(
>  	struct xfs_ail		*ailp,
> -	struct xfs_log_vec	*log_vector,
> +	struct list_head	*lv_chain,
>  	xfs_lsn_t		commit_lsn,
>  	bool			aborted)
>  {
> @@ -762,7 +762,7 @@ xfs_trans_committed_bulk(
>  	spin_unlock(&ailp->ail_lock);
>  
>  	/* unpin all the log items */
> -	for (lv = log_vector; lv; lv = lv->lv_next ) {
> +	list_for_each_entry(lv, lv_chain, lv_chain) {
>  		struct xfs_log_item	*lip = lv->lv_item;
>  		xfs_lsn_t		item_lsn;
>  
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 3004aeac9110..b0bf78e6ff76 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -18,8 +18,8 @@ void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
>  void	xfs_trans_del_item(struct xfs_log_item *);
>  void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
>  
> -void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
> -				xfs_lsn_t commit_lsn, bool aborted);
> +void	xfs_trans_committed_bulk(struct xfs_ail *ailp,
> +		struct list_head *lv_chain, xfs_lsn_t commit_lsn, bool aborted);
>  /*
>   * AIL traversal cursor.
>   *
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive
  2021-03-05  5:11 ` [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive Dave Chinner
@ 2021-03-11  1:36   ` Darrick J. Wong
  0 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  1:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:40PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
>  - 21.92% __xfs_trans_commit
>      - 21.62% xfs_log_commit_cil
> 	- 11.69% xfs_trans_unreserve_and_mod_sb
> 	   - 11.58% __percpu_counter_compare
> 	      - 11.45% __percpu_counter_sum
> 		 - 10.29% _raw_spin_lock_irqsave
> 		    - 10.28% do_raw_spin_lock
> 			 __pv_queued_spin_lock_slowpath
> 
> We debated just getting rid of it last time this came up and
> there was no real objection to removing it. Now it's the biggest
> scalability limitation for debug kernels even on smallish machines,
> so let's just get rid of it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

...unless you want a CONFIG_XFS_DEBUG_SLOW to hide these things behind?

--D

> ---
>  fs/xfs/xfs_trans.c | 11 ++---------
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index b20e68279808..637d084c8aa8 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -616,19 +616,12 @@ xfs_trans_unreserve_and_mod_sb(
>  		ASSERT(!error);
>  	}
>  
> -	if (idelta) {
> +	if (idelta)
>  		percpu_counter_add_batch(&mp->m_icount, idelta,
>  					 XFS_ICOUNT_BATCH);
> -		if (idelta < 0)
> -			ASSERT(__percpu_counter_compare(&mp->m_icount, 0,
> -							XFS_ICOUNT_BATCH) >= 0);
> -	}
>  
> -	if (ifreedelta) {
> +	if (ifreedelta)
>  		percpu_counter_add(&mp->m_ifree, ifreedelta);
> -		if (ifreedelta < 0)
> -			ASSERT(percpu_counter_compare(&mp->m_ifree, 0) >= 0);
> -	}
>  
>  	if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
>  		return;
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 43/45] xfs: avoid cil push lock if possible
  2021-03-05  5:11 ` [PATCH 43/45] xfs: avoid cil push lock if possible Dave Chinner
@ 2021-03-11  1:47   ` Darrick J. Wong
  2021-03-12  2:36     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  1:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:41PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because now it hurts when the CIL fills up.
> 
>   - 37.20% __xfs_trans_commit
>       - 35.84% xfs_log_commit_cil
>          - 19.34% _raw_spin_lock
>             - do_raw_spin_lock
>                  19.01% __pv_queued_spin_lock_slowpath
>          - 4.20% xfs_log_ticket_ungrant
>               0.90% xfs_log_space_wake
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 6dcc23829bef..d60c72ad391a 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1115,10 +1115,18 @@ xlog_cil_push_background(
>  	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
>  
>  	/*
> -	 * Don't do a background push if we haven't used up all the
> -	 * space available yet.
> +	 * We are done if:
> +	 * - we haven't used up all the space available yet; or
> +	 * - we've already queued up a push; and
> +	 * - we're not over the hard limit; and
> +	 * - nothing has been over the hard limit.

Er... do these last three bullet points correspond to the last three
lines of the if test?  I'm not sure how !waitqueue_active() determines
that nothing has been over the hard limit?  Or for that matter how
comparing push_seq against current_seq tells us if we've queued a
push?

--D

> +	 *
> +	 * If so, we don't need to take the push lock as there's nothing to do.
>  	 */
> -	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +	if (space_used < XLOG_CIL_SPACE_LIMIT(log) ||
> +	    (cil->xc_push_seq == cil->xc_current_sequence &&
> +	     space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log) &&
> +	     !waitqueue_active(&cil->xc_push_wait))) {
>  		up_read(&cil->xc_ctx_lock);
>  		return;
>  	}
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space
  2021-03-05  5:11 ` [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
@ 2021-03-11  2:00   ` Darrick J. Wong
  2021-03-16  3:04     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  2:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:42PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When xlog_sync() rounds off the tail the iclog that is being
> flushed, it manually subtracts that space from the grant heads. This
> space is actually reserved by the transaction ticket that covers
> the xlog_sync() call from xlog_write(), but we don't plumb the
> ticket down far enough for it to account for the space consumed in
> the current log ticket.
> 
> The grant heads are hot, so we really should be accounting this to
> the ticket is we can, rather than adding thousands of extra grant
> head updates every CIL commit.
> 
> Interestingly, this actually indicates a potential log space overrun
> can occur when we force the log. By the time that xfs_log_force()
> pushes out an active iclog and consumes the roundoff space, the

Ok I was wondering about that when I was trying to figure out what all
this ticket space stealing code was doing.

So in addition to fixing the theoretical overrun, I guess the
performance fix here is that every time we write an iclog we might have
to move the grant heads forward so that we always write a full log
sector / log stripe unit?  And since a CIL context might write a lot of
iclogs, it's cheaper to make those grant adjustments to the CIL ticket
(which already asked for enough space to handle the roundoffs) since the
ticket only jumps in the hot path once when the ticket is ungranted?

If I got that right,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> reservation for that roundoff space has been returned to the grant
> heads and is no longer covered by a reservation. In theory the
> roundoff added to log force on an already full log could push the
> write head past the tail. In practice, the CIL commit that writes to
> the log and needs the iclog pushed will have reserved space for
> roundoff, so when it releases the ticket there will still be
> physical space for the roundoff to be committed to the log, even
> though it is no longer reserved. This roundoff won't be enough space
> to allow a transaction to be woken if the log is full, so overruns
> should not actually occur in practice.
> 
> That said, it indicates that we should not release the CIL context
> log ticket until after we've released the commit iclog. It also
> means that xlog_sync() still needs the direct grant head
> manipulation if we don't provide it with a ticket. Log forces are
> rare when we are in fast paths running 1.5 million transactions/s
> that make the grant heads hot, so let's optimise the hot case and
> pass CIL log tickets down to the xlog_sync() code.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 39 +++++++++++++++++++++++++--------------
>  fs/xfs/xfs_log_cil.c  | 19 ++++++++++++++-----
>  fs/xfs/xfs_log_priv.h |  3 ++-
>  3 files changed, 41 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fd58c3213ebf..1c7d522b12cd 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -55,7 +55,8 @@ xlog_grant_push_ail(
>  STATIC void
>  xlog_sync(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog);
> +	struct xlog_in_core	*iclog,
> +	struct xlog_ticket	*ticket);
>  #if defined(DEBUG)
>  STATIC void
>  xlog_verify_dest_ptr(
> @@ -535,7 +536,8 @@ __xlog_state_release_iclog(
>  int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog)
> +	struct xlog_in_core	*iclog,
> +	struct xlog_ticket	*ticket)
>  {
>  	lockdep_assert_held(&log->l_icloglock);
>  
> @@ -545,7 +547,7 @@ xlog_state_release_iclog(
>  	if (atomic_dec_and_test(&iclog->ic_refcnt) &&
>  	    __xlog_state_release_iclog(log, iclog)) {
>  		spin_unlock(&log->l_icloglock);
> -		xlog_sync(log, iclog);
> +		xlog_sync(log, iclog, ticket);
>  		spin_lock(&log->l_icloglock);
>  	}
>  
> @@ -898,7 +900,7 @@ xlog_unmount_write(
>  	else
>  		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
>  		       iclog->ic_state == XLOG_STATE_IOERROR);
> -	error = xlog_state_release_iclog(log, iclog);
> +	error = xlog_state_release_iclog(log, iclog, tic);
>  	xlog_wait_on_iclog(iclog);
>  
>  	if (tic) {
> @@ -1930,7 +1932,8 @@ xlog_calc_iclog_size(
>  STATIC void
>  xlog_sync(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog)
> +	struct xlog_in_core	*iclog,
> +	struct xlog_ticket	*ticket)
>  {
>  	unsigned int		count;		/* byte count of bwrite */
>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
> @@ -1941,12 +1944,20 @@ xlog_sync(
>  
>  	count = xlog_calc_iclog_size(log, iclog, &roundoff);
>  
> -	/* move grant heads by roundoff in sync */
> -	xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
> -	xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
> +	/*
> +	 * If we have a ticket, account for the roundoff via the ticket
> +	 * reservation to avoid touching the hot grant heads needlessly.
> +	 * Otherwise, we have to move grant heads directly.
> +	 */
> +	if (ticket) {
> +		ticket->t_curr_res -= roundoff;
> +	} else {
> +		xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
> +		xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
> +	}
>  
>  	/* put cycle number in every block */
> -	xlog_pack_data(log, iclog, roundoff); 
> +	xlog_pack_data(log, iclog, roundoff);
>  
>  	/* real byte length */
>  	size = iclog->ic_offset;
> @@ -2187,7 +2198,7 @@ xlog_write_get_more_iclog_space(
>  	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
>  	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
>  	       iclog->ic_state == XLOG_STATE_IOERROR);
> -	error = xlog_state_release_iclog(log, iclog);
> +	error = xlog_state_release_iclog(log, iclog, ticket);
>  	spin_unlock(&log->l_icloglock);
>  	if (error)
>  		return error;
> @@ -2470,7 +2481,7 @@ xlog_write(
>  		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
>  	} else {
> -		error = xlog_state_release_iclog(log, iclog);
> +		error = xlog_state_release_iclog(log, iclog, ticket);
>  	}
>  	spin_unlock(&log->l_icloglock);
>  
> @@ -2929,7 +2940,7 @@ xlog_state_get_iclog_space(
>  		 * reference to the iclog.
>  		 */
>  		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
> -			error = xlog_state_release_iclog(log, iclog);
> +			error = xlog_state_release_iclog(log, iclog, ticket);
>  		spin_unlock(&log->l_icloglock);
>  		if (error)
>  			return error;
> @@ -3157,7 +3168,7 @@ xfs_log_force(
>  			atomic_inc(&iclog->ic_refcnt);
>  			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  			xlog_state_switch_iclogs(log, iclog, 0);
> -			if (xlog_state_release_iclog(log, iclog))
> +			if (xlog_state_release_iclog(log, iclog, NULL))
>  				goto out_error;
>  
>  			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
> @@ -3250,7 +3261,7 @@ xlog_force_lsn(
>  		}
>  		atomic_inc(&iclog->ic_refcnt);
>  		xlog_state_switch_iclogs(log, iclog, 0);
> -		if (xlog_state_release_iclog(log, iclog))
> +		if (xlog_state_release_iclog(log, iclog, NULL))
>  			goto out_error;
>  		if (log_flushed)
>  			*log_flushed = 1;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index d60c72ad391a..aef60f19ab05 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -804,6 +804,7 @@ xlog_cil_push_work(
>  	int			cpu;
>  	struct xlog_cil_pcp	*cilpcp;
>  	LIST_HEAD		(log_items);
> +	struct xlog_ticket	*ticket;
>  
>  	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1037,12 +1038,10 @@ xlog_cil_push_work(
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> -	xfs_log_ticket_ungrant(log, ctx->ticket);
> -
>  	spin_lock(&commit_iclog->ic_callback_lock);
>  	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
>  		spin_unlock(&commit_iclog->ic_callback_lock);
> -		goto out_abort;
> +		goto out_abort_free_ticket;
>  	}
>  	ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE ||
>  		      commit_iclog->ic_state == XLOG_STATE_WANT_SYNC);
> @@ -1073,12 +1072,23 @@ xlog_cil_push_work(
>  		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>  	}
>  
> +	/*
> +	 * Pull the ticket off the ctx so we can ungrant it after releasing the
> +	 * commit_iclog. The ctx may be freed by the time we return from
> +	 * releasing the commit_iclog (i.e. checkpoint has been completed and
> +	 * callback run) so we can't reference the ctx after the call to
> +	 * xlog_state_release_iclog().
> +	 */
> +	ticket = ctx->ticket;
> +
>  	/* release the hounds! */
>  	spin_lock(&log->l_icloglock);
>  	if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
>  		xlog_state_switch_iclogs(log, commit_iclog, 0);
> -	xlog_state_release_iclog(log, commit_iclog);
> +	xlog_state_release_iclog(log, commit_iclog, ticket);
>  	spin_unlock(&log->l_icloglock);
> +
> +	xfs_log_ticket_ungrant(log, ticket);
>  	return;
>  
>  out_skip:
> @@ -1089,7 +1099,6 @@ xlog_cil_push_work(
>  
>  out_abort_free_ticket:
>  	xfs_log_ticket_ungrant(log, ctx->ticket);
> -out_abort:
>  	ASSERT(XLOG_FORCED_SHUTDOWN(log));
>  	xlog_cil_committed(ctx);
>  }
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 6a4160200417..3d43d3940757 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -487,7 +487,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
>  		int eventual_size);
> -int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
> +int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog,
> +		struct xlog_ticket *ticket);
>  
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 45/45] xfs: expanding delayed logging design with background material
  2021-03-05  5:11 ` [PATCH 45/45] xfs: expanding delayed logging design with background material Dave Chinner
@ 2021-03-11  2:30   ` Darrick J. Wong
  2021-03-16  3:28     ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  2:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:43PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> I wrote up a description of how transactions, space reservations and
> relogging work together in response to a question for background
> material on the delayed logging design. Add this to the existing
> document for ease of future reference.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  .../xfs-delayed-logging-design.rst            | 362 ++++++++++++++++--
>  1 file changed, 323 insertions(+), 39 deletions(-)
> 
> diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst
> index 464405d2801e..e02235911ff3 100644
> --- a/Documentation/filesystems/xfs-delayed-logging-design.rst
> +++ b/Documentation/filesystems/xfs-delayed-logging-design.rst
> @@ -1,29 +1,315 @@
>  .. SPDX-License-Identifier: GPL-2.0
>  
> -==========================
> -XFS Delayed Logging Design
> -==========================
> -
> -Introduction to Re-logging in XFS
> -=================================
> -
> -XFS logging is a combination of logical and physical logging. Some objects,
> -such as inodes and dquots, are logged in logical format where the details
> -logged are made up of the changes to in-core structures rather than on-disk
> -structures. Other objects - typically buffers - have their physical changes
> -logged. The reason for these differences is to reduce the amount of log space
> -required for objects that are frequently logged. Some parts of inodes are more
> -frequently logged than others, and inodes are typically more frequently logged
> -than any other object (except maybe the superblock buffer) so keeping the
> -amount of metadata logged low is of prime importance.
> -
> -The reason that this is such a concern is that XFS allows multiple separate
> -modifications to a single object to be carried in the log at any given time.
> -This allows the log to avoid needing to flush each change to disk before
> -recording a new change to the object. XFS does this via a method called
> -"re-logging". Conceptually, this is quite simple - all it requires is that any
> -new change to the object is recorded with a *new copy* of all the existing
> -changes in the new transaction that is written to the log.
> +==================
> +XFS Logging Design
> +==================
> +
> +Preamble
> +========
> +
> +This document describes the design and algorithms that the XFS journalling
> +subsystem is based on. While originally focussed on just the design of
> +the delayed logging extension introduced in 2010, it assumed the reader already
> +had a fair amount of in-depth knowledge about how XFS transactions are formed
> +and executed. It also largely omitted any details of how journal space
> +reservations are accounted for to ensure the operation of the logging subsystem
> +is guaranteed to be deadlock-free.
> +
> +Much of the original document is retained unmodified because it is still valid
> +and correct. It also allows the new background material to avoid long
> +descriptions for various algorithms because they are already well documented
> +in the original document (e.g. what "relogging" is and why it is needed).

I wonder how useful is this synopsis of this document's history?
When I'm reading a piece of documentation I really only care about
what's in the document I'm reading now, not what it looked like in 2010.
I think you could shorten the preamble by deleting everything between
the second sentence and the last paragraph:

"This document describes the design and algorithms that the XFS
journalling subsystem is based on.  Readers may wish to familiarize
themselves with the general concepts of how transaction processing
works.  We begin with an overview of transactions in XFS, followed by
the way transaction reservations..."

> +Hence we first start with an overview of transactions, followed by the way
> +transaction reservations are structured and accounted,
> +and then move into how we guarantee forwards progress for long running
> +transactions with finite initial reservations bounds. At this point we need
> +to explain how relogging works, and that is where the original document
> +started.
> +
> +Introduction
> +============
> +
> +XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
> +are atomic and recoverable. For reasons of space and time efficiency, the
> +logging mechanisms are varied and complex, combining intents, logical and
> +physical logging mechanisms to provide the necessary recovery guarantees the
> +filesystem requires.
> +
> +Some objects, such as inodes and dquots, are logged in logical format where the
> +details logged are made up of the changes to in-core structures rather than
> +on-disk structures. Other objects - typically buffers - have their physical
> +changes logged. And long running atomic modifications have individual changes

No need for the 'And' at the start of the sentence.

> +chained together by intents, ensuring that journal recovery can restart and
> +finish an operation that was only partially done when the system stopped
> +functioning.
> +
> +The reason for these differences is to keep the amount of log space and CPU time
> +required to process objects being modified as small as possible and hence the
> +logging overhead as low as possible. Some items are very frequently modified,
> +and some parts of objects are more frequently modified than others, so so

Double "so".

> +keeping the overhead of metadata logging low is of prime importance.
> +
> +The method used to log an item or chain modifications together isn't
> +particularly important in the scope of this document. It suffices to know that
> +the method used for logging a particular object or chaining modifications
> +together are different and are dependent on the object and/or modification being
> +performed. The logging subsystem only cares that certain specific rules are
> +followed to guarantee forwards progress and prevent deadlocks.

(Aww, maybe we /should/ document how we choose between intents, logical
updates, and physical updates.  But that can come in a later patch.)

> +
> +
> +Transactions in XFS
> +===================
> +
> +XFS has two types of high level transactions, defined by the type of log space
> +reservation they take. These are known as "one shot" and "permanent"

(Ugh, I wish we could call them 'compound' or 'rolling'
transactions....)

> +transactions.

It might be a good idea to introduce the idea that a permanent
transaction has the ability to reserve space across commit boundaries.

> +Permanent transaction reservations can be used for one-shot
> +transactions, but one-shot reservations cannot be used for permanent
> +transactions. Reservations must be matched to the modification taking place.

"The size and type of reservation must be matched..."

> +
> +In the code, a one-shot transaction pattern looks somewhat like this::
> +
> +        tp = xfs_trans_alloc(<reservation>)
> +	<lock items>
> +        <do modification>

Indenting is not consistent here.

> +        xfs_trans_commit(tp);
> +
> +As items are modified in the transaction, the dirty regions in those items are

...are what?  I think this sentence got truncated.

> +Once the transaction is committed, all resources joined to it are released,
> +along with the remaining unused reservation space that was taken at the
> +transaction allocation time.
> +
> +In contrast, a permanent transaction is made up of multiple linked individual
> +transactions, and the pattern looks like this::
> +
> +	tp = xfs_trans_alloc(<reservation>)
> +	xfs_ilock(ip, XFS_ILOCK_EXCL)
> +
> +	loop {
> +		xfs_trans_ijoin(tp, 0);
> +		<do modification>
> +		xfs_trans_log_inode(tp, ip);
> +		xfs_trans_roll(&tp);
> +	}
> +
> +	xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +While this might look similar to a one-shot transaction, there is an important
> +difference: xfs_trans_roll() performs a specific operation that links two
> +transactions together::
> +
> +	ntp = xfs_trans_dup(tp);
> +	xfs_trans_commit(tp);
> +	xfs_log_reserve(ntp);
> +
> +This results in a series of "rolling transactions" where the inode is locked
> +across the entire chain of transactions.  Hence while this series of rolling
> +transactions is running, nothing else can read from or write to the inode and
> +this provides a mechanism for complex changes to appear atomic from an external
> +observer's point of view.
> +
> +It is important to note that a series of rolling transactions in a permanent
> +transaction does not form an atomic change in the journal. While each
> +individual modification is atomic, the chain is *not atomic*. If we crash half
> +way through, then recovery will only replay up to the last transactional
> +modification the loop made that was committed to the journal.

Can you add:

"Log intent items are used to track the progress of and to restart
metadata updates that span multiple transactions."

Or maybe just hoist the third paragraph of the next section into a
separate section introducing log intents?

> +
> +
> +Transactions are Asynchronous
> +=============================
> +
> +In XFS, all high level transactions are asynchronous by default. This means that
> +xfs_trans_commit() does not guarantee that the modification has been committed
> +to stable storage when it returns. Hence when a system crashes, not all the
> +completed transactions will be replayed during recovery.
> +
> +However, the logging subsystem does provide global ordering guarantees, such
> +that if a specific change is seen after recovery, all metadata modifications
> +that were committed prior to that change will also be seen.
> +
> +This affects long running permanent transactions in that it is not possible to
> +predict how much of a long running operation will actually be recovered because
> +there is no guarantee of how much of the operation reached stale storage. Hence
> +if a long running operation requires multiple transactions to fully complete,
> +the high level operation must use intents and defered operations to guarantee

"deferred"

> +recovery can complete the operation once the first modification reached the
> +journal.

"...once the first transaction is persisted in the on-disk journal."

> +
> +For single shot operations that need to reach stable storage immediately, or
> +ensuring that a long running permanent transaction is fully committed once it is
> +complete, we can explicitly tag a transaction as synchronous. This will trigger
> +a "log force" to flush the outstanding committed transactions to stable storage
> +in the journal and wait for that to complete.
> +
> +Synchronous transactions are rarely used, however, because they limit logging
> +throughput to the IO latency limitations of the underlying storage. Instead, we
> +tend to use log forces to ensure modifications are on stable storage only when
> +a user operation requires a synchronisation point to occur (e.g. fsync).
> +
> +
> +Transaction Reservations
> +========================
> +
> +It has been mentioned a number of times now that the logging subsystem needs to
> +provide a forwards progress guarantee so that no modification ever stalls
> +because it can't be written to the journal due to a lack of space in the
> +journal. This is acheived by the transaction reservations that are made when

"achieved"

> +a transaction is first allocated. For permanent transactions, these reservations
> +are maintained as part of the transaction rolling mechanism.
> +
> +A transaction reservation provides a guarantee that there is physical log space
> +available to write the modification into the journal before we start making
> +modifications to objects and items. As such, the reservation needs to be large
> +enough to take into account the amount of metadata that the change might need to
> +log in the worst case. This means that if we are modifying a btree in the
> +transaction, we have to reserve enough space to record a full leaf-to-root split
> +of the btree. As such, the reservations are quite complex because we have to
> +take into account all the hidden changes that might occur.
> +
> +For example, a user data extent allocation involves allocating an extent from
> +free space, which modifies the free space trees. That's two btrees. Then
> +inserting the extent into the inode extent map requires modify another btree,
> +which might require mor allocation that modifies the free space btrees again.

"Inserting the extent into the inode's extent map might require a split
of the extent map btree, which requires another allocation that can
modify the free space btrees again."

> +Then we might have to update reverse mappings, which modifies yet another btree
> +which might require more space. And so on.  Hence the amount of metadata that a
> +"simple" operation can modify can be quite large.
> +
> +This "worst case" calculation provides us with the static "unit reservation"
> +for the transaction that is calculated at mount time. We must guarantee that the
> +log has this much space available before the transaction is allowed to proceed
> +so that when we come to write the dirty metadata into the log we don't run out
> +of log space half way through the write.
> +
> +For one-shot transactions, a single unit space reservation is all that is
> +required for the transaction to proceed. For permanent transactions, however, we
> +also have a "log count" that affects the size of the reservation that is to be
> +made.
> +
> +While a permanent transaction can get by with a single unit of space
> +reservation, it is somewhat inefficient to do this as it requires the
> +transaction rolling mechanism to re-reserve space on every transaction roll. We
> +know from the implementation of the permanent transactions how many transaction
> +rolls are likely for the common modifications that need to be made.
> +
> +For example, and inode allocation is typically two transactions - one to
> +physically allocate a free inode chunk on disk, and another to allocate an inode
> +from an inode chunk that has free inodes in it.  Hence for an inode allocation
> +transaction, we might set the reservation log count to a value of 2 to indicate
> +that the common/fast path transaction will commit two linked transactions in a
> +chain. Each time a permanent transaction rolls, it consumes an entire unit
> +reservation.
> +
> +Hence when the permanent transaction is first allocated, the log space
> +reservation is increases from a single unit reservation to multiple unit
> +reservations. That multiple is defined by the reservation log count, and this
> +means we can roll the transaction multiple times before we have to re-reserve
> +log space when we roll the transaction. This ensures that the common
> +modifications we make only need to reserve log space once.
> +
> +If the log count for a permanent transation reaches zero, then it needs to

"transaction"

> +re-reserve physical space in the log. This is somewhat complex, and requires
> +an understanding of how the log accounts for space that has been reserved.
> +
> +
> +Log Space Accounting
> +====================
> +
> +The position in the log is typically referred to as a Log Sequence Number (LSN).
> +The log is circular, so the positions in the log are defined by the combination
> +of a cycle number - the number of times the log has been overwritten - and the
> +offset into the log.  A LSN carries the cycle in the upper 32 bits and the
> +offset in the lower 32 bits. The offset is in units of "basic blocks" (512
> +bytes). Hence we can do realtively simple LSN based math to keep track of
> +available space in the log.
> +
> +Log space accounting is done via a pair of constructs called "grant heads".  The
> +position of the grant heads is an absolute value, so the amount of space
> +available in the log is defined by the distance between the position of the
> +grant head and the current log tail. That is, how much space can be
> +reserved/consumed before the grant heads would fully wrap the log and overtake
> +the tail position.
> +
> +The first grant head is the "reserve" head. This tracks the byte count of the
> +reservations currently held by active transactions. It is a purely in-memory
> +accounting of the space reservation and, as such, actually tracks byte offsets
> +into the log rather than basic blocks. Hence it technically isn't using LSNs to
> +represent the log position, but it is still treated like a split {cycle,offset}
> +tuple for the purposes of tracking reservation space.

Lol, the grant head is delalloc for transactions.

> +
> +The reserve grant head is used to accurately account for exact transaction
> +reservations amounts and the exact byte count that modifications actually make
> +and need to write into the log. The reserve head is used to prevent new
> +transactions from taking new reservations when the head reaches the current
> +tail. It will block new reservations in a FIFO queue and as the log tail moves
> +forward it will wake them in order once sufficient space is available. This FIFO
> +mechanism ensures no transaction is starved of resources when log space
> +shortages occur.
> +
> +The other grant head is the "write" head. Unlike the reserve head, this grant
> +head contains an LSN and it tracks the physical space usage in the log. While
> +this might sound like it is accounting the same state as the reserve grant head
> +- and it mostly does track exactly the same location as the reserve grant head -
> +there are critical differences in behaviour between them that provides the
> +forwards progress guarantees that rolling permanent transactions require.
> +
> +These differences when a permanent transaction is rolled and the internal "log
> +count" reaches zero and the initial set of unit reservations have been
> +exhausted. At this point, we still require a log space reservation to continue
> +the next transaction in the sequeunce, but we have none remaining. We cannot
> +sleep during the transaction commit process waiting for new log space to become
> +available, as we may end up on the end of the FIFO queue and the items we have
> +locked while we sleep could end up pinning the tail of the log before there is
> +enough free space in the log to fulfil all of the pending reservations and
> +then wake up transaction commit in progress.
> +
> +To take a new reservation without sleeping requires us to be able to take a
> +reservation even if there is no reservation space currently available. That is,
> +we need to be able to *overcommit* the log reservation space. As has already
> +been detailed, we cannot overcommit physical log space. However, the reserve
> +grant head does not track physical space - it only accounts for the amount of
> +reservations we currently have outstanding. Hence if the reserve head passes
> +over the tail of the log all it means is that new reservations will be throttled
> +immediately and remain throttled until the log tail is moved forward far enough
> +to remove the overcommit and start taking new reservations. In other words, we
> +can overcommit the reserve head without violating the physical log head and tail
> +rules.

I hadn't fully realized this.  Good information. :)

> +As a result, permanent transactions only "regrant" reservation space during
> +xfs_trans_commit() calls, while the physical log space reservation - tracked by
> +the write head - is then reserved separately by a call to xfs_log_reserve()
> +after the commit completes. Once the commit completes, we can sleep waiting for
> +physical log space to be reserved from the write grant head, but only if one
> +critical rule has been observed::
> +
> +	Code using permanent reservations must always log the items they hold
> +	locked across each transaction they roll in the chain.
> +
> +"Re-logging" the locked items on every transaction roll ensures that the items
> +the transaction chain is rolling are always relocated to the physical head of
> +the log and so do not pin the tail of the log. If a locked item pins the tail of
> +the log when we sleep on the write reservation, then we will deadlock the log as
> +we cannot take the locks needed to write back that item and move the tail of the
> +log forwards to free up write grant space. Re-logging the locked items avoids
> +this deadlock and guarantees that the log reservation we are making cannot
> +self-deadlock.
> +
> +If all rolling transactions obey this rule, then they can all make forwards
> +progress independently because nothing will block the progress of the log
> +tail moving forwards and hence ensuring that write grant space is always
> +(eventually) made available to permanent transactions no matter how many times
> +they roll.
> +
> +
> +Re-logging Explained
> +====================
> +
> +XFS allows multiple separate modifications to a single object to be carried in
> +the log at any given time.  This allows the log to avoid needing to flush each
> +change to disk before recording a new change to the object. XFS does this via a
> +method called "re-logging". Conceptually, this is quite simple - all it requires
> +is that any new change to the object is recorded with a *new copy* of all the
> +existing changes in the new transaction that is written to the log.
>  
>  That is, if we have a sequence of changes A through to F, and the object was
>  written to disk after change D, we would see in the log the following series
> @@ -42,16 +328,13 @@ transaction::
>  In other words, each time an object is relogged, the new transaction contains
>  the aggregation of all the previous changes currently held only in the log.
>  
> -This relogging technique also allows objects to be moved forward in the log so
> -that an object being relogged does not prevent the tail of the log from ever
> -moving forward.  This can be seen in the table above by the changing
> -(increasing) LSN of each subsequent transaction - the LSN is effectively a
> -direct encoding of the location in the log of the transaction.
> +This relogging technique allows objects to be moved forward in the log so that
> +an object being relogged does not prevent the tail of the log from ever moving
> +forward.  This can be seen in the table above by the changing (increasing) LSN
> +of each subsequent transaction, and it's the technique that allows us to
> +implement long-running, multiple-commit permanent transactions. 
>  
> -This relogging is also used to implement long-running, multiple-commit
> -transactions.  These transaction are known as rolling transactions, and require
> -a special log reservation known as a permanent transaction reservation. A
> -typical example of a rolling transaction is the removal of extents from an
> +A typical example of a rolling transaction is the removal of extents from an
>  inode which can only be done at a rate of two extents per transaction because
>  of reservation size limitations. Hence a rolling extent removal transaction
>  keeps relogging the inode and btree buffers as they get modified in each
> @@ -67,12 +350,13 @@ the log over and over again. Worse is the fact that objects tend to get
>  dirtier as they get relogged, so each subsequent transaction is writing more
>  metadata into the log.
>  
> -Another feature of the XFS transaction subsystem is that most transactions are
> -asynchronous. That is, they don't commit to disk until either a log buffer is
> -filled (a log buffer can hold multiple transactions) or a synchronous operation
> -forces the log buffers holding the transactions to disk. This means that XFS is
> -doing aggregation of transactions in memory - batching them, if you like - to
> -minimise the impact of the log IO on transaction throughput.
> +It should now also be obvious how relogging and asynchronous transactions go
> +hand in hand. That is, transactions don't get written to the physical journal
> +until either a log buffer is filled (a log buffer can hold multiple
> +transactions) or a synchronous operation forces the log buffers holding the
> +transactions to disk. This means that XFS is doing aggregation of transactions
> +in memory - batching them, if you like - to minimise the impact of the log IO on
> +transaction throughput.

...microtransaction fusion, yippee!

--D

>  
>  The limitation on asynchronous transaction throughput is the number and size of
>  log buffers made available by the log manager. By default there are 8 log
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record
  2021-03-09  0:15   ` Darrick J. Wong
@ 2021-03-11  2:54     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  2:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 04:15:23PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:19PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Remove another case where xlog_write() has to prepend an opheader to
> > a log transaction. The unmount record + ophdr is smaller than the
> > minimum amount of space guaranteed to be free in an iclog (2 *
> > sizeof(ophdr)) and so we don't have to care about an unmount record
> > being split across 2 iclogs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_log.c | 35 ++++++++++++++++++++++++-----------
> >  1 file changed, 24 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index b2f9fb1b4fed..94711b9ff007 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -798,12 +798,22 @@ xlog_write_unmount_record(
> >  	struct xlog		*log,
> >  	struct xlog_ticket	*ticket)
> >  {
> > -	struct xfs_unmount_log_format ulf = {
> > -		.magic = XLOG_UNMOUNT_TYPE,
> > +	struct  {
> > +		struct xlog_op_header ophdr;
> > +		struct xfs_unmount_log_format ulf;
> > +	} unmount_rec = {
> 
> I wonder, should we have a BUILD_BUG_ON to confirm sizeof(umount_rec)
> just in case some weird architecture injects padding between these two?
> Prior to this code we formatted the op header and unmount record in
> separate incore buffers and wrote them to disk with no gap, right?

Yup. Easy enough to add.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 23/45] xfs: log tickets don't need log client id
  2021-03-09  1:48       ` Darrick J. Wong
@ 2021-03-11  3:01         ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  3:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 05:48:00PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 09, 2021 at 12:19:56PM +1100, Dave Chinner wrote:
> > On Mon, Mar 08, 2021 at 04:21:34PM -0800, Darrick J. Wong wrote:
> > > On Fri, Mar 05, 2021 at 04:11:21PM +1100, Dave Chinner wrote:
> > > >  static xlog_op_header_t *
> > > >  xlog_write_setup_ophdr(
> > > > -	struct xlog		*log,
> > > >  	struct xlog_op_header	*ophdr,
> > > > -	struct xlog_ticket	*ticket,
> > > > -	uint			flags)
> > > > +	struct xlog_ticket	*ticket)
> > > >  {
> > > >  	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> > > > -	ophdr->oh_clientid = ticket->t_clientid;
> > > > +	ophdr->oh_clientid = XFS_TRANSACTION;
> > > >  	ophdr->oh_res2 = 0;
> > > > -
> > > > -	/* are we copying a commit or unmount record? */
> > > > -	ophdr->oh_flags = flags;
> > > > -
> > > > -	/*
> > > > -	 * We've seen logs corrupted with bad transaction client ids.  This
> > > > -	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
> > > > -	 * and shut down the filesystem.
> > > > -	 */
> > > > -	switch (ophdr->oh_clientid)  {
> > > > -	case XFS_TRANSACTION:
> > > > -	case XFS_VOLUME:
> > > 
> > > Reading between the lines, I'm guessing this clientid is some
> > > now-vestigial organ from the Irix days, where there was some kind of
> > > volume manager (in addition to the filesystem + log)?  And between the
> > > three, there was a need to dispatch recovered log ops to the correct
> > > subsystem?
> > 
> > I guess that was the original thought. It was included in the
> > initial commit of the log code to XFS in 1993 and never, ever used
> > in any code anywhere. So it's never been written to an XFS log,
> > ever.
> 
> In that case, can you get rid of the #define too, please?

Done.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-03-09  2:21   ` Darrick J. Wong
@ 2021-03-11  3:29     ` Dave Chinner
  2021-03-11  3:41       ` Darrick J. Wong
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  3:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 06:21:34PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Current xlog_write() adds op headers to the log manually for every
> > log item region that is in the vector passed to it. While
> > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > already know it's length, flags, clientid, etc at CIL commit time.
> > 
> > This means the only time that xlog write really needs to format and
> > reserve space for a new ophdr is when a region is split across two
> > iclogs. Adding the opheader and accounting for it as part of the
> > normal formatted item region means we simplify the accounting
> > of space used by a transaction and we don't have to special case
> > reserving of space in for the ophdrs in xlog_write(). It also means
> > we can largely initialise the ophdr in transaction commit instead
> > of xlog_write, making the xlog_write formatting inner loop much
> > tighter.
> > 
> > xlog_prepare_iovec() is now too large to stay as an inline function,
> > so we move it out of line and into xfs_log.c.
> > 
> > Object sizes:
> > text	   data	    bss	    dec	    hex	filename
> > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > 
> > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > out of line, even though it grew in size itself.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Sooo... if I understand this part of the patchset correctly, the goal
> here is to simplify and shorten the inner loop of xlog_write.

That's one of the goals. The other goal is to avoid needing to
account for log op headers separately in the high level CIL commit
code.

> Callers
> are now required to create their own log op headers at the start of the
> xfs_log_iovec chain in the xfs_log_vec, which means that the only time
> xlog_write has to create an ophdr is when we fill up the current iclog
> and must continue in a new one, because that's not something the callers
> should ever have to know about.  Correct?

Yes.

> If so,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

> It /really/ would have been nice to have kept these patches separated by
> major functional change area (i.e. separate series) instead of one
> gigantic 45-patch behemoth to intimidate the reviewers...

How is that any different from sending out 6-7 separate dependent
patchsets one immediately after another?  A change to one patch in
one series results in needing to rebase at least one patch in each
of the smaller patchsets, so I've still got to treat them all as one
big patchset in my development trees. Then I have to start
reposting patchsets just because another patchset was changed, and
that gets even more confusing trying to work out what patchset goes
with which version and so on. It's much easier for me to manage them
as a single patchset....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 27/45] xfs: pass lv chain length into xlog_write()
  2021-03-09  2:36   ` Darrick J. Wong
@ 2021-03-11  3:37     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  3:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 06:36:44PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:25PM +1100, Dave Chinner wrote:
> > @@ -2306,7 +2282,6 @@ xlog_write(
> >  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> >  	}
> >  
> > -	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
> >  	if (start_lsn)
> >  		*start_lsn = 0;
> >  	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 7a5e6bdb7876..34abc3bae587 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr(
> >  				sizeof(struct xfs_trans_header);
> >  	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
> >  
> > -	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> > -
> >  	lvhdr->lv_niovecs = 2;
> >  	lvhdr->lv_iovecp = &hdr->lhdr[0];
> > +	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> 
> Er... does this code change belong in an earlier patch?  Or if not, why
> wasn't it important to set lv_bytes here?

It belongs in this patch.

It was not necessary to set lv_bytes before this because
xlog_write_calc_vec_length() walked the entire change calculating
the length of the chain by adding up the region lengths itself.
Because this isn't a dynamically allocated log vector associated
with a log item, we never actually use the lv_bytes field in it for
anything.

In the case of this patch, we need to add the size of the data in
the log vector to our accumulated total that xlog_cil_push_work()
now passes in to xlog_write() to replace the chain walk that
xlog_write_calc_vec_length() did to calculate the length. Hence we
have to pass the accumulated region length back to the caller, and
rather than add another parameter we fill out lv_bytes so that it
matches all the other log vectors in the chain....


> > @@ -893,6 +898,9 @@ xlog_cil_push_work(
> >  	 * transaction header here as it is not accounted for in xlog_write().
> >  	 */
> >  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
> > +	num_iovecs += lvhdr.lv_niovecs;
> > +	num_bytes += lvhdr.lv_bytes;
> > +
> >  
> 
> No need to have two blank lines here.

Fixed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-03-11  3:29     ` Dave Chinner
@ 2021-03-11  3:41       ` Darrick J. Wong
  2021-03-16 14:54         ` Brian Foster
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Mar 12, 2021 at 02:29:32PM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 06:21:34PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Current xlog_write() adds op headers to the log manually for every
> > > log item region that is in the vector passed to it. While
> > > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > > already know it's length, flags, clientid, etc at CIL commit time.
> > > 
> > > This means the only time that xlog write really needs to format and
> > > reserve space for a new ophdr is when a region is split across two
> > > iclogs. Adding the opheader and accounting for it as part of the
> > > normal formatted item region means we simplify the accounting
> > > of space used by a transaction and we don't have to special case
> > > reserving of space in for the ophdrs in xlog_write(). It also means
> > > we can largely initialise the ophdr in transaction commit instead
> > > of xlog_write, making the xlog_write formatting inner loop much
> > > tighter.
> > > 
> > > xlog_prepare_iovec() is now too large to stay as an inline function,
> > > so we move it out of line and into xfs_log.c.
> > > 
> > > Object sizes:
> > > text	   data	    bss	    dec	    hex	filename
> > > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > > 
> > > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > > out of line, even though it grew in size itself.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Sooo... if I understand this part of the patchset correctly, the goal
> > here is to simplify and shorten the inner loop of xlog_write.
> 
> That's one of the goals. The other goal is to avoid needing to
> account for log op headers separately in the high level CIL commit
> code.
> 
> > Callers
> > are now required to create their own log op headers at the start of the
> > xfs_log_iovec chain in the xfs_log_vec, which means that the only time
> > xlog_write has to create an ophdr is when we fill up the current iclog
> > and must continue in a new one, because that's not something the callers
> > should ever have to know about.  Correct?
> 
> Yes.
> 
> > If so,
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Thanks!
> 
> > It /really/ would have been nice to have kept these patches separated by
> > major functional change area (i.e. separate series) instead of one
> > gigantic 45-patch behemoth to intimidate the reviewers...
> 
> How is that any different from sending out 6-7 separate dependent
> patchsets one immediately after another?  A change to one patch in
> one series results in needing to rebase at least one patch in each
> of the smaller patchsets, so I've still got to treat them all as one
> big patchset in my development trees. Then I have to start
> reposting patchsets just because another patchset was changed, and
> that gets even more confusing trying to work out what patchset goes
> with which version and so on. It's much easier for me to manage them
> as a single patchset....

Well, ok, but it would have been nice for the cover letter to give
/some/ hint as to what's changing in various subranges, e.g.

"Patches 32-36 reduce the xc_cil_lock critical sections,
 Patches 37-41 create per-cpu cil structures and move log items and
       vectors to use them,
 Patches 42-44 are more cleanups,
 Patch 45 documents the whole mess."

So I could see the outlines of where the 45 patches were going.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/45] xfs: introduce xlog_write_single()
  2021-03-09  2:39   ` Darrick J. Wong
@ 2021-03-11  4:19     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  4:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 06:39:27PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:26PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Introduce an optimised version of xlog_write() that is used when the
> > entire write will fit in a single iclog. This greatly simplifies the
> > implementation of writing a log vector chain into an iclog, and sets
> > the ground work for a much more understandable xlog_write()
> > implementation.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 56 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 22f97914ab99..590c1e6db475 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2214,6 +2214,52 @@ xlog_write_copy_finish(
> >  	return error;
> >  }
> >  
> > +/*
> > + * Write log vectors into a single iclog which is guaranteed by the caller
> > + * to have enough space to write the entire log vector into. Return the number
> > + * of log vectors written into the iclog.
> > + */
> > +static int
> > +xlog_write_single(
> > +	struct xfs_log_vec	*log_vector,
> > +	struct xlog_ticket	*ticket,
> > +	struct xlog_in_core	*iclog,
> > +	uint32_t		log_offset,
> > +	uint32_t		len)
> > +{
> > +	struct xfs_log_vec	*lv = log_vector;
> > +	void			*ptr;
> > +	int			index = 0;
> > +	int			record_cnt = 0;
> 
> Any reason these (and the return type) can't be unsigned?  I don't think
> negative indices or record counts have any meaning, right?

Correct, but I'm going to ignore that because the next patch already
addresses both of these things. Changing it here just means a
massive reject of the next patch where it replaces the return value
with a log vector, and the record count moves to a function
parameter that is, indeed, a uint32_t.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/45] xfs:_introduce xlog_write_partial()
  2021-03-09  2:59   ` Darrick J. Wong
@ 2021-03-11  4:33     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  4:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 06:59:32PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:27PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Handle writing of a logvec chain into an iclog that doesn't have
> > enough space to fit it all. The iclog has already been changed to
> > WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> > in the iclog is exclusively owned by this logvec chain.
> > 
> > The difference between the single and partial cases is that
> > we end up with partial iovec writes in the iclog and have to split
> > a log vec regions across two iclogs. The state handling for this is
> > currently awful and so we're building up the pieces needed to
> > handle this more cleanly one at a time.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++-------------------------
> >  1 file changed, 251 insertions(+), 274 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 590c1e6db475..10916b99bf0f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2099,166 +2099,250 @@ xlog_print_trans(
> >  	}
> >  }
> >  
> > -static xlog_op_header_t *
> > -xlog_write_setup_ophdr(
> > -	struct xlog_op_header	*ophdr,
> > -	struct xlog_ticket	*ticket)
> > -{
> > -	ophdr->oh_clientid = XFS_TRANSACTION;
> > -	ophdr->oh_res2 = 0;
> > -	ophdr->oh_flags = 0;
> > -	return ophdr;
> > -}
> > -
> >  /*
> > - * Set up the parameters of the region copy into the log. This has
> > - * to handle region write split across multiple log buffers - this
> > - * state is kept external to this function so that this code can
> > - * be written in an obvious, self documenting manner.
> > + * Write whole log vectors into a single iclog which is guaranteed to have
> > + * either sufficient space for the entire log vector chain to be written or
> > + * exclusive access to the remaining space in the iclog.
> > + *
> > + * Return the number of iovecs and data written into the iclog, as well as
> > + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
> > + * end of the chain.
> >   */
> > -static int
> > -xlog_write_setup_copy(
> > +static struct xfs_log_vec *
> > +xlog_write_single(
> 
> Ouch.  Could you fix the previous patch to move this new function a
> little higher in the file (like above xlog_write_setup_ophdr) so that it
> doesn't get shredded like this?

Not possible because xlog_write_setup_ophdr() is removed by this
patch. I can't help it if the diffs are unreadable - I can't really
control what git is doing here...

> > @@ -2930,7 +2906,7 @@ xlog_state_get_iclog_space(
> >  	 * xlog_write() algorithm assumes that at least 2 xlog_op_header_t's
> >  	 * can fit into remaining data section.
> >  	 */
> > -	if (iclog->ic_size - iclog->ic_offset < 2*sizeof(xlog_op_header_t)) {
> > +	if (iclog->ic_size - iclog->ic_offset < 3*sizeof(xlog_op_header_t)) {
> 
> Why does this change to 3?  Does the comment need amending?

Ah, that was to do with the avoiding the need to split the start
record/transaction header of across two iclogs. That was because the
partial copy loop didn't have special handling for start records
and so that log vector had to be wholly handled by the
xlog_write_single() loop to set the iclog flush flags.

However, with all the changes since then that have added explicit
pre-flushes before the start record is formatted and the lifting of
the iclog flush flags to the callers, we've removed all
the special optype handling in xlog_write(). Hence we no longer need
to guarantee the start record is handled by the single path, it now
can be handled by this partial path just fine. So I can revert this
hunk.

Did I mention that this code was full of all sorts of subtle corner
cases? :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 31/45] xfs: CIL context doesn't need to count iovecs
  2021-03-09  3:16   ` Darrick J. Wong
@ 2021-03-11  5:03     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  5:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 08, 2021 at 07:16:04PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:29PM +1100, Dave Chinner wrote:
> > @@ -471,7 +462,6 @@ xlog_cil_insert_items(
> >  	}
> >  	tp->t_ticket->t_curr_res -= len;
> >  	ctx->space_used += len;
> > -	ctx->nvecs += diff_iovecs;
> 
> If the tracking variable isn't necessary any more, should the field go
> away from xfs_cil_ctx?

Yes. I thought I cleaned that up. Fixed.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock
  2021-03-10 23:25   ` Darrick J. Wong
@ 2021-03-11  5:42     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  5:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 03:25:41PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:31PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The xc_cil_lock is the most highly contended lock in XFS now. To
> > start the process of getting rid of it, lift the initial reservation
> > of the CIL log space out from under the xc_cil_lock.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 27 ++++++++++++---------------
> >  1 file changed, 12 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index e6e36488f0c7..50101336a7f4 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -430,23 +430,19 @@ xlog_cil_insert_items(
> >  	 */
> >  	xlog_cil_insert_format_items(log, tp, &len);
> >  
> > -	spin_lock(&cil->xc_cil_lock);
> 
> Hm, so looking ahead, the next few patches keep kicking this spin_lock
> call further and further down in the file, and the commit messages give
> me the impression that this might even go away entirely?
> 
> Let me see, the CIL locks are:
> 
> xc_ctx_lock, which prevents transactions from committing (into the cil)
> any time the CIL itself is preparing a new commited item context so that
> it can xlog_write (to disk) the log vectors associated with the current
> context.

Yes.

> xc_cil_lock, which serializes transactions adding their items to the CIL
> in the first place, hence the motivation to reduce this hot lock?

Right - it protects manipulations to the CIL log item tracking and
tracking state.  This spin lock is the first global serialisation
point in the transaction commit path, so it effectively sees unbound
concurrency (reservations allow hundreds of transactions can be
committing simultaneously).

> xc_push_lock, which I think is used to coordinate the CIL push worker
> with all the upper level callers that want to force log items to disk?

Yes, this one protects the current push state.

> And the locking order of these three locks is...
> 
> xc_ctx_lock --> xc_push_lock
>     |
>     \---------> xc_cil_lock
> 
> Assuming I grokked all that, then I guess moving the spin_lock call
> works out because the test_and_clear_bit is atomic.  The rest of the
> accounting stuff here is just getting moved further down in the file and
> is still protected by xc_cil_lock.

Yes.

> If I understood all that,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 34/45] xfs: rework per-iclog header CIL reservation
  2021-03-11  0:03   ` Darrick J. Wong
@ 2021-03-11  6:03     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  6:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 04:03:38PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:32PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For every iclog that a CIL push will use up, we need to ensure we
> > have space reserved for the iclog header in each iclog. It is
> > extremely difficult to do this accurately with a per-cpu counter
> > without expensive summing of the counter in every commit. However,
> > we know what the maximum CIL size is going to be because of the
> > hard space limit we have, and hence we know exactly how many iclogs
> > we are going to need to write out the CIL.
> > 
> > We are constrained by the requirement that small transactions only
> > have reservation space for a single iclog header built into them.
> > At commit time we don't know how much of the current transaction
> > reservation is made up of iclog header reservations as calculated by
> > xfs_log_calc_unit_res() when the ticket was reserved. As larger
> > reservations have multiple header spaces reserved, we can steal
> > more than one iclog header reservation at a time, but we only steal
> > the exact number needed for the given log vector size delta.
> > 
> > As a result, we don't know exactly when we are going to steal iclog
> > header reservations, nor do we know exactly how many we are going to
> > need for a given CIL.
> > 
> > To make things simple, start by calculating the worst case number of
> > iclog headers a full CIL push will require. Record this into an
> > atomic variable in the CIL. Then add a byte counter to the log
> > ticket that records exactly how much iclog header space has been
> > reserved in this ticket by xfs_log_calc_unit_res(). This tells us
> > exactly how much space we can steal from the ticket at transaction
> > commit time.
> > 
> > Now, at transaction commit time, we can check if the CIL has a full
> > iclog header reservation and, if not, steal the entire reservation
> > the current ticket holds for iclog headers. This minimises the
> > number of times we need to do atomic operations in the fast path,
> > but still guarantees we get all the reservations we need.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_log_rlimit.c |  2 +-
> >  fs/xfs/libxfs/xfs_shared.h     |  3 +-
> >  fs/xfs/xfs_log.c               | 12 +++++---
> >  fs/xfs/xfs_log_cil.c           | 55 ++++++++++++++++++++++++++--------
> >  fs/xfs/xfs_log_priv.h          | 20 +++++++------
> >  5 files changed, 64 insertions(+), 28 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
> > index 7f55eb3f3653..75390134346d 100644
> > --- a/fs/xfs/libxfs/xfs_log_rlimit.c
> > +++ b/fs/xfs/libxfs/xfs_log_rlimit.c
> > @@ -88,7 +88,7 @@ xfs_log_calc_minimum_size(
> >  
> >  	xfs_log_get_max_trans_res(mp, &tres);
> >  
> > -	max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres);
> > +	max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres, NULL);
> 
> This is currently the only call site of xfs_log_calc_unit_res, so if a
> subsequent patch doesn't make use of that last argument it should go
> away.  (I don't know yet, I haven't looked...)

Can't remember, I'll have to check.

> > @@ -3418,7 +3422,7 @@ xlog_ticket_alloc(
> >  
> >  	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
> >  
> > -	unit_res = xlog_calc_unit_res(log, unit_bytes);
> > +	unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs);
> 
> Ok, so each transaction ticket now gets to know the maximum number of
> iclog headers that the transaction can consume if we use every last byte
> of the reservation...

yes.

> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 50101336a7f4..f8fb2f59e24c 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -44,9 +44,20 @@ xlog_cil_ticket_alloc(
> >  	 * transaction overhead reservation from the first transaction commit.
> >  	 */
> >  	tic->t_curr_res = 0;
> > +	tic->t_iclog_hdrs = 0;
> >  	return tic;
> >  }
> >  
> > +static inline void
> > +xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil)
> > +{
> > +	struct xlog	*log = cil->xc_log;
> > +
> > +	atomic_set(&cil->xc_iclog_hdrs,
> > +		   (XLOG_CIL_BLOCKING_SPACE_LIMIT(log) /
> > +			(log->l_iclog_size - log->l_iclog_hsize)));
> > +}
> > +
> >  /*
> >   * Unavoidable forward declaration - xlog_cil_push_work() calls
> >   * xlog_cil_ctx_alloc() itself.
> > @@ -70,6 +81,7 @@ xlog_cil_ctx_switch(
> >  	struct xfs_cil		*cil,
> >  	struct xfs_cil_ctx	*ctx)
> >  {
> > +	xlog_cil_set_iclog_hdr_count(cil);
> 
> ...and I guess every time the CIL gets a fresh context, we also record
> the maximum number of iclog headers that we might be pushing to disk in
> one go?

Yes. that defines the maximum size of the iclog header reservation
the CIL checkpoint is going to need if it stays within the hard
limit.

> Which I guess happens if someone commits a lot of updates to a
> filesystem, a comitting thread hits the throttle threshold, and now the
> CIL has to switch contexts and write the old context's transactions to
> disk?

Right - it reserves enough space for delays in context switches to
use all the overrun without having to do anything ... slow.

> > @@ -442,19 +454,36 @@ xlog_cil_insert_items(
> >  	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
> >  		ctx_res = ctx->ticket->t_unit_res;
> >  
> > -	spin_lock(&cil->xc_cil_lock);
> > -
> > -	/* do we need space for more log record headers? */
> > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > -				(ctx->space_used + len) / iclog_space)) {
> > -		split_res = (len + iclog_space - 1) / iclog_space;
> > -		/* need to take into account split region headers, too */
> > -		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > -		ctx->ticket->t_unit_res += split_res;
> > +	/*
> > +	 * Check if we need to steal iclog headers. atomic_read() is not a
> > +	 * locked atomic operation, so we can check the value before we do any
> > +	 * real atomic ops in the fast path. If we've already taken the CIL unit
> > +	 * reservation from this commit, we've already got one iclog header
> > +	 * space reserved so we have to account for that otherwise we risk
> > +	 * overrunning the reservation on this ticket.
> > +	 *
> > +	 * If the CIL is already at the hard limit, we might need more header
> > +	 * space that originally reserved. So steal more header space from every
> > +	 * commit that occurs once we are over the hard limit to ensure the CIL
> > +	 * push won't run out of reservation space.
> > +	 *
> > +	 * This can steal more than we need, but that's OK.
> > +	 */
> > +	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> 
> If we haven't stolen enough iclog header space...
> 
> > +	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> 
> ...or we've hit a throttling threshold, in which case we know we're
> going to push, so we might as well take everything and (I guess?) not
> give back any reservation that would encourage more commits before we're
> ready?

Partially. This is also safety against the CIL bumping back
down below and above the space limit multiple times. It just ensures
that every transaction that commits over the hard limit is
guaranteed to have enough iclog headers reserved to write the CIL
when it goes over the hard limit.

> > +		int	split_res = log->l_iclog_hsize +
> > +					sizeof(struct xlog_op_header);
> > +		if (ctx_res)
> > +			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
> > +		else
> > +			ctx_res = split_res * tp->t_ticket->t_iclog_hdrs;
> > +		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
> 
> What happens if xc_iclog_hdrs goes negative?  Does that merely mean that
> we stole more space from the transaction than we needed?  Or does it
> indicate that we're trying to cram too much into a single context?

Nothing. Yes. Indicates that we have commits throttling on the hard
limit.

> I suppose I worry about what might happen if each transaction's
> committed items actually somehow eats up every byte of reservation and
> that actually translates to t_iclog_hdrs iclogs being written out with a
> particular context, where sum(t_iclog_hdrs) is larger than what
> xlog_cil_set_iclog_hdr_count() precomputes?

If I understand what you are asking correctly, that should never
happen because the iclog header count should always span the maximum
number of iclogs that change requires to write into the log. And the
CIL context also reserves enough headers to write the entire set of
CIL data to the iclogs, so again we should not ever get into an
overrun situation because we have maximally dirty transactions being
committed. If these sorts of overruns ever occur, we've got a unit
reservation calculation issue, not a CIL iclog header space
reservation stealling issue...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure
  2021-03-11  0:11   ` Darrick J. Wong
@ 2021-03-11  6:33     ` Dave Chinner
  2021-03-11  6:42       ` Dave Chinner
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  6:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 04:11:43PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:33PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The CIL push lock is highly contended on larger machines, becoming a
> > hard bottleneck that about 700,000 transaction commits/s on >16p
> > machines. To address this, start moving the CIL tracking
> > infrastructure to utilise per-CPU structures.
> > 
> > We need to track the space used, the amount of log reservation space
> > reserved to write the CIL, the log items in the CIL and the busy
> > extents that need to be completed by the CIL commit.  This requires
> > a couple of per-cpu counters, an unordered per-cpu list and a
> > globally ordered per-cpu list.
> > 
> > Create a per-cpu structure to hold these and all the management
> > interfaces needed, as well as the hooks to handle hotplug CPUs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c       | 94 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_priv.h      | 15 ++++++
> >  include/linux/cpuhotplug.h |  1 +
> >  3 files changed, 110 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index f8fb2f59e24c..1bcf0d423d30 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -1365,6 +1365,93 @@ xfs_log_item_in_current_chkpt(
> >  	return true;
> >  }
> >  
> > +#ifdef CONFIG_HOTPLUG_CPU
> > +static LIST_HEAD(xlog_cil_pcp_list);
> > +static DEFINE_SPINLOCK(xlog_cil_pcp_lock);
> > +static bool xlog_cil_pcp_init;
> > +
> > +static int
> > +xlog_cil_pcp_dead(
> > +	unsigned int		cpu)
> > +{
> > +	struct xfs_cil		*cil;
> > +
> > +        spin_lock(&xlog_cil_pcp_lock);
> > +        list_for_each_entry(cil, &xlog_cil_pcp_list, xc_pcp_list) {
> 
> Weird indentation.
> 
> > +		/* move stuff on dead CPU to context */
> 
> Should this have some actual code?  I don't think any of the remaining
> patches add anything here.

They should be moving stuff to the current CIL ctx so it is captured
when the CPU goes down.

> 
> > +	}
> > +	spin_unlock(&xlog_cil_pcp_lock);
> > +	return 0;
> > +}
> > +
> > +static int
> > +xlog_cil_pcp_hpadd(
> > +	struct xfs_cil		*cil)
> > +{
> > +	if (!xlog_cil_pcp_init) {
> > +		int	ret;
> > +		ret = cpuhp_setup_state_nocalls(CPUHP_XFS_CIL_DEAD,
> > +						"xfs/cil_pcp:dead", NULL,
> > +						xlog_cil_pcp_dead);
> > +		if (ret < 0) {
> > +			xfs_warn(cil->xc_log->l_mp,
> > +	"Failed to initialise CIL hotplug, error %d. XFS is non-functional.",
> 
> How likely is to happen?

AFAICT, very unlikely, but....

> 
> > +				ret);
> > +			ASSERT(0);
> 
> I guess not that often?
> 
> > +			return -ENOMEM;
> 
> Why not return ret here?  I guess it's because ret could be any number
> of (not centrally documented?) error codes, and we don't really care to
> expose that to userspace?

... yeah.

The cpu hotplug stuff is poorly documented and the code is another
of those "maze of twisty passages" implementations that bounce
through many functions that can return all sorts of different stuff
from lots of different things that could got wrong. I see at least
EINVAL, ENOSPC, ENOMEM, EBUSY and EAGAIN could be returned, and I
have no idea what any of them would actually mean went wrong.

In the end I kinda just copied what the radix tree and percpu
counter code do with an error failing to init hotplug calls: WARN
and then ignore. Hence from the filesystem POV, the error may as
well be ENOMEM because we couldn't set something critical up and
that saves us getting confused over whether the error is fatal or
not at a higher level.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure
  2021-03-11  6:33     ` Dave Chinner
@ 2021-03-11  6:42       ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  6:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Mar 11, 2021 at 05:33:38PM +1100, Dave Chinner wrote:
> On Wed, Mar 10, 2021 at 04:11:43PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 05, 2021 at 04:11:33PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The CIL push lock is highly contended on larger machines, becoming a
> > > hard bottleneck that about 700,000 transaction commits/s on >16p
> > > machines. To address this, start moving the CIL tracking
> > > infrastructure to utilise per-CPU structures.
> > > 
> > > We need to track the space used, the amount of log reservation space
> > > reserved to write the CIL, the log items in the CIL and the busy
> > > extents that need to be completed by the CIL commit.  This requires
> > > a couple of per-cpu counters, an unordered per-cpu list and a
> > > globally ordered per-cpu list.
> > > 
> > > Create a per-cpu structure to hold these and all the management
> > > interfaces needed, as well as the hooks to handle hotplug CPUs.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c       | 94 ++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_log_priv.h      | 15 ++++++
> > >  include/linux/cpuhotplug.h |  1 +
> > >  3 files changed, 110 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index f8fb2f59e24c..1bcf0d423d30 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -1365,6 +1365,93 @@ xfs_log_item_in_current_chkpt(
> > >  	return true;
> > >  }
> > >  
> > > +#ifdef CONFIG_HOTPLUG_CPU
> > > +static LIST_HEAD(xlog_cil_pcp_list);
> > > +static DEFINE_SPINLOCK(xlog_cil_pcp_lock);
> > > +static bool xlog_cil_pcp_init;
> > > +
> > > +static int
> > > +xlog_cil_pcp_dead(
> > > +	unsigned int		cpu)
> > > +{
> > > +	struct xfs_cil		*cil;
> > > +
> > > +        spin_lock(&xlog_cil_pcp_lock);
> > > +        list_for_each_entry(cil, &xlog_cil_pcp_list, xc_pcp_list) {
> > 
> > Weird indentation.
> > 
> > > +		/* move stuff on dead CPU to context */
> > 
> > Should this have some actual code?  I don't think any of the remaining
> > patches add anything here.
> 
> They should be moving stuff to the current CIL ctx so it is captured
> when the CPU goes down.

Yup, looks like I missed updating this. Will add it in the patches
that need it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 36/45] xfs: implement percpu cil space used calculation
  2021-03-11  0:20   ` Darrick J. Wong
@ 2021-03-11  6:51     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-11  6:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 04:20:54PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:34PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have the CIL percpu structures in place, implement the
> > space used counter with a fast sum check similar to the
> > percpu_counter infrastructure.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 42 ++++++++++++++++++++++++++++++++++++------
> >  fs/xfs/xfs_log_priv.h |  2 +-
> >  2 files changed, 37 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 1bcf0d423d30..5519d112c1fd 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -433,6 +433,8 @@ xlog_cil_insert_items(
> >  	struct xfs_log_item	*lip;
> >  	int			len = 0;
> >  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> > +	int			space_used;
> > +	struct xlog_cil_pcp	*cilpcp;
> >  
> >  	ASSERT(tp);
> >  
> > @@ -469,8 +471,9 @@ xlog_cil_insert_items(
> >  	 *
> >  	 * This can steal more than we need, but that's OK.
> >  	 */
> > +	space_used = atomic_read(&ctx->space_used);
> >  	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> > -	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > +	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> >  		int	split_res = log->l_iclog_hsize +
> >  					sizeof(struct xlog_op_header);
> >  		if (ctx_res)
> > @@ -480,16 +483,34 @@ xlog_cil_insert_items(
> >  		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
> >  	}
> >  
> > +	/*
> > +	 * Update the CIL percpu pointer. This updates the global counter when
> > +	 * over the percpu batch size or when the CIL is over the space limit.
> > +	 * This means low lock overhead for normal updates, and when over the
> > +	 * limit the space used is immediately accounted. This makes enforcing
> > +	 * the hard limit much more accurate. The per cpu fold threshold is
> > +	 * based on how close we are to the hard limit.
> > +	 */
> > +	cilpcp = get_cpu_ptr(cil->xc_pcp);
> > +	cilpcp->space_used += len;
> > +	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> > +	    cilpcp->space_used >
> > +			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> > +					num_online_cpus())) {
> 
> What happens if the log is very small and there are hundreds of CPUs?
> Can we end up on this slow path on a regular basis even if the amount of
> space used is not that large?

AFAICT, no big deal - the transaction reservations limit the the
amount of space and concurrency that we can have here. A small log
will not allow many more than a handful of transactions through at a
time.  IOWs, we'll already be on the transaction reservation slow
path that limits concurrency via the grant head spin lock in
xlog_grant_head_check() and sleeping in xlog_grant_head_wait()...

> Granted I can't think of a good way out of that, since I suspect that if
> you do that you're already going to be hurting in 5 other places anyway.
> That said ... I /do/ keep getting bugs from people with tiny logs on big
> iron.  Some day I'll (ha!) stomp out all the bugs that are "NO do not
> let your deployment system growfs 10000x, this is not ext4"...

Yeah, that hurts long before we get to this transaction commit
path...

> 
> > +		atomic_add(cilpcp->space_used, &ctx->space_used);
> > +		cilpcp->space_used = 0;
> > +	}
> > +	put_cpu_ptr(cilpcp);
> > +
> >  	spin_lock(&cil->xc_cil_lock);
> > -	tp->t_ticket->t_curr_res -= ctx_res + len;
> >  	ctx->ticket->t_unit_res += ctx_res;
> >  	ctx->ticket->t_curr_res += ctx_res;
> > -	ctx->space_used += len;
> >  
> >  	/*
> >  	 * If we've overrun the reservation, dump the tx details before we move
> >  	 * the log items. Shutdown is imminent...
> >  	 */
> > +	tp->t_ticket->t_curr_res -= ctx_res + len;
> 
> Is moving this really necessary?

Not really, just gets it out of the way. I moved it because it
doesn't need to be inside the spinlock and in the end it needs to
be associated with the underrun check. So I moved it here first so
that it didn't have to keep moving every time I moved the spinlock
or changed the order of the code from this point onwards....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure
  2021-03-11  0:26   ` Darrick J. Wong
@ 2021-03-12  0:47     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-12  0:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 04:26:10PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:35PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To get it out from under the cil spinlock.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 11 ++++++-----
> >  fs/xfs/xfs_log_priv.h |  2 +-
> >  2 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 5519d112c1fd..a2f93bd7644b 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -492,6 +492,7 @@ xlog_cil_insert_items(
> >  	 * based on how close we are to the hard limit.
> >  	 */
> >  	cilpcp = get_cpu_ptr(cil->xc_pcp);
> > +	cilpcp->space_reserved += ctx_res;
> >  	cilpcp->space_used += len;
> >  	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> >  	    cilpcp->space_used >
> > @@ -502,10 +503,6 @@ xlog_cil_insert_items(
> >  	}
> >  	put_cpu_ptr(cilpcp);
> >  
> > -	spin_lock(&cil->xc_cil_lock);
> > -	ctx->ticket->t_unit_res += ctx_res;
> > -	ctx->ticket->t_curr_res += ctx_res;
> > -
> >  	/*
> >  	 * If we've overrun the reservation, dump the tx details before we move
> >  	 * the log items. Shutdown is imminent...
> > @@ -527,6 +524,7 @@ xlog_cil_insert_items(
> >  	 * We do this here so we only need to take the CIL lock once during
> >  	 * the transaction commit.
> >  	 */
> > +	spin_lock(&cil->xc_cil_lock);
> >  	list_for_each_entry(lip, &tp->t_items, li_trans) {
> >  
> >  		/* Skip items which aren't dirty in this transaction. */
> > @@ -798,10 +796,13 @@ xlog_cil_push_work(
> >  
> >  	down_write(&cil->xc_ctx_lock);
> >  
> > -	/* Reset the CIL pcp counters */
> > +	/* Aggregate and reset the CIL pcp counters */
> >  	for_each_online_cpu(cpu) {
> >  		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> > +		ctx->ticket->t_curr_res += cilpcp->space_reserved;
> 
> Why isn't it necessary to update ctx->ticket->t_unit_res any more?

Because t_unit_res is never used by the CIL ticket becuse they
aren't permanent transaction reservations. The unit res is only
for granting new space to a ticket, yet the CIL only ever "steals"
granted space from an existing ticket. When
the ticket is dropped, we return unused reservations from the
CIL ticket, but never touch or look at the unit reservation.

I can add it back in here if you want, but it's largely dead code...

> (Admittedly I'm struggling to figure out why it matters to keep it
> updated even in the current code base...)

I think I originally did it a decade ago because I probably wasn't
100% sure on what impact not setting it would have. Getting the rest
of the delayed logging code right was far more important than
sweating on a tiny, largely insignificant detail like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 38/45] xfs: convert CIL busy extents to per-cpu
  2021-03-11  0:36   ` Darrick J. Wong
@ 2021-03-12  1:15     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-12  1:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 04:36:01PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To get them out from under the CIL lock.
> > 
> > This is an unordered list, so we can simply punt it to per-cpu lists
> > during transaction commits and reaggregate it back into a single
> > list during the CIL push work.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 26 ++++++++++++++++++--------
> >  1 file changed, 18 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index a2f93bd7644b..7428b98c8279 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -501,6 +501,9 @@ xlog_cil_insert_items(
> >  		atomic_add(cilpcp->space_used, &ctx->space_used);
> >  		cilpcp->space_used = 0;
> >  	}
> > +	/* attach the transaction to the CIL if it has any busy extents */
> > +	if (!list_empty(&tp->t_busy))
> > +		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
> >  	put_cpu_ptr(cilpcp);
> >  
> >  	/*
> > @@ -540,9 +543,6 @@ xlog_cil_insert_items(
> >  			list_move_tail(&lip->li_cil, &cil->xc_cil);
> >  	}
> >  
> > -	/* attach the transaction to the CIL if it has any busy extents */
> > -	if (!list_empty(&tp->t_busy))
> > -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> >  	spin_unlock(&cil->xc_cil_lock);
> >  
> >  	if (tp->t_ticket->t_curr_res < 0)
> > @@ -802,7 +802,10 @@ xlog_cil_push_work(
> >  		ctx->ticket->t_curr_res += cilpcp->space_reserved;
> >  		cilpcp->space_used = 0;
> >  		cilpcp->space_reserved = 0;
> > -
> > +		if (!list_empty(&cilpcp->busy_extents)) {
> > +			list_splice_init(&cilpcp->busy_extents,
> > +					&ctx->busy_extents);
> > +		}
> >  	}
> >  
> >  	spin_lock(&cil->xc_push_lock);
> > @@ -1459,17 +1462,24 @@ static void __percpu *
> >  xlog_cil_pcp_alloc(
> >  	struct xfs_cil		*cil)
> >  {
> > +	void __percpu		*pcptr;
> >  	struct xlog_cil_pcp	*cilpcp;
> > +	int			cpu;
> >  
> > -	cilpcp = alloc_percpu(struct xlog_cil_pcp);
> > -	if (!cilpcp)
> > +	pcptr = alloc_percpu(struct xlog_cil_pcp);
> > +	if (!pcptr)
> >  		return NULL;
> >  
> > +	for_each_possible_cpu(cpu) {
> > +		cilpcp = per_cpu_ptr(pcptr, cpu);
> 
> So... in my mind, "cilpcp" and "pcptr" aren't really all that distinct
> from each other.  I /think/ you're trying to use "cilpcp" everywhere
> else to mean "pointer to a particular CPU's CIL data", and this change
> makes that usage consistent in the alloc function.

Yeah, it's had to have short, concise, distinct names here because
the generic pointer returned is a pointer to per cpu memory that
contains CIL specific per-cpu structures...

> However, this leaves xlog_cil_pcp_free using "cilpcp" to refer to the
> entire chunk of per-CPU data structures.

I'll fix that, I obviously missed that when trying to clean this up
to be consistent...

> Given that the first refers to
> a specific structure and the second refers to them all in aggregate,
> maybe _pcp_alloc and _pcp_free should use a name that at least sounds
> plural?
> 
> e.g.
> 
> 	void __percpu	*all_cilpcps = alloc_percpu(...);
> 
> 	for_each_possible_cpu(cpu) {
> 		cilpcp = per_cpu_ptr(all_cilpcps, cpu);
> 		cilpcp->magicval = 7777;
> 	}

The problem with "all" is that it implies a "global" all, not
something that is owned by this specific CIL instance. i.e. there
will be a per-cpu CIL area for every filesystem that is mounted, and
they are actually all linked together into a global list for CPU
hotplug to walk. So "all" CIL pcps to me means walking this list:

static LIST_HEAD(xlog_cil_pcp_list);
static DEFINE_SPINLOCK(xlog_cil_pcp_lock);

which is linked by the cil->xc_pcp_list list heads in each CIL
instance so that CPU hotplug can do the right thing.

"pcp" is typical shorthand for a "per cpu pointer" but it's
horrible when we have pointers to per-cpu lists, lists of per-cpu
aware structures, pointers to per-cpu regions, pointers to per-cpu
data (structures) within per-cpu regions, etc.

There is no way to win here, it's going to be confusing whatever we
do. I've tried to keep it simple:

pcptr			- generic pointer to allocated per CPU region
cil->xc_pcp		- CIL instance pointer to allocated percpu region
cil->xc_pcp_list	- global list for CPU hotplug pcp management

cilpcp			- pointer to specfic CPU instance of the CIL
			  percpu data inside cil->xc_pcp region.

I might just change the generic (void percpu *) regions to "pcp" so
that they align with all the other uses of "pcp" in the naming.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 40/45] xfs: convert CIL to unordered per cpu lists
  2021-03-11  1:15   ` Darrick J. Wong
@ 2021-03-12  2:18     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-12  2:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 05:15:05PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:38PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > So that we can remove the cil_lock which is a global serialisation
> > point. We've already got ordering sorted, so all we need to do is
> > treat the CIL list like the busy extent list and reconstruct it
> > before the push starts.
....
> > @@ -530,7 +511,6 @@ xlog_cil_insert_items(
> >  	 * the transaction commit.
> >  	 */
> >  	order = atomic_inc_return(&ctx->order_id);
> > -	spin_lock(&cil->xc_cil_lock);
> >  	list_for_each_entry(lip, &tp->t_items, li_trans) {
> >  
> >  		/* Skip items which aren't dirty in this transaction. */
> > @@ -540,10 +520,26 @@ xlog_cil_insert_items(
> >  		lip->li_order_id = order;
> >  		if (!list_empty(&lip->li_cil))
> >  			continue;
> > -		list_add(&lip->li_cil, &cil->xc_cil);
> > +		list_add(&lip->li_cil, &cilpcp->log_items);
> 
> Ok, so if I understand this correctly -- every time a transaction
> commits, it marks every dirty log item with a monotonically increasing
> counter.  If the log item isn't already on another CPU's CIL list, it
> gets added to the current CPU's CIL list...

Correct.

> > +	}
> > +	put_cpu_ptr(cilpcp);
> > +
> > +	/*
> > +	 * If we've overrun the reservation, dump the tx details before we move
> > +	 * the log items. Shutdown is imminent...
> > +	 */
> > +	tp->t_ticket->t_curr_res -= ctx_res + len;
> > +	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
> > +		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
> > +		xfs_warn(log->l_mp,
> > +			 "  log items: %d bytes (iov hdrs: %d bytes)",
> > +			 len, iovhdr_res);
> > +		xfs_warn(log->l_mp, "  split region headers: %d bytes",
> > +			 split_res);
> > +		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
> > +		xlog_print_trans(tp);
> >  	}
> >  
> > -	spin_unlock(&cil->xc_cil_lock);
> >  
> >  	if (tp->t_ticket->t_curr_res < 0)
> >  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> > @@ -806,6 +802,7 @@ xlog_cil_push_work(
> >  	bool			commit_iclog_sync = false;
> >  	int			cpu;
> >  	struct xlog_cil_pcp	*cilpcp;
> > +	LIST_HEAD		(log_items);
> >  
> >  	new_ctx = xlog_cil_ctx_alloc();
> >  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> > @@ -822,6 +819,9 @@ xlog_cil_push_work(
> >  			list_splice_init(&cilpcp->busy_extents,
> >  					&ctx->busy_extents);
> >  		}
> > +		if (!list_empty(&cilpcp->log_items)) {
> > +			list_splice_init(&cilpcp->log_items, &log_items);
> 
> ...and then at CIL push time, we splice each per-CPU list into a big
> list, sort the dirty log items by counter number, and process them.

Yup, that's pretty much it. I'm replacing insert time ordering with
push-time ordering to get rid of the serialisation overhead of
insert time ordering.

> The first thought I had was that it's a darn shame that _insert_items
> can't steal a log item from another CPU's CIL list, because you could
> then mergesort the per-CPU CIL lists into @log_items.  Unfortunately, I
> don't think there's a safe way to steal items from a per-CPU list
> without involving locks.

Yeah, it needs locks because we then have to serialise local inserts
with remote removals. It can be done fairly easily - I just need to
replace the "order ID" field with the CPU ID of the list it is on.

The problem is that relogging happens a lot, so in some workloads we
might be bouncing a set of commonly accessed log items around CPUs
frequently. That said, I'm not sure this would end up a huge
problem, but it still needs a mergesort to be performed in the push
code...

> The second thought I had was that we have the xfs_pwork mechanism for
> launching a bunch of worker threads.  A pwork workqueue is (probably)
> too costly when the item list is short or there aren't that many CPUs,
> but once list_sort starts getting painful, would it be faster to launch
> a bunch of threads in push_work to sort each per-CPU list and then merge
> sort them into the final list?

Not sure, because now you have N work threads competing with the
userspace workload for CPU to do maybe 10ms of work. The scheduling
latency when the system is CPU bound is likely to introduce more
latency than you save by spreading the work out....

I've largely put these sorts of questions aside because optimising
this code further can be done later. The code as it stands doubles
the throughput of the commit path and I don't think that further
optimisation is immediately necessary. Ensuring that the splitting
and recombining of the lists still results in correctly ordered log
items is more important right now, and I think it does that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 41/45] xfs: move CIL ordering to the logvec chain
  2021-03-11  1:34   ` Darrick J. Wong
@ 2021-03-12  2:29     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-12  2:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 05:34:52PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:39PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Adding a list_sort() call to the CIL push work while the xc_ctx_lock
> > is held exclusively has resulted in fairly long lock hold times and
> > that stops all front end transaction commits from making progress.
> 
> Heh, nice solution. :)
> 
> > We can move the sorting out of the xc_ctx_lock if we can transfer
> > the ordering information to the log vectors as they are detached
> > from the log items and then we can sort the log vectors. This
> > requires log vectors to use a list_head rather than a single linked
> > list
> 
> Ergh, could pull out the list conversion into a separate piece?
> Some of the lv_chain usage is ... not entirely textbook.

Yes, I can probably do that.

> > diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> > index af54ea3f8c90..0445dd6acbce 100644
> > --- a/fs/xfs/xfs_log.h
> > +++ b/fs/xfs/xfs_log.h
> > @@ -9,7 +9,8 @@
> >  struct xfs_cil_ctx;
> >  
> >  struct xfs_log_vec {
> > -	struct xfs_log_vec	*lv_next;	/* next lv in build list */
> > +	struct list_head	lv_chain;	/* lv chain ptrs */
> > +	int			lv_order_id;	/* chain ordering info */
> 
> uint32_t to match li_order_id?

*nod*

> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 3d43a5088154..6dcc23829bef 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -72,6 +72,7 @@ xlog_cil_ctx_alloc(void)
> >  	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
> >  	INIT_LIST_HEAD(&ctx->committing);
> >  	INIT_LIST_HEAD(&ctx->busy_extents);
> > +	INIT_LIST_HEAD(&ctx->lv_chain);
> >  	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
> >  	return ctx;
> >  }
> > @@ -237,6 +238,7 @@ xlog_cil_alloc_shadow_bufs(
> >  			lv = kmem_alloc_large(buf_size, KM_NOFS);
> >  			memset(lv, 0, xlog_cil_iovec_space(niovecs));
> >  
> > +			INIT_LIST_HEAD(&lv->lv_chain);
> >  			lv->lv_item = lip;
> >  			lv->lv_size = buf_size;
> >  			if (ordered)
> > @@ -252,7 +254,6 @@ xlog_cil_alloc_shadow_bufs(
> >  			else
> >  				lv->lv_buf_len = 0;
> >  			lv->lv_bytes = 0;
> > -			lv->lv_next = NULL;
> >  		}
> >  
> >  		/* Ensure the lv is set up according to ->iop_size */
> > @@ -379,8 +380,6 @@ xlog_cil_insert_format_items(
> >  		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
> >  			/* same or smaller, optimise common overwrite case */
> >  			lv = lip->li_lv;
> > -			lv->lv_next = NULL;
> 
> What /did/ these null assignments do?

IIRC, at one point they ensured that the lv chain was correctly
terminated when a lv was reused and added to the tail of an existing
chain. I think that became redundant when we added the shadow
buffers to allow allocation outside the CIL lock contexts...

> > -		list_del_init(&item->li_cil);
> > -		item->li_order_id = 0;
> > -		if (!ctx->lv_chain)
> > -			ctx->lv_chain = item->li_lv;
> > -		else
> > -			lv->lv_next = item->li_lv;
> > +
> >  		lv = item->li_lv;
> > -		item->li_lv = NULL;
> > +		lv->lv_order_id = item->li_order_id;
> >  		num_iovecs += lv->lv_niovecs;
> > -
> >  		/* we don't write ordered log vectors */
> >  		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> >  			num_bytes += lv->lv_bytes;
> > +		list_add_tail(&lv->lv_chain, &ctx->lv_chain);
> > +
> > +		list_del_init(&item->li_cil);
> 
> Do the list manipulations need moving, or could they have stayed further
> up in the loop body for a cleaner patch?

I moved them so the code was structured as:

		<transfer item state to log vec>
		<manipulate lists>
		<clear item state>

Because there was no clear separation between state and list
manipulations. This will clean up if I separate the list
manipulations into their own patch...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 43/45] xfs: avoid cil push lock if possible
  2021-03-11  1:47   ` Darrick J. Wong
@ 2021-03-12  2:36     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-12  2:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 05:47:09PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:41PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because now it hurts when the CIL fills up.
> > 
> >   - 37.20% __xfs_trans_commit
> >       - 35.84% xfs_log_commit_cil
> >          - 19.34% _raw_spin_lock
> >             - do_raw_spin_lock
> >                  19.01% __pv_queued_spin_lock_slowpath
> >          - 4.20% xfs_log_ticket_ungrant
> >               0.90% xfs_log_space_wake
> > 
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 6dcc23829bef..d60c72ad391a 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -1115,10 +1115,18 @@ xlog_cil_push_background(
> >  	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
> >  
> >  	/*
> > -	 * Don't do a background push if we haven't used up all the
> > -	 * space available yet.
> > +	 * We are done if:
> > +	 * - we haven't used up all the space available yet; or
> > +	 * - we've already queued up a push; and
> > +	 * - we're not over the hard limit; and
> > +	 * - nothing has been over the hard limit.
> 
> Er... do these last three bullet points correspond to the last three
> lines of the if test?  I'm not sure how !waitqueue_active() determines
> that nothing has been over the hard limit?

If a commit has made space used go over the hard limit, it will be
throttled and put to sleep on the push wait queue. Another commit
can then return space (inode fork gets smaller) and bring us back
under the hard limit. Hence just checking against the space used does
not tell us if we've hit the hard limit or not, but checking if
there is a throttled process on the wait queue does...

> Or for that matter how
> comparing push_seq against current_seq tells us if we've queued a
> push?

We only set push_seq == current_seq when we queue up a push in
xlog_cil_push_now() or xlog_cil_push_background().  Hence if no push
has been queued, then push_seq will be less than current_seq.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 03/45] xfs: separate CIL commit record IO
  2021-03-05  5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
  2021-03-08  8:34   ` Chandan Babu R
@ 2021-03-15 14:40   ` Brian Foster
  2021-03-16  8:40   ` Christoph Hellwig
  2 siblings, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-15 14:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:01PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
> 
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
> 
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
> 
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
> 
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---

I still think the patch could be titled much more descriptively. For
example:

xfs: checkpoint completion to commit record submission ordering

Otherwise (and despite my slight unease over now blocking async log
forces on iclog callback completion) looks Ok:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_log.c      | 8 +++++---
>  fs/xfs/xfs_log_cil.c  | 9 +++++++++
>  fs/xfs/xfs_log_priv.h | 2 ++
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fa284f26d10e..317c466232d4 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -784,10 +784,12 @@ xfs_log_mount_cancel(
>  }
>  
>  /*
> - * Wait for the iclog to be written disk, or return an error if the log has been
> - * shut down.
> + * Wait for the iclog and all prior iclogs to be written disk as required by the
> + * log force state machine. Waiting on ic_force_wait ensures iclog completions
> + * have been ordered and callbacks run before we are woken here, hence
> + * guaranteeing that all the iclogs up to this one are on stable storage.
>   */
> -static int
> +int
>  xlog_wait_on_iclog(
>  	struct xlog_in_core	*iclog)
>  		__releases(iclog->ic_log->l_icloglock)
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..1e5fd6f268c2 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,15 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn) {
> +		spin_lock(&log->l_icloglock);
> +		xlog_wait_on_iclog(commit_iclog->ic_prev);
> +	}
> +
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..ee7786b33da9 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog(struct xlog_in_core *iclog);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush
  2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
  2021-03-08  9:31   ` Chandan Babu R
  2021-03-08 22:21   ` Darrick J. Wong
@ 2021-03-15 14:40   ` Brian Foster
  2021-03-16  8:41   ` Christoph Hellwig
  3 siblings, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-15 14:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:02PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a one line wrapper around blkdev_issue_flush(). Just replace it
> with direct calls to blkdev_issue_flush().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf.c   | 2 +-
>  fs/xfs/xfs_file.c  | 6 +++---
>  fs/xfs/xfs_log.c   | 2 +-
>  fs/xfs/xfs_super.c | 7 -------
>  fs/xfs/xfs_super.h | 1 -
>  5 files changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 37a1d12762d8..7043546a04b8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1958,7 +1958,7 @@ xfs_free_buftarg(
>  	percpu_counter_destroy(&btp->bt_io_count);
>  	list_lru_destroy(&btp->bt_lru);
>  
> -	xfs_blkdev_issue_flush(btp);
> +	blkdev_issue_flush(btp->bt_bdev);
>  
>  	kmem_free(btp);
>  }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a007ca0711d9..24c7f45fc4eb 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -197,9 +197,9 @@ xfs_file_fsync(
>  	 * inode size in case of an extending write.
>  	 */
>  	if (XFS_IS_REALTIME_INODE(ip))
> -		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
> +		blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev);
>  	else if (mp->m_logdev_targp != mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
>  
>  	/*
>  	 * Any inode that has dirty modifications in the log is pinned.  The
> @@ -219,7 +219,7 @@ xfs_file_fsync(
>  	 */
>  	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
>  	    mp->m_logdev_targp == mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
>  
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 317c466232d4..fee76c485727 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1962,7 +1962,7 @@ xlog_sync(
>  	 * layer state machine for preflushes.
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> -		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
> +		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
>  		need_flush = false;
>  	}
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e5e0713bebcd..ca2cb0448b5e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -339,13 +339,6 @@ xfs_blkdev_put(
>  		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>  }
>  
> -void
> -xfs_blkdev_issue_flush(
> -	xfs_buftarg_t		*buftarg)
> -{
> -	blkdev_issue_flush(buftarg->bt_bdev);
> -}
> -
>  STATIC void
>  xfs_close_devices(
>  	struct xfs_mount	*mp)
> diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
> index 1ca484b8357f..79cb2dece811 100644
> --- a/fs/xfs/xfs_super.h
> +++ b/fs/xfs/xfs_super.h
> @@ -88,7 +88,6 @@ struct block_device;
>  
>  extern void xfs_quiesce_attr(struct xfs_mount *mp);
>  extern void xfs_flush_inodes(struct xfs_mount *mp);
> -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
>  extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
>  					   xfs_agnumber_t agcount);
>  
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-08 22:24     ` Darrick J. Wong
@ 2021-03-15 14:41       ` Brian Foster
  2021-03-15 16:32         ` Darrick J. Wong
  0 siblings, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-03-15 14:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, Dave Chinner, linux-xfs

On Mon, Mar 08, 2021 at 02:24:07PM -0800, Darrick J. Wong wrote:
> On Mon, Mar 08, 2021 at 03:18:09PM +0530, Chandan Babu R wrote:
> > On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > The new checkpoint caceh flush mechanism requires us to issue an
> > > unconditional cache flush before we start a new checkpoint. We don't
> > > want to block for this if we can help it, and we have a fair chunk
> > > of CPU work to do between starting the checkpoint and issuing the
> > > first journal IO.
> > >
> > > Hence it makes sense to amortise the latency cost of the cache flush
> > > by issuing it asynchronously and then waiting for it only when we
> > > need to issue the first IO in the transaction.
> > >
> > > TO do this, we need async cache flush primitives to submit the cache
> > > flush bio and to wait on it. THe block layer has no such primitives
> > > for filesystems, so roll our own for the moment.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_linux.h  |  2 ++
> > >  2 files changed, 38 insertions(+)
> > >
> > > diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> > > index 17f36db2f792..668f8bd27b4a 100644
> > > --- a/fs/xfs/xfs_bio_io.c
> > > +++ b/fs/xfs/xfs_bio_io.c
> > > @@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count)
> > >  	return bio_max_segs(howmany(count, PAGE_SIZE));
> > >  }
> > >  
> > > +void
> > > +xfs_flush_bdev_async_endio(
> > > +	struct bio	*bio)
> > > +{
> > > +	if (bio->bi_private)
> > > +		complete(bio->bi_private);
> > > +}
> > > +
> > > +/*
> > > + * Submit a request for an async cache flush to run. If the request queue does
> > > + * not require flush operations, just skip it altogether. If the caller needsi
> > > + * to wait for the flush completion at a later point in time, they must supply a
> > > + * valid completion. This will be signalled when the flush completes.  The
> > > + * caller never sees the bio that is issued here.
> > > + */
> > > +void
> > > +xfs_flush_bdev_async(
> > > +	struct bio		*bio,
> > > +	struct block_device	*bdev,
> > > +	struct completion	*done)
> > > +{
> > > +	struct request_queue	*q = bdev->bd_disk->queue;
> > > +
> > > +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> > > +		complete(done);
> > 
> > complete() should be invoked only when "done" has a non-NULL value.
> 
> The only caller always provides a completion.
> 

IMO, if the mechanism (i.e. the helper) accommodates a NULL parameter,
the underlying completion callback should as well..

Brian

> --D
> 
> > > +		return;
> > > +	}
> > 
> > -- 
> > chandan
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-05  5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
  2021-03-08  9:48   ` Chandan Babu R
  2021-03-08 22:26   ` Darrick J. Wong
@ 2021-03-15 14:42   ` Brian Foster
  2 siblings, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-15 14:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The new checkpoint caceh flush mechanism requires us to issue an

		     cache

> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
> 
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
> 
> TO do this, we need async cache flush primitives to submit the cache

  To

> flush bio and to wait on it. THe block layer has no such primitives

			       The

> for filesystems, so roll our own for the moment.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_linux.h  |  2 ++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 17f36db2f792..668f8bd27b4a 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count)
>  	return bio_max_segs(howmany(count, PAGE_SIZE));
>  }
>  
> +void
> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	if (bio->bi_private)
> +		complete(bio->bi_private);
> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the request queue does
> + * not require flush operations, just skip it altogether. If the caller needsi

									   needs

> + * to wait for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.  The
> + * caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct bio		*bio,
> +	struct block_device	*bdev,
> +	struct completion	*done)
> +{
> +	struct request_queue	*q = bdev->bd_disk->queue;
> +

It seems rather odd to me to accept a bio here and then init it, but I
see this was explicitly changed from the previous version to avoid an
allocation (I'd rather see the bio in the CIL context or something
rather than dropped on the stack, but whatever).

> +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> +		complete(done);

The NULL or no NULL debate aside, this should be consistent with the
logic in the callback (IMO, just check for NULL here as Chandan
suggested). With that fixed up, one way or the other:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +		return;
> +	}
> +
> +	bio_init(bio, NULL, 0);
> +	bio_set_dev(bio, bdev);
> +	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
> +	bio->bi_private = done;
> +	bio->bi_end_io = xfs_flush_bdev_async_endio;
> +
> +	submit_bio(bio);
> +}
>  int
>  xfs_rw_bdev(
>  	struct block_device	*bdev,
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index af6be9b9ccdf..953d98bc4832 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -196,6 +196,8 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
> +void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev,
> +		struct completion *done);
>  
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally
  2021-03-05  5:11 ` [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
@ 2021-03-15 14:43   ` Brian Foster
  2021-03-16  8:47   ` Christoph Hellwig
  1 sibling, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-15 14:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_log_cil.c | 31 +++++++++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 1e5fd6f268c2..b4cdb8b6c4c3 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
...
> @@ -719,10 +721,25 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>  
>  	/*
> -	 * pull all the log vectors off the items in the CIL, and
> -	 * remove the items from the CIL. We don't need the CIL lock
> -	 * here because it's only needed on the transaction commit
> -	 * side which is currently locked out by the flush lock.
> +	 * The CIL is stable at this point - nothing new will be added to it
> +	 * because we hold the flush lock exclusively. Hence we can now issue
> +	 * a cache flush to ensure all the completed metadata in the journal we
> +	 * are about to overwrite is on stable storage.
> +	 *
> +	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
> +	 * cache flushes to provide this ordering guarantee, and hence for CIL
> +	 * checkpoints that require hundreds or thousands of log writes no
> +	 * longer need to issue device cache flushes to provide metadata
> +	 * writeback ordering.
> +	 */

I don't think we need to have code comments to explain why some other
code doesn't do something or doesn't exist. This seems like something
that should stick to the commit log description (between this patch and
the future patch that removes the historical behavior). IOW, I'd just
drop that second paragraph.

Otherwise (and modulo my previous thoughts on the bio) LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> +				&bdev_flush);
> +
> +	/*
> +	 * Pull all the log vectors off the items in the CIL, and remove the
> +	 * items from the CIL. We don't need the CIL lock here because it's only
> +	 * needed on the transaction commit side which is currently locked out
> +	 * by the flush lock.
>  	 */
>  	lv = NULL;
>  	num_iovecs = 0;
> @@ -806,6 +823,12 @@ xlog_cil_push_work(
>  	lvhdr.lv_iovecp = &lhdr;
>  	lvhdr.lv_next = ctx->lv_chain;
>  
> +	/*
> +	 * Before we format and submit the first iclog, we have to ensure that
> +	 * the metadata writeback ordering cache flush is complete.
> +	 */
> +	wait_for_completion(&bdev_flush);
> +
>  	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
>  	if (error)
>  		goto out_abort_free_ticket;
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write()
  2021-03-05  5:11 ` [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
@ 2021-03-15 14:45   ` Brian Foster
  2021-03-16 14:15   ` Christoph Hellwig
  1 sibling, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-15 14:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:05PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The CIL push is the only call to xlog_write that sets this variable
> to true. The other callers don't need a start rec, and they tell
> xlog_write what to do by passing the type of ophdr they need written
> in the flags field. The need_start_rec parameter essentially tells
> xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
> so get rid of the variable to do this and pass XLOG_START_TRANS as
> the flag value into xlog_write() from the CIL push.
> 
> $ size fs/xfs/xfs_log.o*
>   text	   data	    bss	    dec	    hex	filename
>  27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
>  27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_log.c      | 44 +++++++++++++++++++++----------------------
>  fs/xfs/xfs_log_cil.c  |  3 ++-
>  fs/xfs/xfs_log_priv.h |  3 +--
>  3 files changed, 25 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fee76c485727..364694a83de6 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
...
> @@ -2449,13 +2448,15 @@ xlog_write(
>  			 * write a start record. Only do this for the first
>  			 * iclog we write to.
>  			 */
> -			if (need_start_rec) {
> +			if (optype & XLOG_START_TRANS) {
>  				xlog_write_start_rec(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  						sizeof(struct xlog_op_header));
> +				optype &= ~XLOG_START_TRANS;
> +				wrote_start_rec = true;

I think this overload of optype and op header flags is sufficiently
subtle and fragile that this warrants a comment. E.g., something like:

"Now that we've written the start record, we must clear the flag now so
it doesn't leak into subsequent op headers."

Otherwise, I think it's pretty much a guarantee that somebody will come
along later and attempt to optimize away what looks like an unnecessary
boolean by moving the flag clear further down without realizing why it's
here.

In fact, I think what would have been more clean and simple overall is
to translate the new optype param back into the preexisting flags (as a
local variable) and need_start_rec parameters right after optype is
consumed by xlog_write_calc_vec_length(). Then we wouldn't have to tweak
as much functional logic to accommodate a subtle variable overload
(i.e., basically just removing code changes from this patch) and the
code would be self explanatory. That said, the remaining changes look Ok
to me so long as the above is clearly documented.

Brian

>  			}
>  
> -			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags);
> +			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
>  			if (!ophdr)
>  				return -EIO;
>  
> @@ -2486,14 +2487,13 @@ xlog_write(
>  			}
>  			copy_len += sizeof(struct xlog_op_header);
>  			record_cnt++;
> -			if (need_start_rec) {
> +			if (wrote_start_rec) {
>  				copy_len += sizeof(struct xlog_op_header);
>  				record_cnt++;
> -				need_start_rec = false;
>  			}
>  			data_cnt += contwr ? copy_len : 0;
>  
> -			error = xlog_write_copy_finish(log, iclog, flags,
> +			error = xlog_write_copy_finish(log, iclog, optype,
>  						       &record_cnt, &data_cnt,
>  						       &partial_copy,
>  						       &partial_copy_len,
> @@ -2537,7 +2537,7 @@ xlog_write(
>  	spin_lock(&log->l_icloglock);
>  	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
>  	if (commit_iclog) {
> -		ASSERT(flags & XLOG_COMMIT_TRANS);
> +		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
>  	} else {
>  		error = xlog_state_release_iclog(log, iclog);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b4cdb8b6c4c3..c04d5d37a3a2 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -829,7 +829,8 @@ xlog_cil_push_work(
>  	 */
>  	wait_for_completion(&bdev_flush);
>  
> -	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
> +	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
> +				XLOG_START_TRANS);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index ee7786b33da9..56e1942c47df 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -480,8 +480,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint flags,
> -		bool need_start_rec);
> +		struct xlog_in_core **commit_iclog, uint optype);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/45] xfs: reduce buffer log item shadow allocations
  2021-03-05  5:11 ` [PATCH 10/45] xfs: reduce buffer log item shadow allocations Dave Chinner
@ 2021-03-15 14:52   ` Brian Foster
  0 siblings, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-15 14:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:08PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we modify btrees repeatedly, we regularly increase the size of
> the logged region by a single chunk at a time (per transaction
> commit). This results in the CIL formatting code having to
> reallocate the log vector buffer every time the buffer dirty region
> grows. Hence over a typical 4kB btree buffer, we might grow the log
> vector 4096/128 = 32x over a short period where we repeatedly add
> or remove records to/from the buffer over a series of running
> transaction. This means we are doing 32 memory allocations and frees
> over this time during a performance critical path in the journal.
> 
> The amount of space tracked in the CIL for the object is calculated
> during the ->iop_format() call for the buffer log item, but the
> buffer memory allocated for it is calculated by the ->iop_size()
> call. The size callout determines the size of the buffer, the format
> call determines the space used in the buffer.
> 
> Hence we can oversize the buffer space required in the size
> calculation without impacting the amount of space used and accounted
> to the CIL for the changes being logged. This allows us to reduce
> the number of allocations by rounding up the buffer size to allow
> for future growth. This can safe a substantial amount of CPU time in
> this path:
> 
> -   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
>    - 44.49% xfs_log_commit_cil
>       - 30.78% _raw_spin_lock
>          - 30.75% do_raw_spin_lock
>               30.27% __pv_queued_spin_lock_slowpath
> 
> (oh, ouch!)
> ....
>       - 1.05% kmem_alloc_large
>          - 1.02% kmem_alloc
>               0.94% __kmalloc
> 
> This overhead here us what this patch is aimed at. After:
> 
>       - 0.76% kmem_alloc_large
>          - 0.75% kmem_alloc
>               0.70% __kmalloc
> 
> The size of 512 bytes is based on the bitmap chunk size being 128
> bytes and that random directory entry updates almost never require
> more than 3-4 128 byte regions to be logged in the directory block.
> 
> The other observation is for per-ag btrees. When we are inserting
> into a new btree block, we'll pack it from the front. Hence the
> first few records land in the first 128 bytes so we log only 128
> bytes, the next 8-16 records land in the second region so now we log
> 256 bytes. And so on.  If we are doing random updates, it will only
> allocate every 4 random 128 byte regions that are dirtied instead of
> every single one.
> 
> Any larger than 512 bytes and I noticed an increase in memory
> footprint in my scalability workloads. Any less than this and I
> didn't really see any significant benefit to CPU usage.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---

It looks like I have posted feedback (mostly reviewed-by tags, fwiw) on
previous posts of this patch and the next three that appears to have
been either ignored or lost.

Brian

>  fs/xfs/xfs_buf_item.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 17960b1ce5ef..0628a65d9c55 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -142,6 +142,7 @@ xfs_buf_item_size(
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
>  	int			i;
> +	int			bytes;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
> @@ -173,7 +174,7 @@ xfs_buf_item_size(
>  	}
>  
>  	/*
> -	 * the vector count is based on the number of buffer vectors we have
> +	 * The vector count is based on the number of buffer vectors we have
>  	 * dirty bits in. This will only be greater than one when we have a
>  	 * compound buffer with more than one segment dirty. Hence for compound
>  	 * buffers we need to track which segment the dirty bits correspond to,
> @@ -181,10 +182,18 @@ xfs_buf_item_size(
>  	 * count for the extra buf log format structure that will need to be
>  	 * written.
>  	 */
> +	bytes = 0;
>  	for (i = 0; i < bip->bli_format_count; i++) {
>  		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> -					  nvecs, nbytes);
> +					  nvecs, &bytes);
>  	}
> +
> +	/*
> +	 * Round up the buffer size required to minimise the number of memory
> +	 * allocations that need to be done as this item grows when relogged by
> +	 * repeated modifications.
> +	 */
> +	*nbytes = round_up(bytes, 512);
>  	trace_xfs_buf_item_size(bip);
>  }
>  
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-15 14:41       ` Brian Foster
@ 2021-03-15 16:32         ` Darrick J. Wong
  2021-03-16  8:43           ` Christoph Hellwig
  0 siblings, 1 reply; 145+ messages in thread
From: Darrick J. Wong @ 2021-03-15 16:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: Chandan Babu R, Dave Chinner, linux-xfs

On Mon, Mar 15, 2021 at 10:41:13AM -0400, Brian Foster wrote:
> On Mon, Mar 08, 2021 at 02:24:07PM -0800, Darrick J. Wong wrote:
> > On Mon, Mar 08, 2021 at 03:18:09PM +0530, Chandan Babu R wrote:
> > > On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > The new checkpoint caceh flush mechanism requires us to issue an
> > > > unconditional cache flush before we start a new checkpoint. We don't
> > > > want to block for this if we can help it, and we have a fair chunk
> > > > of CPU work to do between starting the checkpoint and issuing the
> > > > first journal IO.
> > > >
> > > > Hence it makes sense to amortise the latency cost of the cache flush
> > > > by issuing it asynchronously and then waiting for it only when we
> > > > need to issue the first IO in the transaction.
> > > >
> > > > TO do this, we need async cache flush primitives to submit the cache
> > > > flush bio and to wait on it. THe block layer has no such primitives
> > > > for filesystems, so roll our own for the moment.
> > > >
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_linux.h  |  2 ++
> > > >  2 files changed, 38 insertions(+)
> > > >
> > > > diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> > > > index 17f36db2f792..668f8bd27b4a 100644
> > > > --- a/fs/xfs/xfs_bio_io.c
> > > > +++ b/fs/xfs/xfs_bio_io.c
> > > > @@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count)
> > > >  	return bio_max_segs(howmany(count, PAGE_SIZE));
> > > >  }
> > > >  
> > > > +void
> > > > +xfs_flush_bdev_async_endio(
> > > > +	struct bio	*bio)
> > > > +{
> > > > +	if (bio->bi_private)
> > > > +		complete(bio->bi_private);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Submit a request for an async cache flush to run. If the request queue does
> > > > + * not require flush operations, just skip it altogether. If the caller needsi
> > > > + * to wait for the flush completion at a later point in time, they must supply a
> > > > + * valid completion. This will be signalled when the flush completes.  The
> > > > + * caller never sees the bio that is issued here.
> > > > + */
> > > > +void
> > > > +xfs_flush_bdev_async(
> > > > +	struct bio		*bio,
> > > > +	struct block_device	*bdev,
> > > > +	struct completion	*done)
> > > > +{
> > > > +	struct request_queue	*q = bdev->bd_disk->queue;
> > > > +
> > > > +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> > > > +		complete(done);
> > > 
> > > complete() should be invoked only when "done" has a non-NULL value.
> > 
> > The only caller always provides a completion.
> > 
> 
> IMO, if the mechanism (i.e. the helper) accommodates a NULL parameter,
> the underlying completion callback should as well..

Yes, I agree with that principle.  However, the use case for !done isn't
clear ot me -- what is the point of issuing a flush and not waiting for
the results?

Can PREFLUSHes generate IO errors?  And if they do, why don't we return
the error to the caller?

--D

> Brian
> 
> > --D
> > 
> > > > +		return;
> > > > +	}
> > > 
> > > -- 
> > > chandan
> > 
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space
  2021-03-11  2:00   ` Darrick J. Wong
@ 2021-03-16  3:04     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-16  3:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 06:00:45PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:42PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When xlog_sync() rounds off the tail the iclog that is being
> > flushed, it manually subtracts that space from the grant heads. This
> > space is actually reserved by the transaction ticket that covers
> > the xlog_sync() call from xlog_write(), but we don't plumb the
> > ticket down far enough for it to account for the space consumed in
> > the current log ticket.
> > 
> > The grant heads are hot, so we really should be accounting this to
> > the ticket is we can, rather than adding thousands of extra grant
> > head updates every CIL commit.
> > 
> > Interestingly, this actually indicates a potential log space overrun
> > can occur when we force the log. By the time that xfs_log_force()
> > pushes out an active iclog and consumes the roundoff space, the
> 
> Ok I was wondering about that when I was trying to figure out what all
> this ticket space stealing code was doing.
> 
> So in addition to fixing the theoretical overrun, I guess the
> performance fix here is that every time we write an iclog we might have
> to move the grant heads forward so that we always write a full log
> sector / log stripe unit?  And since a CIL context might write a lot of
> iclogs, it's cheaper to make those grant adjustments to the CIL ticket
> (which already asked for enough space to handle the roundoffs) since the
> ticket only jumps in the hot path once when the ticket is ungranted?
> 
> If I got that right,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

You got it right. :)

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 45/45] xfs: expanding delayed logging design with background material
  2021-03-11  2:30   ` Darrick J. Wong
@ 2021-03-16  3:28     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-03-16  3:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 06:30:20PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:43PM +1100, Dave Chinner wrote:
> > +The method used to log an item or chain modifications together isn't
> > +particularly important in the scope of this document. It suffices to know that
> > +the method used for logging a particular object or chaining modifications
> > +together are different and are dependent on the object and/or modification being
> > +performed. The logging subsystem only cares that certain specific rules are
> > +followed to guarantee forwards progress and prevent deadlocks.
> 
> (Aww, maybe we /should/ document how we choose between intents, logical
> updates, and physical updates.  But that can come in a later patch.)

*nod*

> > +Transactions in XFS
> > +===================
> > +
> > +XFS has two types of high level transactions, defined by the type of log space
> > +reservation they take. These are known as "one shot" and "permanent"
> 
> (Ugh, I wish we could call them 'compound' or 'rolling'
> transactions....)

I guess that's a set of followup patches at some point. Calling them
'rolling transactions' everywhere would certainly be an improvement.

> > +Log Space Accounting
> > +====================
> > +
> > +The position in the log is typically referred to as a Log Sequence Number (LSN).
> > +The log is circular, so the positions in the log are defined by the combination
> > +of a cycle number - the number of times the log has been overwritten - and the
> > +offset into the log.  A LSN carries the cycle in the upper 32 bits and the
> > +offset in the lower 32 bits. The offset is in units of "basic blocks" (512
> > +bytes). Hence we can do realtively simple LSN based math to keep track of
> > +available space in the log.
> > +
> > +Log space accounting is done via a pair of constructs called "grant heads".  The
> > +position of the grant heads is an absolute value, so the amount of space
> > +available in the log is defined by the distance between the position of the
> > +grant head and the current log tail. That is, how much space can be
> > +reserved/consumed before the grant heads would fully wrap the log and overtake
> > +the tail position.
> > +
> > +The first grant head is the "reserve" head. This tracks the byte count of the
> > +reservations currently held by active transactions. It is a purely in-memory
> > +accounting of the space reservation and, as such, actually tracks byte offsets
> > +into the log rather than basic blocks. Hence it technically isn't using LSNs to
> > +represent the log position, but it is still treated like a split {cycle,offset}
> > +tuple for the purposes of tracking reservation space.
> 
> Lol, the grant head is delalloc for transactions.

Yes, in effect that's exactly what it does. The log is at ENOSPC
when we run out of reservation space, and the AIL is the garbage
collector that reclaims used space....

> > @@ -67,12 +350,13 @@ the log over and over again. Worse is the fact that objects tend to get
> >  dirtier as they get relogged, so each subsequent transaction is writing more
> >  metadata into the log.
> >  
> > -Another feature of the XFS transaction subsystem is that most transactions are
> > -asynchronous. That is, they don't commit to disk until either a log buffer is
> > -filled (a log buffer can hold multiple transactions) or a synchronous operation
> > -forces the log buffers holding the transactions to disk. This means that XFS is
> > -doing aggregation of transactions in memory - batching them, if you like - to
> > -minimise the impact of the log IO on transaction throughput.
> > +It should now also be obvious how relogging and asynchronous transactions go
> > +hand in hand. That is, transactions don't get written to the physical journal
> > +until either a log buffer is filled (a log buffer can hold multiple
> > +transactions) or a synchronous operation forces the log buffers holding the
> > +transactions to disk. This means that XFS is doing aggregation of transactions
> > +in memory - batching them, if you like - to minimise the impact of the log IO on
> > +transaction throughput.
> 
> ...microtransaction fusion, yippee!

I'll have to remember that next time I play algorithm buzzword
bingo... :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 01/45] xfs: initialise attr fork on inode create
  2021-03-05  5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
  2021-03-08 22:20   ` Darrick J. Wong
@ 2021-03-16  8:35   ` Christoph Hellwig
  1 sibling, 0 replies; 145+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

I'd still prefer the xfs_ifork_alloc to be separate, but otherwise
this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 03/45] xfs: separate CIL commit record IO
  2021-03-05  5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
  2021-03-08  8:34   ` Chandan Babu R
  2021-03-15 14:40   ` Brian Foster
@ 2021-03-16  8:40   ` Christoph Hellwig
  2 siblings, 0 replies; 145+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

I'm still worried that we add a pessimisation for devices that do not
require cache flushes here, and even more so that this tradeoff isn't
even documented in the commit log.

The actual code change using xlog_wait_on_iclog and the intent for
devices with a write cache looks fine to me.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush
  2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
                     ` (2 preceding siblings ...)
  2021-03-15 14:40   ` Brian Foster
@ 2021-03-16  8:41   ` Christoph Hellwig
  3 siblings, 0 replies; 145+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:02PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a one line wrapper around blkdev_issue_flush(). Just replace it
> with direct calls to blkdev_issue_flush().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 05/45] xfs: async blkdev cache flush
  2021-03-15 16:32         ` Darrick J. Wong
@ 2021-03-16  8:43           ` Christoph Hellwig
  0 siblings, 0 replies; 145+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, Chandan Babu R, Dave Chinner, linux-xfs

On Mon, Mar 15, 2021 at 09:32:22AM -0700, Darrick J. Wong wrote:
> Yes, I agree with that principle.  However, the use case for !done isn't
> clear ot me -- what is the point of issuing a flush and not waiting for
> the results?

There is none.  This check should go away.

> 
> Can PREFLUSHes generate IO errors?  And if they do, why don't we return
> the error to the caller?

The caller owns the bio and can look at bi_status itself.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally
  2021-03-05  5:11 ` [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
  2021-03-15 14:43   ` Brian Foster
@ 2021-03-16  8:47   ` Christoph Hellwig
  1 sibling, 0 replies; 145+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +	/*
> +	 * Before we format and submit the first iclog, we have to ensure that
> +	 * the metadata writeback ordering cache flush is complete.
> +	 */
> +	wait_for_completion(&bdev_flush);

.. and this would be where we'd check bio.bi_status for an error ..


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write()
  2021-03-05  5:11 ` [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
  2021-03-15 14:45   ` Brian Foster
@ 2021-03-16 14:15   ` Christoph Hellwig
  1 sibling, 0 replies; 145+ messages in thread
From: Christoph Hellwig @ 2021-03-16 14:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> @@ -818,9 +818,7 @@ xlog_wait_on_iclog(
>  static int
>  xlog_write_unmount_record(
>  	struct xlog		*log,
> -	struct xlog_ticket	*ticket,
> -	xfs_lsn_t		*lsn,
> -	uint			flags)
> +	struct xlog_ticket	*ticket)
>  {
>  	struct xfs_unmount_log_format ulf = {
>  		.magic = XLOG_UNMOUNT_TYPE,
> @@ -837,7 +835,7 @@ xlog_write_unmount_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= sizeof(ulf);
> -	return xlog_write(log, &vec, ticket, lsn, NULL, flags, false);
> +	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);

The removal of the lsn argument from xlog_write_unmount_record and
making optional of the start_lsn argument to xlog_write is not
documented anywhere.  I still it would be best to split such tiny
argument passing cleanups from a more complicated and not quite trivial
one, but at very least it needs to be clearly documented.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 19/45] xfs: factor out the CIL transaction header building
  2021-03-05  5:11 ` [PATCH 19/45] xfs: factor out the CIL transaction header building Dave Chinner
  2021-03-08 23:47   ` Darrick J. Wong
@ 2021-03-16 14:50   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-16 14:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:17PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It is static code deep in the middle of the CIL push logic. Factor
> it out into a helper so that it is clear and easy to modify
> separately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log_cil.c | 71 +++++++++++++++++++++++++++++---------------
>  1 file changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index dfc9ef692a80..b515002e7959 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -651,6 +651,41 @@ xlog_cil_process_committed(
>  	}
>  }
>  
> +struct xlog_cil_trans_hdr {
> +	struct xfs_trans_header	thdr;
> +	struct xfs_log_iovec	lhdr;
> +};
> +
> +/*
> + * Build a checkpoint transaction header to begin the journal transaction.  We
> + * need to account for the space used by the transaction header here as it is
> + * not accounted for in xlog_write().
> + */
> +static void
> +xlog_cil_build_trans_hdr(
> +	struct xfs_cil_ctx	*ctx,
> +	struct xlog_cil_trans_hdr *hdr,
> +	struct xfs_log_vec	*lvhdr,
> +	int			num_iovecs)
> +{
> +	struct xlog_ticket	*tic = ctx->ticket;
> +
> +	memset(hdr, 0, sizeof(*hdr));
> +
> +	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
> +	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
> +	hdr->thdr.th_tid = tic->t_tid;
> +	hdr->thdr.th_num_items = num_iovecs;
> +	hdr->lhdr.i_addr = &hdr->thdr;
> +	hdr->lhdr.i_len = sizeof(xfs_trans_header_t);
> +	hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
> +	tic->t_curr_res -= hdr->lhdr.i_len + sizeof(xlog_op_header_t);
> +

Might as well drop the typedef usages while we're here. Otherwise LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	lvhdr->lv_niovecs = 1;
> +	lvhdr->lv_iovecp = &hdr->lhdr;
> +	lvhdr->lv_next = ctx->lv_chain;
> +}
> +
>  /*
>   * Push the Committed Item List to the log.
>   *
> @@ -676,11 +711,9 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	*lv;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
> -	struct xlog_ticket	*tic;
>  	int			num_iovecs;
>  	int			error = 0;
> -	struct xfs_trans_header thdr;
> -	struct xfs_log_iovec	lhdr;
> +	struct xlog_cil_trans_hdr thdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_lsn_t		commit_lsn;
>  	xfs_lsn_t		push_seq;
> @@ -827,24 +860,8 @@ xlog_cil_push_work(
>  	 * Build a checkpoint transaction header and write it to the log to
>  	 * begin the transaction. We need to account for the space used by the
>  	 * transaction header here as it is not accounted for in xlog_write().
> -	 *
> -	 * The LSN we need to pass to the log items on transaction commit is
> -	 * the LSN reported by the first log vector write. If we use the commit
> -	 * record lsn then we can move the tail beyond the grant write head.
>  	 */
> -	tic = ctx->ticket;
> -	thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
> -	thdr.th_type = XFS_TRANS_CHECKPOINT;
> -	thdr.th_tid = tic->t_tid;
> -	thdr.th_num_items = num_iovecs;
> -	lhdr.i_addr = &thdr;
> -	lhdr.i_len = sizeof(xfs_trans_header_t);
> -	lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
> -	tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t);
> -
> -	lvhdr.lv_niovecs = 1;
> -	lvhdr.lv_iovecp = &lhdr;
> -	lvhdr.lv_next = ctx->lv_chain;
> +	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
>  
>  	/*
>  	 * Before we format and submit the first iclog, we have to ensure that
> @@ -852,7 +869,13 @@ xlog_cil_push_work(
>  	 */
>  	wait_for_completion(&bdev_flush);
>  
> -	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
> +	/*
> +	 * The LSN we need to pass to the log items on transaction commit is the
> +	 * LSN reported by the first log vector write, not the commit lsn. If we
> +	 * use the commit record lsn then we can move the tail beyond the grant
> +	 * write head.
> +	 */
> +	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
>  				XLOG_START_TRANS);
>  	if (error)
>  		goto out_abort_free_ticket;
> @@ -891,11 +914,11 @@ xlog_cil_push_work(
>  	}
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	error = xlog_commit_record(log, tic, &commit_iclog, &commit_lsn);
> +	error = xlog_commit_record(log, ctx->ticket, &commit_iclog, &commit_lsn);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> -	xfs_log_ticket_ungrant(log, tic);
> +	xfs_log_ticket_ungrant(log, ctx->ticket);
>  
>  	spin_lock(&commit_iclog->ic_callback_lock);
>  	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
> @@ -946,7 +969,7 @@ xlog_cil_push_work(
>  	return;
>  
>  out_abort_free_ticket:
> -	xfs_log_ticket_ungrant(log, tic);
> +	xfs_log_ticket_ungrant(log, ctx->ticket);
>  out_abort:
>  	ASSERT(XLOG_FORCED_SHUTDOWN(log));
>  	xlog_cil_committed(ctx);
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 20/45] xfs: only CIL pushes require a start record
  2021-03-05  5:11 ` [PATCH 20/45] xfs: only CIL pushes require a start record Dave Chinner
  2021-03-09  0:07   ` Darrick J. Wong
@ 2021-03-16 14:51   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-16 14:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:18PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So move the one-off start record writing in xlog_write() out into
> the static header that the CIL push builds to write into the log
> initially. This simplifes the xlog_write() logic a lot.
> 
> pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly
> 32 bit aligned and packed for copying the log op and transaction
> headers directly into the log as a single log region copy.
> 
> struct xlog_cil_trans_hdr {
> 	struct xlog_op_header      oph[2];               /*     0    24 */
> 	struct xfs_trans_header    thdr;                 /*    24    16 */
> 	struct xfs_log_iovec       lhdr;                 /*    40    16 */
> 
> 	/* size: 56, cachelines: 1, members: 3 */
> 	/* last cacheline: 56 bytes */
> };

FWIW, this doesn't match the structure defined in the code.

> 
> A wart is needed to handle the fact that length of the region the
> opheader points to doesn't include the opheader length. hence if
> we embed the opheader, we have to substract the opheader length from
> the length written into the opheader by the generic copying code.
> This will eventually go away when everything is converted to
> embedded opheaders.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c     | 90 ++++++++++++++++++++++----------------------
>  fs/xfs/xfs_log_cil.c | 44 ++++++++++++++++++----
>  2 files changed, 81 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f54d48f4584e..b2f9fb1b4fed 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
...
> @@ -2425,25 +2420,24 @@ xlog_write(
>  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
>  			/*
> -			 * Before we start formatting log vectors, we need to
> -			 * write a start record. Only do this for the first
> -			 * iclog we write to.
> +			 * The XLOG_START_TRANS has embedded ophdrs for the
> +			 * start record and transaction header. They will always
> +			 * be the first two regions in the lv chain.
>  			 */
>  			if (optype & XLOG_START_TRANS) {
> -				xlog_write_start_rec(ptr, ticket);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -						sizeof(struct xlog_op_header));
> -				optype &= ~XLOG_START_TRANS;
> -				wrote_start_rec = true;
> -			}
> -
> -			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
> -			if (!ophdr)
> -				return -EIO;
> +				ophdr = reg->i_addr;
> +				if (index)
> +					optype &= ~XLOG_START_TRANS;

So ophdr points to the lv memory in this case, but we're going to memcpy
this into iclog anyways.

Presumably the index check is intended to track processing the first lv
in the chain (with the two embedded headers). That seems Ok, but flakey
enough that I hope it doesn't survive the end of the series.

> +			} else {
> +				ophdr = xlog_write_setup_ophdr(log, ptr,
> +							ticket, optype);
> +				if (!ophdr)
> +					return -EIO;
>  
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset,
> +				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
> -
> +				added_ophdr = true;
> +			}
>  			len += xlog_write_setup_copy(ticket, ophdr,
>  						     iclog->ic_size-log_offset,
>  						     reg->i_len,
...
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b515002e7959..e9da074ecd69 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -652,14 +652,22 @@ xlog_cil_process_committed(
>  }
>  
>  struct xlog_cil_trans_hdr {
> +	struct xlog_op_header	oph[2];
>  	struct xfs_trans_header	thdr;
> -	struct xfs_log_iovec	lhdr;
> +	struct xfs_log_iovec	lhdr[2];
>  };
...

This is all hairy enough that I think it's helpful to at least separate
the two vectors from crossing inside an array boundary. For example,
something like the appended diff (untested).

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index e9da074ecd69..76cb82f1142e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -651,10 +651,16 @@ xlog_cil_process_committed(
 	}
 }
 
+/*
+ * Consolidated structure for the first two iovecs in a CIL checkpoint.
+ */
 struct xlog_cil_trans_hdr {
-	struct xlog_op_header	oph[2];
-	struct xfs_trans_header	thdr;
-	struct xfs_log_iovec	lhdr[2];
+	struct xlog_op_header	op;	/* log start record */
+	struct {			/* trans header*/
+		struct xlog_op_header	op;
+		struct xfs_trans_header	thdr;
+	} t;
+	struct xfs_log_iovec	lhdr[2];/* region pointers for embedded hdrs */
 };
 
 /*
@@ -682,27 +688,27 @@ xlog_cil_build_trans_hdr(
 	memset(hdr, 0, sizeof(*hdr));
 
 	/* Log start record */
-	hdr->oph[0].oh_tid = tid;
-	hdr->oph[0].oh_clientid = XFS_TRANSACTION;
-	hdr->oph[0].oh_flags = XLOG_START_TRANS;
+	hdr->op.oh_tid = tid;
+	hdr->op.oh_clientid = XFS_TRANSACTION;
+	hdr->op.oh_flags = XLOG_START_TRANS;
 
 	/* log iovec region pointer */
-	hdr->lhdr[0].i_addr = &hdr->oph[0];
+	hdr->lhdr[0].i_addr = &hdr->op;
 	hdr->lhdr[0].i_len = sizeof(struct xlog_op_header);
 	hdr->lhdr[0].i_type = XLOG_REG_TYPE_LRHEADER;
 
 	/* log opheader */
-	hdr->oph[1].oh_tid = tid;
-	hdr->oph[1].oh_clientid = XFS_TRANSACTION;
+	hdr->t.op.oh_tid = tid;
+	hdr->t.op.oh_clientid = XFS_TRANSACTION;
 
 	/* transaction header */
-	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
-	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
-	hdr->thdr.th_tid = tid;
-	hdr->thdr.th_num_items = num_iovecs;
+	hdr->t.thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
+	hdr->t.thdr.th_type = XFS_TRANS_CHECKPOINT;
+	hdr->t.thdr.th_tid = tid;
+	hdr->t.thdr.th_num_items = num_iovecs;
 
 	/* log iovec region pointer */
-	hdr->lhdr[1].i_addr = &hdr->oph[1];
+	hdr->lhdr[1].i_addr = &hdr->t.op;
 	hdr->lhdr[1].i_len = sizeof(struct xlog_op_header) +
 				sizeof(struct xfs_trans_header);
 	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* Re: [PATCH 23/45] xfs: log tickets don't need log client id
  2021-03-05  5:11 ` [PATCH 23/45] xfs: log tickets don't need log client id Dave Chinner
  2021-03-09  0:21   ` Darrick J. Wong
@ 2021-03-16 14:51   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-16 14:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently set the log ticket client ID when we reserve a
> transaction. This client ID is only ever written to the log by
> a CIL checkpoint or unmount records, and so anything using a high
> level transaction allocated through xfs_trans_alloc() does not need
> a log ticket client ID to be set.
> 
> For the CIL checkpoint, the client ID written to the journal is
> always XFS_TRANSACTION, and for the unmount record it is always
> XFS_LOG, and nothing else writes to the log. All of these operations
> tell xlog_write() exactly what they need to write to the log (the
> optype) and build their own opheaders for start, commit and unmount
> records. Hence we no longer need to set the client id in either the
> log ticket or the xfs_trans.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---

LGTM with Darrick's suggested feedback:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_log.c      | 47 ++++++++-----------------------------------
>  fs/xfs/xfs_log.h      | 16 ++++++---------
>  fs/xfs/xfs_log_cil.c  |  2 +-
>  fs/xfs/xfs_log_priv.h | 10 ++-------
>  fs/xfs/xfs_trans.c    |  6 ++----
>  5 files changed, 19 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index c2e69a1f5cad..429cb1e7cc67 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -431,10 +431,9 @@ xfs_log_regrant(
>  int
>  xfs_log_reserve(
>  	struct xfs_mount	*mp,
> -	int		 	unit_bytes,
> -	int		 	cnt,
> +	int			unit_bytes,
> +	int			cnt,
>  	struct xlog_ticket	**ticp,
> -	uint8_t		 	client,
>  	bool			permanent)
>  {
>  	struct xlog		*log = mp->m_log;
> @@ -442,15 +441,13 @@ xfs_log_reserve(
>  	int			need_bytes;
>  	int			error = 0;
>  
> -	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
> -
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
>  	XFS_STATS_INC(mp, xs_try_logspace);
>  
>  	ASSERT(*ticp == NULL);
> -	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent);
> +	tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent);
>  	*ticp = tic;
>  
>  	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
> @@ -847,7 +844,7 @@ xlog_unmount_write(
>  	struct xlog_ticket	*tic = NULL;
>  	int			error;
>  
> -	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
> +	error = xfs_log_reserve(mp, 600, 1, &tic, 0);
>  	if (error)
>  		goto out_err;
>  
> @@ -2170,35 +2167,13 @@ xlog_write_calc_vec_length(
>  
>  static xlog_op_header_t *
>  xlog_write_setup_ophdr(
> -	struct xlog		*log,
>  	struct xlog_op_header	*ophdr,
> -	struct xlog_ticket	*ticket,
> -	uint			flags)
> +	struct xlog_ticket	*ticket)
>  {
>  	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -	ophdr->oh_clientid = ticket->t_clientid;
> +	ophdr->oh_clientid = XFS_TRANSACTION;
>  	ophdr->oh_res2 = 0;
> -
> -	/* are we copying a commit or unmount record? */
> -	ophdr->oh_flags = flags;
> -
> -	/*
> -	 * We've seen logs corrupted with bad transaction client ids.  This
> -	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
> -	 * and shut down the filesystem.
> -	 */
> -	switch (ophdr->oh_clientid)  {
> -	case XFS_TRANSACTION:
> -	case XFS_VOLUME:
> -	case XFS_LOG:
> -		break;
> -	default:
> -		xfs_warn(log->l_mp,
> -			"Bad XFS transaction clientid 0x%x in ticket "PTR_FMT,
> -			ophdr->oh_clientid, ticket);
> -		return NULL;
> -	}
> -
> +	ophdr->oh_flags = 0;
>  	return ophdr;
>  }
>  
> @@ -2439,11 +2414,7 @@ xlog_write(
>  				if (index)
>  					optype &= ~XLOG_START_TRANS;
>  			} else {
> -				ophdr = xlog_write_setup_ophdr(log, ptr,
> -							ticket, optype);
> -				if (!ophdr)
> -					return -EIO;
> -
> +                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
>  				added_ophdr = true;
> @@ -3499,7 +3470,6 @@ xlog_ticket_alloc(
>  	struct xlog		*log,
>  	int			unit_bytes,
>  	int			cnt,
> -	char			client,
>  	bool			permanent)
>  {
>  	struct xlog_ticket	*tic;
> @@ -3517,7 +3487,6 @@ xlog_ticket_alloc(
>  	tic->t_cnt		= cnt;
>  	tic->t_ocnt		= cnt;
>  	tic->t_tid		= prandom_u32();
> -	tic->t_clientid		= client;
>  	if (permanent)
>  		tic->t_flags |= XLOG_TIC_PERM_RESERV;
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 1bd080ce3a95..c0c3141944ea 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -117,16 +117,12 @@ int	  xfs_log_mount_finish(struct xfs_mount *mp);
>  void	xfs_log_mount_cancel(struct xfs_mount *);
>  xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
>  xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
> -void	  xfs_log_space_wake(struct xfs_mount *mp);
> -int	  xfs_log_reserve(struct xfs_mount *mp,
> -			  int		   length,
> -			  int		   count,
> -			  struct xlog_ticket **ticket,
> -			  uint8_t		   clientid,
> -			  bool		   permanent);
> -int	  xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
> -void      xfs_log_unmount(struct xfs_mount *mp);
> -int	  xfs_log_force_umount(struct xfs_mount *mp, int logerror);
> +void	xfs_log_space_wake(struct xfs_mount *mp);
> +int	xfs_log_reserve(struct xfs_mount *mp, int length, int count,
> +			struct xlog_ticket **ticket, bool permanent);
> +int	xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
> +void	xfs_log_unmount(struct xfs_mount *mp);
> +int	xfs_log_force_umount(struct xfs_mount *mp, int logerror);
>  bool	xfs_log_writable(struct xfs_mount *mp);
>  
>  struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index e9da074ecd69..0c81c13e2cf6 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -37,7 +37,7 @@ xlog_cil_ticket_alloc(
>  {
>  	struct xlog_ticket *tic;
>  
> -	tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0);
> +	tic = xlog_ticket_alloc(log, 0, 1, 0);
>  
>  	/*
>  	 * set the current reservation to zero so we know to steal the basic
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index bb5fa6b71114..7f601c1c9f45 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -158,7 +158,6 @@ typedef struct xlog_ticket {
>  	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
>  	char		   t_ocnt;	 /* original count		 : 1  */
>  	char		   t_cnt;	 /* current count		 : 1  */
> -	char		   t_clientid;	 /* who does this belong to;	 : 1  */
>  	char		   t_flags;	 /* properties of reservation	 : 1  */
>  
>          /* reservation array fields */
> @@ -465,13 +464,8 @@ extern __le32	 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead,
>  			    char *dp, int size);
>  
>  extern kmem_zone_t *xfs_log_ticket_zone;
> -struct xlog_ticket *
> -xlog_ticket_alloc(
> -	struct xlog	*log,
> -	int		unit_bytes,
> -	int		count,
> -	char		client,
> -	bool		permanent);
> +struct xlog_ticket *xlog_ticket_alloc(struct xlog *log, int unit_bytes,
> +		int count, bool permanent);
>  
>  static inline void
>  xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 52f3fdf1e0de..83c2b7f22eb7 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -194,11 +194,9 @@ xfs_trans_reserve(
>  			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
>  			error = xfs_log_regrant(mp, tp->t_ticket);
>  		} else {
> -			error = xfs_log_reserve(mp,
> -						resp->tr_logres,
> +			error = xfs_log_reserve(mp, resp->tr_logres,
>  						resp->tr_logcount,
> -						&tp->t_ticket, XFS_TRANSACTION,
> -						permanent);
> +						&tp->t_ticket, permanent);
>  		}
>  
>  		if (error)
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 24/45] xfs: move log iovec alignment to preparation function
  2021-03-05  5:11 ` [PATCH 24/45] xfs: move log iovec alignment to preparation function Dave Chinner
  2021-03-09  2:14   ` Darrick J. Wong
@ 2021-03-16 14:51   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-16 14:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:22PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To include log op headers directly into the log iovec regions that
> the ophdrs wrap, we need to move the buffer alignment code from
> xlog_finish_iovec() to xlog_prepare_iovec(). This is because the
> xlog_op_header is only 12 bytes long, and we need the buffer that
> the caller formats their data into to be 8 byte aligned.
> 
> Hence once we start prepending the ophdr in xlog_prepare_iovec(), we
> are going to need to manage the padding directly to ensure that the
> buffer pointer returned is correctly aligned.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_log.h | 25 ++++++++++++++-----------
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index c0c3141944ea..1ca4f2edbdaf 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -21,6 +21,16 @@ struct xfs_log_vec {
>  
>  #define XFS_LOG_VEC_ORDERED	(-1)
>  
> +/*
> + * We need to make sure the buffer pointer returned is naturally aligned for the
> + * biggest basic data type we put into it. We have already accounted for this
> + * padding when sizing the buffer.
> + *
> + * However, this padding does not get written into the log, and hence we have to
> + * track the space used by the log vectors separately to prevent log space hangs
> + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> + * CIL context ticket.
> + */
>  static inline void *
>  xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  		uint type)
> @@ -34,6 +44,9 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  		vec = &lv->lv_iovecp[0];
>  	}
>  
> +	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
> +		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
> +
>  	vec->i_type = type;
>  	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
>  
> @@ -43,20 +56,10 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  	return vec->i_addr;
>  }
>  
> -/*
> - * We need to make sure the next buffer is naturally aligned for the biggest
> - * basic data type we put into it.  We already accounted for this padding when
> - * sizing the buffer.
> - *
> - * However, this padding does not get written into the log, and hence we have to
> - * track the space used by the log vectors separately to prevent log space hangs
> - * due to inaccurate accounting (i.e. a leak) of the used log space through the
> - * CIL context ticket.
> - */
>  static inline void
>  xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
>  {
> -	lv->lv_buf_len += round_up(len, sizeof(uint64_t));
> +	lv->lv_buf_len += len;
>  	lv->lv_bytes += len;
>  	vec->i_len = len;
>  }
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-03-05  5:11 ` [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
  2021-03-09  2:21   ` Darrick J. Wong
@ 2021-03-16 14:53   ` Brian Foster
  2021-05-19  3:18     ` Dave Chinner
  1 sibling, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-03-16 14:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Current xlog_write() adds op headers to the log manually for every
> log item region that is in the vector passed to it. While
> xlog_write() needs to stamp the transaction ID into the ophdr, we
> already know it's length, flags, clientid, etc at CIL commit time.
> 
> This means the only time that xlog write really needs to format and
> reserve space for a new ophdr is when a region is split across two
> iclogs. Adding the opheader and accounting for it as part of the
> normal formatted item region means we simplify the accounting
> of space used by a transaction and we don't have to special case
> reserving of space in for the ophdrs in xlog_write(). It also means
> we can largely initialise the ophdr in transaction commit instead
> of xlog_write, making the xlog_write formatting inner loop much
> tighter.
> 
> xlog_prepare_iovec() is now too large to stay as an inline function,
> so we move it out of line and into xfs_log.c.
> 
> Object sizes:
> text	   data	    bss	    dec	    hex	filename
> 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> 
> So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> out of line, even though it grew in size itself.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks mostly reasonable, a couple or so questions...

>  fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_log.h     |  42 +++-------------
>  fs/xfs/xfs_log_cil.c |  25 +++++-----
>  3 files changed, 99 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 429cb1e7cc67..98de45be80c0 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -89,6 +89,62 @@ xlog_iclogs_empty(
>  static int
>  xfs_log_cover(struct xfs_mount *);
>  
> +/*
> + * We need to make sure the buffer pointer returned is naturally aligned for the
> + * biggest basic data type we put into it. We have already accounted for this
> + * padding when sizing the buffer.
> + *
> + * However, this padding does not get written into the log, and hence we have to
> + * track the space used by the log vectors separately to prevent log space hangs
> + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> + * CIL context ticket.
> + *
> + * We also add space for the xlog_op_header that describes this region in the
> + * log. This prepends the data region we return to the caller to copy their data
> + * into, so do all the static initialisation of the ophdr now. Because the ophdr
> + * is not 8 byte aligned, we have to be careful to ensure that we align the
> + * start of the buffer such that the region we return to the call is 8 byte
> + * aligned and packed against the tail of the ophdr.
> + */
> +void *
> +xlog_prepare_iovec(
> +	struct xfs_log_vec	*lv,
> +	struct xfs_log_iovec	**vecp,
> +	uint			type)
> +{
> +	struct xfs_log_iovec	*vec = *vecp;
> +	struct xlog_op_header	*oph;
> +	uint32_t		len;
> +	void			*buf;
> +
> +	if (vec) {
> +		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> +		vec++;
> +	} else {
> +		vec = &lv->lv_iovecp[0];
> +	}
> +
> +	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
> +	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
> +		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
> +					sizeof(struct xlog_op_header);
> +	}
> +
> +	vec->i_type = type;
> +	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> +
> +	oph = vec->i_addr;
> +	oph->oh_clientid = XFS_TRANSACTION;
> +	oph->oh_res2 = 0;
> +	oph->oh_flags = 0;
> +
> +	buf = vec->i_addr + sizeof(struct xlog_op_header);
> +	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));

Why is it the buffer portion needs to be 8 byte aligned but not ->i_addr
itself?

> +
> +	*vecp = vec;
> +	return buf;
> +}
> +

Not worth changing now, but it's helpful to reduce the size of the patch
by separating out mechanical changes like moving functions around.

>  static void
>  xlog_grant_sub_space(
>  	struct xlog		*log,
...
> @@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length(
>  			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
>  		}
>  	}
> -
> -	/* Don't account for regions with embedded ophdrs */
> -	if (optype && headers > 0) {
> -		headers--;
> -		if (optype & XLOG_START_TRANS) {
> -			ASSERT(headers >= 1);
> -			headers--;
> -		}
> -	}
> -
>  	ticket->t_res_num_ophdrs += headers;
> -	len += headers * sizeof(struct xlog_op_header);

Hm, this seems to suggest something was off wrt to ->t_res_num_ophdrs
prior to this change.  Granted this looks like it's just a debug field,
but the previous logic filtered out embedded op headers unconditionally
whereas now it looks like we go back to accounting them. Am I missing
something?

>  
>  	return len;
>  }
...
> @@ -2404,21 +2448,25 @@ xlog_write(
>  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
>  			/*
> -			 * The XLOG_START_TRANS has embedded ophdrs for the
> -			 * start record and transaction header. They will always
> -			 * be the first two regions in the lv chain. Commit and
> -			 * unmount records also have embedded ophdrs.
> +			 * Regions always have their ophdr at the start of the
> +			 * region, except for:
> +			 * - a transaction start which has a start record ophdr
> +			 *   before the first region ophdr; and
> +			 * - the previous region didn't fully fit into an iclog
> +			 *   so needs a continuation ophdr to prepend the region
> +			 *   in this new iclog.
>  			 */
> -			if (optype) {
> -				ophdr = reg->i_addr;
> -				if (index)
> -					optype &= ~XLOG_START_TRANS;
> -			} else {
> +			ophdr = reg->i_addr;
> +			if (optype && index) {
> +				optype &= ~XLOG_START_TRANS;
> +			} else if (partial_copy) {
>                                  ophdr = xlog_write_setup_ophdr(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
>  				added_ophdr = true;
>  			}

So in the partial_copy continuation case we're still stamping an ophdr
directly into the iclog. Otherwise we're processing/modifying flags and
whatnot on the ophdr already stamped at commit time in the log vector.
However, this is Ok because a relog would reformat the op header.

> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +
>  			len += xlog_write_setup_copy(ticket, ophdr,
>  						     iclog->ic_size-log_offset,
>  						     reg->i_len,
> @@ -2436,20 +2484,11 @@ xlog_write(
>  				ophdr->oh_len = cpu_to_be32(copy_len -
>  						sizeof(struct xlog_op_header));
>  			}
> -			/*
> -			 * Copy region.
> -			 *
> -			 * Commit records just log an opheader, so
> -			 * we can have empty payloads with no data region to
> -			 * copy.  Hence we only copy the payload if the vector
> -			 * says it has data to copy.
> -			 */
> -			ASSERT(copy_len >= 0);
> -			if (copy_len > 0) {
> -				memcpy(ptr, reg->i_addr + copy_off, copy_len);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -						   copy_len);
> -			}
> +
> +			ASSERT(copy_len > 0);
> +			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> +			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> +

I assume the checks in xlog_write_copy_finish() to require a minimum of
one op header worth of space in the iclog prevent doing a partial write
across an embedded op header boundary, but it would be nice to have an
assert or something that ensures that. For example, assert for something
like if a partial_copy occurs, partial_copy_len was at least the length
of an op header into the region.

>  			if (added_ophdr)
>  				copy_len += sizeof(struct xlog_op_header);
>  			record_cnt++;
...
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0c81c13e2cf6..7a5e6bdb7876 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
>  		}
>  
>  		/*
> -		 * We 64-bit align the length of each iovec so that the start
> -		 * of the next one is naturally aligned.  We'll need to
> -		 * account for that slack space here. Then round nbytes up
> -		 * to 64-bit alignment so that the initial buffer alignment is
> -		 * easy to calculate and verify.
> +		 * We 64-bit align the length of each iovec so that the start of
> +		 * the next one is naturally aligned.  We'll need to account for
> +		 * that slack space here.
> +		 *

Related to my question above, I'm a little confused by the (preexisting)
comment. If the start of the next iovec is now the ophdr, doesn't that
mean the "start of the next one (iovec)" is technically no longer
naturally aligned?

Brian

> +		 * We also add the xlog_op_header to each region when
> +		 * formatting, but that's not accounted to the size of the item
> +		 * at this point. Hence we'll need an addition number of bytes
> +		 * for each vector to hold an opheader.
> +		 *
> +		 * Then round nbytes up to 64-bit alignment so that the initial
> +		 * buffer alignment is easy to calculate and verify.
>  		 */
> -		nbytes += niovecs * sizeof(uint64_t);
> +		nbytes += niovecs *
> +			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
>  		nbytes = round_up(nbytes, sizeof(uint64_t));
>  
>  		/*
> @@ -433,11 +440,6 @@ xlog_cil_insert_items(
>  
>  	spin_lock(&cil->xc_cil_lock);
>  
> -	/* account for space used by new iovec headers  */
> -	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> -	len += iovhdr_res;
> -	ctx->nvecs += diff_iovecs;
> -
>  	/* attach the transaction to the CIL if it has any busy extents */
>  	if (!list_empty(&tp->t_busy))
>  		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> @@ -469,6 +471,7 @@ xlog_cil_insert_items(
>  	}
>  	tp->t_ticket->t_curr_res -= len;
>  	ctx->space_used += len;
> +	ctx->nvecs += diff_iovecs;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-03-11  3:41       ` Darrick J. Wong
@ 2021-03-16 14:54         ` Brian Foster
  0 siblings, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-16 14:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Mar 10, 2021 at 07:41:14PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 12, 2021 at 02:29:32PM +1100, Dave Chinner wrote:
> > On Mon, Mar 08, 2021 at 06:21:34PM -0800, Darrick J. Wong wrote:
> > > On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Current xlog_write() adds op headers to the log manually for every
> > > > log item region that is in the vector passed to it. While
> > > > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > > > already know it's length, flags, clientid, etc at CIL commit time.
> > > > 
> > > > This means the only time that xlog write really needs to format and
> > > > reserve space for a new ophdr is when a region is split across two
> > > > iclogs. Adding the opheader and accounting for it as part of the
> > > > normal formatted item region means we simplify the accounting
> > > > of space used by a transaction and we don't have to special case
> > > > reserving of space in for the ophdrs in xlog_write(). It also means
> > > > we can largely initialise the ophdr in transaction commit instead
> > > > of xlog_write, making the xlog_write formatting inner loop much
> > > > tighter.
> > > > 
> > > > xlog_prepare_iovec() is now too large to stay as an inline function,
> > > > so we move it out of line and into xfs_log.c.
> > > > 
> > > > Object sizes:
> > > > text	   data	    bss	    dec	    hex	filename
> > > > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > > > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > > > 
> > > > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > > > out of line, even though it grew in size itself.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Sooo... if I understand this part of the patchset correctly, the goal
> > > here is to simplify and shorten the inner loop of xlog_write.
> > 
> > That's one of the goals. The other goal is to avoid needing to
> > account for log op headers separately in the high level CIL commit
> > code.
> > 
> > > Callers
> > > are now required to create their own log op headers at the start of the
> > > xfs_log_iovec chain in the xfs_log_vec, which means that the only time
> > > xlog_write has to create an ophdr is when we fill up the current iclog
> > > and must continue in a new one, because that's not something the callers
> > > should ever have to know about.  Correct?
> > 
> > Yes.
> > 
> > > If so,
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > Thanks!
> > 
> > > It /really/ would have been nice to have kept these patches separated by
> > > major functional change area (i.e. separate series) instead of one
> > > gigantic 45-patch behemoth to intimidate the reviewers...
> > 
> > How is that any different from sending out 6-7 separate dependent
> > patchsets one immediately after another?  A change to one patch in
> > one series results in needing to rebase at least one patch in each
> > of the smaller patchsets, so I've still got to treat them all as one
> > big patchset in my development trees. Then I have to start
> > reposting patchsets just because another patchset was changed, and
> > that gets even more confusing trying to work out what patchset goes
> > with which version and so on. It's much easier for me to manage them
> > as a single patchset....
> 
> Well, ok, but it would have been nice for the cover letter to give
> /some/ hint as to what's changing in various subranges, e.g.
> 
> "Patches 32-36 reduce the xc_cil_lock critical sections,
>  Patches 37-41 create per-cpu cil structures and move log items and
>        vectors to use them,
>  Patches 42-44 are more cleanups,
>  Patch 45 documents the whole mess."
> 
> So I could see the outlines of where the 45 patches were going.
> 

Agreed. The purpose of separate patch series' is to facilitate upstream
review and patch processing. This series strikes me as not only separate
logical changes, but changes probably with different trajectories toward
merge as well. E.g., do we expect to land this whole series together at
the same time? That would seem... unwise.

If not (or if we don't otherwise want to unnecessarily delay the earlier
parts of the series until the whole percpu cil thing at the end is
worked out), then I think it probably makes sense to split off into
three or so subseries. The first can cover the log flush optimizations
and whatever one off fixes that are all probably close to merge-worthy,
the second can cover this op header formatting rework and associated
cleanups, and the last covers all of the percpu stuff at the end. If
there's a real concern over rebase churn, there's probably no huge need
to respin the entire collection on every review cycle of one of the
earlier subseries.

Brian

> --D
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/45] xfs: log ticket region debug is largely useless
  2021-03-05  5:11 ` [PATCH 26/45] xfs: log ticket region debug is largely useless Dave Chinner
  2021-03-09  2:31   ` Darrick J. Wong
@ 2021-03-16 14:55   ` Brian Foster
  2021-05-19  3:27     ` Dave Chinner
  1 sibling, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-03-16 14:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:24PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xlog_tic_add_region() is used to trace the regions being added to a
> log ticket to provide information in the situation where a ticket
> reservation overrun occurs. The information gathered is stored int
> the ticket, and dumped if xlog_print_tic_res() is called.
> 
> For a front end struct xfs_trans overrun, the ticket only contains
> reservation tracking information - the ticket is never handed to the
> log so has no regions attached to it. The overrun debug information in this
> case comes from xlog_print_trans(), which walks the items attached
> to the transaction and dumps their attached formatted log vectors
> directly. It also dumps the ticket state, but that only contains
> reservation accounting and nothing else. Hence xlog_print_tic_res()
> never dumps region or overrun information from this path.
> 
> xlog_tic_add_region() is actually called from xlog_write(), which
> means it is being used to track the regions seen in a
> CIL checkpoint log vector chain. In looking at CIL behaviour
> recently, I've seen 32MB checkpoints regularly exceed 250,000
> regions in the LV chain. The log ticket debug code can track *15*
> regions. IOWs, if there is a ticket overrun in the CIL code, the
> ticket region tracking code is going to be completely useless for
> determining what went wrong. The only thing it can tell us is how
> much of an overrun occurred, and we really don't need extra debug
> information in the log ticket to tell us that.
> 
> Indeed, the main place we call xlog_tic_add_region() is also adding
> up the number of regions and the space used so that xlog_write()
> knows how much will be written to the log. This is exactly the same
> information that log ticket is storing once we take away the useless
> region tracking array. Hence xlog_tic_add_region() is not useful,
> but can be called 250,000 times a CIL push...
> 
> Just strip all that debug "information" out of the of the log ticket
> and only have it report reservation space information when an
> overrun occurs. This also reduces the size of a log ticket down by
> about 150 bytes...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log.c      | 107 +++---------------------------------------
>  fs/xfs/xfs_log_priv.h |  17 -------
>  2 files changed, 6 insertions(+), 118 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 7f601c1c9f45..8ee6a5f74396 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -139,16 +139,6 @@ enum xlog_iclog_state {
>  /* Ticket reservation region accounting */ 
>  #define XLOG_TIC_LEN_MAX	15
>  

This is unused now.

> -/*
> - * Reservation region
> - * As would be stored in xfs_log_iovec but without the i_addr which
> - * we don't care about.
> - */
> -typedef struct xlog_res {
> -	uint	r_len;	/* region length		:4 */
> -	uint	r_type;	/* region's transaction type	:4 */
> -} xlog_res_t;
> -
>  typedef struct xlog_ticket {
>  	struct list_head   t_queue;	 /* reserve/write queue */
>  	struct task_struct *t_task;	 /* task that owns this ticket */
> @@ -159,13 +149,6 @@ typedef struct xlog_ticket {
>  	char		   t_ocnt;	 /* original count		 : 1  */
>  	char		   t_cnt;	 /* current count		 : 1  */
>  	char		   t_flags;	 /* properties of reservation	 : 1  */
> -
> -        /* reservation array fields */
> -	uint		   t_res_num;                    /* num in array : 4 */
> -	uint		   t_res_num_ophdrs;		 /* num op hdrs  : 4 */

I'm curious why we wouldn't want to retain the ophdr count..? That's
managed separately from the _add_region() bits and provides some info on
the total number of vectors, etc. Otherwise looks reasonable.

Brian

> -	uint		   t_res_arr_sum;		 /* array sum    : 4 */
> -	uint		   t_res_o_flow;		 /* sum overflow : 4 */
> -	xlog_res_t	   t_res_arr[XLOG_TIC_LEN_MAX];  /* array of res : 8 * 15 */ 
>  } xlog_ticket_t;
>  
>  /*
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 27/45] xfs: pass lv chain length into xlog_write()
  2021-03-05  5:11 ` [PATCH 27/45] xfs: pass lv chain length into xlog_write() Dave Chinner
  2021-03-09  2:36   ` Darrick J. Wong
@ 2021-03-16 18:38   ` Brian Foster
  1 sibling, 0 replies; 145+ messages in thread
From: Brian Foster @ 2021-03-16 18:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The caller of xlog_write() usually has a close accounting of the
> aggregated vector length contained in the log vector chain passed to
> xlog_write(). There is no need to iterate the chain to calculate he
> length of the data in xlog_write_calculate_len() if the caller is
> already iterating that chain to build it.
> 
> Passing in the vector length avoids doing an extra chain iteration,
> which can be a significant amount of work given that large CIL
> commits can have hundreds of thousands of vectors attached to the
> chain.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 37 ++++++-------------------------------
>  fs/xfs/xfs_log_cil.c  | 18 +++++++++++++-----
>  fs/xfs/xfs_log_priv.h |  2 +-
>  3 files changed, 20 insertions(+), 37 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 7a5e6bdb7876..34abc3bae587 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
...
> @@ -893,6 +898,9 @@ xlog_cil_push_work(
>  	 * transaction header here as it is not accounted for in xlog_write().
>  	 */
>  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
> +	num_iovecs += lvhdr.lv_niovecs;

What's the point of this if num_iovecs is only used by
xlog_cil_build_trans_hdr()?

Brian

> +	num_bytes += lvhdr.lv_bytes;
> +
>  
>  	/*
>  	 * Before we format and submit the first iclog, we have to ensure that
> @@ -907,7 +915,7 @@ xlog_cil_push_work(
>  	 * write head.
>  	 */
>  	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> -				XLOG_START_TRANS);
> +				XLOG_START_TRANS, num_bytes);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 8ee6a5f74396..003c11653955 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -462,7 +462,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint optype);
> +		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/45] xfs: introduce xlog_write_single()
  2021-03-05  5:11 ` [PATCH 28/45] xfs: introduce xlog_write_single() Dave Chinner
  2021-03-09  2:39   ` Darrick J. Wong
@ 2021-03-16 18:39   ` Brian Foster
  2021-05-19  3:44     ` Dave Chinner
  1 sibling, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-03-16 18:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:26PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce an optimised version of xlog_write() that is used when the
> entire write will fit in a single iclog. This greatly simplifies the
> implementation of writing a log vector chain into an iclog, and sets
> the ground work for a much more understandable xlog_write()
> implementation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 22f97914ab99..590c1e6db475 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2214,6 +2214,52 @@ xlog_write_copy_finish(
>  	return error;
>  }
>  
> +/*
> + * Write log vectors into a single iclog which is guaranteed by the caller
> + * to have enough space to write the entire log vector into. Return the number
> + * of log vectors written into the iclog.
> + */
> +static int
> +xlog_write_single(
> +	struct xfs_log_vec	*log_vector,
> +	struct xlog_ticket	*ticket,
> +	struct xlog_in_core	*iclog,
> +	uint32_t		log_offset,
> +	uint32_t		len)
> +{
> +	struct xfs_log_vec	*lv = log_vector;

This is initialized here and in the loop below.

> +	void			*ptr;
> +	int			index = 0;
> +	int			record_cnt = 0;
> +
> +	ASSERT(log_offset + len <= iclog->ic_size);
> +
> +	ptr = iclog->ic_datap + log_offset;
> +	for (lv = log_vector; lv; lv = lv->lv_next) {
> +		/*
> +		 * Ordered log vectors have no regions to write so this
> +		 * loop will naturally skip them.
> +		 */
> +		for (index = 0; index < lv->lv_niovecs; index++) {
> +			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> +			struct xlog_op_header	*ophdr = reg->i_addr;
> +
> +			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> +			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +						sizeof(struct xlog_op_header));

Perhaps we should retain the xlog_verify_dest_ptr() call here? It's
DEBUG code and otherwise compiled out, so shouldn't impact production

> +			memcpy(ptr, reg->i_addr, reg->i_len);
> +			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> +			record_cnt++;
> +		}
> +	}
> +	ASSERT(len == 0);
> +	return record_cnt;
> +}
> +
> +
>  /*
>   * Write some region out to in-core log
>   *
> @@ -2294,7 +2340,6 @@ xlog_write(
>  			return error;
>  
>  		ASSERT(log_offset <= iclog->ic_size - 1);
> -		ptr = iclog->ic_datap + log_offset;
>  
>  		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
> @@ -2311,10 +2356,20 @@ xlog_write(
>  						XLOG_ICL_NEED_FUA);
>  		}
>  
> +		/* If this is a single iclog write, go fast... */
> +		if (!contwr && lv == log_vector) {
> +			record_cnt = xlog_write_single(lv, ticket, iclog,
> +						log_offset, len);
> +			len = 0;

I assume this is here to satisfy the assert further down in the
function.. This seems a bit contrived when you consider we pass len to
the helper, the helper reduces it and asserts that it goes to zero, then
we do so again here just for another assert. Unless this is all just
removed later, it might be more straightforward to pass a reference.

> +			data_cnt = len;

Similarly, this looks a bit odd because it seems data_cnt should be zero
in the case where contwr == 0. xlog_state_get_iclog_space() has already
bumped ->ic_offset by len (so xlog_state_finish_copy() doesn't need to
via data_cnt).

Brian

> +			break;
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
>  		 */
> +		ptr = iclog->ic_datap + log_offset;
>  		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
>  			struct xfs_log_iovec	*reg;
>  			struct xlog_op_header	*ophdr;
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/45] xfs:_introduce xlog_write_partial()
  2021-03-05  5:11 ` [PATCH 29/45] xfs:_introduce xlog_write_partial() Dave Chinner
  2021-03-09  2:59   ` Darrick J. Wong
@ 2021-03-18 13:22   ` Brian Foster
  2021-05-19  4:49     ` Dave Chinner
  1 sibling, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-03-18 13:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 05, 2021 at 04:11:27PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Handle writing of a logvec chain into an iclog that doesn't have
> enough space to fit it all. The iclog has already been changed to
> WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> in the iclog is exclusively owned by this logvec chain.
> 
> The difference between the single and partial cases is that
> we end up with partial iovec writes in the iclog and have to split
> a log vec regions across two iclogs. The state handling for this is
> currently awful and so we're building up the pieces needed to
> handle this more cleanly one at a time.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

FWIW, git --patience mode generates a more readable diff for this patch
than what it generates by default. I'm referring to that locally and
will try to leave feedback in the appropriate points here.

>  fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++-------------------------
>  1 file changed, 251 insertions(+), 274 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 590c1e6db475..10916b99bf0f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2099,166 +2099,250 @@ xlog_print_trans(
>  	}
>  }
>  
> -static xlog_op_header_t *
> -xlog_write_setup_ophdr(
> -	struct xlog_op_header	*ophdr,
> -	struct xlog_ticket	*ticket)
> -{
> -	ophdr->oh_clientid = XFS_TRANSACTION;
> -	ophdr->oh_res2 = 0;
> -	ophdr->oh_flags = 0;
> -	return ophdr;
> -}
> -
>  /*
> - * Set up the parameters of the region copy into the log. This has
> - * to handle region write split across multiple log buffers - this
> - * state is kept external to this function so that this code can
> - * be written in an obvious, self documenting manner.
> + * Write whole log vectors into a single iclog which is guaranteed to have
> + * either sufficient space for the entire log vector chain to be written or
> + * exclusive access to the remaining space in the iclog.
> + *
> + * Return the number of iovecs and data written into the iclog, as well as
> + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
> + * end of the chain.
>   */
> -static int
> -xlog_write_setup_copy(
> +static struct xfs_log_vec *
> +xlog_write_single(
> +	struct xfs_log_vec	*log_vector,

So xlog_write_single() was initially for single CIL xlog_write() calls
and now it appears to be slightly different in that it writes as many
full log vectors that fit in the current iclog and cycles through
xlog_write_partial() (and back) to process log vectors that span iclogs
differently from those that don't.

>  	struct xlog_ticket	*ticket,
> -	struct xlog_op_header	*ophdr,
> -	int			space_available,
> -	int			space_required,
> -	int			*copy_off,
> -	int			*copy_len,
> -	int			*last_was_partial_copy,
> -	int			*bytes_consumed)
> -{
> -	int			still_to_copy;
> -
> -	still_to_copy = space_required - *bytes_consumed;
> -	*copy_off = *bytes_consumed;
> -
> -	if (still_to_copy <= space_available) {
> -		/* write of region completes here */
> -		*copy_len = still_to_copy;
> -		ophdr->oh_len = cpu_to_be32(*copy_len);
> -		if (*last_was_partial_copy)
> -			ophdr->oh_flags |= (XLOG_END_TRANS|XLOG_WAS_CONT_TRANS);
> -		*last_was_partial_copy = 0;
> -		*bytes_consumed = 0;
> -		return 0;
> -	}
> -
> -	/* partial write of region, needs extra log op header reservation */
> -	*copy_len = space_available;
> -	ophdr->oh_len = cpu_to_be32(*copy_len);
> -	ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> -	if (*last_was_partial_copy)
> -		ophdr->oh_flags |= XLOG_WAS_CONT_TRANS;
> -	*bytes_consumed += *copy_len;
> -	(*last_was_partial_copy)++;
> -
> -	/* account for new log op header */
> -	ticket->t_curr_res -= sizeof(struct xlog_op_header);
> -
> -	return sizeof(struct xlog_op_header);
> -}
> -
> -static int
> -xlog_write_copy_finish(
> -	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
> -	uint			flags,
> -	int			*record_cnt,
> -	int			*data_cnt,
> -	int			*partial_copy,
> -	int			*partial_copy_len,
> -	int			log_offset,
> -	struct xlog_in_core	**commit_iclog)
> +	uint32_t		*log_offset,
> +	uint32_t		*len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt)
>  {
> -	int			error;
> +	struct xfs_log_vec	*lv = log_vector;
> +	void			*ptr;
> +	int			index;
>  
> -	if (*partial_copy) {
> +	ASSERT(*log_offset + *len <= iclog->ic_size ||
> +		iclog->ic_state == XLOG_STATE_WANT_SYNC);
> +
> +	ptr = iclog->ic_datap + *log_offset;
> +	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/*
> -		 * This iclog has already been marked WANT_SYNC by
> -		 * xlog_state_get_iclog_space.
> +		 * If the entire log vec does not fit in the iclog, punt it to
> +		 * the partial copy loop which can handle this case.
>  		 */
> -		spin_lock(&log->l_icloglock);
> -		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> -		*record_cnt = 0;
> -		*data_cnt = 0;
> -		goto release_iclog;
> -	}
> +		if (lv->lv_niovecs &&
> +		    lv->lv_bytes > iclog->ic_size - *log_offset)
> +			break;
>  
> -	*partial_copy = 0;
> -	*partial_copy_len = 0;
> +		/*
> +		 * Ordered log vectors have no regions to write so this
> +		 * loop will naturally skip them.
> +		 */
> +		for (index = 0; index < lv->lv_niovecs; index++) {
> +			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> +			struct xlog_op_header	*ophdr = reg->i_addr;
>  
> -	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
> -		/* no more space in this iclog - push it. */
> -		spin_lock(&log->l_icloglock);
> -		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> -		*record_cnt = 0;
> -		*data_cnt = 0;
> +			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> +			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
> -		if (iclog->ic_state == XLOG_STATE_ACTIVE)
> -			xlog_state_switch_iclogs(log, iclog, 0);
> -		else
> -			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> -			       iclog->ic_state == XLOG_STATE_IOERROR);
> -		if (!commit_iclog)
> -			goto release_iclog;
> -		spin_unlock(&log->l_icloglock);
> -		ASSERT(flags & XLOG_COMMIT_TRANS);
> -		*commit_iclog = iclog;
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +						sizeof(struct xlog_op_header));
> +			memcpy(ptr, reg->i_addr, reg->i_len);
> +			xlog_write_adv_cnt(&ptr, len, log_offset, reg->i_len);
> +			(*record_cnt)++;
> +			*data_cnt += reg->i_len;
> +		}
>  	}
> +	ASSERT(*len == 0 || lv);
> +	return lv;
> +}
>  
> -	return 0;
> +static int
> +xlog_write_get_more_iclog_space(
> +	struct xlog		*log,
> +	struct xlog_ticket	*ticket,
> +	struct xlog_in_core	**iclogp,
> +	uint32_t		*log_offset,
> +	uint32_t		len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt,
> +	int			*contwr)
> +{
> +	struct xlog_in_core	*iclog = *iclogp;
> +	int			error;
>  
> -release_iclog:
> +	spin_lock(&log->l_icloglock);
> +	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> +	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> +	       iclog->ic_state == XLOG_STATE_IOERROR);
>  	error = xlog_state_release_iclog(log, iclog);
>  	spin_unlock(&log->l_icloglock);
> -	return error;
> +	if (error)
> +		return error;
> +
> +	error = xlog_state_get_iclog_space(log, len, &iclog,
> +				ticket, contwr, log_offset);
> +	if (error)
> +		return error;
> +	*record_cnt = 0;
> +	*data_cnt = 0;
> +	*iclogp = iclog;
> +	return 0;
>  }
>  
>  /*
> - * Write log vectors into a single iclog which is guaranteed by the caller
> - * to have enough space to write the entire log vector into. Return the number
> - * of log vectors written into the iclog.
> + * Write log vectors into a single iclog which is smaller than the current chain
> + * length. We write until we cannot fit a full record into the remaining space
> + * and then stop. We return the log vector that is to be written that cannot
> + * wholly fit in the iclog.
>   */
> -static int
> -xlog_write_single(
> +static struct xfs_log_vec *
> +xlog_write_partial(
> +	struct xlog		*log,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
> -	struct xlog_in_core	*iclog,
> -	uint32_t		log_offset,
> -	uint32_t		len)
> +	struct xlog_in_core	**iclogp,
> +	uint32_t		*log_offset,
> +	uint32_t		*len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt,
> +	int			*contwr)
>  {
> +	struct xlog_in_core	*iclog = *iclogp;
>  	struct xfs_log_vec	*lv = log_vector;

The log_vector -> lv assignment seems spurious at this point since this
function only processes lv and returns the next.

> +	struct xfs_log_iovec	*reg;
> +	struct xlog_op_header	*ophdr;
>  	void			*ptr;
>  	int			index = 0;
> -	int			record_cnt = 0;
> +	uint32_t		rlen;
> +	int			error;
>  
> -	ASSERT(log_offset + len <= iclog->ic_size);
> +	/* walk the logvec, copying until we run out of space in the iclog */
> +	ptr = iclog->ic_datap + *log_offset;
> +	for (index = 0; index < lv->lv_niovecs; index++) {
> +		uint32_t	reg_offset = 0;
> +
> +		reg = &lv->lv_iovecp[index];
> +		ASSERT(reg->i_len % sizeof(int32_t) == 0);
>  
> -	ptr = iclog->ic_datap + log_offset;
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/*
> -		 * Ordered log vectors have no regions to write so this
> -		 * loop will naturally skip them.
> +		 * The first region of a continuation must have a non-zero
> +		 * length otherwise log recovery will just skip over it and
> +		 * start recovering from the next opheader it finds. Because we
> +		 * mark the next opheader as a continuation, recovery will then
> +		 * incorrectly add the continuation to the previous region and
> +		 * that breaks stuff.
> +		 *
> +		 * Hence if there isn't space for region data after the
> +		 * opheader, then we need to start afresh with a new iclog.
>  		 */
> -		for (index = 0; index < lv->lv_niovecs; index++) {
> -			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> -			struct xlog_op_header	*ophdr = reg->i_addr;
> +		if (iclog->ic_size - *log_offset <=
> +					sizeof(struct xlog_op_header)) {
> +			error = xlog_write_get_more_iclog_space(log, ticket,
> +					&iclog, log_offset, *len, record_cnt,
> +					data_cnt, contwr);
> +			if (error)
> +				return ERR_PTR(error);
> +			ptr = iclog->ic_datap + *log_offset;
> +		}
>  
> -			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> -			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +		ophdr = reg->i_addr;
> +		rlen = min_t(uint32_t, reg->i_len, iclog->ic_size - *log_offset);
> +
> +		ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +		ophdr->oh_len = cpu_to_be32(rlen - sizeof(struct xlog_op_header));
> +		if (rlen != reg->i_len)
> +			ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
>  
> +		ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +		xlog_verify_dest_ptr(log, ptr);
> +		memcpy(ptr, reg->i_addr, rlen);
> +		xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> +		(*record_cnt)++;
> +		*data_cnt += rlen;
> +

		/* if we fit the full region, jump to the next */

> +		if (rlen == reg->i_len)
> +			continue;
> +
> +		/*
> +		 * We now have a partially written iovec, but it can span
> +		 * multiple iclogs so we loop here. First we release the iclog
> +		 * we currently have, then we get a new iclog and add a new
> +		 * opheader. Then we continue copying from where we were until
> +		 * we either complete the iovec or fill the iclog. If we
> +		 * complete the iovec, then we increment the index and go right
> +		 * back to the top of the outer loop. if we fill the iclog, we
> +		 * run the inner loop again.
> +		 *
> +		 * This is complicated by the tail of a region using all the
> +		 * space in an iclog and hence requiring us to release the iclog
> +		 * and get a new one before returning to the outer loop. We must
> +		 * always guarantee that we exit this inner loop with at least
> +		 * space for log transaction opheaders left in the current
> +		 * iclog, hence we cannot just terminate the loop at the end
> +		 * of the of the continuation. So we loop while there is no
> +		 * space left in the current iclog, and check for the end of the
> +		 * continuation after getting a new iclog.
> +		 */

Ok, so we land in this function if an lv spans an iclog boundary. The
upper loop writes full vectors until we hit said iclog boundary, then we
fall into the inner loop...

> +		do {
> +			/*
> +			 * Account for the continuation opheader before we get
> +			 * a new iclog. This is necessary so that we reserve
> +			 * space in the iclog for it.
> +			 */
> +			if (ophdr->oh_flags & XLOG_CONTINUE_TRANS) {

(Is this ever not true here?)

> +				*len += sizeof(struct xlog_op_header);
> +				ticket->t_curr_res -= sizeof(struct xlog_op_header);
> +			}
> +			error = xlog_write_get_more_iclog_space(log, ticket,
> +					&iclog, log_offset, *len, record_cnt,
> +					data_cnt, contwr);
> +			if (error)
> +				return ERR_PTR(error);
> +			ptr = iclog->ic_datap + *log_offset;
> +
> +			ophdr = ptr;
>  			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +			ophdr->oh_clientid = XFS_TRANSACTION;
> +			ophdr->oh_res2 = 0;
> +			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
> +
> +			xlog_write_adv_cnt(&ptr, len, log_offset,
>  						sizeof(struct xlog_op_header));
> -			memcpy(ptr, reg->i_addr, reg->i_len);
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> -			record_cnt++;
> -		}
> +			*data_cnt += sizeof(struct xlog_op_header);
> +

... which switches to the next iclog, writes the continuation header...

> +			/*
> +			 * If rlen fits in the iclog, then end the region
> +			 * continuation. Otherwise we're going around again.
> +			 */
> +			reg_offset += rlen;
> +			rlen = reg->i_len - reg_offset;
> +			if (rlen <= iclog->ic_size - *log_offset)
> +				ophdr->oh_flags |= XLOG_END_TRANS;
> +			else
> +				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> +
> +			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
> +			ophdr->oh_len = cpu_to_be32(rlen);
> +
> +			xlog_verify_dest_ptr(log, ptr);
> +			memcpy(ptr, reg->i_addr + reg_offset, rlen);
> +			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> +			(*record_cnt)++;
> +			*data_cnt += rlen;
> +
> +		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);

... writes more of the region (iclog space permitting), and then
determines whether we need further continuations (and partial writes of
the same region) or can move onto the next region, until we're done with
the lv.

I think I follow the high level flow and it seems reasonable from a
functional standpoint, but this also seems like quite a bit of churn for
not much reduction in overall complexity. The higher level loop is much
more simple and I think the per lv/vector iteration is an improvement,
but we also seem to have duplicate functionality throughout the updated
code and have introduced new forms of complexity around the state
expectations for the transitions between the different write modes and
between each write mode and the higher level loop.

I.e., xlog_write_single() implements a straighforward loop to write out
full log vectors. That seems fine, but the outer loop of
xlog_write_partial() reimplements nearly the same per-region
functionality with some added flexibility to handle op header flags and
the special iclog processing associated with the continuation case. The
inner loop factors out the continuation iclog management bits and op
header injection, which I think is an improvement, but then duplicates
region copying (yet again) pretty much only to implement partial copies,
which really just involves offset management (i.e., fairly trivial
relative to the broader complexity of the function).

I dunno. I'd certainly need to stare more at this to cover all of the
details, but given the amount of swizzling going on in a single patch
I'm kind of wondering if/why we couldn't land on a single iterator in
the spirit of xlog_write_partial() in that it primarily iterates on
regions and factors out the grotty reservation and continuation
management bits, but doesn't unroll as much and leave so much duplicate
functionality around.

For example, it looks to me that xlog_write_partial() almost nearly
already supports a high level algorithm along the lines of the following
(pseudocode):

xlog_write(len)
{
	get_iclog_space(len)

	for_each_lv() {
		for_each_reg() {
			reg_offset = 0;
cont_write:
			/* write as much as will fit in the iclog, return count,
			 * and set ophdr cont flag based on write result */
			reg_offset += write_region(reg, &len, &reg_offset, ophdr, ...);

			/* handle continuation writes */
			if (reg_offset != reg->i_len) {
				get_more_iclog_space(len);
				/* stamp a WAS_CONT op hdr, set END if rlen fits
				 * into new space, then continue with the same region */
				stamp_cont_op_hdr();
				goto cont_write;
			}

			if (need_more_iclog_space(len))
				get_more_iclog_space(len);
		}
	}
}

That puts the whole thing back into a single high level walk and thus
reintroduces the need for some of the continuation vs. non-continuation
tracking wrt to the op header and iclog, but ISTM that complexity can be
managed by the continuation abstraction you've already started to
introduce (as opposed to the current scheme of conditionally
accumulating data_cnt). It might even be fine to dump some of the
requisite state into a context struct to carry between iclog reservation
and copy finish processing rather than pass around so many independent
and poorly named variables like the current upstream implementation
does, but that's probably getting too deep into the weeds.

FWIW, I can also see an approach of moving from the implementation in
this patch toward something like the above, but I'm not sure I'd want to
subject to the upstream code to that process...

Brian

>  	}
> -	ASSERT(len == 0);
> -	return record_cnt;
> -}
>  
> +	/*
> +	 * No more iovecs remain in this logvec so return the next log vec to
> +	 * the caller so it can go back to fast path copying.
> +	 */
> +	*iclogp = iclog;
> +	return lv->lv_next;
> +}
>  
>  /*
>   * Write some region out to in-core log
> @@ -2312,14 +2396,11 @@ xlog_write(
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
> -	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
> -	int			index = 0;
> -	int			partial_copy = 0;
> -	int			partial_copy_len = 0;
>  	int			contwr = 0;
>  	int			record_cnt = 0;
>  	int			data_cnt = 0;
>  	int			error = 0;
> +	int			log_offset;
>  
>  	if (ticket->t_curr_res < 0) {
>  		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
> @@ -2328,157 +2409,52 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	if (start_lsn)
> -		*start_lsn = 0;
> -	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> -		void		*ptr;
> -		int		log_offset;
> -
> -		error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> -						   &contwr, &log_offset);
> -		if (error)
> -			return error;
> -
> -		ASSERT(log_offset <= iclog->ic_size - 1);
> +	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> +					   &contwr, &log_offset);
> +	if (error)
> +		return error;
>  
> -		/* Start_lsn is the first lsn written to. */
> -		if (start_lsn && !*start_lsn)
> -			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> +	/* start_lsn is the LSN of the first iclog written to. */
> +	if (start_lsn)
> +		*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> -		/*
> -		 * iclogs containing commit records or unmount records need
> -		 * to issue ordering cache flushes and commit immediately
> -		 * to stable storage to guarantee journal vs metadata ordering
> -		 * is correctly maintained in the storage media.
> -		 */
> -		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> -			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> -						XLOG_ICL_NEED_FUA);
> -		}
> +	/*
> +	 * iclogs containing commit records or unmount records need
> +	 * to issue ordering cache flushes and commit immediately
> +	 * to stable storage to guarantee journal vs metadata ordering
> +	 * is correctly maintained in the storage media. This will always
> +	 * fit in the iclog we have been already been passed.
> +	 */
> +	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +		iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
> +		ASSERT(!contwr);
> +	}
>  
> -		/* If this is a single iclog write, go fast... */
> -		if (!contwr && lv == log_vector) {
> -			record_cnt = xlog_write_single(lv, ticket, iclog,
> -						log_offset, len);
> -			len = 0;
> -			data_cnt = len;
> +	while (lv) {
> +		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
> +					&len, &record_cnt, &data_cnt);
> +		if (!lv)
>  			break;
> -		}
> -
> -		/*
> -		 * This loop writes out as many regions as can fit in the amount
> -		 * of space which was allocated by xlog_state_get_iclog_space().
> -		 */
> -		ptr = iclog->ic_datap + log_offset;
> -		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> -			struct xfs_log_iovec	*reg;
> -			struct xlog_op_header	*ophdr;
> -			int			copy_len;
> -			int			copy_off;
> -			bool			ordered = false;
> -			bool			added_ophdr = false;
> -
> -			/* ordered log vectors have no regions to write */
> -			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> -				ASSERT(lv->lv_niovecs == 0);
> -				ordered = true;
> -				goto next_lv;
> -			}
> -
> -			reg = &vecp[index];
> -			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> -			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> -
> -			/*
> -			 * Regions always have their ophdr at the start of the
> -			 * region, except for:
> -			 * - a transaction start which has a start record ophdr
> -			 *   before the first region ophdr; and
> -			 * - the previous region didn't fully fit into an iclog
> -			 *   so needs a continuation ophdr to prepend the region
> -			 *   in this new iclog.
> -			 */
> -			ophdr = reg->i_addr;
> -			if (optype && index) {
> -				optype &= ~XLOG_START_TRANS;
> -			} else if (partial_copy) {
> -                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -					   sizeof(struct xlog_op_header));
> -				added_ophdr = true;
> -			}
> -			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -
> -			len += xlog_write_setup_copy(ticket, ophdr,
> -						     iclog->ic_size-log_offset,
> -						     reg->i_len,
> -						     &copy_off, &copy_len,
> -						     &partial_copy,
> -						     &partial_copy_len);
> -			xlog_verify_dest_ptr(log, ptr);
> -
>  
> -			/*
> -			 * Wart: need to update length in embedded ophdr not
> -			 * to include it's own length.
> -			 */
> -			if (!added_ophdr) {
> -				ophdr->oh_len = cpu_to_be32(copy_len -
> -						sizeof(struct xlog_op_header));
> -			}
> -
> -			ASSERT(copy_len > 0);
> -			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> -
> -			if (added_ophdr)
> -				copy_len += sizeof(struct xlog_op_header);
> -			record_cnt++;
> -			data_cnt += contwr ? copy_len : 0;
> -
> -			error = xlog_write_copy_finish(log, iclog, optype,
> -						       &record_cnt, &data_cnt,
> -						       &partial_copy,
> -						       &partial_copy_len,
> -						       log_offset,
> -						       commit_iclog);
> -			if (error)
> -				return error;
> -
> -			/*
> -			 * if we had a partial copy, we need to get more iclog
> -			 * space but we don't want to increment the region
> -			 * index because there is still more is this region to
> -			 * write.
> -			 *
> -			 * If we completed writing this region, and we flushed
> -			 * the iclog (indicated by resetting of the record
> -			 * count), then we also need to get more log space. If
> -			 * this was the last record, though, we are done and
> -			 * can just return.
> -			 */
> -			if (partial_copy)
> -				break;
> -
> -			if (++index == lv->lv_niovecs) {
> -next_lv:
> -				lv = lv->lv_next;
> -				index = 0;
> -				if (lv)
> -					vecp = lv->lv_iovecp;
> -			}
> -			if (record_cnt == 0 && !ordered) {
> -				if (!lv)
> -					return 0;
> -				break;
> -			}
> +		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
> +		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
> +					&len, &record_cnt, &data_cnt, &contwr);
> +		if (IS_ERR_OR_NULL(lv)) {
> +			error = PTR_ERR_OR_ZERO(lv);
> +			break;
>  		}
>  	}
> +	ASSERT((len == 0 && !lv) || error);
>  
> -	ASSERT(len == 0);
> -
> +	/*
> +	 * We've already been guaranteed that the last writes will fit inside
> +	 * the current iclog, and hence it will already have the space used by
> +	 * those writes accounted to it. Hence we do not need to update the
> +	 * iclog with the number of bytes written here.
> +	 */
> +	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
>  	spin_lock(&log->l_icloglock);
> -	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
> +	xlog_state_finish_copy(log, iclog, record_cnt, 0);
>  	if (commit_iclog) {
>  		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
> @@ -2930,7 +2906,7 @@ xlog_state_get_iclog_space(
>  	 * xlog_write() algorithm assumes that at least 2 xlog_op_header_t's
>  	 * can fit into remaining data section.
>  	 */
> -	if (iclog->ic_size - iclog->ic_offset < 2*sizeof(xlog_op_header_t)) {
> +	if (iclog->ic_size - iclog->ic_offset < 3*sizeof(xlog_op_header_t)) {
>  		int		error = 0;
>  
>  		xlog_state_switch_iclogs(log, iclog, iclog->ic_size);
> @@ -3633,11 +3609,12 @@ xlog_verify_iclog(
>  					iclog->ic_header.h_cycle_data[idx]);
>  			}
>  		}
> -		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG)
> +		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) {
>  			xfs_warn(log->l_mp,
> -				"%s: invalid clientid %d op "PTR_FMT" offset 0x%lx",
> -				__func__, clientid, ophead,
> +				"%s: op %d invalid clientid %d op "PTR_FMT" offset 0x%lx",
> +				__func__, i, clientid, ophead,
>  				(unsigned long)field_offset);
> +		}
>  
>  		/* check length */
>  		p = &ophead->oh_len;
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-03-16 14:53   ` Brian Foster
@ 2021-05-19  3:18     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-05-19  3:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 16, 2021 at 10:53:01AM -0400, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Current xlog_write() adds op headers to the log manually for every
> > log item region that is in the vector passed to it. While
> > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > already know it's length, flags, clientid, etc at CIL commit time.
> > 
> > This means the only time that xlog write really needs to format and
> > reserve space for a new ophdr is when a region is split across two
> > iclogs. Adding the opheader and accounting for it as part of the
> > normal formatted item region means we simplify the accounting
> > of space used by a transaction and we don't have to special case
> > reserving of space in for the ophdrs in xlog_write(). It also means
> > we can largely initialise the ophdr in transaction commit instead
> > of xlog_write, making the xlog_write formatting inner loop much
> > tighter.
> > 
> > xlog_prepare_iovec() is now too large to stay as an inline function,
> > so we move it out of line and into xfs_log.c.
> > 
> > Object sizes:
> > text	   data	    bss	    dec	    hex	filename
> > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > 
> > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > out of line, even though it grew in size itself.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Looks mostly reasonable, a couple or so questions...
> 
> >  fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
> >  fs/xfs/xfs_log.h     |  42 +++-------------
> >  fs/xfs/xfs_log_cil.c |  25 +++++-----
> >  3 files changed, 99 insertions(+), 83 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 429cb1e7cc67..98de45be80c0 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -89,6 +89,62 @@ xlog_iclogs_empty(
> >  static int
> >  xfs_log_cover(struct xfs_mount *);
> >  
> > +/*
> > + * We need to make sure the buffer pointer returned is naturally aligned for the
> > + * biggest basic data type we put into it. We have already accounted for this
> > + * padding when sizing the buffer.
> > + *
> > + * However, this padding does not get written into the log, and hence we have to
> > + * track the space used by the log vectors separately to prevent log space hangs
> > + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> > + * CIL context ticket.
> > + *
> > + * We also add space for the xlog_op_header that describes this region in the
> > + * log. This prepends the data region we return to the caller to copy their data
> > + * into, so do all the static initialisation of the ophdr now. Because the ophdr
> > + * is not 8 byte aligned, we have to be careful to ensure that we align the
> > + * start of the buffer such that the region we return to the call is 8 byte
> > + * aligned and packed against the tail of the ophdr.
> > + */
> > +void *
> > +xlog_prepare_iovec(
> > +	struct xfs_log_vec	*lv,
> > +	struct xfs_log_iovec	**vecp,
> > +	uint			type)
> > +{
> > +	struct xfs_log_iovec	*vec = *vecp;
> > +	struct xlog_op_header	*oph;
> > +	uint32_t		len;
> > +	void			*buf;
> > +
> > +	if (vec) {
> > +		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> > +		vec++;
> > +	} else {
> > +		vec = &lv->lv_iovecp[0];
> > +	}
> > +
> > +	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
> > +	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
> > +		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
> > +					sizeof(struct xlog_op_header);
> > +	}
> > +
> > +	vec->i_type = type;
> > +	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> > +
> > +	oph = vec->i_addr;
> > +	oph->oh_clientid = XFS_TRANSACTION;
> > +	oph->oh_res2 = 0;
> > +	oph->oh_flags = 0;
> > +
> > +	buf = vec->i_addr + sizeof(struct xlog_op_header);
> > +	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));
> 
> Why is it the buffer portion needs to be 8 byte aligned but not ->i_addr
> itself?

Same reason the returned buffer has always been 8 byte aligned -
because we cast ixit  to on-disk structures that assume 8 byte
memory alignment of the structure. e.g. when formatting inodes,
intents, etc.

We don't care about the ophdr itself, because it only contains
4 byte aligned variables (32 bit) and the log only requires them to
be 4 byte alignment of regions (asserts in xlog_write() that it
casts to ophdrs to update them...

> >  static void
> >  xlog_grant_sub_space(
> >  	struct xlog		*log,
> ...
> > @@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length(
> >  			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
> >  		}
> >  	}
> > -
> > -	/* Don't account for regions with embedded ophdrs */
> > -	if (optype && headers > 0) {
> > -		headers--;
> > -		if (optype & XLOG_START_TRANS) {
> > -			ASSERT(headers >= 1);
> > -			headers--;
> > -		}
> > -	}
> > -
> >  	ticket->t_res_num_ophdrs += headers;
> > -	len += headers * sizeof(struct xlog_op_header);
> 
> Hm, this seems to suggest something was off wrt to ->t_res_num_ophdrs
> prior to this change.  Granted this looks like it's just a debug field,
> but the previous logic filtered out embedded op headers unconditionally
> whereas now it looks like we go back to accounting them. Am I missing
> something?

Nothign wrong here with the old or the new code. The old code *adds*
the unaccounted ophdr space for each region here, but now everything
has an embedded opheader it's accounted directly to the region size.
Hence the size of the regions we summed in the above loop already
accounts for the ophdr space and we no longer need to account for
opheaders individually here.

> > @@ -2404,21 +2448,25 @@ xlog_write(
> >  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> >  
> >  			/*
> > -			 * The XLOG_START_TRANS has embedded ophdrs for the
> > -			 * start record and transaction header. They will always
> > -			 * be the first two regions in the lv chain. Commit and
> > -			 * unmount records also have embedded ophdrs.
> > +			 * Regions always have their ophdr at the start of the
> > +			 * region, except for:
> > +			 * - a transaction start which has a start record ophdr
> > +			 *   before the first region ophdr; and
> > +			 * - the previous region didn't fully fit into an iclog
> > +			 *   so needs a continuation ophdr to prepend the region
> > +			 *   in this new iclog.
> >  			 */
> > -			if (optype) {
> > -				ophdr = reg->i_addr;
> > -				if (index)
> > -					optype &= ~XLOG_START_TRANS;
> > -			} else {
> > +			ophdr = reg->i_addr;
> > +			if (optype && index) {
> > +				optype &= ~XLOG_START_TRANS;
> > +			} else if (partial_copy) {
> >                                  ophdr = xlog_write_setup_ophdr(ptr, ticket);
> >  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> >  					   sizeof(struct xlog_op_header));
> >  				added_ophdr = true;
> >  			}
> 
> So in the partial_copy continuation case we're still stamping an ophdr
> directly into the iclog. Otherwise we're processing/modifying flags and
> whatnot on the ophdr already stamped at commit time in the log vector.
> However, this is Ok because a relog would reformat the op header.

Right. We don't know ahead of time when a region is going to be
split across two iclogs, so we still have to handle the case of
adding a ophdr for a split region. This gets much simpler later on
in the series as the xlog_write() code is factored.

> > +			ASSERT(copy_len > 0);
> > +			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> > +			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> > +
> 
> I assume the checks in xlog_write_copy_finish() to require a minimum of
> one op header worth of space in the iclog prevent doing a partial write
> across an embedded op header boundary, but it would be nice to have an
> assert or something that ensures that. For example, assert for something
> like if a partial_copy occurs, partial_copy_len was at least the length
> of an op header into the region.

This code is completely reworked later in the series and these
whacky corner cases largely go away. There's not much point in
making this "robust" because of this...

> 
> >  			if (added_ophdr)
> >  				copy_len += sizeof(struct xlog_op_header);
> >  			record_cnt++;
> ...
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 0c81c13e2cf6..7a5e6bdb7876 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
> >  		}
> >  
> >  		/*
> > -		 * We 64-bit align the length of each iovec so that the start
> > -		 * of the next one is naturally aligned.  We'll need to
> > -		 * account for that slack space here. Then round nbytes up
> > -		 * to 64-bit alignment so that the initial buffer alignment is
> > -		 * easy to calculate and verify.
> > +		 * We 64-bit align the length of each iovec so that the start of
> > +		 * the next one is naturally aligned.  We'll need to account for
> > +		 * that slack space here.
> > +		 *
> 
> Related to my question above, I'm a little confused by the (preexisting)
> comment. If the start of the next iovec is now the ophdr, doesn't that
> mean the "start of the next one (iovec)" is technically no longer
> naturally aligned?

The ophdr doesn't need to be 64 bit aligned, just the
buffer that is returned from xlog_prepare_iovec(). That means we
still have to guarantee space in the buffer for that to be rounded
to 64 bits if it's not already aligned. All of the iovecs might
require padding to align them, so the presence of the ophdr doesn't
really change anything at all here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/45] xfs: log ticket region debug is largely useless
  2021-03-16 14:55   ` Brian Foster
@ 2021-05-19  3:27     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-05-19  3:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 16, 2021 at 10:55:34AM -0400, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 04:11:24PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xlog_tic_add_region() is used to trace the regions being added to a
> > log ticket to provide information in the situation where a ticket
> > reservation overrun occurs. The information gathered is stored int
> > the ticket, and dumped if xlog_print_tic_res() is called.
> > 
> > For a front end struct xfs_trans overrun, the ticket only contains
> > reservation tracking information - the ticket is never handed to the
> > log so has no regions attached to it. The overrun debug information in this
> > case comes from xlog_print_trans(), which walks the items attached
> > to the transaction and dumps their attached formatted log vectors
> > directly. It also dumps the ticket state, but that only contains
> > reservation accounting and nothing else. Hence xlog_print_tic_res()
> > never dumps region or overrun information from this path.
> > 
> > xlog_tic_add_region() is actually called from xlog_write(), which
> > means it is being used to track the regions seen in a
> > CIL checkpoint log vector chain. In looking at CIL behaviour
> > recently, I've seen 32MB checkpoints regularly exceed 250,000
> > regions in the LV chain. The log ticket debug code can track *15*
> > regions. IOWs, if there is a ticket overrun in the CIL code, the
> > ticket region tracking code is going to be completely useless for
> > determining what went wrong. The only thing it can tell us is how
> > much of an overrun occurred, and we really don't need extra debug
> > information in the log ticket to tell us that.
> > 
> > Indeed, the main place we call xlog_tic_add_region() is also adding
> > up the number of regions and the space used so that xlog_write()
> > knows how much will be written to the log. This is exactly the same
> > information that log ticket is storing once we take away the useless
> > region tracking array. Hence xlog_tic_add_region() is not useful,
> > but can be called 250,000 times a CIL push...
> > 
> > Just strip all that debug "information" out of the of the log ticket
> > and only have it report reservation space information when an
> > overrun occurs. This also reduces the size of a log ticket down by
> > about 150 bytes...
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_log.c      | 107 +++---------------------------------------
> >  fs/xfs/xfs_log_priv.h |  17 -------
> >  2 files changed, 6 insertions(+), 118 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> > index 7f601c1c9f45..8ee6a5f74396 100644
> > --- a/fs/xfs/xfs_log_priv.h
> > +++ b/fs/xfs/xfs_log_priv.h
> > @@ -139,16 +139,6 @@ enum xlog_iclog_state {
> >  /* Ticket reservation region accounting */ 
> >  #define XLOG_TIC_LEN_MAX	15
> >  
> 
> This is unused now.

Removed.

> 
> > -/*
> > - * Reservation region
> > - * As would be stored in xfs_log_iovec but without the i_addr which
> > - * we don't care about.
> > - */
> > -typedef struct xlog_res {
> > -	uint	r_len;	/* region length		:4 */
> > -	uint	r_type;	/* region's transaction type	:4 */
> > -} xlog_res_t;
> > -
> >  typedef struct xlog_ticket {
> >  	struct list_head   t_queue;	 /* reserve/write queue */
> >  	struct task_struct *t_task;	 /* task that owns this ticket */
> > @@ -159,13 +149,6 @@ typedef struct xlog_ticket {
> >  	char		   t_ocnt;	 /* original count		 : 1  */
> >  	char		   t_cnt;	 /* current count		 : 1  */
> >  	char		   t_flags;	 /* properties of reservation	 : 1  */
> > -
> > -        /* reservation array fields */
> > -	uint		   t_res_num;                    /* num in array : 4 */
> > -	uint		   t_res_num_ophdrs;		 /* num op hdrs  : 4 */
> 
> I'm curious why we wouldn't want to retain the ophdr count..? That's
> managed separately from the _add_region() bits and provides some info on
> the total number of vectors, etc. Otherwise looks reasonable.

Because we calculate it when we build the lv chain in a push and
only use the value the checkpoint transaction header. This is now
entirely encapsulated within the CIL push code and is no longer
something the xlog_write() needs to know because it doesn't build on
disk transaction headers. If we need debug to track how long the
lvchain is, then grab it the CIL code where it is already known...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/45] xfs: introduce xlog_write_single()
  2021-03-16 18:39   ` Brian Foster
@ 2021-05-19  3:44     ` Dave Chinner
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Chinner @ 2021-05-19  3:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 16, 2021 at 02:39:59PM -0400, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 04:11:26PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Introduce an optimised version of xlog_write() that is used when the
> > entire write will fit in a single iclog. This greatly simplifies the
> > implementation of writing a log vector chain into an iclog, and sets
> > the ground work for a much more understandable xlog_write()
> > implementation.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 56 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 22f97914ab99..590c1e6db475 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2214,6 +2214,52 @@ xlog_write_copy_finish(
> >  	return error;
> >  }
> >  
> > +/*
> > + * Write log vectors into a single iclog which is guaranteed by the caller
> > + * to have enough space to write the entire log vector into. Return the number
> > + * of log vectors written into the iclog.
> > + */
> > +static int
> > +xlog_write_single(
> > +	struct xfs_log_vec	*log_vector,
> > +	struct xlog_ticket	*ticket,
> > +	struct xlog_in_core	*iclog,
> > +	uint32_t		log_offset,
> > +	uint32_t		len)
> > +{
> > +	struct xfs_log_vec	*lv = log_vector;
> 
> This is initialized here and in the loop below.

Fixed.

> 
> > +	void			*ptr;
> > +	int			index = 0;
> > +	int			record_cnt = 0;
> > +
> > +	ASSERT(log_offset + len <= iclog->ic_size);
> > +
> > +	ptr = iclog->ic_datap + log_offset;
> > +	for (lv = log_vector; lv; lv = lv->lv_next) {
> > +		/*
> > +		 * Ordered log vectors have no regions to write so this
> > +		 * loop will naturally skip them.
> > +		 */
> > +		for (index = 0; index < lv->lv_niovecs; index++) {
> > +			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> > +			struct xlog_op_header	*ophdr = reg->i_addr;
> > +
> > +			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> > +			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> > +
> > +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> > +			ophdr->oh_len = cpu_to_be32(reg->i_len -
> > +						sizeof(struct xlog_op_header));
> 
> Perhaps we should retain the xlog_verify_dest_ptr() call here? It's
> DEBUG code and otherwise compiled out, so shouldn't impact production

The pointer check does nothing to actually prevent memory
corruption. It only catches problems after we've already memcpy()d
off the end of the iclog in the previous loop. So if the last region
overruns the log, then it won't be triggered.

And, well, we've already checked and asserted that the copy is going
to fit entirely within the current iclog, so checking whether the
pointer has overrun outside the iclog buffer is both redundant and
too late.  Hence I removed it...

> > +			memcpy(ptr, reg->i_addr, reg->i_len);
> > +			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> > +			record_cnt++;
> > +		}
> > +	}
> > +	ASSERT(len == 0);
> > +	return record_cnt;
> > +}
> > +
> > +
> >  /*
> >   * Write some region out to in-core log
> >   *
> > @@ -2294,7 +2340,6 @@ xlog_write(
> >  			return error;
> >  
> >  		ASSERT(log_offset <= iclog->ic_size - 1);
> > -		ptr = iclog->ic_datap + log_offset;
> >  
> >  		/* Start_lsn is the first lsn written to. */
> >  		if (start_lsn && !*start_lsn)
> > @@ -2311,10 +2356,20 @@ xlog_write(
> >  						XLOG_ICL_NEED_FUA);
> >  		}
> >  
> > +		/* If this is a single iclog write, go fast... */
> > +		if (!contwr && lv == log_vector) {
> > +			record_cnt = xlog_write_single(lv, ticket, iclog,
> > +						log_offset, len);
> > +			len = 0;
> 
> I assume this is here to satisfy the assert further down in the
> function.. This seems a bit contrived when you consider we pass len to
> the helper, the helper reduces it and asserts that it goes to zero, then
> we do so again here just for another assert. Unless this is all just
> removed later, it might be more straightforward to pass a reference.
> 
> > +			data_cnt = len;
> 
> Similarly, this looks a bit odd because it seems data_cnt should be zero
> in the case where contwr == 0. xlog_state_get_iclog_space() has already
> bumped ->ic_offset by len (so xlog_state_finish_copy() doesn't need to
> via data_cnt).

Yes, it's entirely contrived to make it possible to split this code
out in a simple fashion to ease review of the simple, fast path case
this code will end up with. The next patch changes all this context
and the parameters passed to the function, but this was the only way
I could easily split the complex xlog_write() rewrite change into
something a little bit simpler....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/45] xfs:_introduce xlog_write_partial()
  2021-03-18 13:22   ` Brian Foster
@ 2021-05-19  4:49     ` Dave Chinner
  2021-05-20 12:33       ` Brian Foster
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Chinner @ 2021-05-19  4:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Mar 18, 2021 at 09:22:08AM -0400, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 04:11:27PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Handle writing of a logvec chain into an iclog that doesn't have
> > enough space to fit it all. The iclog has already been changed to
> > WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> > in the iclog is exclusively owned by this logvec chain.
> > 
> > The difference between the single and partial cases is that
> > we end up with partial iovec writes in the iclog and have to split
> > a log vec regions across two iclogs. The state handling for this is
> > currently awful and so we're building up the pieces needed to
> > handle this more cleanly one at a time.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> FWIW, git --patience mode generates a more readable diff for this patch
> than what it generates by default. I'm referring to that locally and
> will try to leave feedback in the appropriate points here.
> 
> >  fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++-------------------------
> >  1 file changed, 251 insertions(+), 274 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 590c1e6db475..10916b99bf0f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2099,166 +2099,250 @@ xlog_print_trans(
> >  	}
> >  }
> >  
> > -static xlog_op_header_t *
> > -xlog_write_setup_ophdr(
> > -	struct xlog_op_header	*ophdr,
> > -	struct xlog_ticket	*ticket)
> > -{
> > -	ophdr->oh_clientid = XFS_TRANSACTION;
> > -	ophdr->oh_res2 = 0;
> > -	ophdr->oh_flags = 0;
> > -	return ophdr;
> > -}
> > -
> >  /*
> > - * Set up the parameters of the region copy into the log. This has
> > - * to handle region write split across multiple log buffers - this
> > - * state is kept external to this function so that this code can
> > - * be written in an obvious, self documenting manner.
> > + * Write whole log vectors into a single iclog which is guaranteed to have
> > + * either sufficient space for the entire log vector chain to be written or
> > + * exclusive access to the remaining space in the iclog.
> > + *
> > + * Return the number of iovecs and data written into the iclog, as well as
> > + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
> > + * end of the chain.
> >   */
> > -static int
> > -xlog_write_setup_copy(
> > +static struct xfs_log_vec *
> > +xlog_write_single(
> > +	struct xfs_log_vec	*log_vector,
> 
> So xlog_write_single() was initially for single CIL xlog_write() calls
> and now it appears to be slightly different in that it writes as many
> full log vectors that fit in the current iclog and cycles through
> xlog_write_partial() (and back) to process log vectors that span iclogs
> differently from those that don't.

Yes, that is what it does, but no, you've got the process and
meaning backwards. I wrote xlog_write_single() it as it appears in
this patch first, then split it out backwards to ease review. IOWs,
"single" means "write everything that fits within this single
iclog", not "only call this function if the entire lv chain fits
inside a single iclog".

The latter is what I split out to make it simpler to review, but it
was not the reason it was called xlog_write_single()....

> > +		do {
> > +			/*
> > +			 * Account for the continuation opheader before we get
> > +			 * a new iclog. This is necessary so that we reserve
> > +			 * space in the iclog for it.
> > +			 */
> > +			if (ophdr->oh_flags & XLOG_CONTINUE_TRANS) {
> 
> (Is this ever not true here?)

It is now, wasn't always. Fixed.

> 
> > +				*len += sizeof(struct xlog_op_header);
> > +				ticket->t_curr_res -= sizeof(struct xlog_op_header);
> > +			}
> > +			error = xlog_write_get_more_iclog_space(log, ticket,
> > +					&iclog, log_offset, *len, record_cnt,
> > +					data_cnt, contwr);
> > +			if (error)
> > +				return ERR_PTR(error);
> > +			ptr = iclog->ic_datap + *log_offset;
> > +
> > +			ophdr = ptr;
> >  			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> > -			ophdr->oh_len = cpu_to_be32(reg->i_len -
> > +			ophdr->oh_clientid = XFS_TRANSACTION;
> > +			ophdr->oh_res2 = 0;
> > +			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
> > +
> > +			xlog_write_adv_cnt(&ptr, len, log_offset,
> >  						sizeof(struct xlog_op_header));
> > -			memcpy(ptr, reg->i_addr, reg->i_len);
> > -			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> > -			record_cnt++;
> > -		}
> > +			*data_cnt += sizeof(struct xlog_op_header);
> > +
> 
> ... which switches to the next iclog, writes the continuation header...
> 
> > +			/*
> > +			 * If rlen fits in the iclog, then end the region
> > +			 * continuation. Otherwise we're going around again.
> > +			 */
> > +			reg_offset += rlen;
> > +			rlen = reg->i_len - reg_offset;
> > +			if (rlen <= iclog->ic_size - *log_offset)
> > +				ophdr->oh_flags |= XLOG_END_TRANS;
> > +			else
> > +				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> > +
> > +			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
> > +			ophdr->oh_len = cpu_to_be32(rlen);
> > +
> > +			xlog_verify_dest_ptr(log, ptr);
> > +			memcpy(ptr, reg->i_addr + reg_offset, rlen);
> > +			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> > +			(*record_cnt)++;
> > +			*data_cnt += rlen;
> > +
> > +		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);
> 
> ... writes more of the region (iclog space permitting), and then
> determines whether we need further continuations (and partial writes of
> the same region) or can move onto the next region, until we're done with
> the lv.

Yup.

> I think I follow the high level flow and it seems reasonable from a
> functional standpoint, but this also seems like quite a bit of churn for
> not much reduction in overall complexity. The higher level loop is much
> more simple and I think the per lv/vector iteration is an improvement,
> but we also seem to have duplicate functionality throughout the updated
> code and have introduced new forms of complexity around the state
> expectations for the transitions between the different write modes and
> between each write mode and the higher level loop.

Just getting untangling the code to get it to this point
has been hard enough. I've held off doing more factoring and
changing this code so I can actaully test it and find the bugs I
might have left in it.

Yes, it can be further improved by factoring the region copying
stuff, but that's secondary to the major work of refactoring this
code in the first place. The fact that you actually understood this
fairly easily indicates just how much better this code already is
compared to what is currently upstream....

> I.e., xlog_write_single() implements a straighforward loop to write out
> full log vectors. That seems fine, but the outer loop of
> xlog_write_partial() reimplements nearly the same per-region
> functionality with some added flexibility to handle op header flags and
> the special iclog processing associated with the continuation case. The
> inner loop factors out the continuation iclog management bits and op
> header injection, which I think is an improvement, but then duplicates
> region copying (yet again) pretty much only to implement partial copies,
> which really just involves offset management (i.e., fairly trivial
> relative to the broader complexity of the function).
> 
> I dunno. I'd certainly need to stare more at this to cover all of the
> details, but given the amount of swizzling going on in a single patch
> I'm kind of wondering if/why we couldn't land on a single iterator in
> the spirit of xlog_write_partial() in that it primarily iterates on
> regions and factors out the grotty reservation and continuation
> management bits, but doesn't unroll as much and leave so much duplicate
> functionality around.
> 
> For example, it looks to me that xlog_write_partial() almost nearly
> already supports a high level algorithm along the lines of the following
> (pseudocode):
> 
> xlog_write(len)
> {
> 	get_iclog_space(len)
> 
> 	for_each_lv() {
> 		for_each_reg() {
> 			reg_offset = 0;
> cont_write:
> 			/* write as much as will fit in the iclog, return count,
> 			 * and set ophdr cont flag based on write result */
> 			reg_offset += write_region(reg, &len, &reg_offset, ophdr, ...);
> 
> 			/* handle continuation writes */
> 			if (reg_offset != reg->i_len) {
> 				get_more_iclog_space(len);
> 				/* stamp a WAS_CONT op hdr, set END if rlen fits
> 				 * into new space, then continue with the same region */
> 				stamp_cont_op_hdr();
> 				goto cont_write;
> 			}
> 
> 			if (need_more_iclog_space(len))
> 				get_more_iclog_space(len);
> 		}
> 	}
> }

Yeah, na. That is exactly the mess that I've just untangled.

I don't want to rewrite this code again, and I don't want it more
tightly tied to iclogs than it already is - I'm trying to move the
code towards a common, simple fast path that knows nothing about
iclogs and a slow path that handles the partial regions and
obtaining a new buffer to write into. I want the two cases
completely separate logic, because that makes both cases simpler to
modify and reason about.

Indeed, I want xlog_write to move away from iclogs because I want to
use this code with direct mapped pmem regions, not just fixed memory
buffers held in iclogs.

IOWs, the code as it stands is a beginning, not an end. And even as
a beginning, it works, is much better and faster than the current
code, has been tested for some time now, can be further factored to
make it simpler, easier to understand and provide infrastructure for
new features.


> That puts the whole thing back into a single high level walk and thus
> reintroduces the need for some of the continuation vs. non-continuation
> tracking wrt to the op header and iclog, but ISTM that complexity can be
> managed by the continuation abstraction you've already started to
> introduce (as opposed to the current scheme of conditionally
> accumulating data_cnt). It might even be fine to dump some of the
> requisite state into a context struct to carry between iclog reservation
> and copy finish processing rather than pass around so many independent
> and poorly named variables like the current upstream implementation
> does, but that's probably getting too deep into the weeds.
> 
> FWIW, I can also see an approach of moving from the implementation in
> this patch toward something like the above, but I'm not sure I'd want to
> subject to the upstream code to that process...

This is exactly what upstream is for - iterative improvement via
small steps. This is the first step of many, and what you propose
takes the code in the wrong direction for the steps I've already
taken and are planning to take.

Perfect is the enemy of good, and if upstream is not the place to
make iterative improvements like this that build towards a bigger
picture goal, then where the hell are we supposed to do them?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/45] xfs:_introduce xlog_write_partial()
  2021-05-19  4:49     ` Dave Chinner
@ 2021-05-20 12:33       ` Brian Foster
  2021-05-27 18:03         ` Darrick J. Wong
  0 siblings, 1 reply; 145+ messages in thread
From: Brian Foster @ 2021-05-20 12:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 02:49:03PM +1000, Dave Chinner wrote:
> On Thu, Mar 18, 2021 at 09:22:08AM -0400, Brian Foster wrote:
> > On Fri, Mar 05, 2021 at 04:11:27PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Handle writing of a logvec chain into an iclog that doesn't have
> > > enough space to fit it all. The iclog has already been changed to
> > > WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> > > in the iclog is exclusively owned by this logvec chain.
> > > 
> > > The difference between the single and partial cases is that
> > > we end up with partial iovec writes in the iclog and have to split
> > > a log vec regions across two iclogs. The state handling for this is
> > > currently awful and so we're building up the pieces needed to
> > > handle this more cleanly one at a time.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > FWIW, git --patience mode generates a more readable diff for this patch
> > than what it generates by default. I'm referring to that locally and
> > will try to leave feedback in the appropriate points here.
> > 
> > >  fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++-------------------------
> > >  1 file changed, 251 insertions(+), 274 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index 590c1e6db475..10916b99bf0f 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -2099,166 +2099,250 @@ xlog_print_trans(
> > >  	}
> > >  }
> > >  
> > > -static xlog_op_header_t *
> > > -xlog_write_setup_ophdr(
> > > -	struct xlog_op_header	*ophdr,
> > > -	struct xlog_ticket	*ticket)
> > > -{
> > > -	ophdr->oh_clientid = XFS_TRANSACTION;
> > > -	ophdr->oh_res2 = 0;
> > > -	ophdr->oh_flags = 0;
> > > -	return ophdr;
> > > -}
> > > -
> > >  /*
> > > - * Set up the parameters of the region copy into the log. This has
> > > - * to handle region write split across multiple log buffers - this
> > > - * state is kept external to this function so that this code can
> > > - * be written in an obvious, self documenting manner.
> > > + * Write whole log vectors into a single iclog which is guaranteed to have
> > > + * either sufficient space for the entire log vector chain to be written or
> > > + * exclusive access to the remaining space in the iclog.
> > > + *
> > > + * Return the number of iovecs and data written into the iclog, as well as
> > > + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
> > > + * end of the chain.
> > >   */
> > > -static int
> > > -xlog_write_setup_copy(
> > > +static struct xfs_log_vec *
> > > +xlog_write_single(
> > > +	struct xfs_log_vec	*log_vector,
> > 
> > So xlog_write_single() was initially for single CIL xlog_write() calls
> > and now it appears to be slightly different in that it writes as many
> > full log vectors that fit in the current iclog and cycles through
> > xlog_write_partial() (and back) to process log vectors that span iclogs
> > differently from those that don't.
> 
> Yes, that is what it does, but no, you've got the process and
> meaning backwards. I wrote xlog_write_single() it as it appears in
> this patch first, then split it out backwards to ease review. IOWs,
> "single" means "write everything that fits within this single
> iclog", not "only call this function if the entire lv chain fits
> inside a single iclog".
> 
> The latter is what I split out to make it simpler to review, but it
> was not the reason it was called xlog_write_single()....
> 
> > > +		do {
> > > +			/*
> > > +			 * Account for the continuation opheader before we get
> > > +			 * a new iclog. This is necessary so that we reserve
> > > +			 * space in the iclog for it.
> > > +			 */
> > > +			if (ophdr->oh_flags & XLOG_CONTINUE_TRANS) {
> > 
> > (Is this ever not true here?)
> 
> It is now, wasn't always. Fixed.
> 
> > 
> > > +				*len += sizeof(struct xlog_op_header);
> > > +				ticket->t_curr_res -= sizeof(struct xlog_op_header);
> > > +			}
> > > +			error = xlog_write_get_more_iclog_space(log, ticket,
> > > +					&iclog, log_offset, *len, record_cnt,
> > > +					data_cnt, contwr);
> > > +			if (error)
> > > +				return ERR_PTR(error);
> > > +			ptr = iclog->ic_datap + *log_offset;
> > > +
> > > +			ophdr = ptr;
> > >  			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> > > -			ophdr->oh_len = cpu_to_be32(reg->i_len -
> > > +			ophdr->oh_clientid = XFS_TRANSACTION;
> > > +			ophdr->oh_res2 = 0;
> > > +			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
> > > +
> > > +			xlog_write_adv_cnt(&ptr, len, log_offset,
> > >  						sizeof(struct xlog_op_header));
> > > -			memcpy(ptr, reg->i_addr, reg->i_len);
> > > -			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> > > -			record_cnt++;
> > > -		}
> > > +			*data_cnt += sizeof(struct xlog_op_header);
> > > +
> > 
> > ... which switches to the next iclog, writes the continuation header...
> > 
> > > +			/*
> > > +			 * If rlen fits in the iclog, then end the region
> > > +			 * continuation. Otherwise we're going around again.
> > > +			 */
> > > +			reg_offset += rlen;
> > > +			rlen = reg->i_len - reg_offset;
> > > +			if (rlen <= iclog->ic_size - *log_offset)
> > > +				ophdr->oh_flags |= XLOG_END_TRANS;
> > > +			else
> > > +				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> > > +
> > > +			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
> > > +			ophdr->oh_len = cpu_to_be32(rlen);
> > > +
> > > +			xlog_verify_dest_ptr(log, ptr);
> > > +			memcpy(ptr, reg->i_addr + reg_offset, rlen);
> > > +			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> > > +			(*record_cnt)++;
> > > +			*data_cnt += rlen;
> > > +
> > > +		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);
> > 
> > ... writes more of the region (iclog space permitting), and then
> > determines whether we need further continuations (and partial writes of
> > the same region) or can move onto the next region, until we're done with
> > the lv.
> 
> Yup.
> 
> > I think I follow the high level flow and it seems reasonable from a
> > functional standpoint, but this also seems like quite a bit of churn for
> > not much reduction in overall complexity. The higher level loop is much
> > more simple and I think the per lv/vector iteration is an improvement,
> > but we also seem to have duplicate functionality throughout the updated
> > code and have introduced new forms of complexity around the state
> > expectations for the transitions between the different write modes and
> > between each write mode and the higher level loop.
> 
> Just getting untangling the code to get it to this point
> has been hard enough. I've held off doing more factoring and
> changing this code so I can actaully test it and find the bugs I
> might have left in it.
> 
> Yes, it can be further improved by factoring the region copying
> stuff, but that's secondary to the major work of refactoring this
> code in the first place. The fact that you actually understood this
> fairly easily indicates just how much better this code already is
> compared to what is currently upstream....
> 

Heh. "You understood the patch, so it must be better!" :P

I've paged much of this out in the 2 months or so since this review was
posted, but my recollection is quite different. I use the existing code
as a baseline to confirm behavior and assess readability of the updated
code.

> > I.e., xlog_write_single() implements a straighforward loop to write out
> > full log vectors. That seems fine, but the outer loop of
> > xlog_write_partial() reimplements nearly the same per-region
> > functionality with some added flexibility to handle op header flags and
> > the special iclog processing associated with the continuation case. The
> > inner loop factors out the continuation iclog management bits and op
> > header injection, which I think is an improvement, but then duplicates
> > region copying (yet again) pretty much only to implement partial copies,
> > which really just involves offset management (i.e., fairly trivial
> > relative to the broader complexity of the function).
> > 
> > I dunno. I'd certainly need to stare more at this to cover all of the
> > details, but given the amount of swizzling going on in a single patch
> > I'm kind of wondering if/why we couldn't land on a single iterator in
> > the spirit of xlog_write_partial() in that it primarily iterates on
> > regions and factors out the grotty reservation and continuation
> > management bits, but doesn't unroll as much and leave so much duplicate
> > functionality around.
> > 
> > For example, it looks to me that xlog_write_partial() almost nearly
> > already supports a high level algorithm along the lines of the following
> > (pseudocode):
> > 
> > xlog_write(len)
> > {
> > 	get_iclog_space(len)
> > 
> > 	for_each_lv() {
> > 		for_each_reg() {
> > 			reg_offset = 0;
> > cont_write:
> > 			/* write as much as will fit in the iclog, return count,
> > 			 * and set ophdr cont flag based on write result */
> > 			reg_offset += write_region(reg, &len, &reg_offset, ophdr, ...);
> > 
> > 			/* handle continuation writes */
> > 			if (reg_offset != reg->i_len) {
> > 				get_more_iclog_space(len);
> > 				/* stamp a WAS_CONT op hdr, set END if rlen fits
> > 				 * into new space, then continue with the same region */
> > 				stamp_cont_op_hdr();
> > 				goto cont_write;
> > 			}
> > 
> > 			if (need_more_iclog_space(len))
> > 				get_more_iclog_space(len);
> > 		}
> > 	}
> > }
> 
> Yeah, na. That is exactly the mess that I've just untangled.
> 
> I don't want to rewrite this code again, and I don't want it more
> tightly tied to iclogs than it already is - I'm trying to move the
> code towards a common, simple fast path that knows nothing about
> iclogs and a slow path that handles the partial regions and
> obtaining a new buffer to write into. I want the two cases
> completely separate logic, because that makes both cases simpler to
> modify and reason about.
> 

Well, this review has been on the list for more than a couple months
now. Given the response seems to have appeared after the next version of
the series, I'm not sure it's worth digging my head back into the
details to try and make a more detailed argument. Suffice it to say that
I recall what I proposed as intended to be a fairly reasonable
incremental step from what you ended up at to replace the large amount
of resulting duplication with a single implementation that otherwise
preserves the majority of the other cleanups. Not a rewrite or anything
of the sort..

In any event, no single one of us is ultimately the authority on
"better" or "simple." I'm just providing feedback that I didn't find the
resulting factoring as a clear improvement, find it a bit annoying to
have to dig through duplicate implementations to locate the subtle and
unnecessary differences, and provided a suggestion on how to address
that concern (that doesn't involve rewriting the thing) with specific
details on how and why I think it improves readability. *shrug* Perhaps
others will look at this, disagree with that assessment and find the
separate functions more straightforward.

> Indeed, I want xlog_write to move away from iclogs because I want to
> use this code with direct mapped pmem regions, not just fixed memory
> buffers held in iclogs.
> 

That context and how that relates the proposed structure is not clear to
me. That said, I _thought_ I looked through far enough into this series
to grok how intertwined the resulting structure might have been with
subsequent patches in order to provide thoughtful feedback, but I could
be mistaken.

> IOWs, the code as it stands is a beginning, not an end. And even as
> a beginning, it works, is much better and faster than the current
> code, has been tested for some time now, can be further factored to
> make it simpler, easier to understand and provide infrastructure for
> new features.
> 
> 
> > That puts the whole thing back into a single high level walk and thus
> > reintroduces the need for some of the continuation vs. non-continuation
> > tracking wrt to the op header and iclog, but ISTM that complexity can be
> > managed by the continuation abstraction you've already started to
> > introduce (as opposed to the current scheme of conditionally
> > accumulating data_cnt). It might even be fine to dump some of the
> > requisite state into a context struct to carry between iclog reservation
> > and copy finish processing rather than pass around so many independent
> > and poorly named variables like the current upstream implementation
> > does, but that's probably getting too deep into the weeds.
> > 
> > FWIW, I can also see an approach of moving from the implementation in
> > this patch toward something like the above, but I'm not sure I'd want to
> > subject to the upstream code to that process...
> 
> This is exactly what upstream is for - iterative improvement via
> small steps. This is the first step of many, and what you propose
> takes the code in the wrong direction for the steps I've already
> taken and are planning to take.
> 
> Perfect is the enemy of good, and if upstream is not the place to
> make iterative improvements like this that build towards a bigger
> picture goal, then where the hell are we supposed to do them?
> 

Not every incremental development step is necessarily a suitable point
for an upstream release. My comment above is basically to say that I
think this refactoring is nearly to that point, but should go a bit
further to reduce the duplication. If the argument against that step is
dependence on future work, then propose the factoring close enough to
that work such that sufficient context is available to review.

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/45] xfs:_introduce xlog_write_partial()
  2021-05-20 12:33       ` Brian Foster
@ 2021-05-27 18:03         ` Darrick J. Wong
  0 siblings, 0 replies; 145+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Thu, May 20, 2021 at 08:33:04AM -0400, Brian Foster wrote:

<snipping the earlier comments out because I want only to respond to the
discussion pertaining to handling of large patchsets>

> > > I think I follow the high level flow and it seems reasonable from a
> > > functional standpoint, but this also seems like quite a bit of churn for
> > > not much reduction in overall complexity. The higher level loop is much
> > > more simple and I think the per lv/vector iteration is an improvement,
> > > but we also seem to have duplicate functionality throughout the updated
> > > code and have introduced new forms of complexity around the state
> > > expectations for the transitions between the different write modes and
> > > between each write mode and the higher level loop.
> > 
> > Just getting untangling the code to get it to this point
> > has been hard enough. I've held off doing more factoring and
> > changing this code so I can actaully test it and find the bugs I
> > might have left in it.
> > 
> > Yes, it can be further improved by factoring the region copying
> > stuff, but that's secondary to the major work of refactoring this
> > code in the first place. The fact that you actually understood this
> > fairly easily indicates just how much better this code already is
> > compared to what is currently upstream....
> > 
> 
> Heh. "You understood the patch, so it must be better!" :P
> 
> I've paged much of this out in the 2 months or so since this review was
> posted, but my recollection is quite different. I use the existing code
> as a baseline to confirm behavior and assess readability of the updated
> code.
> 
> > > I.e., xlog_write_single() implements a straighforward loop to write out
> > > full log vectors. That seems fine, but the outer loop of
> > > xlog_write_partial() reimplements nearly the same per-region
> > > functionality with some added flexibility to handle op header flags and
> > > the special iclog processing associated with the continuation case. The
> > > inner loop factors out the continuation iclog management bits and op
> > > header injection, which I think is an improvement, but then duplicates
> > > region copying (yet again) pretty much only to implement partial copies,
> > > which really just involves offset management (i.e., fairly trivial
> > > relative to the broader complexity of the function).
> > > 
> > > I dunno. I'd certainly need to stare more at this to cover all of the
> > > details, but given the amount of swizzling going on in a single patch
> > > I'm kind of wondering if/why we couldn't land on a single iterator in
> > > the spirit of xlog_write_partial() in that it primarily iterates on
> > > regions and factors out the grotty reservation and continuation
> > > management bits, but doesn't unroll as much and leave so much duplicate
> > > functionality around.
> > > 
> > > For example, it looks to me that xlog_write_partial() almost nearly
> > > already supports a high level algorithm along the lines of the following
> > > (pseudocode):
> > > 
> > > xlog_write(len)
> > > {
> > > 	get_iclog_space(len)
> > > 
> > > 	for_each_lv() {
> > > 		for_each_reg() {
> > > 			reg_offset = 0;
> > > cont_write:
> > > 			/* write as much as will fit in the iclog, return count,
> > > 			 * and set ophdr cont flag based on write result */
> > > 			reg_offset += write_region(reg, &len, &reg_offset, ophdr, ...);
> > > 
> > > 			/* handle continuation writes */
> > > 			if (reg_offset != reg->i_len) {
> > > 				get_more_iclog_space(len);
> > > 				/* stamp a WAS_CONT op hdr, set END if rlen fits
> > > 				 * into new space, then continue with the same region */
> > > 				stamp_cont_op_hdr();
> > > 				goto cont_write;
> > > 			}
> > > 
> > > 			if (need_more_iclog_space(len))
> > > 				get_more_iclog_space(len);
> > > 		}
> > > 	}
> > > }
> > 
> > Yeah, na. That is exactly the mess that I've just untangled.
> > 
> > I don't want to rewrite this code again, and I don't want it more
> > tightly tied to iclogs than it already is - I'm trying to move the
> > code towards a common, simple fast path that knows nothing about
> > iclogs and a slow path that handles the partial regions and
> > obtaining a new buffer to write into. I want the two cases
> > completely separate logic, because that makes both cases simpler to
> > modify and reason about.
> > 
> 
> Well, this review has been on the list for more than a couple months
> now. Given the response seems to have appeared after the next version of
> the series, I'm not sure it's worth digging my head back into the
> details to try and make a more detailed argument. Suffice it to say that
> I recall what I proposed as intended to be a fairly reasonable
> incremental step from what you ended up at to replace the large amount
> of resulting duplication with a single implementation that otherwise
> preserves the majority of the other cleanups. Not a rewrite or anything
> of the sort..
> 
> In any event, no single one of us is ultimately the authority on
> "better" or "simple." I'm just providing feedback that I didn't find the
> resulting factoring as a clear improvement, find it a bit annoying to
> have to dig through duplicate implementations to locate the subtle and
> unnecessary differences, and provided a suggestion on how to address
> that concern (that doesn't involve rewriting the thing) with specific
> details on how and why I think it improves readability. *shrug* Perhaps
> others will look at this, disagree with that assessment and find the
> separate functions more straightforward.

Admittedly I did look at the:

	xlog_verify_dest_ptr(log, ptr);
	memcpy(ptr, reg->i_addr + reg_offset, rlen);
	xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
	(*record_cnt)++;
	*data_cnt += rlen;

sprinkled in three places and wondered why that couldn't have been a
single function.  Eh, well.

Leaving the ophdr manipulations as separate clauses actually helps me to
figure out /why/ they're different.

> 
> > Indeed, I want xlog_write to move away from iclogs because I want to
> > use this code with direct mapped pmem regions, not just fixed memory
> > buffers held in iclogs.
> > 
> 
> That context and how that relates the proposed structure is not clear to
> me. That said, I _thought_ I looked through far enough into this series
> to grok how intertwined the resulting structure might have been with
> subsequent patches in order to provide thoughtful feedback, but I could
> be mistaken.
> 
> > IOWs, the code as it stands is a beginning, not an end. And even as
> > a beginning, it works, is much better and faster than the current
> > code, has been tested for some time now, can be further factored to
> > make it simpler, easier to understand and provide infrastructure for
> > new features.
> > 
> > 
> > > That puts the whole thing back into a single high level walk and thus
> > > reintroduces the need for some of the continuation vs. non-continuation
> > > tracking wrt to the op header and iclog, but ISTM that complexity can be
> > > managed by the continuation abstraction you've already started to
> > > introduce (as opposed to the current scheme of conditionally
> > > accumulating data_cnt). It might even be fine to dump some of the
> > > requisite state into a context struct to carry between iclog reservation
> > > and copy finish processing rather than pass around so many independent
> > > and poorly named variables like the current upstream implementation
> > > does, but that's probably getting too deep into the weeds.
> > > 
> > > FWIW, I can also see an approach of moving from the implementation in
> > > this patch toward something like the above, but I'm not sure I'd want to
> > > subject to the upstream code to that process...
> > 
> > This is exactly what upstream is for - iterative improvement via
> > small steps. This is the first step of many, and what you propose
> > takes the code in the wrong direction for the steps I've already
> > taken and are planning to take.
> > 
> > Perfect is the enemy of good, and if upstream is not the place to
> > make iterative improvements like this that build towards a bigger
> > picture goal, then where the hell are we supposed to do them?
> > 
> 
> Not every incremental development step is necessarily a suitable point
> for an upstream release. My comment above is basically to say that I
> think this refactoring is nearly to that point, but should go a bit
> further to reduce the duplication. If the argument against that step is
> dependence on future work, then propose the factoring close enough to
> that work such that sufficient context is available to review.

For a short patchset I agree, but I don't think dumping the /next/ forty
patches on the list as an RFC is going to help much.  We're keyed to the
kernel release cycle, which means (to me anyway) that the criteria is a
little different for Gigantic Patchsets that are never going to land in
a single cycle.

Whereas for small patchsets I think it's reasonable to ask that all the
weird warts get fixed by the end of review, for bigger things I think
it's ok to lower that standard to "Can we understand it in case the
author disappears; and does it not introduce obvious regressions"?

I've applied the same principle to this really long story arc of adding
parent pointers to the filesystem -- yes, the delayed xattrs series has
some strange things in it structurally, but I was ok with only asking
for obvious cleanups (like fixing the naming inconsistencies) so that we
can get to the next series, which justifies all the slicing and dicing
by turning the xattr state machine into a deferred log item.

Posting the full set as a git branch somewhere so at least we can pull
it and see the even bigger picture might, though.  It's helped immensely
for reviewing the delayed xattrs series and throwing some early feedback
to Allison w.r.t. deferred xattrs.

All right, back to the latest posting.

--D

> 
> Brian
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

end of thread, other threads:[~2021-05-27 18:03 UTC | newest]

Thread overview: 145+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
2021-03-05  5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
2021-03-08 22:20   ` Darrick J. Wong
2021-03-16  8:35   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 02/45] xfs: log stripe roundoff is a property of the log Dave Chinner
2021-03-05  5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
2021-03-08  8:34   ` Chandan Babu R
2021-03-15 14:40   ` Brian Foster
2021-03-16  8:40   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
2021-03-08  9:31   ` Chandan Babu R
2021-03-08 22:21   ` Darrick J. Wong
2021-03-15 14:40   ` Brian Foster
2021-03-16  8:41   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
2021-03-08  9:48   ` Chandan Babu R
2021-03-08 22:24     ` Darrick J. Wong
2021-03-15 14:41       ` Brian Foster
2021-03-15 16:32         ` Darrick J. Wong
2021-03-16  8:43           ` Christoph Hellwig
2021-03-08 22:26   ` Darrick J. Wong
2021-03-15 14:42   ` Brian Foster
2021-03-05  5:11 ` [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
2021-03-15 14:43   ` Brian Foster
2021-03-16  8:47   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
2021-03-15 14:45   ` Brian Foster
2021-03-16 14:15   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 08/45] xfs: journal IO cache flush reductions Dave Chinner
2021-03-08 10:49   ` Chandan Babu R
2021-03-08 12:25   ` Brian Foster
2021-03-09  1:13     ` Dave Chinner
2021-03-10 20:49       ` Brian Foster
2021-03-10 21:28         ` Dave Chinner
2021-03-05  5:11 ` [PATCH 09/45] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
2021-03-05  5:11 ` [PATCH 10/45] xfs: reduce buffer log item shadow allocations Dave Chinner
2021-03-15 14:52   ` Brian Foster
2021-03-05  5:11 ` [PATCH 11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
2021-03-05  5:11 ` [PATCH 12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
2021-03-05  5:11 ` [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
2021-03-08 22:53   ` Darrick J. Wong
2021-03-11  0:26     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing Dave Chinner
2021-03-08 23:45   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 15/45] xfs: CIL work is serialised, not pipelined Dave Chinner
2021-03-08 23:14   ` Darrick J. Wong
2021-03-08 23:38     ` Dave Chinner
2021-03-09  1:55       ` Darrick J. Wong
2021-03-09 22:35         ` Andi Kleen
2021-03-10  6:11           ` Dave Chinner
2021-03-05  5:11 ` [PATCH 16/45] xfs: type verification is expensive Dave Chinner
2021-03-05  5:11 ` [PATCH 17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check Dave Chinner
2021-03-05  5:11 ` [PATCH 18/45] xfs: reduce debug overhead of dir leaf/node checks Dave Chinner
2021-03-05  5:11 ` [PATCH 19/45] xfs: factor out the CIL transaction header building Dave Chinner
2021-03-08 23:47   ` Darrick J. Wong
2021-03-16 14:50   ` Brian Foster
2021-03-05  5:11 ` [PATCH 20/45] xfs: only CIL pushes require a start record Dave Chinner
2021-03-09  0:07   ` Darrick J. Wong
2021-03-16 14:51   ` Brian Foster
2021-03-05  5:11 ` [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record Dave Chinner
2021-03-09  0:15   ` Darrick J. Wong
2021-03-11  2:54     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 22/45] xfs: embed the xlog_op_header in the commit record Dave Chinner
2021-03-09  0:17   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 23/45] xfs: log tickets don't need log client id Dave Chinner
2021-03-09  0:21   ` Darrick J. Wong
2021-03-09  1:19     ` Dave Chinner
2021-03-09  1:48       ` Darrick J. Wong
2021-03-11  3:01         ` Dave Chinner
2021-03-16 14:51   ` Brian Foster
2021-03-05  5:11 ` [PATCH 24/45] xfs: move log iovec alignment to preparation function Dave Chinner
2021-03-09  2:14   ` Darrick J. Wong
2021-03-16 14:51   ` Brian Foster
2021-03-05  5:11 ` [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
2021-03-09  2:21   ` Darrick J. Wong
2021-03-11  3:29     ` Dave Chinner
2021-03-11  3:41       ` Darrick J. Wong
2021-03-16 14:54         ` Brian Foster
2021-03-16 14:53   ` Brian Foster
2021-05-19  3:18     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 26/45] xfs: log ticket region debug is largely useless Dave Chinner
2021-03-09  2:31   ` Darrick J. Wong
2021-03-16 14:55   ` Brian Foster
2021-05-19  3:27     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 27/45] xfs: pass lv chain length into xlog_write() Dave Chinner
2021-03-09  2:36   ` Darrick J. Wong
2021-03-11  3:37     ` Dave Chinner
2021-03-16 18:38   ` Brian Foster
2021-03-05  5:11 ` [PATCH 28/45] xfs: introduce xlog_write_single() Dave Chinner
2021-03-09  2:39   ` Darrick J. Wong
2021-03-11  4:19     ` Dave Chinner
2021-03-16 18:39   ` Brian Foster
2021-05-19  3:44     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 29/45] xfs:_introduce xlog_write_partial() Dave Chinner
2021-03-09  2:59   ` Darrick J. Wong
2021-03-11  4:33     ` Dave Chinner
2021-03-18 13:22   ` Brian Foster
2021-05-19  4:49     ` Dave Chinner
2021-05-20 12:33       ` Brian Foster
2021-05-27 18:03         ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 30/45] xfs: xlog_write() no longer needs contwr state Dave Chinner
2021-03-09  3:01   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 31/45] xfs: CIL context doesn't need to count iovecs Dave Chinner
2021-03-09  3:16   ` Darrick J. Wong
2021-03-11  5:03     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks Dave Chinner
2021-03-10 23:01   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
2021-03-10 23:25   ` Darrick J. Wong
2021-03-11  5:42     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 34/45] xfs: rework per-iclog header CIL reservation Dave Chinner
2021-03-11  0:03   ` Darrick J. Wong
2021-03-11  6:03     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure Dave Chinner
2021-03-11  0:11   ` Darrick J. Wong
2021-03-11  6:33     ` Dave Chinner
2021-03-11  6:42       ` Dave Chinner
2021-03-05  5:11 ` [PATCH 36/45] xfs: implement percpu cil space used calculation Dave Chinner
2021-03-11  0:20   ` Darrick J. Wong
2021-03-11  6:51     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure Dave Chinner
2021-03-11  0:26   ` Darrick J. Wong
2021-03-12  0:47     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 38/45] xfs: convert CIL busy extents to per-cpu Dave Chinner
2021-03-11  0:36   ` Darrick J. Wong
2021-03-12  1:15     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 39/45] xfs: Add order IDs to log items in CIL Dave Chinner
2021-03-11  1:00   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 40/45] xfs: convert CIL to unordered per cpu lists Dave Chinner
2021-03-11  1:15   ` Darrick J. Wong
2021-03-12  2:18     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 41/45] xfs: move CIL ordering to the logvec chain Dave Chinner
2021-03-11  1:34   ` Darrick J. Wong
2021-03-12  2:29     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive Dave Chinner
2021-03-11  1:36   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 43/45] xfs: avoid cil push lock if possible Dave Chinner
2021-03-11  1:47   ` Darrick J. Wong
2021-03-12  2:36     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
2021-03-11  2:00   ` Darrick J. Wong
2021-03-16  3:04     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 45/45] xfs: expanding delayed logging design with background material Dave Chinner
2021-03-11  2:30   ` Darrick J. Wong
2021-03-16  3:28     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.