All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1)
@ 2022-05-27 13:02 Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 1/5] xfs: detect overflows in bmbt records Amir Goldstein
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Amir Goldstein @ 2022-05-27 13:02 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, Leah Rumancik,
	Chandan Babu R, Adam Manzanares, Tyler Hicks, Jan Kara,
	linux-xfs, stable

Hi Greg and Shasha!

It has been a while since you heard from xfs team.

We are trying to change things and get xfs fixes flowing to stable
again. Crossing my fingers that we will make this last this time :)

Please see this message from Darrick [4] about xfs stable plans.
My team will be focusing on 5.10.y and Ted and Leah's team will be
focusing on 5.15.y at this time.

This v2 is being sent to stable after testing and after v1 was sent
for review of the xfs list [5].

v2 includes an extra patch that Christoph has backported and tested
and was going to send to stable.

Please see my cover letter to xfs with more details about my plans
for 5.10.y below:

Hi all!

During LSFMM 2022, I have had an opportunity to speak with developers
from several different companies that showed interest in collaborating
on the effort of improving the state of xfs code in LTS kernels.

I would like to kick-off this effort for the 5.10 LTS kernel, in the
hope that others will join me in the future to produce a better common
baseline for everyone to build on.

This is the first of 6 series of stable patch candidates that
I collected from xfs releases v5.11..v5.18 [1].

My intention is to post the parts for review on the xfs list on
a ~weekly basis and forward them to stable only after xfs developers
have had the chance to review the selection.

I used a gadget that I developed "b4 rn" that produces high level
"release notes" with references to the posted patch series and also
looks for mentions of fstest names in the discussions on lore.
I then used an elimination process to select the stable tree candidate
patches. The selection process is documented in the git log of [1].

After I had candidates, Luis has helped me to set up a kdevops testing
environment on a server that Samsung has contributed to the effort.
Luis and I have spent a considerable amount of time to establish the
expunge lists that produce stable baseline results for v5.10.y [2].
Eventually, we ran the auto group test over 100 times to sanitize the
baseline, on the following configurations:
reflink_normapbt (default), reflink, reflink_1024, nocrc, nocrc_512.

The patches in this part are from circa v5.11 release.
They have been through 36 auto group runs with the configs listed above
and no regressions from baseline were observed.

At least two of the fixes have regression tests in fstests that were used
to verify the fix. I also annotated [3] the fix commits in the tests.

I would like to thank Luis for his huge part in this still ongoing effort
and I would like to thank Samsung for contributing the hardware resources
to drive this effort.

Your inputs on the selection in this part and in upcoming parts [1]
are most welcome!

Thanks,
Amir.

[1] https://github.com/amir73il/b4/blob/xfs-5.10.y/xfs-5.10..5.17-fixes.rst
[2] https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/5.10.105/xfs/unassigned
[3] https://lore.kernel.org/fstests/20220520143249.2103631-1-amir73il@gmail.com/
[4] https://lore.kernel.org/linux-xfs/Yo6ePjvpC7nhgek+@magnolia/
[5] https://lore.kernel.org/linux-xfs/20220525111715.2769700-1-amir73il@gmail.com/

Changes since v1:
- Send to stable
- Add patch from Christoph

Darrick J. Wong (3):
  xfs: detect overflows in bmbt records
  xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks
  xfs: fix an ABBA deadlock in xfs_rename

Dave Chinner (1):
  xfs: Fix CIL throttle hang when CIL space used going backwards

Kaixu Xia (1):
  xfs: show the proper user quota options

 fs/xfs/libxfs/xfs_bmap.c    |  5 +++++
 fs/xfs/libxfs/xfs_dir2.h    |  2 --
 fs/xfs/libxfs/xfs_dir2_sf.c |  2 +-
 fs/xfs/xfs_buf_item.c       | 37 ++++++++++++++++----------------
 fs/xfs/xfs_inode.c          | 42 ++++++++++++++++++++++---------------
 fs/xfs/xfs_inode_item.c     | 14 +++++++++++++
 fs/xfs/xfs_iwalk.c          |  2 +-
 fs/xfs/xfs_log_cil.c        | 22 ++++++++++++++-----
 fs/xfs/xfs_super.c          | 10 +++++----
 9 files changed, 87 insertions(+), 49 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 5.10 v2 1/5] xfs: detect overflows in bmbt records
  2022-05-27 13:02 [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Amir Goldstein
@ 2022-05-27 13:02 ` Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 2/5] xfs: show the proper user quota options Amir Goldstein
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2022-05-27 13:02 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, Leah Rumancik,
	Chandan Babu R, Adam Manzanares, Tyler Hicks, Jan Kara,
	linux-xfs, stable

From: "Darrick J. Wong" <darrick.wong@oracle.com>

commit acf104c2331c1ba2a667e65dd36139d1555b1432 upstream.

Detect file block mappings with a blockcount that's either so large that
integer overflows occur or are zero, because neither are valid in the
filesystem.  Worse yet, attempting directory modifications causes the
iext code to trip over the bmbt key handling and takes the filesystem
down.  We can fix most of this by preventing the bad metadata from
entering the incore structures in the first place.

Found by setting blockcount=0 in a directory data fork mapping and
watching the fireworks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index d9a692484eae..de9c27ef68d8 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6229,6 +6229,11 @@ xfs_bmap_validate_extent(
 	xfs_fsblock_t		endfsb;
 	bool			isrt;
 
+	if (irec->br_startblock + irec->br_blockcount <= irec->br_startblock)
+		return __this_address;
+	if (irec->br_startoff + irec->br_blockcount <= irec->br_startoff)
+		return __this_address;
+
 	isrt = XFS_IS_REALTIME_INODE(ip);
 	endfsb = irec->br_startblock + irec->br_blockcount - 1;
 	if (isrt && whichfork == XFS_DATA_FORK) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5.10 v2 2/5] xfs: show the proper user quota options
  2022-05-27 13:02 [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 1/5] xfs: detect overflows in bmbt records Amir Goldstein
@ 2022-05-27 13:02 ` Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 3/5] xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks Amir Goldstein
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2022-05-27 13:02 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, Leah Rumancik,
	Chandan Babu R, Adam Manzanares, Tyler Hicks, Jan Kara,
	linux-xfs, stable, Kaixu Xia

From: Kaixu Xia <kaixuxia@tencent.com>

commit 237d7887ae723af7d978e8b9a385fdff416f357b upstream.

The quota option 'usrquota' should be shown if both the XFS_UQUOTA_ACCT
and XFS_UQUOTA_ENFD flags are set. The option 'uqnoenforce' should be
shown when only the XFS_UQUOTA_ACCT flag is set. The current code logic
seems wrong, Fix it and show proper options.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/xfs/xfs_super.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e3e229e52512..5ebd6cdc44a7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -199,10 +199,12 @@ xfs_fs_show_options(
 		seq_printf(m, ",swidth=%d",
 				(int)XFS_FSB_TO_BB(mp, mp->m_swidth));
 
-	if (mp->m_qflags & (XFS_UQUOTA_ACCT|XFS_UQUOTA_ENFD))
-		seq_puts(m, ",usrquota");
-	else if (mp->m_qflags & XFS_UQUOTA_ACCT)
-		seq_puts(m, ",uqnoenforce");
+	if (mp->m_qflags & XFS_UQUOTA_ACCT) {
+		if (mp->m_qflags & XFS_UQUOTA_ENFD)
+			seq_puts(m, ",usrquota");
+		else
+			seq_puts(m, ",uqnoenforce");
+	}
 
 	if (mp->m_qflags & XFS_PQUOTA_ACCT) {
 		if (mp->m_qflags & XFS_PQUOTA_ENFD)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5.10 v2 3/5] xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks
  2022-05-27 13:02 [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 1/5] xfs: detect overflows in bmbt records Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 2/5] xfs: show the proper user quota options Amir Goldstein
@ 2022-05-27 13:02 ` Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 4/5] xfs: fix an ABBA deadlock in xfs_rename Amir Goldstein
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2022-05-27 13:02 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, Leah Rumancik,
	Chandan Babu R, Adam Manzanares, Tyler Hicks, Jan Kara,
	linux-xfs, stable, zlang, Dave Chinner

From: "Darrick J. Wong" <darrick.wong@oracle.com>

commit a5336d6bb2d02d0e9d4d3c8be04b80b8b68d56c8 upstream.

In commit 27c14b5daa82 we started tracking the last inode seen during an
inode walk to avoid infinite loops if a corrupt inobt record happens to
have a lower ir_startino than the record preceeding it.  Unfortunately,
the assertion trips over the case where there are completely empty inobt
records (which can happen quite easily on 64k page filesystems) because
we advance the tracking cursor without actually putting the empty record
into the processing buffer.  Fix the assert to allow for this case.

Reported-by: zlang@redhat.com
Fixes: 27c14b5daa82 ("xfs: ensure inobt record walks always make forward progress")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/xfs/xfs_iwalk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 2a45138831e3..eae3aff9bc97 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -363,7 +363,7 @@ xfs_iwalk_run_callbacks(
 	/* Delete cursor but remember the last record we cached... */
 	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
 	irec = &iwag->recs[iwag->nr_recs - 1];
-	ASSERT(next_agino == irec->ir_startino + XFS_INODES_PER_CHUNK);
+	ASSERT(next_agino >= irec->ir_startino + XFS_INODES_PER_CHUNK);
 
 	error = xfs_iwalk_ag_recs(iwag);
 	if (error)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5.10 v2 4/5] xfs: fix an ABBA deadlock in xfs_rename
  2022-05-27 13:02 [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Amir Goldstein
                   ` (2 preceding siblings ...)
  2022-05-27 13:02 ` [PATCH 5.10 v2 3/5] xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks Amir Goldstein
@ 2022-05-27 13:02 ` Amir Goldstein
  2022-05-27 13:02 ` [PATCH 5.10 v2 5/5] xfs: Fix CIL throttle hang when CIL space used going backwards Amir Goldstein
  2022-06-03 14:32 ` [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Greg Kroah-Hartman
  5 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2022-05-27 13:02 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, Leah Rumancik,
	Chandan Babu R, Adam Manzanares, Tyler Hicks, Jan Kara,
	linux-xfs, stable, wenli xie, Brian Foster

From: "Darrick J. Wong" <darrick.wong@oracle.com>

commit 6da1b4b1ab36d80a3994fd4811c8381de10af604 upstream.

When overlayfs is running on top of xfs and the user unlinks a file in
the overlay, overlayfs will create a whiteout inode and ask xfs to
"rename" the whiteout file atop the one being unlinked.  If the file
being unlinked loses its one nlink, we then have to put the inode on the
unlinked list.

This requires us to grab the AGI buffer of the whiteout inode to take it
off the unlinked list (which is where whiteouts are created) and to grab
the AGI buffer of the file being deleted.  If the whiteout was created
in a higher numbered AG than the file being deleted, we'll lock the AGIs
in the wrong order and deadlock.

Therefore, grab all the AGI locks we think we'll need ahead of time, and
in order of increasing AG number per the locking rules.

Reported-by: wenli xie <wlxie7296@gmail.com>
Fixes: 93597ae8dac0 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/xfs/libxfs/xfs_dir2.h    |  2 --
 fs/xfs/libxfs/xfs_dir2_sf.c |  2 +-
 fs/xfs/xfs_inode.c          | 42 ++++++++++++++++++++++---------------
 3 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index e55378640b05..d03e6098ded9 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -47,8 +47,6 @@ extern int xfs_dir_lookup(struct xfs_trans *tp, struct xfs_inode *dp,
 extern int xfs_dir_removename(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t ino,
 				xfs_extlen_t tot);
-extern bool xfs_dir2_sf_replace_needblock(struct xfs_inode *dp,
-				xfs_ino_t inum);
 extern int xfs_dir_replace(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t inum,
 				xfs_extlen_t tot);
diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c
index 2463b5d73447..8c4f76bba88b 100644
--- a/fs/xfs/libxfs/xfs_dir2_sf.c
+++ b/fs/xfs/libxfs/xfs_dir2_sf.c
@@ -1018,7 +1018,7 @@ xfs_dir2_sf_removename(
 /*
  * Check whether the sf dir replace operation need more blocks.
  */
-bool
+static bool
 xfs_dir2_sf_replace_needblock(
 	struct xfs_inode	*dp,
 	xfs_ino_t		inum)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2bfbcf28b1bd..e958b1c74561 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3152,7 +3152,7 @@ xfs_rename(
 	struct xfs_trans	*tp;
 	struct xfs_inode	*wip = NULL;		/* whiteout inode */
 	struct xfs_inode	*inodes[__XFS_SORT_INODES];
-	struct xfs_buf		*agibp;
+	int			i;
 	int			num_inodes = __XFS_SORT_INODES;
 	bool			new_parent = (src_dp != target_dp);
 	bool			src_is_directory = S_ISDIR(VFS_I(src_ip)->i_mode);
@@ -3265,6 +3265,30 @@ xfs_rename(
 		}
 	}
 
+	/*
+	 * Lock the AGI buffers we need to handle bumping the nlink of the
+	 * whiteout inode off the unlinked list and to handle dropping the
+	 * nlink of the target inode.  Per locking order rules, do this in
+	 * increasing AG order and before directory block allocation tries to
+	 * grab AGFs because we grab AGIs before AGFs.
+	 *
+	 * The (vfs) caller must ensure that if src is a directory then
+	 * target_ip is either null or an empty directory.
+	 */
+	for (i = 0; i < num_inodes && inodes[i] != NULL; i++) {
+		if (inodes[i] == wip ||
+		    (inodes[i] == target_ip &&
+		     (VFS_I(target_ip)->i_nlink == 1 || src_is_directory))) {
+			struct xfs_buf	*bp;
+			xfs_agnumber_t	agno;
+
+			agno = XFS_INO_TO_AGNO(mp, inodes[i]->i_ino);
+			error = xfs_read_agi(mp, tp, agno, &bp);
+			if (error)
+				goto out_trans_cancel;
+		}
+	}
+
 	/*
 	 * Directory entry creation below may acquire the AGF. Remove
 	 * the whiteout from the unlinked list first to preserve correct
@@ -3317,22 +3341,6 @@ xfs_rename(
 		 * In case there is already an entry with the same
 		 * name at the destination directory, remove it first.
 		 */
-
-		/*
-		 * Check whether the replace operation will need to allocate
-		 * blocks.  This happens when the shortform directory lacks
-		 * space and we have to convert it to a block format directory.
-		 * When more blocks are necessary, we must lock the AGI first
-		 * to preserve locking order (AGI -> AGF).
-		 */
-		if (xfs_dir2_sf_replace_needblock(target_dp, src_ip->i_ino)) {
-			error = xfs_read_agi(mp, tp,
-					XFS_INO_TO_AGNO(mp, target_ip->i_ino),
-					&agibp);
-			if (error)
-				goto out_trans_cancel;
-		}
-
 		error = xfs_dir_replace(tp, target_dp, target_name,
 					src_ip->i_ino, spaceres);
 		if (error)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5.10 v2 5/5] xfs: Fix CIL throttle hang when CIL space used going backwards
  2022-05-27 13:02 [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Amir Goldstein
                   ` (3 preceding siblings ...)
  2022-05-27 13:02 ` [PATCH 5.10 v2 4/5] xfs: fix an ABBA deadlock in xfs_rename Amir Goldstein
@ 2022-05-27 13:02 ` Amir Goldstein
  2022-06-03 14:32 ` [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Greg Kroah-Hartman
  5 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2022-05-27 13:02 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, Leah Rumancik,
	Chandan Babu R, Adam Manzanares, Tyler Hicks, Jan Kara,
	linux-xfs, stable, Dave Chinner, Donald Buczek, Brian Foster,
	Chandan Babu R, Darrick J . Wong, Allison Henderson

From: Dave Chinner <dchinner@redhat.com>

commit 19f4e7cc819771812a7f527d7897c2deffbf7a00 upstream.

A hang with tasks stuck on the CIL hard throttle was reported and
largely diagnosed by Donald Buczek, who discovered that it was a
result of the CIL context space usage decrementing in committed
transactions once the hard throttle limit had been hit and processes
were already blocked.  This resulted in the CIL push not waking up
those waiters because the CIL context was no longer over the hard
throttle limit.

The surprising aspect of this was the CIL space usage going
backwards regularly enough to trigger this situation. Assumptions
had been made in design that the relogging process would only
increase the size of the objects in the CIL, and so that space would
only increase.

This change and commit message fixes the issue and documents the
result of an audit of the triggers that can cause the CIL space to
go backwards, how large the backwards steps tend to be, the
frequency in which they occur, and what the impact on the CIL
accounting code is.

Even though the CIL ctx->space_used can go backwards, it will only
do so if the log item is already logged to the CIL and contains a
space reservation for it's entire logged state. This is tracked by
the shadow buffer state on the log item. If the item is not
previously logged in the CIL it has no shadow buffer nor log vector,
and hence the entire size of the logged item copied to the log
vector is accounted to the CIL space usage. i.e.  it will always go
up in this case.

If the item has a log vector (i.e. already in the CIL) and the size
decreases, then the existing log vector will be overwritten and the
space usage will go down. This is the only condition where the space
usage reduces, and it can only occur when an item is already tracked
in the CIL. Hence we are safe from CIL space usage underruns as a
result of log items decreasing in size when they are relogged.

Typically this reduction in CIL usage occurs from metadata blocks
being free, such as when a btree block merge occurs or a directory
enter/xattr entry is removed and the da-tree is reduced in size.
This generally results in a reduction in size of around a single
block in the CIL, but also tends to increase the number of log
vectors because the parent and sibling nodes in the tree needs to be
updated when a btree block is removed. If a multi-level merge
occurs, then we see reduction in size of 2+ blocks, but again the
log vector count goes up.

The other vector is inode fork size changes, which only log the
current size of the fork and ignore the previously logged size when
the fork is relogged. Hence if we are removing items from the inode
fork (dir/xattr removal in shortform, extent record removal in
extent form, etc) the relogged size of the inode for can decrease.

No other log items can decrease in size either because they are a
fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
an intent actually creates a new intent log item and doesn't relog
the old item at all.) Hence the only two vectors for CIL context
size reduction are relogging inode forks and marking buffers active
in the CIL as stale.

Long story short: the majority of the code does the right thing and
handles the reduction in log item size correctly, and only the CIL
hard throttle implementation is problematic and needs fixing. This
patch makes that fix, as well as adds comments in the log item code
that result in items shrinking in size when they are relogged as a
clear reminder that this can and does happen frequently.

The throttle fix is based upon the change Donald proposed, though it
goes further to ensure that once the throttle is activated, it
captures all tasks until the CIL push issues a wakeup, regardless of
whether the CIL space used has gone back under the throttle
threshold.

This ensures that we prevent tasks reducing the CIL slightly under
the throttle threshold and then making more changes that push it
well over the throttle limit. This is acheived by checking if the
throttle wait queue is already active as a condition of throttling.
Hence once we start throttling, we continue to apply the throttle
until the CIL context push wakes everything on the wait queue.

We can use waitqueue_active() for the waitqueue manipulations and
checks as they are all done under the ctx->xc_push_lock. Hence the
waitqueue has external serialisation and we can safely peek inside
the wait queue without holding the internal waitqueue locks.

Many thanks to Donald for his diagnostic and analysis work to
isolate the cause of this hang.

Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_buf_item.c   | 37 ++++++++++++++++++-------------------
 fs/xfs/xfs_inode_item.c | 14 ++++++++++++++
 fs/xfs/xfs_log_cil.c    | 22 +++++++++++++++++-----
 3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 0356f2e340a1..8c6e26d62ef2 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -56,14 +56,12 @@ xfs_buf_log_format_size(
 }
 
 /*
- * This returns the number of log iovecs needed to log the
- * given buf log item.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item segment.
  *
- * It calculates this as 1 iovec for the buf log format structure
- * and 1 for each stretch of non-contiguous chunks to be logged.
- * Contiguous chunks are logged in a single iovec.
- *
- * If the XFS_BLI_STALE flag has been set, then log nothing.
+ * It calculates this as 1 iovec for the buf log format structure and 1 for each
+ * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
+ * in a single iovec.
  */
 STATIC void
 xfs_buf_item_size_segment(
@@ -119,11 +117,8 @@ xfs_buf_item_size_segment(
 }
 
 /*
- * This returns the number of log iovecs needed to log the given buf log item.
- *
- * It calculates this as 1 iovec for the buf log format structure and 1 for each
- * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
- * in a single iovec.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item.
  *
  * Discontiguous buffers need a format structure per region that is being
  * logged. This makes the changes in the buffer appear to log recovery as though
@@ -133,7 +128,11 @@ xfs_buf_item_size_segment(
  * what ends up on disk.
  *
  * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log
- * format structures.
+ * format structures. If the item has previously been logged and has dirty
+ * regions, we do not relog them in stale buffers. This has the effect of
+ * reducing the size of the relogged item by the amount of dirty data tracked
+ * by the log item. This can result in the committing transaction reducing the
+ * amount of space being consumed by the CIL.
  */
 STATIC void
 xfs_buf_item_size(
@@ -147,9 +146,9 @@ xfs_buf_item_size(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
 		/*
-		 * The buffer is stale, so all we need to log
-		 * is the buf log format structure with the
-		 * cancel flag in it.
+		 * The buffer is stale, so all we need to log is the buf log
+		 * format structure with the cancel flag in it as we are never
+		 * going to replay the changes tracked in the log item.
 		 */
 		trace_xfs_buf_item_size_stale(bip);
 		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
@@ -164,9 +163,9 @@ xfs_buf_item_size(
 
 	if (bip->bli_flags & XFS_BLI_ORDERED) {
 		/*
-		 * The buffer has been logged just to order it.
-		 * It is not being included in the transaction
-		 * commit, so no vectors are used at all.
+		 * The buffer has been logged just to order it. It is not being
+		 * included in the transaction commit, so no vectors are used at
+		 * all.
 		 */
 		trace_xfs_buf_item_size_ordered(bip);
 		*nvecs = XFS_LOG_VEC_ORDERED;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 17e20a6d8b4e..6ff91e5bf3cd 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_inode_log_item, ili_item);
 }
 
+/*
+ * The logged size of an inode fork is always the current size of the inode
+ * fork. This means that when an inode fork is relogged, the size of the logged
+ * region is determined by the current state, not the combination of the
+ * previously logged state + the current state. This is different relogging
+ * behaviour to most other log items which will retain the size of the
+ * previously logged changes when smaller regions are relogged.
+ *
+ * Hence operations that remove data from the inode fork (e.g. shortform
+ * dir/attr remove, extent form extent removal, etc), the size of the relogged
+ * inode gets -smaller- rather than stays the same size as the previously logged
+ * size and this can result in the committing transaction reducing the amount of
+ * space being consumed by the CIL.
+ */
 STATIC void
 xfs_inode_item_data_fork_size(
 	struct xfs_inode_log_item *iip,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..cd5c04dabe2e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -668,9 +668,14 @@ xlog_cil_push_work(
 	ASSERT(push_seq <= ctx->sequence);
 
 	/*
-	 * Wake up any background push waiters now this context is being pushed.
+	 * As we are about to switch to a new, empty CIL context, we no longer
+	 * need to throttle tasks on CIL space overruns. Wake any waiters that
+	 * the hard push throttle may have caught so they can start committing
+	 * to the new context. The ctx->xc_push_lock provides the serialisation
+	 * necessary for safely using the lockless waitqueue_active() check in
+	 * this context.
 	 */
-	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+	if (waitqueue_active(&cil->xc_push_wait))
 		wake_up_all(&cil->xc_push_wait);
 
 	/*
@@ -907,7 +912,7 @@ xlog_cil_push_background(
 	ASSERT(!list_empty(&cil->xc_cil));
 
 	/*
-	 * don't do a background push if we haven't used up all the
+	 * Don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
 	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
@@ -931,9 +936,16 @@ xlog_cil_push_background(
 
 	/*
 	 * If we are well over the space limit, throttle the work that is being
-	 * done until the push work on this context has begun.
+	 * done until the push work on this context has begun. Enforce the hard
+	 * throttle on all transaction commits once it has been activated, even
+	 * if the committing transactions have resulted in the space usage
+	 * dipping back down under the hard limit.
+	 *
+	 * The ctx->xc_push_lock provides the serialisation necessary for safely
+	 * using the lockless waitqueue_active() check in this context.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
+	    waitqueue_active(&cil->xc_push_wait)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
 		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
 		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1)
  2022-05-27 13:02 [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Amir Goldstein
                   ` (4 preceding siblings ...)
  2022-05-27 13:02 ` [PATCH 5.10 v2 5/5] xfs: Fix CIL throttle hang when CIL space used going backwards Amir Goldstein
@ 2022-06-03 14:32 ` Greg Kroah-Hartman
  5 siblings, 0 replies; 7+ messages in thread
From: Greg Kroah-Hartman @ 2022-06-03 14:32 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sasha Levin, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, Leah Rumancik,
	Chandan Babu R, Adam Manzanares, Tyler Hicks, Jan Kara,
	linux-xfs, stable

On Fri, May 27, 2022 at 04:02:14PM +0300, Amir Goldstein wrote:
> Hi Greg and Shasha!
> 
> It has been a while since you heard from xfs team.
> 
> We are trying to change things and get xfs fixes flowing to stable
> again. Crossing my fingers that we will make this last this time :)
> 
> Please see this message from Darrick [4] about xfs stable plans.
> My team will be focusing on 5.10.y and Ted and Leah's team will be
> focusing on 5.15.y at this time.
> 
> This v2 is being sent to stable after testing and after v1 was sent
> for review of the xfs list [5].
> 
> v2 includes an extra patch that Christoph has backported and tested
> and was going to send to stable.
> 
> Please see my cover letter to xfs with more details about my plans
> for 5.10.y below:

All now queued up, thanks for doing this and I look forward to more xfs
patches being sent to us!

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-03 14:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-27 13:02 [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Amir Goldstein
2022-05-27 13:02 ` [PATCH 5.10 v2 1/5] xfs: detect overflows in bmbt records Amir Goldstein
2022-05-27 13:02 ` [PATCH 5.10 v2 2/5] xfs: show the proper user quota options Amir Goldstein
2022-05-27 13:02 ` [PATCH 5.10 v2 3/5] xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks Amir Goldstein
2022-05-27 13:02 ` [PATCH 5.10 v2 4/5] xfs: fix an ABBA deadlock in xfs_rename Amir Goldstein
2022-05-27 13:02 ` [PATCH 5.10 v2 5/5] xfs: Fix CIL throttle hang when CIL space used going backwards Amir Goldstein
2022-06-03 14:32 ` [PATCH 5.10 v2 0/5] xfs fixes for 5.10.y (part 1) Greg Kroah-Hartman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.