All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] xfs-4.18: online scrub fixes
@ 2018-04-18  2:39 Darrick J. Wong
  2018-04-18  2:39 ` [PATCH 01/11] xfs: skip scrub xref if corruption already noted Darrick J. Wong
                   ` (10 more replies)
  0 siblings, 11 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:39 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

Here's a series cleaning up some problems and bugs in the online scrub
code.  Most of the problems fixed here were uncovered through more
thorough testing of the online repair series, which will be posted
shortly.

First, we amend scrub to skip cross-referencing once the first error is
found; this patch hasn't changed since its posting during the 4.17 cycle.

The next five patches refactor the quota code to make it possible to
iterate all the dquots in the system with the quota inode's ILOCK held.
After this, we can drop all the weird code that attaches to the quota ip
in favor of using the well known mechanism that we use for all other
inodes.  This enables us to call the data fork scrubber on the quota
inodes; and the dquot iterator that we created earlier.

The next two patches fix some broken interaction between scrub and the
rest of xfs.

Following that are three fixes to the scrubbers themselves -- the first
checks that a btree block never contains fewer than minrecs records
unless it's the top block in a tree.  The second refactors the
transaction allocation helper; and the last one fixes a potential ABBA
deadlock in the parent scrubber.

If you're going to start using this mess, you probably ought to just
pull from my git trees.  The kernel patches[1] should apply against
4.17-rc1.  xfsprogs[2] and xfstests[3] can be found in their usual
places.  The git trees contain all four series' worth of changes.

This is an extraordinary way to eat your data.  Enjoy!
Comments and questions are, as always, welcome.

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
[3] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/11] xfs: skip scrub xref if corruption already noted
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
@ 2018-04-18  2:39 ` Darrick J. Wong
  2018-04-18 15:03   ` Brian Foster
  2018-04-18  2:39 ` [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag Darrick J. Wong
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:39 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Don't bother looking for cross-referencing problems if the metadata is
already corrupt or we've already found a cross-referencing problem.
Since we added a helper function for flags testing, convert existing
users to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/alloc.c    |    2 +-
 fs/xfs/scrub/common.c   |    4 ++++
 fs/xfs/scrub/common.h   |    6 ++++++
 fs/xfs/scrub/ialloc.c   |    2 +-
 fs/xfs/scrub/refcount.c |    4 ++--
 fs/xfs/scrub/rmap.c     |    4 ++--
 fs/xfs/scrub/rtbitmap.c |    3 +++
 fs/xfs/scrub/scrub.c    |    3 +--
 8 files changed, 20 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 517c079..f5c6993 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -172,7 +172,7 @@ xfs_scrub_xref_is_used_space(
 	bool				is_freesp;
 	int				error;
 
-	if (!sc->sa.bno_cur)
+	if (!sc->sa.bno_cur || xfs_scrub_found_corruption(sc->sm))
 		return;
 
 	error = xfs_alloc_has_record(sc->sa.bno_cur, agbno, len, &is_freesp);
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 8ed91d5..3f72176 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -727,6 +727,10 @@ xfs_scrub_should_check_xref(
 	int				*error,
 	struct xfs_btree_cur		**curpp)
 {
+	/* No point in xref if we already know we're corrupt. */
+	if (xfs_scrub_found_corruption(sc->sm))
+		return false;
+
 	if (*error == 0)
 		return true;
 
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index deaf604..30e9039 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -157,4 +157,10 @@ int xfs_scrub_setup_inode_contents(struct xfs_scrub_context *sc,
 				   struct xfs_inode *ip, unsigned int resblks);
 void xfs_scrub_buffer_recheck(struct xfs_scrub_context *sc, struct xfs_buf *bp);
 
+static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+			       XFS_SCRUB_OFLAG_XCORRUPT);
+}
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 106ca4b..05e6191 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -496,7 +496,7 @@ xfs_scrub_xref_inode_check(
 	bool				has_inodes;
 	int				error;
 
-	if (!(*icur))
+	if (!(*icur) || xfs_scrub_found_corruption(sc->sm))
 		return;
 
 	error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &has_inodes);
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 400f156..823bda3 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -460,7 +460,7 @@ xfs_scrub_xref_is_cow_staging(
 	int				has_refcount;
 	int				error;
 
-	if (!sc->sa.refc_cur)
+	if (!sc->sa.refc_cur || xfs_scrub_found_corruption(sc->sm))
 		return;
 
 	/* Find the CoW staging extent. */
@@ -504,7 +504,7 @@ xfs_scrub_xref_is_not_shared(
 	bool				shared;
 	int				error;
 
-	if (!sc->sa.refc_cur)
+	if (!sc->sa.refc_cur || xfs_scrub_found_corruption(sc->sm))
 		return;
 
 	error = xfs_refcount_has_record(sc->sa.refc_cur, agbno, len, &shared);
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 8f2a7c3..9ca92e4 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -207,7 +207,7 @@ xfs_scrub_xref_check_owner(
 	bool				has_rmap;
 	int				error;
 
-	if (!sc->sa.rmap_cur)
+	if (!sc->sa.rmap_cur || xfs_scrub_found_corruption(sc->sm))
 		return;
 
 	error = xfs_rmap_record_exists(sc->sa.rmap_cur, bno, len, oinfo,
@@ -250,7 +250,7 @@ xfs_scrub_xref_has_no_owner(
 	bool				has_rmap;
 	int				error;
 
-	if (!sc->sa.rmap_cur)
+	if (!sc->sa.rmap_cur || xfs_scrub_found_corruption(sc->sm))
 		return;
 
 	error = xfs_rmap_has_record(sc->sa.rmap_cur, bno, len, &has_rmap);
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 39c41dfe..c8672f7 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -110,6 +110,9 @@ xfs_scrub_xref_is_used_rt_space(
 	bool				is_free;
 	int				error;
 
+	if (xfs_scrub_found_corruption(sc->sm))
+		return;
+
 	xfs_ilock(sc->mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
 	error = xfs_rtalloc_extent_is_free(sc->mp, sc->tp, fsbno, len,
 			&is_free);
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 26c7596..cb1e727 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -446,8 +446,7 @@ xfs_scrub_metadata(
 	} else if (error)
 		goto out_teardown;
 
-	if (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
-			       XFS_SCRUB_OFLAG_XCORRUPT))
+	if (xfs_scrub_found_corruption(sc.sm))
 		xfs_alert_ratelimited(mp, "Corruption detected during scrub.");
 
 out_teardown:


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
  2018-04-18  2:39 ` [PATCH 01/11] xfs: skip scrub xref if corruption already noted Darrick J. Wong
@ 2018-04-18  2:39 ` Darrick J. Wong
  2018-04-18 15:33   ` Brian Foster
  2018-04-19  8:32   ` Christoph Hellwig
  2018-04-18  2:39 ` [PATCH 03/11] xfs: report failing address when dquot verifier fails Darrick J. Wong
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:39 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a new QMOPT flag to signal that we've already locked the quota
inode.  This will be used by the quota scrub code refactoring to avoid
dropping the quota ip lock during scrub.  The flag is incompatible with
the DQALLOC flag because dquot allocation creates a transaction, and we
shouldn't be doing that with the ILOCK held.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_quota_defs.h |    2 ++
 fs/xfs/xfs_dquot.c             |   32 ++++++++++++++++++++------------
 2 files changed, 22 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_quota_defs.h b/fs/xfs/libxfs/xfs_quota_defs.h
index bb1b13a..cfc9938 100644
--- a/fs/xfs/libxfs/xfs_quota_defs.h
+++ b/fs/xfs/libxfs/xfs_quota_defs.h
@@ -107,6 +107,8 @@ typedef uint16_t	xfs_qwarncnt_t;
  * to a single function. None of these XFS_QMOPT_* flags are meant to have
  * persistent values (ie. their values can and will change between versions)
  */
+/* Quota ip already locked.  Cannot be used with DQALLOC. */
+#define XFS_QMOPT_QUOTIP_LOCKED	0x0000001
 #define XFS_QMOPT_DQALLOC	0x0000002 /* alloc dquot ondisk if needed */
 #define XFS_QMOPT_UQUOTA	0x0000004 /* user dquot requested */
 #define XFS_QMOPT_PQUOTA	0x0000008 /* project dquot requested */
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index a7daef9..ed2e37c 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -417,18 +417,20 @@ xfs_qm_dqtobp(
 	struct xfs_mount	*mp = dqp->q_mount;
 	xfs_dqid_t		id = be32_to_cpu(dqp->q_core.d_id);
 	struct xfs_trans	*tp = (tpp ? *tpp : NULL);
-	uint			lock_mode;
+	uint			lock_mode = 0;
 
 	quotip = xfs_quota_inode(dqp->q_mount, dqp->dq_flags);
 	dqp->q_fileoffset = (xfs_fileoff_t)id / mp->m_quotainfo->qi_dqperchunk;
 
-	lock_mode = xfs_ilock_data_map_shared(quotip);
+	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
+		lock_mode = xfs_ilock_data_map_shared(quotip);
 	if (!xfs_this_quota_on(dqp->q_mount, dqp->dq_flags)) {
 		/*
 		 * Return if this type of quotas is turned off while we
 		 * didn't have the quota inode lock.
 		 */
-		xfs_iunlock(quotip, lock_mode);
+		if (lock_mode)
+			xfs_iunlock(quotip, lock_mode);
 		return -ESRCH;
 	}
 
@@ -438,7 +440,8 @@ xfs_qm_dqtobp(
 	error = xfs_bmapi_read(quotip, dqp->q_fileoffset,
 			       XFS_DQUOT_CLUSTER_SIZE_FSB, &map, &nmaps, 0);
 
-	xfs_iunlock(quotip, lock_mode);
+	if (lock_mode)
+		xfs_iunlock(quotip, lock_mode);
 	if (error)
 		return error;
 
@@ -458,7 +461,7 @@ xfs_qm_dqtobp(
 		 */
 		if (!(flags & XFS_QMOPT_DQALLOC))
 			return -ENOENT;
-
+		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));
 		ASSERT(tp);
 		error = xfs_qm_dqalloc(tpp, mp, dqp, quotip,
 					dqp->q_fileoffset, &bp);
@@ -553,6 +556,7 @@ xfs_qm_dqread(
 	trace_xfs_dqread(dqp);
 
 	if (flags & XFS_QMOPT_DQALLOC) {
+		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_dqalloc,
 				XFS_QM_DQALLOC_SPACE_RES(mp), 0, 0, &tp);
 		if (error)
@@ -634,11 +638,12 @@ static int
 xfs_dq_get_next_id(
 	struct xfs_mount	*mp,
 	uint			type,
-	xfs_dqid_t		*id)
+	xfs_dqid_t		*id,
+	uint			flags)
 {
 	struct xfs_inode	*quotip = xfs_quota_inode(mp, type);
 	xfs_dqid_t		next_id = *id + 1; /* simple advance */
-	uint			lock_flags;
+	uint			lock_flags = 0;
 	struct xfs_bmbt_irec	got;
 	struct xfs_iext_cursor	cur;
 	xfs_fsblock_t		start;
@@ -657,7 +662,8 @@ xfs_dq_get_next_id(
 	/* Nope, next_id is now past the current chunk, so find the next one */
 	start = (xfs_fsblock_t)next_id / mp->m_quotainfo->qi_dqperchunk;
 
-	lock_flags = xfs_ilock_data_map_shared(quotip);
+	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
+		lock_flags = xfs_ilock_data_map_shared(quotip);
 	if (!(quotip->i_df.if_flags & XFS_IFEXTENTS)) {
 		error = xfs_iread_extents(NULL, quotip, XFS_DATA_FORK);
 		if (error)
@@ -673,7 +679,8 @@ xfs_dq_get_next_id(
 		error = -ENOENT;
 	}
 
-	xfs_iunlock(quotip, lock_flags);
+	if (lock_flags)
+		xfs_iunlock(quotip, lock_flags);
 
 	return error;
 }
@@ -733,7 +740,8 @@ xfs_qm_dqget(
 			if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
 				xfs_dqunlock(dqp);
 				mutex_unlock(&qi->qi_tree_lock);
-				error = xfs_dq_get_next_id(mp, type, &id);
+				error = xfs_dq_get_next_id(mp, type, &id,
+						flags);
 				if (error)
 					return error;
 				goto restart;
@@ -768,7 +776,7 @@ xfs_qm_dqget(
 
 	/* If we are asked to find next active id, keep looking */
 	if (error == -ENOENT && (flags & XFS_QMOPT_DQNEXT)) {
-		error = xfs_dq_get_next_id(mp, type, &id);
+		error = xfs_dq_get_next_id(mp, type, &id, flags);
 		if (!error)
 			goto restart;
 	}
@@ -827,7 +835,7 @@ xfs_qm_dqget(
 	if (flags & XFS_QMOPT_DQNEXT) {
 		if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
 			xfs_qm_dqput(dqp);
-			error = xfs_dq_get_next_id(mp, type, &id);
+			error = xfs_dq_get_next_id(mp, type, &id, flags);
 			if (error)
 				return error;
 			goto restart;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/11] xfs: report failing address when dquot verifier fails
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
  2018-04-18  2:39 ` [PATCH 01/11] xfs: skip scrub xref if corruption already noted Darrick J. Wong
  2018-04-18  2:39 ` [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag Darrick J. Wong
@ 2018-04-18  2:39 ` Darrick J. Wong
  2018-04-18 18:33   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 04/11] xfs: refactor dquot iteration Darrick J. Wong
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:39 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Pass the failing address through to the corruption report when dquot
verifiers fail.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_dquot_buf.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_dquot_buf.c b/fs/xfs/libxfs/xfs_dquot_buf.c
index 8b7a6c3..a1e6cf1 100644
--- a/fs/xfs/libxfs/xfs_dquot_buf.c
+++ b/fs/xfs/libxfs/xfs_dquot_buf.c
@@ -229,7 +229,7 @@ xfs_dquot_buf_read_verify(
 	else {
 		fa = xfs_dquot_buf_verify(mp, bp);
 		if (fa)
-			xfs_verifier_error(bp, -EFSCORRUPTED, __this_address);
+			xfs_verifier_error(bp, -EFSCORRUPTED, fa);
 	}
 }
 
@@ -266,7 +266,7 @@ xfs_dquot_buf_write_verify(
 
 	fa = xfs_dquot_buf_verify(mp, bp);
 	if (fa)
-		xfs_verifier_error(bp, -EFSCORRUPTED, __this_address);
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
 }
 
 const struct xfs_buf_ops xfs_dquot_buf_ops = {


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/11] xfs: refactor dquot iteration
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (2 preceding siblings ...)
  2018-04-18  2:39 ` [PATCH 03/11] xfs: report failing address when dquot verifier fails Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-18 18:34   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 05/11] xfs: avoid ilock games in the quota scrubber Darrick J. Wong
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a helper function to iterate all the dquots of a given type in
the system, and refactor the dquot scrub to use it.  This will get more
use in the quota repair code.  Note that the new function differs from
xfs_qm_dqiterate in that _dqiterate iterates dquot buffers, not the
in-core structures themselves.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/quota.c |   46 +++++++++++++++++++++-------------------------
 fs/xfs/xfs_dquot.c   |   41 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_dquot.h   |    5 +++++
 3 files changed, 67 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 6ba465e..363e318 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -77,14 +77,21 @@ xfs_scrub_setup_quota(
 
 /* Quotas. */
 
+struct xfs_scrub_quota_info {
+	struct xfs_scrub_context	*sc;
+	xfs_dqid_t			last_id;
+};
+
 /* Scrub the fields in an individual quota item. */
-STATIC void
+STATIC int
 xfs_scrub_quota_item(
-	struct xfs_scrub_context	*sc,
-	uint				dqtype,
 	struct xfs_dquot		*dq,
-	xfs_dqid_t			id)
+	uint				dqtype,
+	xfs_dqid_t			id,
+	void				*priv)
 {
+	struct xfs_scrub_quota_info	*sqi = priv;
+	struct xfs_scrub_context	*sc = sqi->sc;
 	struct xfs_mount		*mp = sc->mp;
 	struct xfs_disk_dquot		*d = &dq->q_core;
 	struct xfs_quotainfo		*qi = mp->m_quotainfo;
@@ -100,6 +107,7 @@ xfs_scrub_quota_item(
 	unsigned long long		rcount;
 	xfs_ino_t			fs_icount;
 
+	sqi->last_id = id;
 	offset = id / qi->qi_dqperchunk;
 
 	/*
@@ -183,6 +191,8 @@ xfs_scrub_quota_item(
 		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
 	if (id != 0 && rhard != 0 && rcount > rhard)
 		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
+
+	return 0;
 }
 
 /* Scrub all of a quota type's items. */
@@ -191,13 +201,12 @@ xfs_scrub_quota(
 	struct xfs_scrub_context	*sc)
 {
 	struct xfs_bmbt_irec		irec = { 0 };
+	struct xfs_scrub_quota_info	sqi;
 	struct xfs_mount		*mp = sc->mp;
 	struct xfs_inode		*ip;
 	struct xfs_quotainfo		*qi = mp->m_quotainfo;
-	struct xfs_dquot		*dq;
 	xfs_fileoff_t			max_dqid_off;
 	xfs_fileoff_t			off = 0;
-	xfs_dqid_t			id = 0;
 	uint				dqtype;
 	int				nimaps;
 	int				error = 0;
@@ -264,25 +273,12 @@ xfs_scrub_quota(
 		goto out;
 
 	/* Check all the quota items. */
-	while (id < ((xfs_dqid_t)-1ULL)) {
-		if (xfs_scrub_should_terminate(sc, &error))
-			break;
-
-		error = xfs_qm_dqget(mp, NULL, id, dqtype, XFS_QMOPT_DQNEXT,
-				&dq);
-		if (error == -ENOENT)
-			break;
-		if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK,
-				id * qi->qi_dqperchunk, &error))
-			break;
-
-		xfs_scrub_quota_item(sc, dqtype, dq, id);
-
-		id = be32_to_cpu(dq->q_core.d_id) + 1;
-		xfs_qm_dqput(dq);
-		if (!id)
-			break;
-	}
+	sqi.sc = sc;
+	sqi.last_id = 0;
+	error = xfs_dquot_iterate(mp, dqtype, 0, xfs_scrub_quota_item, &sqi);
+	if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK,
+			sqi.last_id * qi->qi_dqperchunk, &error))
+		goto out;
 
 out:
 	/* We set sc->ip earlier, so make sure we clear it now. */
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index ed2e37c..ec00402 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1127,3 +1127,44 @@ xfs_qm_exit(void)
 	kmem_zone_destroy(xfs_qm_dqtrxzone);
 	kmem_zone_destroy(xfs_qm_dqzone);
 }
+
+/*
+ * Iterate every dquot of a particular type.  The caller must ensure that the
+ * particular quota type is active.  iter_fn can return negative error codes,
+ * or XFS_BTREE_QUERY_RANGE_ABORT to indicate that it wants to stop iterating.
+ *
+ * Note that xfs_qm_dqiterate iterates all the dquot bufs, not the dquots
+ * themselves.
+ */
+int
+xfs_dquot_iterate(
+	struct xfs_mount	*mp,
+	uint			dqtype,
+	uint			iter_flags,
+	xfs_dquot_iterate_fn	iter_fn,
+	void			*priv)
+{
+	struct xfs_dquot	*dq;
+	xfs_dqid_t		id = 0;
+	int			error;
+
+	while (id < ((xfs_dqid_t)-1ULL)) {
+		error = xfs_qm_dqget(mp, NULL, id, dqtype,
+				XFS_QMOPT_DQNEXT | iter_flags, &dq);
+		if (error == -ENOENT) {
+			error = 0;
+			break;
+		}
+		if (error)
+			break;
+
+		id = be32_to_cpu(dq->q_core.d_id);
+		error = iter_fn(dq, dqtype, id, priv);
+		xfs_qm_dqput(dq);
+		id++;
+		if (error || id == 0)
+			break;
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index 2f536f3..db0511e 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -185,4 +185,9 @@ static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
 	return dqp;
 }
 
+typedef int (*xfs_dquot_iterate_fn)(struct xfs_dquot *dq, uint dqtype,
+		xfs_dqid_t id, void *priv);
+int xfs_dquot_iterate(struct xfs_mount *mp, uint dqtype, uint iter_flags,
+		xfs_dquot_iterate_fn iter_fn, void *priv);
+
 #endif /* __XFS_DQUOT_H__ */


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/11] xfs: avoid ilock games in the quota scrubber
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (3 preceding siblings ...)
  2018-04-18  2:40 ` [PATCH 04/11] xfs: refactor dquot iteration Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-18 18:34   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber Darrick J. Wong
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we have a new QMOPT_QUOTIP_LOCKED flag, use it to dqget in the
quota scrubber so that we can ilock the quota ip at the start of scrub
and keep it locked all the way to the end.  This enables us to remove
quite a bit of locking-related games and manage the quota ip the same
way we treat other inodes being scrubbed.  Allocate a transaction so
that we can evade deadlock issues in bmbt lookups.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/quota.c |   46 ++++++++++++++++++----------------------------
 fs/xfs/scrub/scrub.c |    4 ++++
 fs/xfs/scrub/scrub.h |    1 +
 3 files changed, 23 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 363e318..1068a91 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -66,12 +66,24 @@ xfs_scrub_setup_quota(
 	struct xfs_inode		*ip)
 {
 	uint				dqtype;
+	int				error;
+
+	if (!XFS_IS_QUOTA_RUNNING(sc->mp) || !XFS_IS_QUOTA_ON(sc->mp))
+		return -ENOENT;
 
 	dqtype = xfs_scrub_quota_to_dqtype(sc);
 	if (dqtype == 0)
 		return -EINVAL;
+	sc->has_quotaofflock = true;
+	mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock);
 	if (!xfs_this_quota_on(sc->mp, dqtype))
 		return -ENOENT;
+	error = xfs_scrub_setup_fs(sc, ip);
+	if (error)
+		return error;
+	sc->ip = xfs_quota_inode(sc->mp, dqtype);
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+	sc->ilock_flags = XFS_ILOCK_EXCL;
 	return 0;
 }
 
@@ -203,7 +215,6 @@ xfs_scrub_quota(
 	struct xfs_bmbt_irec		irec = { 0 };
 	struct xfs_scrub_quota_info	sqi;
 	struct xfs_mount		*mp = sc->mp;
-	struct xfs_inode		*ip;
 	struct xfs_quotainfo		*qi = mp->m_quotainfo;
 	xfs_fileoff_t			max_dqid_off;
 	xfs_fileoff_t			off = 0;
@@ -211,25 +222,12 @@ xfs_scrub_quota(
 	int				nimaps;
 	int				error = 0;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
-		return -ENOENT;
-
-	mutex_lock(&qi->qi_quotaofflock);
 	dqtype = xfs_scrub_quota_to_dqtype(sc);
-	if (!xfs_this_quota_on(sc->mp, dqtype)) {
-		error = -ENOENT;
-		goto out_unlock_quota;
-	}
-
-	/* Attach to the quota inode and set sc->ip so that reporting works. */
-	ip = xfs_quota_inode(sc->mp, dqtype);
-	sc->ip = ip;
 
 	/* Look for problem extents. */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (ip->i_d.di_flags & XFS_DIFLAG_REALTIME) {
+	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) {
 		xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino);
-		goto out_unlock_inode;
+		goto out;
 	}
 	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
 	while (1) {
@@ -238,11 +236,11 @@ xfs_scrub_quota(
 
 		off = irec.br_startoff + irec.br_blockcount;
 		nimaps = 1;
-		error = xfs_bmapi_read(ip, off, -1, &irec, &nimaps,
+		error = xfs_bmapi_read(sc->ip, off, -1, &irec, &nimaps,
 				XFS_BMAPI_ENTIRE);
 		if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK, off,
 				&error))
-			goto out_unlock_inode;
+			goto out;
 		if (!nimaps)
 			break;
 		if (irec.br_startblock == HOLESTARTBLOCK)
@@ -268,26 +266,18 @@ xfs_scrub_quota(
 		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
 			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
 	}
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		goto out;
 
 	/* Check all the quota items. */
 	sqi.sc = sc;
 	sqi.last_id = 0;
-	error = xfs_dquot_iterate(mp, dqtype, 0, xfs_scrub_quota_item, &sqi);
+	error = xfs_dquot_iterate(mp, dqtype, XFS_QMOPT_QUOTIP_LOCKED,
+			xfs_scrub_quota_item, &sqi);
 	if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK,
 			sqi.last_id * qi->qi_dqperchunk, &error))
 		goto out;
 
 out:
-	/* We set sc->ip earlier, so make sure we clear it now. */
-	sc->ip = NULL;
-out_unlock_quota:
-	mutex_unlock(&qi->qi_quotaofflock);
 	return error;
-
-out_unlock_inode:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	goto out;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index cb1e727..c43ee9e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -42,6 +42,8 @@
 #include "xfs_refcount_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -166,6 +168,8 @@ xfs_scrub_teardown(
 			iput(VFS_I(sc->ip));
 		sc->ip = NULL;
 	}
+	if (sc->has_quotaofflock)
+		mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
 	if (sc->buf) {
 		kmem_free(sc->buf);
 		sc->buf = NULL;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 0d92af8..5d79731 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -73,6 +73,7 @@ struct xfs_scrub_context {
 	void				*buf;
 	uint				ilock_flags;
 	bool				try_harder;
+	bool				has_quotaofflock;
 
 	/* State tracking for single-AG operations. */
 	struct xfs_scrub_ag		sa;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (4 preceding siblings ...)
  2018-04-18  2:40 ` [PATCH 05/11] xfs: avoid ilock games in the quota scrubber Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-18 18:34   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 07/11] xfs: superblock scrub should use uncached buffers Darrick J. Wong
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Replace the quota scrubber's open-coded data fork scrubber with a
redirected call to the bmapbtd scrubber.  This strengthens the quota
scrub to include all the cross-referencing that it does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/quota.c |   95 ++++++++++++++++++++++++++++----------------------
 1 file changed, 54 insertions(+), 41 deletions(-)


diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 1068a91..67f94c4 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -207,65 +207,78 @@ xfs_scrub_quota_item(
 	return 0;
 }
 
-/* Scrub all of a quota type's items. */
-int
-xfs_scrub_quota(
+/* Check the quota's data fork. */
+STATIC int
+xfs_scrub_quota_data_fork(
 	struct xfs_scrub_context	*sc)
 {
 	struct xfs_bmbt_irec		irec = { 0 };
-	struct xfs_scrub_quota_info	sqi;
-	struct xfs_mount		*mp = sc->mp;
-	struct xfs_quotainfo		*qi = mp->m_quotainfo;
+	struct xfs_iext_cursor		icur;
+	struct xfs_scrub_metadata	fake_sm;
+	struct xfs_scrub_metadata	*real_sm = sc->sm;
+	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
+	struct xfs_ifork		*ifp;
 	xfs_fileoff_t			max_dqid_off;
-	xfs_fileoff_t			off = 0;
-	uint				dqtype;
-	int				nimaps;
 	int				error = 0;
 
-	dqtype = xfs_scrub_quota_to_dqtype(sc);
-
-	/* Look for problem extents. */
+	/* Quotas don't live on the rt device. */
 	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) {
 		xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino);
-		goto out;
+		return 0;
 	}
+
+	/* Invoke the data fork scrubber. */
+	memcpy(&fake_sm, real_sm, sizeof(fake_sm));
+	fake_sm.sm_type = XFS_SCRUB_TYPE_BMBTD;
+	fake_sm.sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	sc->sm = &fake_sm;
+	error = xfs_scrub_bmap_data(sc);
+	sc->sm = real_sm;
+	if (error)
+		return error;
+	sc->sm->sm_flags |= (fake_sm.sm_flags & XFS_SCRUB_FLAGS_OUT);
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return error;
+
+	/* Check for data fork problems that apply only to quota files. */
 	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
-	while (1) {
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	for_each_xfs_iext(ifp, &icur, &irec) {
 		if (xfs_scrub_should_terminate(sc, &error))
 			break;
-
-		off = irec.br_startoff + irec.br_blockcount;
-		nimaps = 1;
-		error = xfs_bmapi_read(sc->ip, off, -1, &irec, &nimaps,
-				XFS_BMAPI_ENTIRE);
-		if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK, off,
-				&error))
-			goto out;
-		if (!nimaps)
-			break;
-		if (irec.br_startblock == HOLESTARTBLOCK)
-			continue;
-
-		/* Check the extent record doesn't point to crap. */
-		if (irec.br_startblock + irec.br_blockcount <=
-		    irec.br_startblock)
-			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
-					irec.br_startoff);
-		if (!xfs_verify_fsbno(mp, irec.br_startblock) ||
-		    !xfs_verify_fsbno(mp, irec.br_startblock +
-					irec.br_blockcount - 1))
-			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
-					irec.br_startoff);
-
 		/*
-		 * Unwritten extents or blocks mapped above the highest
+		 * delalloc extents or blocks mapped above the highest
 		 * quota id shouldn't happen.
 		 */
 		if (isnullstartblock(irec.br_startblock) ||
 		    irec.br_startoff > max_dqid_off ||
-		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
-			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
+		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1) {
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
+					irec.br_startoff);
+			break;
+		}
 	}
+
+	return error;
+}
+
+/* Scrub all of a quota type's items. */
+int
+xfs_scrub_quota(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_scrub_quota_info	sqi;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_quotainfo		*qi = mp->m_quotainfo;
+	uint				dqtype;
+	int				error = 0;
+
+	dqtype = xfs_scrub_quota_to_dqtype(sc);
+
+	/* Look for problem extents. */
+	error = xfs_scrub_quota_data_fork(sc);
+	if (error)
+		goto out;
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		goto out;
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/11] xfs: superblock scrub should use uncached buffers
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (5 preceding siblings ...)
  2018-04-18  2:40 ` [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-19 12:55   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 08/11] xfs: clean up scrub usage of KM_NOFS Darrick J. Wong
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

We've never cached buffers when reading or writing superblocks, so we
need to change scrub to do likewise or risk screwing up the uncached sb
buffer usage everywhere else.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/agheader.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 018aabbd..aacbc3f 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -157,7 +157,7 @@ xfs_scrub_superblock(
 	if (agno == 0)
 		return 0;
 
-	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+	error = xfs_buf_read_uncached(mp->m_ddev_targp,
 		  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
 		  XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
 	/*
@@ -421,6 +421,7 @@ xfs_scrub_superblock(
 		xfs_scrub_block_set_corrupt(sc, bp);
 
 	xfs_scrub_superblock_xref(sc, bp);
+	xfs_buf_relse(bp);
 
 	return error;
 }


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/11] xfs: clean up scrub usage of KM_NOFS
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (6 preceding siblings ...)
  2018-04-18  2:40 ` [PATCH 07/11] xfs: superblock scrub should use uncached buffers Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-19 12:55   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 09/11] xfs: btree scrub should check minrecs Darrick J. Wong
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

All scrub code runs in transaction context, which means that memory
allocations are automatically run in PF_MEMALLOC_NOFS context.  It's
therefore unnecessary to pass in KM_NOFS to allocation routines, so
clean them all out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/agheader.c |    3 ++-
 fs/xfs/scrub/btree.c    |    2 +-
 fs/xfs/scrub/refcount.c |    2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index aacbc3f..6c6e4d8 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -798,7 +798,8 @@ xfs_scrub_agfl(
 	}
 	memset(&sai, 0, sizeof(sai));
 	sai.sz_entries = agflcount;
-	sai.entries = kmem_zalloc(sizeof(xfs_agblock_t) * agflcount, KM_NOFS);
+	sai.entries = kmem_zalloc(sizeof(xfs_agblock_t) * agflcount,
+			KM_MAYFAIL);
 	if (!sai.entries) {
 		error = -ENOMEM;
 		goto out;
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 5421816..ea972da 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -442,7 +442,7 @@ xfs_scrub_btree_check_owner(
 	 */
 	if (cur->bc_btnum == XFS_BTNUM_BNO || cur->bc_btnum == XFS_BTNUM_RMAP) {
 		co = kmem_alloc(sizeof(struct check_owner),
-				KM_MAYFAIL | KM_NOFS);
+				KM_MAYFAIL);
 		if (!co)
 			return -ENOMEM;
 		co->level = level;
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 823bda3..5fff94d 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -150,7 +150,7 @@ xfs_scrub_refcountbt_rmap_check(
 		 * so we don't need insertion sort here.
 		 */
 		frag = kmem_alloc(sizeof(struct xfs_scrub_refcnt_frag),
-				KM_MAYFAIL | KM_NOFS);
+				KM_MAYFAIL);
 		if (!frag)
 			return -ENOMEM;
 		memcpy(&frag->rm, rec, sizeof(frag->rm));


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/11] xfs: btree scrub should check minrecs
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (7 preceding siblings ...)
  2018-04-18  2:40 ` [PATCH 08/11] xfs: clean up scrub usage of KM_NOFS Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-19 12:55   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 10/11] xfs: refactor scrub transaction allocation function Darrick J. Wong
  2018-04-18  2:40 ` [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers Darrick J. Wong
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Strengthen the btree block header checks to detect the number of records
being less than the btree type's minimum record count.  Certain blocks
are allowed to violate this constraint -- specifically any btree block
at the top of the tree can have fewer than minrecs records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/btree.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index ea972da..2d29dce 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -455,6 +455,44 @@ xfs_scrub_btree_check_owner(
 }
 
 /*
+ * Check that this btree block has at least minrecs records or is one of the
+ * special blocks that don't require that.
+ */
+STATIC void
+xfs_scrub_btree_check_minrecs(
+	struct xfs_scrub_btree	*bs,
+	int			level,
+	struct xfs_btree_block	*block)
+{
+	unsigned int		numrecs;
+	int			ok_level;
+
+	numrecs = be16_to_cpu(block->bb_numrecs);
+
+	/* More records than minrecs means the block is ok. */
+	if (numrecs >= bs->cur->bc_ops->get_minrecs(bs->cur, level))
+		return;
+
+	/*
+	 * Certain btree blocks /can/ have fewer than minrecs records.  Any
+	 * level greater than or equal to the level of the highest dedicated
+	 * btree block are allowed to violate this constraint.
+	 *
+	 * For a btree rooted in a block, the btree root can have fewer than
+	 * minrecs records.  If the btree is rooted in an inode and does not
+	 * store records in the root, the direct children of the root and the
+	 * root itself can have fewer than minrecs records.
+	 */
+	ok_level = bs->cur->bc_nlevels - 1;
+	if (bs->cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
+		ok_level--;
+	if (level >= ok_level)
+		return;
+
+	xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, level);
+}
+
+/*
  * Grab and scrub a btree block given a btree pointer.  Returns block
  * and buffer pointers (if applicable) if they're ok to use.
  */
@@ -491,6 +529,8 @@ xfs_scrub_btree_get_block(
 	if (*pbp)
 		xfs_scrub_buffer_recheck(bs->sc, *pbp);
 
+	xfs_scrub_btree_check_minrecs(bs, level, *pblock);
+
 	/*
 	 * Check the block's owner; this function absorbs error codes
 	 * for us.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/11] xfs: refactor scrub transaction allocation function
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (8 preceding siblings ...)
  2018-04-18  2:40 ` [PATCH 09/11] xfs: btree scrub should check minrecs Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-19 12:56   ` Brian Foster
  2018-04-18  2:40 ` [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers Darrick J. Wong
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Since the transaction allocation helper is about to become more complex,
move it to common.c and remove the redundant parameters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/bmap.c   |    3 +--
 fs/xfs/scrub/common.c |   16 +++++++++++++---
 fs/xfs/scrub/common.h |   14 +-------------
 fs/xfs/scrub/inode.c  |    5 ++---
 4 files changed, 17 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 639d14b..3f8fd10 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -51,7 +51,6 @@ xfs_scrub_setup_inode_bmap(
 	struct xfs_scrub_context	*sc,
 	struct xfs_inode		*ip)
 {
-	struct xfs_mount		*mp = sc->mp;
 	int				error;
 
 	error = xfs_scrub_get_inode(sc, ip);
@@ -75,7 +74,7 @@ xfs_scrub_setup_inode_bmap(
 	}
 
 	/* Got the inode, lock it and we're ready to go. */
-	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
+	error = xfs_scrub_trans_alloc(sc);
 	if (error)
 		goto out;
 	sc->ilock_flags |= XFS_ILOCK_EXCL;
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 3f72176..f5e281a 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -568,13 +568,24 @@ xfs_scrub_ag_init(
 
 /* Per-scrubber setup functions */
 
+/*
+ * Grab an empty transaction so that we can re-grab locked buffers if
+ * one of our btrees turns out to be cyclic.
+ */
+int
+xfs_scrub_trans_alloc(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_trans_alloc_empty(sc->mp, &sc->tp);
+}
+
 /* Set us up with a transaction and an empty context. */
 int
 xfs_scrub_setup_fs(
 	struct xfs_scrub_context	*sc,
 	struct xfs_inode		*ip)
 {
-	return xfs_scrub_trans_alloc(sc->sm, sc->mp, &sc->tp);
+	return xfs_scrub_trans_alloc(sc);
 }
 
 /* Set us up with AG headers and btree cursors. */
@@ -695,7 +706,6 @@ xfs_scrub_setup_inode_contents(
 	struct xfs_inode		*ip,
 	unsigned int			resblks)
 {
-	struct xfs_mount		*mp = sc->mp;
 	int				error;
 
 	error = xfs_scrub_get_inode(sc, ip);
@@ -705,7 +715,7 @@ xfs_scrub_setup_inode_contents(
 	/* Got the inode, lock it and we're ready to go. */
 	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	xfs_ilock(sc->ip, sc->ilock_flags);
-	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
+	error = xfs_scrub_trans_alloc(sc);
 	if (error)
 		goto out;
 	sc->ilock_flags |= XFS_ILOCK_EXCL;
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 30e9039..8296873 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -38,19 +38,7 @@ xfs_scrub_should_terminate(
 	return false;
 }
 
-/*
- * Grab an empty transaction so that we can re-grab locked buffers if
- * one of our btrees turns out to be cyclic.
- */
-static inline int
-xfs_scrub_trans_alloc(
-	struct xfs_scrub_metadata	*sm,
-	struct xfs_mount		*mp,
-	struct xfs_trans		**tpp)
-{
-	return xfs_trans_alloc_empty(mp, tpp);
-}
-
+int xfs_scrub_trans_alloc(struct xfs_scrub_context *sc);
 bool xfs_scrub_process_error(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
 		xfs_agblock_t bno, int *error);
 bool xfs_scrub_fblock_process_error(struct xfs_scrub_context *sc, int whichfork,
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index df14930..e15b1bc 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -55,7 +55,6 @@ xfs_scrub_setup_inode(
 	struct xfs_scrub_context	*sc,
 	struct xfs_inode		*ip)
 {
-	struct xfs_mount		*mp = sc->mp;
 	int				error;
 
 	/*
@@ -68,7 +67,7 @@ xfs_scrub_setup_inode(
 		break;
 	case -EFSCORRUPTED:
 	case -EFSBADCRC:
-		return xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
+		return xfs_scrub_trans_alloc(sc);
 	default:
 		return error;
 	}
@@ -76,7 +75,7 @@ xfs_scrub_setup_inode(
 	/* Got the inode, lock it and we're ready to go. */
 	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	xfs_ilock(sc->ip, sc->ilock_flags);
-	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
+	error = xfs_scrub_trans_alloc(sc);
 	if (error)
 		goto out;
 	sc->ilock_flags |= XFS_ILOCK_EXCL;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers
  2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
                   ` (9 preceding siblings ...)
  2018-04-18  2:40 ` [PATCH 10/11] xfs: refactor scrub transaction allocation function Darrick J. Wong
@ 2018-04-18  2:40 ` Darrick J. Wong
  2018-04-19 12:56   ` Brian Foster
  10 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18  2:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

In normal operation, the XFS convention is to take an inode's iolock
and then allocate a transaction.  However, when scrubbing parent inodes
this is inverted -- we allocated the transaction to do the scrub, and
now we're trying to grab the parent's iolock.  This can lead to ABBA
deadlocks: some thread grabbed the parent's iolock and is waiting for
space for a transaction while our parent scrubber is sitting on a
transaction trying to get the parent's iolock.

Therefore, convert all iolock attempts to use trylock; if that fails,
they can use the existing mechanisms to back off and try again.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/common.c |   22 ++++++++++++++++++++++
 fs/xfs/scrub/common.h |    2 ++
 fs/xfs/scrub/parent.c |   16 ++++++++++++++--
 3 files changed, 38 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index f5e281a..93f9e7d 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -787,3 +787,25 @@ xfs_scrub_buffer_recheck(
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 	trace_xfs_scrub_block_error(sc, bp->b_bn, fa);
 }
+
+/*
+ * Try to lock an inode in violation of the usual locking order rules.  For
+ * example, trying to get the IOLOCK while in transaction context, or just
+ * plain breaking AG-order or inode-order inode locking rules.  Either way,
+ * the only way to avoid an ABBA deadlock is to use trylock and back off if
+ * we can't.
+ */
+int
+xfs_scrub_ilock_inverted(
+	struct xfs_inode	*ip,
+	uint			lock_mode)
+{
+	int			i;
+
+	for (i = 0; i < 20; i++) {
+		if (xfs_ilock_nowait(ip, lock_mode))
+			return 0;
+		delay(1);
+	}
+	return -EDEADLOCK;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 8296873..191c369 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -151,4 +151,6 @@ static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
 			       XFS_SCRUB_OFLAG_XCORRUPT);
 }
 
+int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 1fb88c1..19cd54d 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -211,7 +211,9 @@ xfs_scrub_parent_validate(
 	 */
 	xfs_iunlock(sc->ip, sc->ilock_flags);
 	sc->ilock_flags = 0;
-	xfs_ilock(dp, XFS_IOLOCK_SHARED);
+	error = xfs_scrub_ilock_inverted(dp, XFS_IOLOCK_SHARED);
+	if (error)
+		goto out_rele;
 
 	/* Go looking for our dentry. */
 	error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nlink);
@@ -220,8 +222,10 @@ xfs_scrub_parent_validate(
 
 	/* Drop the parent lock, relock this inode. */
 	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
+	error = xfs_scrub_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
+	if (error)
+		goto out_rele;
 	sc->ilock_flags = XFS_IOLOCK_EXCL;
-	xfs_ilock(sc->ip, sc->ilock_flags);
 
 	/*
 	 * If we're an unlinked directory, the parent /won't/ have a link
@@ -323,5 +327,13 @@ xfs_scrub_parent(
 	if (try_again && tries == 20)
 		xfs_scrub_set_incomplete(sc);
 out:
+	/*
+	 * If we failed to lock the parent inode even after a retry, just mark
+	 * this scrub incomplete and return.
+	 */
+	if (sc->try_harder && error == -EDEADLOCK) {
+		error = 0;
+		xfs_scrub_set_incomplete(sc);
+	}
 	return error;
 }


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/11] xfs: skip scrub xref if corruption already noted
  2018-04-18  2:39 ` [PATCH 01/11] xfs: skip scrub xref if corruption already noted Darrick J. Wong
@ 2018-04-18 15:03   ` Brian Foster
  2018-04-18 16:02     ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-18 15:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:39:33PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Don't bother looking for cross-referencing problems if the metadata is
> already corrupt or we've already found a cross-referencing problem.
> Since we added a helper function for flags testing, convert existing
> users to use it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/alloc.c    |    2 +-
>  fs/xfs/scrub/common.c   |    4 ++++
>  fs/xfs/scrub/common.h   |    6 ++++++
>  fs/xfs/scrub/ialloc.c   |    2 +-
>  fs/xfs/scrub/refcount.c |    4 ++--
>  fs/xfs/scrub/rmap.c     |    4 ++--
>  fs/xfs/scrub/rtbitmap.c |    3 +++
>  fs/xfs/scrub/scrub.c    |    3 +--
>  8 files changed, 20 insertions(+), 8 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
> index 517c079..f5c6993 100644
> --- a/fs/xfs/scrub/alloc.c
> +++ b/fs/xfs/scrub/alloc.c
> @@ -172,7 +172,7 @@ xfs_scrub_xref_is_used_space(
>  	bool				is_freesp;
>  	int				error;
>  
> -	if (!sc->sa.bno_cur)
> +	if (!sc->sa.bno_cur || xfs_scrub_found_corruption(sc->sm))
>  		return;
>  
>  	error = xfs_alloc_has_record(sc->sa.bno_cur, agbno, len, &is_freesp);
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index 8ed91d5..3f72176 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -727,6 +727,10 @@ xfs_scrub_should_check_xref(
>  	int				*error,
>  	struct xfs_btree_cur		**curpp)
>  {
> +	/* No point in xref if we already know we're corrupt. */
> +	if (xfs_scrub_found_corruption(sc->sm))
> +		return false;
> +

This seems like a fairly innocuous change for not being terribly
familiar with the scrub code, but a high level thought...

This function has 30+ callers, then we have a bunch of these
found_corruption() checks as well, one of which is the former. The new
callers of the found_corruption() functions look like they easily have
tens of callers throughout the scrub xref code, which kind of makes this
look like quite the maze of logic just to shortcut various checks.

This is not an objection to this patch, but it makes me wonder how much
"empty" processing is involved in a situation where we assume an xref
error is detected early on in a scrub. Secondarily, the logic makes this
hard enough to follow that the only way to really surmise that is to
trace through all of the scrub code. It also looks like we have some
places where it's checked multiple times. For example,
xfs_scrub_inode_xref() checks the corrupt flag and then calls a bunch of
sub-handlers that call either found_corruption() or the
should_check_xref(). One of the latter callers is
xfs_scrub_inode_xref_finobt(), which does a btree lookup before it ever
gets to this call only to then bail on the corruption flag being set.

Granted, this would probably only execute if the flag was set after the
higher level scrub_inode_xref() check, but isn't it kind of strange to
bury a check that is intended to control high-level xref execution flow
in a call that is (also) designed to check the error of some btree
lookup that shouldn't be required at all if the xref was going to be
skipped anyways? I'm wondering if this would be more clear with
consistent but separate error/corruption checks at the top of every xref
helper to dictate high-level scrub execution flow and leave
scrub_should_check_xref() to continue to care only about local
processing decisions specific to the current branch of the scan (i.e.,
my btree lookup failed and so we must set the corrupt flag and not
proceed).

Brian

>  	if (*error == 0)
>  		return true;
>  
> diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> index deaf604..30e9039 100644
> --- a/fs/xfs/scrub/common.h
> +++ b/fs/xfs/scrub/common.h
> @@ -157,4 +157,10 @@ int xfs_scrub_setup_inode_contents(struct xfs_scrub_context *sc,
>  				   struct xfs_inode *ip, unsigned int resblks);
>  void xfs_scrub_buffer_recheck(struct xfs_scrub_context *sc, struct xfs_buf *bp);
>  
> +static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
> +{
> +	return sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> +			       XFS_SCRUB_OFLAG_XCORRUPT);
> +}
> +
>  #endif	/* __XFS_SCRUB_COMMON_H__ */
> diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
> index 106ca4b..05e6191 100644
> --- a/fs/xfs/scrub/ialloc.c
> +++ b/fs/xfs/scrub/ialloc.c
> @@ -496,7 +496,7 @@ xfs_scrub_xref_inode_check(
>  	bool				has_inodes;
>  	int				error;
>  
> -	if (!(*icur))
> +	if (!(*icur) || xfs_scrub_found_corruption(sc->sm))
>  		return;
>  
>  	error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &has_inodes);
> diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
> index 400f156..823bda3 100644
> --- a/fs/xfs/scrub/refcount.c
> +++ b/fs/xfs/scrub/refcount.c
> @@ -460,7 +460,7 @@ xfs_scrub_xref_is_cow_staging(
>  	int				has_refcount;
>  	int				error;
>  
> -	if (!sc->sa.refc_cur)
> +	if (!sc->sa.refc_cur || xfs_scrub_found_corruption(sc->sm))
>  		return;
>  
>  	/* Find the CoW staging extent. */
> @@ -504,7 +504,7 @@ xfs_scrub_xref_is_not_shared(
>  	bool				shared;
>  	int				error;
>  
> -	if (!sc->sa.refc_cur)
> +	if (!sc->sa.refc_cur || xfs_scrub_found_corruption(sc->sm))
>  		return;
>  
>  	error = xfs_refcount_has_record(sc->sa.refc_cur, agbno, len, &shared);
> diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
> index 8f2a7c3..9ca92e4 100644
> --- a/fs/xfs/scrub/rmap.c
> +++ b/fs/xfs/scrub/rmap.c
> @@ -207,7 +207,7 @@ xfs_scrub_xref_check_owner(
>  	bool				has_rmap;
>  	int				error;
>  
> -	if (!sc->sa.rmap_cur)
> +	if (!sc->sa.rmap_cur || xfs_scrub_found_corruption(sc->sm))
>  		return;
>  
>  	error = xfs_rmap_record_exists(sc->sa.rmap_cur, bno, len, oinfo,
> @@ -250,7 +250,7 @@ xfs_scrub_xref_has_no_owner(
>  	bool				has_rmap;
>  	int				error;
>  
> -	if (!sc->sa.rmap_cur)
> +	if (!sc->sa.rmap_cur || xfs_scrub_found_corruption(sc->sm))
>  		return;
>  
>  	error = xfs_rmap_has_record(sc->sa.rmap_cur, bno, len, &has_rmap);
> diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
> index 39c41dfe..c8672f7 100644
> --- a/fs/xfs/scrub/rtbitmap.c
> +++ b/fs/xfs/scrub/rtbitmap.c
> @@ -110,6 +110,9 @@ xfs_scrub_xref_is_used_rt_space(
>  	bool				is_free;
>  	int				error;
>  
> +	if (xfs_scrub_found_corruption(sc->sm))
> +		return;
> +
>  	xfs_ilock(sc->mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
>  	error = xfs_rtalloc_extent_is_free(sc->mp, sc->tp, fsbno, len,
>  			&is_free);
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 26c7596..cb1e727 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -446,8 +446,7 @@ xfs_scrub_metadata(
>  	} else if (error)
>  		goto out_teardown;
>  
> -	if (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> -			       XFS_SCRUB_OFLAG_XCORRUPT))
> +	if (xfs_scrub_found_corruption(sc.sm))
>  		xfs_alert_ratelimited(mp, "Corruption detected during scrub.");
>  
>  out_teardown:
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag
  2018-04-18  2:39 ` [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag Darrick J. Wong
@ 2018-04-18 15:33   ` Brian Foster
  2018-04-18 16:55     ` Darrick J. Wong
  2018-04-19  8:32   ` Christoph Hellwig
  1 sibling, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-18 15:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:39:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a new QMOPT flag to signal that we've already locked the quota
> inode.  This will be used by the quota scrub code refactoring to avoid
> dropping the quota ip lock during scrub.  The flag is incompatible with
> the DQALLOC flag because dquot allocation creates a transaction, and we
> shouldn't be doing that with the ILOCK held.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_quota_defs.h |    2 ++
>  fs/xfs/xfs_dquot.c             |   32 ++++++++++++++++++++------------
>  2 files changed, 22 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_quota_defs.h b/fs/xfs/libxfs/xfs_quota_defs.h
> index bb1b13a..cfc9938 100644
> --- a/fs/xfs/libxfs/xfs_quota_defs.h
> +++ b/fs/xfs/libxfs/xfs_quota_defs.h
> @@ -107,6 +107,8 @@ typedef uint16_t	xfs_qwarncnt_t;
>   * to a single function. None of these XFS_QMOPT_* flags are meant to have
>   * persistent values (ie. their values can and will change between versions)
>   */
> +/* Quota ip already locked.  Cannot be used with DQALLOC. */
> +#define XFS_QMOPT_QUOTIP_LOCKED	0x0000001
>  #define XFS_QMOPT_DQALLOC	0x0000002 /* alloc dquot ondisk if needed */
>  #define XFS_QMOPT_UQUOTA	0x0000004 /* user dquot requested */
>  #define XFS_QMOPT_PQUOTA	0x0000008 /* project dquot requested */
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index a7daef9..ed2e37c 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -417,18 +417,20 @@ xfs_qm_dqtobp(
>  	struct xfs_mount	*mp = dqp->q_mount;
>  	xfs_dqid_t		id = be32_to_cpu(dqp->q_core.d_id);
>  	struct xfs_trans	*tp = (tpp ? *tpp : NULL);
> -	uint			lock_mode;
> +	uint			lock_mode = 0;
>  
>  	quotip = xfs_quota_inode(dqp->q_mount, dqp->dq_flags);
>  	dqp->q_fileoffset = (xfs_fileoff_t)id / mp->m_quotainfo->qi_dqperchunk;
>  
> -	lock_mode = xfs_ilock_data_map_shared(quotip);
> +	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
> +		lock_mode = xfs_ilock_data_map_shared(quotip);
>  	if (!xfs_this_quota_on(dqp->q_mount, dqp->dq_flags)) {
>  		/*
>  		 * Return if this type of quotas is turned off while we
>  		 * didn't have the quota inode lock.
>  		 */
> -		xfs_iunlock(quotip, lock_mode);
> +		if (lock_mode)
> +			xfs_iunlock(quotip, lock_mode);
>  		return -ESRCH;
>  	}
>  
> @@ -438,7 +440,8 @@ xfs_qm_dqtobp(
>  	error = xfs_bmapi_read(quotip, dqp->q_fileoffset,
>  			       XFS_DQUOT_CLUSTER_SIZE_FSB, &map, &nmaps, 0);
>  
> -	xfs_iunlock(quotip, lock_mode);
> +	if (lock_mode)
> +		xfs_iunlock(quotip, lock_mode);
>  	if (error)
>  		return error;
>  
> @@ -458,7 +461,7 @@ xfs_qm_dqtobp(
>  		 */
>  		if (!(flags & XFS_QMOPT_DQALLOC))
>  			return -ENOENT;
> -
> +		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));
>  		ASSERT(tp);
>  		error = xfs_qm_dqalloc(tpp, mp, dqp, quotip,
>  					dqp->q_fileoffset, &bp);
> @@ -553,6 +556,7 @@ xfs_qm_dqread(
>  	trace_xfs_dqread(dqp);
>  
>  	if (flags & XFS_QMOPT_DQALLOC) {
> +		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));

Perhaps we should just have an explicit check for both flags above and
return with -EINVAL if necessary?

>  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_dqalloc,
>  				XFS_QM_DQALLOC_SPACE_RES(mp), 0, 0, &tp);
>  		if (error)
> @@ -634,11 +638,12 @@ static int
>  xfs_dq_get_next_id(
>  	struct xfs_mount	*mp,
>  	uint			type,
> -	xfs_dqid_t		*id)
> +	xfs_dqid_t		*id,
> +	uint			flags)
>  {
>  	struct xfs_inode	*quotip = xfs_quota_inode(mp, type);
>  	xfs_dqid_t		next_id = *id + 1; /* simple advance */
> -	uint			lock_flags;
> +	uint			lock_flags = 0;
>  	struct xfs_bmbt_irec	got;
>  	struct xfs_iext_cursor	cur;
>  	xfs_fsblock_t		start;
> @@ -657,7 +662,8 @@ xfs_dq_get_next_id(
>  	/* Nope, next_id is now past the current chunk, so find the next one */
>  	start = (xfs_fsblock_t)next_id / mp->m_quotainfo->qi_dqperchunk;
>  
> -	lock_flags = xfs_ilock_data_map_shared(quotip);
> +	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
> +		lock_flags = xfs_ilock_data_map_shared(quotip);
>  	if (!(quotip->i_df.if_flags & XFS_IFEXTENTS)) {
>  		error = xfs_iread_extents(NULL, quotip, XFS_DATA_FORK);
>  		if (error)
> @@ -673,7 +679,8 @@ xfs_dq_get_next_id(
>  		error = -ENOENT;
>  	}
>  
> -	xfs_iunlock(quotip, lock_flags);
> +	if (lock_flags)
> +		xfs_iunlock(quotip, lock_flags);
>  
>  	return error;
>  }
> @@ -733,7 +740,8 @@ xfs_qm_dqget(

Earlier code in this function drops/reacquires the ip ILOCK with a note
in the comment around lock ordering with the quota inode. Assuming an
inode is passed, is that lock cycle safe if the caller also has the
quotip held locked?

Brian

>  			if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
>  				xfs_dqunlock(dqp);
>  				mutex_unlock(&qi->qi_tree_lock);
> -				error = xfs_dq_get_next_id(mp, type, &id);
> +				error = xfs_dq_get_next_id(mp, type, &id,
> +						flags);
>  				if (error)
>  					return error;
>  				goto restart;
> @@ -768,7 +776,7 @@ xfs_qm_dqget(
>  
>  	/* If we are asked to find next active id, keep looking */
>  	if (error == -ENOENT && (flags & XFS_QMOPT_DQNEXT)) {
> -		error = xfs_dq_get_next_id(mp, type, &id);
> +		error = xfs_dq_get_next_id(mp, type, &id, flags);
>  		if (!error)
>  			goto restart;
>  	}
> @@ -827,7 +835,7 @@ xfs_qm_dqget(
>  	if (flags & XFS_QMOPT_DQNEXT) {
>  		if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
>  			xfs_qm_dqput(dqp);
> -			error = xfs_dq_get_next_id(mp, type, &id);
> +			error = xfs_dq_get_next_id(mp, type, &id, flags);
>  			if (error)
>  				return error;
>  			goto restart;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/11] xfs: skip scrub xref if corruption already noted
  2018-04-18 15:03   ` Brian Foster
@ 2018-04-18 16:02     ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18 16:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 18, 2018 at 11:03:08AM -0400, Brian Foster wrote:
> On Tue, Apr 17, 2018 at 07:39:33PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Don't bother looking for cross-referencing problems if the metadata is
> > already corrupt or we've already found a cross-referencing problem.
> > Since we added a helper function for flags testing, convert existing
> > users to use it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/alloc.c    |    2 +-
> >  fs/xfs/scrub/common.c   |    4 ++++
> >  fs/xfs/scrub/common.h   |    6 ++++++
> >  fs/xfs/scrub/ialloc.c   |    2 +-
> >  fs/xfs/scrub/refcount.c |    4 ++--
> >  fs/xfs/scrub/rmap.c     |    4 ++--
> >  fs/xfs/scrub/rtbitmap.c |    3 +++
> >  fs/xfs/scrub/scrub.c    |    3 +--
> >  8 files changed, 20 insertions(+), 8 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
> > index 517c079..f5c6993 100644
> > --- a/fs/xfs/scrub/alloc.c
> > +++ b/fs/xfs/scrub/alloc.c
> > @@ -172,7 +172,7 @@ xfs_scrub_xref_is_used_space(
> >  	bool				is_freesp;
> >  	int				error;
> >  
> > -	if (!sc->sa.bno_cur)
> > +	if (!sc->sa.bno_cur || xfs_scrub_found_corruption(sc->sm))
> >  		return;
> >  
> >  	error = xfs_alloc_has_record(sc->sa.bno_cur, agbno, len, &is_freesp);
> > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > index 8ed91d5..3f72176 100644
> > --- a/fs/xfs/scrub/common.c
> > +++ b/fs/xfs/scrub/common.c
> > @@ -727,6 +727,10 @@ xfs_scrub_should_check_xref(
> >  	int				*error,
> >  	struct xfs_btree_cur		**curpp)
> >  {
> > +	/* No point in xref if we already know we're corrupt. */
> > +	if (xfs_scrub_found_corruption(sc->sm))
> > +		return false;
> > +
> 
> This seems like a fairly innocuous change for not being terribly
> familiar with the scrub code, but a high level thought...
> 
> This function has 30+ callers, then we have a bunch of these
> found_corruption() checks as well, one of which is the former. The new
> callers of the found_corruption() functions look like they easily have
> tens of callers throughout the scrub xref code, which kind of makes this
> look like quite the maze of logic just to shortcut various checks.
> 
> This is not an objection to this patch, but it makes me wonder how much
> "empty" processing is involved in a situation where we assume an xref
> error is detected early on in a scrub. Secondarily, the logic makes this
> hard enough to follow that the only way to really surmise that is to
> trace through all of the scrub code. It also looks like we have some
> places where it's checked multiple times.

Yep.  Originally I designed scrub always to run all the way to the end
so that we could find and stream to userspace every corruption &
discrepancy that we found.  Then I started looking into how to actually
communicate with userspace in a sane way, found none, and decided to
migrate towards bailing out as soon as we know something's bad:

 1. Stop checking at the first outright corruption we see.
 2. Stop cross-referencing at the first xref discrepancy we see.

The goal is to revise the structure of the scrubbing code like so:

1. for each record:
   a. call function that:
        i. look for corruption errors
       ii. if corruption errors then return
      iii. for each thing we can cross-reference with:
           A. if xref errors, break out of loop
           B. confirm cross-reference
   b. if corruption errors, break out of loop

Note: the xref checks in 1.a.iii.B. should never set OFLAG_CORRUPT.

Therefore I have to outfit every xref helper and subfunction with a
check that skips the xref if there are xref errors...

> For example, xfs_scrub_inode_xref() checks the corrupt flag and then
> calls a bunch of sub-handlers that call either found_corruption() or
> the should_check_xref(). One of the latter callers is
> xfs_scrub_inode_xref_finobt(), which does a btree lookup before it
> ever gets to this call only to then bail on the corruption flag being
> set.

...but I see what you're saying and I think I just plain missed those. :(

> Granted, this would probably only execute if the flag was set after the
> higher level scrub_inode_xref() check, but isn't it kind of strange to
> bury a check that is intended to control high-level xref execution flow
> in a call that is (also) designed to check the error of some btree
> lookup that shouldn't be required at all if the xref was going to be
> skipped anyways? I'm wondering if this would be more clear with
> consistent but separate error/corruption checks at the top of every xref
> helper to dictate high-level scrub execution flow and leave
> scrub_should_check_xref() to continue to care only about local
> processing decisions specific to the current branch of the scan (i.e.,
> my btree lookup failed and so we must set the corrupt flag and not
> proceed).

Yep, exactly right.  I'll go scan the source code again.

(My filthy excuse for missing things is distraction on account of making
sure that scrub reliably triggers repair, and that repair works
correctly.)

--D

> Brian
> 
> >  	if (*error == 0)
> >  		return true;
> >  
> > diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> > index deaf604..30e9039 100644
> > --- a/fs/xfs/scrub/common.h
> > +++ b/fs/xfs/scrub/common.h
> > @@ -157,4 +157,10 @@ int xfs_scrub_setup_inode_contents(struct xfs_scrub_context *sc,
> >  				   struct xfs_inode *ip, unsigned int resblks);
> >  void xfs_scrub_buffer_recheck(struct xfs_scrub_context *sc, struct xfs_buf *bp);
> >  
> > +static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
> > +{
> > +	return sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> > +			       XFS_SCRUB_OFLAG_XCORRUPT);
> > +}
> > +
> >  #endif	/* __XFS_SCRUB_COMMON_H__ */
> > diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
> > index 106ca4b..05e6191 100644
> > --- a/fs/xfs/scrub/ialloc.c
> > +++ b/fs/xfs/scrub/ialloc.c
> > @@ -496,7 +496,7 @@ xfs_scrub_xref_inode_check(
> >  	bool				has_inodes;
> >  	int				error;
> >  
> > -	if (!(*icur))
> > +	if (!(*icur) || xfs_scrub_found_corruption(sc->sm))
> >  		return;
> >  
> >  	error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &has_inodes);
> > diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
> > index 400f156..823bda3 100644
> > --- a/fs/xfs/scrub/refcount.c
> > +++ b/fs/xfs/scrub/refcount.c
> > @@ -460,7 +460,7 @@ xfs_scrub_xref_is_cow_staging(
> >  	int				has_refcount;
> >  	int				error;
> >  
> > -	if (!sc->sa.refc_cur)
> > +	if (!sc->sa.refc_cur || xfs_scrub_found_corruption(sc->sm))
> >  		return;
> >  
> >  	/* Find the CoW staging extent. */
> > @@ -504,7 +504,7 @@ xfs_scrub_xref_is_not_shared(
> >  	bool				shared;
> >  	int				error;
> >  
> > -	if (!sc->sa.refc_cur)
> > +	if (!sc->sa.refc_cur || xfs_scrub_found_corruption(sc->sm))
> >  		return;
> >  
> >  	error = xfs_refcount_has_record(sc->sa.refc_cur, agbno, len, &shared);
> > diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
> > index 8f2a7c3..9ca92e4 100644
> > --- a/fs/xfs/scrub/rmap.c
> > +++ b/fs/xfs/scrub/rmap.c
> > @@ -207,7 +207,7 @@ xfs_scrub_xref_check_owner(
> >  	bool				has_rmap;
> >  	int				error;
> >  
> > -	if (!sc->sa.rmap_cur)
> > +	if (!sc->sa.rmap_cur || xfs_scrub_found_corruption(sc->sm))
> >  		return;
> >  
> >  	error = xfs_rmap_record_exists(sc->sa.rmap_cur, bno, len, oinfo,
> > @@ -250,7 +250,7 @@ xfs_scrub_xref_has_no_owner(
> >  	bool				has_rmap;
> >  	int				error;
> >  
> > -	if (!sc->sa.rmap_cur)
> > +	if (!sc->sa.rmap_cur || xfs_scrub_found_corruption(sc->sm))
> >  		return;
> >  
> >  	error = xfs_rmap_has_record(sc->sa.rmap_cur, bno, len, &has_rmap);
> > diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
> > index 39c41dfe..c8672f7 100644
> > --- a/fs/xfs/scrub/rtbitmap.c
> > +++ b/fs/xfs/scrub/rtbitmap.c
> > @@ -110,6 +110,9 @@ xfs_scrub_xref_is_used_rt_space(
> >  	bool				is_free;
> >  	int				error;
> >  
> > +	if (xfs_scrub_found_corruption(sc->sm))
> > +		return;
> > +
> >  	xfs_ilock(sc->mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
> >  	error = xfs_rtalloc_extent_is_free(sc->mp, sc->tp, fsbno, len,
> >  			&is_free);
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 26c7596..cb1e727 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -446,8 +446,7 @@ xfs_scrub_metadata(
> >  	} else if (error)
> >  		goto out_teardown;
> >  
> > -	if (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> > -			       XFS_SCRUB_OFLAG_XCORRUPT))
> > +	if (xfs_scrub_found_corruption(sc.sm))
> >  		xfs_alert_ratelimited(mp, "Corruption detected during scrub.");
> >  
> >  out_teardown:
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag
  2018-04-18 15:33   ` Brian Foster
@ 2018-04-18 16:55     ` Darrick J. Wong
  2018-04-18 17:09       ` Brian Foster
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18 16:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 18, 2018 at 11:33:17AM -0400, Brian Foster wrote:
> On Tue, Apr 17, 2018 at 07:39:39PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a new QMOPT flag to signal that we've already locked the quota
> > inode.  This will be used by the quota scrub code refactoring to avoid
> > dropping the quota ip lock during scrub.  The flag is incompatible with
> > the DQALLOC flag because dquot allocation creates a transaction, and we
> > shouldn't be doing that with the ILOCK held.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_quota_defs.h |    2 ++
> >  fs/xfs/xfs_dquot.c             |   32 ++++++++++++++++++++------------
> >  2 files changed, 22 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_quota_defs.h b/fs/xfs/libxfs/xfs_quota_defs.h
> > index bb1b13a..cfc9938 100644
> > --- a/fs/xfs/libxfs/xfs_quota_defs.h
> > +++ b/fs/xfs/libxfs/xfs_quota_defs.h
> > @@ -107,6 +107,8 @@ typedef uint16_t	xfs_qwarncnt_t;
> >   * to a single function. None of these XFS_QMOPT_* flags are meant to have
> >   * persistent values (ie. their values can and will change between versions)
> >   */
> > +/* Quota ip already locked.  Cannot be used with DQALLOC. */
> > +#define XFS_QMOPT_QUOTIP_LOCKED	0x0000001
> >  #define XFS_QMOPT_DQALLOC	0x0000002 /* alloc dquot ondisk if needed */
> >  #define XFS_QMOPT_UQUOTA	0x0000004 /* user dquot requested */
> >  #define XFS_QMOPT_PQUOTA	0x0000008 /* project dquot requested */
> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index a7daef9..ed2e37c 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> > @@ -417,18 +417,20 @@ xfs_qm_dqtobp(
> >  	struct xfs_mount	*mp = dqp->q_mount;
> >  	xfs_dqid_t		id = be32_to_cpu(dqp->q_core.d_id);
> >  	struct xfs_trans	*tp = (tpp ? *tpp : NULL);
> > -	uint			lock_mode;
> > +	uint			lock_mode = 0;
> >  
> >  	quotip = xfs_quota_inode(dqp->q_mount, dqp->dq_flags);
> >  	dqp->q_fileoffset = (xfs_fileoff_t)id / mp->m_quotainfo->qi_dqperchunk;
> >  
> > -	lock_mode = xfs_ilock_data_map_shared(quotip);
> > +	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
> > +		lock_mode = xfs_ilock_data_map_shared(quotip);
> >  	if (!xfs_this_quota_on(dqp->q_mount, dqp->dq_flags)) {
> >  		/*
> >  		 * Return if this type of quotas is turned off while we
> >  		 * didn't have the quota inode lock.
> >  		 */
> > -		xfs_iunlock(quotip, lock_mode);
> > +		if (lock_mode)
> > +			xfs_iunlock(quotip, lock_mode);
> >  		return -ESRCH;
> >  	}
> >  
> > @@ -438,7 +440,8 @@ xfs_qm_dqtobp(
> >  	error = xfs_bmapi_read(quotip, dqp->q_fileoffset,
> >  			       XFS_DQUOT_CLUSTER_SIZE_FSB, &map, &nmaps, 0);
> >  
> > -	xfs_iunlock(quotip, lock_mode);
> > +	if (lock_mode)
> > +		xfs_iunlock(quotip, lock_mode);
> >  	if (error)
> >  		return error;
> >  
> > @@ -458,7 +461,7 @@ xfs_qm_dqtobp(
> >  		 */
> >  		if (!(flags & XFS_QMOPT_DQALLOC))
> >  			return -ENOENT;
> > -
> > +		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));
> >  		ASSERT(tp);
> >  		error = xfs_qm_dqalloc(tpp, mp, dqp, quotip,
> >  					dqp->q_fileoffset, &bp);
> > @@ -553,6 +556,7 @@ xfs_qm_dqread(
> >  	trace_xfs_dqread(dqp);
> >  
> >  	if (flags & XFS_QMOPT_DQALLOC) {
> > +		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));
> 
> Perhaps we should just have an explicit check for both flags above and
> return with -EINVAL if necessary?

Sounds like a good idea.

> >  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_dqalloc,
> >  				XFS_QM_DQALLOC_SPACE_RES(mp), 0, 0, &tp);
> >  		if (error)
> > @@ -634,11 +638,12 @@ static int
> >  xfs_dq_get_next_id(
> >  	struct xfs_mount	*mp,
> >  	uint			type,
> > -	xfs_dqid_t		*id)
> > +	xfs_dqid_t		*id,
> > +	uint			flags)
> >  {
> >  	struct xfs_inode	*quotip = xfs_quota_inode(mp, type);
> >  	xfs_dqid_t		next_id = *id + 1; /* simple advance */
> > -	uint			lock_flags;
> > +	uint			lock_flags = 0;
> >  	struct xfs_bmbt_irec	got;
> >  	struct xfs_iext_cursor	cur;
> >  	xfs_fsblock_t		start;
> > @@ -657,7 +662,8 @@ xfs_dq_get_next_id(
> >  	/* Nope, next_id is now past the current chunk, so find the next one */
> >  	start = (xfs_fsblock_t)next_id / mp->m_quotainfo->qi_dqperchunk;
> >  
> > -	lock_flags = xfs_ilock_data_map_shared(quotip);
> > +	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
> > +		lock_flags = xfs_ilock_data_map_shared(quotip);
> >  	if (!(quotip->i_df.if_flags & XFS_IFEXTENTS)) {
> >  		error = xfs_iread_extents(NULL, quotip, XFS_DATA_FORK);
> >  		if (error)
> > @@ -673,7 +679,8 @@ xfs_dq_get_next_id(
> >  		error = -ENOENT;
> >  	}
> >  
> > -	xfs_iunlock(quotip, lock_flags);
> > +	if (lock_flags)
> > +		xfs_iunlock(quotip, lock_flags);
> >  
> >  	return error;
> >  }
> > @@ -733,7 +740,8 @@ xfs_qm_dqget(
> 
> Earlier code in this function drops/reacquires the ip ILOCK with a note
> in the comment around lock ordering with the quota inode. Assuming an
> inode is passed, is that lock cycle safe if the caller also has the
> quotip held locked?

I'm not sure; for the case that I want to support (iterate dquots
without a specific file inode in mind) I don't need (ip != NULL) at all.
So my answer is that I don't want to support this scenario at all. :)

Ok, so at the top of _dqget, add:

if (flags & _QUOTIP_LOCKED) {
	ASSERT(ip == NULL);
	ASSERT(!(flags & _DQALLOC));
	if (ip || (flags & DQALLOC))
		return -EINVAL;
}

--D

> 
> Brian
> 
> >  			if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
> >  				xfs_dqunlock(dqp);
> >  				mutex_unlock(&qi->qi_tree_lock);
> > -				error = xfs_dq_get_next_id(mp, type, &id);
> > +				error = xfs_dq_get_next_id(mp, type, &id,
> > +						flags);
> >  				if (error)
> >  					return error;
> >  				goto restart;
> > @@ -768,7 +776,7 @@ xfs_qm_dqget(
> >  
> >  	/* If we are asked to find next active id, keep looking */
> >  	if (error == -ENOENT && (flags & XFS_QMOPT_DQNEXT)) {
> > -		error = xfs_dq_get_next_id(mp, type, &id);
> > +		error = xfs_dq_get_next_id(mp, type, &id, flags);
> >  		if (!error)
> >  			goto restart;
> >  	}
> > @@ -827,7 +835,7 @@ xfs_qm_dqget(
> >  	if (flags & XFS_QMOPT_DQNEXT) {
> >  		if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
> >  			xfs_qm_dqput(dqp);
> > -			error = xfs_dq_get_next_id(mp, type, &id);
> > +			error = xfs_dq_get_next_id(mp, type, &id, flags);
> >  			if (error)
> >  				return error;
> >  			goto restart;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag
  2018-04-18 16:55     ` Darrick J. Wong
@ 2018-04-18 17:09       ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2018-04-18 17:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Apr 18, 2018 at 09:55:11AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 18, 2018 at 11:33:17AM -0400, Brian Foster wrote:
> > On Tue, Apr 17, 2018 at 07:39:39PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Create a new QMOPT flag to signal that we've already locked the quota
> > > inode.  This will be used by the quota scrub code refactoring to avoid
> > > dropping the quota ip lock during scrub.  The flag is incompatible with
> > > the DQALLOC flag because dquot allocation creates a transaction, and we
> > > shouldn't be doing that with the ILOCK held.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_quota_defs.h |    2 ++
> > >  fs/xfs/xfs_dquot.c             |   32 ++++++++++++++++++++------------
> > >  2 files changed, 22 insertions(+), 12 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_quota_defs.h b/fs/xfs/libxfs/xfs_quota_defs.h
> > > index bb1b13a..cfc9938 100644
> > > --- a/fs/xfs/libxfs/xfs_quota_defs.h
> > > +++ b/fs/xfs/libxfs/xfs_quota_defs.h
> > > @@ -107,6 +107,8 @@ typedef uint16_t	xfs_qwarncnt_t;
> > >   * to a single function. None of these XFS_QMOPT_* flags are meant to have
> > >   * persistent values (ie. their values can and will change between versions)
> > >   */
> > > +/* Quota ip already locked.  Cannot be used with DQALLOC. */
> > > +#define XFS_QMOPT_QUOTIP_LOCKED	0x0000001
> > >  #define XFS_QMOPT_DQALLOC	0x0000002 /* alloc dquot ondisk if needed */
> > >  #define XFS_QMOPT_UQUOTA	0x0000004 /* user dquot requested */
> > >  #define XFS_QMOPT_PQUOTA	0x0000008 /* project dquot requested */
> > > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > > index a7daef9..ed2e37c 100644
> > > --- a/fs/xfs/xfs_dquot.c
> > > +++ b/fs/xfs/xfs_dquot.c
> > > @@ -417,18 +417,20 @@ xfs_qm_dqtobp(
> > >  	struct xfs_mount	*mp = dqp->q_mount;
> > >  	xfs_dqid_t		id = be32_to_cpu(dqp->q_core.d_id);
> > >  	struct xfs_trans	*tp = (tpp ? *tpp : NULL);
> > > -	uint			lock_mode;
> > > +	uint			lock_mode = 0;
> > >  
> > >  	quotip = xfs_quota_inode(dqp->q_mount, dqp->dq_flags);
> > >  	dqp->q_fileoffset = (xfs_fileoff_t)id / mp->m_quotainfo->qi_dqperchunk;
> > >  
> > > -	lock_mode = xfs_ilock_data_map_shared(quotip);
> > > +	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
> > > +		lock_mode = xfs_ilock_data_map_shared(quotip);
> > >  	if (!xfs_this_quota_on(dqp->q_mount, dqp->dq_flags)) {
> > >  		/*
> > >  		 * Return if this type of quotas is turned off while we
> > >  		 * didn't have the quota inode lock.
> > >  		 */
> > > -		xfs_iunlock(quotip, lock_mode);
> > > +		if (lock_mode)
> > > +			xfs_iunlock(quotip, lock_mode);
> > >  		return -ESRCH;
> > >  	}
> > >  
> > > @@ -438,7 +440,8 @@ xfs_qm_dqtobp(
> > >  	error = xfs_bmapi_read(quotip, dqp->q_fileoffset,
> > >  			       XFS_DQUOT_CLUSTER_SIZE_FSB, &map, &nmaps, 0);
> > >  
> > > -	xfs_iunlock(quotip, lock_mode);
> > > +	if (lock_mode)
> > > +		xfs_iunlock(quotip, lock_mode);
> > >  	if (error)
> > >  		return error;
> > >  
> > > @@ -458,7 +461,7 @@ xfs_qm_dqtobp(
> > >  		 */
> > >  		if (!(flags & XFS_QMOPT_DQALLOC))
> > >  			return -ENOENT;
> > > -
> > > +		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));
> > >  		ASSERT(tp);
> > >  		error = xfs_qm_dqalloc(tpp, mp, dqp, quotip,
> > >  					dqp->q_fileoffset, &bp);
> > > @@ -553,6 +556,7 @@ xfs_qm_dqread(
> > >  	trace_xfs_dqread(dqp);
> > >  
> > >  	if (flags & XFS_QMOPT_DQALLOC) {
> > > +		ASSERT(!(flags & XFS_QMOPT_QUOTIP_LOCKED));
> > 
> > Perhaps we should just have an explicit check for both flags above and
> > return with -EINVAL if necessary?
> 
> Sounds like a good idea.
> 
> > >  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_dqalloc,
> > >  				XFS_QM_DQALLOC_SPACE_RES(mp), 0, 0, &tp);
> > >  		if (error)
> > > @@ -634,11 +638,12 @@ static int
> > >  xfs_dq_get_next_id(
> > >  	struct xfs_mount	*mp,
> > >  	uint			type,
> > > -	xfs_dqid_t		*id)
> > > +	xfs_dqid_t		*id,
> > > +	uint			flags)
> > >  {
> > >  	struct xfs_inode	*quotip = xfs_quota_inode(mp, type);
> > >  	xfs_dqid_t		next_id = *id + 1; /* simple advance */
> > > -	uint			lock_flags;
> > > +	uint			lock_flags = 0;
> > >  	struct xfs_bmbt_irec	got;
> > >  	struct xfs_iext_cursor	cur;
> > >  	xfs_fsblock_t		start;
> > > @@ -657,7 +662,8 @@ xfs_dq_get_next_id(
> > >  	/* Nope, next_id is now past the current chunk, so find the next one */
> > >  	start = (xfs_fsblock_t)next_id / mp->m_quotainfo->qi_dqperchunk;
> > >  
> > > -	lock_flags = xfs_ilock_data_map_shared(quotip);
> > > +	if (!(flags & XFS_QMOPT_QUOTIP_LOCKED))
> > > +		lock_flags = xfs_ilock_data_map_shared(quotip);
> > >  	if (!(quotip->i_df.if_flags & XFS_IFEXTENTS)) {
> > >  		error = xfs_iread_extents(NULL, quotip, XFS_DATA_FORK);
> > >  		if (error)
> > > @@ -673,7 +679,8 @@ xfs_dq_get_next_id(
> > >  		error = -ENOENT;
> > >  	}
> > >  
> > > -	xfs_iunlock(quotip, lock_flags);
> > > +	if (lock_flags)
> > > +		xfs_iunlock(quotip, lock_flags);
> > >  
> > >  	return error;
> > >  }
> > > @@ -733,7 +740,8 @@ xfs_qm_dqget(
> > 
> > Earlier code in this function drops/reacquires the ip ILOCK with a note
> > in the comment around lock ordering with the quota inode. Assuming an
> > inode is passed, is that lock cycle safe if the caller also has the
> > quotip held locked?
> 
> I'm not sure; for the case that I want to support (iterate dquots
> without a specific file inode in mind) I don't need (ip != NULL) at all.
> So my answer is that I don't want to support this scenario at all. :)
> 

Ok, works for me.

> Ok, so at the top of _dqget, add:
> 
> if (flags & _QUOTIP_LOCKED) {
> 	ASSERT(ip == NULL);
> 	ASSERT(!(flags & _DQALLOC));
> 	if (ip || (flags & DQALLOC))
> 		return -EINVAL;
> }
> 

It might be better to split the DQALLOC part off to xfs_qm_dqread()
since it is not static and that's where we act on the dqalloc flag.

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  			if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
> > >  				xfs_dqunlock(dqp);
> > >  				mutex_unlock(&qi->qi_tree_lock);
> > > -				error = xfs_dq_get_next_id(mp, type, &id);
> > > +				error = xfs_dq_get_next_id(mp, type, &id,
> > > +						flags);
> > >  				if (error)
> > >  					return error;
> > >  				goto restart;
> > > @@ -768,7 +776,7 @@ xfs_qm_dqget(
> > >  
> > >  	/* If we are asked to find next active id, keep looking */
> > >  	if (error == -ENOENT && (flags & XFS_QMOPT_DQNEXT)) {
> > > -		error = xfs_dq_get_next_id(mp, type, &id);
> > > +		error = xfs_dq_get_next_id(mp, type, &id, flags);
> > >  		if (!error)
> > >  			goto restart;
> > >  	}
> > > @@ -827,7 +835,7 @@ xfs_qm_dqget(
> > >  	if (flags & XFS_QMOPT_DQNEXT) {
> > >  		if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
> > >  			xfs_qm_dqput(dqp);
> > > -			error = xfs_dq_get_next_id(mp, type, &id);
> > > +			error = xfs_dq_get_next_id(mp, type, &id, flags);
> > >  			if (error)
> > >  				return error;
> > >  			goto restart;
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/11] xfs: report failing address when dquot verifier fails
  2018-04-18  2:39 ` [PATCH 03/11] xfs: report failing address when dquot verifier fails Darrick J. Wong
@ 2018-04-18 18:33   ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2018-04-18 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:39:46PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Pass the failing address through to the corruption report when dquot
> verifiers fail.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_dquot_buf.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_dquot_buf.c b/fs/xfs/libxfs/xfs_dquot_buf.c
> index 8b7a6c3..a1e6cf1 100644
> --- a/fs/xfs/libxfs/xfs_dquot_buf.c
> +++ b/fs/xfs/libxfs/xfs_dquot_buf.c
> @@ -229,7 +229,7 @@ xfs_dquot_buf_read_verify(
>  	else {
>  		fa = xfs_dquot_buf_verify(mp, bp);
>  		if (fa)
> -			xfs_verifier_error(bp, -EFSCORRUPTED, __this_address);
> +			xfs_verifier_error(bp, -EFSCORRUPTED, fa);
>  	}
>  }
>  
> @@ -266,7 +266,7 @@ xfs_dquot_buf_write_verify(
>  
>  	fa = xfs_dquot_buf_verify(mp, bp);
>  	if (fa)
> -		xfs_verifier_error(bp, -EFSCORRUPTED, __this_address);
> +		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
>  }
>  
>  const struct xfs_buf_ops xfs_dquot_buf_ops = {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/11] xfs: refactor dquot iteration
  2018-04-18  2:40 ` [PATCH 04/11] xfs: refactor dquot iteration Darrick J. Wong
@ 2018-04-18 18:34   ` Brian Foster
  2018-04-18 22:20     ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-18 18:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a helper function to iterate all the dquots of a given type in
> the system, and refactor the dquot scrub to use it.  This will get more
> use in the quota repair code.  Note that the new function differs from
> xfs_qm_dqiterate in that _dqiterate iterates dquot buffers, not the
> in-core structures themselves.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/quota.c |   46 +++++++++++++++++++++-------------------------
>  fs/xfs/xfs_dquot.c   |   41 +++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_dquot.h   |    5 +++++
>  3 files changed, 67 insertions(+), 25 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index ed2e37c..ec00402 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1127,3 +1127,44 @@ xfs_qm_exit(void)
>  	kmem_zone_destroy(xfs_qm_dqtrxzone);
>  	kmem_zone_destroy(xfs_qm_dqzone);
>  }
> +
> +/*
> + * Iterate every dquot of a particular type.  The caller must ensure that the
> + * particular quota type is active.  iter_fn can return negative error codes,
> + * or XFS_BTREE_QUERY_RANGE_ABORT to indicate that it wants to stop iterating.
> + *
> + * Note that xfs_qm_dqiterate iterates all the dquot bufs, not the dquots
> + * themselves.
> + */
> +int
> +xfs_dquot_iterate(
> +	struct xfs_mount	*mp,
> +	uint			dqtype,
> +	uint			iter_flags,
> +	xfs_dquot_iterate_fn	iter_fn,
> +	void			*priv)
> +{
> +	struct xfs_dquot	*dq;
> +	xfs_dqid_t		id = 0;
> +	int			error;
> +
> +	while (id < ((xfs_dqid_t)-1ULL)) {
> +		error = xfs_qm_dqget(mp, NULL, id, dqtype,
> +				XFS_QMOPT_DQNEXT | iter_flags, &dq);
> +		if (error == -ENOENT) {
> +			error = 0;
> +			break;
> +		}
> +		if (error)
> +			break;
> +

It looks like this logic could be simplified a bit. E.g.:

	while ((error = xfs_qm_dqget(mp, NULL, id, dqtype,
				XFS_QMOPT_DQNEXT | iter_flags, &dq)) == 0) {
		id = be32_to_cpu(dq->q_core.d_id);
		...

	}
	return error == -ENOENT ? 0 : error;

Otherwise looks good.

Brian

> +		id = be32_to_cpu(dq->q_core.d_id);
> +		error = iter_fn(dq, dqtype, id, priv);
> +		xfs_qm_dqput(dq);
> +		id++;
> +		if (error || id == 0)
> +			break;
> +	}
> +
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> index 2f536f3..db0511e 100644
> --- a/fs/xfs/xfs_dquot.h
> +++ b/fs/xfs/xfs_dquot.h
> @@ -185,4 +185,9 @@ static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
>  	return dqp;
>  }
>  
> +typedef int (*xfs_dquot_iterate_fn)(struct xfs_dquot *dq, uint dqtype,
> +		xfs_dqid_t id, void *priv);
> +int xfs_dquot_iterate(struct xfs_mount *mp, uint dqtype, uint iter_flags,
> +		xfs_dquot_iterate_fn iter_fn, void *priv);
> +
>  #endif /* __XFS_DQUOT_H__ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 05/11] xfs: avoid ilock games in the quota scrubber
  2018-04-18  2:40 ` [PATCH 05/11] xfs: avoid ilock games in the quota scrubber Darrick J. Wong
@ 2018-04-18 18:34   ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2018-04-18 18:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:07PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that we have a new QMOPT_QUOTIP_LOCKED flag, use it to dqget in the
> quota scrubber so that we can ilock the quota ip at the start of scrub
> and keep it locked all the way to the end.  This enables us to remove
> quite a bit of locking-related games and manage the quota ip the same
> way we treat other inodes being scrubbed.  Allocate a transaction so
> that we can evade deadlock issues in bmbt lookups.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/quota.c |   46 ++++++++++++++++++----------------------------
>  fs/xfs/scrub/scrub.c |    4 ++++
>  fs/xfs/scrub/scrub.h |    1 +
>  3 files changed, 23 insertions(+), 28 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
> index 363e318..1068a91 100644
> --- a/fs/xfs/scrub/quota.c
> +++ b/fs/xfs/scrub/quota.c
> @@ -66,12 +66,24 @@ xfs_scrub_setup_quota(
>  	struct xfs_inode		*ip)
>  {
>  	uint				dqtype;
> +	int				error;
> +
> +	if (!XFS_IS_QUOTA_RUNNING(sc->mp) || !XFS_IS_QUOTA_ON(sc->mp))
> +		return -ENOENT;
>  
>  	dqtype = xfs_scrub_quota_to_dqtype(sc);
>  	if (dqtype == 0)
>  		return -EINVAL;
> +	sc->has_quotaofflock = true;
> +	mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock);
>  	if (!xfs_this_quota_on(sc->mp, dqtype))
>  		return -ENOENT;
> +	error = xfs_scrub_setup_fs(sc, ip);

Looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

Not really related, but FYI the ip parameter to xfs_scrub_setup_fs()
appears to be unused.

Brian

> +	if (error)
> +		return error;
> +	sc->ip = xfs_quota_inode(sc->mp, dqtype);
> +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> +	sc->ilock_flags = XFS_ILOCK_EXCL;
>  	return 0;
>  }
>  
> @@ -203,7 +215,6 @@ xfs_scrub_quota(
>  	struct xfs_bmbt_irec		irec = { 0 };
>  	struct xfs_scrub_quota_info	sqi;
>  	struct xfs_mount		*mp = sc->mp;
> -	struct xfs_inode		*ip;
>  	struct xfs_quotainfo		*qi = mp->m_quotainfo;
>  	xfs_fileoff_t			max_dqid_off;
>  	xfs_fileoff_t			off = 0;
> @@ -211,25 +222,12 @@ xfs_scrub_quota(
>  	int				nimaps;
>  	int				error = 0;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> -		return -ENOENT;
> -
> -	mutex_lock(&qi->qi_quotaofflock);
>  	dqtype = xfs_scrub_quota_to_dqtype(sc);
> -	if (!xfs_this_quota_on(sc->mp, dqtype)) {
> -		error = -ENOENT;
> -		goto out_unlock_quota;
> -	}
> -
> -	/* Attach to the quota inode and set sc->ip so that reporting works. */
> -	ip = xfs_quota_inode(sc->mp, dqtype);
> -	sc->ip = ip;
>  
>  	/* Look for problem extents. */
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	if (ip->i_d.di_flags & XFS_DIFLAG_REALTIME) {
> +	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) {
>  		xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino);
> -		goto out_unlock_inode;
> +		goto out;
>  	}
>  	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
>  	while (1) {
> @@ -238,11 +236,11 @@ xfs_scrub_quota(
>  
>  		off = irec.br_startoff + irec.br_blockcount;
>  		nimaps = 1;
> -		error = xfs_bmapi_read(ip, off, -1, &irec, &nimaps,
> +		error = xfs_bmapi_read(sc->ip, off, -1, &irec, &nimaps,
>  				XFS_BMAPI_ENTIRE);
>  		if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK, off,
>  				&error))
> -			goto out_unlock_inode;
> +			goto out;
>  		if (!nimaps)
>  			break;
>  		if (irec.br_startblock == HOLESTARTBLOCK)
> @@ -268,26 +266,18 @@ xfs_scrub_quota(
>  		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
>  			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
>  	}
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
>  		goto out;
>  
>  	/* Check all the quota items. */
>  	sqi.sc = sc;
>  	sqi.last_id = 0;
> -	error = xfs_dquot_iterate(mp, dqtype, 0, xfs_scrub_quota_item, &sqi);
> +	error = xfs_dquot_iterate(mp, dqtype, XFS_QMOPT_QUOTIP_LOCKED,
> +			xfs_scrub_quota_item, &sqi);
>  	if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK,
>  			sqi.last_id * qi->qi_dqperchunk, &error))
>  		goto out;
>  
>  out:
> -	/* We set sc->ip earlier, so make sure we clear it now. */
> -	sc->ip = NULL;
> -out_unlock_quota:
> -	mutex_unlock(&qi->qi_quotaofflock);
>  	return error;
> -
> -out_unlock_inode:
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	goto out;
>  }
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index cb1e727..c43ee9e 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -42,6 +42,8 @@
>  #include "xfs_refcount_btree.h"
>  #include "xfs_rmap.h"
>  #include "xfs_rmap_btree.h"
> +#include "xfs_quota.h"
> +#include "xfs_qm.h"
>  #include "scrub/xfs_scrub.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
> @@ -166,6 +168,8 @@ xfs_scrub_teardown(
>  			iput(VFS_I(sc->ip));
>  		sc->ip = NULL;
>  	}
> +	if (sc->has_quotaofflock)
> +		mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
>  	if (sc->buf) {
>  		kmem_free(sc->buf);
>  		sc->buf = NULL;
> diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
> index 0d92af8..5d79731 100644
> --- a/fs/xfs/scrub/scrub.h
> +++ b/fs/xfs/scrub/scrub.h
> @@ -73,6 +73,7 @@ struct xfs_scrub_context {
>  	void				*buf;
>  	uint				ilock_flags;
>  	bool				try_harder;
> +	bool				has_quotaofflock;
>  
>  	/* State tracking for single-AG operations. */
>  	struct xfs_scrub_ag		sa;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber
  2018-04-18  2:40 ` [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber Darrick J. Wong
@ 2018-04-18 18:34   ` Brian Foster
  2018-04-18 20:00     ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-18 18:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:14PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Replace the quota scrubber's open-coded data fork scrubber with a
> redirected call to the bmapbtd scrubber.  This strengthens the quota
> scrub to include all the cross-referencing that it does.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/quota.c |   95 ++++++++++++++++++++++++++++----------------------
>  1 file changed, 54 insertions(+), 41 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
> index 1068a91..67f94c4 100644
> --- a/fs/xfs/scrub/quota.c
> +++ b/fs/xfs/scrub/quota.c
> @@ -207,65 +207,78 @@ xfs_scrub_quota_item(
>  	return 0;
>  }
>  
> -/* Scrub all of a quota type's items. */
> -int
> -xfs_scrub_quota(
> +/* Check the quota's data fork. */
> +STATIC int
> +xfs_scrub_quota_data_fork(
>  	struct xfs_scrub_context	*sc)
>  {
>  	struct xfs_bmbt_irec		irec = { 0 };
> -	struct xfs_scrub_quota_info	sqi;
> -	struct xfs_mount		*mp = sc->mp;
> -	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> +	struct xfs_iext_cursor		icur;
> +	struct xfs_scrub_metadata	fake_sm;
> +	struct xfs_scrub_metadata	*real_sm = sc->sm;
> +	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
> +	struct xfs_ifork		*ifp;
>  	xfs_fileoff_t			max_dqid_off;
> -	xfs_fileoff_t			off = 0;
> -	uint				dqtype;
> -	int				nimaps;
>  	int				error = 0;
>  
> -	dqtype = xfs_scrub_quota_to_dqtype(sc);
> -
> -	/* Look for problem extents. */
> +	/* Quotas don't live on the rt device. */
>  	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) {
>  		xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino);
> -		goto out;
> +		return 0;
>  	}
> +
> +	/* Invoke the data fork scrubber. */
> +	memcpy(&fake_sm, real_sm, sizeof(fake_sm));
> +	fake_sm.sm_type = XFS_SCRUB_TYPE_BMBTD;
> +	fake_sm.sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
> +	sc->sm = &fake_sm;
> +	error = xfs_scrub_bmap_data(sc);
> +	sc->sm = real_sm;

Is the sm swap out of caution or is there really some conflict here?
AFAICT sm_type is only used in the setup handler (and tracepoints).
With respect to sm_flags, what's the difference between the above and
just passing in the original sm without stripping FLAGS_OUT (whatever it
is, might be useful to note in the comment)?

> +	if (error)
> +		return error;
> +	sc->sm->sm_flags |= (fake_sm.sm_flags & XFS_SCRUB_FLAGS_OUT);
> +	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> +		return error;
> +
> +	/* Check for data fork problems that apply only to quota files. */
>  	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
> -	while (1) {
> +	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
> +	for_each_xfs_iext(ifp, &icur, &irec) {
>  		if (xfs_scrub_should_terminate(sc, &error))
>  			break;
> -
> -		off = irec.br_startoff + irec.br_blockcount;
> -		nimaps = 1;
> -		error = xfs_bmapi_read(sc->ip, off, -1, &irec, &nimaps,
> -				XFS_BMAPI_ENTIRE);
> -		if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK, off,
> -				&error))
> -			goto out;
> -		if (!nimaps)
> -			break;
> -		if (irec.br_startblock == HOLESTARTBLOCK)
> -			continue;
> -
> -		/* Check the extent record doesn't point to crap. */
> -		if (irec.br_startblock + irec.br_blockcount <=
> -		    irec.br_startblock)
> -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> -					irec.br_startoff);
> -		if (!xfs_verify_fsbno(mp, irec.br_startblock) ||
> -		    !xfs_verify_fsbno(mp, irec.br_startblock +
> -					irec.br_blockcount - 1))
> -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> -					irec.br_startoff);
> -
>  		/*
> -		 * Unwritten extents or blocks mapped above the highest
> +		 * delalloc extents or blocks mapped above the highest
>  		 * quota id shouldn't happen.
>  		 */
>  		if (isnullstartblock(irec.br_startblock) ||
>  		    irec.br_startoff > max_dqid_off ||
> -		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
> -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
> +		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1) {

Not introduced by this patch, but what's with the +1 to max_dqid_off
here?

Brian

> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> +					irec.br_startoff);
> +			break;
> +		}
>  	}
> +
> +	return error;
> +}
> +
> +/* Scrub all of a quota type's items. */
> +int
> +xfs_scrub_quota(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_scrub_quota_info	sqi;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> +	uint				dqtype;
> +	int				error = 0;
> +
> +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> +
> +	/* Look for problem extents. */
> +	error = xfs_scrub_quota_data_fork(sc);
> +	if (error)
> +		goto out;
>  	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
>  		goto out;
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber
  2018-04-18 18:34   ` Brian Foster
@ 2018-04-18 20:00     ` Darrick J. Wong
  2018-04-19 11:20       ` Brian Foster
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18 20:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 18, 2018 at 02:34:47PM -0400, Brian Foster wrote:
> On Tue, Apr 17, 2018 at 07:40:14PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Replace the quota scrubber's open-coded data fork scrubber with a
> > redirected call to the bmapbtd scrubber.  This strengthens the quota
> > scrub to include all the cross-referencing that it does.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/quota.c |   95 ++++++++++++++++++++++++++++----------------------
> >  1 file changed, 54 insertions(+), 41 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
> > index 1068a91..67f94c4 100644
> > --- a/fs/xfs/scrub/quota.c
> > +++ b/fs/xfs/scrub/quota.c
> > @@ -207,65 +207,78 @@ xfs_scrub_quota_item(
> >  	return 0;
> >  }
> >  
> > -/* Scrub all of a quota type's items. */
> > -int
> > -xfs_scrub_quota(
> > +/* Check the quota's data fork. */
> > +STATIC int
> > +xfs_scrub_quota_data_fork(
> >  	struct xfs_scrub_context	*sc)
> >  {
> >  	struct xfs_bmbt_irec		irec = { 0 };
> > -	struct xfs_scrub_quota_info	sqi;
> > -	struct xfs_mount		*mp = sc->mp;
> > -	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> > +	struct xfs_iext_cursor		icur;
> > +	struct xfs_scrub_metadata	fake_sm;
> > +	struct xfs_scrub_metadata	*real_sm = sc->sm;
> > +	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
> > +	struct xfs_ifork		*ifp;
> >  	xfs_fileoff_t			max_dqid_off;
> > -	xfs_fileoff_t			off = 0;
> > -	uint				dqtype;
> > -	int				nimaps;
> >  	int				error = 0;
> >  
> > -	dqtype = xfs_scrub_quota_to_dqtype(sc);
> > -
> > -	/* Look for problem extents. */
> > +	/* Quotas don't live on the rt device. */
> >  	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) {
> >  		xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino);
> > -		goto out;
> > +		return 0;
> >  	}
> > +
> > +	/* Invoke the data fork scrubber. */
> > +	memcpy(&fake_sm, real_sm, sizeof(fake_sm));
> > +	fake_sm.sm_type = XFS_SCRUB_TYPE_BMBTD;
> > +	fake_sm.sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
> > +	sc->sm = &fake_sm;
> > +	error = xfs_scrub_bmap_data(sc);
> > +	sc->sm = real_sm;
> 
> Is the sm swap out of caution or is there really some conflict here?
> AFAICT sm_type is only used in the setup handler (and tracepoints).

The fake_sm is out of caution so that any thing that scribbles on
_scrub_bmap_data can't make it back out ot userspace.  It could probably
get dropped.

The sm_type override is so the tracepoints report the scrub type
correctly ("type bmapbtd ino <uquotino>").

But, stack space is precious so I'll just override the two fields we
need.

> With respect to sm_flags, what's the difference between the above and
> just passing in the original sm without stripping FLAGS_OUT (whatever it
> is, might be useful to note in the comment)?

Hm yes the whole thing warrants an "if (_CORRUPT) return;" at the top so
that we then don't need to clean out sm_flags.

> 
> > +	if (error)
> > +		return error;
> > +	sc->sm->sm_flags |= (fake_sm.sm_flags & XFS_SCRUB_FLAGS_OUT);
> > +	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> > +		return error;
> > +
> > +	/* Check for data fork problems that apply only to quota files. */
> >  	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
> > -	while (1) {
> > +	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
> > +	for_each_xfs_iext(ifp, &icur, &irec) {
> >  		if (xfs_scrub_should_terminate(sc, &error))
> >  			break;
> > -
> > -		off = irec.br_startoff + irec.br_blockcount;
> > -		nimaps = 1;
> > -		error = xfs_bmapi_read(sc->ip, off, -1, &irec, &nimaps,
> > -				XFS_BMAPI_ENTIRE);
> > -		if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK, off,
> > -				&error))
> > -			goto out;
> > -		if (!nimaps)
> > -			break;
> > -		if (irec.br_startblock == HOLESTARTBLOCK)
> > -			continue;
> > -
> > -		/* Check the extent record doesn't point to crap. */
> > -		if (irec.br_startblock + irec.br_blockcount <=
> > -		    irec.br_startblock)
> > -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > -					irec.br_startoff);
> > -		if (!xfs_verify_fsbno(mp, irec.br_startblock) ||
> > -		    !xfs_verify_fsbno(mp, irec.br_startblock +
> > -					irec.br_blockcount - 1))
> > -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > -					irec.br_startoff);
> > -
> >  		/*
> > -		 * Unwritten extents or blocks mapped above the highest
> > +		 * delalloc extents or blocks mapped above the highest
> >  		 * quota id shouldn't happen.
> >  		 */
> >  		if (isnullstartblock(irec.br_startblock) ||
> >  		    irec.br_startoff > max_dqid_off ||
> > -		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
> > -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
> > +		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1) {
> 
> Not introduced by this patch, but what's with the +1 to max_dqid_off
> here?

max_dquid_off is the file block offset of the dquot for the highest
quota id (0xFFFFFFFF), so this is checking that the next block after the
extent doesn't map an unreachable offset.  The logic is a bit twisted;
maybe (irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) would
be clearer?

--D

> 
> Brian
> 
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > +					irec.br_startoff);
> > +			break;
> > +		}
> >  	}
> > +
> > +	return error;
> > +}
> > +
> > +/* Scrub all of a quota type's items. */
> > +int
> > +xfs_scrub_quota(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_scrub_quota_info	sqi;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> > +	uint				dqtype;
> > +	int				error = 0;
> > +
> > +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> > +
> > +	/* Look for problem extents. */
> > +	error = xfs_scrub_quota_data_fork(sc);
> > +	if (error)
> > +		goto out;
> >  	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> >  		goto out;
> >  
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/11] xfs: refactor dquot iteration
  2018-04-18 18:34   ` Brian Foster
@ 2018-04-18 22:20     ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-18 22:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Apr 18, 2018 at 02:34:30PM -0400, Brian Foster wrote:
> On Tue, Apr 17, 2018 at 07:40:01PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a helper function to iterate all the dquots of a given type in
> > the system, and refactor the dquot scrub to use it.  This will get more
> > use in the quota repair code.  Note that the new function differs from
> > xfs_qm_dqiterate in that _dqiterate iterates dquot buffers, not the
> > in-core structures themselves.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/quota.c |   46 +++++++++++++++++++++-------------------------
> >  fs/xfs/xfs_dquot.c   |   41 +++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_dquot.h   |    5 +++++
> >  3 files changed, 67 insertions(+), 25 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index ed2e37c..ec00402 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> > @@ -1127,3 +1127,44 @@ xfs_qm_exit(void)
> >  	kmem_zone_destroy(xfs_qm_dqtrxzone);
> >  	kmem_zone_destroy(xfs_qm_dqzone);
> >  }
> > +
> > +/*
> > + * Iterate every dquot of a particular type.  The caller must ensure that the
> > + * particular quota type is active.  iter_fn can return negative error codes,
> > + * or XFS_BTREE_QUERY_RANGE_ABORT to indicate that it wants to stop iterating.
> > + *
> > + * Note that xfs_qm_dqiterate iterates all the dquot bufs, not the dquots
> > + * themselves.
> > + */
> > +int
> > +xfs_dquot_iterate(
> > +	struct xfs_mount	*mp,
> > +	uint			dqtype,
> > +	uint			iter_flags,
> > +	xfs_dquot_iterate_fn	iter_fn,
> > +	void			*priv)
> > +{
> > +	struct xfs_dquot	*dq;
> > +	xfs_dqid_t		id = 0;
> > +	int			error;
> > +
> > +	while (id < ((xfs_dqid_t)-1ULL)) {
> > +		error = xfs_qm_dqget(mp, NULL, id, dqtype,
> > +				XFS_QMOPT_DQNEXT | iter_flags, &dq);
> > +		if (error == -ENOENT) {
> > +			error = 0;
> > +			break;
> > +		}
> > +		if (error)
> > +			break;
> > +
> 
> It looks like this logic could be simplified a bit. E.g.:
> 
> 	while ((error = xfs_qm_dqget(mp, NULL, id, dqtype,
> 				XFS_QMOPT_DQNEXT | iter_flags, &dq)) == 0) {
> 		id = be32_to_cpu(dq->q_core.d_id);
> 		...
> 
> 	}
> 	return error == -ENOENT ? 0 : error;

The downside of this (the return line specifically) is we'll squash
iter_fn returning -ENOENT, even though the function comment says iter_fn
is allowed to return that.

That said, the current while loop test is wrong anyway; 0xFFFFFFFF is a
valid project id because the kernel will let you set it, even if
xfs_quota/xfs_io won't.

So I think this loop can become:

do {
	error = xfs_qm_dqget(mp, NULL, id, dqtype,
			XFS_QMOPT_DQNEXT | iter_flags, &dq);
	if (error == -ENOENT)
		return 0;
	if (error)
		return error;

	id = be32_to_cpu(dq->q_core.d_id);
	error = iter_fn(dq, dqtype, id, priv);
	xfs_qm_dqput(dq);
	id++;
} while (error == 0 && id != 0);

--D

> 
> Otherwise looks good.
> 
> Brian
> 
> > +		id = be32_to_cpu(dq->q_core.d_id);
> > +		error = iter_fn(dq, dqtype, id, priv);
> > +		xfs_qm_dqput(dq);
> > +		id++;
> > +		if (error || id == 0)
> > +			break;
> > +	}
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> > index 2f536f3..db0511e 100644
> > --- a/fs/xfs/xfs_dquot.h
> > +++ b/fs/xfs/xfs_dquot.h
> > @@ -185,4 +185,9 @@ static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
> >  	return dqp;
> >  }
> >  
> > +typedef int (*xfs_dquot_iterate_fn)(struct xfs_dquot *dq, uint dqtype,
> > +		xfs_dqid_t id, void *priv);
> > +int xfs_dquot_iterate(struct xfs_mount *mp, uint dqtype, uint iter_flags,
> > +		xfs_dquot_iterate_fn iter_fn, void *priv);
> > +
> >  #endif /* __XFS_DQUOT_H__ */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag
  2018-04-18  2:39 ` [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag Darrick J. Wong
  2018-04-18 15:33   ` Brian Foster
@ 2018-04-19  8:32   ` Christoph Hellwig
  2018-04-21 18:42     ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2018-04-19  8:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:39:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a new QMOPT flag to signal that we've already locked the quota
> inode.  This will be used by the quota scrub code refactoring to avoid
> dropping the quota ip lock during scrub.  The flag is incompatible with
> the DQALLOC flag because dquot allocation creates a transaction, and we
> shouldn't be doing that with the ILOCK held.

Locked flag are always a sign for code smell.  And this already is
pretty smelly code before that flag is added.  I think someone (i.e.
either you or me :)) needs to do a major refactoring here first.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber
  2018-04-18 20:00     ` Darrick J. Wong
@ 2018-04-19 11:20       ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2018-04-19 11:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Apr 18, 2018 at 01:00:36PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 18, 2018 at 02:34:47PM -0400, Brian Foster wrote:
> > On Tue, Apr 17, 2018 at 07:40:14PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Replace the quota scrubber's open-coded data fork scrubber with a
> > > redirected call to the bmapbtd scrubber.  This strengthens the quota
> > > scrub to include all the cross-referencing that it does.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/scrub/quota.c |   95 ++++++++++++++++++++++++++++----------------------
> > >  1 file changed, 54 insertions(+), 41 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
> > > index 1068a91..67f94c4 100644
> > > --- a/fs/xfs/scrub/quota.c
> > > +++ b/fs/xfs/scrub/quota.c
...
> > > +	if (error)
> > > +		return error;
> > > +	sc->sm->sm_flags |= (fake_sm.sm_flags & XFS_SCRUB_FLAGS_OUT);
> > > +	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> > > +		return error;
> > > +
> > > +	/* Check for data fork problems that apply only to quota files. */
> > >  	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
> > > -	while (1) {
> > > +	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
> > > +	for_each_xfs_iext(ifp, &icur, &irec) {
> > >  		if (xfs_scrub_should_terminate(sc, &error))
> > >  			break;
> > > -
> > > -		off = irec.br_startoff + irec.br_blockcount;
> > > -		nimaps = 1;
> > > -		error = xfs_bmapi_read(sc->ip, off, -1, &irec, &nimaps,
> > > -				XFS_BMAPI_ENTIRE);
> > > -		if (!xfs_scrub_fblock_process_error(sc, XFS_DATA_FORK, off,
> > > -				&error))
> > > -			goto out;
> > > -		if (!nimaps)
> > > -			break;
> > > -		if (irec.br_startblock == HOLESTARTBLOCK)
> > > -			continue;
> > > -
> > > -		/* Check the extent record doesn't point to crap. */
> > > -		if (irec.br_startblock + irec.br_blockcount <=
> > > -		    irec.br_startblock)
> > > -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > > -					irec.br_startoff);
> > > -		if (!xfs_verify_fsbno(mp, irec.br_startblock) ||
> > > -		    !xfs_verify_fsbno(mp, irec.br_startblock +
> > > -					irec.br_blockcount - 1))
> > > -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > > -					irec.br_startoff);
> > > -
> > >  		/*
> > > -		 * Unwritten extents or blocks mapped above the highest
> > > +		 * delalloc extents or blocks mapped above the highest
> > >  		 * quota id shouldn't happen.
> > >  		 */
> > >  		if (isnullstartblock(irec.br_startblock) ||
> > >  		    irec.br_startoff > max_dqid_off ||
> > > -		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
> > > -			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
> > > +		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1) {
> > 
> > Not introduced by this patch, but what's with the +1 to max_dqid_off
> > here?
> 
> max_dquid_off is the file block offset of the dquot for the highest
> quota id (0xFFFFFFFF), so this is checking that the next block after the
> extent doesn't map an unreachable offset.  The logic is a bit twisted;
> maybe (irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) would
> be clearer?
> 

Ah, I see. The calculated offset is exclusive and the max_dqid_off value
is inclusive. Yeah... not a big deal obviously but the tweak you propose
does seem a bit more clear to me. Thanks.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > > +					irec.br_startoff);
> > > +			break;
> > > +		}
> > >  	}
> > > +
> > > +	return error;
> > > +}
> > > +
> > > +/* Scrub all of a quota type's items. */
> > > +int
> > > +xfs_scrub_quota(
> > > +	struct xfs_scrub_context	*sc)
> > > +{
> > > +	struct xfs_scrub_quota_info	sqi;
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> > > +	uint				dqtype;
> > > +	int				error = 0;
> > > +
> > > +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> > > +
> > > +	/* Look for problem extents. */
> > > +	error = xfs_scrub_quota_data_fork(sc);
> > > +	if (error)
> > > +		goto out;
> > >  	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> > >  		goto out;
> > >  
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/11] xfs: superblock scrub should use uncached buffers
  2018-04-18  2:40 ` [PATCH 07/11] xfs: superblock scrub should use uncached buffers Darrick J. Wong
@ 2018-04-19 12:55   ` Brian Foster
  2018-04-19 17:25     ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-19 12:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:20PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> We've never cached buffers when reading or writing superblocks, so we
> need to change scrub to do likewise or risk screwing up the uncached sb
> buffer usage everywhere else.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Hmm, so this goes back to this[1] thread, right? IIRC, we wanted to end
up with an lru-bypassing uncached buffer lookup mechanism to provide
uncached behavior for resource-saving purposes but without introducing
serialization issues between multiple users of uncached buffers.

On a quick look back, growfs currently uses cached buffers for secondary
superblocks and the associated patch was looking to change that to
something like the above. Don't we have the same requirement here (since
growfs currently still uses cached buffers)?

Brian

[1] https://marc.info/?l=linux-xfs&m=151746733326282&w=2

>  fs/xfs/scrub/agheader.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> index 018aabbd..aacbc3f 100644
> --- a/fs/xfs/scrub/agheader.c
> +++ b/fs/xfs/scrub/agheader.c
> @@ -157,7 +157,7 @@ xfs_scrub_superblock(
>  	if (agno == 0)
>  		return 0;
>  
> -	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +	error = xfs_buf_read_uncached(mp->m_ddev_targp,
>  		  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
>  		  XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
>  	/*
> @@ -421,6 +421,7 @@ xfs_scrub_superblock(
>  		xfs_scrub_block_set_corrupt(sc, bp);
>  
>  	xfs_scrub_superblock_xref(sc, bp);
> +	xfs_buf_relse(bp);
>  
>  	return error;
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/11] xfs: clean up scrub usage of KM_NOFS
  2018-04-18  2:40 ` [PATCH 08/11] xfs: clean up scrub usage of KM_NOFS Darrick J. Wong
@ 2018-04-19 12:55   ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2018-04-19 12:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:26PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> All scrub code runs in transaction context, which means that memory
> allocations are automatically run in PF_MEMALLOC_NOFS context.  It's
> therefore unnecessary to pass in KM_NOFS to allocation routines, so
> clean them all out.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/scrub/agheader.c |    3 ++-
>  fs/xfs/scrub/btree.c    |    2 +-
>  fs/xfs/scrub/refcount.c |    2 +-
>  3 files changed, 4 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> index aacbc3f..6c6e4d8 100644
> --- a/fs/xfs/scrub/agheader.c
> +++ b/fs/xfs/scrub/agheader.c
> @@ -798,7 +798,8 @@ xfs_scrub_agfl(
>  	}
>  	memset(&sai, 0, sizeof(sai));
>  	sai.sz_entries = agflcount;
> -	sai.entries = kmem_zalloc(sizeof(xfs_agblock_t) * agflcount, KM_NOFS);
> +	sai.entries = kmem_zalloc(sizeof(xfs_agblock_t) * agflcount,
> +			KM_MAYFAIL);
>  	if (!sai.entries) {
>  		error = -ENOMEM;
>  		goto out;
> diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> index 5421816..ea972da 100644
> --- a/fs/xfs/scrub/btree.c
> +++ b/fs/xfs/scrub/btree.c
> @@ -442,7 +442,7 @@ xfs_scrub_btree_check_owner(
>  	 */
>  	if (cur->bc_btnum == XFS_BTNUM_BNO || cur->bc_btnum == XFS_BTNUM_RMAP) {
>  		co = kmem_alloc(sizeof(struct check_owner),
> -				KM_MAYFAIL | KM_NOFS);
> +				KM_MAYFAIL);
>  		if (!co)
>  			return -ENOMEM;
>  		co->level = level;
> diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
> index 823bda3..5fff94d 100644
> --- a/fs/xfs/scrub/refcount.c
> +++ b/fs/xfs/scrub/refcount.c
> @@ -150,7 +150,7 @@ xfs_scrub_refcountbt_rmap_check(
>  		 * so we don't need insertion sort here.
>  		 */
>  		frag = kmem_alloc(sizeof(struct xfs_scrub_refcnt_frag),
> -				KM_MAYFAIL | KM_NOFS);
> +				KM_MAYFAIL);
>  		if (!frag)
>  			return -ENOMEM;
>  		memcpy(&frag->rm, rec, sizeof(frag->rm));
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/11] xfs: btree scrub should check minrecs
  2018-04-18  2:40 ` [PATCH 09/11] xfs: btree scrub should check minrecs Darrick J. Wong
@ 2018-04-19 12:55   ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2018-04-19 12:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:32PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Strengthen the btree block header checks to detect the number of records
> being less than the btree type's minimum record count.  Certain blocks
> are allowed to violate this constraint -- specifically any btree block
> at the top of the tree can have fewer than minrecs records.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/scrub/btree.c |   40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> 
> diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> index ea972da..2d29dce 100644
> --- a/fs/xfs/scrub/btree.c
> +++ b/fs/xfs/scrub/btree.c
> @@ -455,6 +455,44 @@ xfs_scrub_btree_check_owner(
>  }
>  
>  /*
> + * Check that this btree block has at least minrecs records or is one of the
> + * special blocks that don't require that.
> + */
> +STATIC void
> +xfs_scrub_btree_check_minrecs(
> +	struct xfs_scrub_btree	*bs,
> +	int			level,
> +	struct xfs_btree_block	*block)
> +{
> +	unsigned int		numrecs;
> +	int			ok_level;
> +
> +	numrecs = be16_to_cpu(block->bb_numrecs);
> +
> +	/* More records than minrecs means the block is ok. */
> +	if (numrecs >= bs->cur->bc_ops->get_minrecs(bs->cur, level))
> +		return;
> +
> +	/*
> +	 * Certain btree blocks /can/ have fewer than minrecs records.  Any
> +	 * level greater than or equal to the level of the highest dedicated
> +	 * btree block are allowed to violate this constraint.
> +	 *
> +	 * For a btree rooted in a block, the btree root can have fewer than
> +	 * minrecs records.  If the btree is rooted in an inode and does not
> +	 * store records in the root, the direct children of the root and the
> +	 * root itself can have fewer than minrecs records.
> +	 */
> +	ok_level = bs->cur->bc_nlevels - 1;
> +	if (bs->cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
> +		ok_level--;
> +	if (level >= ok_level)
> +		return;
> +
> +	xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, level);
> +}
> +
> +/*
>   * Grab and scrub a btree block given a btree pointer.  Returns block
>   * and buffer pointers (if applicable) if they're ok to use.
>   */
> @@ -491,6 +529,8 @@ xfs_scrub_btree_get_block(
>  	if (*pbp)
>  		xfs_scrub_buffer_recheck(bs->sc, *pbp);
>  
> +	xfs_scrub_btree_check_minrecs(bs, level, *pblock);
> +
>  	/*
>  	 * Check the block's owner; this function absorbs error codes
>  	 * for us.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/11] xfs: refactor scrub transaction allocation function
  2018-04-18  2:40 ` [PATCH 10/11] xfs: refactor scrub transaction allocation function Darrick J. Wong
@ 2018-04-19 12:56   ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2018-04-19 12:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:38PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Since the transaction allocation helper is about to become more complex,
> move it to common.c and remove the redundant parameters.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/scrub/bmap.c   |    3 +--
>  fs/xfs/scrub/common.c |   16 +++++++++++++---
>  fs/xfs/scrub/common.h |   14 +-------------
>  fs/xfs/scrub/inode.c  |    5 ++---
>  4 files changed, 17 insertions(+), 21 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
> index 639d14b..3f8fd10 100644
> --- a/fs/xfs/scrub/bmap.c
> +++ b/fs/xfs/scrub/bmap.c
> @@ -51,7 +51,6 @@ xfs_scrub_setup_inode_bmap(
>  	struct xfs_scrub_context	*sc,
>  	struct xfs_inode		*ip)
>  {
> -	struct xfs_mount		*mp = sc->mp;
>  	int				error;
>  
>  	error = xfs_scrub_get_inode(sc, ip);
> @@ -75,7 +74,7 @@ xfs_scrub_setup_inode_bmap(
>  	}
>  
>  	/* Got the inode, lock it and we're ready to go. */
> -	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> +	error = xfs_scrub_trans_alloc(sc);
>  	if (error)
>  		goto out;
>  	sc->ilock_flags |= XFS_ILOCK_EXCL;
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index 3f72176..f5e281a 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -568,13 +568,24 @@ xfs_scrub_ag_init(
>  
>  /* Per-scrubber setup functions */
>  
> +/*
> + * Grab an empty transaction so that we can re-grab locked buffers if
> + * one of our btrees turns out to be cyclic.
> + */
> +int
> +xfs_scrub_trans_alloc(
> +	struct xfs_scrub_context	*sc)
> +{
> +	return xfs_trans_alloc_empty(sc->mp, &sc->tp);
> +}
> +
>  /* Set us up with a transaction and an empty context. */
>  int
>  xfs_scrub_setup_fs(
>  	struct xfs_scrub_context	*sc,
>  	struct xfs_inode		*ip)
>  {
> -	return xfs_scrub_trans_alloc(sc->sm, sc->mp, &sc->tp);
> +	return xfs_scrub_trans_alloc(sc);
>  }
>  
>  /* Set us up with AG headers and btree cursors. */
> @@ -695,7 +706,6 @@ xfs_scrub_setup_inode_contents(
>  	struct xfs_inode		*ip,
>  	unsigned int			resblks)
>  {
> -	struct xfs_mount		*mp = sc->mp;
>  	int				error;
>  
>  	error = xfs_scrub_get_inode(sc, ip);
> @@ -705,7 +715,7 @@ xfs_scrub_setup_inode_contents(
>  	/* Got the inode, lock it and we're ready to go. */
>  	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	xfs_ilock(sc->ip, sc->ilock_flags);
> -	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> +	error = xfs_scrub_trans_alloc(sc);
>  	if (error)
>  		goto out;
>  	sc->ilock_flags |= XFS_ILOCK_EXCL;
> diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> index 30e9039..8296873 100644
> --- a/fs/xfs/scrub/common.h
> +++ b/fs/xfs/scrub/common.h
> @@ -38,19 +38,7 @@ xfs_scrub_should_terminate(
>  	return false;
>  }
>  
> -/*
> - * Grab an empty transaction so that we can re-grab locked buffers if
> - * one of our btrees turns out to be cyclic.
> - */
> -static inline int
> -xfs_scrub_trans_alloc(
> -	struct xfs_scrub_metadata	*sm,
> -	struct xfs_mount		*mp,
> -	struct xfs_trans		**tpp)
> -{
> -	return xfs_trans_alloc_empty(mp, tpp);
> -}
> -
> +int xfs_scrub_trans_alloc(struct xfs_scrub_context *sc);
>  bool xfs_scrub_process_error(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
>  		xfs_agblock_t bno, int *error);
>  bool xfs_scrub_fblock_process_error(struct xfs_scrub_context *sc, int whichfork,
> diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> index df14930..e15b1bc 100644
> --- a/fs/xfs/scrub/inode.c
> +++ b/fs/xfs/scrub/inode.c
> @@ -55,7 +55,6 @@ xfs_scrub_setup_inode(
>  	struct xfs_scrub_context	*sc,
>  	struct xfs_inode		*ip)
>  {
> -	struct xfs_mount		*mp = sc->mp;
>  	int				error;
>  
>  	/*
> @@ -68,7 +67,7 @@ xfs_scrub_setup_inode(
>  		break;
>  	case -EFSCORRUPTED:
>  	case -EFSBADCRC:
> -		return xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> +		return xfs_scrub_trans_alloc(sc);
>  	default:
>  		return error;
>  	}
> @@ -76,7 +75,7 @@ xfs_scrub_setup_inode(
>  	/* Got the inode, lock it and we're ready to go. */
>  	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	xfs_ilock(sc->ip, sc->ilock_flags);
> -	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> +	error = xfs_scrub_trans_alloc(sc);
>  	if (error)
>  		goto out;
>  	sc->ilock_flags |= XFS_ILOCK_EXCL;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers
  2018-04-18  2:40 ` [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers Darrick J. Wong
@ 2018-04-19 12:56   ` Brian Foster
  2018-04-19 17:33     ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-19 12:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 17, 2018 at 07:40:44PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In normal operation, the XFS convention is to take an inode's iolock
> and then allocate a transaction.  However, when scrubbing parent inodes
> this is inverted -- we allocated the transaction to do the scrub, and
> now we're trying to grab the parent's iolock.  This can lead to ABBA
> deadlocks: some thread grabbed the parent's iolock and is waiting for
> space for a transaction while our parent scrubber is sitting on a
> transaction trying to get the parent's iolock.
> 

Is that really an issue if the scrub transaction doesn't acquire a log
reservation (or does it in certain circumstances)..?

Brian

> Therefore, convert all iolock attempts to use trylock; if that fails,
> they can use the existing mechanisms to back off and try again.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/common.c |   22 ++++++++++++++++++++++
>  fs/xfs/scrub/common.h |    2 ++
>  fs/xfs/scrub/parent.c |   16 ++++++++++++++--
>  3 files changed, 38 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index f5e281a..93f9e7d 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -787,3 +787,25 @@ xfs_scrub_buffer_recheck(
>  	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
>  	trace_xfs_scrub_block_error(sc, bp->b_bn, fa);
>  }
> +
> +/*
> + * Try to lock an inode in violation of the usual locking order rules.  For
> + * example, trying to get the IOLOCK while in transaction context, or just
> + * plain breaking AG-order or inode-order inode locking rules.  Either way,
> + * the only way to avoid an ABBA deadlock is to use trylock and back off if
> + * we can't.
> + */
> +int
> +xfs_scrub_ilock_inverted(
> +	struct xfs_inode	*ip,
> +	uint			lock_mode)
> +{
> +	int			i;
> +
> +	for (i = 0; i < 20; i++) {
> +		if (xfs_ilock_nowait(ip, lock_mode))
> +			return 0;
> +		delay(1);
> +	}
> +	return -EDEADLOCK;
> +}
> diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> index 8296873..191c369 100644
> --- a/fs/xfs/scrub/common.h
> +++ b/fs/xfs/scrub/common.h
> @@ -151,4 +151,6 @@ static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
>  			       XFS_SCRUB_OFLAG_XCORRUPT);
>  }
>  
> +int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
> +
>  #endif	/* __XFS_SCRUB_COMMON_H__ */
> diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
> index 1fb88c1..19cd54d 100644
> --- a/fs/xfs/scrub/parent.c
> +++ b/fs/xfs/scrub/parent.c
> @@ -211,7 +211,9 @@ xfs_scrub_parent_validate(
>  	 */
>  	xfs_iunlock(sc->ip, sc->ilock_flags);
>  	sc->ilock_flags = 0;
> -	xfs_ilock(dp, XFS_IOLOCK_SHARED);
> +	error = xfs_scrub_ilock_inverted(dp, XFS_IOLOCK_SHARED);
> +	if (error)
> +		goto out_rele;
>  
>  	/* Go looking for our dentry. */
>  	error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nlink);
> @@ -220,8 +222,10 @@ xfs_scrub_parent_validate(
>  
>  	/* Drop the parent lock, relock this inode. */
>  	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
> +	error = xfs_scrub_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
> +	if (error)
> +		goto out_rele;
>  	sc->ilock_flags = XFS_IOLOCK_EXCL;
> -	xfs_ilock(sc->ip, sc->ilock_flags);
>  
>  	/*
>  	 * If we're an unlinked directory, the parent /won't/ have a link
> @@ -323,5 +327,13 @@ xfs_scrub_parent(
>  	if (try_again && tries == 20)
>  		xfs_scrub_set_incomplete(sc);
>  out:
> +	/*
> +	 * If we failed to lock the parent inode even after a retry, just mark
> +	 * this scrub incomplete and return.
> +	 */
> +	if (sc->try_harder && error == -EDEADLOCK) {
> +		error = 0;
> +		xfs_scrub_set_incomplete(sc);
> +	}
>  	return error;
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/11] xfs: superblock scrub should use uncached buffers
  2018-04-19 12:55   ` Brian Foster
@ 2018-04-19 17:25     ` Darrick J. Wong
  2018-04-19 18:57       ` Brian Foster
  2018-04-20  0:05       ` Dave Chinner
  0 siblings, 2 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-19 17:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Apr 19, 2018 at 08:55:47AM -0400, Brian Foster wrote:
> On Tue, Apr 17, 2018 at 07:40:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > We've never cached buffers when reading or writing superblocks, so we
> > need to change scrub to do likewise or risk screwing up the uncached sb
> > buffer usage everywhere else.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Hmm, so this goes back to this[1] thread, right?

Yep.

> IIRC, we wanted to end up with an lru-bypassing uncached buffer lookup
> mechanism to provide uncached behavior for resource-saving purposes
> but without introducing serialization issues between multiple users of
> uncached buffers.

Right.

> On a quick look back, growfs currently uses cached buffers for secondary
> superblocks and the associated patch was looking to change that to
> something like the above. Don't we have the same requirement here (since
> growfs currently still uses cached buffers)?

Correct, this patch is contingent on landing Dave's "tableise growfs"
series that converts growfs to use uncached buffers for writing out the
new secondary supers.  If I manage to land Dave's growfs thing before
repair then I'll keep this; if repair goes first I'll defer it to that
series.

I've never done a contingent-patch in the middle of a series, so I'm not
sure exactly how to present such a {monstr,bog}osity.

--D

> Brian
> 
> [1] https://marc.info/?l=linux-xfs&m=151746733326282&w=2
> 
> >  fs/xfs/scrub/agheader.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > index 018aabbd..aacbc3f 100644
> > --- a/fs/xfs/scrub/agheader.c
> > +++ b/fs/xfs/scrub/agheader.c
> > @@ -157,7 +157,7 @@ xfs_scrub_superblock(
> >  	if (agno == 0)
> >  		return 0;
> >  
> > -	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> > +	error = xfs_buf_read_uncached(mp->m_ddev_targp,
> >  		  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
> >  		  XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
> >  	/*
> > @@ -421,6 +421,7 @@ xfs_scrub_superblock(
> >  		xfs_scrub_block_set_corrupt(sc, bp);
> >  
> >  	xfs_scrub_superblock_xref(sc, bp);
> > +	xfs_buf_relse(bp);
> >  
> >  	return error;
> >  }
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers
  2018-04-19 12:56   ` Brian Foster
@ 2018-04-19 17:33     ` Darrick J. Wong
  2018-04-19 18:58       ` Brian Foster
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-19 17:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Apr 19, 2018 at 08:56:07AM -0400, Brian Foster wrote:
> On Tue, Apr 17, 2018 at 07:40:44PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > In normal operation, the XFS convention is to take an inode's iolock
> > and then allocate a transaction.  However, when scrubbing parent inodes
> > this is inverted -- we allocated the transaction to do the scrub, and
> > now we're trying to grab the parent's iolock.  This can lead to ABBA
> > deadlocks: some thread grabbed the parent's iolock and is waiting for
> > space for a transaction while our parent scrubber is sitting on a
> > transaction trying to get the parent's iolock.
> > 
> 
> Is that really an issue if the scrub transaction doesn't acquire a log
> reservation (or does it in certain circumstances)..?

Once we get to the repair series the transactions will have reservations
for logging metadata changes from the metadata rebuilds.

For a non-repair scrub invocation it's pretty simple:
1. Allocate zero-reservation (empty) transaction
2. Iterate metadata, check stuff
3. Cancel transaction, exit to userland

For a repair it's much more complicated:
1. Allocate a big permanent-reservation transaction
2. Iterate metadata, check stuff (same as #2 above)
3. If the metadata is ok, cancel and exit to userland
4. Create set of records that metadata is supposed to have
5. Zap metadata root
6. Insert record, roll transaction, repeat...
7. Commit transaction
8. Run non-repair scrub to see if we fixed it.

So this patch is more of a cleanup to prepare for the circumstances
changing later. :)

--D

> Brian
> 
> > Therefore, convert all iolock attempts to use trylock; if that fails,
> > they can use the existing mechanisms to back off and try again.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/common.c |   22 ++++++++++++++++++++++
> >  fs/xfs/scrub/common.h |    2 ++
> >  fs/xfs/scrub/parent.c |   16 ++++++++++++++--
> >  3 files changed, 38 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > index f5e281a..93f9e7d 100644
> > --- a/fs/xfs/scrub/common.c
> > +++ b/fs/xfs/scrub/common.c
> > @@ -787,3 +787,25 @@ xfs_scrub_buffer_recheck(
> >  	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> >  	trace_xfs_scrub_block_error(sc, bp->b_bn, fa);
> >  }
> > +
> > +/*
> > + * Try to lock an inode in violation of the usual locking order rules.  For
> > + * example, trying to get the IOLOCK while in transaction context, or just
> > + * plain breaking AG-order or inode-order inode locking rules.  Either way,
> > + * the only way to avoid an ABBA deadlock is to use trylock and back off if
> > + * we can't.
> > + */
> > +int
> > +xfs_scrub_ilock_inverted(
> > +	struct xfs_inode	*ip,
> > +	uint			lock_mode)
> > +{
> > +	int			i;
> > +
> > +	for (i = 0; i < 20; i++) {
> > +		if (xfs_ilock_nowait(ip, lock_mode))
> > +			return 0;
> > +		delay(1);
> > +	}
> > +	return -EDEADLOCK;
> > +}
> > diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> > index 8296873..191c369 100644
> > --- a/fs/xfs/scrub/common.h
> > +++ b/fs/xfs/scrub/common.h
> > @@ -151,4 +151,6 @@ static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
> >  			       XFS_SCRUB_OFLAG_XCORRUPT);
> >  }
> >  
> > +int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
> > +
> >  #endif	/* __XFS_SCRUB_COMMON_H__ */
> > diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
> > index 1fb88c1..19cd54d 100644
> > --- a/fs/xfs/scrub/parent.c
> > +++ b/fs/xfs/scrub/parent.c
> > @@ -211,7 +211,9 @@ xfs_scrub_parent_validate(
> >  	 */
> >  	xfs_iunlock(sc->ip, sc->ilock_flags);
> >  	sc->ilock_flags = 0;
> > -	xfs_ilock(dp, XFS_IOLOCK_SHARED);
> > +	error = xfs_scrub_ilock_inverted(dp, XFS_IOLOCK_SHARED);
> > +	if (error)
> > +		goto out_rele;
> >  
> >  	/* Go looking for our dentry. */
> >  	error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nlink);
> > @@ -220,8 +222,10 @@ xfs_scrub_parent_validate(
> >  
> >  	/* Drop the parent lock, relock this inode. */
> >  	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
> > +	error = xfs_scrub_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
> > +	if (error)
> > +		goto out_rele;
> >  	sc->ilock_flags = XFS_IOLOCK_EXCL;
> > -	xfs_ilock(sc->ip, sc->ilock_flags);
> >  
> >  	/*
> >  	 * If we're an unlinked directory, the parent /won't/ have a link
> > @@ -323,5 +327,13 @@ xfs_scrub_parent(
> >  	if (try_again && tries == 20)
> >  		xfs_scrub_set_incomplete(sc);
> >  out:
> > +	/*
> > +	 * If we failed to lock the parent inode even after a retry, just mark
> > +	 * this scrub incomplete and return.
> > +	 */
> > +	if (sc->try_harder && error == -EDEADLOCK) {
> > +		error = 0;
> > +		xfs_scrub_set_incomplete(sc);
> > +	}
> >  	return error;
> >  }
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/11] xfs: superblock scrub should use uncached buffers
  2018-04-19 17:25     ` Darrick J. Wong
@ 2018-04-19 18:57       ` Brian Foster
  2018-04-20  0:07         ` Dave Chinner
  2018-04-20  0:05       ` Dave Chinner
  1 sibling, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-19 18:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 19, 2018 at 10:25:48AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 19, 2018 at 08:55:47AM -0400, Brian Foster wrote:
> > On Tue, Apr 17, 2018 at 07:40:20PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > We've never cached buffers when reading or writing superblocks, so we
> > > need to change scrub to do likewise or risk screwing up the uncached sb
> > > buffer usage everywhere else.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Hmm, so this goes back to this[1] thread, right?
> 
> Yep.
> 
> > IIRC, we wanted to end up with an lru-bypassing uncached buffer lookup
> > mechanism to provide uncached behavior for resource-saving purposes
> > but without introducing serialization issues between multiple users of
> > uncached buffers.
> 
> Right.
> 
> > On a quick look back, growfs currently uses cached buffers for secondary
> > superblocks and the associated patch was looking to change that to
> > something like the above. Don't we have the same requirement here (since
> > growfs currently still uses cached buffers)?
> 
> Correct, this patch is contingent on landing Dave's "tableise growfs"
> series that converts growfs to use uncached buffers for writing out the
> new secondary supers.  If I manage to land Dave's growfs thing before
> repair then I'll keep this; if repair goes first I'll defer it to that
> series.
> 

Ok, though I'm not sure I see the need for the higher level series
dependency. Don't one of these patches just need to fold in a helper[1]
to set the lru reference appropriately and start to use it?

Brian

[1] I.e., untested:

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d9b94bd5f689..19ffe2819bb2 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -972,3 +972,24 @@ xfs_fs_geometry(
 
 	return 0;
 }
+
+int
+xfs_read_secondary_sb(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	struct xfs_buf		**bpp)
+{
+	struct xfs_buf		*bp;
+	int			error;
+
+	ASSERT(agno != NULLAGNUMBER);
+	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
+	if (error)
+		return error;
+	xfs_buf_set_ref(bp, XFS_SSB_REF);
+	*bpp = bp;
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index d0b84da0cb1e..d23f33c02f64 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -127,6 +127,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_ATTR_BTREE_REF	1
 #define	XFS_DQUOT_REF		1
 #define	XFS_REFC_BTREE_REF	1
+#define	XFS_SSB_REF		0
 
 /*
  * Flags for xfs_trans_ichgtime().

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers
  2018-04-19 17:33     ` Darrick J. Wong
@ 2018-04-19 18:58       ` Brian Foster
  2018-04-19 19:06         ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2018-04-19 18:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 19, 2018 at 10:33:40AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 19, 2018 at 08:56:07AM -0400, Brian Foster wrote:
> > On Tue, Apr 17, 2018 at 07:40:44PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > In normal operation, the XFS convention is to take an inode's iolock
> > > and then allocate a transaction.  However, when scrubbing parent inodes
> > > this is inverted -- we allocated the transaction to do the scrub, and
> > > now we're trying to grab the parent's iolock.  This can lead to ABBA
> > > deadlocks: some thread grabbed the parent's iolock and is waiting for
> > > space for a transaction while our parent scrubber is sitting on a
> > > transaction trying to get the parent's iolock.
> > > 
> > 
> > Is that really an issue if the scrub transaction doesn't acquire a log
> > reservation (or does it in certain circumstances)..?
> 
> Once we get to the repair series the transactions will have reservations
> for logging metadata changes from the metadata rebuilds.
> 
> For a non-repair scrub invocation it's pretty simple:
> 1. Allocate zero-reservation (empty) transaction
> 2. Iterate metadata, check stuff
> 3. Cancel transaction, exit to userland
> 
> For a repair it's much more complicated:
> 1. Allocate a big permanent-reservation transaction
> 2. Iterate metadata, check stuff (same as #2 above)
> 3. If the metadata is ok, cancel and exit to userland
> 4. Create set of records that metadata is supposed to have
> 5. Zap metadata root
> 6. Insert record, roll transaction, repeat...
> 7. Commit transaction
> 8. Run non-repair scrub to see if we fixed it.
> 
> So this patch is more of a cleanup to prepare for the circumstances
> changing later. :)
> 

Ok, so all that really matters wrt to this patch is that the repair mode
will eventually reserve log space for the transaction. Care to add some
context to the commit log? Otherwise it's kind of hard to surmise the
purpose. ;) Perhaps better yet would be to just bundle this with the
repair code that depends on it..?

Brian

> --D
> 
> > Brian
> > 
> > > Therefore, convert all iolock attempts to use trylock; if that fails,
> > > they can use the existing mechanisms to back off and try again.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/scrub/common.c |   22 ++++++++++++++++++++++
> > >  fs/xfs/scrub/common.h |    2 ++
> > >  fs/xfs/scrub/parent.c |   16 ++++++++++++++--
> > >  3 files changed, 38 insertions(+), 2 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > > index f5e281a..93f9e7d 100644
> > > --- a/fs/xfs/scrub/common.c
> > > +++ b/fs/xfs/scrub/common.c
> > > @@ -787,3 +787,25 @@ xfs_scrub_buffer_recheck(
> > >  	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> > >  	trace_xfs_scrub_block_error(sc, bp->b_bn, fa);
> > >  }
> > > +
> > > +/*
> > > + * Try to lock an inode in violation of the usual locking order rules.  For
> > > + * example, trying to get the IOLOCK while in transaction context, or just
> > > + * plain breaking AG-order or inode-order inode locking rules.  Either way,
> > > + * the only way to avoid an ABBA deadlock is to use trylock and back off if
> > > + * we can't.
> > > + */
> > > +int
> > > +xfs_scrub_ilock_inverted(
> > > +	struct xfs_inode	*ip,
> > > +	uint			lock_mode)
> > > +{
> > > +	int			i;
> > > +
> > > +	for (i = 0; i < 20; i++) {
> > > +		if (xfs_ilock_nowait(ip, lock_mode))
> > > +			return 0;
> > > +		delay(1);
> > > +	}
> > > +	return -EDEADLOCK;
> > > +}
> > > diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> > > index 8296873..191c369 100644
> > > --- a/fs/xfs/scrub/common.h
> > > +++ b/fs/xfs/scrub/common.h
> > > @@ -151,4 +151,6 @@ static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
> > >  			       XFS_SCRUB_OFLAG_XCORRUPT);
> > >  }
> > >  
> > > +int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
> > > +
> > >  #endif	/* __XFS_SCRUB_COMMON_H__ */
> > > diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
> > > index 1fb88c1..19cd54d 100644
> > > --- a/fs/xfs/scrub/parent.c
> > > +++ b/fs/xfs/scrub/parent.c
> > > @@ -211,7 +211,9 @@ xfs_scrub_parent_validate(
> > >  	 */
> > >  	xfs_iunlock(sc->ip, sc->ilock_flags);
> > >  	sc->ilock_flags = 0;
> > > -	xfs_ilock(dp, XFS_IOLOCK_SHARED);
> > > +	error = xfs_scrub_ilock_inverted(dp, XFS_IOLOCK_SHARED);
> > > +	if (error)
> > > +		goto out_rele;
> > >  
> > >  	/* Go looking for our dentry. */
> > >  	error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nlink);
> > > @@ -220,8 +222,10 @@ xfs_scrub_parent_validate(
> > >  
> > >  	/* Drop the parent lock, relock this inode. */
> > >  	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
> > > +	error = xfs_scrub_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
> > > +	if (error)
> > > +		goto out_rele;
> > >  	sc->ilock_flags = XFS_IOLOCK_EXCL;
> > > -	xfs_ilock(sc->ip, sc->ilock_flags);
> > >  
> > >  	/*
> > >  	 * If we're an unlinked directory, the parent /won't/ have a link
> > > @@ -323,5 +327,13 @@ xfs_scrub_parent(
> > >  	if (try_again && tries == 20)
> > >  		xfs_scrub_set_incomplete(sc);
> > >  out:
> > > +	/*
> > > +	 * If we failed to lock the parent inode even after a retry, just mark
> > > +	 * this scrub incomplete and return.
> > > +	 */
> > > +	if (sc->try_harder && error == -EDEADLOCK) {
> > > +		error = 0;
> > > +		xfs_scrub_set_incomplete(sc);
> > > +	}
> > >  	return error;
> > >  }
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers
  2018-04-19 18:58       ` Brian Foster
@ 2018-04-19 19:06         ` Darrick J. Wong
  2018-04-21  0:31           ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-19 19:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Apr 19, 2018 at 02:58:12PM -0400, Brian Foster wrote:
> On Thu, Apr 19, 2018 at 10:33:40AM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 19, 2018 at 08:56:07AM -0400, Brian Foster wrote:
> > > On Tue, Apr 17, 2018 at 07:40:44PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > In normal operation, the XFS convention is to take an inode's iolock
> > > > and then allocate a transaction.  However, when scrubbing parent inodes
> > > > this is inverted -- we allocated the transaction to do the scrub, and
> > > > now we're trying to grab the parent's iolock.  This can lead to ABBA
> > > > deadlocks: some thread grabbed the parent's iolock and is waiting for
> > > > space for a transaction while our parent scrubber is sitting on a
> > > > transaction trying to get the parent's iolock.
> > > > 
> > > 
> > > Is that really an issue if the scrub transaction doesn't acquire a log
> > > reservation (or does it in certain circumstances)..?
> > 
> > Once we get to the repair series the transactions will have reservations
> > for logging metadata changes from the metadata rebuilds.
> > 
> > For a non-repair scrub invocation it's pretty simple:
> > 1. Allocate zero-reservation (empty) transaction
> > 2. Iterate metadata, check stuff
> > 3. Cancel transaction, exit to userland
> > 
> > For a repair it's much more complicated:
> > 1. Allocate a big permanent-reservation transaction
> > 2. Iterate metadata, check stuff (same as #2 above)
> > 3. If the metadata is ok, cancel and exit to userland
> > 4. Create set of records that metadata is supposed to have
> > 5. Zap metadata root
> > 6. Insert record, roll transaction, repeat...
> > 7. Commit transaction
> > 8. Run non-repair scrub to see if we fixed it.
> > 
> > So this patch is more of a cleanup to prepare for the circumstances
> > changing later. :)
> > 
> 
> Ok, so all that really matters wrt to this patch is that the repair mode
> will eventually reserve log space for the transaction. Care to add some
> context to the commit log? Otherwise it's kind of hard to surmise the
> purpose. ;) Perhaps better yet would be to just bundle this with the
> repair code that depends on it..?

Ok, I'll update the commit message to note that we're preparing for
repair using transactions with nonzero reservations.

Though TBH this patch {c,sh}ould have just been the first one of the
series that I posted immediately after it.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > Therefore, convert all iolock attempts to use trylock; if that fails,
> > > > they can use the existing mechanisms to back off and try again.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/scrub/common.c |   22 ++++++++++++++++++++++
> > > >  fs/xfs/scrub/common.h |    2 ++
> > > >  fs/xfs/scrub/parent.c |   16 ++++++++++++++--
> > > >  3 files changed, 38 insertions(+), 2 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > > > index f5e281a..93f9e7d 100644
> > > > --- a/fs/xfs/scrub/common.c
> > > > +++ b/fs/xfs/scrub/common.c
> > > > @@ -787,3 +787,25 @@ xfs_scrub_buffer_recheck(
> > > >  	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> > > >  	trace_xfs_scrub_block_error(sc, bp->b_bn, fa);
> > > >  }
> > > > +
> > > > +/*
> > > > + * Try to lock an inode in violation of the usual locking order rules.  For
> > > > + * example, trying to get the IOLOCK while in transaction context, or just
> > > > + * plain breaking AG-order or inode-order inode locking rules.  Either way,
> > > > + * the only way to avoid an ABBA deadlock is to use trylock and back off if
> > > > + * we can't.
> > > > + */
> > > > +int
> > > > +xfs_scrub_ilock_inverted(
> > > > +	struct xfs_inode	*ip,
> > > > +	uint			lock_mode)
> > > > +{
> > > > +	int			i;
> > > > +
> > > > +	for (i = 0; i < 20; i++) {
> > > > +		if (xfs_ilock_nowait(ip, lock_mode))
> > > > +			return 0;
> > > > +		delay(1);
> > > > +	}
> > > > +	return -EDEADLOCK;
> > > > +}
> > > > diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> > > > index 8296873..191c369 100644
> > > > --- a/fs/xfs/scrub/common.h
> > > > +++ b/fs/xfs/scrub/common.h
> > > > @@ -151,4 +151,6 @@ static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
> > > >  			       XFS_SCRUB_OFLAG_XCORRUPT);
> > > >  }
> > > >  
> > > > +int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
> > > > +
> > > >  #endif	/* __XFS_SCRUB_COMMON_H__ */
> > > > diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
> > > > index 1fb88c1..19cd54d 100644
> > > > --- a/fs/xfs/scrub/parent.c
> > > > +++ b/fs/xfs/scrub/parent.c
> > > > @@ -211,7 +211,9 @@ xfs_scrub_parent_validate(
> > > >  	 */
> > > >  	xfs_iunlock(sc->ip, sc->ilock_flags);
> > > >  	sc->ilock_flags = 0;
> > > > -	xfs_ilock(dp, XFS_IOLOCK_SHARED);
> > > > +	error = xfs_scrub_ilock_inverted(dp, XFS_IOLOCK_SHARED);
> > > > +	if (error)
> > > > +		goto out_rele;
> > > >  
> > > >  	/* Go looking for our dentry. */
> > > >  	error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nlink);
> > > > @@ -220,8 +222,10 @@ xfs_scrub_parent_validate(
> > > >  
> > > >  	/* Drop the parent lock, relock this inode. */
> > > >  	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
> > > > +	error = xfs_scrub_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
> > > > +	if (error)
> > > > +		goto out_rele;
> > > >  	sc->ilock_flags = XFS_IOLOCK_EXCL;
> > > > -	xfs_ilock(sc->ip, sc->ilock_flags);
> > > >  
> > > >  	/*
> > > >  	 * If we're an unlinked directory, the parent /won't/ have a link
> > > > @@ -323,5 +327,13 @@ xfs_scrub_parent(
> > > >  	if (try_again && tries == 20)
> > > >  		xfs_scrub_set_incomplete(sc);
> > > >  out:
> > > > +	/*
> > > > +	 * If we failed to lock the parent inode even after a retry, just mark
> > > > +	 * this scrub incomplete and return.
> > > > +	 */
> > > > +	if (sc->try_harder && error == -EDEADLOCK) {
> > > > +		error = 0;
> > > > +		xfs_scrub_set_incomplete(sc);
> > > > +	}
> > > >  	return error;
> > > >  }
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/11] xfs: superblock scrub should use uncached buffers
  2018-04-19 17:25     ` Darrick J. Wong
  2018-04-19 18:57       ` Brian Foster
@ 2018-04-20  0:05       ` Dave Chinner
  1 sibling, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2018-04-20  0:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Thu, Apr 19, 2018 at 10:25:48AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 19, 2018 at 08:55:47AM -0400, Brian Foster wrote:
> > On Tue, Apr 17, 2018 at 07:40:20PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > We've never cached buffers when reading or writing superblocks, so we
> > > need to change scrub to do likewise or risk screwing up the uncached sb
> > > buffer usage everywhere else.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Hmm, so this goes back to this[1] thread, right?
> 
> Yep.
> 
> > IIRC, we wanted to end up with an lru-bypassing uncached buffer lookup
> > mechanism to provide uncached behavior for resource-saving purposes
> > but without introducing serialization issues between multiple users of
> > uncached buffers.
> 
> Right.
> 
> > On a quick look back, growfs currently uses cached buffers for secondary
> > superblocks and the associated patch was looking to change that to
> > something like the above. Don't we have the same requirement here (since
> > growfs currently still uses cached buffers)?
> 
> Correct, this patch is contingent on landing Dave's "tableise growfs"
> series that converts growfs to use uncached buffers for writing out the
> new secondary supers.  If I manage to land Dave's growfs thing before
> repair then I'll keep this; if repair goes first I'll defer it to that
> series.

I plan on changing that patch set to use single-use cached buffers
rather than uncached buffers. So I'd just drop this patch, or
replace it with a patch that adds this:

#define XFS_SINGLE_USE_REF	0

and this after the xfs_trans_read_buf() call:

	xfs_buf_set_ref(bp, XFS_SINGLE_USE_REF);

Which will result in it being reclaimed when it is released rather
than being put on the LRU.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/11] xfs: superblock scrub should use uncached buffers
  2018-04-19 18:57       ` Brian Foster
@ 2018-04-20  0:07         ` Dave Chinner
  2018-04-21  0:29           ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chinner @ 2018-04-20  0:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Apr 19, 2018 at 02:57:23PM -0400, Brian Foster wrote:
> On Thu, Apr 19, 2018 at 10:25:48AM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 19, 2018 at 08:55:47AM -0400, Brian Foster wrote:
> > > On Tue, Apr 17, 2018 at 07:40:20PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > We've never cached buffers when reading or writing superblocks, so we
> > > > need to change scrub to do likewise or risk screwing up the uncached sb
> > > > buffer usage everywhere else.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Hmm, so this goes back to this[1] thread, right?
> > 
> > Yep.
> > 
> > > IIRC, we wanted to end up with an lru-bypassing uncached buffer lookup
> > > mechanism to provide uncached behavior for resource-saving purposes
> > > but without introducing serialization issues between multiple users of
> > > uncached buffers.
> > 
> > Right.
> > 
> > > On a quick look back, growfs currently uses cached buffers for secondary
> > > superblocks and the associated patch was looking to change that to
> > > something like the above. Don't we have the same requirement here (since
> > > growfs currently still uses cached buffers)?
> > 
> > Correct, this patch is contingent on landing Dave's "tableise growfs"
> > series that converts growfs to use uncached buffers for writing out the
> > new secondary supers.  If I manage to land Dave's growfs thing before
> > repair then I'll keep this; if repair goes first I'll defer it to that
> > series.
> > 
> 
> Ok, though I'm not sure I see the need for the higher level series
> dependency. Don't one of these patches just need to fold in a helper[1]
> to set the lru reference appropriately and start to use it?

Yes, that would work. :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/11] xfs: superblock scrub should use uncached buffers
  2018-04-20  0:07         ` Dave Chinner
@ 2018-04-21  0:29           ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-21  0:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Fri, Apr 20, 2018 at 10:07:16AM +1000, Dave Chinner wrote:
> On Thu, Apr 19, 2018 at 02:57:23PM -0400, Brian Foster wrote:
> > On Thu, Apr 19, 2018 at 10:25:48AM -0700, Darrick J. Wong wrote:
> > > On Thu, Apr 19, 2018 at 08:55:47AM -0400, Brian Foster wrote:
> > > > On Tue, Apr 17, 2018 at 07:40:20PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > We've never cached buffers when reading or writing superblocks, so we
> > > > > need to change scrub to do likewise or risk screwing up the uncached sb
> > > > > buffer usage everywhere else.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > 
> > > > Hmm, so this goes back to this[1] thread, right?
> > > 
> > > Yep.
> > > 
> > > > IIRC, we wanted to end up with an lru-bypassing uncached buffer lookup
> > > > mechanism to provide uncached behavior for resource-saving purposes
> > > > but without introducing serialization issues between multiple users of
> > > > uncached buffers.
> > > 
> > > Right.
> > > 
> > > > On a quick look back, growfs currently uses cached buffers for secondary
> > > > superblocks and the associated patch was looking to change that to
> > > > something like the above. Don't we have the same requirement here (since
> > > > growfs currently still uses cached buffers)?
> > > 
> > > Correct, this patch is contingent on landing Dave's "tableise growfs"
> > > series that converts growfs to use uncached buffers for writing out the
> > > new secondary supers.  If I manage to land Dave's growfs thing before
> > > repair then I'll keep this; if repair goes first I'll defer it to that
> > > series.
> > > 
> > 
> > Ok, though I'm not sure I see the need for the higher level series
> > dependency. Don't one of these patches just need to fold in a helper[1]
> > to set the lru reference appropriately and start to use it?
> 
> Yes, that would work. :P

Ok, I'll fold that into my patch and I'll fix things up depending on who
wins the race.

I changed the ASSERT to trip on agno == 0, btw.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers
  2018-04-19 19:06         ` Darrick J. Wong
@ 2018-04-21  0:31           ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-21  0:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Apr 19, 2018 at 12:06:32PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 19, 2018 at 02:58:12PM -0400, Brian Foster wrote:
> > On Thu, Apr 19, 2018 at 10:33:40AM -0700, Darrick J. Wong wrote:
> > > On Thu, Apr 19, 2018 at 08:56:07AM -0400, Brian Foster wrote:
> > > > On Tue, Apr 17, 2018 at 07:40:44PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > In normal operation, the XFS convention is to take an inode's iolock
> > > > > and then allocate a transaction.  However, when scrubbing parent inodes
> > > > > this is inverted -- we allocated the transaction to do the scrub, and
> > > > > now we're trying to grab the parent's iolock.  This can lead to ABBA
> > > > > deadlocks: some thread grabbed the parent's iolock and is waiting for
> > > > > space for a transaction while our parent scrubber is sitting on a
> > > > > transaction trying to get the parent's iolock.
> > > > > 
> > > > 
> > > > Is that really an issue if the scrub transaction doesn't acquire a log
> > > > reservation (or does it in certain circumstances)..?
> > > 
> > > Once we get to the repair series the transactions will have reservations
> > > for logging metadata changes from the metadata rebuilds.
> > > 
> > > For a non-repair scrub invocation it's pretty simple:
> > > 1. Allocate zero-reservation (empty) transaction
> > > 2. Iterate metadata, check stuff
> > > 3. Cancel transaction, exit to userland
> > > 
> > > For a repair it's much more complicated:
> > > 1. Allocate a big permanent-reservation transaction
> > > 2. Iterate metadata, check stuff (same as #2 above)
> > > 3. If the metadata is ok, cancel and exit to userland
> > > 4. Create set of records that metadata is supposed to have
> > > 5. Zap metadata root
> > > 6. Insert record, roll transaction, repeat...
> > > 7. Commit transaction
> > > 8. Run non-repair scrub to see if we fixed it.
> > > 
> > > So this patch is more of a cleanup to prepare for the circumstances
> > > changing later. :)
> > > 
> > 
> > Ok, so all that really matters wrt to this patch is that the repair mode
> > will eventually reserve log space for the transaction. Care to add some
> > context to the commit log? Otherwise it's kind of hard to surmise the
> > purpose. ;) Perhaps better yet would be to just bundle this with the
> > repair code that depends on it..?
> 
> Ok, I'll update the commit message to note that we're preparing for
> repair using transactions with nonzero reservations.
> 
> Though TBH this patch {c,sh}ould have just been the first one of the
> series that I posted immediately after it.

Just to stay abreast of IRC: No, it should stay where it is --
iolock-then-get-stuck-waiting-for-reservation can happen anywhere.)

--D

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > Therefore, convert all iolock attempts to use trylock; if that fails,
> > > > > they can use the existing mechanisms to back off and try again.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/xfs/scrub/common.c |   22 ++++++++++++++++++++++
> > > > >  fs/xfs/scrub/common.h |    2 ++
> > > > >  fs/xfs/scrub/parent.c |   16 ++++++++++++++--
> > > > >  3 files changed, 38 insertions(+), 2 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > > > > index f5e281a..93f9e7d 100644
> > > > > --- a/fs/xfs/scrub/common.c
> > > > > +++ b/fs/xfs/scrub/common.c
> > > > > @@ -787,3 +787,25 @@ xfs_scrub_buffer_recheck(
> > > > >  	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> > > > >  	trace_xfs_scrub_block_error(sc, bp->b_bn, fa);
> > > > >  }
> > > > > +
> > > > > +/*
> > > > > + * Try to lock an inode in violation of the usual locking order rules.  For
> > > > > + * example, trying to get the IOLOCK while in transaction context, or just
> > > > > + * plain breaking AG-order or inode-order inode locking rules.  Either way,
> > > > > + * the only way to avoid an ABBA deadlock is to use trylock and back off if
> > > > > + * we can't.
> > > > > + */
> > > > > +int
> > > > > +xfs_scrub_ilock_inverted(
> > > > > +	struct xfs_inode	*ip,
> > > > > +	uint			lock_mode)
> > > > > +{
> > > > > +	int			i;
> > > > > +
> > > > > +	for (i = 0; i < 20; i++) {
> > > > > +		if (xfs_ilock_nowait(ip, lock_mode))
> > > > > +			return 0;
> > > > > +		delay(1);
> > > > > +	}
> > > > > +	return -EDEADLOCK;
> > > > > +}
> > > > > diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> > > > > index 8296873..191c369 100644
> > > > > --- a/fs/xfs/scrub/common.h
> > > > > +++ b/fs/xfs/scrub/common.h
> > > > > @@ -151,4 +151,6 @@ static inline bool xfs_scrub_found_corruption(struct xfs_scrub_metadata *sm)
> > > > >  			       XFS_SCRUB_OFLAG_XCORRUPT);
> > > > >  }
> > > > >  
> > > > > +int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
> > > > > +
> > > > >  #endif	/* __XFS_SCRUB_COMMON_H__ */
> > > > > diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
> > > > > index 1fb88c1..19cd54d 100644
> > > > > --- a/fs/xfs/scrub/parent.c
> > > > > +++ b/fs/xfs/scrub/parent.c
> > > > > @@ -211,7 +211,9 @@ xfs_scrub_parent_validate(
> > > > >  	 */
> > > > >  	xfs_iunlock(sc->ip, sc->ilock_flags);
> > > > >  	sc->ilock_flags = 0;
> > > > > -	xfs_ilock(dp, XFS_IOLOCK_SHARED);
> > > > > +	error = xfs_scrub_ilock_inverted(dp, XFS_IOLOCK_SHARED);
> > > > > +	if (error)
> > > > > +		goto out_rele;
> > > > >  
> > > > >  	/* Go looking for our dentry. */
> > > > >  	error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nlink);
> > > > > @@ -220,8 +222,10 @@ xfs_scrub_parent_validate(
> > > > >  
> > > > >  	/* Drop the parent lock, relock this inode. */
> > > > >  	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
> > > > > +	error = xfs_scrub_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
> > > > > +	if (error)
> > > > > +		goto out_rele;
> > > > >  	sc->ilock_flags = XFS_IOLOCK_EXCL;
> > > > > -	xfs_ilock(sc->ip, sc->ilock_flags);
> > > > >  
> > > > >  	/*
> > > > >  	 * If we're an unlinked directory, the parent /won't/ have a link
> > > > > @@ -323,5 +327,13 @@ xfs_scrub_parent(
> > > > >  	if (try_again && tries == 20)
> > > > >  		xfs_scrub_set_incomplete(sc);
> > > > >  out:
> > > > > +	/*
> > > > > +	 * If we failed to lock the parent inode even after a retry, just mark
> > > > > +	 * this scrub incomplete and return.
> > > > > +	 */
> > > > > +	if (sc->try_harder && error == -EDEADLOCK) {
> > > > > +		error = 0;
> > > > > +		xfs_scrub_set_incomplete(sc);
> > > > > +	}
> > > > >  	return error;
> > > > >  }
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag
  2018-04-19  8:32   ` Christoph Hellwig
@ 2018-04-21 18:42     ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2018-04-21 18:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Apr 19, 2018 at 01:32:24AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 17, 2018 at 07:39:39PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a new QMOPT flag to signal that we've already locked the quota
> > inode.  This will be used by the quota scrub code refactoring to avoid
> > dropping the quota ip lock during scrub.  The flag is incompatible with
> > the DQALLOC flag because dquot allocation creates a transaction, and we
> > shouldn't be doing that with the ILOCK held.
> 
> Locked flag are always a sign for code smell.  And this already is
> pretty smelly code before that flag is added.  I think someone (i.e.
> either you or me :)) needs to do a major refactoring here first.

Bleh... ok, I marched through that swamp and refactored all that goopy
dqget crud into a ton of smaller helper functions that only do one thing,
killed off QMOPT_NEXT, and renamed DQALLOC to make it obvious it's for
dqget only.  So now we have...

xfs_qm_dqget (get dquot based on id/type, can take XFS_DQGET_DQALLOC to
allocate if not present)

xfs_qm_dqget_inode (get dquot based on inode/type, contains all the
locking and looping mess to one function, can take _DQALLOC)

xfs_qm_dqget_next (get this or the next higher dquot, no DQALLOC allowed
here, contain all the NEXT looping mess to this function)

xfs_qm_dqread (read dquot from disk and return incore dquot, do not
check in radix tree, no longer takes DQALLOC)

xfs_qm_dqiterate (iterate all the dquots for the given quota type)

I also cleaned out a bunch of unnecessary parameters and other junk,
and refactored mount time quotacheck so that it doesn't need to
ILOCK_EXCL every inode in the system.  In theory that should be fine
because we only needed ILOCK_EXCL to make sure nobody can chown/chproj
the inode on us, which shouldn't be happening during mount.

As a bonus I realized that scrub and repair don't actually need to
maintain the ilock once they're done fixing all of the things that can
cause dqget to fail, so the ILOCKED flag goes away entirely.

Will have a revised megaseries out on the list ideally some time before
LSF starts... though it did add another 12 patches to the review queue. :P

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-04-21 18:43 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18  2:39 [PATCH 00/11] xfs-4.18: online scrub fixes Darrick J. Wong
2018-04-18  2:39 ` [PATCH 01/11] xfs: skip scrub xref if corruption already noted Darrick J. Wong
2018-04-18 15:03   ` Brian Foster
2018-04-18 16:02     ` Darrick J. Wong
2018-04-18  2:39 ` [PATCH 02/11] xfs: create the XFS_QMOPT_QUOTIP_LOCKED flag Darrick J. Wong
2018-04-18 15:33   ` Brian Foster
2018-04-18 16:55     ` Darrick J. Wong
2018-04-18 17:09       ` Brian Foster
2018-04-19  8:32   ` Christoph Hellwig
2018-04-21 18:42     ` Darrick J. Wong
2018-04-18  2:39 ` [PATCH 03/11] xfs: report failing address when dquot verifier fails Darrick J. Wong
2018-04-18 18:33   ` Brian Foster
2018-04-18  2:40 ` [PATCH 04/11] xfs: refactor dquot iteration Darrick J. Wong
2018-04-18 18:34   ` Brian Foster
2018-04-18 22:20     ` Darrick J. Wong
2018-04-18  2:40 ` [PATCH 05/11] xfs: avoid ilock games in the quota scrubber Darrick J. Wong
2018-04-18 18:34   ` Brian Foster
2018-04-18  2:40 ` [PATCH 06/11] xfs: quota scrub should use bmapbtd scrubber Darrick J. Wong
2018-04-18 18:34   ` Brian Foster
2018-04-18 20:00     ` Darrick J. Wong
2018-04-19 11:20       ` Brian Foster
2018-04-18  2:40 ` [PATCH 07/11] xfs: superblock scrub should use uncached buffers Darrick J. Wong
2018-04-19 12:55   ` Brian Foster
2018-04-19 17:25     ` Darrick J. Wong
2018-04-19 18:57       ` Brian Foster
2018-04-20  0:07         ` Dave Chinner
2018-04-21  0:29           ` Darrick J. Wong
2018-04-20  0:05       ` Dave Chinner
2018-04-18  2:40 ` [PATCH 08/11] xfs: clean up scrub usage of KM_NOFS Darrick J. Wong
2018-04-19 12:55   ` Brian Foster
2018-04-18  2:40 ` [PATCH 09/11] xfs: btree scrub should check minrecs Darrick J. Wong
2018-04-19 12:55   ` Brian Foster
2018-04-18  2:40 ` [PATCH 10/11] xfs: refactor scrub transaction allocation function Darrick J. Wong
2018-04-19 12:56   ` Brian Foster
2018-04-18  2:40 ` [PATCH 11/11] xfs: avoid ABBA deadlock when scrubbing parent pointers Darrick J. Wong
2018-04-19 12:56   ` Brian Foster
2018-04-19 17:33     ` Darrick J. Wong
2018-04-19 18:58       ` Brian Foster
2018-04-19 19:06         ` Darrick J. Wong
2018-04-21  0:31           ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.