All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: rework quotaoff to avoid log deadlock
@ 2020-10-01 15:03 Brian Foster
  2020-10-01 15:03 ` [PATCH 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Brian Foster @ 2020-10-01 15:03 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's a proper v1 of the quotaoff logging rework. Changes from the RFC
are fairly straightforward and listed below. The one outstanding bit of
feedback from the RFC is the lock / memory ordering question around the
quota flags update. My understanding is that the percpu rwsem provides
the required consistency through a combination of explicit memory
barriers and RCU, so I opted to drop the unnecessary wrapper functions
to make the locking more clear (along with comment updates) and also
eliminate freeze/unfreeze naming confusion.

Thoughts, reviews, flames appreciated.

Brian

v1:
- Replace patch 2 with a proper internal quiesce mechanism.
- Remove unnecessary freeze/unfreeze helpers.
- Relocate quotaoff end transaction to commit inside quiesce window. 
- Clean up comments and document new algorithm.
rfc: https://lore.kernel.org/linux-xfs/20200929141228.108688-1-bfoster@redhat.com/

Brian Foster (3):
  xfs: skip dquot reservations if quota is inactive
  xfs: transaction subsystem quiesce mechanism
  xfs: rework quotaoff logging to avoid log deadlock on active fs

 fs/xfs/xfs_aops.c        |   2 +
 fs/xfs/xfs_mount.h       |   3 +
 fs/xfs/xfs_qm_syscalls.c | 133 +++++++++++++++++++--------------------
 fs/xfs/xfs_super.c       |   8 +++
 fs/xfs/xfs_trans.c       |   4 +-
 fs/xfs/xfs_trans.h       |  20 ++++++
 fs/xfs/xfs_trans_dquot.c |  22 +++----
 7 files changed, 111 insertions(+), 81 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] xfs: skip dquot reservations if quota is inactive
  2020-10-01 15:03 [PATCH 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
@ 2020-10-01 15:03 ` Brian Foster
  2020-10-07 22:09   ` Darrick J. Wong
  2020-10-01 15:03 ` [PATCH 2/3] xfs: transaction subsystem quiesce mechanism Brian Foster
  2020-10-01 15:03 ` [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
  2 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2020-10-01 15:03 UTC (permalink / raw)
  To: linux-xfs

The dquot reservation helper currently performs the associated
reservation for any provided dquots. The dquots could have been
acquired from inode references or explicit dquot allocation
requests. Some reservation callers may have already checked that the
associated quota subsystem is active (xfs_qm_dqget() returns an
error otherwise), while others might not have checked at all
(xfs_trans_reserve_quota_nblks() passes the inode references).
Further, subsequent dquot modifications do actually check that the
associated quota is active before making transactional changes
(xfs_trans_mod_dquot_byino()).

Given all of that, the behavior to unconditionally perform
reservation on any provided dquots is somewhat ad hoc. While it is
currently harmless, it is not without side effect. If the quota is
inactive by the time a transaction attempts a quota reservation, the
dquot will be attached to the transaction and subsequently logged,
even though no dquot modifications are ultimately made.

This is a problem for upcoming quotaoff changes that intend to
implement a strict transactional barrier for logging dquots during a
quotaoff operation. If a dquot is logged after the subsystem
deactivated and the barrier released, a subsequent log recovery can
incorrectly replay dquot changes into the filesystem.

Therefore, update the dquot reservation path to also check that a
particular quota mode is active before associating a dquot with a
transaction. This should have no noticeable impact on the current
code that already accommodates checking active quota state at points
before and after quota reservations are made.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans_dquot.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 133fc6fc3edd..547ba824542e 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -39,14 +39,12 @@ xfs_trans_dqjoin(
 }
 
 /*
- * This is called to mark the dquot as needing
- * to be logged when the transaction is committed.  The dquot must
- * already be associated with the given transaction.
- * Note that it marks the entire transaction as dirty. In the ordinary
- * case, this gets called via xfs_trans_commit, after the transaction
- * is already dirty. However, there's nothing stop this from getting
- * called directly, as done by xfs_qm_scall_setqlim. Hence, the TRANS_DIRTY
- * flag.
+ * This is called to mark the dquot as needing to be logged when the transaction
+ * is committed. The dquot must already be associated with the given
+ * transaction. Note that it marks the entire transaction as dirty. In the
+ * ordinary case, this gets called via xfs_trans_commit, after the transaction
+ * is already dirty. However, there's nothing stop this from getting called
+ * directly, as done by xfs_qm_scall_setqlim. Hence, the TRANS_DIRTY flag.
  */
 void
 xfs_trans_log_dquot(
@@ -770,19 +768,19 @@ xfs_trans_reserve_quota_bydquots(
 
 	ASSERT(flags & XFS_QMOPT_RESBLK_MASK);
 
-	if (udqp) {
+	if (XFS_IS_UQUOTA_ON(mp) && udqp) {
 		error = xfs_trans_dqresv(tp, mp, udqp, nblks, ninos, flags);
 		if (error)
 			return error;
 	}
 
-	if (gdqp) {
+	if (XFS_IS_GQUOTA_ON(mp) && gdqp) {
 		error = xfs_trans_dqresv(tp, mp, gdqp, nblks, ninos, flags);
 		if (error)
 			goto unwind_usr;
 	}
 
-	if (pdqp) {
+	if (XFS_IS_PQUOTA_ON(mp) && pdqp) {
 		error = xfs_trans_dqresv(tp, mp, pdqp, nblks, ninos, flags);
 		if (error)
 			goto unwind_grp;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] xfs: transaction subsystem quiesce mechanism
  2020-10-01 15:03 [PATCH 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
  2020-10-01 15:03 ` [PATCH 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
@ 2020-10-01 15:03 ` Brian Foster
  2020-10-07 22:13   ` Darrick J. Wong
  2020-10-01 15:03 ` [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
  2 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2020-10-01 15:03 UTC (permalink / raw)
  To: linux-xfs

The updated quotaoff logging algorithm depends on a runtime quiesce
of the transaction subsystem to guarantee all transactions after a
certain point detect quota subsystem changes. Implement this
mechanism using an internal lock, similar to the external filesystem
freeze mechanism. This is also somewhat analogous to the old percpu
transaction counter mechanism, but we don't actually need a counter.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c  |  2 ++
 fs/xfs/xfs_mount.h |  3 +++
 fs/xfs/xfs_super.c |  8 ++++++++
 fs/xfs/xfs_trans.c |  4 ++--
 fs/xfs/xfs_trans.h | 20 ++++++++++++++++++++
 5 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..214310c94de5 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -58,6 +58,7 @@ xfs_setfilesize_trans_alloc(
 	 * we released it.
 	 */
 	__sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS);
+	percpu_rwsem_release(&mp->m_trans_rwsem, true, _THIS_IP_);
 	/*
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
@@ -127,6 +128,7 @@ xfs_setfilesize_ioend(
 	 */
 	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
+	percpu_rwsem_acquire(&ip->i_mount->m_trans_rwsem, true, _THIS_IP_);
 
 	/* we abort the update if there was an IO error */
 	if (error) {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index dfa429b77ee2..f1083a9ce1f8 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -171,6 +171,9 @@ typedef struct xfs_mount {
 	 */
 	struct percpu_counter	m_delalloc_blks;
 
+	/* lock for transaction quiesce (used by quotaoff) */
+	struct percpu_rw_semaphore	m_trans_rwsem;
+
 	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
 	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
 	uint64_t		m_resblks;	/* total reserved blocks */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index baf5de30eebb..ff3ad5392e21 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1029,8 +1029,15 @@ xfs_init_percpu_counters(
 	if (error)
 		goto free_fdblocks;
 
+	/* not a counter, but close enough... */
+	error = percpu_init_rwsem(&mp->m_trans_rwsem);
+	if (error)
+		goto free_delalloc;
+
 	return 0;
 
+free_delalloc:
+	percpu_counter_destroy(&mp->m_delalloc_blks);
 free_fdblocks:
 	percpu_counter_destroy(&mp->m_fdblocks);
 free_ifree:
@@ -1053,6 +1060,7 @@ static void
 xfs_destroy_percpu_counters(
 	struct xfs_mount	*mp)
 {
+	percpu_free_rwsem(&mp->m_trans_rwsem);
 	percpu_counter_destroy(&mp->m_icount);
 	percpu_counter_destroy(&mp->m_ifree);
 	percpu_counter_destroy(&mp->m_fdblocks);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index ca18a040336a..c07fa036549a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -69,7 +69,7 @@ xfs_trans_free(
 
 	trace_xfs_trans_free(tp, _RET_IP_);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
-		sb_end_intwrite(tp->t_mountp->m_super);
+		xfs_trans_end(tp->t_mountp);
 	xfs_trans_free_dqinfo(tp);
 	kmem_cache_free(xfs_trans_zone, tp);
 }
@@ -265,7 +265,7 @@ xfs_trans_alloc(
 	 */
 	tp = kmem_cache_zalloc(xfs_trans_zone, GFP_KERNEL | __GFP_NOFAIL);
 	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
-		sb_start_intwrite(mp->m_super);
+		xfs_trans_start(mp);
 
 	/*
 	 * Zero-reservation ("empty") transactions can't modify anything, so
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f46534b75236..af54c17a22c0 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -209,6 +209,26 @@ xfs_trans_read_buf(
 				      flags, bpp, ops);
 }
 
+/*
+ * Context tracking helpers for external (i.e. fs freeze) and internal
+ * transaction quiesce.
+ */
+static inline void
+xfs_trans_start(
+	struct xfs_mount	*mp)
+{
+	sb_start_intwrite(mp->m_super);
+	percpu_down_read(&mp->m_trans_rwsem);
+}
+
+static inline void
+xfs_trans_end(
+	struct xfs_mount	*mp)
+{
+	percpu_up_read(&mp->m_trans_rwsem);
+	sb_end_intwrite(mp->m_super);
+}
+
 struct xfs_buf	*xfs_trans_getsb(struct xfs_trans *);
 
 void		xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs
  2020-10-01 15:03 [PATCH 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
  2020-10-01 15:03 ` [PATCH 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
  2020-10-01 15:03 ` [PATCH 2/3] xfs: transaction subsystem quiesce mechanism Brian Foster
@ 2020-10-01 15:03 ` Brian Foster
  2020-10-07 22:24   ` Darrick J. Wong
  2 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2020-10-01 15:03 UTC (permalink / raw)
  To: linux-xfs

The quotaoff operation logs two log items. The start item is
committed first, followed by the bulk of the in-core quotaoff
processing, and then the quotaoff end item is committed to release
the start item from the log. The problem with this mechanism is that
quite a bit of processing can be required to release dquots from all
in-core inodes and subsequently flush/purge all dquots in the
system. This processing work doesn't generally generate much log
traffic itself, but the start item pins the tail of the log. If an
external workload consumes the remaining log space before the
transaction for the end item is allocated, a log deadlock can occur.

The purpose of the separate start and end log items is primarily to
ensure that log recovery does not incorrectly recover dquot data
after an fs crash where a quotaoff was in progress. If we only
logged a single quotaoff item, for example, it could fall behind the
tail of the log before the last dquot modification was made and
incorrectly replay dquot changes that might have occurred after the
start item committed but before quotaoff deactivated the quota.

With that context, we can make some small changes to the quotaoff
algorithm to provide the same general log ordering guarantee without
such a large window to create a log deadlock vector. Rather than
place a start item in the log for the duration of quotaoff
processing, we can quiesce the transaction subsystem up front to
guarantee that no further dquots are logged from that point forward.
IOW, we pause the transaction subsystem, commit the quotaoff start
and end items, deactivate the associated quota such that subsequent
transactions no longer modify associated dquots, and resume the
transaction subsystem. The transaction pause is somewhat of a heavy
weight operation, but quotaoff is already a rare, slow and
performance disruptive operation and the quiesce is only required
for two small transactions.

Altogether, this means that the dquot rele/purge sequence occurs
after the quotaoff end item has committed and thus can technically
fall off the end of the log. This is safe because the remaining
processing is in-core work that doesn't involve logging dquots and
we've guaranteed that no further dquots are modified by external
transactions. This allows quotaoff to complete without risking log
deadlock regardless of how much dquot processing is required.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_qm_syscalls.c | 133 +++++++++++++++++++--------------------
 fs/xfs/xfs_trans_dquot.c |   2 +
 2 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index ca1b57d291dc..b8e55f4947bd 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -29,7 +29,8 @@ xfs_qm_log_quotaoff(
 	int			error;
 	struct xfs_qoff_logitem	*qoffi;
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
+				XFS_TRANS_NO_WRITECOUNT, &tp);
 	if (error)
 		goto out;
 
@@ -67,7 +68,8 @@ xfs_qm_log_quotaoff_end(
 	int			error;
 	struct xfs_qoff_logitem	*qoffi;
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0,
+				XFS_TRANS_NO_WRITECOUNT, &tp);
 	if (error)
 		return error;
 
@@ -106,8 +108,8 @@ xfs_qm_scall_quotaoff(
 
 	/*
 	 * No file system can have quotas enabled on disk but not in core.
-	 * Note that quota utilities (like quotaoff) _expect_
-	 * errno == -EEXIST here.
+	 * Note that quota utilities (like quotaoff) _expect_ errno == -EEXIST
+	 * here.
 	 */
 	if ((mp->m_qflags & flags) == 0)
 		return -EEXIST;
@@ -116,17 +118,14 @@ xfs_qm_scall_quotaoff(
 	flags &= (XFS_ALL_QUOTA_ACCT | XFS_ALL_QUOTA_ENFD);
 
 	/*
-	 * We don't want to deal with two quotaoffs messing up each other,
-	 * so we're going to serialize it. quotaoff isn't exactly a performance
+	 * We don't want to deal with two quotaoffs messing up each other, so
+	 * we're going to serialize it. quotaoff isn't exactly a performance
 	 * critical thing.
-	 * If quotaoff, then we must be dealing with the root filesystem.
 	 */
 	ASSERT(q);
 	mutex_lock(&q->qi_quotaofflock);
 
-	/*
-	 * If we're just turning off quota enforcement, change mp and go.
-	 */
+	/* if we're just turning off quota enforcement, change mp and go */
 	if ((flags & XFS_ALL_QUOTA_ACCT) == 0) {
 		mp->m_qflags &= ~(flags);
 
@@ -142,9 +141,9 @@ xfs_qm_scall_quotaoff(
 	dqtype = 0;
 	inactivate_flags = 0;
 	/*
-	 * If accounting is off, we must turn enforcement off, clear the
-	 * quota 'CHKD' certificate to make it known that we have to
-	 * do a quotacheck the next time this quota is turned on.
+	 * If accounting is off, we must turn enforcement off, clear the quota
+	 * 'CHKD' certificate to make it known that we have to do a quotacheck
+	 * the next time this quota is turned on.
 	 */
 	if (flags & XFS_UQUOTA_ACCT) {
 		dqtype |= XFS_QMOPT_UQUOTA;
@@ -163,89 +162,87 @@ xfs_qm_scall_quotaoff(
 	}
 
 	/*
-	 * Nothing to do?  Don't complain. This happens when we're just
-	 * turning off quota enforcement.
+	 * Nothing to do? Don't complain. This happens when we're just turning
+	 * off quota enforcement.
 	 */
 	if ((mp->m_qflags & flags) == 0)
 		goto out_unlock;
 
 	/*
-	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
-	 * and synchronously. If we fail to write, we should abort the
-	 * operation as it cannot be recovered safely if we crash.
+	 * Quotaoff must deactivate the associated quota mode(s), release dquots
+	 * from inodes and purge them from the system all while the filesystem
+	 * remains active. We have two quotaoff log records that traditionally
+	 * bound the start and end of this sequence. This guarantees that no
+	 * dquots are modified after the end item hits the log, but quotaoff can
+	 * be time consuming and thus prone to deadlock because the start item
+	 * pins the tail the of log in the meantime (and we can't hold the end
+	 * transaction open across the dqrele scan).
+	 *
+	 * The critical aspect of correctly logging quotaoff is that no dquots
+	 * are modified after the quotaoff end item hits the on-disk log.
+	 * Otherwise the quotaoff can fall off the tail and log recovery can
+	 * replay incorrect data. Instead of letting the start item sit in the
+	 * log while quotaoff completes, we can provide the same guarantee via a
+	 * runtime barrier for dquot modifications. Specifically, we pause all
+	 * transactions on the system via the transaction subsystem lock, log
+	 * both start and end items (via sync transactions, which drains the
+	 * CIL), deactivate the quota, and then resume the transaction subsystem
+	 * while quotaoff completes.
+	 *
+	 * This is safe because the remaining quotaoff work is in-core cleanup
+	 * and all subsequent transactions should see the updated quota state
+	 * due to memory ordering provided by the lock. We also avoid deadlock
+	 * by committing both items sequentially with near exclusive access to
+	 * the transaction subsystem.
 	 */
+	percpu_down_write(&mp->m_trans_rwsem);
+
 	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
-	if (error)
+	if (error) {
+		percpu_up_write(&mp->m_trans_rwsem);
 		goto out_unlock;
+	}
 
-	/*
-	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
-	 * to take care of the race between dqget and quotaoff. We don't take
-	 * any special locks to reset these bits. All processes need to check
-	 * these bits *after* taking inode lock(s) to see if the particular
-	 * quota type is in the process of being turned off. If *ACTIVE, it is
-	 * guaranteed that all dquot structures and all quotainode ptrs will all
-	 * stay valid as long as that inode is kept locked.
-	 *
-	 * There is no turning back after this.
-	 */
 	mp->m_qflags &= ~inactivate_flags;
 
+	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
+	if (error) {
+		percpu_up_write(&mp->m_trans_rwsem);
+		/* We're screwed now. Shutdown is the only option. */
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		goto out_unlock;
+	}
+
+	percpu_up_write(&mp->m_trans_rwsem);
+
 	/*
-	 * Give back all the dquot reference(s) held by inodes.
-	 * Here we go thru every single incore inode in this file system, and
-	 * do a dqrele on the i_udquot/i_gdquot that it may have.
-	 * Essentially, as long as somebody has an inode locked, this guarantees
-	 * that quotas will not be turned off. This is handy because in a
-	 * transaction once we lock the inode(s) and check for quotaon, we can
-	 * depend on the quota inodes (and other things) being valid as long as
-	 * we keep the lock(s).
+	 * Release dquot references held by inodes. Technically some contexts
+	 * might not pick up the quota state change until the inode lock is
+	 * cycled if there is no transaction. We don't care about that above
+	 * because a dquot can't be logged without a transaction and we can't
+	 * release/purge a dquot here until we've cycled the locks of all inodes
+	 * that reference it.
 	 */
 	xfs_qm_dqrele_all_inodes(mp, flags);
 
 	/*
 	 * Next we make the changes in the quota flag in the mount struct.
-	 * This isn't protected by a particular lock directly, because we
-	 * don't want to take a mrlock every time we depend on quotas being on.
+	 * This isn't protected by a particular lock directly, because we don't
+	 * want to take a mrlock every time we depend on quotas being on.
 	 */
 	mp->m_qflags &= ~flags;
 
-	/*
-	 * Go through all the dquots of this file system and purge them,
-	 * according to what was turned off.
-	 */
+	/* purge all deactivated dquots from the filesystem */
 	xfs_qm_dqpurge_all(mp, dqtype);
 
-	/*
-	 * Transactions that had started before ACTIVE state bit was cleared
-	 * could have logged many dquots, so they'd have higher LSNs than
-	 * the first QUOTAOFF log record does. If we happen to crash when
-	 * the tail of the log has gone past the QUOTAOFF record, but
-	 * before the last dquot modification, those dquots __will__
-	 * recover, and that's not good.
-	 *
-	 * So, we have QUOTAOFF start and end logitems; the start
-	 * logitem won't get overwritten until the end logitem appears...
-	 */
-	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
-	if (error) {
-		/* We're screwed now. Shutdown is the only option. */
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		goto out_unlock;
-	}
-
-	/*
-	 * If all quotas are completely turned off, close shop.
-	 */
+	/* if all quotas are completely turned off, close shop */
 	if (mp->m_qflags == 0) {
 		mutex_unlock(&q->qi_quotaofflock);
 		xfs_qm_destroy_quotainfo(mp);
 		return 0;
 	}
 
-	/*
-	 * Release our quotainode references if we don't need them anymore.
-	 */
+	/* release our quotainode references if we don't need them anymore */
 	if ((dqtype & XFS_QMOPT_UQUOTA) && q->qi_uquotaip) {
 		xfs_irele(q->qi_uquotaip);
 		q->qi_uquotaip = NULL;
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 547ba824542e..9839b83e732a 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -52,6 +52,8 @@ xfs_trans_log_dquot(
 	struct xfs_dquot	*dqp)
 {
 	ASSERT(XFS_DQ_IS_LOCKED(dqp));
+	/* quotaoff expects no dquots logged after deactivation */
+	ASSERT(xfs_this_quota_on(tp->t_mountp, xfs_dquot_type(dqp)));
 
 	/* Upgrade the dquot to bigtime format if possible. */
 	if (dqp->q_id != 0 &&
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] xfs: skip dquot reservations if quota is inactive
  2020-10-01 15:03 ` [PATCH 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
@ 2020-10-07 22:09   ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2020-10-07 22:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Oct 01, 2020 at 11:03:08AM -0400, Brian Foster wrote:
> The dquot reservation helper currently performs the associated
> reservation for any provided dquots. The dquots could have been
> acquired from inode references or explicit dquot allocation
> requests. Some reservation callers may have already checked that the
> associated quota subsystem is active (xfs_qm_dqget() returns an
> error otherwise), while others might not have checked at all
> (xfs_trans_reserve_quota_nblks() passes the inode references).
> Further, subsequent dquot modifications do actually check that the
> associated quota is active before making transactional changes
> (xfs_trans_mod_dquot_byino()).
> 
> Given all of that, the behavior to unconditionally perform
> reservation on any provided dquots is somewhat ad hoc. While it is
> currently harmless, it is not without side effect. If the quota is
> inactive by the time a transaction attempts a quota reservation, the
> dquot will be attached to the transaction and subsequently logged,
> even though no dquot modifications are ultimately made.
> 
> This is a problem for upcoming quotaoff changes that intend to
> implement a strict transactional barrier for logging dquots during a
> quotaoff operation. If a dquot is logged after the subsystem
> deactivated and the barrier released, a subsequent log recovery can
> incorrectly replay dquot changes into the filesystem.
> 
> Therefore, update the dquot reservation path to also check that a
> particular quota mode is active before associating a dquot with a
> transaction. This should have no noticeable impact on the current
> code that already accommodates checking active quota state at points
> before and after quota reservations are made.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Seems reasonable not to bother with the dqresv step if the quota type
isn't enabled.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_trans_dquot.c | 20 +++++++++-----------
>  1 file changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> index 133fc6fc3edd..547ba824542e 100644
> --- a/fs/xfs/xfs_trans_dquot.c
> +++ b/fs/xfs/xfs_trans_dquot.c
> @@ -39,14 +39,12 @@ xfs_trans_dqjoin(
>  }
>  
>  /*
> - * This is called to mark the dquot as needing
> - * to be logged when the transaction is committed.  The dquot must
> - * already be associated with the given transaction.
> - * Note that it marks the entire transaction as dirty. In the ordinary
> - * case, this gets called via xfs_trans_commit, after the transaction
> - * is already dirty. However, there's nothing stop this from getting
> - * called directly, as done by xfs_qm_scall_setqlim. Hence, the TRANS_DIRTY
> - * flag.
> + * This is called to mark the dquot as needing to be logged when the transaction
> + * is committed. The dquot must already be associated with the given
> + * transaction. Note that it marks the entire transaction as dirty. In the
> + * ordinary case, this gets called via xfs_trans_commit, after the transaction
> + * is already dirty. However, there's nothing stop this from getting called
> + * directly, as done by xfs_qm_scall_setqlim. Hence, the TRANS_DIRTY flag.
>   */
>  void
>  xfs_trans_log_dquot(
> @@ -770,19 +768,19 @@ xfs_trans_reserve_quota_bydquots(
>  
>  	ASSERT(flags & XFS_QMOPT_RESBLK_MASK);
>  
> -	if (udqp) {
> +	if (XFS_IS_UQUOTA_ON(mp) && udqp) {
>  		error = xfs_trans_dqresv(tp, mp, udqp, nblks, ninos, flags);
>  		if (error)
>  			return error;
>  	}
>  
> -	if (gdqp) {
> +	if (XFS_IS_GQUOTA_ON(mp) && gdqp) {
>  		error = xfs_trans_dqresv(tp, mp, gdqp, nblks, ninos, flags);
>  		if (error)
>  			goto unwind_usr;
>  	}
>  
> -	if (pdqp) {
> +	if (XFS_IS_PQUOTA_ON(mp) && pdqp) {
>  		error = xfs_trans_dqresv(tp, mp, pdqp, nblks, ninos, flags);
>  		if (error)
>  			goto unwind_grp;
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] xfs: transaction subsystem quiesce mechanism
  2020-10-01 15:03 ` [PATCH 2/3] xfs: transaction subsystem quiesce mechanism Brian Foster
@ 2020-10-07 22:13   ` Darrick J. Wong
  2020-10-08 11:26     ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2020-10-07 22:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Oct 01, 2020 at 11:03:09AM -0400, Brian Foster wrote:
> The updated quotaoff logging algorithm depends on a runtime quiesce
> of the transaction subsystem to guarantee all transactions after a
> certain point detect quota subsystem changes. Implement this
> mechanism using an internal lock, similar to the external filesystem
> freeze mechanism. This is also somewhat analogous to the old percpu
> transaction counter mechanism, but we don't actually need a counter.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  |  2 ++
>  fs/xfs/xfs_mount.h |  3 +++
>  fs/xfs/xfs_super.c |  8 ++++++++
>  fs/xfs/xfs_trans.c |  4 ++--
>  fs/xfs/xfs_trans.h | 20 ++++++++++++++++++++
>  5 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index b35611882ff9..214310c94de5 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -58,6 +58,7 @@ xfs_setfilesize_trans_alloc(
>  	 * we released it.
>  	 */
>  	__sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS);
> +	percpu_rwsem_release(&mp->m_trans_rwsem, true, _THIS_IP_);
>  	/*
>  	 * We hand off the transaction to the completion thread now, so
>  	 * clear the flag here.
> @@ -127,6 +128,7 @@ xfs_setfilesize_ioend(
>  	 */
>  	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> +	percpu_rwsem_acquire(&ip->i_mount->m_trans_rwsem, true, _THIS_IP_);
>  
>  	/* we abort the update if there was an IO error */
>  	if (error) {
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index dfa429b77ee2..f1083a9ce1f8 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -171,6 +171,9 @@ typedef struct xfs_mount {
>  	 */
>  	struct percpu_counter	m_delalloc_blks;
>  
> +	/* lock for transaction quiesce (used by quotaoff) */
> +	struct percpu_rw_semaphore	m_trans_rwsem;
> +
>  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
>  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
>  	uint64_t		m_resblks;	/* total reserved blocks */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index baf5de30eebb..ff3ad5392e21 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1029,8 +1029,15 @@ xfs_init_percpu_counters(
>  	if (error)
>  		goto free_fdblocks;
>  
> +	/* not a counter, but close enough... */
> +	error = percpu_init_rwsem(&mp->m_trans_rwsem);
> +	if (error)
> +		goto free_delalloc;
> +
>  	return 0;
>  
> +free_delalloc:
> +	percpu_counter_destroy(&mp->m_delalloc_blks);
>  free_fdblocks:
>  	percpu_counter_destroy(&mp->m_fdblocks);
>  free_ifree:
> @@ -1053,6 +1060,7 @@ static void
>  xfs_destroy_percpu_counters(
>  	struct xfs_mount	*mp)
>  {
> +	percpu_free_rwsem(&mp->m_trans_rwsem);
>  	percpu_counter_destroy(&mp->m_icount);
>  	percpu_counter_destroy(&mp->m_ifree);
>  	percpu_counter_destroy(&mp->m_fdblocks);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index ca18a040336a..c07fa036549a 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -69,7 +69,7 @@ xfs_trans_free(
>  
>  	trace_xfs_trans_free(tp, _RET_IP_);
>  	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
> -		sb_end_intwrite(tp->t_mountp->m_super);
> +		xfs_trans_end(tp->t_mountp);
>  	xfs_trans_free_dqinfo(tp);
>  	kmem_cache_free(xfs_trans_zone, tp);
>  }
> @@ -265,7 +265,7 @@ xfs_trans_alloc(
>  	 */
>  	tp = kmem_cache_zalloc(xfs_trans_zone, GFP_KERNEL | __GFP_NOFAIL);
>  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
> -		sb_start_intwrite(mp->m_super);
> +		xfs_trans_start(mp);
>  
>  	/*
>  	 * Zero-reservation ("empty") transactions can't modify anything, so
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index f46534b75236..af54c17a22c0 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -209,6 +209,26 @@ xfs_trans_read_buf(
>  				      flags, bpp, ops);
>  }
>  
> +/*
> + * Context tracking helpers for external (i.e. fs freeze) and internal
> + * transaction quiesce.
> + */
> +static inline void
> +xfs_trans_start(
> +	struct xfs_mount	*mp)
> +{
> +	sb_start_intwrite(mp->m_super);
> +	percpu_down_read(&mp->m_trans_rwsem);

/me wonders, have you noticed any extra cpu overhead with this?

So far it looks ok to me, though I wonder if we could skip all this if
CONFIG_XFS_QUOTA=n...

--D

> +}
> +
> +static inline void
> +xfs_trans_end(
> +	struct xfs_mount	*mp)
> +{
> +	percpu_up_read(&mp->m_trans_rwsem);
> +	sb_end_intwrite(mp->m_super);
> +}
> +
>  struct xfs_buf	*xfs_trans_getsb(struct xfs_trans *);
>  
>  void		xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs
  2020-10-01 15:03 ` [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
@ 2020-10-07 22:24   ` Darrick J. Wong
  2020-10-08 11:29     ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2020-10-07 22:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Oct 01, 2020 at 11:03:10AM -0400, Brian Foster wrote:
> The quotaoff operation logs two log items. The start item is
> committed first, followed by the bulk of the in-core quotaoff
> processing, and then the quotaoff end item is committed to release
> the start item from the log. The problem with this mechanism is that
> quite a bit of processing can be required to release dquots from all
> in-core inodes and subsequently flush/purge all dquots in the
> system. This processing work doesn't generally generate much log
> traffic itself, but the start item pins the tail of the log. If an
> external workload consumes the remaining log space before the
> transaction for the end item is allocated, a log deadlock can occur.
> 
> The purpose of the separate start and end log items is primarily to
> ensure that log recovery does not incorrectly recover dquot data
> after an fs crash where a quotaoff was in progress. If we only
> logged a single quotaoff item, for example, it could fall behind the
> tail of the log before the last dquot modification was made and
> incorrectly replay dquot changes that might have occurred after the
> start item committed but before quotaoff deactivated the quota.
> 
> With that context, we can make some small changes to the quotaoff
> algorithm to provide the same general log ordering guarantee without
> such a large window to create a log deadlock vector. Rather than
> place a start item in the log for the duration of quotaoff
> processing, we can quiesce the transaction subsystem up front to
> guarantee that no further dquots are logged from that point forward.
> IOW, we pause the transaction subsystem, commit the quotaoff start
> and end items, deactivate the associated quota such that subsequent
> transactions no longer modify associated dquots, and resume the
> transaction subsystem. The transaction pause is somewhat of a heavy
> weight operation, but quotaoff is already a rare, slow and
> performance disruptive operation and the quiesce is only required
> for two small transactions.
> 
> Altogether, this means that the dquot rele/purge sequence occurs
> after the quotaoff end item has committed and thus can technically
> fall off the end of the log. This is safe because the remaining
> processing is in-core work that doesn't involve logging dquots and
> we've guaranteed that no further dquots are modified by external
> transactions. This allows quotaoff to complete without risking log
> deadlock regardless of how much dquot processing is required.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_qm_syscalls.c | 133 +++++++++++++++++++--------------------
>  fs/xfs/xfs_trans_dquot.c |   2 +
>  2 files changed, 67 insertions(+), 68 deletions(-)
> 
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index ca1b57d291dc..b8e55f4947bd 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -29,7 +29,8 @@ xfs_qm_log_quotaoff(
>  	int			error;
>  	struct xfs_qoff_logitem	*qoffi;
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> +				XFS_TRANS_NO_WRITECOUNT, &tp);
>  	if (error)
>  		goto out;
>  
> @@ -67,7 +68,8 @@ xfs_qm_log_quotaoff_end(
>  	int			error;
>  	struct xfs_qoff_logitem	*qoffi;
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0,
> +				XFS_TRANS_NO_WRITECOUNT, &tp);
>  	if (error)
>  		return error;
>  
> @@ -106,8 +108,8 @@ xfs_qm_scall_quotaoff(
>  
>  	/*
>  	 * No file system can have quotas enabled on disk but not in core.
> -	 * Note that quota utilities (like quotaoff) _expect_
> -	 * errno == -EEXIST here.
> +	 * Note that quota utilities (like quotaoff) _expect_ errno == -EEXIST
> +	 * here.
>  	 */
>  	if ((mp->m_qflags & flags) == 0)
>  		return -EEXIST;
> @@ -116,17 +118,14 @@ xfs_qm_scall_quotaoff(
>  	flags &= (XFS_ALL_QUOTA_ACCT | XFS_ALL_QUOTA_ENFD);
>  
>  	/*
> -	 * We don't want to deal with two quotaoffs messing up each other,
> -	 * so we're going to serialize it. quotaoff isn't exactly a performance
> +	 * We don't want to deal with two quotaoffs messing up each other, so
> +	 * we're going to serialize it. quotaoff isn't exactly a performance
>  	 * critical thing.
> -	 * If quotaoff, then we must be dealing with the root filesystem.
>  	 */
>  	ASSERT(q);
>  	mutex_lock(&q->qi_quotaofflock);
>  
> -	/*
> -	 * If we're just turning off quota enforcement, change mp and go.
> -	 */
> +	/* if we're just turning off quota enforcement, change mp and go */
>  	if ((flags & XFS_ALL_QUOTA_ACCT) == 0) {
>  		mp->m_qflags &= ~(flags);
>  
> @@ -142,9 +141,9 @@ xfs_qm_scall_quotaoff(
>  	dqtype = 0;
>  	inactivate_flags = 0;
>  	/*
> -	 * If accounting is off, we must turn enforcement off, clear the
> -	 * quota 'CHKD' certificate to make it known that we have to
> -	 * do a quotacheck the next time this quota is turned on.
> +	 * If accounting is off, we must turn enforcement off, clear the quota
> +	 * 'CHKD' certificate to make it known that we have to do a quotacheck
> +	 * the next time this quota is turned on.
>  	 */
>  	if (flags & XFS_UQUOTA_ACCT) {
>  		dqtype |= XFS_QMOPT_UQUOTA;
> @@ -163,89 +162,87 @@ xfs_qm_scall_quotaoff(
>  	}
>  
>  	/*
> -	 * Nothing to do?  Don't complain. This happens when we're just
> -	 * turning off quota enforcement.
> +	 * Nothing to do? Don't complain. This happens when we're just turning
> +	 * off quota enforcement.
>  	 */
>  	if ((mp->m_qflags & flags) == 0)
>  		goto out_unlock;
>  
>  	/*
> -	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
> -	 * and synchronously. If we fail to write, we should abort the
> -	 * operation as it cannot be recovered safely if we crash.
> +	 * Quotaoff must deactivate the associated quota mode(s), release dquots
> +	 * from inodes and purge them from the system all while the filesystem
> +	 * remains active. We have two quotaoff log records that traditionally
> +	 * bound the start and end of this sequence. This guarantees that no
> +	 * dquots are modified after the end item hits the log, but quotaoff can
> +	 * be time consuming and thus prone to deadlock because the start item
> +	 * pins the tail the of log in the meantime (and we can't hold the end
> +	 * transaction open across the dqrele scan).
> +	 *
> +	 * The critical aspect of correctly logging quotaoff is that no dquots
> +	 * are modified after the quotaoff end item hits the on-disk log.
> +	 * Otherwise the quotaoff can fall off the tail and log recovery can
> +	 * replay incorrect data. Instead of letting the start item sit in the
> +	 * log while quotaoff completes, we can provide the same guarantee via a
> +	 * runtime barrier for dquot modifications. Specifically, we pause all
> +	 * transactions on the system via the transaction subsystem lock, log
> +	 * both start and end items (via sync transactions, which drains the
> +	 * CIL), deactivate the quota, and then resume the transaction subsystem
> +	 * while quotaoff completes.
> +	 *
> +	 * This is safe because the remaining quotaoff work is in-core cleanup
> +	 * and all subsequent transactions should see the updated quota state
> +	 * due to memory ordering provided by the lock. We also avoid deadlock
> +	 * by committing both items sequentially with near exclusive access to
> +	 * the transaction subsystem.
>  	 */
> +	percpu_down_write(&mp->m_trans_rwsem);
> +
>  	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> -	if (error)
> +	if (error) {
> +		percpu_up_write(&mp->m_trans_rwsem);
>  		goto out_unlock;
> +	}
>  
> -	/*
> -	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
> -	 * to take care of the race between dqget and quotaoff. We don't take
> -	 * any special locks to reset these bits. All processes need to check
> -	 * these bits *after* taking inode lock(s) to see if the particular
> -	 * quota type is in the process of being turned off. If *ACTIVE, it is
> -	 * guaranteed that all dquot structures and all quotainode ptrs will all
> -	 * stay valid as long as that inode is kept locked.
> -	 *
> -	 * There is no turning back after this.
> -	 */
>  	mp->m_qflags &= ~inactivate_flags;
>  
> +	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> +	if (error) {
> +		percpu_up_write(&mp->m_trans_rwsem);
> +		/* We're screwed now. Shutdown is the only option. */
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +		goto out_unlock;
> +	}
> +
> +	percpu_up_write(&mp->m_trans_rwsem);

Hmm, so if I read this correctly, you're changing what gets written to
the log from:

	<quotaoff intent>
	<transactions that maybe log dquots but we don't care>
	...now the qflags get cleared...
	<transactions that maybe log dquots but we don't care>
	...run around purging dquots...
	<transactions that maybe log dquots but we don't care>
	<quotaoff done>

to something that looks more like:

	<transactions that log dquots>
	<quotaoff intent, all other transactions locked out>
	...now the qflags get cleared; no other transactions...
	<quotaoff done, turn off the lockout>
	<transactions that wont log dquots>
	...run around purging dquots...

Is that right?  I guess that makes sense, though it's sort of a pity
that we now make every transaction grab a read lock.  Though I guess
there's not much that can be done about that; it's better than pinning
the log tail.  I don't even think there's a good way to relog that
quotaoff-intent item, is there?  Since you'd have to, I dunno, do all
the checks that xfs_defer_relog() does to decide if it should relog an
intent item?

--D

> +
>  	/*
> -	 * Give back all the dquot reference(s) held by inodes.
> -	 * Here we go thru every single incore inode in this file system, and
> -	 * do a dqrele on the i_udquot/i_gdquot that it may have.
> -	 * Essentially, as long as somebody has an inode locked, this guarantees
> -	 * that quotas will not be turned off. This is handy because in a
> -	 * transaction once we lock the inode(s) and check for quotaon, we can
> -	 * depend on the quota inodes (and other things) being valid as long as
> -	 * we keep the lock(s).
> +	 * Release dquot references held by inodes. Technically some contexts
> +	 * might not pick up the quota state change until the inode lock is
> +	 * cycled if there is no transaction. We don't care about that above
> +	 * because a dquot can't be logged without a transaction and we can't
> +	 * release/purge a dquot here until we've cycled the locks of all inodes
> +	 * that reference it.
>  	 */
>  	xfs_qm_dqrele_all_inodes(mp, flags);
>  
>  	/*
>  	 * Next we make the changes in the quota flag in the mount struct.
> -	 * This isn't protected by a particular lock directly, because we
> -	 * don't want to take a mrlock every time we depend on quotas being on.
> +	 * This isn't protected by a particular lock directly, because we don't
> +	 * want to take a mrlock every time we depend on quotas being on.
>  	 */
>  	mp->m_qflags &= ~flags;
>  
> -	/*
> -	 * Go through all the dquots of this file system and purge them,
> -	 * according to what was turned off.
> -	 */
> +	/* purge all deactivated dquots from the filesystem */
>  	xfs_qm_dqpurge_all(mp, dqtype);
>  
> -	/*
> -	 * Transactions that had started before ACTIVE state bit was cleared
> -	 * could have logged many dquots, so they'd have higher LSNs than
> -	 * the first QUOTAOFF log record does. If we happen to crash when
> -	 * the tail of the log has gone past the QUOTAOFF record, but
> -	 * before the last dquot modification, those dquots __will__
> -	 * recover, and that's not good.
> -	 *
> -	 * So, we have QUOTAOFF start and end logitems; the start
> -	 * logitem won't get overwritten until the end logitem appears...
> -	 */
> -	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> -	if (error) {
> -		/* We're screwed now. Shutdown is the only option. */
> -		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -		goto out_unlock;
> -	}
> -
> -	/*
> -	 * If all quotas are completely turned off, close shop.
> -	 */
> +	/* if all quotas are completely turned off, close shop */
>  	if (mp->m_qflags == 0) {
>  		mutex_unlock(&q->qi_quotaofflock);
>  		xfs_qm_destroy_quotainfo(mp);
>  		return 0;
>  	}
>  
> -	/*
> -	 * Release our quotainode references if we don't need them anymore.
> -	 */
> +	/* release our quotainode references if we don't need them anymore */
>  	if ((dqtype & XFS_QMOPT_UQUOTA) && q->qi_uquotaip) {
>  		xfs_irele(q->qi_uquotaip);
>  		q->qi_uquotaip = NULL;
> diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> index 547ba824542e..9839b83e732a 100644
> --- a/fs/xfs/xfs_trans_dquot.c
> +++ b/fs/xfs/xfs_trans_dquot.c
> @@ -52,6 +52,8 @@ xfs_trans_log_dquot(
>  	struct xfs_dquot	*dqp)
>  {
>  	ASSERT(XFS_DQ_IS_LOCKED(dqp));
> +	/* quotaoff expects no dquots logged after deactivation */
> +	ASSERT(xfs_this_quota_on(tp->t_mountp, xfs_dquot_type(dqp)));
>  
>  	/* Upgrade the dquot to bigtime format if possible. */
>  	if (dqp->q_id != 0 &&
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] xfs: transaction subsystem quiesce mechanism
  2020-10-07 22:13   ` Darrick J. Wong
@ 2020-10-08 11:26     ` Brian Foster
  2020-10-08 18:27       ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2020-10-08 11:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 07, 2020 at 03:13:10PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 01, 2020 at 11:03:09AM -0400, Brian Foster wrote:
> > The updated quotaoff logging algorithm depends on a runtime quiesce
> > of the transaction subsystem to guarantee all transactions after a
> > certain point detect quota subsystem changes. Implement this
> > mechanism using an internal lock, similar to the external filesystem
> > freeze mechanism. This is also somewhat analogous to the old percpu
> > transaction counter mechanism, but we don't actually need a counter.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c  |  2 ++
> >  fs/xfs/xfs_mount.h |  3 +++
> >  fs/xfs/xfs_super.c |  8 ++++++++
> >  fs/xfs/xfs_trans.c |  4 ++--
> >  fs/xfs/xfs_trans.h | 20 ++++++++++++++++++++
> >  5 files changed, 35 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index b35611882ff9..214310c94de5 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -58,6 +58,7 @@ xfs_setfilesize_trans_alloc(
> >  	 * we released it.
> >  	 */
> >  	__sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS);
> > +	percpu_rwsem_release(&mp->m_trans_rwsem, true, _THIS_IP_);
> >  	/*
> >  	 * We hand off the transaction to the completion thread now, so
> >  	 * clear the flag here.
> > @@ -127,6 +128,7 @@ xfs_setfilesize_ioend(
> >  	 */
> >  	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> > +	percpu_rwsem_acquire(&ip->i_mount->m_trans_rwsem, true, _THIS_IP_);
> >  
> >  	/* we abort the update if there was an IO error */
> >  	if (error) {
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index dfa429b77ee2..f1083a9ce1f8 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -171,6 +171,9 @@ typedef struct xfs_mount {
> >  	 */
> >  	struct percpu_counter	m_delalloc_blks;
> >  
> > +	/* lock for transaction quiesce (used by quotaoff) */
> > +	struct percpu_rw_semaphore	m_trans_rwsem;
> > +
> >  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> >  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> >  	uint64_t		m_resblks;	/* total reserved blocks */
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index baf5de30eebb..ff3ad5392e21 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1029,8 +1029,15 @@ xfs_init_percpu_counters(
> >  	if (error)
> >  		goto free_fdblocks;
> >  
> > +	/* not a counter, but close enough... */
> > +	error = percpu_init_rwsem(&mp->m_trans_rwsem);
> > +	if (error)
> > +		goto free_delalloc;
> > +
> >  	return 0;
> >  
> > +free_delalloc:
> > +	percpu_counter_destroy(&mp->m_delalloc_blks);
> >  free_fdblocks:
> >  	percpu_counter_destroy(&mp->m_fdblocks);
> >  free_ifree:
> > @@ -1053,6 +1060,7 @@ static void
> >  xfs_destroy_percpu_counters(
> >  	struct xfs_mount	*mp)
> >  {
> > +	percpu_free_rwsem(&mp->m_trans_rwsem);
> >  	percpu_counter_destroy(&mp->m_icount);
> >  	percpu_counter_destroy(&mp->m_ifree);
> >  	percpu_counter_destroy(&mp->m_fdblocks);
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index ca18a040336a..c07fa036549a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -69,7 +69,7 @@ xfs_trans_free(
> >  
> >  	trace_xfs_trans_free(tp, _RET_IP_);
> >  	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
> > -		sb_end_intwrite(tp->t_mountp->m_super);
> > +		xfs_trans_end(tp->t_mountp);
> >  	xfs_trans_free_dqinfo(tp);
> >  	kmem_cache_free(xfs_trans_zone, tp);
> >  }
> > @@ -265,7 +265,7 @@ xfs_trans_alloc(
> >  	 */
> >  	tp = kmem_cache_zalloc(xfs_trans_zone, GFP_KERNEL | __GFP_NOFAIL);
> >  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
> > -		sb_start_intwrite(mp->m_super);
> > +		xfs_trans_start(mp);
> >  
> >  	/*
> >  	 * Zero-reservation ("empty") transactions can't modify anything, so
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index f46534b75236..af54c17a22c0 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -209,6 +209,26 @@ xfs_trans_read_buf(
> >  				      flags, bpp, ops);
> >  }
> >  
> > +/*
> > + * Context tracking helpers for external (i.e. fs freeze) and internal
> > + * transaction quiesce.
> > + */
> > +static inline void
> > +xfs_trans_start(
> > +	struct xfs_mount	*mp)
> > +{
> > +	sb_start_intwrite(mp->m_super);
> > +	percpu_down_read(&mp->m_trans_rwsem);
> 
> /me wonders, have you noticed any extra cpu overhead with this?
> 

Not that I've noticed, but I haven't explicitly profiled it or anything.
My understanding is the percpu variant of the rwsem should make read
acquisition extremely cheap at the cost of more expensive write
acquisition (which pretty much fits use cases like freeze, quotaoff,
etc.). I can try a file create/remove workload or something on a high
cpu count box and see if there's a noticeable effect..

> So far it looks ok to me, though I wonder if we could skip all this if
> CONFIG_XFS_QUOTA=n...
> 

We could, though I'd probably prefer to use something like
XFS_IS_QUOTA_RUNNING() as a conditional rather than an ifdef. Of course,
that means we'd have to be careful that turning quotaoff doesn't
actually break the transaction tracking. Regardless, I'm not really sure
it's worth sprinkling that logic around to avoid a per-transaction
percpu read lock.

Brian

> --D
> 
> > +}
> > +
> > +static inline void
> > +xfs_trans_end(
> > +	struct xfs_mount	*mp)
> > +{
> > +	percpu_up_read(&mp->m_trans_rwsem);
> > +	sb_end_intwrite(mp->m_super);
> > +}
> > +
> >  struct xfs_buf	*xfs_trans_getsb(struct xfs_trans *);
> >  
> >  void		xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
> > -- 
> > 2.25.4
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs
  2020-10-07 22:24   ` Darrick J. Wong
@ 2020-10-08 11:29     ` Brian Foster
  0 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2020-10-08 11:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 07, 2020 at 03:24:16PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 01, 2020 at 11:03:10AM -0400, Brian Foster wrote:
> > The quotaoff operation logs two log items. The start item is
> > committed first, followed by the bulk of the in-core quotaoff
> > processing, and then the quotaoff end item is committed to release
> > the start item from the log. The problem with this mechanism is that
> > quite a bit of processing can be required to release dquots from all
> > in-core inodes and subsequently flush/purge all dquots in the
> > system. This processing work doesn't generally generate much log
> > traffic itself, but the start item pins the tail of the log. If an
> > external workload consumes the remaining log space before the
> > transaction for the end item is allocated, a log deadlock can occur.
> > 
> > The purpose of the separate start and end log items is primarily to
> > ensure that log recovery does not incorrectly recover dquot data
> > after an fs crash where a quotaoff was in progress. If we only
> > logged a single quotaoff item, for example, it could fall behind the
> > tail of the log before the last dquot modification was made and
> > incorrectly replay dquot changes that might have occurred after the
> > start item committed but before quotaoff deactivated the quota.
> > 
> > With that context, we can make some small changes to the quotaoff
> > algorithm to provide the same general log ordering guarantee without
> > such a large window to create a log deadlock vector. Rather than
> > place a start item in the log for the duration of quotaoff
> > processing, we can quiesce the transaction subsystem up front to
> > guarantee that no further dquots are logged from that point forward.
> > IOW, we pause the transaction subsystem, commit the quotaoff start
> > and end items, deactivate the associated quota such that subsequent
> > transactions no longer modify associated dquots, and resume the
> > transaction subsystem. The transaction pause is somewhat of a heavy
> > weight operation, but quotaoff is already a rare, slow and
> > performance disruptive operation and the quiesce is only required
> > for two small transactions.
> > 
> > Altogether, this means that the dquot rele/purge sequence occurs
> > after the quotaoff end item has committed and thus can technically
> > fall off the end of the log. This is safe because the remaining
> > processing is in-core work that doesn't involve logging dquots and
> > we've guaranteed that no further dquots are modified by external
> > transactions. This allows quotaoff to complete without risking log
> > deadlock regardless of how much dquot processing is required.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_qm_syscalls.c | 133 +++++++++++++++++++--------------------
> >  fs/xfs/xfs_trans_dquot.c |   2 +
> >  2 files changed, 67 insertions(+), 68 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index ca1b57d291dc..b8e55f4947bd 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
...
> > @@ -163,89 +162,87 @@ xfs_qm_scall_quotaoff(
> >  	}
> >  
> >  	/*
> > -	 * Nothing to do?  Don't complain. This happens when we're just
> > -	 * turning off quota enforcement.
> > +	 * Nothing to do? Don't complain. This happens when we're just turning
> > +	 * off quota enforcement.
> >  	 */
> >  	if ((mp->m_qflags & flags) == 0)
> >  		goto out_unlock;
> >  
> >  	/*
> > -	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
> > -	 * and synchronously. If we fail to write, we should abort the
> > -	 * operation as it cannot be recovered safely if we crash.
> > +	 * Quotaoff must deactivate the associated quota mode(s), release dquots
> > +	 * from inodes and purge them from the system all while the filesystem
> > +	 * remains active. We have two quotaoff log records that traditionally
> > +	 * bound the start and end of this sequence. This guarantees that no
> > +	 * dquots are modified after the end item hits the log, but quotaoff can
> > +	 * be time consuming and thus prone to deadlock because the start item
> > +	 * pins the tail the of log in the meantime (and we can't hold the end
> > +	 * transaction open across the dqrele scan).
> > +	 *
> > +	 * The critical aspect of correctly logging quotaoff is that no dquots
> > +	 * are modified after the quotaoff end item hits the on-disk log.
> > +	 * Otherwise the quotaoff can fall off the tail and log recovery can
> > +	 * replay incorrect data. Instead of letting the start item sit in the
> > +	 * log while quotaoff completes, we can provide the same guarantee via a
> > +	 * runtime barrier for dquot modifications. Specifically, we pause all
> > +	 * transactions on the system via the transaction subsystem lock, log
> > +	 * both start and end items (via sync transactions, which drains the
> > +	 * CIL), deactivate the quota, and then resume the transaction subsystem
> > +	 * while quotaoff completes.
> > +	 *
> > +	 * This is safe because the remaining quotaoff work is in-core cleanup
> > +	 * and all subsequent transactions should see the updated quota state
> > +	 * due to memory ordering provided by the lock. We also avoid deadlock
> > +	 * by committing both items sequentially with near exclusive access to
> > +	 * the transaction subsystem.
> >  	 */
> > +	percpu_down_write(&mp->m_trans_rwsem);
> > +
> >  	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > -	if (error)
> > +	if (error) {
> > +		percpu_up_write(&mp->m_trans_rwsem);
> >  		goto out_unlock;
> > +	}
> >  
> > -	/*
> > -	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
> > -	 * to take care of the race between dqget and quotaoff. We don't take
> > -	 * any special locks to reset these bits. All processes need to check
> > -	 * these bits *after* taking inode lock(s) to see if the particular
> > -	 * quota type is in the process of being turned off. If *ACTIVE, it is
> > -	 * guaranteed that all dquot structures and all quotainode ptrs will all
> > -	 * stay valid as long as that inode is kept locked.
> > -	 *
> > -	 * There is no turning back after this.
> > -	 */
> >  	mp->m_qflags &= ~inactivate_flags;
> >  
> > +	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> > +	if (error) {
> > +		percpu_up_write(&mp->m_trans_rwsem);
> > +		/* We're screwed now. Shutdown is the only option. */
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +		goto out_unlock;
> > +	}
> > +
> > +	percpu_up_write(&mp->m_trans_rwsem);
> 
> Hmm, so if I read this correctly, you're changing what gets written to
> the log from:
> 
> 	<quotaoff intent>
> 	<transactions that maybe log dquots but we don't care>
> 	...now the qflags get cleared...
> 	<transactions that maybe log dquots but we don't care>
> 	...run around purging dquots...
> 	<transactions that maybe log dquots but we don't care>
> 	<quotaoff done>
> 
> to something that looks more like:
> 
> 	<transactions that log dquots>
> 	<quotaoff intent, all other transactions locked out>
> 	...now the qflags get cleared; no other transactions...
> 	<quotaoff done, turn off the lockout>
> 	<transactions that wont log dquots>
> 	...run around purging dquots...
> 

Yep, pretty much. The idea is to preserve the behavior that no dquots
are logged after the quotaoff end item to guarantee the log can never
hold a modified dquot after a previous quotaoff falls off the tail.
Rather than doing that by bounding the quotaoff operation with the two
log items, we introduce a runtime quiesce that deterministically
disables dquot activity and log the items after that point.

> Is that right?  I guess that makes sense, though it's sort of a pity
> that we now make every transaction grab a read lock.  Though I guess
> there's not much that can be done about that; it's better than pinning
> the log tail.  I don't even think there's a good way to relog that
> quotaoff-intent item, is there?  Since you'd have to, I dunno, do all
> the checks that xfs_defer_relog() does to decide if it should relog an
> intent item?
> 

Yeah, well that was a purpose of the previous automatic relog bits.
Short of that, I'm not aware of a great way to keep the start intent
moving. We can't hold a transaction open across the scan(s), so that
also rules out turning quotaoff into a dfop, for example. We could try
to relog the intent at random points, but that looks to me like a bunch
of busy work to break up the scans for no guarantee in return that the
relog transaction wouldn't still deadlock anyways.

Brian

> --D
> 
> > +
> >  	/*
> > -	 * Give back all the dquot reference(s) held by inodes.
> > -	 * Here we go thru every single incore inode in this file system, and
> > -	 * do a dqrele on the i_udquot/i_gdquot that it may have.
> > -	 * Essentially, as long as somebody has an inode locked, this guarantees
> > -	 * that quotas will not be turned off. This is handy because in a
> > -	 * transaction once we lock the inode(s) and check for quotaon, we can
> > -	 * depend on the quota inodes (and other things) being valid as long as
> > -	 * we keep the lock(s).
> > +	 * Release dquot references held by inodes. Technically some contexts
> > +	 * might not pick up the quota state change until the inode lock is
> > +	 * cycled if there is no transaction. We don't care about that above
> > +	 * because a dquot can't be logged without a transaction and we can't
> > +	 * release/purge a dquot here until we've cycled the locks of all inodes
> > +	 * that reference it.
> >  	 */
> >  	xfs_qm_dqrele_all_inodes(mp, flags);
> >  
> >  	/*
> >  	 * Next we make the changes in the quota flag in the mount struct.
> > -	 * This isn't protected by a particular lock directly, because we
> > -	 * don't want to take a mrlock every time we depend on quotas being on.
> > +	 * This isn't protected by a particular lock directly, because we don't
> > +	 * want to take a mrlock every time we depend on quotas being on.
> >  	 */
> >  	mp->m_qflags &= ~flags;
> >  
> > -	/*
> > -	 * Go through all the dquots of this file system and purge them,
> > -	 * according to what was turned off.
> > -	 */
> > +	/* purge all deactivated dquots from the filesystem */
> >  	xfs_qm_dqpurge_all(mp, dqtype);
> >  
> > -	/*
> > -	 * Transactions that had started before ACTIVE state bit was cleared
> > -	 * could have logged many dquots, so they'd have higher LSNs than
> > -	 * the first QUOTAOFF log record does. If we happen to crash when
> > -	 * the tail of the log has gone past the QUOTAOFF record, but
> > -	 * before the last dquot modification, those dquots __will__
> > -	 * recover, and that's not good.
> > -	 *
> > -	 * So, we have QUOTAOFF start and end logitems; the start
> > -	 * logitem won't get overwritten until the end logitem appears...
> > -	 */
> > -	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> > -	if (error) {
> > -		/* We're screwed now. Shutdown is the only option. */
> > -		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > -		goto out_unlock;
> > -	}
> > -
> > -	/*
> > -	 * If all quotas are completely turned off, close shop.
> > -	 */
> > +	/* if all quotas are completely turned off, close shop */
> >  	if (mp->m_qflags == 0) {
> >  		mutex_unlock(&q->qi_quotaofflock);
> >  		xfs_qm_destroy_quotainfo(mp);
> >  		return 0;
> >  	}
> >  
> > -	/*
> > -	 * Release our quotainode references if we don't need them anymore.
> > -	 */
> > +	/* release our quotainode references if we don't need them anymore */
> >  	if ((dqtype & XFS_QMOPT_UQUOTA) && q->qi_uquotaip) {
> >  		xfs_irele(q->qi_uquotaip);
> >  		q->qi_uquotaip = NULL;
> > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> > index 547ba824542e..9839b83e732a 100644
> > --- a/fs/xfs/xfs_trans_dquot.c
> > +++ b/fs/xfs/xfs_trans_dquot.c
> > @@ -52,6 +52,8 @@ xfs_trans_log_dquot(
> >  	struct xfs_dquot	*dqp)
> >  {
> >  	ASSERT(XFS_DQ_IS_LOCKED(dqp));
> > +	/* quotaoff expects no dquots logged after deactivation */
> > +	ASSERT(xfs_this_quota_on(tp->t_mountp, xfs_dquot_type(dqp)));
> >  
> >  	/* Upgrade the dquot to bigtime format if possible. */
> >  	if (dqp->q_id != 0 &&
> > -- 
> > 2.25.4
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] xfs: transaction subsystem quiesce mechanism
  2020-10-08 11:26     ` Brian Foster
@ 2020-10-08 18:27       ` Brian Foster
  0 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2020-10-08 18:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 08, 2020 at 07:26:26AM -0400, Brian Foster wrote:
> On Wed, Oct 07, 2020 at 03:13:10PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 01, 2020 at 11:03:09AM -0400, Brian Foster wrote:
> > > The updated quotaoff logging algorithm depends on a runtime quiesce
> > > of the transaction subsystem to guarantee all transactions after a
> > > certain point detect quota subsystem changes. Implement this
> > > mechanism using an internal lock, similar to the external filesystem
> > > freeze mechanism. This is also somewhat analogous to the old percpu
> > > transaction counter mechanism, but we don't actually need a counter.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_aops.c  |  2 ++
> > >  fs/xfs/xfs_mount.h |  3 +++
> > >  fs/xfs/xfs_super.c |  8 ++++++++
> > >  fs/xfs/xfs_trans.c |  4 ++--
> > >  fs/xfs/xfs_trans.h | 20 ++++++++++++++++++++
> > >  5 files changed, 35 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index b35611882ff9..214310c94de5 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -58,6 +58,7 @@ xfs_setfilesize_trans_alloc(
> > >  	 * we released it.
> > >  	 */
> > >  	__sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS);
> > > +	percpu_rwsem_release(&mp->m_trans_rwsem, true, _THIS_IP_);
> > >  	/*
> > >  	 * We hand off the transaction to the completion thread now, so
> > >  	 * clear the flag here.
> > > @@ -127,6 +128,7 @@ xfs_setfilesize_ioend(
> > >  	 */
> > >  	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> > >  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> > > +	percpu_rwsem_acquire(&ip->i_mount->m_trans_rwsem, true, _THIS_IP_);
> > >  
> > >  	/* we abort the update if there was an IO error */
> > >  	if (error) {
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index dfa429b77ee2..f1083a9ce1f8 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -171,6 +171,9 @@ typedef struct xfs_mount {
> > >  	 */
> > >  	struct percpu_counter	m_delalloc_blks;
> > >  
> > > +	/* lock for transaction quiesce (used by quotaoff) */
> > > +	struct percpu_rw_semaphore	m_trans_rwsem;
> > > +
> > >  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> > >  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> > >  	uint64_t		m_resblks;	/* total reserved blocks */
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index baf5de30eebb..ff3ad5392e21 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1029,8 +1029,15 @@ xfs_init_percpu_counters(
> > >  	if (error)
> > >  		goto free_fdblocks;
> > >  
> > > +	/* not a counter, but close enough... */
> > > +	error = percpu_init_rwsem(&mp->m_trans_rwsem);
> > > +	if (error)
> > > +		goto free_delalloc;
> > > +
> > >  	return 0;
> > >  
> > > +free_delalloc:
> > > +	percpu_counter_destroy(&mp->m_delalloc_blks);
> > >  free_fdblocks:
> > >  	percpu_counter_destroy(&mp->m_fdblocks);
> > >  free_ifree:
> > > @@ -1053,6 +1060,7 @@ static void
> > >  xfs_destroy_percpu_counters(
> > >  	struct xfs_mount	*mp)
> > >  {
> > > +	percpu_free_rwsem(&mp->m_trans_rwsem);
> > >  	percpu_counter_destroy(&mp->m_icount);
> > >  	percpu_counter_destroy(&mp->m_ifree);
> > >  	percpu_counter_destroy(&mp->m_fdblocks);
> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index ca18a040336a..c07fa036549a 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -69,7 +69,7 @@ xfs_trans_free(
> > >  
> > >  	trace_xfs_trans_free(tp, _RET_IP_);
> > >  	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
> > > -		sb_end_intwrite(tp->t_mountp->m_super);
> > > +		xfs_trans_end(tp->t_mountp);
> > >  	xfs_trans_free_dqinfo(tp);
> > >  	kmem_cache_free(xfs_trans_zone, tp);
> > >  }
> > > @@ -265,7 +265,7 @@ xfs_trans_alloc(
> > >  	 */
> > >  	tp = kmem_cache_zalloc(xfs_trans_zone, GFP_KERNEL | __GFP_NOFAIL);
> > >  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
> > > -		sb_start_intwrite(mp->m_super);
> > > +		xfs_trans_start(mp);
> > >  
> > >  	/*
> > >  	 * Zero-reservation ("empty") transactions can't modify anything, so
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index f46534b75236..af54c17a22c0 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -209,6 +209,26 @@ xfs_trans_read_buf(
> > >  				      flags, bpp, ops);
> > >  }
> > >  
> > > +/*
> > > + * Context tracking helpers for external (i.e. fs freeze) and internal
> > > + * transaction quiesce.
> > > + */
> > > +static inline void
> > > +xfs_trans_start(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	sb_start_intwrite(mp->m_super);
> > > +	percpu_down_read(&mp->m_trans_rwsem);
> > 
> > /me wonders, have you noticed any extra cpu overhead with this?
> > 
> 
> Not that I've noticed, but I haven't explicitly profiled it or anything.
> My understanding is the percpu variant of the rwsem should make read
> acquisition extremely cheap at the cost of more expensive write
> acquisition (which pretty much fits use cases like freeze, quotaoff,
> etc.). I can try a file create/remove workload or something on a high
> cpu count box and see if there's a noticeable effect..
> 

Following up, I don't see a noticeable difference on a 64xcpu 200m empty
file creation (fs_mark) workload. The numbers jump around a bit between
iterations so it's hard to see fine details, but the tests land in the
same general ranges and complete within 30s (out of ~28m) of eachother
(with the test kernel actually completing before the baseline kernel).
All in all it kind of looks like "in the noise" to me, but I'm happy to
try any other tests people might want to see.

Brian

> > So far it looks ok to me, though I wonder if we could skip all this if
> > CONFIG_XFS_QUOTA=n...
> > 
> 
> We could, though I'd probably prefer to use something like
> XFS_IS_QUOTA_RUNNING() as a conditional rather than an ifdef. Of course,
> that means we'd have to be careful that turning quotaoff doesn't
> actually break the transaction tracking. Regardless, I'm not really sure
> it's worth sprinkling that logic around to avoid a per-transaction
> percpu read lock.
> 
> Brian
> 
> > --D
> > 
> > > +}
> > > +
> > > +static inline void
> > > +xfs_trans_end(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	percpu_up_read(&mp->m_trans_rwsem);
> > > +	sb_end_intwrite(mp->m_super);
> > > +}
> > > +
> > >  struct xfs_buf	*xfs_trans_getsb(struct xfs_trans *);
> > >  
> > >  void		xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
> > > -- 
> > > 2.25.4
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-10-08 18:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-01 15:03 [PATCH 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
2020-10-01 15:03 ` [PATCH 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
2020-10-07 22:09   ` Darrick J. Wong
2020-10-01 15:03 ` [PATCH 2/3] xfs: transaction subsystem quiesce mechanism Brian Foster
2020-10-07 22:13   ` Darrick J. Wong
2020-10-08 11:26     ` Brian Foster
2020-10-08 18:27       ` Brian Foster
2020-10-01 15:03 ` [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
2020-10-07 22:24   ` Darrick J. Wong
2020-10-08 11:29     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.