linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] xfs: automatic relogging
@ 2020-07-01 16:51 Brian Foster
  2020-07-01 16:51 ` [PATCH 01/10] xfs: automatic relogging item management Brian Foster
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's a v1 (non-RFC) version of the automatic relogging functionality.
Note that the buffer relogging bits (patches 8-10) are still RFC as I've
had to hack around some things to utilize it for testing. I include them
here mostly for reference/discussion. Most of the effort from the last
rfc post has gone into testing and solidifying the functionality. This
now survives a traditional fstests regression run as well as a test run
with random buffer relogging enabled on every test/scratch device mount
that occurs throughout the fstests cycle. The quotaoff use case is
additionally tested independently by artificially delaying completion of
the quotaoff in parallel with many fsstress worker threads.

The hacks/workarounds to support the random buffer relogging enabled
fstests run are not included here because they are not associated with
core functionality, but rather are side effects of randomly relogging
arbitrary buffers, etc. I can work them into the buffer relogging
patches if desired, but I'd like to get the core functionality and use
case worked out before getting too far into the testing code. I also
know Darrick was interested in the ->iop_relog() callback for some form
of generic feedback into active dfops processing, so it might be worth
exploring that further.

Thoughts, reviews, flames appreciated.

Brian

v1:
- Rebased to latest for-next.
- Push handling logic tweaks.
- Rework and document the relog reservation calculation.
rfcv6: https://lore.kernel.org/linux-xfs/20200406123632.20873-1-bfoster@redhat.com/
- Rework relog reservation model.
- Drop unnecessary log ticket t_task fix.
- Use ->iop_relog() callback unconditionally.
- Rudimentary freeze handling for random buffer relogging.
- Various other fixes, tweaks and cleanups.
rfcv5: https://lore.kernel.org/linux-xfs/20200227134321.7238-1-bfoster@redhat.com/
- More fleshed out design to prevent log reservation deadlock and
  locking problems.
- Split out core patches between pre-reservation management, relog item
  state management and relog mechanism.
- Added experimental buffer relogging capability.
rfcv4: https://lore.kernel.org/linux-xfs/20191205175037.52529-1-bfoster@redhat.com/
- AIL based approach.
rfcv3: https://lore.kernel.org/linux-xfs/20191125185523.47556-1-bfoster@redhat.com/
- CIL based approach.
rfcv2: https://lore.kernel.org/linux-xfs/20191122181927.32870-1-bfoster@redhat.com/
- Different approach based on workqueue and transaction rolling.
rfc: https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/

Brian Foster (10):
  xfs: automatic relogging item management
  xfs: create helper for ticket-less log res ungrant
  xfs: extra runtime reservation overhead for relog transactions
  xfs: relog log reservation stealing and accounting
  xfs: automatic log item relog mechanism
  xfs: automatically relog the quotaoff start intent
  xfs: prevent fs freeze with outstanding relog items
  xfs: buffer relogging support prototype
  xfs: create an error tag for random relog reservation
  xfs: relog random buffers based on errortag

 fs/xfs/libxfs/xfs_errortag.h |   4 +-
 fs/xfs/libxfs/xfs_shared.h   |   1 +
 fs/xfs/xfs_buf.c             |   4 +
 fs/xfs/xfs_buf_item.c        |  61 ++++++++++++--
 fs/xfs/xfs_dquot_item.c      |  26 +++++-
 fs/xfs/xfs_error.c           |   3 +
 fs/xfs/xfs_log.c             |  35 ++++++--
 fs/xfs/xfs_log.h             |   4 +-
 fs/xfs/xfs_log_cil.c         |   2 +-
 fs/xfs/xfs_log_priv.h        |   1 +
 fs/xfs/xfs_qm_syscalls.c     |  12 ++-
 fs/xfs/xfs_super.c           |   4 +
 fs/xfs/xfs_trace.h           |   4 +
 fs/xfs/xfs_trans.c           |  75 +++++++++++++++--
 fs/xfs/xfs_trans.h           |  44 +++++++++-
 fs/xfs/xfs_trans_ail.c       | 152 ++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trans_buf.c       |  80 ++++++++++++++++++
 fs/xfs/xfs_trans_priv.h      |  28 +++++++
 18 files changed, 512 insertions(+), 28 deletions(-)

-- 
2.21.3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/10] xfs: automatic relogging item management
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH 02/10] xfs: create helper for ticket-less log res ungrant Brian Foster
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Add a log item flag to track relog state and a couple helpers to set
and clear the flag. The flag will be set on any log item that is to
be automatically relogged by log tail pressure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/xfs_trace.h      |  2 ++
 fs/xfs/xfs_trans.c      | 20 ++++++++++++++++++++
 fs/xfs/xfs_trans.h      |  4 +++-
 fs/xfs/xfs_trans_priv.h |  2 ++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 460136628a79..f6fd598c3912 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1068,6 +1068,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
+DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
 
 DECLARE_EVENT_CLASS(xfs_ail_class,
 	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3c94e5ff4316..5190b792cc68 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -651,6 +651,26 @@ xfs_trans_del_item(
 	list_del_init(&lip->li_trans);
 }
 
+void
+xfs_trans_relog_item(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	if (test_and_set_bit(XFS_LI_RELOG, &lip->li_flags))
+		return;
+	trace_xfs_relog_item(lip);
+}
+
+void
+xfs_trans_relog_item_cancel(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	if (!test_and_clear_bit(XFS_LI_RELOG, &lip->li_flags))
+		return;
+	trace_xfs_relog_item_cancel(lip);
+}
+
 /* Detach and unlock all of the items in a transaction */
 static void
 xfs_trans_free_items(
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8308bf6d7e40..6349e78af002 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -60,13 +60,15 @@ struct xfs_log_item {
 #define	XFS_LI_FAILED	2
 #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
 #define	XFS_LI_RECOVERED 4	/* log intent item has been recovered */
+#define	XFS_LI_RELOG	5	/* automatically relog item */
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
 	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
 	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
 	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
-	{ (1 << XFS_LI_RECOVERED),	"RECOVERED" }
+	{ (1 << XFS_LI_RECOVERED),	"RECOVERED" }, \
+	{ (1 << XFS_LI_RELOG),		"RELOG" }
 
 struct xfs_item_ops {
 	unsigned flags;
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 3004aeac9110..64965a861346 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -16,6 +16,8 @@ struct xfs_log_vec;
 void	xfs_trans_init(struct xfs_mount *);
 void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_del_item(struct xfs_log_item *);
+void	xfs_trans_relog_item(struct xfs_trans *, struct xfs_log_item *);
+void	xfs_trans_relog_item_cancel(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
 
 void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/10] xfs: create helper for ticket-less log res ungrant
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
  2020-07-01 16:51 ` [PATCH 01/10] xfs: automatic relogging item management Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH 03/10] xfs: extra runtime reservation overhead for relog transactions Brian Foster
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Log reservation is currently acquired and released via log tickets.
The relog mechanism introduces behavior where relog reservation is
transferred between transaction log tickets and an external pool of
relog reservation for active relog items. Certain contexts will be
able to release outstanding relog reservation without the need for a
log ticket. Factor out a helper to allow byte granularity log
reservation ungrant.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/xfs_log.c | 20 ++++++++++++++++----
 fs/xfs/xfs_log.h |  1 +
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 00fda2e8e738..d6b63490a78b 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2980,6 +2980,21 @@ xfs_log_ticket_regrant(
 	xfs_log_ticket_put(ticket);
 }
 
+/*
+ * Restore log reservation directly to the grant heads.
+ */
+void
+xfs_log_ungrant_bytes(
+	struct xfs_mount	*mp,
+	int			bytes)
+{
+	struct xlog		*log = mp->m_log;
+
+	xlog_grant_sub_space(log, &log->l_reserve_head.grant, bytes);
+	xlog_grant_sub_space(log, &log->l_write_head.grant, bytes);
+	xfs_log_space_wake(mp);
+}
+
 /*
  * Give back the space left from a reservation.
  *
@@ -3018,12 +3033,9 @@ xfs_log_ticket_ungrant(
 		bytes += ticket->t_unit_res*ticket->t_cnt;
 	}
 
-	xlog_grant_sub_space(log, &log->l_reserve_head.grant, bytes);
-	xlog_grant_sub_space(log, &log->l_write_head.grant, bytes);
-
+	xfs_log_ungrant_bytes(log->l_mp, bytes);
 	trace_xfs_log_ticket_ungrant_exit(log, ticket);
 
-	xfs_log_space_wake(log->l_mp);
 	xfs_log_ticket_put(ticket);
 }
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 1412d6993f1e..6d2f30f42245 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -125,6 +125,7 @@ int	  xfs_log_reserve(struct xfs_mount *mp,
 			  uint8_t		   clientid,
 			  bool		   permanent);
 int	  xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
+void	  xfs_log_ungrant_bytes(struct xfs_mount *mp, int bytes);
 void      xfs_log_unmount(struct xfs_mount *mp);
 int	  xfs_log_force_umount(struct xfs_mount *mp, int logerror);
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/10] xfs: extra runtime reservation overhead for relog transactions
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
  2020-07-01 16:51 ` [PATCH 01/10] xfs: automatic relogging item management Brian Foster
  2020-07-01 16:51 ` [PATCH 02/10] xfs: create helper for ticket-less log res ungrant Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH 04/10] xfs: relog log reservation stealing and accounting Brian Foster
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Every transaction reservation includes runtime overhead on top of
the reservation calculated in the struct xfs_trans_res. This
overhead is required for things like the CIL context ticket, log
headers, etc., that are stolen from individual transactions. Since
reservation for the relog transaction is entirely contributed by
regular transactions, this runtime reservation overhead must be
contributed as well. This means that a transaction that relogs one
or more items must include overhead for the current transaction as
well as for the relog transaction.

Define a new transaction flag to indicate that a transaction is
relog enabled. Plumb this state down to the log ticket allocation
and use it to bump the worst case overhead included in the
transaction. The overhead will eventually be transferred to the
relog system as needed for individual log items.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/libxfs/xfs_shared.h |  1 +
 fs/xfs/xfs_log.c           | 12 +++++++++---
 fs/xfs/xfs_log.h           |  3 ++-
 fs/xfs/xfs_log_cil.c       |  2 +-
 fs/xfs/xfs_log_priv.h      |  1 +
 fs/xfs/xfs_trans.c         |  3 ++-
 6 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index c45acbd3add9..1ede1e720a5c 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -65,6 +65,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define XFS_TRANS_DQ_DIRTY	0x10	/* at least one dquot in trx dirty */
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
 #define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
+#define XFS_TRANS_RELOG		0x80	/* requires extra relog overhead */
 /*
  * LOWMODE is used by the allocator to activate the lowspace algorithm - when
  * free space is running low the extent allocator may choose to allocate an
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index d6b63490a78b..b55abde6c142 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -418,7 +418,8 @@ xfs_log_reserve(
 	int		 	cnt,
 	struct xlog_ticket	**ticp,
 	uint8_t		 	client,
-	bool			permanent)
+	bool			permanent,
+	bool			relog)
 {
 	struct xlog		*log = mp->m_log;
 	struct xlog_ticket	*tic;
@@ -433,7 +434,8 @@ xfs_log_reserve(
 	XFS_STATS_INC(mp, xs_try_logspace);
 
 	ASSERT(*ticp == NULL);
-	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0);
+	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, relog,
+				0);
 	*ticp = tic;
 
 	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
@@ -831,7 +833,7 @@ xlog_unmount_write(
 	uint			flags = XLOG_UNMOUNT_TRANS;
 	int			error;
 
-	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
+	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, false, false);
 	if (error)
 		goto out_err;
 
@@ -3421,6 +3423,7 @@ xlog_ticket_alloc(
 	int			cnt,
 	char			client,
 	bool			permanent,
+	bool			relog,
 	xfs_km_flags_t		alloc_flags)
 {
 	struct xlog_ticket	*tic;
@@ -3431,6 +3434,9 @@ xlog_ticket_alloc(
 		return NULL;
 
 	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
+	/* double the overhead for the relog transaction */
+	if (relog)
+		unit_res += (unit_res - unit_bytes);
 
 	atomic_set(&tic->t_ref, 1);
 	tic->t_task		= current;
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 6d2f30f42245..f1089a4b299c 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -123,7 +123,8 @@ int	  xfs_log_reserve(struct xfs_mount *mp,
 			  int		   count,
 			  struct xlog_ticket **ticket,
 			  uint8_t		   clientid,
-			  bool		   permanent);
+			  bool		   permanent,
+			  bool		   relog);
 int	  xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
 void	  xfs_log_ungrant_bytes(struct xfs_mount *mp, int bytes);
 void      xfs_log_unmount(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 9ed90368ab31..dfa25370f8af 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -37,7 +37,7 @@ xlog_cil_ticket_alloc(
 {
 	struct xlog_ticket *tic;
 
-	tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0,
+	tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, false, false,
 				KM_NOFS);
 
 	/*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 75a62870b63a..bcc3d7a9c2c9 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -465,6 +465,7 @@ xlog_ticket_alloc(
 	int		count,
 	char		client,
 	bool		permanent,
+	bool		relog,
 	xfs_km_flags_t	alloc_flags);
 
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 5190b792cc68..cfa9915523e1 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -174,6 +174,7 @@ xfs_trans_reserve(
 	 */
 	if (resp->tr_logres > 0) {
 		bool	permanent = false;
+		bool	relog	  = (tp->t_flags & XFS_TRANS_RELOG);
 
 		ASSERT(tp->t_log_res == 0 ||
 		       tp->t_log_res == resp->tr_logres);
@@ -196,7 +197,7 @@ xfs_trans_reserve(
 						resp->tr_logres,
 						resp->tr_logcount,
 						&tp->t_ticket, XFS_TRANSACTION,
-						permanent);
+						permanent, relog);
 		}
 
 		if (error)
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/10] xfs: relog log reservation stealing and accounting
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (2 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH 03/10] xfs: extra runtime reservation overhead for relog transactions Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH 05/10] xfs: automatic log item relog mechanism Brian Foster
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

The transaction that eventually commits relog enabled log items
requires log reservation like any other transaction. It is not safe
to acquire reservation on-demand because relogged items aren't
processed until they are likely at the tail of the log and require
movement in order to free up space in the log. As such, a relog
transaction that blocks on log reservation is a likely deadlock
vector.

To address this problem, implement a model where relog reservation
is contributed by the transaction that enables relogging on a
particular item. Update the relog helper to transfer reservation
from the transaction to the relog pool. The relog pool holds
outstanding reservation such that it can be used to commit the item
in an otherwise empty transaction. The upcoming relog mechanism is
responsible to replenish the relog reservation as items are
relogged. When relog is cancelled on a log item, transfer the
outstanding relog reservation to the current transaction (if
provided) for eventual release or otherwise release it directly to
the grant heads.

Note that this approach has several caveats:

- Log reservation calculations for transactions that relog items
  must be increased accordingly.
- The currently per-transaction overhead reservation (i.e. for
  things like the CIL ticket) must be included for each reloggable
  item because said items can be relogged in arbitrary combinations.
- Relog reservation must be based on the worst case requirement for
  a log item. This is not a concern for fixed size log items, such
  as most intents. Items with more granular logging capability, such
  as buffers, can have additional ranges dirtied after relogging has
  been enabled and the relog subsystem must have enough reservation
  to accommodate.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log.c        |  3 +++
 fs/xfs/xfs_trans.c      | 20 ++++++++++++++++++++
 fs/xfs/xfs_trans.h      | 31 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans_ail.c  |  2 ++
 fs/xfs/xfs_trans_priv.h | 14 ++++++++++++++
 5 files changed, 70 insertions(+)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b55abde6c142..940e5bb9786c 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -983,6 +983,9 @@ xfs_log_item_init(
 	item->li_type = type;
 	item->li_ops = ops;
 	item->li_lv = NULL;
+#ifdef DEBUG
+	atomic64_set(&item->li_relog_res, 0);
+#endif
 
 	INIT_LIST_HEAD(&item->li_ail);
 	INIT_LIST_HEAD(&item->li_cil);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index cfa9915523e1..ba2540d8a6c9 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -20,6 +20,7 @@
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
+#include "xfs_log_priv.h"
 
 kmem_zone_t	*xfs_trans_zone;
 
@@ -657,9 +658,19 @@ xfs_trans_relog_item(
 	struct xfs_trans	*tp,
 	struct xfs_log_item	*lip)
 {
+	int			nbytes;
+
+	ASSERT(tp->t_flags & XFS_TRANS_RELOG);
+
 	if (test_and_set_bit(XFS_LI_RELOG, &lip->li_flags))
 		return;
 	trace_xfs_relog_item(lip);
+
+	nbytes = xfs_relog_calc_res(lip);
+
+	tp->t_ticket->t_curr_res -= nbytes;
+	xfs_relog_res_account(lip, nbytes);
+	tp->t_flags |= XFS_TRANS_DIRTY;
 }
 
 void
@@ -667,9 +678,18 @@ xfs_trans_relog_item_cancel(
 	struct xfs_trans	*tp,
 	struct xfs_log_item	*lip)
 {
+	int			res;
+
 	if (!test_and_clear_bit(XFS_LI_RELOG, &lip->li_flags))
 		return;
 	trace_xfs_relog_item_cancel(lip);
+
+	res = xfs_relog_calc_res(lip);
+	if (tp)
+		tp->t_ticket->t_curr_res += res;
+	else
+		xfs_log_ungrant_bytes(lip->li_mountp, res);
+	xfs_relog_res_account(lip, -res);
 }
 
 /* Detach and unlock all of the items in a transaction */
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 6349e78af002..70373e2b8f6d 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -48,6 +48,9 @@ struct xfs_log_item {
 	struct xfs_log_vec		*li_lv;		/* active log vector */
 	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
 	xfs_lsn_t			li_seq;		/* CIL commit seq */
+#ifdef DEBUG
+	atomic64_t			li_relog_res;	/* automatic relog log res */
+#endif
 };
 
 /*
@@ -216,6 +219,34 @@ xfs_trans_read_buf(
 				      flags, bpp, ops);
 }
 
+/*
+ * Calculate the log reservation required to enable relogging of a log item.
+ */
+static inline int
+xfs_relog_calc_res(
+	struct xfs_log_item	*lip)
+{
+	int			niovecs = 0;
+	int			nbytes = 0;
+
+	/*
+	 * The reservation consumed by a transaction at commit time consists of
+	 * the total size of the formatted log vectors of the items dirtied by
+	 * the transaction, an op header for each iovec in the log vectors, the
+	 * unit reservation of the CIL context ticket, and extra iclog and op
+	 * headers if the CIL context spans multiple iclogs (i.e. split
+	 * reservation). The CIL ticket and split reservation are included by
+	 * xfs_log_calc_unit_res().
+	 */
+	lip->li_ops->iop_size(lip, &niovecs, &nbytes);
+	ASSERT(niovecs == 1);
+
+	nbytes += niovecs * sizeof(xlog_op_header_t);
+	nbytes = xfs_log_calc_unit_res(lip->li_mountp, nbytes);
+
+	return nbytes;
+}
+
 struct xfs_buf	*xfs_trans_getsb(xfs_trans_t *, struct xfs_mount *);
 
 void		xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index ac5019361a13..5c862821171f 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -894,6 +894,7 @@ xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
+	atomic64_set(&ailp->ail_relog_res, 0);
 
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_super->s_id);
@@ -914,6 +915,7 @@ xfs_trans_ail_destroy(
 {
 	struct xfs_ail	*ailp = mp->m_ail;
 
+	ASSERT(atomic64_read(&ailp->ail_relog_res) == 0);
 	kthread_stop(ailp->ail_task);
 	kmem_free(ailp);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 64965a861346..d923e79676af 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -63,6 +63,7 @@ struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	atomic64_t		ail_relog_res;
 };
 
 /*
@@ -169,4 +170,17 @@ xfs_set_li_failed(
 	}
 }
 
+static inline int64_t
+xfs_relog_res_account(
+	struct xfs_log_item	*lip,
+	int64_t			bytes)
+{
+#ifdef DEBUG
+	int64_t			res;
+
+	res = atomic64_add_return(bytes, &lip->li_relog_res);
+	ASSERT(res == bytes || (bytes < 0 && res == 0));
+#endif
+	return atomic64_add_return(bytes, &lip->li_ailp->ail_relog_res);
+}
 #endif	/* __XFS_TRANS_PRIV_H__ */
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/10] xfs: automatic log item relog mechanism
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (3 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH 04/10] xfs: relog log reservation stealing and accounting Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-03  6:08   ` Dave Chinner
  2020-07-01 16:51 ` [PATCH 06/10] xfs: automatically relog the quotaoff start intent Brian Foster
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Now that relog reservation is available and relog state tracking is
in place, all that remains to automatically relog items is the relog
mechanism itself. An item with relogging enabled is basically pinned
from writeback until relog is disabled. Instead of being written
back, the item must instead be periodically committed in a new
transaction to move it forward in the physical log. The purpose of
moving the item is to avoid long term tail pinning and thus avoid
log deadlocks for long running operations.

The ideal time to relog an item is in response to tail pushing
pressure. This accommodates the current workload at any given time
as opposed to a fixed time interval or log reservation heuristic,
which risks performance regression. This is essentially the same
heuristic that drives metadata writeback. XFS already implements
various log tail pushing heuristics that attempt to keep the log
progressing on an active fileystem under various workloads.

The act of relogging an item simply requires to add it to a
transaction and commit. This pushes the already dirty item into a
subsequent log checkpoint and frees up its previous location in the
on-disk log. Joining an item to a transaction of course requires
locking the item first, which means we have to be aware of
type-specific locks and lock ordering wherever the relog takes
place.

Fundamentally, this points to xfsaild as the ideal location to
process relog enabled items. xfsaild already processes log resident
items, is driven by log tail pushing pressure, processes arbitrary
log item types through callbacks, and is sensitive to type-specific
locking rules by design. The fact that automatic relogging
essentially diverts items between writeback or relog also suggests
xfsaild as an ideal location to process items one way or the other.

Of course, we don't want xfsaild to process transactions as it is a
critical component of the log subsystem for driving metadata
writeback and freeing up log space. Therefore, similar to how
xfsaild builds up a writeback queue of dirty items and queues writes
asynchronously, make xfsaild responsible only for directing pending
relog items into an appropriate queue and create an async
(workqueue) context for processing the queue. The workqueue context
utilizes the pre-reserved log reservation to drain the queue by
rolling a permanent transaction.

Update the AIL pushing infrastructure to support a new RELOG item
state. If a log item push returns the relog state, queue the item
for relog instead of writeback. On completion of a push cycle,
schedule the relog task at the same point metadata buffer I/O is
submitted. This allows items to be relogged automatically under the
same locking rules and pressure heuristics that govern metadata
writeback.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trace.h      |   2 +
 fs/xfs/xfs_trans.c      |  13 ++++-
 fs/xfs/xfs_trans.h      |   6 ++-
 fs/xfs/xfs_trans_ail.c  | 117 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trans_priv.h |  14 ++++-
 5 files changed, 147 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f6fd598c3912..1f81a47c7f6d 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1068,6 +1068,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
+DEFINE_LOG_ITEM_EVENT(xfs_ail_relog_queue);
 DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
 DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index ba2540d8a6c9..310beaccbc4c 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -676,7 +676,8 @@ xfs_trans_relog_item(
 void
 xfs_trans_relog_item_cancel(
 	struct xfs_trans	*tp,
-	struct xfs_log_item	*lip)
+	struct xfs_log_item	*lip,
+	bool			wait)
 {
 	int			res;
 
@@ -684,6 +685,15 @@ xfs_trans_relog_item_cancel(
 		return;
 	trace_xfs_relog_item_cancel(lip);
 
+	/*
+	 * Must wait on active relog to complete before reclaiming reservation.
+	 * Currently a big hammer because the QUEUED state isn't cleared until
+	 * AIL (re)insertion. A separate state might be warranted.
+	 */
+	while (wait && wait_on_bit_timeout(&lip->li_flags, XFS_LI_RELOG_QUEUED,
+					   TASK_UNINTERRUPTIBLE, HZ))
+		xfs_log_force(lip->li_mountp, XFS_LOG_SYNC);
+
 	res = xfs_relog_calc_res(lip);
 	if (tp)
 		tp->t_ticket->t_curr_res += res;
@@ -777,6 +787,7 @@ xfs_trans_committed_bulk(
 
 		if (aborted)
 			set_bit(XFS_LI_ABORTED, &lip->li_flags);
+		clear_and_wake_up_bit(XFS_LI_RELOG_QUEUED, &lip->li_flags);
 
 		if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
 			lip->li_ops->iop_release(lip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 70373e2b8f6d..7f409b0d456a 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -64,6 +64,7 @@ struct xfs_log_item {
 #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
 #define	XFS_LI_RECOVERED 4	/* log intent item has been recovered */
 #define	XFS_LI_RELOG	5	/* automatically relog item */
+#define XFS_LI_RELOG_QUEUED 6	/* queued for relog */
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
@@ -71,7 +72,8 @@ struct xfs_log_item {
 	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
 	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
 	{ (1 << XFS_LI_RECOVERED),	"RECOVERED" }, \
-	{ (1 << XFS_LI_RELOG),		"RELOG" }
+	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
+	{ (1 << XFS_LI_RELOG_QUEUED),	"RELOG_QUEUED" }
 
 struct xfs_item_ops {
 	unsigned flags;
@@ -86,6 +88,7 @@ struct xfs_item_ops {
 	void (*iop_error)(struct xfs_log_item *, xfs_buf_t *);
 	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
+	void (*iop_relog)(struct xfs_log_item *, struct xfs_trans *);
 };
 
 /*
@@ -104,6 +107,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
 #define XFS_ITEM_PINNED		1
 #define XFS_ITEM_LOCKED		2
 #define XFS_ITEM_FLUSHING	3
+#define XFS_ITEM_RELOG		4
 
 /*
  * Deferred operation item relogging limits.
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 5c862821171f..6c4d219801a6 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -152,6 +153,88 @@ xfs_ail_max_lsn(
 	return lsn;
 }
 
+/*
+ * Relog log items on the AIL relog queue.
+ *
+ * Note that relog is incompatible with filesystem freeze due to the
+ * multi-transaction nature of its users. The freeze sequence blocks all
+ * transactions and expects to drain the AIL. Allowing the relog transaction to
+ * proceed while freeze is in progress is not sufficient because it is not
+ * responsible for cancellation of relog state. The higher level operations must
+ * be guaranteed to progress to completion before the AIL can be drained of
+ * relog enabled items. This is currently accomplished by holding
+ * ->s_umount (quotaoff) or superblock write references (scrub) across the high
+ * level operations that depend on relog.
+ */
+static void
+xfs_ail_relog(
+	struct work_struct	*work)
+{
+	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
+						     ail_relog_work);
+	struct xfs_mount	*mp = ailp->ail_mount;
+	struct xfs_trans_res	tres = {};
+	struct xfs_trans	*tp;
+	struct xfs_log_item	*lip, *lipp;
+	int			error;
+	LIST_HEAD(relog_list);
+
+	/*
+	 * Open code allocation of an empty transaction and log ticket. The
+	 * ticket requires no initial reservation because the all outstanding
+	 * relog reservation is attached to log items.
+	 */
+	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
+	if (error)
+		goto out;
+	ASSERT(tp && !tp->t_ticket);
+	tp->t_flags |= XFS_TRANS_PERM_LOG_RES;
+	tp->t_log_count = 1;
+	tp->t_ticket = xlog_ticket_alloc(mp->m_log, 0, 1, XFS_TRANSACTION,
+					 true, false, 0);
+	/* reset to zero to undo res overhead calculation on ticket alloc */
+	tp->t_ticket->t_curr_res = 0;
+	tp->t_ticket->t_unit_res = 0;
+	tp->t_log_res = 0;
+
+	spin_lock(&ailp->ail_lock);
+	while (!list_empty(&ailp->ail_relog_list)) {
+		list_splice_init(&ailp->ail_relog_list, &relog_list);
+		spin_unlock(&ailp->ail_lock);
+
+		list_for_each_entry_safe(lip, lipp, &relog_list, li_trans) {
+			list_del_init(&lip->li_trans);
+
+			trace_xfs_ail_relog(lip);
+			ASSERT(lip->li_ops->iop_relog);
+			if (lip->li_ops->iop_relog)
+				lip->li_ops->iop_relog(lip, tp);
+		}
+
+		error = xfs_trans_roll(&tp);
+		if (error) {
+			xfs_trans_cancel(tp);
+			goto out;
+		}
+
+		/*
+		 * Now that the transaction has rolled, reset the ticket to
+		 * zero to reflect that the log reservation held by the
+		 * attached items has been replenished.
+		 */
+		tp->t_ticket->t_curr_res = 0;
+		tp->t_ticket->t_unit_res = 0;
+		tp->t_log_res = 0;
+
+		spin_lock(&ailp->ail_lock);
+	}
+	spin_unlock(&ailp->ail_lock);
+	xfs_trans_cancel(tp);
+
+out:
+	ASSERT(!error || XFS_FORCED_SHUTDOWN(mp));
+}
+
 /*
  * The cursor keeps track of where our current traversal is up to by tracking
  * the next item in the list for us. However, for this to be safe, removing an
@@ -413,7 +496,7 @@ static long
 xfsaild_push(
 	struct xfs_ail		*ailp)
 {
-	xfs_mount_t		*mp = ailp->ail_mount;
+	struct xfs_mount	*mp = ailp->ail_mount;
 	struct xfs_ail_cursor	cur;
 	struct xfs_log_item	*lip;
 	xfs_lsn_t		lsn;
@@ -475,6 +558,23 @@ xfsaild_push(
 			ailp->ail_last_pushed_lsn = lsn;
 			break;
 
+		case XFS_ITEM_RELOG:
+			/*
+			 * The item requires a relog. Add to the relog queue
+			 * and set a bit to prevent further relog requests
+			 * until AIL reinsertion.
+			 */
+			if (test_and_set_bit(XFS_LI_RELOG_QUEUED,
+					     &lip->li_flags)) {
+				ailp->ail_log_flush++;
+				break;
+			}
+
+			trace_xfs_ail_relog_queue(lip);
+			ASSERT(list_empty(&lip->li_trans));
+			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
+			break;
+
 		case XFS_ITEM_FLUSHING:
 			/*
 			 * The item or its backing buffer is already being
@@ -541,6 +641,9 @@ xfsaild_push(
 	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
 		ailp->ail_log_flush++;
 
+	if (!list_empty(&ailp->ail_relog_list))
+		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
+
 	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
 out_done:
 		/*
@@ -894,16 +997,25 @@ xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
+	INIT_LIST_HEAD(&ailp->ail_relog_list);
+	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
 	atomic64_set(&ailp->ail_relog_res, 0);
 
+	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
+					     mp->m_super->s_id);
+	if (!ailp->ail_relog_wq)
+		goto out_free_ailp;
+
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_super->s_id);
 	if (IS_ERR(ailp->ail_task))
-		goto out_free_ailp;
+		goto out_destroy_wq;
 
 	mp->m_ail = ailp;
 	return 0;
 
+out_destroy_wq:
+	destroy_workqueue(ailp->ail_relog_wq);
 out_free_ailp:
 	kmem_free(ailp);
 	return -ENOMEM;
@@ -917,5 +1029,6 @@ xfs_trans_ail_destroy(
 
 	ASSERT(atomic64_read(&ailp->ail_relog_res) == 0);
 	kthread_stop(ailp->ail_task);
+	destroy_workqueue(ailp->ail_relog_wq);
 	kmem_free(ailp);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index d923e79676af..6c15a4f39b68 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -17,7 +17,8 @@ void	xfs_trans_init(struct xfs_mount *);
 void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_del_item(struct xfs_log_item *);
 void	xfs_trans_relog_item(struct xfs_trans *, struct xfs_log_item *);
-void	xfs_trans_relog_item_cancel(struct xfs_trans *, struct xfs_log_item *);
+void	xfs_trans_relog_item_cancel(struct xfs_trans *, struct xfs_log_item *,
+				    bool wait);
 void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
 
 void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
@@ -64,6 +65,9 @@ struct xfs_ail {
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
 	atomic64_t		ail_relog_res;
+	struct work_struct	ail_relog_work;
+	struct list_head	ail_relog_list;
+	struct workqueue_struct	*ail_relog_wq;
 };
 
 /*
@@ -85,6 +89,14 @@ xfs_ail_min(
 					li_ail);
 }
 
+static inline bool
+xfs_item_needs_relog(
+	struct xfs_log_item	*lip)
+{
+	return test_bit(XFS_LI_RELOG, &lip->li_flags) &&
+	       !test_bit(XFS_LI_RELOG_QUEUED, &lip->li_flags);
+}
+
 static inline void
 xfs_trans_ail_update(
 	struct xfs_ail		*ailp,
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/10] xfs: automatically relog the quotaoff start intent
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (4 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH 05/10] xfs: automatic log item relog mechanism Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH 07/10] xfs: prevent fs freeze with outstanding relog items Brian Foster
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

The quotaoff operation has a rare but longstanding deadlock vector
in terms of how the operation is logged. A quotaoff start intent is
logged (synchronously) at the onset to ensure recovery can handle
the operation if interrupted before in-core changes are made. This
quotaoff intent pins the log tail while the quotaoff sequence scans
and purges dquots from all in-core inodes. While this operation
generally doesn't generate much log traffic on its own, it can be
time consuming. If unrelated, concurrent filesystem activity
consumes remaining log space before quotaoff is able to acquire log
reservation for the quotaoff end intent, the filesystem locks up
indefinitely.

quotaoff cannot allocate the end intent before the scan because the
latter can result in transaction allocation itself in certain
indirect cases (releasing an inode, for example). Further, rolling
the original transaction is difficult because the scanning work
occurs multiple layers down where caller context is lost and not
much information is available to determine how often to roll the
transaction.

To address this problem, enable automatic relogging of the quotaoff
start intent. This automatically relogs the intent whenever AIL
pushing finds the item at the tail of the log. When quotaoff
completes, wait for relogging to complete as the end intent expects
to be able to permanently remove the start intent from the log
subsystem. This ensures that the log tail is kept moving during a
particularly long quotaoff operation and avoids the log reservation
deadlock.

Note that the quotaoff reservation calculation does not need to be
updated for relog as it already (incorrectly) accounts for two
quotaoff intents.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_dquot_item.c  | 26 ++++++++++++++++++++++++--
 fs/xfs/xfs_qm_syscalls.c | 12 +++++++++++-
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 349c92d26570..86dcb6932aab 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -17,6 +17,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_qm.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 static inline struct xfs_dq_logitem *DQUOT_ITEM(struct xfs_log_item *lip)
 {
@@ -275,14 +276,17 @@ xfs_qm_qoff_logitem_format(
 }
 
 /*
- * There isn't much you can do to push a quotaoff item.  It is simply
- * stuck waiting for the log to be flushed to disk.
+ * The quotaoff log item is stuck in the log until quotaoff completes. Either
+ * relog it to keep the tail moving or consider it locked.
  */
 STATIC uint
 xfs_qm_qoff_logitem_push(
 	struct xfs_log_item	*lip,
 	struct list_head	*buffer_list)
 {
+
+	if (xfs_item_needs_relog(lip))
+		return XFS_ITEM_RELOG;
 	return XFS_ITEM_LOCKED;
 }
 
@@ -314,6 +318,23 @@ xfs_qm_qoff_logitem_release(
 	}
 }
 
+STATIC void
+xfs_qm_qoff_logitem_relog(
+	struct xfs_log_item	*lip,
+	struct xfs_trans	*tp)
+{
+	int			res;
+
+	res = xfs_relog_calc_res(lip);
+
+	xfs_trans_add_item(tp, lip);
+	tp->t_ticket->t_curr_res += res;
+	tp->t_ticket->t_unit_res += res;
+	tp->t_log_res += res;
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &lip->li_flags);
+}
+
 static const struct xfs_item_ops xfs_qm_qoffend_logitem_ops = {
 	.iop_size	= xfs_qm_qoff_logitem_size,
 	.iop_format	= xfs_qm_qoff_logitem_format,
@@ -327,6 +348,7 @@ static const struct xfs_item_ops xfs_qm_qoff_logitem_ops = {
 	.iop_format	= xfs_qm_qoff_logitem_format,
 	.iop_push	= xfs_qm_qoff_logitem_push,
 	.iop_release	= xfs_qm_qoff_logitem_release,
+	.iop_relog	= xfs_qm_qoff_logitem_relog,
 };
 
 /*
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 7effd7a28136..5602ed2b7e8d 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -18,6 +18,7 @@
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_icache.h"
+#include "xfs_trans_priv.h"
 
 STATIC int
 xfs_qm_log_quotaoff(
@@ -29,12 +30,14 @@ xfs_qm_log_quotaoff(
 	int			error;
 	struct xfs_qoff_logitem	*qoffi;
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
+				XFS_TRANS_RELOG, &tp);
 	if (error)
 		goto out;
 
 	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
 	xfs_trans_log_quotaoff_item(tp, qoffi);
+	xfs_trans_relog_item(tp, &qoffi->qql_item);
 
 	spin_lock(&mp->m_sb_lock);
 	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
@@ -71,6 +74,13 @@ xfs_qm_log_quotaoff_end(
 	if (error)
 		return error;
 
+	/*
+	 * startqoff must be in the AIL and not the CIL when the end intent
+	 * commits to ensure it is not readded to the AIL out of order. Wait on
+	 * relog activity to drain to isolate startqoff to the AIL.
+	 */
+	xfs_trans_relog_item_cancel(tp, &(*startqoff)->qql_item, true);
+
 	qoffi = xfs_trans_get_qoff_item(tp, *startqoff,
 					flags & XFS_ALL_QUOTA_ACCT);
 	xfs_trans_log_quotaoff_item(tp, qoffi);
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/10] xfs: prevent fs freeze with outstanding relog items
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (5 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH 06/10] xfs: automatically relog the quotaoff start intent Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH RFC 08/10] xfs: buffer relogging support prototype Brian Foster
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

The automatic relog mechanism is currently incompatible with
filesystem freeze in a generic sense. Freeze protection is currently
implemented via locks that cannot be held on return to userspace,
which means we can't hold a superblock write reference for the
duration relogging is enabled on an item. It's too late to block
when the freeze sequence calls into the filesystem because the
transaction subsystem has already begun to be frozen. Not only can
this block the relog transaction, but blocking any unrelated
transaction essentially prevents a particular operation from
progressing to the point where it can disable relogging on an item.
Therefore marking the relog transaction as "nowrite" does not solve
the problem.

This is not a problem in practice because the two primary use cases
already exclude freeze via other means. quotaoff holds ->s_umount
read locked across the operation and scrub explicitly takes a
superblock write reference, both of which block freeze of the
transaction subsystem for the duration of relog enabled items.

As a fallback for future use cases and the upcoming random buffer
relogging test code, fail fs freeze attempts when the global relog
reservation counter is elevated to prevent deadlock. This is a
partial punt of the problem as compared to teaching freeze to wait
on relogged items because the only current dependency is test code.
In other words, this patch prevents deadlock if a user happens to
issue a freeze in conjunction with random buffer relog injection.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
---
 fs/xfs/xfs_super.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 379cbff438bc..f77af5298a80 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,7 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_trans_priv.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -914,6 +915,9 @@ xfs_fs_freeze(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	if (WARN_ON_ONCE(atomic64_read(&mp->m_ail->ail_relog_res)))
+		return -EAGAIN;
+
 	xfs_stop_block_reaping(mp);
 	xfs_save_resvblks(mp);
 	xfs_quiesce_attr(mp);
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 08/10] xfs: buffer relogging support prototype
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (6 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH 07/10] xfs: prevent fs freeze with outstanding relog items Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH RFC 09/10] xfs: create an error tag for random relog reservation Brian Foster
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Implement buffer relogging support. There is currently no use case
for buffer relogging. This is for testing and experimental purposes
and serves as an example to demonstrate the ability to relog
arbitrary items in the future, if necessary.

Add helpers to manage relogged buffers, update the buffer log item
push handler to support relogged BLIs and add a log item relog
callback to properly join buffers to the relog transaction. Note
that buffers associated with higher level log items (i.e., inodes
and dquots) are skipped.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c       |  4 +++
 fs/xfs/xfs_buf_item.c  | 60 ++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_trans.h     |  5 +++-
 fs/xfs/xfs_trans_buf.c | 66 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 128 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 20b748f7e186..eec482204336 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -16,6 +16,8 @@
 #include "xfs_log.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
+#include "xfs_trans.h"
+#include "xfs_buf_item.h"
 
 static kmem_zone_t *xfs_buf_zone;
 
@@ -1500,6 +1502,8 @@ __xfs_buf_submit(
 	trace_xfs_buf_submit(bp, _RET_IP_);
 
 	ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
+	ASSERT(!bp->b_log_item ||
+	       !test_bit(XFS_LI_RELOG, &bp->b_log_item->bli_item.li_flags));
 
 	/* on shutdown we stale and complete the buffer immediately */
 	if (XFS_FORCED_SHUTDOWN(bp->b_mount)) {
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 9e75e8d6042e..eb827a31b47f 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -16,7 +16,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
-
+#include "xfs_log_priv.h"
 
 kmem_zone_t	*xfs_buf_item_zone;
 
@@ -141,7 +141,6 @@ xfs_buf_item_size(
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	int			i;
 
-	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
 		/*
 		 * The buffer is stale, so all we need to log
@@ -157,7 +156,7 @@ xfs_buf_item_size(
 		return;
 	}
 
-	ASSERT(bip->bli_flags & XFS_BLI_LOGGED);
+	ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
 
 	if (bip->bli_flags & XFS_BLI_ORDERED) {
 		/*
@@ -418,6 +417,10 @@ xfs_buf_item_unpin(
 
 	trace_xfs_buf_item_unpin(bip);
 
+	/* cancel relogging on abort before we drop the bli reference */
+	if (remove)
+		xfs_trans_relog_buf_cancel(NULL, bp);
+
 	freed = atomic_dec_and_test(&bip->bli_refcount);
 
 	if (atomic_dec_and_test(&bp->b_pin_count))
@@ -462,6 +465,13 @@ xfs_buf_item_unpin(
 			list_del_init(&bp->b_li_list);
 			bp->b_iodone = NULL;
 		} else {
+			/* racy */
+			ASSERT(!test_bit(XFS_LI_RELOG_QUEUED, &lip->li_flags));
+			if (test_bit(XFS_LI_RELOG, &lip->li_flags)) {
+				atomic_dec(&bp->b_pin_count);
+				xfs_trans_relog_item_cancel(NULL, lip, true);
+			}
+
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
 			ASSERT(bp->b_log_item == NULL);
@@ -488,8 +498,6 @@ xfs_buf_item_push(
 	struct xfs_buf		*bp = bip->bli_buf;
 	uint			rval = XFS_ITEM_SUCCESS;
 
-	if (xfs_buf_ispinned(bp))
-		return XFS_ITEM_PINNED;
 	if (!xfs_buf_trylock(bp)) {
 		/*
 		 * If we have just raced with a buffer being pinned and it has
@@ -503,6 +511,15 @@ xfs_buf_item_push(
 		return XFS_ITEM_LOCKED;
 	}
 
+	/* relog bufs are pinned so check relog state first */
+	if (xfs_item_needs_relog(lip))
+		return XFS_ITEM_RELOG;
+
+	if (xfs_buf_ispinned(bp)) {
+		xfs_buf_unlock(bp);
+		return XFS_ITEM_PINNED;
+	}
+
 	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
 
 	trace_xfs_buf_item_push(bip);
@@ -532,6 +549,7 @@ xfs_buf_item_put(
 	struct xfs_buf_log_item	*bip)
 {
 	struct xfs_log_item	*lip = &bip->bli_item;
+	struct xfs_buf		*bp = bip->bli_buf;
 	bool			aborted;
 	bool			dirty;
 
@@ -557,8 +575,10 @@ xfs_buf_item_put(
 	 * transaction that invalidated a dirty bli and cleared the dirty
 	 * state.
 	 */
-	if (aborted)
+	if (aborted) {
+		xfs_trans_relog_buf_cancel(NULL, bp);
 		xfs_trans_ail_delete(lip, 0);
+	}
 	xfs_buf_item_relse(bip->bli_buf);
 	return true;
 }
@@ -668,6 +688,28 @@ xfs_buf_item_committed(
 	return lsn;
 }
 
+STATIC void
+xfs_buf_item_relog(
+	struct xfs_log_item	*lip,
+	struct xfs_trans	*tp)
+{
+	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
+	int			res;
+
+	/*
+	 * Grab a reference to the buffer for the transaction before we join
+	 * and dirty it.
+	 */
+	xfs_buf_hold(bip->bli_buf);
+	xfs_trans_bjoin(tp, bip->bli_buf);
+	xfs_trans_dirty_buf(tp, bip->bli_buf);
+
+	res = xfs_relog_calc_res(lip);
+	tp->t_ticket->t_curr_res += res;
+	tp->t_ticket->t_unit_res += res;
+	tp->t_log_res += res;
+}
+
 static const struct xfs_item_ops xfs_buf_item_ops = {
 	.iop_size	= xfs_buf_item_size,
 	.iop_format	= xfs_buf_item_format,
@@ -677,6 +719,7 @@ static const struct xfs_item_ops xfs_buf_item_ops = {
 	.iop_committing	= xfs_buf_item_committing,
 	.iop_committed	= xfs_buf_item_committed,
 	.iop_push	= xfs_buf_item_push,
+	.iop_relog	= xfs_buf_item_relog,
 };
 
 STATIC void
@@ -930,6 +973,11 @@ STATIC void
 xfs_buf_item_free(
 	struct xfs_buf_log_item	*bip)
 {
+	ASSERT(!test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags));
+#ifdef DEBUG
+	ASSERT(!atomic64_read(&bip->bli_item.li_relog_res));
+#endif
+
 	xfs_buf_item_free_format(bip);
 	kmem_free(bip->bli_item.li_lv_shadow);
 	kmem_cache_free(xfs_buf_item_zone, bip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 7f409b0d456a..0262a883969f 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -243,7 +243,7 @@ xfs_relog_calc_res(
 	 * xfs_log_calc_unit_res().
 	 */
 	lip->li_ops->iop_size(lip, &niovecs, &nbytes);
-	ASSERT(niovecs == 1);
+	ASSERT(niovecs == 1 || lip->li_type == XFS_LI_BUF);
 
 	nbytes += niovecs * sizeof(xlog_op_header_t);
 	nbytes = xfs_log_calc_unit_res(lip->li_mountp, nbytes);
@@ -262,6 +262,9 @@ void		xfs_trans_inode_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_stale_inode_buf(xfs_trans_t *, struct xfs_buf *);
 bool		xfs_trans_ordered_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_dquot_buf(xfs_trans_t *, struct xfs_buf *, uint);
+bool		xfs_trans_relog_buf(struct xfs_trans *, struct xfs_buf *);
+void		xfs_trans_relog_buf_cancel(struct xfs_trans *,
+					   struct xfs_buf *);
 void		xfs_trans_inode_alloc_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_ichgtime(struct xfs_trans *, struct xfs_inode *, int);
 void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 08174ffa2118..b5b552a4bcfb 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -588,6 +588,8 @@ xfs_trans_binval(
 		return;
 	}
 
+	/* return relog res before we reset dirty state */
+	xfs_trans_relog_buf_cancel(tp, bp);
 	xfs_buf_stale(bp);
 
 	bip->bli_flags |= XFS_BLI_STALE;
@@ -787,3 +789,67 @@ xfs_trans_dquot_buf(
 
 	xfs_trans_buf_set_type(tp, bp, type);
 }
+
+/*
+ * Enable automatic relogging on a buffer. This essentially pins a dirty buffer
+ * in-core until relogging is disabled.
+ */
+bool
+xfs_trans_relog_buf(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+	enum xfs_blft		blft;
+
+	ASSERT(xfs_buf_islocked(bp));
+
+	if (bip->bli_flags & (XFS_BLI_ORDERED|XFS_BLI_STALE))
+		return false;
+	/*
+	 * Don't bother with queued buffers since we're about to pin it for an
+	 * indeterminate amount of time and we don't want the responsibility of
+	 * failing it if an abort happens to remove it from the AIL.
+	 */
+	if (bp->b_flags & _XBF_DELWRI_Q)
+		return false;
+
+	/*
+	 * Skip buffers with higher level log items. Those items must be
+	 * relogged directly to move in the log.
+	 */
+	blft = xfs_blft_from_flags(&bip->__bli_format);
+	switch (blft) {
+	case XFS_BLFT_DINO_BUF:
+	case XFS_BLFT_UDQUOT_BUF:
+	case XFS_BLFT_PDQUOT_BUF:
+	case XFS_BLFT_GDQUOT_BUF:
+		return false;
+	default:
+		break;
+	}
+
+	/*
+	 * Relog expects a worst case reservation from ->iop_size. Hack that in
+	 * here by logging the entire buffer in this transaction. Also grab a
+	 * buffer pin to prevent it from being written out.
+	 */
+	xfs_buf_item_log(bip, 0, BBTOB(bp->b_length) - 1);
+	atomic_inc(&bp->b_pin_count);
+	xfs_trans_relog_item(tp, &bip->bli_item);
+	return true;
+}
+
+void
+xfs_trans_relog_buf_cancel(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+
+	if (!test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags))
+		return;
+
+	atomic_dec(&bp->b_pin_count);
+	xfs_trans_relog_item_cancel(tp, &bip->bli_item, false);
+}
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 09/10] xfs: create an error tag for random relog reservation
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (7 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH RFC 08/10] xfs: buffer relogging support prototype Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-01 16:51 ` [PATCH RFC 10/10] xfs: relog random buffers based on errortag Brian Foster
  2020-07-02 11:51 ` [PATCH 00/10] xfs: automatic relogging Dave Chinner
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Create an errortag to enable relogging on random transactions. Since
relogging requires extra transaction reservation, artificially bump
the reservation on selected transactions and tag them with the relog
flag such that the requisite reservation overhead is added by the
ticket allocation code. This allows subsequent random buffer relog
events to target transactions where reservation is included. This is
necessary to avoid transaction reservation overruns on non-relog
transactions.

Note that this does not yet enable relogging of any particular
items. The tag will be reused in a subsequent patch to enable random
buffer relogging.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_errortag.h |  4 +++-
 fs/xfs/xfs_error.c           |  3 +++
 fs/xfs/xfs_trans.c           | 21 ++++++++++++++++-----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 53b305dea381..8f360cfc666c 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -56,7 +56,8 @@
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
 #define XFS_ERRTAG_IUNLINK_FALLBACK			34
 #define XFS_ERRTAG_BUF_IOERROR				35
-#define XFS_ERRTAG_MAX					36
+#define XFS_ERRTAG_RELOG				36
+#define XFS_ERRTAG_MAX					37
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -97,5 +98,6 @@
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
 #define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
 #define XFS_RANDOM_BUF_IOERROR				XFS_RANDOM_DEFAULT
+#define XFS_RANDOM_RELOG				XFS_RANDOM_DEFAULT
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 7f6e20899473..562e00f7dcf5 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -54,6 +54,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
 	XFS_RANDOM_IUNLINK_FALLBACK,
 	XFS_RANDOM_BUF_IOERROR,
+	XFS_RANDOM_RELOG,
 };
 
 struct xfs_errortag_attr {
@@ -164,6 +165,7 @@ XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
 XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
 XFS_ERRORTAG_ATTR_RW(buf_ioerror,	XFS_ERRTAG_BUF_IOERROR);
+XFS_ERRORTAG_ATTR_RW(relog,		XFS_ERRTAG_RELOG);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -202,6 +204,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
 	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
 	XFS_ERRORTAG_ATTR_LIST(buf_ioerror),
+	XFS_ERRORTAG_ATTR_LIST(relog),
 	NULL,
 };
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 310beaccbc4c..df94ca45c7c8 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -21,6 +21,7 @@
 #include "xfs_error.h"
 #include "xfs_defer.h"
 #include "xfs_log_priv.h"
+#include "xfs_errortag.h"
 
 kmem_zone_t	*xfs_trans_zone;
 
@@ -176,9 +177,10 @@ xfs_trans_reserve(
 	if (resp->tr_logres > 0) {
 		bool	permanent = false;
 		bool	relog	  = (tp->t_flags & XFS_TRANS_RELOG);
+		int	logres = resp->tr_logres;
 
 		ASSERT(tp->t_log_res == 0 ||
-		       tp->t_log_res == resp->tr_logres);
+		       tp->t_log_res == logres);
 		ASSERT(tp->t_log_count == 0 ||
 		       tp->t_log_count == resp->tr_logcount);
 
@@ -194,9 +196,18 @@ xfs_trans_reserve(
 			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
 			error = xfs_log_regrant(mp, tp->t_ticket);
 		} else {
-			error = xfs_log_reserve(mp,
-						resp->tr_logres,
-						resp->tr_logcount,
+			/*
+			 * Enable relog overhead on random transactions to support
+			 * random item relogging.
+			 */
+			if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RELOG) &&
+			    !relog) {
+				tp->t_flags |= XFS_TRANS_RELOG;
+				relog = true;
+				logres <<= 1;
+			}
+
+			error = xfs_log_reserve(mp, logres, resp->tr_logcount,
 						&tp->t_ticket, XFS_TRANSACTION,
 						permanent, relog);
 		}
@@ -204,7 +215,7 @@ xfs_trans_reserve(
 		if (error)
 			goto undo_blocks;
 
-		tp->t_log_res = resp->tr_logres;
+		tp->t_log_res = logres;
 		tp->t_log_count = resp->tr_logcount;
 	}
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 10/10] xfs: relog random buffers based on errortag
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (8 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH RFC 09/10] xfs: create an error tag for random relog reservation Brian Foster
@ 2020-07-01 16:51 ` Brian Foster
  2020-07-02 11:51 ` [PATCH 00/10] xfs: automatic relogging Dave Chinner
  10 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-01 16:51 UTC (permalink / raw)
  To: linux-xfs

Since there is currently no specific use case for buffer relogging,
add some hacky and experimental code to relog random buffers when
the associated errortag is enabled. Use fixed termination logic
regardless of the user-specified error rate to help ensure that the
relog queue doesn't grow indefinitely.

Note that this patch was useful in causing log reservation deadlocks
on an fsstress workload if the relog mechanism code is modified to
acquire its own log reservation rather than rely on the
pre-reservation mechanism. In other words, this helps prove that the
relog reservation management code effectively avoids log reservation
deadlocks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c  |  1 +
 fs/xfs/xfs_trans.h     |  4 +++-
 fs/xfs/xfs_trans_ail.c | 33 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans_buf.c | 14 ++++++++++++++
 4 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index eb827a31b47f..fb277187a2cf 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -469,6 +469,7 @@ xfs_buf_item_unpin(
 			ASSERT(!test_bit(XFS_LI_RELOG_QUEUED, &lip->li_flags));
 			if (test_bit(XFS_LI_RELOG, &lip->li_flags)) {
 				atomic_dec(&bp->b_pin_count);
+				clear_bit(XFS_LI_RELOG_RAND, &bip->bli_item.li_flags);
 				xfs_trans_relog_item_cancel(NULL, lip, true);
 			}
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 0262a883969f..18714e6af476 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -65,6 +65,7 @@ struct xfs_log_item {
 #define	XFS_LI_RECOVERED 4	/* log intent item has been recovered */
 #define	XFS_LI_RELOG	5	/* automatically relog item */
 #define XFS_LI_RELOG_QUEUED 6	/* queued for relog */
+#define XFS_LI_RELOG_RAND   7
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
@@ -73,7 +74,8 @@ struct xfs_log_item {
 	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
 	{ (1 << XFS_LI_RECOVERED),	"RECOVERED" }, \
 	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
-	{ (1 << XFS_LI_RELOG_QUEUED),	"RELOG_QUEUED" }
+	{ (1 << XFS_LI_RELOG_QUEUED),	"RELOG_QUEUED" }, \
+	{ (1 << XFS_LI_RELOG_RAND),	"RELOG_RAND" }
 
 struct xfs_item_ops {
 	unsigned flags;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 6c4d219801a6..3a8a1abc6c4c 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -18,6 +18,7 @@
 #include "xfs_error.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
+#include "xfs_buf_item.h"
 
 #ifdef DEBUG
 /*
@@ -176,6 +177,7 @@ xfs_ail_relog(
 	struct xfs_trans_res	tres = {};
 	struct xfs_trans	*tp;
 	struct xfs_log_item	*lip, *lipp;
+	int			cancelres;
 	int			error;
 	LIST_HEAD(relog_list);
 
@@ -209,6 +211,37 @@ xfs_ail_relog(
 			ASSERT(lip->li_ops->iop_relog);
 			if (lip->li_ops->iop_relog)
 				lip->li_ops->iop_relog(lip, tp);
+
+			/*
+			 * Cancel random buffer relogs at a fixed rate to
+			 * prevent too much buildup.
+			 */
+			if (test_bit(XFS_LI_RELOG_RAND, &lip->li_flags) &&
+			    ((prandom_u32() & 1) ||
+			     (mp->m_flags & XFS_MOUNT_UNMOUNTING))) {
+				struct xfs_buf_log_item	*bli;
+				bli = container_of(lip, struct xfs_buf_log_item,
+						   bli_item);
+				xfs_trans_relog_buf_cancel(tp, bli->bli_buf);
+			}
+		}
+
+		/*
+		 * Cancelling relog reservation in the same transaction as
+		 * consuming it means the current transaction over releases
+		 * reservation on commit and the next transaction reservation
+		 * restores the grant heads to even. To avoid this behavior,
+		 * remove surplus reservation (->t_curr_res) from the committing
+		 * transaction and replace it with a reduction in the
+		 * reservation requirement (->t_unit_res) for the next. This has
+		 * no net effect on reservation accounting, but ensures we don't
+		 * cause problems elsewhere with odd reservation behavior.
+		 */
+		cancelres = tp->t_ticket->t_curr_res - tp->t_ticket->t_unit_res;
+		if (cancelres) {
+			tp->t_ticket->t_curr_res -= cancelres;
+			tp->t_ticket->t_unit_res -= cancelres;
+			tp->t_log_res -= cancelres;
 		}
 
 		error = xfs_trans_roll(&tp);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index b5b552a4bcfb..565386912e4d 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -14,6 +14,8 @@
 #include "xfs_buf_item.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
 
 /*
  * Check to see if a buffer matching the given parameters is already
@@ -527,6 +529,17 @@ xfs_trans_log_buf(
 
 	trace_xfs_trans_log_buf(bip);
 	xfs_buf_item_log(bip, first, last);
+
+	/*
+	 * Relog random buffers so long as the transaction is relog enabled and
+	 * the buffer wasn't already relogged explicitly.
+	 */
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_RELOG) &&
+	    (tp->t_flags & XFS_TRANS_RELOG) &&
+	    !test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags)) {
+		if (xfs_trans_relog_buf(tp, bp))
+			set_bit(XFS_LI_RELOG_RAND, &bip->bli_item.li_flags);
+	}
 }
 
 
@@ -852,4 +865,5 @@ xfs_trans_relog_buf_cancel(
 
 	atomic_dec(&bp->b_pin_count);
 	xfs_trans_relog_item_cancel(tp, &bip->bli_item, false);
+	clear_bit(XFS_LI_RELOG_RAND, &bip->bli_item.li_flags);
 }
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
                   ` (9 preceding siblings ...)
  2020-07-01 16:51 ` [PATCH RFC 10/10] xfs: relog random buffers based on errortag Brian Foster
@ 2020-07-02 11:51 ` Dave Chinner
  2020-07-02 18:52   ` Brian Foster
  10 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-07-02 11:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> Hi all,
> 
> Here's a v1 (non-RFC) version of the automatic relogging functionality.
> Note that the buffer relogging bits (patches 8-10) are still RFC as I've
> had to hack around some things to utilize it for testing. I include them
> here mostly for reference/discussion. Most of the effort from the last
> rfc post has gone into testing and solidifying the functionality. This
> now survives a traditional fstests regression run as well as a test run
> with random buffer relogging enabled on every test/scratch device mount
> that occurs throughout the fstests cycle. The quotaoff use case is
> additionally tested independently by artificially delaying completion of
> the quotaoff in parallel with many fsstress worker threads.
> 
> The hacks/workarounds to support the random buffer relogging enabled
> fstests run are not included here because they are not associated with
> core functionality, but rather are side effects of randomly relogging
> arbitrary buffers, etc. I can work them into the buffer relogging
> patches if desired, but I'd like to get the core functionality and use
> case worked out before getting too far into the testing code. I also
> know Darrick was interested in the ->iop_relog() callback for some form
> of generic feedback into active dfops processing, so it might be worth
> exploring that further.
> 
> Thoughts, reviews, flames appreciated.

Ok I've looked through the code again, and again I've had to pause,
stop and think hard about it because the feeling I've had right from
the start about the automatic relogging concept is stronger than
ever.

I think the most constructive way to say what I'm feeling is that I
think this is the wrong approach to solve the quota off problem.
However, I've never been able to come up with an alternative that
also solved the quotaoff problem so I've tried to help make this
relogging concept work.

It's a very interesting experiment, but I've always had a nagging
doubt about putting transaction reservations both above and below
the AIL. In reading this version, I'm having trouble following and
understanding the transaction reservation juggling and
recalculation complexity that's been introduced to facilitate
the stealing that is being done. Yes, I know that I suggested the
dynamic stealing approach - it's certainly better than past
versions, but it hasn't really addressed my underlying doubt about
the relogging concept in general...

I have been spending some time recently in the quota code, so I have
a better grip on what it is doing now than I did last time I looked
at this relogging code. I never really questioned why the quota code
needed two transactions for quota-off, and I'm guessing that nobody
else has either. So I spent some time this morning understanding
what problem it was actually solving and trying to find an alternate
solution to that problem.

The reason we have the two quota-off transactions is that active
dquot modifications at the time quotaoff is started leak past the
first quota off transaction that hits the journal. Hence to avoid
incorrect replay of those modifications in the journal if we crash
after the quota-off item passes out of the journal, we pin the
quota-off item in the journal. It gets unpinned by the commit of the
second quota-off transaction at completion time, hence defining the
window in journal where quota-off is being processed and dquot
modifications should be ignored. i.e. there is no window where
recovery will replay dquot modifications incorrectly.

However, if the second transaction is left too long, the reservation
will fail to find journal space because of the pinned quota-off item.

The relogging infrastructure is designed to allow the inital
quota-off intent to keep moving forward in the log so it never pins
the tail of the log before the second quota-off transaction is run.
This tries to avoid the recovery issue because there's always an
active quota off item in the log, but I think there may be a flaw
here.  When the quotaoff item gets relogged, it jumps all the dquots
in the log that were modified after the quota-off started. Hence if
we crash after the relogging but while the dquots are still in the
log before the relogged quotaoff item, then they will be replayed,
possibly incorrectly. i.e. the relogged quota-off item no longer
prevents replay of those items.

So while relogging prevents the tail pinning deadlock, I think it
may actually result in incorrect recovery behaviour in that items
that should be cancelled and not replayed can end up getting
replayed.  I'm not sure that this matters for dquots, but for a
general mechanism I think the transactional ordering violations it
can result in reduce it's usefulness significantly.

But back to quota-off: What I've realised is that the only dquot
modifications we need to protect against being recovered are the
ones that are running at the time the first quota-off is committed
to the journal. That is, once the DQACTIVE flags are clear,
transactions will not modify those dquots anymore. Hence by the time
that the quota off item pins the tail of the log, the transactions
that were actively dirtying inodes when it was committed have also
committed and are in the journal and there are no actively modified
dquots left in memory.

IOWs, we don't actually need to wait until we've released and purged
all the dquots from memory before we log the second quota off item;
all we need to wait for is for all the transactions with dirty
dquots to have committed. These transactions already have log
reservations, so completing them will free unused reservation space
for the second quota off transaction. Once they are committed, then
we can log the second item. i.e. we don't have to wait until we've
cleaned up the dquots to close out the quota-off transaction in the
journal.

To make it even more robust, if we stop all the transactions that
may dirty dquots and drain the active ones before we log the first
quota-off item, we can log the second item immediately afterwards
because it is known that there are no dquot modifications in flight
when the first item is logged. We can probably even log both items
in the same transaction.

Recovery will behave correctly on all kernels, old and new, because
we've prevented dquots from landing in the journal after the first
quota-off item. Hence it looks to recovery simply like the fs was
idle when the quota-off was run...

So, putting my money where my mouth is, the patch below does this.
It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
of -g quota with all quotas enabled and is currently running a full
auto cycle with all quotas enabled. It hasn't let the smoke out
after about 4 hours of testing now....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com


[RFC] xfs: rework quota off process to avoid deadlocks

From: Dave Chinner <dchinner@redhat.com>

The quotaoff operation has a rare but longstanding deadlock vector
in terms of how the operation is logged. A quotaoff start intent is
logged (synchronously) at the onset to ensure recovery can handle
the operation if interrupted before in-core changes are made. This
quotaoff intent pins the log tail while the quotaoff sequence scans
and purges dquots from all in-core inodes. While this operation
generally doesn't generate much log traffic on its own, it can be
time consuming. If unrelated, concurrent filesystem activity
consumes remaining log space before quotaoff is able to acquire log
reservation for the quotaoff end intent, the filesystem locks up
indefinitely.

The problem that the quota-off intent/done logging setup is designed
to deal with is that when the quota-off intent is first logged,
there can be active transactions with modified dquots in them. These
cannot be aborted, and the result is that modified dquots end up in
the journal after the quota-off item has been recorded. Hence the
proposal to automatically relog the quota off intent so it doesn't
fall out of the log before the done intent can be written.

We can fix this another way: all we need to do is prevent dquots
from being logged to the journal after we log the quota off intent.
If we do this, then we can just log the quota off intent and the
intent done items in the same transaction and then tear down the
in memory dquots.

The current code prevents new dquots from being logged once the
quota-off has been logged simply by clearing the DQ_ACTIVE flags for
the quota being logged. Removing this flag means all future
transactions will see that the specific quota type is no longer
active and skip over it completely. This, however, requires
co-ordination with inode locks for it to behave correctly, and
hence this mechanism cannot be used to block transactions while we
wait for other transactions to drain out.

Hence add a new "quota off running" flag to the mount quota flags.
This bit can be set atomically, and we can use bit ops to wait on it
being cleared. ANd when we clear the bit we can issue a wakeups on
it. Hence adding a new bit gives us the mechanism to block
operations while quota-off is running.

To acheive this draining, we need to keep a count of the number of
active transactions that are modifying dquots. To do that, let's tag
transaction reservation structures for transactions that may modify
dquots with a new flag. We can look at that flag in
xfs_trans_alloc() - where it is safe to block - and do
all our quota-off run checks, blocking and accounting here.
We can use a single per-cpu counter to track dquot modifying
transactions in flight, so after setting the "quota off running"
flag all we need to do is wait for the counter to run to zero.

At this point, we can log the quota off items, then clear the
mount quota flags, and then detatch and reap all the dquots that
remain in memory that will no longer be used. This can be done after
the quota-off end item is logged because nothing is going to be
modifying the dquots we need to reap at this point.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_quota_defs.h |   3 ++
 fs/xfs/libxfs/xfs_shared.h     |   2 +
 fs/xfs/libxfs/xfs_trans_resv.c |  28 +++++------
 fs/xfs/xfs_mount.h             |   2 +-
 fs/xfs/xfs_qm.c                |  24 ++++-----
 fs/xfs/xfs_qm.h                |   4 ++
 fs/xfs/xfs_qm_syscalls.c       |  95 ++++++++++++++++++------------------
 fs/xfs/xfs_trans.c             |  18 ++++++-
 fs/xfs/xfs_trans_dquot.c       | 108 ++++++++++++++++++++++++++++++-----------
 9 files changed, 180 insertions(+), 104 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_quota_defs.h b/fs/xfs/libxfs/xfs_quota_defs.h
index 56d9dd787e7b..84e5d568cf1f 100644
--- a/fs/xfs/libxfs/xfs_quota_defs.h
+++ b/fs/xfs/libxfs/xfs_quota_defs.h
@@ -79,6 +79,9 @@ typedef uint16_t	xfs_qwarncnt_t;
 #define XFS_ALL_QUOTA_ACTIVE	\
 	(XFS_UQUOTA_ACTIVE | XFS_GQUOTA_ACTIVE | XFS_PQUOTA_ACTIVE)
 
+#define XFS_QUOTA_OFF_RUNNING_BIT 15  /* Quotas are being turned off */
+#define XFS_QUOTA_OFF_RUNNING	(1 << XFS_QUOTA_OFF_RUNNING_BIT)
+
 /*
  * Checking XFS_IS_*QUOTA_ON() while holding any inode lock guarantees
  * quota will be not be switched off as long as that inode lock is held.
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index c45acbd3add9..d4a6156dba31 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -65,6 +65,8 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define XFS_TRANS_DQ_DIRTY	0x10	/* at least one dquot in trx dirty */
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
 #define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
+#define XFS_TRANS_QUOTA		0x80	/* xact manipulates dquot info */
+
 /*
  * LOWMODE is used by the allocator to activate the lowspace algorithm - when
  * free space is running low the extent allocator may choose to allocate an
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index d1a0848cb52e..2310d1634898 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -846,7 +846,7 @@ xfs_trans_resv_calc(
 		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
 	else
 		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
-	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
 	if (xfs_sb_version_hasreflink(&mp->m_sb))
@@ -854,56 +854,56 @@ xfs_trans_resv_calc(
 				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
 	else
 		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
-	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
 	resp->tr_rename.tr_logcount = XFS_RENAME_LOG_COUNT;
-	resp->tr_rename.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_rename.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_link.tr_logres = xfs_calc_link_reservation(mp);
 	resp->tr_link.tr_logcount = XFS_LINK_LOG_COUNT;
-	resp->tr_link.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_link.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_remove.tr_logres = xfs_calc_remove_reservation(mp);
 	resp->tr_remove.tr_logcount = XFS_REMOVE_LOG_COUNT;
-	resp->tr_remove.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_remove.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_symlink.tr_logres = xfs_calc_symlink_reservation(mp);
 	resp->tr_symlink.tr_logcount = XFS_SYMLINK_LOG_COUNT;
-	resp->tr_symlink.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_symlink.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_create.tr_logres = xfs_calc_icreate_reservation(mp);
 	resp->tr_create.tr_logcount = XFS_CREATE_LOG_COUNT;
-	resp->tr_create.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_create.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_create_tmpfile.tr_logres =
 			xfs_calc_create_tmpfile_reservation(mp);
 	resp->tr_create_tmpfile.tr_logcount = XFS_CREATE_TMPFILE_LOG_COUNT;
-	resp->tr_create_tmpfile.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_create_tmpfile.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_mkdir.tr_logres = xfs_calc_mkdir_reservation(mp);
 	resp->tr_mkdir.tr_logcount = XFS_MKDIR_LOG_COUNT;
-	resp->tr_mkdir.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_mkdir.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_ifree.tr_logres = xfs_calc_ifree_reservation(mp);
 	resp->tr_ifree.tr_logcount = XFS_INACTIVE_LOG_COUNT;
-	resp->tr_ifree.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_ifree.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_addafork.tr_logres = xfs_calc_addafork_reservation(mp);
 	resp->tr_addafork.tr_logcount = XFS_ADDAFORK_LOG_COUNT;
-	resp->tr_addafork.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_addafork.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_attrinval.tr_logres = xfs_calc_attrinval_reservation(mp);
 	resp->tr_attrinval.tr_logcount = XFS_ATTRINVAL_LOG_COUNT;
-	resp->tr_attrinval.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_attrinval.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_attrsetm.tr_logres = xfs_calc_attrsetm_reservation(mp);
 	resp->tr_attrsetm.tr_logcount = XFS_ATTRSET_LOG_COUNT;
-	resp->tr_attrsetm.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_attrsetm.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_attrrm.tr_logres = xfs_calc_attrrm_reservation(mp);
 	resp->tr_attrrm.tr_logcount = XFS_ATTRRM_LOG_COUNT;
-	resp->tr_attrrm.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_attrrm.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
 
 	resp->tr_growrtalloc.tr_logres = xfs_calc_growrtalloc_reservation(mp);
 	resp->tr_growrtalloc.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 6d3e78c35883..5eabc555a532 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -129,7 +129,7 @@ typedef struct xfs_mount {
 	uint			m_rsumlevels;	/* rt summary levels */
 	uint			m_rsumsize;	/* size of rt summary, bytes */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
-	uint			m_qflags;	/* quota status flags */
+	unsigned long		m_qflags;	/* quota status flags */
 	uint64_t		m_flags;	/* global mount flags */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
 	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 938023dd8ce5..3f59291d7fbc 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -299,9 +299,7 @@ xfs_qm_need_dqattach(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp))
-		return false;
-	if (!XFS_IS_QUOTA_ON(mp))
+	if (!xfs_trans_quota_running(ip->i_mount))
 		return false;
 	if (!XFS_NOT_DQATTACHED(mp, ip))
 		return false;
@@ -648,13 +646,16 @@ xfs_qm_init_quotainfo(
 	if (error)
 		goto out_free_qinf;
 
+	error = percpu_counter_init(&qinf->qi_active_trans, 0, GFP_NOFS);
+	if (error)
+		goto out_free_lru;
 	/*
 	 * See if quotainodes are setup, and if not, allocate them,
 	 * and change the superblock accordingly.
 	 */
 	error = xfs_qm_init_quotainos(mp);
 	if (error)
-		goto out_free_lru;
+		goto out_free_counter;
 
 	INIT_RADIX_TREE(&qinf->qi_uquota_tree, GFP_NOFS);
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
@@ -696,6 +697,8 @@ xfs_qm_init_quotainfo(
 	mutex_destroy(&qinf->qi_quotaofflock);
 	mutex_destroy(&qinf->qi_tree_lock);
 	xfs_qm_destroy_quotainos(qinf);
+out_free_counter:
+	percpu_counter_destroy(&qinf->qi_active_trans);
 out_free_lru:
 	list_lru_destroy(&qinf->qi_lru);
 out_free_qinf:
@@ -713,18 +716,17 @@ void
 xfs_qm_destroy_quotainfo(
 	struct xfs_mount	*mp)
 {
-	struct xfs_quotainfo	*qi;
+	struct xfs_quotainfo	*qi = mp->m_quotainfo;
 
-	qi = mp->m_quotainfo;
-	ASSERT(qi != NULL);
+	mp->m_quotainfo = NULL;
 
 	unregister_shrinker(&qi->qi_shrinker);
 	list_lru_destroy(&qi->qi_lru);
+	percpu_counter_destroy(&qi->qi_active_trans);
 	xfs_qm_destroy_quotainos(qi);
 	mutex_destroy(&qi->qi_tree_lock);
 	mutex_destroy(&qi->qi_quotaofflock);
 	kmem_free(qi);
-	mp->m_quotainfo = NULL;
 }
 
 /*
@@ -1640,7 +1642,7 @@ xfs_qm_vop_dqalloc(
 	int			error;
 	uint			lockflags;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
+	if (!xfs_trans_quota_running(mp))
 		return 0;
 
 	lockflags = XFS_ILOCK_EXCL;
@@ -1891,7 +1893,7 @@ xfs_qm_vop_rename_dqattach(
 	struct xfs_mount	*mp = i_tab[0]->i_mount;
 	int			i;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
+	if (!xfs_trans_quota_running(mp))
 		return 0;
 
 	for (i = 0; (i < 4 && i_tab[i]); i++) {
@@ -1922,7 +1924,7 @@ xfs_qm_vop_create_dqattach(
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
+	if (!xfs_trans_quota_running(mp))
 		return;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 7b0e771fcbce..754b05024cf1 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -78,6 +78,7 @@ struct xfs_quotainfo {
 	struct xfs_def_quota	qi_grp_default;
 	struct xfs_def_quota	qi_prj_default;
 	struct shrinker		qi_shrinker;
+	struct percpu_counter	qi_active_trans;
 };
 
 static inline struct radix_tree_root *
@@ -125,6 +126,9 @@ xfs_dquot_type(struct xfs_dquot *dqp)
 	return XFS_DQ_PROJ;
 }
 
+bool xfs_trans_quota_running(struct xfs_mount *mp);
+bool xfs_trans_quota_enabled(struct xfs_mount *mp);
+
 extern void	xfs_trans_mod_dquot(struct xfs_trans *tp, struct xfs_dquot *dqp,
 				    uint field, int64_t delta);
 extern void	xfs_trans_dqjoin(struct xfs_trans *, struct xfs_dquot *);
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 7effd7a28136..bc08d7127ef3 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -18,6 +18,7 @@
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_icache.h"
+#include "xfs_log.h"
 
 STATIC int
 xfs_qm_log_quotaoff(
@@ -169,6 +170,27 @@ xfs_qm_scall_quotaoff(
 	if ((mp->m_qflags & flags) == 0)
 		goto out_unlock;
 
+	/*
+	 * We are about to start the quota off operation. At this point we stop
+	 * new transactions from making quota modifications, but we want
+	 * currently running transactions to drain out. Setting the
+	 * XFS_QUOTA_OFF_RUNNING_BIT will place a new modification barrier in
+	 * the dquot code, and then we need to wait until the active quota
+	 * transaction counter falls to zero.
+	 */
+	set_bit(XFS_QUOTA_OFF_RUNNING_BIT, &mp->m_qflags);
+	while (percpu_counter_sum(&q->qi_active_trans) > 0) {
+		/* sleep for a short while before checking again */
+		msleep(250);
+	}
+
+	/*
+	 * Now there are no quota modifications taking place, force the log to
+	 * get all quota modifications into the log before we log the quota off
+	 * items to indicate quota has been turned off.
+	 */
+	xfs_log_force(mp, XFS_LOG_SYNC);
+
 	/*
 	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
 	 * and synchronously. If we fail to write, we should abort the
@@ -179,61 +201,38 @@ xfs_qm_scall_quotaoff(
 		goto out_unlock;
 
 	/*
-	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
-	 * to take care of the race between dqget and quotaoff. We don't take
-	 * any special locks to reset these bits. All processes need to check
-	 * these bits *after* taking inode lock(s) to see if the particular
-	 * quota type is in the process of being turned off. If *ACTIVE, it is
-	 * guaranteed that all dquot structures and all quotainode ptrs will all
-	 * stay valid as long as that inode is kept locked.
-	 *
-	 * There is no turning back after this.
-	 */
-	mp->m_qflags &= ~inactivate_flags;
-
-	/*
-	 * Give back all the dquot reference(s) held by inodes.
-	 * Here we go thru every single incore inode in this file system, and
-	 * do a dqrele on the i_udquot/i_gdquot that it may have.
-	 * Essentially, as long as somebody has an inode locked, this guarantees
-	 * that quotas will not be turned off. This is handy because in a
-	 * transaction once we lock the inode(s) and check for quotaon, we can
-	 * depend on the quota inodes (and other things) being valid as long as
-	 * we keep the lock(s).
+	 * Clear all the quota flags now that we've logged the initial quota-off
+	 * intent. Once these flags are cleared, we can log the quota-off end
+	 * intent knowing that no further modifications to the type of dquots
+	 * we just turned off will occur.
 	 */
-	xfs_qm_dqrele_all_inodes(mp, flags);
+	mp->m_qflags &= ~(inactivate_flags | flags);
+	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
+	if (error) {
+		/*
+		 * We're screwed now. Shutdown is the only option, but we
+		 * continue the quotaoff dquot cleanup as that has to
+		 * be done regardless of whether we shutdown or not.
+		 */
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	}
 
 	/*
-	 * Next we make the changes in the quota flag in the mount struct.
-	 * This isn't protected by a particular lock directly, because we
-	 * don't want to take a mrlock every time we depend on quotas being on.
+	 * The quota-off transactions are now on stable storage and quotas are
+	 * turned off in memory. We still have dquots that we need to reclaim,
+	 * but they will no longer be a part of ongoing modifications so we
+	 * can do that after we release all the transactions on the quota-off
+	 * barrier.
 	 */
-	mp->m_qflags &= ~flags;
+	clear_and_wake_up_bit(XFS_QUOTA_OFF_RUNNING_BIT, &mp->m_qflags);
 
 	/*
-	 * Go through all the dquots of this file system and purge them,
-	 * according to what was turned off.
+	 * Give back all the dquot references held by inodes and purge the
+	 * dquots from memory.
 	 */
+	xfs_qm_dqrele_all_inodes(mp, flags);
 	xfs_qm_dqpurge_all(mp, dqtype);
 
-	/*
-	 * Transactions that had started before ACTIVE state bit was cleared
-	 * could have logged many dquots, so they'd have higher LSNs than
-	 * the first QUOTAOFF log record does. If we happen to crash when
-	 * the tail of the log has gone past the QUOTAOFF record, but
-	 * before the last dquot modification, those dquots __will__
-	 * recover, and that's not good.
-	 *
-	 * So, we have QUOTAOFF start and end logitems; the start
-	 * logitem won't get overwritten until the end logitem appears...
-	 */
-	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
-	if (error) {
-		/* We're screwed now. Shutdown is the only option. */
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		goto out_unlock;
-	}
-
 	/*
 	 * If all quotas are completely turned off, close shop.
 	 */
@@ -323,7 +322,7 @@ xfs_qm_scall_trunc_qfiles(
 
 	if (!xfs_sb_version_hasquota(&mp->m_sb) || flags == 0 ||
 	    (flags & ~XFS_DQ_ALLTYPES)) {
-		xfs_debug(mp, "%s: flags=%x m_qflags=%x",
+		xfs_debug(mp, "%s: flags=%x m_qflags=%lx",
 			__func__, flags, mp->m_qflags);
 		return -EINVAL;
 	}
@@ -364,7 +363,7 @@ xfs_qm_scall_quotaon(
 	flags &= XFS_ALL_QUOTA_ENFD;
 
 	if (flags == 0) {
-		xfs_debug(mp, "%s: zero flags, m_qflags=%x",
+		xfs_debug(mp, "%s: zero flags, m_qflags=%lx",
 			__func__, mp->m_qflags);
 		return -EINVAL;
 	}
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 6f350490f84b..abeddaed4cd6 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -20,6 +20,8 @@
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_qm.h"
 
 kmem_zone_t	*xfs_trans_zone;
 
@@ -266,6 +268,20 @@ xfs_trans_alloc(
 	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_start_intwrite(mp->m_super);
 
+	/*
+	 * If we might be manipulating quota, we need to block here if quotas
+	 * are being turned off. This allows active transactions to drain so
+	 * that we aren't modifying dquots after the quota has been turned
+	 * off. We do this here because we aren't holding any locks that
+	 * the transaction draining could block on. If quotas are enabled
+	 * while the transaction runs, track that in the transaction flags
+	 * so that we can tell the quota subsystem when the transaction is done.
+	 */
+	if (resp->tr_logflags & XFS_TRANS_QUOTA) {
+		if (xfs_trans_quota_enabled(mp))
+			tp->t_flags |= XFS_TRANS_QUOTA;
+	}
+
 	/*
 	 * Zero-reservation ("empty") transactions can't modify anything, so
 	 * they're allowed to run while we're frozen.
@@ -274,7 +290,7 @@ xfs_trans_alloc(
 		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 
 	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
-	tp->t_flags = flags;
+	tp->t_flags |= flags;
 	tp->t_mountp = mp;
 	INIT_LIST_HEAD(&tp->t_items);
 	INIT_LIST_HEAD(&tp->t_busy);
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index c0f73b82c055..6d8d5543eb94 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -16,7 +16,66 @@
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 
-STATIC void	xfs_trans_alloc_dqinfo(xfs_trans_t *);
+static void
+xfs_trans_alloc_dqinfo(
+	struct xfs_trans	*tp)
+{
+	if (!tp || tp->t_dqinfo)
+		return;
+	tp->t_dqinfo = kmem_zone_zalloc(xfs_qm_dqtrxzone, 0);
+}
+
+void
+xfs_trans_free_dqinfo(
+	struct xfs_trans	*tp)
+{
+	if (tp->t_flags & XFS_TRANS_QUOTA)
+		percpu_counter_dec(&tp->t_mountp->m_quotainfo->qi_active_trans);
+
+	if (!tp->t_dqinfo)
+		return;
+
+	kmem_cache_free(xfs_qm_dqtrxzone, tp->t_dqinfo);
+	tp->t_dqinfo = NULL;
+}
+
+bool
+xfs_trans_quota_running(
+	struct xfs_mount	*mp)
+{
+	if (!XFS_IS_QUOTA_RUNNING(mp))
+		return false;
+	if (!XFS_IS_QUOTA_ON(mp))
+		return false;
+	return true;
+}
+
+bool
+xfs_trans_quota_enabled(
+	struct xfs_mount	*mp)
+{
+	bool			waited;
+
+	do {
+		if (!xfs_trans_quota_running(mp))
+			return false;
+
+		/*
+		 * Don't start new quota modifications while quota off is
+		 * running. If we waited on a quota off, we need to recheck
+		 * if quota is enabled.
+		 */
+		waited = false;
+		while (test_bit(XFS_QUOTA_OFF_RUNNING_BIT, &mp->m_qflags)) {
+			wait_on_bit(&mp->m_qflags, XFS_QUOTA_OFF_RUNNING_BIT,
+						TASK_UNINTERRUPTIBLE);
+			waited = true;
+		}
+	} while (waited == true);
+
+	percpu_counter_inc(&mp->m_quotainfo->qi_active_trans);
+	return true;
+}
 
 /*
  * Add the locked dquot to the transaction.
@@ -67,11 +126,17 @@ xfs_trans_dup_dqinfo(
 	struct xfs_trans	*otp,
 	struct xfs_trans	*ntp)
 {
+	struct xfs_mount	*mp = otp->t_mountp;
 	struct xfs_dqtrx	*oq, *nq;
 	int			i, j;
 	struct xfs_dqtrx	*oqa, *nqa;
 	uint64_t		blk_res_used;
 
+	if (otp->t_flags & XFS_TRANS_QUOTA) {
+		ntp->t_flags |= XFS_TRANS_QUOTA;
+		percpu_counter_inc(&mp->m_quotainfo->qi_active_trans);
+	}
+
 	if (!otp->t_dqinfo)
 		return;
 
@@ -131,13 +196,12 @@ xfs_trans_mod_dquot_byino(
 {
 	xfs_mount_t	*mp = tp->t_mountp;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp) ||
-	    !XFS_IS_QUOTA_ON(mp) ||
-	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
+	if (!xfs_trans_quota_running(mp))
+		return;
+	if (xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
 		return;
 
-	if (tp->t_dqinfo == NULL)
-		xfs_trans_alloc_dqinfo(tp);
+	xfs_trans_alloc_dqinfo(tp);
 
 	if (XFS_IS_UQUOTA_ON(mp) && ip->i_udquot)
 		(void) xfs_trans_mod_dquot(tp, ip->i_udquot, field, delta);
@@ -192,8 +256,11 @@ xfs_trans_mod_dquot(
 	ASSERT(XFS_IS_QUOTA_RUNNING(tp->t_mountp));
 	qtrx = NULL;
 
-	if (tp->t_dqinfo == NULL)
-		xfs_trans_alloc_dqinfo(tp);
+	if (!xfs_trans_quota_running(tp->t_mountp))
+		return;
+
+	xfs_trans_alloc_dqinfo(tp);
+
 	/*
 	 * Find either the first free slot or the slot that belongs
 	 * to this dquot.
@@ -742,11 +809,9 @@ xfs_trans_reserve_quota_bydquots(
 {
 	int		error;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
+	if (!xfs_trans_quota_running(mp))
 		return 0;
-
-	if (tp && tp->t_dqinfo == NULL)
-		xfs_trans_alloc_dqinfo(tp);
+	xfs_trans_alloc_dqinfo(tp);
 
 	ASSERT(flags & XFS_QMOPT_RESBLK_MASK);
 
@@ -800,8 +865,9 @@ xfs_trans_reserve_quota_nblks(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
-	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
+	if (!xfs_trans_quota_running(mp))
 		return 0;
+	xfs_trans_alloc_dqinfo(tp);
 
 	ASSERT(!xfs_is_quota_inode(&mp->m_sb, ip->i_ino));
 
@@ -856,19 +922,3 @@ xfs_trans_log_quotaoff_item(
 	set_bit(XFS_LI_DIRTY, &qlp->qql_item.li_flags);
 }
 
-STATIC void
-xfs_trans_alloc_dqinfo(
-	xfs_trans_t	*tp)
-{
-	tp->t_dqinfo = kmem_zone_zalloc(xfs_qm_dqtrxzone, 0);
-}
-
-void
-xfs_trans_free_dqinfo(
-	xfs_trans_t	*tp)
-{
-	if (!tp->t_dqinfo)
-		return;
-	kmem_cache_free(xfs_qm_dqtrxzone, tp->t_dqinfo);
-	tp->t_dqinfo = NULL;
-}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-02 11:51 ` [PATCH 00/10] xfs: automatic relogging Dave Chinner
@ 2020-07-02 18:52   ` Brian Foster
  2020-07-03  0:49     ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-07-02 18:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > Here's a v1 (non-RFC) version of the automatic relogging functionality.
> > Note that the buffer relogging bits (patches 8-10) are still RFC as I've
> > had to hack around some things to utilize it for testing. I include them
> > here mostly for reference/discussion. Most of the effort from the last
> > rfc post has gone into testing and solidifying the functionality. This
> > now survives a traditional fstests regression run as well as a test run
> > with random buffer relogging enabled on every test/scratch device mount
> > that occurs throughout the fstests cycle. The quotaoff use case is
> > additionally tested independently by artificially delaying completion of
> > the quotaoff in parallel with many fsstress worker threads.
> > 
> > The hacks/workarounds to support the random buffer relogging enabled
> > fstests run are not included here because they are not associated with
> > core functionality, but rather are side effects of randomly relogging
> > arbitrary buffers, etc. I can work them into the buffer relogging
> > patches if desired, but I'd like to get the core functionality and use
> > case worked out before getting too far into the testing code. I also
> > know Darrick was interested in the ->iop_relog() callback for some form
> > of generic feedback into active dfops processing, so it might be worth
> > exploring that further.
> > 
> > Thoughts, reviews, flames appreciated.
> 
> Ok I've looked through the code again, and again I've had to pause,
> stop and think hard about it because the feeling I've had right from
> the start about the automatic relogging concept is stronger than
> ever.
> 
> I think the most constructive way to say what I'm feeling is that I
> think this is the wrong approach to solve the quota off problem.
> However, I've never been able to come up with an alternative that
> also solved the quotaoff problem so I've tried to help make this
> relogging concept work.
> 

I actually agree that this mechanism is overkill for quotaoff. I
probably wouldn't have invested this much time in the first place if
that was the only use case. Note that the original relogging concept
came about around discussion with Darrick on online btree repair because
IIRC the technique we landed on required dropping EFIs (which still have
open issues wrt to relogging) in the log for a non-deterministic amount
of time on an otherwise active fs. We came up with the concept and I
remembered quotaoff had a similar unresolved problem, so simply decided
to use that as a vector for the POC because the use case is much
simpler.

> It's a very interesting experiment, but I've always had a nagging
> doubt about putting transaction reservations both above and below
> the AIL. In reading this version, I'm having trouble following and
> understanding the transaction reservation juggling and
> recalculation complexity that's been introduced to facilitate
> the stealing that is being done. Yes, I know that I suggested the
> dynamic stealing approach - it's certainly better than past
> versions, but it hasn't really addressed my underlying doubt about
> the relogging concept in general...
> 

I think we need to separate discussion around the best solution for the
quotaoff problem from general doubts about the relog mechanism here. In
my mind, changing how we address quotaoff doesn't really impact the
existence of this mechanism because it was never the primary use case.
It just changes the timeline/dependency/requirements a bit.

However, you're mentioning "nagging doubts" about the fundamentals of
how it works, etc., so that suggests there are still concerns around the
mechanism itself independent from quotaoff. I've sent 5 or so RFCs to
try and elicit general feedback and address fundamental concerns before
putting in the effort to solidify the implementation, which was notably
more time consuming than reworking the RFC. It's quite frustrating to
see negative feedback broaden at this stage in a manner/pattern that
suggests the mechanism is not generally acceptable.

All that being what it is, I'd obviously rather not expend even more
time if this is going to be met with vague/general reluctance. Do we
need to go back to the drawing board on the repair use case? If so,
should we reconsider the approach repair is using to release blocks?
Perhaps alter this mechanism to address some tangible concerns? Try and
come up with something else entirely..?

Moving on to quotaoff...

> I have been spending some time recently in the quota code, so I have
> a better grip on what it is doing now than I did last time I looked
> at this relogging code. I never really questioned why the quota code
> needed two transactions for quota-off, and I'm guessing that nobody
> else has either. So I spent some time this morning understanding
> what problem it was actually solving and trying to find an alternate
> solution to that problem.
> 

Indeed, I hadn't looked into that.

> The reason we have the two quota-off transactions is that active
> dquot modifications at the time quotaoff is started leak past the
> first quota off transaction that hits the journal. Hence to avoid
> incorrect replay of those modifications in the journal if we crash
> after the quota-off item passes out of the journal, we pin the
> quota-off item in the journal. It gets unpinned by the commit of the
> second quota-off transaction at completion time, hence defining the
> window in journal where quota-off is being processed and dquot
> modifications should be ignored. i.e. there is no window where
> recovery will replay dquot modifications incorrectly.
> 

Ok.

> However, if the second transaction is left too long, the reservation
> will fail to find journal space because of the pinned quota-off item.
> 

Right.

> The relogging infrastructure is designed to allow the inital
> quota-off intent to keep moving forward in the log so it never pins
> the tail of the log before the second quota-off transaction is run.
> This tries to avoid the recovery issue because there's always an
> active quota off item in the log, but I think there may be a flaw
> here.  When the quotaoff item gets relogged, it jumps all the dquots
> in the log that were modified after the quota-off started. Hence if
> we crash after the relogging but while the dquots are still in the
> log before the relogged quotaoff item, then they will be replayed,
> possibly incorrectly. i.e. the relogged quota-off item no longer
> prevents replay of those items.
> 
> So while relogging prevents the tail pinning deadlock, I think it
> may actually result in incorrect recovery behaviour in that items
> that should be cancelled and not replayed can end up getting
> replayed.  I'm not sure that this matters for dquots, but for a
> general mechanism I think the transactional ordering violations it
> can result in reduce it's usefulness significantly.
> 

Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
the early RFCs. Log recovery processes the quotaoff intent in pass 1 and
dquot updates in pass 2, which I thought was intended to handle this
kind of problem.

If I follow correctly, the recovery issue that warrants pinning the
quotaoff in the log is not so much an ordering issue, but if the latter
happens to fall off the end of the log before the last of the dquot
modifications, recovery could see dquot changes after having lost the
fact that a quotaoff had occurred at all. The current implementation
presumably handles this by pinning the quotaoff until all dquots are
completely purged from existence. The relog mechanism just allows the
item to move while it effectively remains pinned, so I don't see how it
introduces recovery issues.

> But back to quota-off: What I've realised is that the only dquot
> modifications we need to protect against being recovered are the
> ones that are running at the time the first quota-off is committed
> to the journal. That is, once the DQACTIVE flags are clear,
> transactions will not modify those dquots anymore. Hence by the time
> that the quota off item pins the tail of the log, the transactions
> that were actively dirtying inodes when it was committed have also
> committed and are in the journal and there are no actively modified
> dquots left in memory.
> 

I'm not sure how the (sync) commit of the quotaoff guarantees some other
transaction running in parallel hadn't modified a dquot and committed
after the quotaoff, but I think I see where you're going in general...

> IOWs, we don't actually need to wait until we've released and purged
> all the dquots from memory before we log the second quota off item;
> all we need to wait for is for all the transactions with dirty
> dquots to have committed. These transactions already have log
> reservations, so completing them will free unused reservation space
> for the second quota off transaction. Once they are committed, then
> we can log the second item. i.e. we don't have to wait until we've
> cleaned up the dquots to close out the quota-off transaction in the
> journal.
> 

Ok, so we can deterministically shorten the window with a runtime
barrier (i.e. disable -> drain) on quota modifying transactions rather
than relying on the full dquot purge to provide this ordering.

> To make it even more robust, if we stop all the transactions that
> may dirty dquots and drain the active ones before we log the first
> quota-off item, we can log the second item immediately afterwards
> because it is known that there are no dquot modifications in flight
> when the first item is logged. We can probably even log both items
> in the same transaction.
> 

I was going to ask why we'd even need two items if this approach is
generally viable. ISTM that recovery doesn't care whether there's one or
two in the log since the first quotaoff it sees takes effect (and the
superblock changes are logged in the same transaction).

> Recovery will behave correctly on all kernels, old and new, because
> we've prevented dquots from landing in the journal after the first
> quota-off item. Hence it looks to recovery simply like the fs was
> idle when the quota-off was run...
> 
> So, putting my money where my mouth is, the patch below does this.
> It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> of -g quota with all quotas enabled and is currently running a full
> auto cycle with all quotas enabled. It hasn't let the smoke out
> after about 4 hours of testing now....
> 

Thanks for the patch. First, I like the idea and agree that it's more
simple than the relogging approach. I do still need to stare at it some
more to grok it and convince myself it's safe.

The thing that sticks out to me is tagging all of the transactions that
modify quotas. Is there any reason we can't just quiesce the transaction
subsystem entirely as a first step? It's not like quotaoff is common or
performance sensitive. For example:

1. stop all transactions, wait to drain, force log
2. log the sb/quotaoff synchronously (punching through via something
   like NO_WRITECOUNT)
3. clear the xfs_mount quota active flags
4. restart the transaction subsystem (no more dquot mods)
5. complete quotaoff via the dquot release and purge sequence

I think it could be worth the tradeoff for the simplicity of not having
to maintain the transaction reservation tags or the special quota
waiting infrastructure vs. something like the more generic (recently
removed) transaction counter. We might even be able to abstract the
whole thing behind a transaction flag. E.g.:

	/*
	 * A barrier transaction locks out further transactions and waits on
	 * outstanding transactions to drain (i.e. commit) before returning.
	 * Everything unlocks when the transaction commits.
	 */
	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
			XFS_TRANS_BARRIER, &tp);
	...

	/* qoff disappears when committed to on-disk log */
	xfs_trans_log_quotaoff_item(tp, qoffi);
	xfs_log_sb();
	...

	/*
	 * The transaction subsystem is currently frozen. This transaction is
	 * either going to commit synchronously and we'll proceed with the quotaoff
	 * or abort and shutdown the fs. Clear the active quota state now to
	 * guarantee that no further dquot changes hit the log after the quotaoff.
	 */
	mp->m_qflags &= ~(inactivate_flags | flags);
	
	xfs_trans_set_sync(tp);
	error = xfs_trans_commit(tp);
	...

Thoughts on either?

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> [RFC] xfs: rework quota off process to avoid deadlocks
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> The quotaoff operation has a rare but longstanding deadlock vector
> in terms of how the operation is logged. A quotaoff start intent is
> logged (synchronously) at the onset to ensure recovery can handle
> the operation if interrupted before in-core changes are made. This
> quotaoff intent pins the log tail while the quotaoff sequence scans
> and purges dquots from all in-core inodes. While this operation
> generally doesn't generate much log traffic on its own, it can be
> time consuming. If unrelated, concurrent filesystem activity
> consumes remaining log space before quotaoff is able to acquire log
> reservation for the quotaoff end intent, the filesystem locks up
> indefinitely.
> 
> The problem that the quota-off intent/done logging setup is designed
> to deal with is that when the quota-off intent is first logged,
> there can be active transactions with modified dquots in them. These
> cannot be aborted, and the result is that modified dquots end up in
> the journal after the quota-off item has been recorded. Hence the
> proposal to automatically relog the quota off intent so it doesn't
> fall out of the log before the done intent can be written.
> 
> We can fix this another way: all we need to do is prevent dquots
> from being logged to the journal after we log the quota off intent.
> If we do this, then we can just log the quota off intent and the
> intent done items in the same transaction and then tear down the
> in memory dquots.
> 
> The current code prevents new dquots from being logged once the
> quota-off has been logged simply by clearing the DQ_ACTIVE flags for
> the quota being logged. Removing this flag means all future
> transactions will see that the specific quota type is no longer
> active and skip over it completely. This, however, requires
> co-ordination with inode locks for it to behave correctly, and
> hence this mechanism cannot be used to block transactions while we
> wait for other transactions to drain out.
> 
> Hence add a new "quota off running" flag to the mount quota flags.
> This bit can be set atomically, and we can use bit ops to wait on it
> being cleared. ANd when we clear the bit we can issue a wakeups on
> it. Hence adding a new bit gives us the mechanism to block
> operations while quota-off is running.
> 
> To acheive this draining, we need to keep a count of the number of
> active transactions that are modifying dquots. To do that, let's tag
> transaction reservation structures for transactions that may modify
> dquots with a new flag. We can look at that flag in
> xfs_trans_alloc() - where it is safe to block - and do
> all our quota-off run checks, blocking and accounting here.
> We can use a single per-cpu counter to track dquot modifying
> transactions in flight, so after setting the "quota off running"
> flag all we need to do is wait for the counter to run to zero.
> 
> At this point, we can log the quota off items, then clear the
> mount quota flags, and then detatch and reap all the dquots that
> remain in memory that will no longer be used. This can be done after
> the quota-off end item is logged because nothing is going to be
> modifying the dquots we need to reap at this point.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_quota_defs.h |   3 ++
>  fs/xfs/libxfs/xfs_shared.h     |   2 +
>  fs/xfs/libxfs/xfs_trans_resv.c |  28 +++++------
>  fs/xfs/xfs_mount.h             |   2 +-
>  fs/xfs/xfs_qm.c                |  24 ++++-----
>  fs/xfs/xfs_qm.h                |   4 ++
>  fs/xfs/xfs_qm_syscalls.c       |  95 ++++++++++++++++++------------------
>  fs/xfs/xfs_trans.c             |  18 ++++++-
>  fs/xfs/xfs_trans_dquot.c       | 108 ++++++++++++++++++++++++++++++-----------
>  9 files changed, 180 insertions(+), 104 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_quota_defs.h b/fs/xfs/libxfs/xfs_quota_defs.h
> index 56d9dd787e7b..84e5d568cf1f 100644
> --- a/fs/xfs/libxfs/xfs_quota_defs.h
> +++ b/fs/xfs/libxfs/xfs_quota_defs.h
> @@ -79,6 +79,9 @@ typedef uint16_t	xfs_qwarncnt_t;
>  #define XFS_ALL_QUOTA_ACTIVE	\
>  	(XFS_UQUOTA_ACTIVE | XFS_GQUOTA_ACTIVE | XFS_PQUOTA_ACTIVE)
>  
> +#define XFS_QUOTA_OFF_RUNNING_BIT 15  /* Quotas are being turned off */
> +#define XFS_QUOTA_OFF_RUNNING	(1 << XFS_QUOTA_OFF_RUNNING_BIT)
> +
>  /*
>   * Checking XFS_IS_*QUOTA_ON() while holding any inode lock guarantees
>   * quota will be not be switched off as long as that inode lock is held.
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index c45acbd3add9..d4a6156dba31 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -65,6 +65,8 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  #define XFS_TRANS_DQ_DIRTY	0x10	/* at least one dquot in trx dirty */
>  #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
>  #define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
> +#define XFS_TRANS_QUOTA		0x80	/* xact manipulates dquot info */
> +
>  /*
>   * LOWMODE is used by the allocator to activate the lowspace algorithm - when
>   * free space is running low the extent allocator may choose to allocate an
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index d1a0848cb52e..2310d1634898 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -846,7 +846,7 @@ xfs_trans_resv_calc(
>  		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
>  	else
>  		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
> -	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
>  	if (xfs_sb_version_hasreflink(&mp->m_sb))
> @@ -854,56 +854,56 @@ xfs_trans_resv_calc(
>  				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
>  	else
>  		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
> -	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
>  	resp->tr_rename.tr_logcount = XFS_RENAME_LOG_COUNT;
> -	resp->tr_rename.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_rename.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_link.tr_logres = xfs_calc_link_reservation(mp);
>  	resp->tr_link.tr_logcount = XFS_LINK_LOG_COUNT;
> -	resp->tr_link.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_link.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_remove.tr_logres = xfs_calc_remove_reservation(mp);
>  	resp->tr_remove.tr_logcount = XFS_REMOVE_LOG_COUNT;
> -	resp->tr_remove.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_remove.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_symlink.tr_logres = xfs_calc_symlink_reservation(mp);
>  	resp->tr_symlink.tr_logcount = XFS_SYMLINK_LOG_COUNT;
> -	resp->tr_symlink.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_symlink.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_create.tr_logres = xfs_calc_icreate_reservation(mp);
>  	resp->tr_create.tr_logcount = XFS_CREATE_LOG_COUNT;
> -	resp->tr_create.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_create.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_create_tmpfile.tr_logres =
>  			xfs_calc_create_tmpfile_reservation(mp);
>  	resp->tr_create_tmpfile.tr_logcount = XFS_CREATE_TMPFILE_LOG_COUNT;
> -	resp->tr_create_tmpfile.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_create_tmpfile.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_mkdir.tr_logres = xfs_calc_mkdir_reservation(mp);
>  	resp->tr_mkdir.tr_logcount = XFS_MKDIR_LOG_COUNT;
> -	resp->tr_mkdir.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_mkdir.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_ifree.tr_logres = xfs_calc_ifree_reservation(mp);
>  	resp->tr_ifree.tr_logcount = XFS_INACTIVE_LOG_COUNT;
> -	resp->tr_ifree.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_ifree.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_addafork.tr_logres = xfs_calc_addafork_reservation(mp);
>  	resp->tr_addafork.tr_logcount = XFS_ADDAFORK_LOG_COUNT;
> -	resp->tr_addafork.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_addafork.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_attrinval.tr_logres = xfs_calc_attrinval_reservation(mp);
>  	resp->tr_attrinval.tr_logcount = XFS_ATTRINVAL_LOG_COUNT;
> -	resp->tr_attrinval.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_attrinval.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_attrsetm.tr_logres = xfs_calc_attrsetm_reservation(mp);
>  	resp->tr_attrsetm.tr_logcount = XFS_ATTRSET_LOG_COUNT;
> -	resp->tr_attrsetm.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_attrsetm.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_attrrm.tr_logres = xfs_calc_attrrm_reservation(mp);
>  	resp->tr_attrrm.tr_logcount = XFS_ATTRRM_LOG_COUNT;
> -	resp->tr_attrrm.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_attrrm.tr_logflags |= XFS_TRANS_PERM_LOG_RES | XFS_TRANS_QUOTA;
>  
>  	resp->tr_growrtalloc.tr_logres = xfs_calc_growrtalloc_reservation(mp);
>  	resp->tr_growrtalloc.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 6d3e78c35883..5eabc555a532 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -129,7 +129,7 @@ typedef struct xfs_mount {
>  	uint			m_rsumlevels;	/* rt summary levels */
>  	uint			m_rsumsize;	/* size of rt summary, bytes */
>  	int			m_fixedfsid[2];	/* unchanged for life of FS */
> -	uint			m_qflags;	/* quota status flags */
> +	unsigned long		m_qflags;	/* quota status flags */
>  	uint64_t		m_flags;	/* global mount flags */
>  	int64_t			m_low_space[XFS_LOWSP_MAX];
>  	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 938023dd8ce5..3f59291d7fbc 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -299,9 +299,7 @@ xfs_qm_need_dqattach(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp))
> -		return false;
> -	if (!XFS_IS_QUOTA_ON(mp))
> +	if (!xfs_trans_quota_running(ip->i_mount))
>  		return false;
>  	if (!XFS_NOT_DQATTACHED(mp, ip))
>  		return false;
> @@ -648,13 +646,16 @@ xfs_qm_init_quotainfo(
>  	if (error)
>  		goto out_free_qinf;
>  
> +	error = percpu_counter_init(&qinf->qi_active_trans, 0, GFP_NOFS);
> +	if (error)
> +		goto out_free_lru;
>  	/*
>  	 * See if quotainodes are setup, and if not, allocate them,
>  	 * and change the superblock accordingly.
>  	 */
>  	error = xfs_qm_init_quotainos(mp);
>  	if (error)
> -		goto out_free_lru;
> +		goto out_free_counter;
>  
>  	INIT_RADIX_TREE(&qinf->qi_uquota_tree, GFP_NOFS);
>  	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
> @@ -696,6 +697,8 @@ xfs_qm_init_quotainfo(
>  	mutex_destroy(&qinf->qi_quotaofflock);
>  	mutex_destroy(&qinf->qi_tree_lock);
>  	xfs_qm_destroy_quotainos(qinf);
> +out_free_counter:
> +	percpu_counter_destroy(&qinf->qi_active_trans);
>  out_free_lru:
>  	list_lru_destroy(&qinf->qi_lru);
>  out_free_qinf:
> @@ -713,18 +716,17 @@ void
>  xfs_qm_destroy_quotainfo(
>  	struct xfs_mount	*mp)
>  {
> -	struct xfs_quotainfo	*qi;
> +	struct xfs_quotainfo	*qi = mp->m_quotainfo;
>  
> -	qi = mp->m_quotainfo;
> -	ASSERT(qi != NULL);
> +	mp->m_quotainfo = NULL;
>  
>  	unregister_shrinker(&qi->qi_shrinker);
>  	list_lru_destroy(&qi->qi_lru);
> +	percpu_counter_destroy(&qi->qi_active_trans);
>  	xfs_qm_destroy_quotainos(qi);
>  	mutex_destroy(&qi->qi_tree_lock);
>  	mutex_destroy(&qi->qi_quotaofflock);
>  	kmem_free(qi);
> -	mp->m_quotainfo = NULL;
>  }
>  
>  /*
> @@ -1640,7 +1642,7 @@ xfs_qm_vop_dqalloc(
>  	int			error;
>  	uint			lockflags;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> +	if (!xfs_trans_quota_running(mp))
>  		return 0;
>  
>  	lockflags = XFS_ILOCK_EXCL;
> @@ -1891,7 +1893,7 @@ xfs_qm_vop_rename_dqattach(
>  	struct xfs_mount	*mp = i_tab[0]->i_mount;
>  	int			i;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> +	if (!xfs_trans_quota_running(mp))
>  		return 0;
>  
>  	for (i = 0; (i < 4 && i_tab[i]); i++) {
> @@ -1922,7 +1924,7 @@ xfs_qm_vop_create_dqattach(
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> +	if (!xfs_trans_quota_running(mp))
>  		return;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
> index 7b0e771fcbce..754b05024cf1 100644
> --- a/fs/xfs/xfs_qm.h
> +++ b/fs/xfs/xfs_qm.h
> @@ -78,6 +78,7 @@ struct xfs_quotainfo {
>  	struct xfs_def_quota	qi_grp_default;
>  	struct xfs_def_quota	qi_prj_default;
>  	struct shrinker		qi_shrinker;
> +	struct percpu_counter	qi_active_trans;
>  };
>  
>  static inline struct radix_tree_root *
> @@ -125,6 +126,9 @@ xfs_dquot_type(struct xfs_dquot *dqp)
>  	return XFS_DQ_PROJ;
>  }
>  
> +bool xfs_trans_quota_running(struct xfs_mount *mp);
> +bool xfs_trans_quota_enabled(struct xfs_mount *mp);
> +
>  extern void	xfs_trans_mod_dquot(struct xfs_trans *tp, struct xfs_dquot *dqp,
>  				    uint field, int64_t delta);
>  extern void	xfs_trans_dqjoin(struct xfs_trans *, struct xfs_dquot *);
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 7effd7a28136..bc08d7127ef3 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -18,6 +18,7 @@
>  #include "xfs_quota.h"
>  #include "xfs_qm.h"
>  #include "xfs_icache.h"
> +#include "xfs_log.h"
>  
>  STATIC int
>  xfs_qm_log_quotaoff(
> @@ -169,6 +170,27 @@ xfs_qm_scall_quotaoff(
>  	if ((mp->m_qflags & flags) == 0)
>  		goto out_unlock;
>  
> +	/*
> +	 * We are about to start the quota off operation. At this point we stop
> +	 * new transactions from making quota modifications, but we want
> +	 * currently running transactions to drain out. Setting the
> +	 * XFS_QUOTA_OFF_RUNNING_BIT will place a new modification barrier in
> +	 * the dquot code, and then we need to wait until the active quota
> +	 * transaction counter falls to zero.
> +	 */
> +	set_bit(XFS_QUOTA_OFF_RUNNING_BIT, &mp->m_qflags);
> +	while (percpu_counter_sum(&q->qi_active_trans) > 0) {
> +		/* sleep for a short while before checking again */
> +		msleep(250);
> +	}
> +
> +	/*
> +	 * Now there are no quota modifications taking place, force the log to
> +	 * get all quota modifications into the log before we log the quota off
> +	 * items to indicate quota has been turned off.
> +	 */
> +	xfs_log_force(mp, XFS_LOG_SYNC);
> +
>  	/*
>  	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
>  	 * and synchronously. If we fail to write, we should abort the
> @@ -179,61 +201,38 @@ xfs_qm_scall_quotaoff(
>  		goto out_unlock;
>  
>  	/*
> -	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
> -	 * to take care of the race between dqget and quotaoff. We don't take
> -	 * any special locks to reset these bits. All processes need to check
> -	 * these bits *after* taking inode lock(s) to see if the particular
> -	 * quota type is in the process of being turned off. If *ACTIVE, it is
> -	 * guaranteed that all dquot structures and all quotainode ptrs will all
> -	 * stay valid as long as that inode is kept locked.
> -	 *
> -	 * There is no turning back after this.
> -	 */
> -	mp->m_qflags &= ~inactivate_flags;
> -
> -	/*
> -	 * Give back all the dquot reference(s) held by inodes.
> -	 * Here we go thru every single incore inode in this file system, and
> -	 * do a dqrele on the i_udquot/i_gdquot that it may have.
> -	 * Essentially, as long as somebody has an inode locked, this guarantees
> -	 * that quotas will not be turned off. This is handy because in a
> -	 * transaction once we lock the inode(s) and check for quotaon, we can
> -	 * depend on the quota inodes (and other things) being valid as long as
> -	 * we keep the lock(s).
> +	 * Clear all the quota flags now that we've logged the initial quota-off
> +	 * intent. Once these flags are cleared, we can log the quota-off end
> +	 * intent knowing that no further modifications to the type of dquots
> +	 * we just turned off will occur.
>  	 */
> -	xfs_qm_dqrele_all_inodes(mp, flags);
> +	mp->m_qflags &= ~(inactivate_flags | flags);
> +	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> +	if (error) {
> +		/*
> +		 * We're screwed now. Shutdown is the only option, but we
> +		 * continue the quotaoff dquot cleanup as that has to
> +		 * be done regardless of whether we shutdown or not.
> +		 */
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +	}
>  
>  	/*
> -	 * Next we make the changes in the quota flag in the mount struct.
> -	 * This isn't protected by a particular lock directly, because we
> -	 * don't want to take a mrlock every time we depend on quotas being on.
> +	 * The quota-off transactions are now on stable storage and quotas are
> +	 * turned off in memory. We still have dquots that we need to reclaim,
> +	 * but they will no longer be a part of ongoing modifications so we
> +	 * can do that after we release all the transactions on the quota-off
> +	 * barrier.
>  	 */
> -	mp->m_qflags &= ~flags;
> +	clear_and_wake_up_bit(XFS_QUOTA_OFF_RUNNING_BIT, &mp->m_qflags);
>  
>  	/*
> -	 * Go through all the dquots of this file system and purge them,
> -	 * according to what was turned off.
> +	 * Give back all the dquot references held by inodes and purge the
> +	 * dquots from memory.
>  	 */
> +	xfs_qm_dqrele_all_inodes(mp, flags);
>  	xfs_qm_dqpurge_all(mp, dqtype);
>  
> -	/*
> -	 * Transactions that had started before ACTIVE state bit was cleared
> -	 * could have logged many dquots, so they'd have higher LSNs than
> -	 * the first QUOTAOFF log record does. If we happen to crash when
> -	 * the tail of the log has gone past the QUOTAOFF record, but
> -	 * before the last dquot modification, those dquots __will__
> -	 * recover, and that's not good.
> -	 *
> -	 * So, we have QUOTAOFF start and end logitems; the start
> -	 * logitem won't get overwritten until the end logitem appears...
> -	 */
> -	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> -	if (error) {
> -		/* We're screwed now. Shutdown is the only option. */
> -		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -		goto out_unlock;
> -	}
> -
>  	/*
>  	 * If all quotas are completely turned off, close shop.
>  	 */
> @@ -323,7 +322,7 @@ xfs_qm_scall_trunc_qfiles(
>  
>  	if (!xfs_sb_version_hasquota(&mp->m_sb) || flags == 0 ||
>  	    (flags & ~XFS_DQ_ALLTYPES)) {
> -		xfs_debug(mp, "%s: flags=%x m_qflags=%x",
> +		xfs_debug(mp, "%s: flags=%x m_qflags=%lx",
>  			__func__, flags, mp->m_qflags);
>  		return -EINVAL;
>  	}
> @@ -364,7 +363,7 @@ xfs_qm_scall_quotaon(
>  	flags &= XFS_ALL_QUOTA_ENFD;
>  
>  	if (flags == 0) {
> -		xfs_debug(mp, "%s: zero flags, m_qflags=%x",
> +		xfs_debug(mp, "%s: zero flags, m_qflags=%lx",
>  			__func__, mp->m_qflags);
>  		return -EINVAL;
>  	}
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 6f350490f84b..abeddaed4cd6 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -20,6 +20,8 @@
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
>  #include "xfs_defer.h"
> +#include "xfs_inode.h"
> +#include "xfs_qm.h"
>  
>  kmem_zone_t	*xfs_trans_zone;
>  
> @@ -266,6 +268,20 @@ xfs_trans_alloc(
>  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_start_intwrite(mp->m_super);
>  
> +	/*
> +	 * If we might be manipulating quota, we need to block here if quotas
> +	 * are being turned off. This allows active transactions to drain so
> +	 * that we aren't modifying dquots after the quota has been turned
> +	 * off. We do this here because we aren't holding any locks that
> +	 * the transaction draining could block on. If quotas are enabled
> +	 * while the transaction runs, track that in the transaction flags
> +	 * so that we can tell the quota subsystem when the transaction is done.
> +	 */
> +	if (resp->tr_logflags & XFS_TRANS_QUOTA) {
> +		if (xfs_trans_quota_enabled(mp))
> +			tp->t_flags |= XFS_TRANS_QUOTA;
> +	}
> +
>  	/*
>  	 * Zero-reservation ("empty") transactions can't modify anything, so
>  	 * they're allowed to run while we're frozen.
> @@ -274,7 +290,7 @@ xfs_trans_alloc(
>  		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
>  
>  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
> -	tp->t_flags = flags;
> +	tp->t_flags |= flags;
>  	tp->t_mountp = mp;
>  	INIT_LIST_HEAD(&tp->t_items);
>  	INIT_LIST_HEAD(&tp->t_busy);
> diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> index c0f73b82c055..6d8d5543eb94 100644
> --- a/fs/xfs/xfs_trans_dquot.c
> +++ b/fs/xfs/xfs_trans_dquot.c
> @@ -16,7 +16,66 @@
>  #include "xfs_quota.h"
>  #include "xfs_qm.h"
>  
> -STATIC void	xfs_trans_alloc_dqinfo(xfs_trans_t *);
> +static void
> +xfs_trans_alloc_dqinfo(
> +	struct xfs_trans	*tp)
> +{
> +	if (!tp || tp->t_dqinfo)
> +		return;
> +	tp->t_dqinfo = kmem_zone_zalloc(xfs_qm_dqtrxzone, 0);
> +}
> +
> +void
> +xfs_trans_free_dqinfo(
> +	struct xfs_trans	*tp)
> +{
> +	if (tp->t_flags & XFS_TRANS_QUOTA)
> +		percpu_counter_dec(&tp->t_mountp->m_quotainfo->qi_active_trans);
> +
> +	if (!tp->t_dqinfo)
> +		return;
> +
> +	kmem_cache_free(xfs_qm_dqtrxzone, tp->t_dqinfo);
> +	tp->t_dqinfo = NULL;
> +}
> +
> +bool
> +xfs_trans_quota_running(
> +	struct xfs_mount	*mp)
> +{
> +	if (!XFS_IS_QUOTA_RUNNING(mp))
> +		return false;
> +	if (!XFS_IS_QUOTA_ON(mp))
> +		return false;
> +	return true;
> +}
> +
> +bool
> +xfs_trans_quota_enabled(
> +	struct xfs_mount	*mp)
> +{
> +	bool			waited;
> +
> +	do {
> +		if (!xfs_trans_quota_running(mp))
> +			return false;
> +
> +		/*
> +		 * Don't start new quota modifications while quota off is
> +		 * running. If we waited on a quota off, we need to recheck
> +		 * if quota is enabled.
> +		 */
> +		waited = false;
> +		while (test_bit(XFS_QUOTA_OFF_RUNNING_BIT, &mp->m_qflags)) {
> +			wait_on_bit(&mp->m_qflags, XFS_QUOTA_OFF_RUNNING_BIT,
> +						TASK_UNINTERRUPTIBLE);
> +			waited = true;
> +		}
> +	} while (waited == true);
> +
> +	percpu_counter_inc(&mp->m_quotainfo->qi_active_trans);
> +	return true;
> +}
>  
>  /*
>   * Add the locked dquot to the transaction.
> @@ -67,11 +126,17 @@ xfs_trans_dup_dqinfo(
>  	struct xfs_trans	*otp,
>  	struct xfs_trans	*ntp)
>  {
> +	struct xfs_mount	*mp = otp->t_mountp;
>  	struct xfs_dqtrx	*oq, *nq;
>  	int			i, j;
>  	struct xfs_dqtrx	*oqa, *nqa;
>  	uint64_t		blk_res_used;
>  
> +	if (otp->t_flags & XFS_TRANS_QUOTA) {
> +		ntp->t_flags |= XFS_TRANS_QUOTA;
> +		percpu_counter_inc(&mp->m_quotainfo->qi_active_trans);
> +	}
> +
>  	if (!otp->t_dqinfo)
>  		return;
>  
> @@ -131,13 +196,12 @@ xfs_trans_mod_dquot_byino(
>  {
>  	xfs_mount_t	*mp = tp->t_mountp;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp) ||
> -	    !XFS_IS_QUOTA_ON(mp) ||
> -	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> +	if (!xfs_trans_quota_running(mp))
> +		return;
> +	if (xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
>  		return;
>  
> -	if (tp->t_dqinfo == NULL)
> -		xfs_trans_alloc_dqinfo(tp);
> +	xfs_trans_alloc_dqinfo(tp);
>  
>  	if (XFS_IS_UQUOTA_ON(mp) && ip->i_udquot)
>  		(void) xfs_trans_mod_dquot(tp, ip->i_udquot, field, delta);
> @@ -192,8 +256,11 @@ xfs_trans_mod_dquot(
>  	ASSERT(XFS_IS_QUOTA_RUNNING(tp->t_mountp));
>  	qtrx = NULL;
>  
> -	if (tp->t_dqinfo == NULL)
> -		xfs_trans_alloc_dqinfo(tp);
> +	if (!xfs_trans_quota_running(tp->t_mountp))
> +		return;
> +
> +	xfs_trans_alloc_dqinfo(tp);
> +
>  	/*
>  	 * Find either the first free slot or the slot that belongs
>  	 * to this dquot.
> @@ -742,11 +809,9 @@ xfs_trans_reserve_quota_bydquots(
>  {
>  	int		error;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> +	if (!xfs_trans_quota_running(mp))
>  		return 0;
> -
> -	if (tp && tp->t_dqinfo == NULL)
> -		xfs_trans_alloc_dqinfo(tp);
> +	xfs_trans_alloc_dqinfo(tp);
>  
>  	ASSERT(flags & XFS_QMOPT_RESBLK_MASK);
>  
> @@ -800,8 +865,9 @@ xfs_trans_reserve_quota_nblks(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  
> -	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> +	if (!xfs_trans_quota_running(mp))
>  		return 0;
> +	xfs_trans_alloc_dqinfo(tp);
>  
>  	ASSERT(!xfs_is_quota_inode(&mp->m_sb, ip->i_ino));
>  
> @@ -856,19 +922,3 @@ xfs_trans_log_quotaoff_item(
>  	set_bit(XFS_LI_DIRTY, &qlp->qql_item.li_flags);
>  }
>  
> -STATIC void
> -xfs_trans_alloc_dqinfo(
> -	xfs_trans_t	*tp)
> -{
> -	tp->t_dqinfo = kmem_zone_zalloc(xfs_qm_dqtrxzone, 0);
> -}
> -
> -void
> -xfs_trans_free_dqinfo(
> -	xfs_trans_t	*tp)
> -{
> -	if (!tp->t_dqinfo)
> -		return;
> -	kmem_cache_free(xfs_qm_dqtrxzone, tp->t_dqinfo);
> -	tp->t_dqinfo = NULL;
> -}
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-02 18:52   ` Brian Foster
@ 2020-07-03  0:49     ` Dave Chinner
  2020-07-06 16:03       ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-07-03  0:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > Here's a v1 (non-RFC) version of the automatic relogging functionality.
> > > Note that the buffer relogging bits (patches 8-10) are still RFC as I've
> > > had to hack around some things to utilize it for testing. I include them
> > > here mostly for reference/discussion. Most of the effort from the last
> > > rfc post has gone into testing and solidifying the functionality. This
> > > now survives a traditional fstests regression run as well as a test run
> > > with random buffer relogging enabled on every test/scratch device mount
> > > that occurs throughout the fstests cycle. The quotaoff use case is
> > > additionally tested independently by artificially delaying completion of
> > > the quotaoff in parallel with many fsstress worker threads.
> > > 
> > > The hacks/workarounds to support the random buffer relogging enabled
> > > fstests run are not included here because they are not associated with
> > > core functionality, but rather are side effects of randomly relogging
> > > arbitrary buffers, etc. I can work them into the buffer relogging
> > > patches if desired, but I'd like to get the core functionality and use
> > > case worked out before getting too far into the testing code. I also
> > > know Darrick was interested in the ->iop_relog() callback for some form
> > > of generic feedback into active dfops processing, so it might be worth
> > > exploring that further.
> > > 
> > > Thoughts, reviews, flames appreciated.
> > 
> > Ok I've looked through the code again, and again I've had to pause,
> > stop and think hard about it because the feeling I've had right from
> > the start about the automatic relogging concept is stronger than
> > ever.
> > 
> > I think the most constructive way to say what I'm feeling is that I
> > think this is the wrong approach to solve the quota off problem.
> > However, I've never been able to come up with an alternative that
> > also solved the quotaoff problem so I've tried to help make this
> > relogging concept work.
> > 
> 
> I actually agree that this mechanism is overkill for quotaoff. I
> probably wouldn't have invested this much time in the first place if
> that was the only use case. Note that the original relogging concept
> came about around discussion with Darrick on online btree repair because
> IIRC the technique we landed on required dropping EFIs (which still have
> open issues wrt to relogging) in the log for a non-deterministic amount
> of time on an otherwise active fs. We came up with the concept and I
> remembered quotaoff had a similar unresolved problem, so simply decided
> to use that as a vector for the POC because the use case is much
> simpler.

Yes, I know the history. That didn't make it any easier for me to
write what I did, because I know how much time you've put into this
already.

w.r.t. EFIs, that comes back to the problem of the relogged items
jumping over things that have been logged that should appear between
the EFI and EFD - moving the EFI forward past such dependent items
is going to be a problem - those changes are going to replayed
regardless of whether the EFI needs replaying or not, and hence
replaying the EFI that got relogged will be out of order with other
operations that occurred after then EFI was orginally logged.

> > It's a very interesting experiment, but I've always had a nagging
> > doubt about putting transaction reservations both above and below
> > the AIL. In reading this version, I'm having trouble following and
> > understanding the transaction reservation juggling and
> > recalculation complexity that's been introduced to facilitate
> > the stealing that is being done. Yes, I know that I suggested the
> > dynamic stealing approach - it's certainly better than past
> > versions, but it hasn't really addressed my underlying doubt about
> > the relogging concept in general...
> > 
> 
> I think we need to separate discussion around the best solution for the
> quotaoff problem from general doubts about the relog mechanism here. In
> my mind, changing how we address quotaoff doesn't really impact the
> existence of this mechanism because it was never the primary use case.
> It just changes the timeline/dependency/requirements a bit.
> 
> However, you're mentioning "nagging doubts" about the fundamentals of
> how it works, etc., so that suggests there are still concerns around the
> mechanism itself independent from quotaoff. I've sent 5 or so RFCs to
> try and elicit general feedback and address fundamental concerns before
> putting in the effort to solidify the implementation, which was notably
> more time consuming than reworking the RFC. It's quite frustrating to
> see negative feedback broaden at this stage in a manner/pattern that
> suggests the mechanism is not generally acceptable.

Well, my initial reponse to the very first RFC was:

| [...] I can see how appealing the concept of automatically
| relogging is, but I'm unconvinced that we can make it work,
| especially when there aren't sufficient reservations to relog
| the items that need relogging.

https://lore.kernel.org/linux-xfs/20191024224308.GD4614@dread.disaster.area/

To RFC v4, which was the next version I had time to look at:

| [long list of potential issues]
|
| Given this, I'm skeptical this can be made into a useful, reliable
| generic async relogging mechanism.

https://lore.kernel.org/linux-xfs/20191205210211.GP2695@dread.disaster.area/

Maybe general comments that "I remain unconvinced this will work"
got drowned out by all the other comments I made trying to help you
understand the code and hence make it work.

Don't get me wrong - I really like the idea, but everything I know
is telling me that, as it stands, I don't think it's going to work.
A large part of that doubt is the absence of application level code
that needs it to work in anger....

This is the nature of XFS development, especially in the log. I did
this three times with development of delayed logging - I threw away
prototypes I'd put similar effort into when it became obvious that
there was a fundamental assumption I'd missed deep in the guts of
the code and so the approach I was taking just wouldn't work. At the
time, I had nobody to tell me that my approach might have problems
before I found them out myself - all the deep XFS knowledge and
expertise had been lost over the previous 10 years of brain drain as
SGI flamed out and crashed.

So I know all too well what it feels like to get this far and then
have to start again from the point of having to come up with a
completely different design premise....

> All that being what it is, I'd obviously rather not expend even more
> time if this is going to be met with vague/general reluctance. Do we
> need to go back to the drawing board on the repair use case? If so,
> should we reconsider the approach repair is using to release blocks?
> Perhaps alter this mechanism to address some tangible concerns? Try and
> come up with something else entirely..?

Well, like I said originally: I think relogging really needs to be
done from the perspective of the owner of the logged item so that we
can avoid things like ordering violations in the journal and other
similar issues. i.e. relogging is designed around it being a
function of the high order change algorithms and not something that
can be used to work around high level change algorithms that don't
follow the rules properly...

I really haven't put any thought into how to solve the online-repair
issue. I simply don't have the time to dive into every problem and
come up with potential solutions to them. However, given the context
of this discussion, we can already relog EFIs in a way that
online-repair can use.

Consider that a single transaction that contains an EFD for the
original EFI, and a new EFI for the same extent is effectively
"relogging the EFI". It does so by atomically cancelling the
original EFI in the log and creating a new EFI.

Now, and EFI is defined on disk as:

typedef struct xfs_efi_log_format {
        uint16_t                efi_type;       /* efi log item type */
        uint16_t                efi_size;       /* size of this item */
        uint32_t                efi_nextents;   /* # extents to free */
        uint64_t                efi_id;         /* efi identifier */
        xfs_extent_t            efi_extents[1]; /* array of extents to free */
} xfs_efi_log_format_t;

Which means it can hold up to 2^16-1 individual extents that we
intend to free. We currently only use one extent per EFI, but if we
go back in history, they were dynamically sized structures and
could track arbitrary numbers of extents.

So, repair needs to track mulitple nested EFIs?

We cancel the old EFI, log a new EFI with all the old extents and
the new extent in it. We now have a single EFI in the journal
containing N+1 extents in it.

Further, an EFD with multiple extents in it is -intended to be
relogged- multiple times. Every time we free an extent in the EFI,
we remove it from the EFD and relog the EFD. THis tells log recovery
that this extent has now been freed, and that it should not replay
it, even though it is still in the EFI.

And to prevent the big EFI from pinning the tail of the log while
EFDs are being processed, we can relog the EFI along with the EFD
each time the EFD is updated, hence we drag the EFI forwards in
every high level transaction roll when we are actually freeing the
extents.

The key to this is that the EFI/EFD relogging must be done entirely
under a single rolling transaction, so there is -always- space
available in the log for both the EFI and the EFDs to be relogged as
the long running operation is performed.

IOWs, the EFI/EFD structures support relogging of the intents at a
design level, and it is intended that this process is entirely
driven from a single rolling transaction context. I srtongly suspect
that all the recent EFI/EFD and deferred ops reworking has lost a
lot of this context from the historical EFI/EFD implementation...

So before we go down the path of implementing generic automatic
relogging infrastructure, we first should have been writing the
application code that needs to relog intents and use a mechanism
like the above to cancel and reinsert intents further down the log.
Once we have code that is using these techniques to do bulk
operations, then we can look to optimise/genericise the
infrastructure they use.

> Moving on to quotaoff...
> 
> > I have been spending some time recently in the quota code, so I have
> > a better grip on what it is doing now than I did last time I looked
> > at this relogging code. I never really questioned why the quota code
> > needed two transactions for quota-off, and I'm guessing that nobody
> > else has either. So I spent some time this morning understanding
> > what problem it was actually solving and trying to find an alternate
> > solution to that problem.
> 
> Indeed, I hadn't looked into that.
> 
> > The reason we have the two quota-off transactions is that active
> > dquot modifications at the time quotaoff is started leak past the
> > first quota off transaction that hits the journal. Hence to avoid
> > incorrect replay of those modifications in the journal if we crash
> > after the quota-off item passes out of the journal, we pin the
> > quota-off item in the journal. It gets unpinned by the commit of the
> > second quota-off transaction at completion time, hence defining the
> > window in journal where quota-off is being processed and dquot
> > modifications should be ignored. i.e. there is no window where
> > recovery will replay dquot modifications incorrectly.
> > 
> 
> Ok.
> 
> > However, if the second transaction is left too long, the reservation
> > will fail to find journal space because of the pinned quota-off item.
> > 
> 
> Right.
> 
> > The relogging infrastructure is designed to allow the inital
> > quota-off intent to keep moving forward in the log so it never pins
> > the tail of the log before the second quota-off transaction is run.
> > This tries to avoid the recovery issue because there's always an
> > active quota off item in the log, but I think there may be a flaw
> > here.  When the quotaoff item gets relogged, it jumps all the dquots
> > in the log that were modified after the quota-off started. Hence if
> > we crash after the relogging but while the dquots are still in the
> > log before the relogged quotaoff item, then they will be replayed,
> > possibly incorrectly. i.e. the relogged quota-off item no longer
> > prevents replay of those items.
> > 
> > So while relogging prevents the tail pinning deadlock, I think it
> > may actually result in incorrect recovery behaviour in that items
> > that should be cancelled and not replayed can end up getting
> > replayed.  I'm not sure that this matters for dquots, but for a
> > general mechanism I think the transactional ordering violations it
> > can result in reduce it's usefulness significantly.
> > 
> 
> Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
> the early RFCs.

We might have, but I don't recall that. And it would appear nobody
looked at this code in any detail if we did discuss it, so I'd say
the discussion was largely uninformed...

> Log recovery processes the quotaoff intent in pass 1 and
> dquot updates in pass 2, which I thought was intended to handle this
> kind of problem.

Right, it does handle it, but only because there are two quota-off
items in the log. i.e.  There's two recovery situations in play here
- 1) quota off in progress and 2) quota off done.

In the first case, only the initial quota-off item is in the log, so
it is needed to be detect to stop replay of relevant dquots that
have been logged after the quota off was started.

The second case has to be broken down into two sitations: a) both quota-off items
are active in the log, or b) only the second item is active in the log
as the tail has moved forwards past the first item.

In the case of 2a), it doesn't matter which item recovery sees, it
will cancel the dquot updates correctly. In the case of 2b), the
second quota off item is absolutely necessary to prevent replay of
the dquots in the log before it.

Hence if dquot modifications can leak past the first quota-off item
in the log, then the second item is absolutely necessary to catch
the 2b) case to prevent incorrect replay of dquot buffers.

> If I follow correctly, the recovery issue that warrants pinning the
> quotaoff in the log is not so much an ordering issue, but if the latter
> happens to fall off the end of the log before the last of the dquot
> modifications, recovery could see dquot changes after having lost the
> fact that a quotaoff had occurred at all. The current implementation
> presumably handles this by pinning the quotaoff until all dquots are
> completely purged from existence. The relog mechanism just allows the
> item to move while it effectively remains pinned, so I don't see how it
> introduces recovery issues.

As I said, it may not affect the specific quota-off usage, but we
can't just change the order of items in the physical journal without
care because the journal is supposed to be -strictly ordered-.

Reordering intents in the log automatically without regard to higher
level transactional ordering dependencies of the log items may
violate the ordering rules for journalling and recovery of metadata.
This is why I said automatic relogging may not be useful as generic
infrastructure - if there are dependent log items, then they need to
relogged as an atomic change set that maintains the ordering
dependencies between objects. That's where this automatic mechanism
completely falls down - the ordering dependencies are known only by
the code running the original transaction, not the log items...

> > But back to quota-off: What I've realised is that the only dquot
> > modifications we need to protect against being recovered are the
> > ones that are running at the time the first quota-off is committed
> > to the journal. That is, once the DQACTIVE flags are clear,
> > transactions will not modify those dquots anymore. Hence by the time
> > that the quota off item pins the tail of the log, the transactions
> > that were actively dirtying inodes when it was committed have also
> > committed and are in the journal and there are no actively modified
> > dquots left in memory.
> > 
> 
> I'm not sure how the (sync) commit of the quotaoff guarantees some other
> transaction running in parallel hadn't modified a dquot and committed
> after the quotaoff, but I think I see where you're going in general...

We drained out all the transactions that can be modifying quotas
before we log the quotaoff items. So, by definition, this cannot
happen.

> > IOWs, we don't actually need to wait until we've released and purged
> > all the dquots from memory before we log the second quota off item;
> > all we need to wait for is for all the transactions with dirty
> > dquots to have committed. These transactions already have log
> > reservations, so completing them will free unused reservation space
> > for the second quota off transaction. Once they are committed, then
> > we can log the second item. i.e. we don't have to wait until we've
> > cleaned up the dquots to close out the quota-off transaction in the
> > journal.
> > 
> 
> Ok, so we can deterministically shorten the window with a runtime
> barrier (i.e. disable -> drain) on quota modifying transactions rather
> than relying on the full dquot purge to provide this ordering.

Yup.

> > To make it even more robust, if we stop all the transactions that
> > may dirty dquots and drain the active ones before we log the first
> > quota-off item, we can log the second item immediately afterwards
> > because it is known that there are no dquot modifications in flight
> > when the first item is logged. We can probably even log both items
> > in the same transaction.
> > 
> 
> I was going to ask why we'd even need two items if this approach is
> generally viable.

Because I don't want to change the in-journal appearance of
quota-off to older kernels. Changing how things appear on disk is
dangerous and likely going to bite us in unexpected ways.

> > So, putting my money where my mouth is, the patch below does this.
> > It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> > of -g quota with all quotas enabled and is currently running a full
> > auto cycle with all quotas enabled. It hasn't let the smoke out
> > after about 4 hours of testing now....
> > 
> 
> Thanks for the patch. First, I like the idea and agree that it's more
> simple than the relogging approach. I do still need to stare at it some
> more to grok it and convince myself it's safe.
> 
> The thing that sticks out to me is tagging all of the transactions that
> modify quotas. Is there any reason we can't just quiesce the transaction
> subsystem entirely as a first step? It's not like quotaoff is common or
> performance sensitive. For example:
>
> 1. stop all transactions, wait to drain, force log
> 2. log the sb/quotaoff synchronously (punching through via something
>    like NO_WRITECOUNT)
> 3. clear the xfs_mount quota active flags
> 4. restart the transaction subsystem (no more dquot mods)
> 5. complete quotaoff via the dquot release and purge sequence

Yup, as I said on #xfs a short while ago:

[3/7/20 01:15] <djwong> qi_active_trans?
[3/7/20 01:15] <djwong> man, we just killed off m_active_trans
[3/7/20 08:47] <dchinner> djwong: I know we just killed off that atomic counter, it was used for doing exactly what I needed for quota-off, but freeze didn't need it anymore
[3/7/20 08:48] <dchinner> I mean, we could just make quota-off freeze the filesystem, do quota-off, then unfreeze....
[3/7/20 08:48] <dchinner> that's a simple, brute force solution
[3/7/20 08:49] <dchinner> but it's also overkill in that it forces lots of unnecessary data writeback...
[3/7/20 08:52] * djwong sometimes wonders if we just need a "run XXXX with exclusive access" thing
[3/7/20 08:58] <dchinner> djwong: that's kinda what xfs_quiesce_attr() was originally intended for
[3/7/20 08:59] <dchinner> but as all the code slowly got moved up into the VFS freeze layers, it stopped being able to be used for that sort of operation....
[3/7/20 09:01] <djwong> oh
[3/7/20 09:03] <dchinner> and so just after we remove the last remaining fragment of that original functionality, we find that maybe we actually still need to be able to quiesce the filesytsem for internal synchronisation reasons

So, we used to have exactly the functionality I needed in XFS as
general infrastructure, but we've removed it over the past few years
as the VFS has slowly been brought up to feature parity with XFS. I
just implemented what I needed to block/halt quota modifications
because I didn't want to perturb anything else while exploring if my
hypothesis was correct.

The only outstanding thing I haven't checked out fully is the
delayed allocation reservations that aren't done in transaction
contexts. I -think- these are OK because they are in memory only,
and they will serialised on the inode lock when detatching dquots
(i.e. the existing dquot purging ordering mechanisms) after quotas
are turned off. Hence I think these are fine, but more investigation
will be needed there to confirm behaviour is correct.

> I think it could be worth the tradeoff for the simplicity of not having
> to maintain the transaction reservation tags or the special quota
> waiting infrastructure vs. something like the more generic (recently
> removed) transaction counter. We might even be able to abstract the
> whole thing behind a transaction flag. E.g.:
> 
> 	/*
> 	 * A barrier transaction locks out further transactions and waits on
> 	 * outstanding transactions to drain (i.e. commit) before returning.
> 	 * Everything unlocks when the transaction commits.
> 	 */
> 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> 			XFS_TRANS_BARRIER, &tp);
> 	...

Yup, if we decide that we want to track all active transactions again
rather than just when quota is active, it would make a lot of
sense to make it a formal function of the xfs_trans_alloc() API.

Really, though, I've got so many other things on my plate right now
I don't have the time to take on yet another infrastructure
reworking. I spent the time to write the patch because if I was
going to say I didn't like relogging then it was absolutely
necessary for me to provide an alternative solution to the problem,
but I'm ireally hoping that it is sufficient for someone else to be
able to pick it up and run with it....

Cheers,

Dave.

PS. FWIW, if anyone wants to pick up any RFC patchset I've posted in
the past and run with it, I'm more than happy for you to do so. I've
got way more ideas and prototypes than I've got time to turn into
full production features. I also don't care about "ownership" of the
work; it's better to have someone actively working on the code than
having it sit around waiting for me to find time to get back to
it...

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/10] xfs: automatic log item relog mechanism
  2020-07-01 16:51 ` [PATCH 05/10] xfs: automatic log item relog mechanism Brian Foster
@ 2020-07-03  6:08   ` Dave Chinner
  2020-07-06 16:06     ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-07-03  6:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jul 01, 2020 at 12:51:11PM -0400, Brian Foster wrote:
> Now that relog reservation is available and relog state tracking is
> in place, all that remains to automatically relog items is the relog
> mechanism itself. An item with relogging enabled is basically pinned
> from writeback until relog is disabled. Instead of being written
> back, the item must instead be periodically committed in a new
> transaction to move it forward in the physical log. The purpose of
> moving the item is to avoid long term tail pinning and thus avoid
> log deadlocks for long running operations.
> 
> The ideal time to relog an item is in response to tail pushing
> pressure. This accommodates the current workload at any given time
> as opposed to a fixed time interval or log reservation heuristic,
> which risks performance regression. This is essentially the same
> heuristic that drives metadata writeback. XFS already implements
> various log tail pushing heuristics that attempt to keep the log
> progressing on an active fileystem under various workloads.
> 
> The act of relogging an item simply requires to add it to a
> transaction and commit. This pushes the already dirty item into a
> subsequent log checkpoint and frees up its previous location in the
> on-disk log. Joining an item to a transaction of course requires
> locking the item first, which means we have to be aware of
> type-specific locks and lock ordering wherever the relog takes
> place.
> 
> Fundamentally, this points to xfsaild as the ideal location to
> process relog enabled items. xfsaild already processes log resident
> items, is driven by log tail pushing pressure, processes arbitrary
> log item types through callbacks, and is sensitive to type-specific
> locking rules by design. The fact that automatic relogging
> essentially diverts items between writeback or relog also suggests
> xfsaild as an ideal location to process items one way or the other.
> 
> Of course, we don't want xfsaild to process transactions as it is a
> critical component of the log subsystem for driving metadata
> writeback and freeing up log space. Therefore, similar to how
> xfsaild builds up a writeback queue of dirty items and queues writes
> asynchronously, make xfsaild responsible only for directing pending
> relog items into an appropriate queue and create an async
> (workqueue) context for processing the queue. The workqueue context
> utilizes the pre-reserved log reservation to drain the queue by
> rolling a permanent transaction.
> 
> Update the AIL pushing infrastructure to support a new RELOG item
> state. If a log item push returns the relog state, queue the item
> for relog instead of writeback. On completion of a push cycle,
> schedule the relog task at the same point metadata buffer I/O is
> submitted. This allows items to be relogged automatically under the
> same locking rules and pressure heuristics that govern metadata
> writeback.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

A note while it's still fresh in my mind: memory reclaim is going to
force relogging of items whether they need it or not. The inode
shrinker pushes the AIL to it's highest current LSN, which means the
first shrinker invocation will relog the items. Sustained memory
pressure will result in this sort of behaviour

	AIL				AIL relog workqueue
cycle 1:
	relog item
		-> move to relog queue
	relog item
		-> move to relog queue
	....
	relog item
		-> move to relog queue

	queue work to AIL relog workqueue
	<sleep 20ms>

					iterates relog items
					  ->relog
					  commit

cycle 2:
	relog item
		already queued
		marks AIL for log force
	relog item
		already queued
		marks AIL for log force
	....
	relog item
		-> move to relog queue

	<sleep 20ms>

cycle 3:
	xfs_log_force(XFS_LOG_SYNC)
	-> CIL flush
	   log io
	   log IO completes
	   relogged items reinserted in AIL
	....
	relog item
		-> move to relog queue
	relog item
		-> move to relog queue
	....
	relog item
		-> move to relog queue

	queue work to AIL relog workqueue
	<sleep 20ms>

					iterates relog items
					  ->relog
					  commit
<repeat>

So it looks like when there is memory pressure we are going to
trigger a relog every second AIL push cycle, and a synchronous log
force every other log cycle.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-03  0:49     ` Dave Chinner
@ 2020-07-06 16:03       ` Brian Foster
  2020-07-06 17:42         ` Darrick J. Wong
  2020-07-10  4:09         ` Dave Chinner
  0 siblings, 2 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-06 16:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 03, 2020 at 10:49:40AM +1000, Dave Chinner wrote:
> On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> > On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > Here's a v1 (non-RFC) version of the automatic relogging functionality.
> > > > Note that the buffer relogging bits (patches 8-10) are still RFC as I've
> > > > had to hack around some things to utilize it for testing. I include them
> > > > here mostly for reference/discussion. Most of the effort from the last
> > > > rfc post has gone into testing and solidifying the functionality. This
> > > > now survives a traditional fstests regression run as well as a test run
> > > > with random buffer relogging enabled on every test/scratch device mount
> > > > that occurs throughout the fstests cycle. The quotaoff use case is
> > > > additionally tested independently by artificially delaying completion of
> > > > the quotaoff in parallel with many fsstress worker threads.
> > > > 
> > > > The hacks/workarounds to support the random buffer relogging enabled
> > > > fstests run are not included here because they are not associated with
> > > > core functionality, but rather are side effects of randomly relogging
> > > > arbitrary buffers, etc. I can work them into the buffer relogging
> > > > patches if desired, but I'd like to get the core functionality and use
> > > > case worked out before getting too far into the testing code. I also
> > > > know Darrick was interested in the ->iop_relog() callback for some form
> > > > of generic feedback into active dfops processing, so it might be worth
> > > > exploring that further.
> > > > 
> > > > Thoughts, reviews, flames appreciated.
> > > 
> > > Ok I've looked through the code again, and again I've had to pause,
> > > stop and think hard about it because the feeling I've had right from
> > > the start about the automatic relogging concept is stronger than
> > > ever.
> > > 
> > > I think the most constructive way to say what I'm feeling is that I
> > > think this is the wrong approach to solve the quota off problem.
> > > However, I've never been able to come up with an alternative that
> > > also solved the quotaoff problem so I've tried to help make this
> > > relogging concept work.
> > > 
> > 
> > I actually agree that this mechanism is overkill for quotaoff. I
> > probably wouldn't have invested this much time in the first place if
> > that was the only use case. Note that the original relogging concept
> > came about around discussion with Darrick on online btree repair because
> > IIRC the technique we landed on required dropping EFIs (which still have
> > open issues wrt to relogging) in the log for a non-deterministic amount
> > of time on an otherwise active fs. We came up with the concept and I
> > remembered quotaoff had a similar unresolved problem, so simply decided
> > to use that as a vector for the POC because the use case is much
> > simpler.
> 
> Yes, I know the history. That didn't make it any easier for me to
> write what I did, because I know how much time you've put into this
> already.
> 
> w.r.t. EFIs, that comes back to the problem of the relogged items
> jumping over things that have been logged that should appear between
> the EFI and EFD - moving the EFI forward past such dependent items
> is going to be a problem - those changes are going to replayed
> regardless of whether the EFI needs replaying or not, and hence
> replaying the EFI that got relogged will be out of order with other
> operations that occurred after then EFI was orginally logged.
> 
> > > It's a very interesting experiment, but I've always had a nagging
> > > doubt about putting transaction reservations both above and below
> > > the AIL. In reading this version, I'm having trouble following and
> > > understanding the transaction reservation juggling and
> > > recalculation complexity that's been introduced to facilitate
> > > the stealing that is being done. Yes, I know that I suggested the
> > > dynamic stealing approach - it's certainly better than past
> > > versions, but it hasn't really addressed my underlying doubt about
> > > the relogging concept in general...
> > > 
> > 
> > I think we need to separate discussion around the best solution for the
> > quotaoff problem from general doubts about the relog mechanism here. In
> > my mind, changing how we address quotaoff doesn't really impact the
> > existence of this mechanism because it was never the primary use case.
> > It just changes the timeline/dependency/requirements a bit.
> > 
> > However, you're mentioning "nagging doubts" about the fundamentals of
> > how it works, etc., so that suggests there are still concerns around the
> > mechanism itself independent from quotaoff. I've sent 5 or so RFCs to
> > try and elicit general feedback and address fundamental concerns before
> > putting in the effort to solidify the implementation, which was notably
> > more time consuming than reworking the RFC. It's quite frustrating to
> > see negative feedback broaden at this stage in a manner/pattern that
> > suggests the mechanism is not generally acceptable.
> 
> Well, my initial reponse to the very first RFC was:
> 
> | [...] I can see how appealing the concept of automatically
> | relogging is, but I'm unconvinced that we can make it work,
> | especially when there aren't sufficient reservations to relog
> | the items that need relogging.
> 
> https://lore.kernel.org/linux-xfs/20191024224308.GD4614@dread.disaster.area/
> 
> To RFC v4, which was the next version I had time to look at:
> 
> | [long list of potential issues]
> |
> | Given this, I'm skeptical this can be made into a useful, reliable
> | generic async relogging mechanism.
> 
> https://lore.kernel.org/linux-xfs/20191205210211.GP2695@dread.disaster.area/
> 
> Maybe general comments that "I remain unconvinced this will work"
> got drowned out by all the other comments I made trying to help you
> understand the code and hence make it work.
> 

I explicitly worked through those issues to the point where to the best
that I can tell, the mechanism works.

> Don't get me wrong - I really like the idea, but everything I know
> is telling me that, as it stands, I don't think it's going to work.
> A large part of that doubt is the absence of application level code
> that needs it to work in anger....
> 
> This is the nature of XFS development, especially in the log. I did
> this three times with development of delayed logging - I threw away
> prototypes I'd put similar effort into when it became obvious that
> there was a fundamental assumption I'd missed deep in the guts of
> the code and so the approach I was taking just wouldn't work. At the
> time, I had nobody to tell me that my approach might have problems
> before I found them out myself - all the deep XFS knowledge and
> expertise had been lost over the previous 10 years of brain drain as
> SGI flamed out and crashed.
> 
> So I know all too well what it feels like to get this far and then
> have to start again from the point of having to come up with a
> completely different design premise....
> 

I think you're misinterpreting my response. I don't mind having to move
on from or reinvent a design premise of an RFC, even after six versions,
because those have intentionally avoided the productization effort
(i.e., broad testing, stabilization, considering error conditions, code
documentation, etc.) that is so time consuming. I don't want to get too
far into the weeds on this topic given that I'm going to table this
series, but I'll try to summarize my thoughts for the benefit of next
time...

If the approach of some feature is generally not acceptable (as in "I'm
not comfortable with the approach" or "I think it should be done another
way"), that is potentially subjective but certainly valid feedback. I
might or might not debate that feedback, but that's at least an honest
debate where stances are clear. I'm certainly not going to try and
stabilize something I know that one or more key upstream contributers do
not agree with (unless I can convince them otherwise). If the feedback
is "I'm skeptical it works because of items 1, 2, 3," that means the
developer is likely to look through those issues and try to prove or
disprove whether the mechanism works based on that insight.

It appears to me that the issue here is not really whether the mechanism
works or not, but rather for one reason or another, you aren't
comfortable with the approach. That's fair enough, but that feedback
would have been far more useful on the previous RFC. It's not really
different from this version from a design perspective. To be fair, I
could have pinged that version since there wasn't much feedback and I
know folks are busy. I'll probably request some kind of informal design
ack in the future to help avoid this miscommunication...

> > All that being what it is, I'd obviously rather not expend even more
> > time if this is going to be met with vague/general reluctance. Do we
> > need to go back to the drawing board on the repair use case? If so,
> > should we reconsider the approach repair is using to release blocks?
> > Perhaps alter this mechanism to address some tangible concerns? Try and
> > come up with something else entirely..?
> 
> Well, like I said originally: I think relogging really needs to be
> done from the perspective of the owner of the logged item so that we
> can avoid things like ordering violations in the journal and other
> similar issues. i.e. relogging is designed around it being a
> function of the high order change algorithms and not something that
> can be used to work around high level change algorithms that don't
> follow the rules properly...
> 

I'm curious if you have thoughts around what that might look like.
Perhaps using quotaoff just as an example..? (Obviously we'd not
implement that over the current proposal..).

> I really haven't put any thought into how to solve the online-repair
> issue. I simply don't have the time to dive into every problem and
> come up with potential solutions to them. However, given the context
> of this discussion, we can already relog EFIs in a way that
> online-repair can use.
> 

I wasn't necessarily asking for a solution, but rather trying to figure
out where things stand on both relogging and repair given a large part
of the original feedback was focused on quotaoff. It sounds like
relogging is still an option in general terms, but in a manner that is
more intertwined with the originating transaction context/reservation.

> Consider that a single transaction that contains an EFD for the
> original EFI, and a new EFI for the same extent is effectively
> "relogging the EFI". It does so by atomically cancelling the
> original EFI in the log and creating a new EFI.
> 

Right. This is how dfops currently works IIRC.

> Now, and EFI is defined on disk as:
> 
> typedef struct xfs_efi_log_format {
>         uint16_t                efi_type;       /* efi log item type */
>         uint16_t                efi_size;       /* size of this item */
>         uint32_t                efi_nextents;   /* # extents to free */
>         uint64_t                efi_id;         /* efi identifier */
>         xfs_extent_t            efi_extents[1]; /* array of extents to free */
> } xfs_efi_log_format_t;
> 
> Which means it can hold up to 2^16-1 individual extents that we
> intend to free. We currently only use one extent per EFI, but if we
> go back in history, they were dynamically sized structures and
> could track arbitrary numbers of extents.
> 
> So, repair needs to track mulitple nested EFIs?
> 
> We cancel the old EFI, log a new EFI with all the old extents and
> the new extent in it. We now have a single EFI in the journal
> containing N+1 extents in it.
> 

That's an interesting optimization.

> Further, an EFD with multiple extents in it is -intended to be
> relogged- multiple times. Every time we free an extent in the EFI,
> we remove it from the EFD and relog the EFD. THis tells log recovery
> that this extent has now been freed, and that it should not replay
> it, even though it is still in the EFI.
> 
> And to prevent the big EFI from pinning the tail of the log while
> EFDs are being processed, we can relog the EFI along with the EFD
> each time the EFD is updated, hence we drag the EFI forwards in
> every high level transaction roll when we are actually freeing the
> extents.
> 

Hmm.. I'm not sure that addresses the deadlock problem for repair. That
assumes that EFD updates come at regular enough intervals to keep the
tail moving, but IIRC the bulk loading infrastructure will essentially
log a bunch of EFIs, spend a non-deterministic amount of time doing
work, then log the associated EFDs. So there's still a period of time in
there where we might need to relog intents that aren't otherwise being
updated.

Darrick might want to chime in here in case I'm missing something...

> The key to this is that the EFI/EFD relogging must be done entirely
> under a single rolling transaction, so there is -always- space
> available in the log for both the EFI and the EFDs to be relogged as
> the long running operation is performed.
> 
> IOWs, the EFI/EFD structures support relogging of the intents at a
> design level, and it is intended that this process is entirely
> driven from a single rolling transaction context. I srtongly suspect
> that all the recent EFI/EFD and deferred ops reworking has lost a
> lot of this context from the historical EFI/EFD implementation...
> 
> So before we go down the path of implementing generic automatic
> relogging infrastructure, we first should have been writing the
> application code that needs to relog intents and use a mechanism
> like the above to cancel and reinsert intents further down the log.
> Once we have code that is using these techniques to do bulk
> operations, then we can look to optimise/genericise the
> infrastructure they use.
> 
> > Moving on to quotaoff...
> > 
> > > I have been spending some time recently in the quota code, so I have
> > > a better grip on what it is doing now than I did last time I looked
> > > at this relogging code. I never really questioned why the quota code
> > > needed two transactions for quota-off, and I'm guessing that nobody
> > > else has either. So I spent some time this morning understanding
> > > what problem it was actually solving and trying to find an alternate
> > > solution to that problem.
> > 
> > Indeed, I hadn't looked into that.
> > 
> > > The reason we have the two quota-off transactions is that active
> > > dquot modifications at the time quotaoff is started leak past the
> > > first quota off transaction that hits the journal. Hence to avoid
> > > incorrect replay of those modifications in the journal if we crash
> > > after the quota-off item passes out of the journal, we pin the
> > > quota-off item in the journal. It gets unpinned by the commit of the
> > > second quota-off transaction at completion time, hence defining the
> > > window in journal where quota-off is being processed and dquot
> > > modifications should be ignored. i.e. there is no window where
> > > recovery will replay dquot modifications incorrectly.
> > > 
> > 
> > Ok.
> > 
> > > However, if the second transaction is left too long, the reservation
> > > will fail to find journal space because of the pinned quota-off item.
> > > 
> > 
> > Right.
> > 
> > > The relogging infrastructure is designed to allow the inital
> > > quota-off intent to keep moving forward in the log so it never pins
> > > the tail of the log before the second quota-off transaction is run.
> > > This tries to avoid the recovery issue because there's always an
> > > active quota off item in the log, but I think there may be a flaw
> > > here.  When the quotaoff item gets relogged, it jumps all the dquots
> > > in the log that were modified after the quota-off started. Hence if
> > > we crash after the relogging but while the dquots are still in the
> > > log before the relogged quotaoff item, then they will be replayed,
> > > possibly incorrectly. i.e. the relogged quota-off item no longer
> > > prevents replay of those items.
> > > 
> > > So while relogging prevents the tail pinning deadlock, I think it
> > > may actually result in incorrect recovery behaviour in that items
> > > that should be cancelled and not replayed can end up getting
> > > replayed.  I'm not sure that this matters for dquots, but for a
> > > general mechanism I think the transactional ordering violations it
> > > can result in reduce it's usefulness significantly.
> > > 
> > 
> > Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
> > the early RFCs.
> 
> We might have, but I don't recall that. And it would appear nobody
> looked at this code in any detail if we did discuss it, so I'd say
> the discussion was largely uninformed...
> 
> > Log recovery processes the quotaoff intent in pass 1 and
> > dquot updates in pass 2, which I thought was intended to handle this
> > kind of problem.
> 
> Right, it does handle it, but only because there are two quota-off
> items in the log. i.e.  There's two recovery situations in play here
> - 1) quota off in progress and 2) quota off done.
> 
> In the first case, only the initial quota-off item is in the log, so
> it is needed to be detect to stop replay of relevant dquots that
> have been logged after the quota off was started.
> 
> The second case has to be broken down into two sitations: a) both quota-off items
> are active in the log, or b) only the second item is active in the log
> as the tail has moved forwards past the first item.
> 
> In the case of 2a), it doesn't matter which item recovery sees, it
> will cancel the dquot updates correctly. In the case of 2b), the
> second quota off item is absolutely necessary to prevent replay of
> the dquots in the log before it.
> 
> Hence if dquot modifications can leak past the first quota-off item
> in the log, then the second item is absolutely necessary to catch
> the 2b) case to prevent incorrect replay of dquot buffers.
> 

Ok, but we're talking specifically about log recovery after quotaoff has
completed but before both intents have fallen off of the log. Relogging
of the initial intent (re: the original comment above about incorrect
recovery behavior) has no impact on this general ordering between the
start/end intents or dquot changes and the end intent.

> > If I follow correctly, the recovery issue that warrants pinning the
> > quotaoff in the log is not so much an ordering issue, but if the latter
> > happens to fall off the end of the log before the last of the dquot
> > modifications, recovery could see dquot changes after having lost the
> > fact that a quotaoff had occurred at all. The current implementation
> > presumably handles this by pinning the quotaoff until all dquots are
> > completely purged from existence. The relog mechanism just allows the
> > item to move while it effectively remains pinned, so I don't see how it
> > introduces recovery issues.
> 
> As I said, it may not affect the specific quota-off usage, but we
> can't just change the order of items in the physical journal without
> care because the journal is supposed to be -strictly ordered-.
> 

The mechanism itself is intended to target specific instances of log
items. Each use case should be evaluated for correctness on its own,
just like one would with ordered buffers or some other internal low
level construct that changes behavior.

> Reordering intents in the log automatically without regard to higher
> level transactional ordering dependencies of the log items may
> violate the ordering rules for journalling and recovery of metadata.
> This is why I said automatic relogging may not be useful as generic
> infrastructure - if there are dependent log items, then they need to
> relogged as an atomic change set that maintains the ordering
> dependencies between objects. That's where this automatic mechanism
> completely falls down - the ordering dependencies are known only by
> the code running the original transaction, not the log items...
> 

This and the above sounds to me that you're treating automatic relogging
like it would just be enabled by default on all intents, reordering
things arbitrarily. That is not the case as things would certainly
break, just like what would happen if ordered buffers were enabled by
default. The mechanism is per log item and context specific. It is
"generic" in the sense that there are (were) multiple use cases for it,
not that it should be used arbitrarily or "without care."

Use cases that have very particular ordering requirements across certain
sets of items should probably not enable this mechanism on those items
or otherwise verify that relogging a particular item is safe. The
potential example of this ordering problem being cited is quotaoff, but
we've already gone through this example multiple times and established
that relogging the quotaoff start item is safe.

All that said, extending a relogging notification somehow to a
particular context has always been a consideration because 1.) direct
EFI relogging would require log recovery changes and 2.) there was yet
another potential use case where dfops needed to know whether to relog a
particular intent in a long running chain to avoid some issue (the
details of which escape me). I think issue #1 is not complicated to
address, but creates a backwards incompatibility for log recovery. Issue
#2 would potentially separate out relogging as a notification mechanism
from the reservation management bits, but it's still not clear to me
what that notification mechanism would look like for a transaction that
has already been committed by some caller context.

I think Darrick was looking at repurposing ->iop_relog() for that one so
I'd be curious to know what that is looking like in general...

> > > But back to quota-off: What I've realised is that the only dquot
> > > modifications we need to protect against being recovered are the
> > > ones that are running at the time the first quota-off is committed
> > > to the journal. That is, once the DQACTIVE flags are clear,
> > > transactions will not modify those dquots anymore. Hence by the time
> > > that the quota off item pins the tail of the log, the transactions
> > > that were actively dirtying inodes when it was committed have also
> > > committed and are in the journal and there are no actively modified
> > > dquots left in memory.
> > > 
> > 
> > I'm not sure how the (sync) commit of the quotaoff guarantees some other
> > transaction running in parallel hadn't modified a dquot and committed
> > after the quotaoff, but I think I see where you're going in general...
> 
> We drained out all the transactions that can be modifying quotas
> before we log the quotaoff items. So, by definition, this cannot
> happen.
> 
> > > IOWs, we don't actually need to wait until we've released and purged
> > > all the dquots from memory before we log the second quota off item;
> > > all we need to wait for is for all the transactions with dirty
> > > dquots to have committed. These transactions already have log
> > > reservations, so completing them will free unused reservation space
> > > for the second quota off transaction. Once they are committed, then
> > > we can log the second item. i.e. we don't have to wait until we've
> > > cleaned up the dquots to close out the quota-off transaction in the
> > > journal.
> > > 
> > 
> > Ok, so we can deterministically shorten the window with a runtime
> > barrier (i.e. disable -> drain) on quota modifying transactions rather
> > than relying on the full dquot purge to provide this ordering.
> 
> Yup.
> 
> > > To make it even more robust, if we stop all the transactions that
> > > may dirty dquots and drain the active ones before we log the first
> > > quota-off item, we can log the second item immediately afterwards
> > > because it is known that there are no dquot modifications in flight
> > > when the first item is logged. We can probably even log both items
> > > in the same transaction.
> > > 
> > 
> > I was going to ask why we'd even need two items if this approach is
> > generally viable.
> 
> Because I don't want to change the in-journal appearance of
> quota-off to older kernels. Changing how things appear on disk is
> dangerous and likely going to bite us in unexpected ways.
> 

Well combining them into a single transaction doesn't guarantee ordering
of the two, right? So it might not be worth doing that either if we're
concerned about log appearance. Regardless, those potential steps can be
evaluated independently on top of the core runtime fixes.

> > > So, putting my money where my mouth is, the patch below does this.
> > > It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> > > of -g quota with all quotas enabled and is currently running a full
> > > auto cycle with all quotas enabled. It hasn't let the smoke out
> > > after about 4 hours of testing now....
> > > 
> > 
> > Thanks for the patch. First, I like the idea and agree that it's more
> > simple than the relogging approach. I do still need to stare at it some
> > more to grok it and convince myself it's safe.
> > 
> > The thing that sticks out to me is tagging all of the transactions that
> > modify quotas. Is there any reason we can't just quiesce the transaction
> > subsystem entirely as a first step? It's not like quotaoff is common or
> > performance sensitive. For example:
> >
> > 1. stop all transactions, wait to drain, force log
> > 2. log the sb/quotaoff synchronously (punching through via something
> >    like NO_WRITECOUNT)
> > 3. clear the xfs_mount quota active flags
> > 4. restart the transaction subsystem (no more dquot mods)
> > 5. complete quotaoff via the dquot release and purge sequence
> 
> Yup, as I said on #xfs a short while ago:
> 
> [3/7/20 01:15] <djwong> qi_active_trans?
> [3/7/20 01:15] <djwong> man, we just killed off m_active_trans
> [3/7/20 08:47] <dchinner> djwong: I know we just killed off that atomic counter, it was used for doing exactly what I needed for quota-off, but freeze didn't need it anymore
> [3/7/20 08:48] <dchinner> I mean, we could just make quota-off freeze the filesystem, do quota-off, then unfreeze....
> [3/7/20 08:48] <dchinner> that's a simple, brute force solution
> [3/7/20 08:49] <dchinner> but it's also overkill in that it forces lots of unnecessary data writeback...
> [3/7/20 08:52] * djwong sometimes wonders if we just need a "run XXXX with exclusive access" thing
> [3/7/20 08:58] <dchinner> djwong: that's kinda what xfs_quiesce_attr() was originally intended for
> [3/7/20 08:59] <dchinner> but as all the code slowly got moved up into the VFS freeze layers, it stopped being able to be used for that sort of operation....
> [3/7/20 09:01] <djwong> oh
> [3/7/20 09:03] <dchinner> and so just after we remove the last remaining fragment of that original functionality, we find that maybe we actually still need to be able to quiesce the filesytsem for internal synchronisation reasons
> 
> So, we used to have exactly the functionality I needed in XFS as
> general infrastructure, but we've removed it over the past few years
> as the VFS has slowly been brought up to feature parity with XFS. I
> just implemented what I needed to block/halt quota modifications
> because I didn't want to perturb anything else while exploring if my
> hypothesis was correct.
> 

Ok.

> The only outstanding thing I haven't checked out fully is the
> delayed allocation reservations that aren't done in transaction
> contexts. I -think- these are OK because they are in memory only,
> and they will serialised on the inode lock when detatching dquots
> (i.e. the existing dquot purging ordering mechanisms) after quotas
> are turned off. Hence I think these are fine, but more investigation
> will be needed there to confirm behaviour is correct.
> 

Yep.

> > I think it could be worth the tradeoff for the simplicity of not having
> > to maintain the transaction reservation tags or the special quota
> > waiting infrastructure vs. something like the more generic (recently
> > removed) transaction counter. We might even be able to abstract the
> > whole thing behind a transaction flag. E.g.:
> > 
> > 	/*
> > 	 * A barrier transaction locks out further transactions and waits on
> > 	 * outstanding transactions to drain (i.e. commit) before returning.
> > 	 * Everything unlocks when the transaction commits.
> > 	 */
> > 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > 			XFS_TRANS_BARRIER, &tp);
> > 	...
> 
> Yup, if we decide that we want to track all active transactions again
> rather than just when quota is active, it would make a lot of
> sense to make it a formal function of the xfs_trans_alloc() API.
> 
> Really, though, I've got so many other things on my plate right now
> I don't have the time to take on yet another infrastructure
> reworking. I spent the time to write the patch because if I was
> going to say I didn't like relogging then it was absolutely
> necessary for me to provide an alternative solution to the problem,
> but I'm ireally hoping that it is sufficient for someone else to be
> able to pick it up and run with it....
> 

Ok, I can take a look at this since I need to step back and rethink this
particular feature anyways.

Brian

> Cheers,
> 
> Dave.
> 
> PS. FWIW, if anyone wants to pick up any RFC patchset I've posted in
> the past and run with it, I'm more than happy for you to do so. I've
> got way more ideas and prototypes than I've got time to turn into
> full production features. I also don't care about "ownership" of the
> work; it's better to have someone actively working on the code than
> having it sit around waiting for me to find time to get back to
> it...
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/10] xfs: automatic log item relog mechanism
  2020-07-03  6:08   ` Dave Chinner
@ 2020-07-06 16:06     ` Brian Foster
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-06 16:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 03, 2020 at 04:08:23PM +1000, Dave Chinner wrote:
> On Wed, Jul 01, 2020 at 12:51:11PM -0400, Brian Foster wrote:
> > Now that relog reservation is available and relog state tracking is
> > in place, all that remains to automatically relog items is the relog
> > mechanism itself. An item with relogging enabled is basically pinned
> > from writeback until relog is disabled. Instead of being written
> > back, the item must instead be periodically committed in a new
> > transaction to move it forward in the physical log. The purpose of
> > moving the item is to avoid long term tail pinning and thus avoid
> > log deadlocks for long running operations.
> > 
> > The ideal time to relog an item is in response to tail pushing
> > pressure. This accommodates the current workload at any given time
> > as opposed to a fixed time interval or log reservation heuristic,
> > which risks performance regression. This is essentially the same
> > heuristic that drives metadata writeback. XFS already implements
> > various log tail pushing heuristics that attempt to keep the log
> > progressing on an active fileystem under various workloads.
> > 
> > The act of relogging an item simply requires to add it to a
> > transaction and commit. This pushes the already dirty item into a
> > subsequent log checkpoint and frees up its previous location in the
> > on-disk log. Joining an item to a transaction of course requires
> > locking the item first, which means we have to be aware of
> > type-specific locks and lock ordering wherever the relog takes
> > place.
> > 
> > Fundamentally, this points to xfsaild as the ideal location to
> > process relog enabled items. xfsaild already processes log resident
> > items, is driven by log tail pushing pressure, processes arbitrary
> > log item types through callbacks, and is sensitive to type-specific
> > locking rules by design. The fact that automatic relogging
> > essentially diverts items between writeback or relog also suggests
> > xfsaild as an ideal location to process items one way or the other.
> > 
> > Of course, we don't want xfsaild to process transactions as it is a
> > critical component of the log subsystem for driving metadata
> > writeback and freeing up log space. Therefore, similar to how
> > xfsaild builds up a writeback queue of dirty items and queues writes
> > asynchronously, make xfsaild responsible only for directing pending
> > relog items into an appropriate queue and create an async
> > (workqueue) context for processing the queue. The workqueue context
> > utilizes the pre-reserved log reservation to drain the queue by
> > rolling a permanent transaction.
> > 
> > Update the AIL pushing infrastructure to support a new RELOG item
> > state. If a log item push returns the relog state, queue the item
> > for relog instead of writeback. On completion of a push cycle,
> > schedule the relog task at the same point metadata buffer I/O is
> > submitted. This allows items to be relogged automatically under the
> > same locking rules and pressure heuristics that govern metadata
> > writeback.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> A note while it's still fresh in my mind: memory reclaim is going to
> force relogging of items whether they need it or not. The inode
> shrinker pushes the AIL to it's highest current LSN, which means the
> first shrinker invocation will relog the items. Sustained memory
> pressure will result in this sort of behaviour
> 
...
> 
> So it looks like when there is memory pressure we are going to
> trigger a relog every second AIL push cycle, and a synchronous log
> force every other log cycle.
> 

Indeed. I went back and forth on how to report status of already
relogged items so that bit is somewhat accidental. I could probably
remove the log force increment from the RELOG_QUEUED condition and let
the item fall back to the whatever state is most appropriate since I
didn't have an explicit reason for that other than trying to preserve
behavior from previous versions.

Ultimately even though much of this code is implemented around the AIL,
the AIL fundamentally serves as a notification mechanism to identify
when to relog items. All we really need is some indication that an item
is being pushed due to reservation pressure, whether it be a log item
state bit or a function callout, etc., and much of the rest of the
implementation could be lifted out into a separate mechanism. IOW, if
the push frequency of an item is too crude to drive relogs by itself,
that could probably be addressed by filtering the feedback mechanism to
exclude non head/tail related pressure. For example, the push code could
consider the heuristic implemented in xlog_grant_push_ail() to determine
whether to relog items at the tail, whether to force the log for already
relogged items, or just fall back to a traditional state.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-06 16:03       ` Brian Foster
@ 2020-07-06 17:42         ` Darrick J. Wong
  2020-07-07 11:37           ` Brian Foster
  2020-07-10  4:09         ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2020-07-06 17:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Mon, Jul 06, 2020 at 12:03:06PM -0400, Brian Foster wrote:
> On Fri, Jul 03, 2020 at 10:49:40AM +1000, Dave Chinner wrote:
> > On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> > > On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> > > > > Hi all,
> > > > > 
> > > > > Here's a v1 (non-RFC) version of the automatic relogging functionality.
> > > > > Note that the buffer relogging bits (patches 8-10) are still RFC as I've
> > > > > had to hack around some things to utilize it for testing. I include them
> > > > > here mostly for reference/discussion. Most of the effort from the last
> > > > > rfc post has gone into testing and solidifying the functionality. This
> > > > > now survives a traditional fstests regression run as well as a test run
> > > > > with random buffer relogging enabled on every test/scratch device mount
> > > > > that occurs throughout the fstests cycle. The quotaoff use case is
> > > > > additionally tested independently by artificially delaying completion of
> > > > > the quotaoff in parallel with many fsstress worker threads.
> > > > > 
> > > > > The hacks/workarounds to support the random buffer relogging enabled
> > > > > fstests run are not included here because they are not associated with
> > > > > core functionality, but rather are side effects of randomly relogging
> > > > > arbitrary buffers, etc. I can work them into the buffer relogging
> > > > > patches if desired, but I'd like to get the core functionality and use
> > > > > case worked out before getting too far into the testing code. I also
> > > > > know Darrick was interested in the ->iop_relog() callback for some form
> > > > > of generic feedback into active dfops processing, so it might be worth
> > > > > exploring that further.
> > > > > 
> > > > > Thoughts, reviews, flames appreciated.
> > > > 
> > > > Ok I've looked through the code again, and again I've had to pause,
> > > > stop and think hard about it because the feeling I've had right from
> > > > the start about the automatic relogging concept is stronger than
> > > > ever.
> > > > 
> > > > I think the most constructive way to say what I'm feeling is that I
> > > > think this is the wrong approach to solve the quota off problem.
> > > > However, I've never been able to come up with an alternative that
> > > > also solved the quotaoff problem so I've tried to help make this
> > > > relogging concept work.
> > > > 
> > > 
> > > I actually agree that this mechanism is overkill for quotaoff. I
> > > probably wouldn't have invested this much time in the first place if
> > > that was the only use case. Note that the original relogging concept
> > > came about around discussion with Darrick on online btree repair because
> > > IIRC the technique we landed on required dropping EFIs (which still have
> > > open issues wrt to relogging) in the log for a non-deterministic amount
> > > of time on an otherwise active fs. We came up with the concept and I
> > > remembered quotaoff had a similar unresolved problem, so simply decided
> > > to use that as a vector for the POC because the use case is much
> > > simpler.
> > 
> > Yes, I know the history. That didn't make it any easier for me to
> > write what I did, because I know how much time you've put into this
> > already.
> > 
> > w.r.t. EFIs, that comes back to the problem of the relogged items
> > jumping over things that have been logged that should appear between
> > the EFI and EFD - moving the EFI forward past such dependent items
> > is going to be a problem - those changes are going to replayed
> > regardless of whether the EFI needs replaying or not, and hence
> > replaying the EFI that got relogged will be out of order with other
> > operations that occurred after then EFI was orginally logged.
> > 
> > > > It's a very interesting experiment, but I've always had a nagging
> > > > doubt about putting transaction reservations both above and below
> > > > the AIL. In reading this version, I'm having trouble following and
> > > > understanding the transaction reservation juggling and
> > > > recalculation complexity that's been introduced to facilitate
> > > > the stealing that is being done. Yes, I know that I suggested the
> > > > dynamic stealing approach - it's certainly better than past
> > > > versions, but it hasn't really addressed my underlying doubt about
> > > > the relogging concept in general...
> > > > 
> > > 
> > > I think we need to separate discussion around the best solution for the
> > > quotaoff problem from general doubts about the relog mechanism here. In
> > > my mind, changing how we address quotaoff doesn't really impact the
> > > existence of this mechanism because it was never the primary use case.
> > > It just changes the timeline/dependency/requirements a bit.
> > > 
> > > However, you're mentioning "nagging doubts" about the fundamentals of
> > > how it works, etc., so that suggests there are still concerns around the
> > > mechanism itself independent from quotaoff. I've sent 5 or so RFCs to
> > > try and elicit general feedback and address fundamental concerns before
> > > putting in the effort to solidify the implementation, which was notably
> > > more time consuming than reworking the RFC. It's quite frustrating to
> > > see negative feedback broaden at this stage in a manner/pattern that
> > > suggests the mechanism is not generally acceptable.
> > 
> > Well, my initial reponse to the very first RFC was:
> > 
> > | [...] I can see how appealing the concept of automatically
> > | relogging is, but I'm unconvinced that we can make it work,
> > | especially when there aren't sufficient reservations to relog
> > | the items that need relogging.
> > 
> > https://lore.kernel.org/linux-xfs/20191024224308.GD4614@dread.disaster.area/
> > 
> > To RFC v4, which was the next version I had time to look at:
> > 
> > | [long list of potential issues]
> > |
> > | Given this, I'm skeptical this can be made into a useful, reliable
> > | generic async relogging mechanism.
> > 
> > https://lore.kernel.org/linux-xfs/20191205210211.GP2695@dread.disaster.area/
> > 
> > Maybe general comments that "I remain unconvinced this will work"
> > got drowned out by all the other comments I made trying to help you
> > understand the code and hence make it work.
> > 
> 
> I explicitly worked through those issues to the point where to the best
> that I can tell, the mechanism works.
> 
> > Don't get me wrong - I really like the idea, but everything I know
> > is telling me that, as it stands, I don't think it's going to work.
> > A large part of that doubt is the absence of application level code
> > that needs it to work in anger....
> > 
> > This is the nature of XFS development, especially in the log. I did
> > this three times with development of delayed logging - I threw away
> > prototypes I'd put similar effort into when it became obvious that
> > there was a fundamental assumption I'd missed deep in the guts of
> > the code and so the approach I was taking just wouldn't work. At the
> > time, I had nobody to tell me that my approach might have problems
> > before I found them out myself - all the deep XFS knowledge and
> > expertise had been lost over the previous 10 years of brain drain as
> > SGI flamed out and crashed.
> > 
> > So I know all too well what it feels like to get this far and then
> > have to start again from the point of having to come up with a
> > completely different design premise....
> > 
> 
> I think you're misinterpreting my response. I don't mind having to move
> on from or reinvent a design premise of an RFC, even after six versions,
> because those have intentionally avoided the productization effort
> (i.e., broad testing, stabilization, considering error conditions, code
> documentation, etc.) that is so time consuming. I don't want to get too
> far into the weeds on this topic given that I'm going to table this
> series, but I'll try to summarize my thoughts for the benefit of next
> time...
> 
> If the approach of some feature is generally not acceptable (as in "I'm
> not comfortable with the approach" or "I think it should be done another
> way"), that is potentially subjective but certainly valid feedback. I
> might or might not debate that feedback, but that's at least an honest
> debate where stances are clear. I'm certainly not going to try and
> stabilize something I know that one or more key upstream contributers do
> not agree with (unless I can convince them otherwise). If the feedback
> is "I'm skeptical it works because of items 1, 2, 3," that means the
> developer is likely to look through those issues and try to prove or
> disprove whether the mechanism works based on that insight.
> 
> It appears to me that the issue here is not really whether the mechanism
> works or not, but rather for one reason or another, you aren't
> comfortable with the approach. That's fair enough, but that feedback
> would have been far more useful on the previous RFC. It's not really
> different from this version from a design perspective. To be fair, I
> could have pinged that version since there wasn't much feedback and I
> know folks are busy. I'll probably request some kind of informal design
> ack in the future to help avoid this miscommunication...
> 
> > > All that being what it is, I'd obviously rather not expend even more
> > > time if this is going to be met with vague/general reluctance. Do we
> > > need to go back to the drawing board on the repair use case? If so,
> > > should we reconsider the approach repair is using to release blocks?
> > > Perhaps alter this mechanism to address some tangible concerns? Try and
> > > come up with something else entirely..?
> > 
> > Well, like I said originally: I think relogging really needs to be
> > done from the perspective of the owner of the logged item so that we
> > can avoid things like ordering violations in the journal and other
> > similar issues. i.e. relogging is designed around it being a
> > function of the high order change algorithms and not something that
> > can be used to work around high level change algorithms that don't
> > follow the rules properly...
> > 
> 
> I'm curious if you have thoughts around what that might look like.
> Perhaps using quotaoff just as an example..? (Obviously we'd not
> implement that over the current proposal..).
> 
> > I really haven't put any thought into how to solve the online-repair
> > issue. I simply don't have the time to dive into every problem and
> > come up with potential solutions to them. However, given the context
> > of this discussion, we can already relog EFIs in a way that
> > online-repair can use.
> > 
> 
> I wasn't necessarily asking for a solution, but rather trying to figure
> out where things stand on both relogging and repair given a large part
> of the original feedback was focused on quotaoff. It sounds like
> relogging is still an option in general terms, but in a manner that is
> more intertwined with the originating transaction context/reservation.
> 
> > Consider that a single transaction that contains an EFD for the
> > original EFI, and a new EFI for the same extent is effectively
> > "relogging the EFI". It does so by atomically cancelling the
> > original EFI in the log and creating a new EFI.
> > 
> 
> Right. This is how dfops currently works IIRC.
> 
> > Now, and EFI is defined on disk as:
> > 
> > typedef struct xfs_efi_log_format {
> >         uint16_t                efi_type;       /* efi log item type */
> >         uint16_t                efi_size;       /* size of this item */
> >         uint32_t                efi_nextents;   /* # extents to free */
> >         uint64_t                efi_id;         /* efi identifier */
> >         xfs_extent_t            efi_extents[1]; /* array of extents to free */
> > } xfs_efi_log_format_t;
> > 
> > Which means it can hold up to 2^16-1 individual extents that we
> > intend to free. We currently only use one extent per EFI, but if we
> > go back in history, they were dynamically sized structures and
> > could track arbitrary numbers of extents.
> > 
> > So, repair needs to track mulitple nested EFIs?
> > 
> > We cancel the old EFI, log a new EFI with all the old extents and
> > the new extent in it. We now have a single EFI in the journal
> > containing N+1 extents in it.
> > 
> 
> That's an interesting optimization.

Hmm, I hadn't thought about amortizing the cost of maintaining an EFI
across as many of the btree block allocations as possible.  That would
make the current scheme (which I'll get into below) less scary.

> > Further, an EFD with multiple extents in it is -intended to be
> > relogged- multiple times. Every time we free an extent in the EFI,
> > we remove it from the EFD and relog the EFD. THis tells log recovery
> > that this extent has now been freed, and that it should not replay
> > it, even though it is still in the EFI.
> > 
> > And to prevent the big EFI from pinning the tail of the log while
> > EFDs are being processed, we can relog the EFI along with the EFD
> > each time the EFD is updated, hence we drag the EFI forwards in
> > every high level transaction roll when we are actually freeing the
> > extents.
> > 
> 
> Hmm.. I'm not sure that addresses the deadlock problem for repair. That
> assumes that EFD updates come at regular enough intervals to keep the
> tail moving, but IIRC the bulk loading infrastructure will essentially
> log a bunch of EFIs, spend a non-deterministic amount of time doing
> work, then log the associated EFDs. So there's still a period of time in
> there where we might need to relog intents that aren't otherwise being
> updated.
> 
> Darrick might want to chime in here in case I'm missing something...

I changed the ->claim_block function in the online repair code[3] to
relog all of the EFIs (using the strategy Dave outlined above), with the
claimed block not present in the new EFI.  I think this means we can't
pin the log tail longer than it takes to memcpy a bunch of records into
a block and put it on a delwri list, which should be fast enough.

> > The key to this is that the EFI/EFD relogging must be done entirely
> > under a single rolling transaction, so there is -always- space
> > available in the log for both the EFI and the EFDs to be relogged as
> > the long running operation is performed.
> > 
> > IOWs, the EFI/EFD structures support relogging of the intents at a
> > design level, and it is intended that this process is entirely
> > driven from a single rolling transaction context. I srtongly suspect
> > that all the recent EFI/EFD and deferred ops reworking has lost a
> > lot of this context from the historical EFI/EFD implementation...

I don't agree with this, since the BUI/CUI items actively relog
themselves for another go-around if they decide that they still have
work to do, and the atomic extent swap item I proposed also takes
advantage of this design property.

Though, I concede that I don't think any of us were watching carefully
enough to the dfops manager to spot the occasional need to relog all the
attached intent items if the chain gets long enough.

> > So before we go down the path of implementing generic automatic
> > relogging infrastructure, we first should have been writing the
> > application code that needs to relog intents and use a mechanism
> > like the above to cancel and reinsert intents further down the log.

Already done.  The reason why Brian and I are stirring up this hornet
nest again is that I started posting patches to fix various deficiencies
that were exposed by generic/52[12] shakedowns of the atomic swap code. ;)

I guess I should go post the latest version of the defer freezer code
since it takes steps to minimize the dfops chain lengths, and relogs the
entire dfops chain every few rolls to keep the associated log items
moving forward...

> > Once we have code that is using these techniques to do bulk
> > operations, then we can look to optimise/genericise the
> > infrastructure they use.
> > 
> > > Moving on to quotaoff...
> > > 
> > > > I have been spending some time recently in the quota code, so I have
> > > > a better grip on what it is doing now than I did last time I looked
> > > > at this relogging code. I never really questioned why the quota code
> > > > needed two transactions for quota-off, and I'm guessing that nobody
> > > > else has either. So I spent some time this morning understanding
> > > > what problem it was actually solving and trying to find an alternate
> > > > solution to that problem.
> > > 
> > > Indeed, I hadn't looked into that.
> > > 
> > > > The reason we have the two quota-off transactions is that active
> > > > dquot modifications at the time quotaoff is started leak past the
> > > > first quota off transaction that hits the journal. Hence to avoid
> > > > incorrect replay of those modifications in the journal if we crash
> > > > after the quota-off item passes out of the journal, we pin the
> > > > quota-off item in the journal. It gets unpinned by the commit of the
> > > > second quota-off transaction at completion time, hence defining the
> > > > window in journal where quota-off is being processed and dquot
> > > > modifications should be ignored. i.e. there is no window where
> > > > recovery will replay dquot modifications incorrectly.
> > > > 
> > > 
> > > Ok.
> > > 
> > > > However, if the second transaction is left too long, the reservation
> > > > will fail to find journal space because of the pinned quota-off item.
> > > > 
> > > 
> > > Right.
> > > 
> > > > The relogging infrastructure is designed to allow the inital
> > > > quota-off intent to keep moving forward in the log so it never pins
> > > > the tail of the log before the second quota-off transaction is run.
> > > > This tries to avoid the recovery issue because there's always an
> > > > active quota off item in the log, but I think there may be a flaw
> > > > here.  When the quotaoff item gets relogged, it jumps all the dquots
> > > > in the log that were modified after the quota-off started. Hence if
> > > > we crash after the relogging but while the dquots are still in the
> > > > log before the relogged quotaoff item, then they will be replayed,
> > > > possibly incorrectly. i.e. the relogged quota-off item no longer
> > > > prevents replay of those items.
> > > > 
> > > > So while relogging prevents the tail pinning deadlock, I think it
> > > > may actually result in incorrect recovery behaviour in that items
> > > > that should be cancelled and not replayed can end up getting
> > > > replayed.  I'm not sure that this matters for dquots, but for a
> > > > general mechanism I think the transactional ordering violations it
> > > > can result in reduce it's usefulness significantly.
> > > > 
> > > 
> > > Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
> > > the early RFCs.
> > 
> > We might have, but I don't recall that. And it would appear nobody
> > looked at this code in any detail if we did discuss it, so I'd say
> > the discussion was largely uninformed...
> > 
> > > Log recovery processes the quotaoff intent in pass 1 and
> > > dquot updates in pass 2, which I thought was intended to handle this
> > > kind of problem.
> > 
> > Right, it does handle it, but only because there are two quota-off
> > items in the log. i.e.  There's two recovery situations in play here
> > - 1) quota off in progress and 2) quota off done.
> > 
> > In the first case, only the initial quota-off item is in the log, so
> > it is needed to be detect to stop replay of relevant dquots that
> > have been logged after the quota off was started.
> > 
> > The second case has to be broken down into two sitations: a) both quota-off items
> > are active in the log, or b) only the second item is active in the log
> > as the tail has moved forwards past the first item.
> > 
> > In the case of 2a), it doesn't matter which item recovery sees, it
> > will cancel the dquot updates correctly. In the case of 2b), the
> > second quota off item is absolutely necessary to prevent replay of
> > the dquots in the log before it.
> > 
> > Hence if dquot modifications can leak past the first quota-off item
> > in the log, then the second item is absolutely necessary to catch
> > the 2b) case to prevent incorrect replay of dquot buffers.
> > 
> 
> Ok, but we're talking specifically about log recovery after quotaoff has
> completed but before both intents have fallen off of the log. Relogging
> of the initial intent (re: the original comment above about incorrect
> recovery behavior) has no impact on this general ordering between the
> start/end intents or dquot changes and the end intent.
> 
> > > If I follow correctly, the recovery issue that warrants pinning the
> > > quotaoff in the log is not so much an ordering issue, but if the latter
> > > happens to fall off the end of the log before the last of the dquot
> > > modifications, recovery could see dquot changes after having lost the
> > > fact that a quotaoff had occurred at all. The current implementation
> > > presumably handles this by pinning the quotaoff until all dquots are
> > > completely purged from existence. The relog mechanism just allows the
> > > item to move while it effectively remains pinned, so I don't see how it
> > > introduces recovery issues.
> > 
> > As I said, it may not affect the specific quota-off usage, but we
> > can't just change the order of items in the physical journal without
> > care because the journal is supposed to be -strictly ordered-.
> > 
> 
> The mechanism itself is intended to target specific instances of log
> items. Each use case should be evaluated for correctness on its own,
> just like one would with ordered buffers or some other internal low
> level construct that changes behavior.
> 
> > Reordering intents in the log automatically without regard to higher
> > level transactional ordering dependencies of the log items may
> > violate the ordering rules for journalling and recovery of metadata.
> > This is why I said automatic relogging may not be useful as generic
> > infrastructure - if there are dependent log items, then they need to
> > relogged as an atomic change set that maintains the ordering
> > dependencies between objects. That's where this automatic mechanism
> > completely falls down - the ordering dependencies are known only by
> > the code running the original transaction, not the log items...
> > 
> 
> This and the above sounds to me that you're treating automatic relogging
> like it would just be enabled by default on all intents, reordering
> things arbitrarily. That is not the case as things would certainly
> break, just like what would happen if ordered buffers were enabled by
> default. The mechanism is per log item and context specific. It is
> "generic" in the sense that there are (were) multiple use cases for it,
> not that it should be used arbitrarily or "without care."
> 
> Use cases that have very particular ordering requirements across certain
> sets of items should probably not enable this mechanism on those items
> or otherwise verify that relogging a particular item is safe. The
> potential example of this ordering problem being cited is quotaoff, but
> we've already gone through this example multiple times and established
> that relogging the quotaoff start item is safe.
> 
> All that said, extending a relogging notification somehow to a
> particular context has always been a consideration because 1.) direct
> EFI relogging would require log recovery changes and 2.) there was yet
> another potential use case where dfops needed to know whether to relog a
> particular intent in a long running chain to avoid some issue (the
> details of which escape me). I think issue #1 is not complicated to
> address, but creates a backwards incompatibility for log recovery. Issue
> #2 would potentially separate out relogging as a notification mechanism
> from the reservation management bits, but it's still not clear to me
> what that notification mechanism would look like for a transaction that
> has already been committed by some caller context.
> 
> I think Darrick was looking at repurposing ->iop_relog() for that one so
> I'd be curious to know what that is looking like in general...

...it's graduated to the point that I'm willing/crazy enough to run it
on my development workstations, and it hasn't let out the magic smoke.

I added ->iop_relog handlers to all the major intent items (EFI, RUI,
CUI, BUI, SXI), then taught xfs_defer_finish_noroll to relog everything
on dop_pending every 7 transaction rolls[1].  Initially this caused log
reservation overflow problems with transactions that log intents whose
->finish_item functions themselves log dozens more intents, but then
realized that the second patch[2] I had written took care of this
problem.

That patch, of course, is one that I posted a while ago that makes it so
that if a transaction owner logs items A and B and then commits (or
calls defer_roll), dfops will now finish all the items created by A's
->finish_item before it moves on to trying to finish B.

I had put /that/ patch aside after Brian pointed out that on its own,
that patch merely substituted pinning the tail on some sub-item of A
with pinning the tail on B, but I think with both patches applied I have
solved both problems.

--D

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=515cc4e637bf4e9afcfbaeb39b13f85b27923916
[2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=8e63f8a7af12d673feb5400d09179502632854c4
[3] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=ddfeca6f1f1862c3f162db8b8bdbfc5149f5e5c5

> > > > But back to quota-off: What I've realised is that the only dquot
> > > > modifications we need to protect against being recovered are the
> > > > ones that are running at the time the first quota-off is committed
> > > > to the journal. That is, once the DQACTIVE flags are clear,
> > > > transactions will not modify those dquots anymore. Hence by the time
> > > > that the quota off item pins the tail of the log, the transactions
> > > > that were actively dirtying inodes when it was committed have also
> > > > committed and are in the journal and there are no actively modified
> > > > dquots left in memory.
> > > > 
> > > 
> > > I'm not sure how the (sync) commit of the quotaoff guarantees some other
> > > transaction running in parallel hadn't modified a dquot and committed
> > > after the quotaoff, but I think I see where you're going in general...
> > 
> > We drained out all the transactions that can be modifying quotas
> > before we log the quotaoff items. So, by definition, this cannot
> > happen.
> > 
> > > > IOWs, we don't actually need to wait until we've released and purged
> > > > all the dquots from memory before we log the second quota off item;
> > > > all we need to wait for is for all the transactions with dirty
> > > > dquots to have committed. These transactions already have log
> > > > reservations, so completing them will free unused reservation space
> > > > for the second quota off transaction. Once they are committed, then
> > > > we can log the second item. i.e. we don't have to wait until we've
> > > > cleaned up the dquots to close out the quota-off transaction in the
> > > > journal.
> > > > 
> > > 
> > > Ok, so we can deterministically shorten the window with a runtime
> > > barrier (i.e. disable -> drain) on quota modifying transactions rather
> > > than relying on the full dquot purge to provide this ordering.
> > 
> > Yup.
> > 
> > > > To make it even more robust, if we stop all the transactions that
> > > > may dirty dquots and drain the active ones before we log the first
> > > > quota-off item, we can log the second item immediately afterwards
> > > > because it is known that there are no dquot modifications in flight
> > > > when the first item is logged. We can probably even log both items
> > > > in the same transaction.
> > > > 
> > > 
> > > I was going to ask why we'd even need two items if this approach is
> > > generally viable.
> > 
> > Because I don't want to change the in-journal appearance of
> > quota-off to older kernels. Changing how things appear on disk is
> > dangerous and likely going to bite us in unexpected ways.
> > 
> 
> Well combining them into a single transaction doesn't guarantee ordering
> of the two, right? So it might not be worth doing that either if we're
> concerned about log appearance. Regardless, those potential steps can be
> evaluated independently on top of the core runtime fixes.
> 
> > > > So, putting my money where my mouth is, the patch below does this.
> > > > It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> > > > of -g quota with all quotas enabled and is currently running a full
> > > > auto cycle with all quotas enabled. It hasn't let the smoke out
> > > > after about 4 hours of testing now....
> > > > 
> > > 
> > > Thanks for the patch. First, I like the idea and agree that it's more
> > > simple than the relogging approach. I do still need to stare at it some
> > > more to grok it and convince myself it's safe.
> > > 
> > > The thing that sticks out to me is tagging all of the transactions that
> > > modify quotas. Is there any reason we can't just quiesce the transaction
> > > subsystem entirely as a first step? It's not like quotaoff is common or
> > > performance sensitive. For example:
> > >
> > > 1. stop all transactions, wait to drain, force log
> > > 2. log the sb/quotaoff synchronously (punching through via something
> > >    like NO_WRITECOUNT)
> > > 3. clear the xfs_mount quota active flags
> > > 4. restart the transaction subsystem (no more dquot mods)
> > > 5. complete quotaoff via the dquot release and purge sequence
> > 
> > Yup, as I said on #xfs a short while ago:
> > 
> > [3/7/20 01:15] <djwong> qi_active_trans?
> > [3/7/20 01:15] <djwong> man, we just killed off m_active_trans
> > [3/7/20 08:47] <dchinner> djwong: I know we just killed off that atomic counter, it was used for doing exactly what I needed for quota-off, but freeze didn't need it anymore
> > [3/7/20 08:48] <dchinner> I mean, we could just make quota-off freeze the filesystem, do quota-off, then unfreeze....
> > [3/7/20 08:48] <dchinner> that's a simple, brute force solution
> > [3/7/20 08:49] <dchinner> but it's also overkill in that it forces lots of unnecessary data writeback...
> > [3/7/20 08:52] * djwong sometimes wonders if we just need a "run XXXX with exclusive access" thing
> > [3/7/20 08:58] <dchinner> djwong: that's kinda what xfs_quiesce_attr() was originally intended for
> > [3/7/20 08:59] <dchinner> but as all the code slowly got moved up into the VFS freeze layers, it stopped being able to be used for that sort of operation....
> > [3/7/20 09:01] <djwong> oh
> > [3/7/20 09:03] <dchinner> and so just after we remove the last remaining fragment of that original functionality, we find that maybe we actually still need to be able to quiesce the filesytsem for internal synchronisation reasons
> > 
> > So, we used to have exactly the functionality I needed in XFS as
> > general infrastructure, but we've removed it over the past few years
> > as the VFS has slowly been brought up to feature parity with XFS. I
> > just implemented what I needed to block/halt quota modifications
> > because I didn't want to perturb anything else while exploring if my
> > hypothesis was correct.
> > 
> 
> Ok.
> 
> > The only outstanding thing I haven't checked out fully is the
> > delayed allocation reservations that aren't done in transaction
> > contexts. I -think- these are OK because they are in memory only,
> > and they will serialised on the inode lock when detatching dquots
> > (i.e. the existing dquot purging ordering mechanisms) after quotas
> > are turned off. Hence I think these are fine, but more investigation
> > will be needed there to confirm behaviour is correct.
> > 
> 
> Yep.
> 
> > > I think it could be worth the tradeoff for the simplicity of not having
> > > to maintain the transaction reservation tags or the special quota
> > > waiting infrastructure vs. something like the more generic (recently
> > > removed) transaction counter. We might even be able to abstract the
> > > whole thing behind a transaction flag. E.g.:
> > > 
> > > 	/*
> > > 	 * A barrier transaction locks out further transactions and waits on
> > > 	 * outstanding transactions to drain (i.e. commit) before returning.
> > > 	 * Everything unlocks when the transaction commits.
> > > 	 */
> > > 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > 			XFS_TRANS_BARRIER, &tp);
> > > 	...
> > 
> > Yup, if we decide that we want to track all active transactions again
> > rather than just when quota is active, it would make a lot of
> > sense to make it a formal function of the xfs_trans_alloc() API.
> > 
> > Really, though, I've got so many other things on my plate right now
> > I don't have the time to take on yet another infrastructure
> > reworking. I spent the time to write the patch because if I was
> > going to say I didn't like relogging then it was absolutely
> > necessary for me to provide an alternative solution to the problem,
> > but I'm ireally hoping that it is sufficient for someone else to be
> > able to pick it up and run with it....
> > 
> 
> Ok, I can take a look at this since I need to step back and rethink this
> particular feature anyways.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > 
> > PS. FWIW, if anyone wants to pick up any RFC patchset I've posted in
> > the past and run with it, I'm more than happy for you to do so. I've
> > got way more ideas and prototypes than I've got time to turn into
> > full production features. I also don't care about "ownership" of the
> > work; it's better to have someone actively working on the code than
> > having it sit around waiting for me to find time to get back to
> > it...
> > 
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-06 17:42         ` Darrick J. Wong
@ 2020-07-07 11:37           ` Brian Foster
  2020-07-08 16:44             ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-07-07 11:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Mon, Jul 06, 2020 at 10:42:57AM -0700, Darrick J. Wong wrote:
> On Mon, Jul 06, 2020 at 12:03:06PM -0400, Brian Foster wrote:
> > On Fri, Jul 03, 2020 at 10:49:40AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> > > > On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > > > > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
...
> > 
> > > Consider that a single transaction that contains an EFD for the
> > > original EFI, and a new EFI for the same extent is effectively
> > > "relogging the EFI". It does so by atomically cancelling the
> > > original EFI in the log and creating a new EFI.
> > > 
> > 
> > Right. This is how dfops currently works IIRC.
> > 
> > > Now, and EFI is defined on disk as:
> > > 
> > > typedef struct xfs_efi_log_format {
> > >         uint16_t                efi_type;       /* efi log item type */
> > >         uint16_t                efi_size;       /* size of this item */
> > >         uint32_t                efi_nextents;   /* # extents to free */
> > >         uint64_t                efi_id;         /* efi identifier */
> > >         xfs_extent_t            efi_extents[1]; /* array of extents to free */
> > > } xfs_efi_log_format_t;
> > > 
> > > Which means it can hold up to 2^16-1 individual extents that we
> > > intend to free. We currently only use one extent per EFI, but if we
> > > go back in history, they were dynamically sized structures and
> > > could track arbitrary numbers of extents.
> > > 
> > > So, repair needs to track mulitple nested EFIs?
> > > 
> > > We cancel the old EFI, log a new EFI with all the old extents and
> > > the new extent in it. We now have a single EFI in the journal
> > > containing N+1 extents in it.
> > > 
> > 
> > That's an interesting optimization.
> 
> Hmm, I hadn't thought about amortizing the cost of maintaining an EFI
> across as many of the btree block allocations as possible.  That would
> make the current scheme (which I'll get into below) less scary.
> 
> > > Further, an EFD with multiple extents in it is -intended to be
> > > relogged- multiple times. Every time we free an extent in the EFI,
> > > we remove it from the EFD and relog the EFD. THis tells log recovery
> > > that this extent has now been freed, and that it should not replay
> > > it, even though it is still in the EFI.
> > > 
> > > And to prevent the big EFI from pinning the tail of the log while
> > > EFDs are being processed, we can relog the EFI along with the EFD
> > > each time the EFD is updated, hence we drag the EFI forwards in
> > > every high level transaction roll when we are actually freeing the
> > > extents.
> > > 
> > 
> > Hmm.. I'm not sure that addresses the deadlock problem for repair. That
> > assumes that EFD updates come at regular enough intervals to keep the
> > tail moving, but IIRC the bulk loading infrastructure will essentially
> > log a bunch of EFIs, spend a non-deterministic amount of time doing
> > work, then log the associated EFDs. So there's still a period of time in
> > there where we might need to relog intents that aren't otherwise being
> > updated.
> > 
> > Darrick might want to chime in here in case I'm missing something...
> 
> I changed the ->claim_block function in the online repair code[3] to
> relog all of the EFIs (using the strategy Dave outlined above), with the
> claimed block not present in the new EFI.  I think this means we can't
> pin the log tail longer than it takes to memcpy a bunch of records into
> a block and put it on a delwri list, which should be fast enough.
> 

Ah, interesting. So each filled block would relog the intents associated
with the outstanding reservation for the rebuild. That certainly will
keep things moving, I'd suspect relogging far more frequently than
necessary, but correctness first. :)

One thing I'm curious about in the broader context of repair is what
happens if the filesystem crashes mid-rebuild? IIRC the rebuilt tree is
fake rooted and the above seems to imply we'd have EFDs logged for the
blocks consumed by the tree. Are those blocks restored to the fs somehow
or another on a crash recovery?

> > > The key to this is that the EFI/EFD relogging must be done entirely
> > > under a single rolling transaction, so there is -always- space
> > > available in the log for both the EFI and the EFDs to be relogged as
> > > the long running operation is performed.
> > > 
> > > IOWs, the EFI/EFD structures support relogging of the intents at a
> > > design level, and it is intended that this process is entirely
> > > driven from a single rolling transaction context. I srtongly suspect
> > > that all the recent EFI/EFD and deferred ops reworking has lost a
> > > lot of this context from the historical EFI/EFD implementation...
> 
> I don't agree with this, since the BUI/CUI items actively relog
> themselves for another go-around if they decide that they still have
> work to do, and the atomic extent swap item I proposed also takes
> advantage of this design property.
> 
> Though, I concede that I don't think any of us were watching carefully
> enough to the dfops manager to spot the occasional need to relog all the
> attached intent items if the chain gets long enough.
> 
> > > So before we go down the path of implementing generic automatic
> > > relogging infrastructure, we first should have been writing the
> > > application code that needs to relog intents and use a mechanism
> > > like the above to cancel and reinsert intents further down the log.
> 
> Already done.  The reason why Brian and I are stirring up this hornet
> nest again is that I started posting patches to fix various deficiencies
> that were exposed by generic/52[12] shakedowns of the atomic swap code. ;)
> 
> I guess I should go post the latest version of the defer freezer code
> since it takes steps to minimize the dfops chain lengths, and relogs the
> entire dfops chain every few rolls to keep the associated log items
> moving forward...
> 
> > > Once we have code that is using these techniques to do bulk
> > > operations, then we can look to optimise/genericise the
> > > infrastructure they use.
> > > 
> > > > Moving on to quotaoff...
> > > > 
> > > > > I have been spending some time recently in the quota code, so I have
> > > > > a better grip on what it is doing now than I did last time I looked
> > > > > at this relogging code. I never really questioned why the quota code
> > > > > needed two transactions for quota-off, and I'm guessing that nobody
> > > > > else has either. So I spent some time this morning understanding
> > > > > what problem it was actually solving and trying to find an alternate
> > > > > solution to that problem.
> > > > 
> > > > Indeed, I hadn't looked into that.
> > > > 
> > > > > The reason we have the two quota-off transactions is that active
> > > > > dquot modifications at the time quotaoff is started leak past the
> > > > > first quota off transaction that hits the journal. Hence to avoid
> > > > > incorrect replay of those modifications in the journal if we crash
> > > > > after the quota-off item passes out of the journal, we pin the
> > > > > quota-off item in the journal. It gets unpinned by the commit of the
> > > > > second quota-off transaction at completion time, hence defining the
> > > > > window in journal where quota-off is being processed and dquot
> > > > > modifications should be ignored. i.e. there is no window where
> > > > > recovery will replay dquot modifications incorrectly.
> > > > > 
> > > > 
> > > > Ok.
> > > > 
> > > > > However, if the second transaction is left too long, the reservation
> > > > > will fail to find journal space because of the pinned quota-off item.
> > > > > 
> > > > 
> > > > Right.
> > > > 
> > > > > The relogging infrastructure is designed to allow the inital
> > > > > quota-off intent to keep moving forward in the log so it never pins
> > > > > the tail of the log before the second quota-off transaction is run.
> > > > > This tries to avoid the recovery issue because there's always an
> > > > > active quota off item in the log, but I think there may be a flaw
> > > > > here.  When the quotaoff item gets relogged, it jumps all the dquots
> > > > > in the log that were modified after the quota-off started. Hence if
> > > > > we crash after the relogging but while the dquots are still in the
> > > > > log before the relogged quotaoff item, then they will be replayed,
> > > > > possibly incorrectly. i.e. the relogged quota-off item no longer
> > > > > prevents replay of those items.
> > > > > 
> > > > > So while relogging prevents the tail pinning deadlock, I think it
> > > > > may actually result in incorrect recovery behaviour in that items
> > > > > that should be cancelled and not replayed can end up getting
> > > > > replayed.  I'm not sure that this matters for dquots, but for a
> > > > > general mechanism I think the transactional ordering violations it
> > > > > can result in reduce it's usefulness significantly.
> > > > > 
> > > > 
> > > > Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
> > > > the early RFCs.
> > > 
> > > We might have, but I don't recall that. And it would appear nobody
> > > looked at this code in any detail if we did discuss it, so I'd say
> > > the discussion was largely uninformed...
> > > 
> > > > Log recovery processes the quotaoff intent in pass 1 and
> > > > dquot updates in pass 2, which I thought was intended to handle this
> > > > kind of problem.
> > > 
> > > Right, it does handle it, but only because there are two quota-off
> > > items in the log. i.e.  There's two recovery situations in play here
> > > - 1) quota off in progress and 2) quota off done.
> > > 
> > > In the first case, only the initial quota-off item is in the log, so
> > > it is needed to be detect to stop replay of relevant dquots that
> > > have been logged after the quota off was started.
> > > 
> > > The second case has to be broken down into two sitations: a) both quota-off items
> > > are active in the log, or b) only the second item is active in the log
> > > as the tail has moved forwards past the first item.
> > > 
> > > In the case of 2a), it doesn't matter which item recovery sees, it
> > > will cancel the dquot updates correctly. In the case of 2b), the
> > > second quota off item is absolutely necessary to prevent replay of
> > > the dquots in the log before it.
> > > 
> > > Hence if dquot modifications can leak past the first quota-off item
> > > in the log, then the second item is absolutely necessary to catch
> > > the 2b) case to prevent incorrect replay of dquot buffers.
> > > 
> > 
> > Ok, but we're talking specifically about log recovery after quotaoff has
> > completed but before both intents have fallen off of the log. Relogging
> > of the initial intent (re: the original comment above about incorrect
> > recovery behavior) has no impact on this general ordering between the
> > start/end intents or dquot changes and the end intent.
> > 
> > > > If I follow correctly, the recovery issue that warrants pinning the
> > > > quotaoff in the log is not so much an ordering issue, but if the latter
> > > > happens to fall off the end of the log before the last of the dquot
> > > > modifications, recovery could see dquot changes after having lost the
> > > > fact that a quotaoff had occurred at all. The current implementation
> > > > presumably handles this by pinning the quotaoff until all dquots are
> > > > completely purged from existence. The relog mechanism just allows the
> > > > item to move while it effectively remains pinned, so I don't see how it
> > > > introduces recovery issues.
> > > 
> > > As I said, it may not affect the specific quota-off usage, but we
> > > can't just change the order of items in the physical journal without
> > > care because the journal is supposed to be -strictly ordered-.
> > > 
> > 
> > The mechanism itself is intended to target specific instances of log
> > items. Each use case should be evaluated for correctness on its own,
> > just like one would with ordered buffers or some other internal low
> > level construct that changes behavior.
> > 
> > > Reordering intents in the log automatically without regard to higher
> > > level transactional ordering dependencies of the log items may
> > > violate the ordering rules for journalling and recovery of metadata.
> > > This is why I said automatic relogging may not be useful as generic
> > > infrastructure - if there are dependent log items, then they need to
> > > relogged as an atomic change set that maintains the ordering
> > > dependencies between objects. That's where this automatic mechanism
> > > completely falls down - the ordering dependencies are known only by
> > > the code running the original transaction, not the log items...
> > > 
> > 
> > This and the above sounds to me that you're treating automatic relogging
> > like it would just be enabled by default on all intents, reordering
> > things arbitrarily. That is not the case as things would certainly
> > break, just like what would happen if ordered buffers were enabled by
> > default. The mechanism is per log item and context specific. It is
> > "generic" in the sense that there are (were) multiple use cases for it,
> > not that it should be used arbitrarily or "without care."
> > 
> > Use cases that have very particular ordering requirements across certain
> > sets of items should probably not enable this mechanism on those items
> > or otherwise verify that relogging a particular item is safe. The
> > potential example of this ordering problem being cited is quotaoff, but
> > we've already gone through this example multiple times and established
> > that relogging the quotaoff start item is safe.
> > 
> > All that said, extending a relogging notification somehow to a
> > particular context has always been a consideration because 1.) direct
> > EFI relogging would require log recovery changes and 2.) there was yet
> > another potential use case where dfops needed to know whether to relog a
> > particular intent in a long running chain to avoid some issue (the
> > details of which escape me). I think issue #1 is not complicated to
> > address, but creates a backwards incompatibility for log recovery. Issue
> > #2 would potentially separate out relogging as a notification mechanism
> > from the reservation management bits, but it's still not clear to me
> > what that notification mechanism would look like for a transaction that
> > has already been committed by some caller context.
> > 
> > I think Darrick was looking at repurposing ->iop_relog() for that one so
> > I'd be curious to know what that is looking like in general...
> 
> ...it's graduated to the point that I'm willing/crazy enough to run it
> on my development workstations, and it hasn't let out the magic smoke.
> 

Heh.

> I added ->iop_relog handlers to all the major intent items (EFI, RUI,
> CUI, BUI, SXI), then taught xfs_defer_finish_noroll to relog everything
> on dop_pending every 7 transaction rolls[1].  Initially this caused log
> reservation overflow problems with transactions that log intents whose
> ->finish_item functions themselves log dozens more intents, but then
> realized that the second patch[2] I had written took care of this
> problem.
> 

Thanks. I think I get the general idea. We're reworking the
->iop_relog() handler to complete and replace the current intent (rather
than just relog the original intent, which is what this series did for
the quotaoff case) in the current dfops transaction and allow the dfops
code to update its reference to the item. The part that's obviously
missing is some kind of determination on when we actually need to relog
the outstanding intents vs. using a fixed roll count.

I suppose we could do something like I was mentioning in my other reply
on the AIL pushing issue Dave pointed out where we'd set a bit on
certain items that are tail pinned and in need of relog. That sounds
like overkill given this use case is currently self-contained to dfops.
Perhaps the other idea of factoring out the threshold determination
logic from xlog_grant_push_ail() might be useful.

For example, if the current free reservation is below the calculated
threshold (with need_bytes == 0), return a threshold LSN based on the
current tail. Instead of using that to push the AIL, compare it to
->li_lsn of each intent and relog any that are inside the threshold LSN
(which will probably be all of them in practice since they are part of
the same transaction). We'd probably need to identify intents that have
been recently relogged so the process doesn't repeat until the CIL
drains and the li_lsn eventually changes. Hmm.. I did have an
XFS_LI_IN_CIL state tracking patch around somewhere for debugging
purposes that might actually be sufficient for that. We could also
consider stashing a "relog push" LSN somewhere (similar to the way AIL
pushing works) and perhaps use that to avoid repeated relogs on a chain,
but it's not immediately clear to me how well that would fit into the
dfops mechanism...

Brian

> That patch, of course, is one that I posted a while ago that makes it so
> that if a transaction owner logs items A and B and then commits (or
> calls defer_roll), dfops will now finish all the items created by A's
> ->finish_item before it moves on to trying to finish B.
> 
> I had put /that/ patch aside after Brian pointed out that on its own,
> that patch merely substituted pinning the tail on some sub-item of A
> with pinning the tail on B, but I think with both patches applied I have
> solved both problems.
> 
> --D
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=515cc4e637bf4e9afcfbaeb39b13f85b27923916
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=8e63f8a7af12d673feb5400d09179502632854c4
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=ddfeca6f1f1862c3f162db8b8bdbfc5149f5e5c5
> 
> > > > > But back to quota-off: What I've realised is that the only dquot
> > > > > modifications we need to protect against being recovered are the
> > > > > ones that are running at the time the first quota-off is committed
> > > > > to the journal. That is, once the DQACTIVE flags are clear,
> > > > > transactions will not modify those dquots anymore. Hence by the time
> > > > > that the quota off item pins the tail of the log, the transactions
> > > > > that were actively dirtying inodes when it was committed have also
> > > > > committed and are in the journal and there are no actively modified
> > > > > dquots left in memory.
> > > > > 
> > > > 
> > > > I'm not sure how the (sync) commit of the quotaoff guarantees some other
> > > > transaction running in parallel hadn't modified a dquot and committed
> > > > after the quotaoff, but I think I see where you're going in general...
> > > 
> > > We drained out all the transactions that can be modifying quotas
> > > before we log the quotaoff items. So, by definition, this cannot
> > > happen.
> > > 
> > > > > IOWs, we don't actually need to wait until we've released and purged
> > > > > all the dquots from memory before we log the second quota off item;
> > > > > all we need to wait for is for all the transactions with dirty
> > > > > dquots to have committed. These transactions already have log
> > > > > reservations, so completing them will free unused reservation space
> > > > > for the second quota off transaction. Once they are committed, then
> > > > > we can log the second item. i.e. we don't have to wait until we've
> > > > > cleaned up the dquots to close out the quota-off transaction in the
> > > > > journal.
> > > > > 
> > > > 
> > > > Ok, so we can deterministically shorten the window with a runtime
> > > > barrier (i.e. disable -> drain) on quota modifying transactions rather
> > > > than relying on the full dquot purge to provide this ordering.
> > > 
> > > Yup.
> > > 
> > > > > To make it even more robust, if we stop all the transactions that
> > > > > may dirty dquots and drain the active ones before we log the first
> > > > > quota-off item, we can log the second item immediately afterwards
> > > > > because it is known that there are no dquot modifications in flight
> > > > > when the first item is logged. We can probably even log both items
> > > > > in the same transaction.
> > > > > 
> > > > 
> > > > I was going to ask why we'd even need two items if this approach is
> > > > generally viable.
> > > 
> > > Because I don't want to change the in-journal appearance of
> > > quota-off to older kernels. Changing how things appear on disk is
> > > dangerous and likely going to bite us in unexpected ways.
> > > 
> > 
> > Well combining them into a single transaction doesn't guarantee ordering
> > of the two, right? So it might not be worth doing that either if we're
> > concerned about log appearance. Regardless, those potential steps can be
> > evaluated independently on top of the core runtime fixes.
> > 
> > > > > So, putting my money where my mouth is, the patch below does this.
> > > > > It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> > > > > of -g quota with all quotas enabled and is currently running a full
> > > > > auto cycle with all quotas enabled. It hasn't let the smoke out
> > > > > after about 4 hours of testing now....
> > > > > 
> > > > 
> > > > Thanks for the patch. First, I like the idea and agree that it's more
> > > > simple than the relogging approach. I do still need to stare at it some
> > > > more to grok it and convince myself it's safe.
> > > > 
> > > > The thing that sticks out to me is tagging all of the transactions that
> > > > modify quotas. Is there any reason we can't just quiesce the transaction
> > > > subsystem entirely as a first step? It's not like quotaoff is common or
> > > > performance sensitive. For example:
> > > >
> > > > 1. stop all transactions, wait to drain, force log
> > > > 2. log the sb/quotaoff synchronously (punching through via something
> > > >    like NO_WRITECOUNT)
> > > > 3. clear the xfs_mount quota active flags
> > > > 4. restart the transaction subsystem (no more dquot mods)
> > > > 5. complete quotaoff via the dquot release and purge sequence
> > > 
> > > Yup, as I said on #xfs a short while ago:
> > > 
> > > [3/7/20 01:15] <djwong> qi_active_trans?
> > > [3/7/20 01:15] <djwong> man, we just killed off m_active_trans
> > > [3/7/20 08:47] <dchinner> djwong: I know we just killed off that atomic counter, it was used for doing exactly what I needed for quota-off, but freeze didn't need it anymore
> > > [3/7/20 08:48] <dchinner> I mean, we could just make quota-off freeze the filesystem, do quota-off, then unfreeze....
> > > [3/7/20 08:48] <dchinner> that's a simple, brute force solution
> > > [3/7/20 08:49] <dchinner> but it's also overkill in that it forces lots of unnecessary data writeback...
> > > [3/7/20 08:52] * djwong sometimes wonders if we just need a "run XXXX with exclusive access" thing
> > > [3/7/20 08:58] <dchinner> djwong: that's kinda what xfs_quiesce_attr() was originally intended for
> > > [3/7/20 08:59] <dchinner> but as all the code slowly got moved up into the VFS freeze layers, it stopped being able to be used for that sort of operation....
> > > [3/7/20 09:01] <djwong> oh
> > > [3/7/20 09:03] <dchinner> and so just after we remove the last remaining fragment of that original functionality, we find that maybe we actually still need to be able to quiesce the filesytsem for internal synchronisation reasons
> > > 
> > > So, we used to have exactly the functionality I needed in XFS as
> > > general infrastructure, but we've removed it over the past few years
> > > as the VFS has slowly been brought up to feature parity with XFS. I
> > > just implemented what I needed to block/halt quota modifications
> > > because I didn't want to perturb anything else while exploring if my
> > > hypothesis was correct.
> > > 
> > 
> > Ok.
> > 
> > > The only outstanding thing I haven't checked out fully is the
> > > delayed allocation reservations that aren't done in transaction
> > > contexts. I -think- these are OK because they are in memory only,
> > > and they will serialised on the inode lock when detatching dquots
> > > (i.e. the existing dquot purging ordering mechanisms) after quotas
> > > are turned off. Hence I think these are fine, but more investigation
> > > will be needed there to confirm behaviour is correct.
> > > 
> > 
> > Yep.
> > 
> > > > I think it could be worth the tradeoff for the simplicity of not having
> > > > to maintain the transaction reservation tags or the special quota
> > > > waiting infrastructure vs. something like the more generic (recently
> > > > removed) transaction counter. We might even be able to abstract the
> > > > whole thing behind a transaction flag. E.g.:
> > > > 
> > > > 	/*
> > > > 	 * A barrier transaction locks out further transactions and waits on
> > > > 	 * outstanding transactions to drain (i.e. commit) before returning.
> > > > 	 * Everything unlocks when the transaction commits.
> > > > 	 */
> > > > 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > > 			XFS_TRANS_BARRIER, &tp);
> > > > 	...
> > > 
> > > Yup, if we decide that we want to track all active transactions again
> > > rather than just when quota is active, it would make a lot of
> > > sense to make it a formal function of the xfs_trans_alloc() API.
> > > 
> > > Really, though, I've got so many other things on my plate right now
> > > I don't have the time to take on yet another infrastructure
> > > reworking. I spent the time to write the patch because if I was
> > > going to say I didn't like relogging then it was absolutely
> > > necessary for me to provide an alternative solution to the problem,
> > > but I'm ireally hoping that it is sufficient for someone else to be
> > > able to pick it up and run with it....
> > > 
> > 
> > Ok, I can take a look at this since I need to step back and rethink this
> > particular feature anyways.
> > 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > 
> > > PS. FWIW, if anyone wants to pick up any RFC patchset I've posted in
> > > the past and run with it, I'm more than happy for you to do so. I've
> > > got way more ideas and prototypes than I've got time to turn into
> > > full production features. I also don't care about "ownership" of the
> > > work; it's better to have someone actively working on the code than
> > > having it sit around waiting for me to find time to get back to
> > > it...
> > > 
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-07 11:37           ` Brian Foster
@ 2020-07-08 16:44             ` Darrick J. Wong
  2020-07-09 12:15               ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2020-07-08 16:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Tue, Jul 07, 2020 at 07:37:43AM -0400, Brian Foster wrote:
> On Mon, Jul 06, 2020 at 10:42:57AM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 06, 2020 at 12:03:06PM -0400, Brian Foster wrote:
> > > On Fri, Jul 03, 2020 at 10:49:40AM +1000, Dave Chinner wrote:
> > > > On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> > > > > On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > > > > > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> ...
> > > 
> > > > Consider that a single transaction that contains an EFD for the
> > > > original EFI, and a new EFI for the same extent is effectively
> > > > "relogging the EFI". It does so by atomically cancelling the
> > > > original EFI in the log and creating a new EFI.
> > > > 
> > > 
> > > Right. This is how dfops currently works IIRC.
> > > 
> > > > Now, and EFI is defined on disk as:
> > > > 
> > > > typedef struct xfs_efi_log_format {
> > > >         uint16_t                efi_type;       /* efi log item type */
> > > >         uint16_t                efi_size;       /* size of this item */
> > > >         uint32_t                efi_nextents;   /* # extents to free */
> > > >         uint64_t                efi_id;         /* efi identifier */
> > > >         xfs_extent_t            efi_extents[1]; /* array of extents to free */
> > > > } xfs_efi_log_format_t;
> > > > 
> > > > Which means it can hold up to 2^16-1 individual extents that we
> > > > intend to free. We currently only use one extent per EFI, but if we
> > > > go back in history, they were dynamically sized structures and
> > > > could track arbitrary numbers of extents.
> > > > 
> > > > So, repair needs to track mulitple nested EFIs?
> > > > 
> > > > We cancel the old EFI, log a new EFI with all the old extents and
> > > > the new extent in it. We now have a single EFI in the journal
> > > > containing N+1 extents in it.
> > > > 
> > > 
> > > That's an interesting optimization.
> > 
> > Hmm, I hadn't thought about amortizing the cost of maintaining an EFI
> > across as many of the btree block allocations as possible.  That would
> > make the current scheme (which I'll get into below) less scary.
> > 
> > > > Further, an EFD with multiple extents in it is -intended to be
> > > > relogged- multiple times. Every time we free an extent in the EFI,
> > > > we remove it from the EFD and relog the EFD. THis tells log recovery
> > > > that this extent has now been freed, and that it should not replay
> > > > it, even though it is still in the EFI.
> > > > 
> > > > And to prevent the big EFI from pinning the tail of the log while
> > > > EFDs are being processed, we can relog the EFI along with the EFD
> > > > each time the EFD is updated, hence we drag the EFI forwards in
> > > > every high level transaction roll when we are actually freeing the
> > > > extents.
> > > > 
> > > 
> > > Hmm.. I'm not sure that addresses the deadlock problem for repair. That
> > > assumes that EFD updates come at regular enough intervals to keep the
> > > tail moving, but IIRC the bulk loading infrastructure will essentially
> > > log a bunch of EFIs, spend a non-deterministic amount of time doing
> > > work, then log the associated EFDs. So there's still a period of time in
> > > there where we might need to relog intents that aren't otherwise being
> > > updated.
> > > 
> > > Darrick might want to chime in here in case I'm missing something...
> > 
> > I changed the ->claim_block function in the online repair code[3] to
> > relog all of the EFIs (using the strategy Dave outlined above), with the
> > claimed block not present in the new EFI.  I think this means we can't
> > pin the log tail longer than it takes to memcpy a bunch of records into
> > a block and put it on a delwri list, which should be fast enough.
> > 
> 
> Ah, interesting. So each filled block would relog the intents associated
> with the outstanding reservation for the rebuild. That certainly will
> keep things moving, I'd suspect relogging far more frequently than
> necessary, but correctness first. :)

<nod> It's even less bad if you've maximized the number of free extent
records per EFI.

> One thing I'm curious about in the broader context of repair is what
> happens if the filesystem crashes mid-rebuild? IIRC the rebuilt tree is
> fake rooted and the above seems to imply we'd have EFDs logged for the
> blocks consumed by the tree. Are those blocks restored to the fs somehow
> or another on a crash recovery?

Each time we relog the EFIs, we log exactly the same records as before,
which means that all the blocks in the new btree remain the target of
EFIs until the repair completes.  If we crash before the end, log
recovery will free the entire half-built structure.

If we reach the end of the repair, we'll log EFDs for all the EFIs in
the btree bulkload reservation; log the new btree root; and queue a
bunch of extfree_items (which themselves log more EFIs) for the old
btree blocks.  Then we let dfops finish the extfree items, which deletes
the old btree.  If we crash during that part, log recovery will roll us
forward to the point of having a freshly rebuilt btree and no leaked
blocks.

> > > > The key to this is that the EFI/EFD relogging must be done entirely
> > > > under a single rolling transaction, so there is -always- space
> > > > available in the log for both the EFI and the EFDs to be relogged as
> > > > the long running operation is performed.
> > > > 
> > > > IOWs, the EFI/EFD structures support relogging of the intents at a
> > > > design level, and it is intended that this process is entirely
> > > > driven from a single rolling transaction context. I srtongly suspect
> > > > that all the recent EFI/EFD and deferred ops reworking has lost a
> > > > lot of this context from the historical EFI/EFD implementation...
> > 
> > I don't agree with this, since the BUI/CUI items actively relog
> > themselves for another go-around if they decide that they still have
> > work to do, and the atomic extent swap item I proposed also takes
> > advantage of this design property.
> > 
> > Though, I concede that I don't think any of us were watching carefully
> > enough to the dfops manager to spot the occasional need to relog all the
> > attached intent items if the chain gets long enough.
> > 
> > > > So before we go down the path of implementing generic automatic
> > > > relogging infrastructure, we first should have been writing the
> > > > application code that needs to relog intents and use a mechanism
> > > > like the above to cancel and reinsert intents further down the log.
> > 
> > Already done.  The reason why Brian and I are stirring up this hornet
> > nest again is that I started posting patches to fix various deficiencies
> > that were exposed by generic/52[12] shakedowns of the atomic swap code. ;)
> > 
> > I guess I should go post the latest version of the defer freezer code
> > since it takes steps to minimize the dfops chain lengths, and relogs the
> > entire dfops chain every few rolls to keep the associated log items
> > moving forward...
> > 
> > > > Once we have code that is using these techniques to do bulk
> > > > operations, then we can look to optimise/genericise the
> > > > infrastructure they use.
> > > > 
> > > > > Moving on to quotaoff...
> > > > > 
> > > > > > I have been spending some time recently in the quota code, so I have
> > > > > > a better grip on what it is doing now than I did last time I looked
> > > > > > at this relogging code. I never really questioned why the quota code
> > > > > > needed two transactions for quota-off, and I'm guessing that nobody
> > > > > > else has either. So I spent some time this morning understanding
> > > > > > what problem it was actually solving and trying to find an alternate
> > > > > > solution to that problem.
> > > > > 
> > > > > Indeed, I hadn't looked into that.
> > > > > 
> > > > > > The reason we have the two quota-off transactions is that active
> > > > > > dquot modifications at the time quotaoff is started leak past the
> > > > > > first quota off transaction that hits the journal. Hence to avoid
> > > > > > incorrect replay of those modifications in the journal if we crash
> > > > > > after the quota-off item passes out of the journal, we pin the
> > > > > > quota-off item in the journal. It gets unpinned by the commit of the
> > > > > > second quota-off transaction at completion time, hence defining the
> > > > > > window in journal where quota-off is being processed and dquot
> > > > > > modifications should be ignored. i.e. there is no window where
> > > > > > recovery will replay dquot modifications incorrectly.
> > > > > > 
> > > > > 
> > > > > Ok.
> > > > > 
> > > > > > However, if the second transaction is left too long, the reservation
> > > > > > will fail to find journal space because of the pinned quota-off item.
> > > > > > 
> > > > > 
> > > > > Right.
> > > > > 
> > > > > > The relogging infrastructure is designed to allow the inital
> > > > > > quota-off intent to keep moving forward in the log so it never pins
> > > > > > the tail of the log before the second quota-off transaction is run.
> > > > > > This tries to avoid the recovery issue because there's always an
> > > > > > active quota off item in the log, but I think there may be a flaw
> > > > > > here.  When the quotaoff item gets relogged, it jumps all the dquots
> > > > > > in the log that were modified after the quota-off started. Hence if
> > > > > > we crash after the relogging but while the dquots are still in the
> > > > > > log before the relogged quotaoff item, then they will be replayed,
> > > > > > possibly incorrectly. i.e. the relogged quota-off item no longer
> > > > > > prevents replay of those items.
> > > > > > 
> > > > > > So while relogging prevents the tail pinning deadlock, I think it
> > > > > > may actually result in incorrect recovery behaviour in that items
> > > > > > that should be cancelled and not replayed can end up getting
> > > > > > replayed.  I'm not sure that this matters for dquots, but for a
> > > > > > general mechanism I think the transactional ordering violations it
> > > > > > can result in reduce it's usefulness significantly.
> > > > > > 
> > > > > 
> > > > > Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
> > > > > the early RFCs.
> > > > 
> > > > We might have, but I don't recall that. And it would appear nobody
> > > > looked at this code in any detail if we did discuss it, so I'd say
> > > > the discussion was largely uninformed...
> > > > 
> > > > > Log recovery processes the quotaoff intent in pass 1 and
> > > > > dquot updates in pass 2, which I thought was intended to handle this
> > > > > kind of problem.
> > > > 
> > > > Right, it does handle it, but only because there are two quota-off
> > > > items in the log. i.e.  There's two recovery situations in play here
> > > > - 1) quota off in progress and 2) quota off done.
> > > > 
> > > > In the first case, only the initial quota-off item is in the log, so
> > > > it is needed to be detect to stop replay of relevant dquots that
> > > > have been logged after the quota off was started.
> > > > 
> > > > The second case has to be broken down into two sitations: a) both quota-off items
> > > > are active in the log, or b) only the second item is active in the log
> > > > as the tail has moved forwards past the first item.
> > > > 
> > > > In the case of 2a), it doesn't matter which item recovery sees, it
> > > > will cancel the dquot updates correctly. In the case of 2b), the
> > > > second quota off item is absolutely necessary to prevent replay of
> > > > the dquots in the log before it.
> > > > 
> > > > Hence if dquot modifications can leak past the first quota-off item
> > > > in the log, then the second item is absolutely necessary to catch
> > > > the 2b) case to prevent incorrect replay of dquot buffers.
> > > > 
> > > 
> > > Ok, but we're talking specifically about log recovery after quotaoff has
> > > completed but before both intents have fallen off of the log. Relogging
> > > of the initial intent (re: the original comment above about incorrect
> > > recovery behavior) has no impact on this general ordering between the
> > > start/end intents or dquot changes and the end intent.
> > > 
> > > > > If I follow correctly, the recovery issue that warrants pinning the
> > > > > quotaoff in the log is not so much an ordering issue, but if the latter
> > > > > happens to fall off the end of the log before the last of the dquot
> > > > > modifications, recovery could see dquot changes after having lost the
> > > > > fact that a quotaoff had occurred at all. The current implementation
> > > > > presumably handles this by pinning the quotaoff until all dquots are
> > > > > completely purged from existence. The relog mechanism just allows the
> > > > > item to move while it effectively remains pinned, so I don't see how it
> > > > > introduces recovery issues.
> > > > 
> > > > As I said, it may not affect the specific quota-off usage, but we
> > > > can't just change the order of items in the physical journal without
> > > > care because the journal is supposed to be -strictly ordered-.
> > > > 
> > > 
> > > The mechanism itself is intended to target specific instances of log
> > > items. Each use case should be evaluated for correctness on its own,
> > > just like one would with ordered buffers or some other internal low
> > > level construct that changes behavior.
> > > 
> > > > Reordering intents in the log automatically without regard to higher
> > > > level transactional ordering dependencies of the log items may
> > > > violate the ordering rules for journalling and recovery of metadata.
> > > > This is why I said automatic relogging may not be useful as generic
> > > > infrastructure - if there are dependent log items, then they need to
> > > > relogged as an atomic change set that maintains the ordering
> > > > dependencies between objects. That's where this automatic mechanism
> > > > completely falls down - the ordering dependencies are known only by
> > > > the code running the original transaction, not the log items...
> > > > 
> > > 
> > > This and the above sounds to me that you're treating automatic relogging
> > > like it would just be enabled by default on all intents, reordering
> > > things arbitrarily. That is not the case as things would certainly
> > > break, just like what would happen if ordered buffers were enabled by
> > > default. The mechanism is per log item and context specific. It is
> > > "generic" in the sense that there are (were) multiple use cases for it,
> > > not that it should be used arbitrarily or "without care."
> > > 
> > > Use cases that have very particular ordering requirements across certain
> > > sets of items should probably not enable this mechanism on those items
> > > or otherwise verify that relogging a particular item is safe. The
> > > potential example of this ordering problem being cited is quotaoff, but
> > > we've already gone through this example multiple times and established
> > > that relogging the quotaoff start item is safe.
> > > 
> > > All that said, extending a relogging notification somehow to a
> > > particular context has always been a consideration because 1.) direct
> > > EFI relogging would require log recovery changes and 2.) there was yet
> > > another potential use case where dfops needed to know whether to relog a
> > > particular intent in a long running chain to avoid some issue (the
> > > details of which escape me). I think issue #1 is not complicated to
> > > address, but creates a backwards incompatibility for log recovery. Issue
> > > #2 would potentially separate out relogging as a notification mechanism
> > > from the reservation management bits, but it's still not clear to me
> > > what that notification mechanism would look like for a transaction that
> > > has already been committed by some caller context.
> > > 
> > > I think Darrick was looking at repurposing ->iop_relog() for that one so
> > > I'd be curious to know what that is looking like in general...
> > 
> > ...it's graduated to the point that I'm willing/crazy enough to run it
> > on my development workstations, and it hasn't let out the magic smoke.
> > 
> 
> Heh.
> 
> > I added ->iop_relog handlers to all the major intent items (EFI, RUI,
> > CUI, BUI, SXI), then taught xfs_defer_finish_noroll to relog everything
> > on dop_pending every 7 transaction rolls[1].  Initially this caused log
> > reservation overflow problems with transactions that log intents whose
> > ->finish_item functions themselves log dozens more intents, but then
> > realized that the second patch[2] I had written took care of this
> > problem.
> > 
> 
> Thanks. I think I get the general idea. We're reworking the
> ->iop_relog() handler to complete and replace the current intent (rather
> than just relog the original intent, which is what this series did for
> the quotaoff case) in the current dfops transaction and allow the dfops
> code to update its reference to the item. The part that's obviously
> missing is some kind of determination on when we actually need to relog
> the outstanding intents vs. using a fixed roll count.

<nod>  I don't consider myself sufficiently ail-smart to know how to do
that part. :)

> I suppose we could do something like I was mentioning in my other reply
> on the AIL pushing issue Dave pointed out where we'd set a bit on
> certain items that are tail pinned and in need of relog. That sounds
> like overkill given this use case is currently self-contained to dfops.

That might be a useful optimization -- every time defer_finish rolls the
transaction, check the items to see if any of them have
XFS_LI_RELOGMEPLEASE set, and if any of them do, or we hit our (now
probably higher than 7) fixed roll count, we'll relog as desired to keep
the log moving forward.

> Perhaps the other idea of factoring out the threshold determination
> logic from xlog_grant_push_ail() might be useful.
> 
> For example, if the current free reservation is below the calculated
> threshold (with need_bytes == 0), return a threshold LSN based on the
> current tail. Instead of using that to push the AIL, compare it to
> ->li_lsn of each intent and relog any that are inside the threshold LSN
> (which will probably be all of them in practice since they are part of
> the same transaction). We'd probably need to identify intents that have
> been recently relogged so the process doesn't repeat until the CIL
> drains and the li_lsn eventually changes. Hmm.. I did have an
> XFS_LI_IN_CIL state tracking patch around somewhere for debugging
> purposes that might actually be sufficient for that. We could also
> consider stashing a "relog push" LSN somewhere (similar to the way AIL
> pushing works) and perhaps use that to avoid repeated relogs on a chain,
> but it's not immediately clear to me how well that would fit into the
> dfops mechanism...

...is there a sane way for dfops to query the threshold LSN so that it
could compare against the li_lsn of each item it holds?

--D

> Brian
> 
> > That patch, of course, is one that I posted a while ago that makes it so
> > that if a transaction owner logs items A and B and then commits (or
> > calls defer_roll), dfops will now finish all the items created by A's
> > ->finish_item before it moves on to trying to finish B.
> > 
> > I had put /that/ patch aside after Brian pointed out that on its own,
> > that patch merely substituted pinning the tail on some sub-item of A
> > with pinning the tail on B, but I think with both patches applied I have
> > solved both problems.
> > 
> > --D
> > 
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=515cc4e637bf4e9afcfbaeb39b13f85b27923916
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=8e63f8a7af12d673feb5400d09179502632854c4
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=ddfeca6f1f1862c3f162db8b8bdbfc5149f5e5c5
> > 
> > > > > > But back to quota-off: What I've realised is that the only dquot
> > > > > > modifications we need to protect against being recovered are the
> > > > > > ones that are running at the time the first quota-off is committed
> > > > > > to the journal. That is, once the DQACTIVE flags are clear,
> > > > > > transactions will not modify those dquots anymore. Hence by the time
> > > > > > that the quota off item pins the tail of the log, the transactions
> > > > > > that were actively dirtying inodes when it was committed have also
> > > > > > committed and are in the journal and there are no actively modified
> > > > > > dquots left in memory.
> > > > > > 
> > > > > 
> > > > > I'm not sure how the (sync) commit of the quotaoff guarantees some other
> > > > > transaction running in parallel hadn't modified a dquot and committed
> > > > > after the quotaoff, but I think I see where you're going in general...
> > > > 
> > > > We drained out all the transactions that can be modifying quotas
> > > > before we log the quotaoff items. So, by definition, this cannot
> > > > happen.
> > > > 
> > > > > > IOWs, we don't actually need to wait until we've released and purged
> > > > > > all the dquots from memory before we log the second quota off item;
> > > > > > all we need to wait for is for all the transactions with dirty
> > > > > > dquots to have committed. These transactions already have log
> > > > > > reservations, so completing them will free unused reservation space
> > > > > > for the second quota off transaction. Once they are committed, then
> > > > > > we can log the second item. i.e. we don't have to wait until we've
> > > > > > cleaned up the dquots to close out the quota-off transaction in the
> > > > > > journal.
> > > > > > 
> > > > > 
> > > > > Ok, so we can deterministically shorten the window with a runtime
> > > > > barrier (i.e. disable -> drain) on quota modifying transactions rather
> > > > > than relying on the full dquot purge to provide this ordering.
> > > > 
> > > > Yup.
> > > > 
> > > > > > To make it even more robust, if we stop all the transactions that
> > > > > > may dirty dquots and drain the active ones before we log the first
> > > > > > quota-off item, we can log the second item immediately afterwards
> > > > > > because it is known that there are no dquot modifications in flight
> > > > > > when the first item is logged. We can probably even log both items
> > > > > > in the same transaction.
> > > > > > 
> > > > > 
> > > > > I was going to ask why we'd even need two items if this approach is
> > > > > generally viable.
> > > > 
> > > > Because I don't want to change the in-journal appearance of
> > > > quota-off to older kernels. Changing how things appear on disk is
> > > > dangerous and likely going to bite us in unexpected ways.
> > > > 
> > > 
> > > Well combining them into a single transaction doesn't guarantee ordering
> > > of the two, right? So it might not be worth doing that either if we're
> > > concerned about log appearance. Regardless, those potential steps can be
> > > evaluated independently on top of the core runtime fixes.
> > > 
> > > > > > So, putting my money where my mouth is, the patch below does this.
> > > > > > It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> > > > > > of -g quota with all quotas enabled and is currently running a full
> > > > > > auto cycle with all quotas enabled. It hasn't let the smoke out
> > > > > > after about 4 hours of testing now....
> > > > > > 
> > > > > 
> > > > > Thanks for the patch. First, I like the idea and agree that it's more
> > > > > simple than the relogging approach. I do still need to stare at it some
> > > > > more to grok it and convince myself it's safe.
> > > > > 
> > > > > The thing that sticks out to me is tagging all of the transactions that
> > > > > modify quotas. Is there any reason we can't just quiesce the transaction
> > > > > subsystem entirely as a first step? It's not like quotaoff is common or
> > > > > performance sensitive. For example:
> > > > >
> > > > > 1. stop all transactions, wait to drain, force log
> > > > > 2. log the sb/quotaoff synchronously (punching through via something
> > > > >    like NO_WRITECOUNT)
> > > > > 3. clear the xfs_mount quota active flags
> > > > > 4. restart the transaction subsystem (no more dquot mods)
> > > > > 5. complete quotaoff via the dquot release and purge sequence
> > > > 
> > > > Yup, as I said on #xfs a short while ago:
> > > > 
> > > > [3/7/20 01:15] <djwong> qi_active_trans?
> > > > [3/7/20 01:15] <djwong> man, we just killed off m_active_trans
> > > > [3/7/20 08:47] <dchinner> djwong: I know we just killed off that atomic counter, it was used for doing exactly what I needed for quota-off, but freeze didn't need it anymore
> > > > [3/7/20 08:48] <dchinner> I mean, we could just make quota-off freeze the filesystem, do quota-off, then unfreeze....
> > > > [3/7/20 08:48] <dchinner> that's a simple, brute force solution
> > > > [3/7/20 08:49] <dchinner> but it's also overkill in that it forces lots of unnecessary data writeback...
> > > > [3/7/20 08:52] * djwong sometimes wonders if we just need a "run XXXX with exclusive access" thing
> > > > [3/7/20 08:58] <dchinner> djwong: that's kinda what xfs_quiesce_attr() was originally intended for
> > > > [3/7/20 08:59] <dchinner> but as all the code slowly got moved up into the VFS freeze layers, it stopped being able to be used for that sort of operation....
> > > > [3/7/20 09:01] <djwong> oh
> > > > [3/7/20 09:03] <dchinner> and so just after we remove the last remaining fragment of that original functionality, we find that maybe we actually still need to be able to quiesce the filesytsem for internal synchronisation reasons
> > > > 
> > > > So, we used to have exactly the functionality I needed in XFS as
> > > > general infrastructure, but we've removed it over the past few years
> > > > as the VFS has slowly been brought up to feature parity with XFS. I
> > > > just implemented what I needed to block/halt quota modifications
> > > > because I didn't want to perturb anything else while exploring if my
> > > > hypothesis was correct.
> > > > 
> > > 
> > > Ok.
> > > 
> > > > The only outstanding thing I haven't checked out fully is the
> > > > delayed allocation reservations that aren't done in transaction
> > > > contexts. I -think- these are OK because they are in memory only,
> > > > and they will serialised on the inode lock when detatching dquots
> > > > (i.e. the existing dquot purging ordering mechanisms) after quotas
> > > > are turned off. Hence I think these are fine, but more investigation
> > > > will be needed there to confirm behaviour is correct.
> > > > 
> > > 
> > > Yep.
> > > 
> > > > > I think it could be worth the tradeoff for the simplicity of not having
> > > > > to maintain the transaction reservation tags or the special quota
> > > > > waiting infrastructure vs. something like the more generic (recently
> > > > > removed) transaction counter. We might even be able to abstract the
> > > > > whole thing behind a transaction flag. E.g.:
> > > > > 
> > > > > 	/*
> > > > > 	 * A barrier transaction locks out further transactions and waits on
> > > > > 	 * outstanding transactions to drain (i.e. commit) before returning.
> > > > > 	 * Everything unlocks when the transaction commits.
> > > > > 	 */
> > > > > 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > > > 			XFS_TRANS_BARRIER, &tp);
> > > > > 	...
> > > > 
> > > > Yup, if we decide that we want to track all active transactions again
> > > > rather than just when quota is active, it would make a lot of
> > > > sense to make it a formal function of the xfs_trans_alloc() API.
> > > > 
> > > > Really, though, I've got so many other things on my plate right now
> > > > I don't have the time to take on yet another infrastructure
> > > > reworking. I spent the time to write the patch because if I was
> > > > going to say I didn't like relogging then it was absolutely
> > > > necessary for me to provide an alternative solution to the problem,
> > > > but I'm ireally hoping that it is sufficient for someone else to be
> > > > able to pick it up and run with it....
> > > > 
> > > 
> > > Ok, I can take a look at this since I need to step back and rethink this
> > > particular feature anyways.
> > > 
> > > Brian
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > 
> > > > PS. FWIW, if anyone wants to pick up any RFC patchset I've posted in
> > > > the past and run with it, I'm more than happy for you to do so. I've
> > > > got way more ideas and prototypes than I've got time to turn into
> > > > full production features. I also don't care about "ownership" of the
> > > > work; it's better to have someone actively working on the code than
> > > > having it sit around waiting for me to find time to get back to
> > > > it...
> > > > 
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-08 16:44             ` Darrick J. Wong
@ 2020-07-09 12:15               ` Brian Foster
  2020-07-09 16:32                 ` Darrick J. Wong
  2020-07-20  3:58                 ` Dave Chinner
  0 siblings, 2 replies; 25+ messages in thread
From: Brian Foster @ 2020-07-09 12:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Jul 08, 2020 at 09:44:28AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 07, 2020 at 07:37:43AM -0400, Brian Foster wrote:
> > On Mon, Jul 06, 2020 at 10:42:57AM -0700, Darrick J. Wong wrote:
> > > On Mon, Jul 06, 2020 at 12:03:06PM -0400, Brian Foster wrote:
> > > > On Fri, Jul 03, 2020 at 10:49:40AM +1000, Dave Chinner wrote:
> > > > > On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> > > > > > On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > > > > > > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> > ...
> > > > 
> > > > > Consider that a single transaction that contains an EFD for the
> > > > > original EFI, and a new EFI for the same extent is effectively
> > > > > "relogging the EFI". It does so by atomically cancelling the
> > > > > original EFI in the log and creating a new EFI.
> > > > > 
> > > > 
> > > > Right. This is how dfops currently works IIRC.
> > > > 
> > > > > Now, and EFI is defined on disk as:
> > > > > 
> > > > > typedef struct xfs_efi_log_format {
> > > > >         uint16_t                efi_type;       /* efi log item type */
> > > > >         uint16_t                efi_size;       /* size of this item */
> > > > >         uint32_t                efi_nextents;   /* # extents to free */
> > > > >         uint64_t                efi_id;         /* efi identifier */
> > > > >         xfs_extent_t            efi_extents[1]; /* array of extents to free */
> > > > > } xfs_efi_log_format_t;
> > > > > 
> > > > > Which means it can hold up to 2^16-1 individual extents that we
> > > > > intend to free. We currently only use one extent per EFI, but if we
> > > > > go back in history, they were dynamically sized structures and
> > > > > could track arbitrary numbers of extents.
> > > > > 
> > > > > So, repair needs to track mulitple nested EFIs?
> > > > > 
> > > > > We cancel the old EFI, log a new EFI with all the old extents and
> > > > > the new extent in it. We now have a single EFI in the journal
> > > > > containing N+1 extents in it.
> > > > > 
> > > > 
> > > > That's an interesting optimization.
> > > 
> > > Hmm, I hadn't thought about amortizing the cost of maintaining an EFI
> > > across as many of the btree block allocations as possible.  That would
> > > make the current scheme (which I'll get into below) less scary.
> > > 
> > > > > Further, an EFD with multiple extents in it is -intended to be
> > > > > relogged- multiple times. Every time we free an extent in the EFI,
> > > > > we remove it from the EFD and relog the EFD. THis tells log recovery
> > > > > that this extent has now been freed, and that it should not replay
> > > > > it, even though it is still in the EFI.
> > > > > 
> > > > > And to prevent the big EFI from pinning the tail of the log while
> > > > > EFDs are being processed, we can relog the EFI along with the EFD
> > > > > each time the EFD is updated, hence we drag the EFI forwards in
> > > > > every high level transaction roll when we are actually freeing the
> > > > > extents.
> > > > > 
> > > > 
> > > > Hmm.. I'm not sure that addresses the deadlock problem for repair. That
> > > > assumes that EFD updates come at regular enough intervals to keep the
> > > > tail moving, but IIRC the bulk loading infrastructure will essentially
> > > > log a bunch of EFIs, spend a non-deterministic amount of time doing
> > > > work, then log the associated EFDs. So there's still a period of time in
> > > > there where we might need to relog intents that aren't otherwise being
> > > > updated.
> > > > 
> > > > Darrick might want to chime in here in case I'm missing something...
> > > 
> > > I changed the ->claim_block function in the online repair code[3] to
> > > relog all of the EFIs (using the strategy Dave outlined above), with the
> > > claimed block not present in the new EFI.  I think this means we can't
> > > pin the log tail longer than it takes to memcpy a bunch of records into
> > > a block and put it on a delwri list, which should be fast enough.
> > > 
> > 
> > Ah, interesting. So each filled block would relog the intents associated
> > with the outstanding reservation for the rebuild. That certainly will
> > keep things moving, I'd suspect relogging far more frequently than
> > necessary, but correctness first. :)
> 
> <nod> It's even less bad if you've maximized the number of free extent
> records per EFI.
> 
> > One thing I'm curious about in the broader context of repair is what
> > happens if the filesystem crashes mid-rebuild? IIRC the rebuilt tree is
> > fake rooted and the above seems to imply we'd have EFDs logged for the
> > blocks consumed by the tree. Are those blocks restored to the fs somehow
> > or another on a crash recovery?
> 
> Each time we relog the EFIs, we log exactly the same records as before,
> which means that all the blocks in the new btree remain the target of
> EFIs until the repair completes.  If we crash before the end, log
> recovery will free the entire half-built structure.
> 

I'm a little confused by how we log EFIs for exactly the same records as
before while the above description says we leave off the claimed block
in the new EFI when it rolls. Does the roll in this scenario complete
the previous EFI(s) with an EFD and create a new EFI without the claimed
block? For example, suppose we do our bulk reservation and log the
necessary EFIs, start the rebuild, claim the first block, relog the
outstanding EFI without the claimed block, and then the system crashes.
What happens with the claimed block on subsequent recovery?

> If we reach the end of the repair, we'll log EFDs for all the EFIs in
> the btree bulkload reservation; log the new btree root; and queue a
> bunch of extfree_items (which themselves log more EFIs) for the old
> btree blocks.  Then we let dfops finish the extfree items, which deletes
> the old btree.  If we crash during that part, log recovery will roll us
> forward to the point of having a freshly rebuilt btree and no leaked
> blocks.
> 
> > > > > The key to this is that the EFI/EFD relogging must be done entirely
> > > > > under a single rolling transaction, so there is -always- space
> > > > > available in the log for both the EFI and the EFDs to be relogged as
> > > > > the long running operation is performed.
> > > > > 
> > > > > IOWs, the EFI/EFD structures support relogging of the intents at a
> > > > > design level, and it is intended that this process is entirely
> > > > > driven from a single rolling transaction context. I srtongly suspect
> > > > > that all the recent EFI/EFD and deferred ops reworking has lost a
> > > > > lot of this context from the historical EFI/EFD implementation...
> > > 
> > > I don't agree with this, since the BUI/CUI items actively relog
> > > themselves for another go-around if they decide that they still have
> > > work to do, and the atomic extent swap item I proposed also takes
> > > advantage of this design property.
> > > 
> > > Though, I concede that I don't think any of us were watching carefully
> > > enough to the dfops manager to spot the occasional need to relog all the
> > > attached intent items if the chain gets long enough.
> > > 
> > > > > So before we go down the path of implementing generic automatic
> > > > > relogging infrastructure, we first should have been writing the
> > > > > application code that needs to relog intents and use a mechanism
> > > > > like the above to cancel and reinsert intents further down the log.
> > > 
> > > Already done.  The reason why Brian and I are stirring up this hornet
> > > nest again is that I started posting patches to fix various deficiencies
> > > that were exposed by generic/52[12] shakedowns of the atomic swap code. ;)
> > > 
> > > I guess I should go post the latest version of the defer freezer code
> > > since it takes steps to minimize the dfops chain lengths, and relogs the
> > > entire dfops chain every few rolls to keep the associated log items
> > > moving forward...
> > > 
> > > > > Once we have code that is using these techniques to do bulk
> > > > > operations, then we can look to optimise/genericise the
> > > > > infrastructure they use.
> > > > > 
> > > > > > Moving on to quotaoff...
> > > > > > 
> > > > > > > I have been spending some time recently in the quota code, so I have
> > > > > > > a better grip on what it is doing now than I did last time I looked
> > > > > > > at this relogging code. I never really questioned why the quota code
> > > > > > > needed two transactions for quota-off, and I'm guessing that nobody
> > > > > > > else has either. So I spent some time this morning understanding
> > > > > > > what problem it was actually solving and trying to find an alternate
> > > > > > > solution to that problem.
> > > > > > 
> > > > > > Indeed, I hadn't looked into that.
> > > > > > 
> > > > > > > The reason we have the two quota-off transactions is that active
> > > > > > > dquot modifications at the time quotaoff is started leak past the
> > > > > > > first quota off transaction that hits the journal. Hence to avoid
> > > > > > > incorrect replay of those modifications in the journal if we crash
> > > > > > > after the quota-off item passes out of the journal, we pin the
> > > > > > > quota-off item in the journal. It gets unpinned by the commit of the
> > > > > > > second quota-off transaction at completion time, hence defining the
> > > > > > > window in journal where quota-off is being processed and dquot
> > > > > > > modifications should be ignored. i.e. there is no window where
> > > > > > > recovery will replay dquot modifications incorrectly.
> > > > > > > 
> > > > > > 
> > > > > > Ok.
> > > > > > 
> > > > > > > However, if the second transaction is left too long, the reservation
> > > > > > > will fail to find journal space because of the pinned quota-off item.
> > > > > > > 
> > > > > > 
> > > > > > Right.
> > > > > > 
> > > > > > > The relogging infrastructure is designed to allow the inital
> > > > > > > quota-off intent to keep moving forward in the log so it never pins
> > > > > > > the tail of the log before the second quota-off transaction is run.
> > > > > > > This tries to avoid the recovery issue because there's always an
> > > > > > > active quota off item in the log, but I think there may be a flaw
> > > > > > > here.  When the quotaoff item gets relogged, it jumps all the dquots
> > > > > > > in the log that were modified after the quota-off started. Hence if
> > > > > > > we crash after the relogging but while the dquots are still in the
> > > > > > > log before the relogged quotaoff item, then they will be replayed,
> > > > > > > possibly incorrectly. i.e. the relogged quota-off item no longer
> > > > > > > prevents replay of those items.
> > > > > > > 
> > > > > > > So while relogging prevents the tail pinning deadlock, I think it
> > > > > > > may actually result in incorrect recovery behaviour in that items
> > > > > > > that should be cancelled and not replayed can end up getting
> > > > > > > replayed.  I'm not sure that this matters for dquots, but for a
> > > > > > > general mechanism I think the transactional ordering violations it
> > > > > > > can result in reduce it's usefulness significantly.
> > > > > > > 
> > > > > > 
> > > > > > Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
> > > > > > the early RFCs.
> > > > > 
> > > > > We might have, but I don't recall that. And it would appear nobody
> > > > > looked at this code in any detail if we did discuss it, so I'd say
> > > > > the discussion was largely uninformed...
> > > > > 
> > > > > > Log recovery processes the quotaoff intent in pass 1 and
> > > > > > dquot updates in pass 2, which I thought was intended to handle this
> > > > > > kind of problem.
> > > > > 
> > > > > Right, it does handle it, but only because there are two quota-off
> > > > > items in the log. i.e.  There's two recovery situations in play here
> > > > > - 1) quota off in progress and 2) quota off done.
> > > > > 
> > > > > In the first case, only the initial quota-off item is in the log, so
> > > > > it is needed to be detect to stop replay of relevant dquots that
> > > > > have been logged after the quota off was started.
> > > > > 
> > > > > The second case has to be broken down into two sitations: a) both quota-off items
> > > > > are active in the log, or b) only the second item is active in the log
> > > > > as the tail has moved forwards past the first item.
> > > > > 
> > > > > In the case of 2a), it doesn't matter which item recovery sees, it
> > > > > will cancel the dquot updates correctly. In the case of 2b), the
> > > > > second quota off item is absolutely necessary to prevent replay of
> > > > > the dquots in the log before it.
> > > > > 
> > > > > Hence if dquot modifications can leak past the first quota-off item
> > > > > in the log, then the second item is absolutely necessary to catch
> > > > > the 2b) case to prevent incorrect replay of dquot buffers.
> > > > > 
> > > > 
> > > > Ok, but we're talking specifically about log recovery after quotaoff has
> > > > completed but before both intents have fallen off of the log. Relogging
> > > > of the initial intent (re: the original comment above about incorrect
> > > > recovery behavior) has no impact on this general ordering between the
> > > > start/end intents or dquot changes and the end intent.
> > > > 
> > > > > > If I follow correctly, the recovery issue that warrants pinning the
> > > > > > quotaoff in the log is not so much an ordering issue, but if the latter
> > > > > > happens to fall off the end of the log before the last of the dquot
> > > > > > modifications, recovery could see dquot changes after having lost the
> > > > > > fact that a quotaoff had occurred at all. The current implementation
> > > > > > presumably handles this by pinning the quotaoff until all dquots are
> > > > > > completely purged from existence. The relog mechanism just allows the
> > > > > > item to move while it effectively remains pinned, so I don't see how it
> > > > > > introduces recovery issues.
> > > > > 
> > > > > As I said, it may not affect the specific quota-off usage, but we
> > > > > can't just change the order of items in the physical journal without
> > > > > care because the journal is supposed to be -strictly ordered-.
> > > > > 
> > > > 
> > > > The mechanism itself is intended to target specific instances of log
> > > > items. Each use case should be evaluated for correctness on its own,
> > > > just like one would with ordered buffers or some other internal low
> > > > level construct that changes behavior.
> > > > 
> > > > > Reordering intents in the log automatically without regard to higher
> > > > > level transactional ordering dependencies of the log items may
> > > > > violate the ordering rules for journalling and recovery of metadata.
> > > > > This is why I said automatic relogging may not be useful as generic
> > > > > infrastructure - if there are dependent log items, then they need to
> > > > > relogged as an atomic change set that maintains the ordering
> > > > > dependencies between objects. That's where this automatic mechanism
> > > > > completely falls down - the ordering dependencies are known only by
> > > > > the code running the original transaction, not the log items...
> > > > > 
> > > > 
> > > > This and the above sounds to me that you're treating automatic relogging
> > > > like it would just be enabled by default on all intents, reordering
> > > > things arbitrarily. That is not the case as things would certainly
> > > > break, just like what would happen if ordered buffers were enabled by
> > > > default. The mechanism is per log item and context specific. It is
> > > > "generic" in the sense that there are (were) multiple use cases for it,
> > > > not that it should be used arbitrarily or "without care."
> > > > 
> > > > Use cases that have very particular ordering requirements across certain
> > > > sets of items should probably not enable this mechanism on those items
> > > > or otherwise verify that relogging a particular item is safe. The
> > > > potential example of this ordering problem being cited is quotaoff, but
> > > > we've already gone through this example multiple times and established
> > > > that relogging the quotaoff start item is safe.
> > > > 
> > > > All that said, extending a relogging notification somehow to a
> > > > particular context has always been a consideration because 1.) direct
> > > > EFI relogging would require log recovery changes and 2.) there was yet
> > > > another potential use case where dfops needed to know whether to relog a
> > > > particular intent in a long running chain to avoid some issue (the
> > > > details of which escape me). I think issue #1 is not complicated to
> > > > address, but creates a backwards incompatibility for log recovery. Issue
> > > > #2 would potentially separate out relogging as a notification mechanism
> > > > from the reservation management bits, but it's still not clear to me
> > > > what that notification mechanism would look like for a transaction that
> > > > has already been committed by some caller context.
> > > > 
> > > > I think Darrick was looking at repurposing ->iop_relog() for that one so
> > > > I'd be curious to know what that is looking like in general...
> > > 
> > > ...it's graduated to the point that I'm willing/crazy enough to run it
> > > on my development workstations, and it hasn't let out the magic smoke.
> > > 
> > 
> > Heh.
> > 
> > > I added ->iop_relog handlers to all the major intent items (EFI, RUI,
> > > CUI, BUI, SXI), then taught xfs_defer_finish_noroll to relog everything
> > > on dop_pending every 7 transaction rolls[1].  Initially this caused log
> > > reservation overflow problems with transactions that log intents whose
> > > ->finish_item functions themselves log dozens more intents, but then
> > > realized that the second patch[2] I had written took care of this
> > > problem.
> > > 
> > 
> > Thanks. I think I get the general idea. We're reworking the
> > ->iop_relog() handler to complete and replace the current intent (rather
> > than just relog the original intent, which is what this series did for
> > the quotaoff case) in the current dfops transaction and allow the dfops
> > code to update its reference to the item. The part that's obviously
> > missing is some kind of determination on when we actually need to relog
> > the outstanding intents vs. using a fixed roll count.
> 
> <nod>  I don't consider myself sufficiently ail-smart to know how to do
> that part. :)
> 
> > I suppose we could do something like I was mentioning in my other reply
> > on the AIL pushing issue Dave pointed out where we'd set a bit on
> > certain items that are tail pinned and in need of relog. That sounds
> > like overkill given this use case is currently self-contained to dfops.
> 
> That might be a useful optimization -- every time defer_finish rolls the
> transaction, check the items to see if any of them have
> XFS_LI_RELOGMEPLEASE set, and if any of them do, or we hit our (now
> probably higher than 7) fixed roll count, we'll relog as desired to keep
> the log moving forward.
> 

It's an optimization in some sense to prevent unnecessary relogs, but
the intent would be to avoid the need for a fixed count by notifying
when a relog is needed to a transaction that should be guaranteed to
have the reservation necessary to do so. I'm not sure it's worth the
complexity if there were some reason we still needed to fall back to a
hard count.

> > Perhaps the other idea of factoring out the threshold determination
> > logic from xlog_grant_push_ail() might be useful.
> > 
> > For example, if the current free reservation is below the calculated
> > threshold (with need_bytes == 0), return a threshold LSN based on the
> > current tail. Instead of using that to push the AIL, compare it to
> > ->li_lsn of each intent and relog any that are inside the threshold LSN
> > (which will probably be all of them in practice since they are part of
> > the same transaction). We'd probably need to identify intents that have
> > been recently relogged so the process doesn't repeat until the CIL
> > drains and the li_lsn eventually changes. Hmm.. I did have an
> > XFS_LI_IN_CIL state tracking patch around somewhere for debugging
> > purposes that might actually be sufficient for that. We could also
> > consider stashing a "relog push" LSN somewhere (similar to the way AIL
> > pushing works) and perhaps use that to avoid repeated relogs on a chain,
> > but it's not immediately clear to me how well that would fit into the
> > dfops mechanism...
> 
> ...is there a sane way for dfops to query the threshold LSN so that it
> could compare against the li_lsn of each item it holds?
> 

I'd start with just trying to reuse the logic in xlog_grant_push_ail()
(i.e. just factor out the AIL push). That function starts with a check
on available log reservation to filter out unnecessary pushes, then
calculates the AIL push LSN by adding the amount of log space we need to
free up to the current log tail. In this case we're not pushing the AIL,
but I think we'd be able to use the same threshold calculation logic to
determine when to relog intents that happen to reside within the range
from the current tail to the calculated threshold.

Brian

> --D
> 
> > Brian
> > 
> > > That patch, of course, is one that I posted a while ago that makes it so
> > > that if a transaction owner logs items A and B and then commits (or
> > > calls defer_roll), dfops will now finish all the items created by A's
> > > ->finish_item before it moves on to trying to finish B.
> > > 
> > > I had put /that/ patch aside after Brian pointed out that on its own,
> > > that patch merely substituted pinning the tail on some sub-item of A
> > > with pinning the tail on B, but I think with both patches applied I have
> > > solved both problems.
> > > 
> > > --D
> > > 
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=515cc4e637bf4e9afcfbaeb39b13f85b27923916
> > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=8e63f8a7af12d673feb5400d09179502632854c4
> > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=ddfeca6f1f1862c3f162db8b8bdbfc5149f5e5c5
> > > 
> > > > > > > But back to quota-off: What I've realised is that the only dquot
> > > > > > > modifications we need to protect against being recovered are the
> > > > > > > ones that are running at the time the first quota-off is committed
> > > > > > > to the journal. That is, once the DQACTIVE flags are clear,
> > > > > > > transactions will not modify those dquots anymore. Hence by the time
> > > > > > > that the quota off item pins the tail of the log, the transactions
> > > > > > > that were actively dirtying inodes when it was committed have also
> > > > > > > committed and are in the journal and there are no actively modified
> > > > > > > dquots left in memory.
> > > > > > > 
> > > > > > 
> > > > > > I'm not sure how the (sync) commit of the quotaoff guarantees some other
> > > > > > transaction running in parallel hadn't modified a dquot and committed
> > > > > > after the quotaoff, but I think I see where you're going in general...
> > > > > 
> > > > > We drained out all the transactions that can be modifying quotas
> > > > > before we log the quotaoff items. So, by definition, this cannot
> > > > > happen.
> > > > > 
> > > > > > > IOWs, we don't actually need to wait until we've released and purged
> > > > > > > all the dquots from memory before we log the second quota off item;
> > > > > > > all we need to wait for is for all the transactions with dirty
> > > > > > > dquots to have committed. These transactions already have log
> > > > > > > reservations, so completing them will free unused reservation space
> > > > > > > for the second quota off transaction. Once they are committed, then
> > > > > > > we can log the second item. i.e. we don't have to wait until we've
> > > > > > > cleaned up the dquots to close out the quota-off transaction in the
> > > > > > > journal.
> > > > > > > 
> > > > > > 
> > > > > > Ok, so we can deterministically shorten the window with a runtime
> > > > > > barrier (i.e. disable -> drain) on quota modifying transactions rather
> > > > > > than relying on the full dquot purge to provide this ordering.
> > > > > 
> > > > > Yup.
> > > > > 
> > > > > > > To make it even more robust, if we stop all the transactions that
> > > > > > > may dirty dquots and drain the active ones before we log the first
> > > > > > > quota-off item, we can log the second item immediately afterwards
> > > > > > > because it is known that there are no dquot modifications in flight
> > > > > > > when the first item is logged. We can probably even log both items
> > > > > > > in the same transaction.
> > > > > > > 
> > > > > > 
> > > > > > I was going to ask why we'd even need two items if this approach is
> > > > > > generally viable.
> > > > > 
> > > > > Because I don't want to change the in-journal appearance of
> > > > > quota-off to older kernels. Changing how things appear on disk is
> > > > > dangerous and likely going to bite us in unexpected ways.
> > > > > 
> > > > 
> > > > Well combining them into a single transaction doesn't guarantee ordering
> > > > of the two, right? So it might not be worth doing that either if we're
> > > > concerned about log appearance. Regardless, those potential steps can be
> > > > evaluated independently on top of the core runtime fixes.
> > > > 
> > > > > > > So, putting my money where my mouth is, the patch below does this.
> > > > > > > It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> > > > > > > of -g quota with all quotas enabled and is currently running a full
> > > > > > > auto cycle with all quotas enabled. It hasn't let the smoke out
> > > > > > > after about 4 hours of testing now....
> > > > > > > 
> > > > > > 
> > > > > > Thanks for the patch. First, I like the idea and agree that it's more
> > > > > > simple than the relogging approach. I do still need to stare at it some
> > > > > > more to grok it and convince myself it's safe.
> > > > > > 
> > > > > > The thing that sticks out to me is tagging all of the transactions that
> > > > > > modify quotas. Is there any reason we can't just quiesce the transaction
> > > > > > subsystem entirely as a first step? It's not like quotaoff is common or
> > > > > > performance sensitive. For example:
> > > > > >
> > > > > > 1. stop all transactions, wait to drain, force log
> > > > > > 2. log the sb/quotaoff synchronously (punching through via something
> > > > > >    like NO_WRITECOUNT)
> > > > > > 3. clear the xfs_mount quota active flags
> > > > > > 4. restart the transaction subsystem (no more dquot mods)
> > > > > > 5. complete quotaoff via the dquot release and purge sequence
> > > > > 
> > > > > Yup, as I said on #xfs a short while ago:
> > > > > 
> > > > > [3/7/20 01:15] <djwong> qi_active_trans?
> > > > > [3/7/20 01:15] <djwong> man, we just killed off m_active_trans
> > > > > [3/7/20 08:47] <dchinner> djwong: I know we just killed off that atomic counter, it was used for doing exactly what I needed for quota-off, but freeze didn't need it anymore
> > > > > [3/7/20 08:48] <dchinner> I mean, we could just make quota-off freeze the filesystem, do quota-off, then unfreeze....
> > > > > [3/7/20 08:48] <dchinner> that's a simple, brute force solution
> > > > > [3/7/20 08:49] <dchinner> but it's also overkill in that it forces lots of unnecessary data writeback...
> > > > > [3/7/20 08:52] * djwong sometimes wonders if we just need a "run XXXX with exclusive access" thing
> > > > > [3/7/20 08:58] <dchinner> djwong: that's kinda what xfs_quiesce_attr() was originally intended for
> > > > > [3/7/20 08:59] <dchinner> but as all the code slowly got moved up into the VFS freeze layers, it stopped being able to be used for that sort of operation....
> > > > > [3/7/20 09:01] <djwong> oh
> > > > > [3/7/20 09:03] <dchinner> and so just after we remove the last remaining fragment of that original functionality, we find that maybe we actually still need to be able to quiesce the filesytsem for internal synchronisation reasons
> > > > > 
> > > > > So, we used to have exactly the functionality I needed in XFS as
> > > > > general infrastructure, but we've removed it over the past few years
> > > > > as the VFS has slowly been brought up to feature parity with XFS. I
> > > > > just implemented what I needed to block/halt quota modifications
> > > > > because I didn't want to perturb anything else while exploring if my
> > > > > hypothesis was correct.
> > > > > 
> > > > 
> > > > Ok.
> > > > 
> > > > > The only outstanding thing I haven't checked out fully is the
> > > > > delayed allocation reservations that aren't done in transaction
> > > > > contexts. I -think- these are OK because they are in memory only,
> > > > > and they will serialised on the inode lock when detatching dquots
> > > > > (i.e. the existing dquot purging ordering mechanisms) after quotas
> > > > > are turned off. Hence I think these are fine, but more investigation
> > > > > will be needed there to confirm behaviour is correct.
> > > > > 
> > > > 
> > > > Yep.
> > > > 
> > > > > > I think it could be worth the tradeoff for the simplicity of not having
> > > > > > to maintain the transaction reservation tags or the special quota
> > > > > > waiting infrastructure vs. something like the more generic (recently
> > > > > > removed) transaction counter. We might even be able to abstract the
> > > > > > whole thing behind a transaction flag. E.g.:
> > > > > > 
> > > > > > 	/*
> > > > > > 	 * A barrier transaction locks out further transactions and waits on
> > > > > > 	 * outstanding transactions to drain (i.e. commit) before returning.
> > > > > > 	 * Everything unlocks when the transaction commits.
> > > > > > 	 */
> > > > > > 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > > > > 			XFS_TRANS_BARRIER, &tp);
> > > > > > 	...
> > > > > 
> > > > > Yup, if we decide that we want to track all active transactions again
> > > > > rather than just when quota is active, it would make a lot of
> > > > > sense to make it a formal function of the xfs_trans_alloc() API.
> > > > > 
> > > > > Really, though, I've got so many other things on my plate right now
> > > > > I don't have the time to take on yet another infrastructure
> > > > > reworking. I spent the time to write the patch because if I was
> > > > > going to say I didn't like relogging then it was absolutely
> > > > > necessary for me to provide an alternative solution to the problem,
> > > > > but I'm ireally hoping that it is sufficient for someone else to be
> > > > > able to pick it up and run with it....
> > > > > 
> > > > 
> > > > Ok, I can take a look at this since I need to step back and rethink this
> > > > particular feature anyways.
> > > > 
> > > > Brian
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > 
> > > > > PS. FWIW, if anyone wants to pick up any RFC patchset I've posted in
> > > > > the past and run with it, I'm more than happy for you to do so. I've
> > > > > got way more ideas and prototypes than I've got time to turn into
> > > > > full production features. I also don't care about "ownership" of the
> > > > > work; it's better to have someone actively working on the code than
> > > > > having it sit around waiting for me to find time to get back to
> > > > > it...
> > > > > 
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-09 12:15               ` Brian Foster
@ 2020-07-09 16:32                 ` Darrick J. Wong
  2020-07-20  3:58                 ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2020-07-09 16:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Thu, Jul 09, 2020 at 08:15:30AM -0400, Brian Foster wrote:
> On Wed, Jul 08, 2020 at 09:44:28AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 07, 2020 at 07:37:43AM -0400, Brian Foster wrote:
> > > On Mon, Jul 06, 2020 at 10:42:57AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Jul 06, 2020 at 12:03:06PM -0400, Brian Foster wrote:
> > > > > On Fri, Jul 03, 2020 at 10:49:40AM +1000, Dave Chinner wrote:
> > > > > > On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> > > > > > > On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > > > > > > > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:
> > > ...
> > > > > 
> > > > > > Consider that a single transaction that contains an EFD for the
> > > > > > original EFI, and a new EFI for the same extent is effectively
> > > > > > "relogging the EFI". It does so by atomically cancelling the
> > > > > > original EFI in the log and creating a new EFI.
> > > > > > 
> > > > > 
> > > > > Right. This is how dfops currently works IIRC.
> > > > > 
> > > > > > Now, and EFI is defined on disk as:
> > > > > > 
> > > > > > typedef struct xfs_efi_log_format {
> > > > > >         uint16_t                efi_type;       /* efi log item type */
> > > > > >         uint16_t                efi_size;       /* size of this item */
> > > > > >         uint32_t                efi_nextents;   /* # extents to free */
> > > > > >         uint64_t                efi_id;         /* efi identifier */
> > > > > >         xfs_extent_t            efi_extents[1]; /* array of extents to free */
> > > > > > } xfs_efi_log_format_t;
> > > > > > 
> > > > > > Which means it can hold up to 2^16-1 individual extents that we
> > > > > > intend to free. We currently only use one extent per EFI, but if we
> > > > > > go back in history, they were dynamically sized structures and
> > > > > > could track arbitrary numbers of extents.
> > > > > > 
> > > > > > So, repair needs to track mulitple nested EFIs?
> > > > > > 
> > > > > > We cancel the old EFI, log a new EFI with all the old extents and
> > > > > > the new extent in it. We now have a single EFI in the journal
> > > > > > containing N+1 extents in it.
> > > > > > 
> > > > > 
> > > > > That's an interesting optimization.
> > > > 
> > > > Hmm, I hadn't thought about amortizing the cost of maintaining an EFI
> > > > across as many of the btree block allocations as possible.  That would
> > > > make the current scheme (which I'll get into below) less scary.
> > > > 
> > > > > > Further, an EFD with multiple extents in it is -intended to be
> > > > > > relogged- multiple times. Every time we free an extent in the EFI,
> > > > > > we remove it from the EFD and relog the EFD. THis tells log recovery
> > > > > > that this extent has now been freed, and that it should not replay
> > > > > > it, even though it is still in the EFI.
> > > > > > 
> > > > > > And to prevent the big EFI from pinning the tail of the log while
> > > > > > EFDs are being processed, we can relog the EFI along with the EFD
> > > > > > each time the EFD is updated, hence we drag the EFI forwards in
> > > > > > every high level transaction roll when we are actually freeing the
> > > > > > extents.
> > > > > > 
> > > > > 
> > > > > Hmm.. I'm not sure that addresses the deadlock problem for repair. That
> > > > > assumes that EFD updates come at regular enough intervals to keep the
> > > > > tail moving, but IIRC the bulk loading infrastructure will essentially
> > > > > log a bunch of EFIs, spend a non-deterministic amount of time doing
> > > > > work, then log the associated EFDs. So there's still a period of time in
> > > > > there where we might need to relog intents that aren't otherwise being
> > > > > updated.
> > > > > 
> > > > > Darrick might want to chime in here in case I'm missing something...
> > > > 
> > > > I changed the ->claim_block function in the online repair code[3] to
> > > > relog all of the EFIs (using the strategy Dave outlined above), with the
> > > > claimed block not present in the new EFI.  I think this means we can't
> > > > pin the log tail longer than it takes to memcpy a bunch of records into
> > > > a block and put it on a delwri list, which should be fast enough.
> > > > 
> > > 
> > > Ah, interesting. So each filled block would relog the intents associated
> > > with the outstanding reservation for the rebuild. That certainly will
> > > keep things moving, I'd suspect relogging far more frequently than
> > > necessary, but correctness first. :)
> > 
> > <nod> It's even less bad if you've maximized the number of free extent
> > records per EFI.
> > 
> > > One thing I'm curious about in the broader context of repair is what
> > > happens if the filesystem crashes mid-rebuild? IIRC the rebuilt tree is
> > > fake rooted and the above seems to imply we'd have EFDs logged for the
> > > blocks consumed by the tree. Are those blocks restored to the fs somehow
> > > or another on a crash recovery?
> > 
> > Each time we relog the EFIs, we log exactly the same records as before,
> > which means that all the blocks in the new btree remain the target of
> > EFIs until the repair completes.  If we crash before the end, log
> > recovery will free the entire half-built structure.
> > 
> 
> I'm a little confused by how we log EFIs for exactly the same records as
> before while the above description says we leave off the claimed block
> in the new EFI when it rolls. Does the roll in this scenario complete
> the previous EFI(s) with an EFD and create a new EFI without the claimed
> block? For example, suppose we do our bulk reservation and log the
> necessary EFIs, start the rebuild, claim the first block, relog the
> outstanding EFI without the claimed block, and then the system crashes.
> What happens with the claimed block on subsequent recovery?

Sorry, I must have zapped the first sentence that said (more or less,
since I also accidentally deleted an earlier draft of this reply):

"I misspoke earlier-- the ->claim_block relogging code does /not/ leave
off the claimed block; it has to preserve everything.  Each time we
relog the EFIs..."

> > If we reach the end of the repair, we'll log EFDs for all the EFIs in
> > the btree bulkload reservation; log the new btree root; and queue a
> > bunch of extfree_items (which themselves log more EFIs) for the old
> > btree blocks.  Then we let dfops finish the extfree items, which deletes
> > the old btree.  If we crash during that part, log recovery will roll us
> > forward to the point of having a freshly rebuilt btree and no leaked
> > blocks.
> > 
> > > > > > The key to this is that the EFI/EFD relogging must be done entirely
> > > > > > under a single rolling transaction, so there is -always- space
> > > > > > available in the log for both the EFI and the EFDs to be relogged as
> > > > > > the long running operation is performed.
> > > > > > 
> > > > > > IOWs, the EFI/EFD structures support relogging of the intents at a
> > > > > > design level, and it is intended that this process is entirely
> > > > > > driven from a single rolling transaction context. I srtongly suspect
> > > > > > that all the recent EFI/EFD and deferred ops reworking has lost a
> > > > > > lot of this context from the historical EFI/EFD implementation...
> > > > 
> > > > I don't agree with this, since the BUI/CUI items actively relog
> > > > themselves for another go-around if they decide that they still have
> > > > work to do, and the atomic extent swap item I proposed also takes
> > > > advantage of this design property.
> > > > 
> > > > Though, I concede that I don't think any of us were watching carefully
> > > > enough to the dfops manager to spot the occasional need to relog all the
> > > > attached intent items if the chain gets long enough.
> > > > 
> > > > > > So before we go down the path of implementing generic automatic
> > > > > > relogging infrastructure, we first should have been writing the
> > > > > > application code that needs to relog intents and use a mechanism
> > > > > > like the above to cancel and reinsert intents further down the log.
> > > > 
> > > > Already done.  The reason why Brian and I are stirring up this hornet
> > > > nest again is that I started posting patches to fix various deficiencies
> > > > that were exposed by generic/52[12] shakedowns of the atomic swap code. ;)
> > > > 
> > > > I guess I should go post the latest version of the defer freezer code
> > > > since it takes steps to minimize the dfops chain lengths, and relogs the
> > > > entire dfops chain every few rolls to keep the associated log items
> > > > moving forward...
> > > > 
> > > > > > Once we have code that is using these techniques to do bulk
> > > > > > operations, then we can look to optimise/genericise the
> > > > > > infrastructure they use.
> > > > > > 
> > > > > > > Moving on to quotaoff...
> > > > > > > 
> > > > > > > > I have been spending some time recently in the quota code, so I have
> > > > > > > > a better grip on what it is doing now than I did last time I looked
> > > > > > > > at this relogging code. I never really questioned why the quota code
> > > > > > > > needed two transactions for quota-off, and I'm guessing that nobody
> > > > > > > > else has either. So I spent some time this morning understanding
> > > > > > > > what problem it was actually solving and trying to find an alternate
> > > > > > > > solution to that problem.
> > > > > > > 
> > > > > > > Indeed, I hadn't looked into that.
> > > > > > > 
> > > > > > > > The reason we have the two quota-off transactions is that active
> > > > > > > > dquot modifications at the time quotaoff is started leak past the
> > > > > > > > first quota off transaction that hits the journal. Hence to avoid
> > > > > > > > incorrect replay of those modifications in the journal if we crash
> > > > > > > > after the quota-off item passes out of the journal, we pin the
> > > > > > > > quota-off item in the journal. It gets unpinned by the commit of the
> > > > > > > > second quota-off transaction at completion time, hence defining the
> > > > > > > > window in journal where quota-off is being processed and dquot
> > > > > > > > modifications should be ignored. i.e. there is no window where
> > > > > > > > recovery will replay dquot modifications incorrectly.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok.
> > > > > > > 
> > > > > > > > However, if the second transaction is left too long, the reservation
> > > > > > > > will fail to find journal space because of the pinned quota-off item.
> > > > > > > > 
> > > > > > > 
> > > > > > > Right.
> > > > > > > 
> > > > > > > > The relogging infrastructure is designed to allow the inital
> > > > > > > > quota-off intent to keep moving forward in the log so it never pins
> > > > > > > > the tail of the log before the second quota-off transaction is run.
> > > > > > > > This tries to avoid the recovery issue because there's always an
> > > > > > > > active quota off item in the log, but I think there may be a flaw
> > > > > > > > here.  When the quotaoff item gets relogged, it jumps all the dquots
> > > > > > > > in the log that were modified after the quota-off started. Hence if
> > > > > > > > we crash after the relogging but while the dquots are still in the
> > > > > > > > log before the relogged quotaoff item, then they will be replayed,
> > > > > > > > possibly incorrectly. i.e. the relogged quota-off item no longer
> > > > > > > > prevents replay of those items.
> > > > > > > > 
> > > > > > > > So while relogging prevents the tail pinning deadlock, I think it
> > > > > > > > may actually result in incorrect recovery behaviour in that items
> > > > > > > > that should be cancelled and not replayed can end up getting
> > > > > > > > replayed.  I'm not sure that this matters for dquots, but for a
> > > > > > > > general mechanism I think the transactional ordering violations it
> > > > > > > > can result in reduce it's usefulness significantly.
> > > > > > > > 
> > > > > > > 
> > > > > > > Hmm.. I could be mistaken, but I thought we reasoned about this a bit on
> > > > > > > the early RFCs.
> > > > > > 
> > > > > > We might have, but I don't recall that. And it would appear nobody
> > > > > > looked at this code in any detail if we did discuss it, so I'd say
> > > > > > the discussion was largely uninformed...
> > > > > > 
> > > > > > > Log recovery processes the quotaoff intent in pass 1 and
> > > > > > > dquot updates in pass 2, which I thought was intended to handle this
> > > > > > > kind of problem.
> > > > > > 
> > > > > > Right, it does handle it, but only because there are two quota-off
> > > > > > items in the log. i.e.  There's two recovery situations in play here
> > > > > > - 1) quota off in progress and 2) quota off done.
> > > > > > 
> > > > > > In the first case, only the initial quota-off item is in the log, so
> > > > > > it is needed to be detect to stop replay of relevant dquots that
> > > > > > have been logged after the quota off was started.
> > > > > > 
> > > > > > The second case has to be broken down into two sitations: a) both quota-off items
> > > > > > are active in the log, or b) only the second item is active in the log
> > > > > > as the tail has moved forwards past the first item.
> > > > > > 
> > > > > > In the case of 2a), it doesn't matter which item recovery sees, it
> > > > > > will cancel the dquot updates correctly. In the case of 2b), the
> > > > > > second quota off item is absolutely necessary to prevent replay of
> > > > > > the dquots in the log before it.
> > > > > > 
> > > > > > Hence if dquot modifications can leak past the first quota-off item
> > > > > > in the log, then the second item is absolutely necessary to catch
> > > > > > the 2b) case to prevent incorrect replay of dquot buffers.
> > > > > > 
> > > > > 
> > > > > Ok, but we're talking specifically about log recovery after quotaoff has
> > > > > completed but before both intents have fallen off of the log. Relogging
> > > > > of the initial intent (re: the original comment above about incorrect
> > > > > recovery behavior) has no impact on this general ordering between the
> > > > > start/end intents or dquot changes and the end intent.
> > > > > 
> > > > > > > If I follow correctly, the recovery issue that warrants pinning the
> > > > > > > quotaoff in the log is not so much an ordering issue, but if the latter
> > > > > > > happens to fall off the end of the log before the last of the dquot
> > > > > > > modifications, recovery could see dquot changes after having lost the
> > > > > > > fact that a quotaoff had occurred at all. The current implementation
> > > > > > > presumably handles this by pinning the quotaoff until all dquots are
> > > > > > > completely purged from existence. The relog mechanism just allows the
> > > > > > > item to move while it effectively remains pinned, so I don't see how it
> > > > > > > introduces recovery issues.
> > > > > > 
> > > > > > As I said, it may not affect the specific quota-off usage, but we
> > > > > > can't just change the order of items in the physical journal without
> > > > > > care because the journal is supposed to be -strictly ordered-.
> > > > > > 
> > > > > 
> > > > > The mechanism itself is intended to target specific instances of log
> > > > > items. Each use case should be evaluated for correctness on its own,
> > > > > just like one would with ordered buffers or some other internal low
> > > > > level construct that changes behavior.
> > > > > 
> > > > > > Reordering intents in the log automatically without regard to higher
> > > > > > level transactional ordering dependencies of the log items may
> > > > > > violate the ordering rules for journalling and recovery of metadata.
> > > > > > This is why I said automatic relogging may not be useful as generic
> > > > > > infrastructure - if there are dependent log items, then they need to
> > > > > > relogged as an atomic change set that maintains the ordering
> > > > > > dependencies between objects. That's where this automatic mechanism
> > > > > > completely falls down - the ordering dependencies are known only by
> > > > > > the code running the original transaction, not the log items...
> > > > > > 
> > > > > 
> > > > > This and the above sounds to me that you're treating automatic relogging
> > > > > like it would just be enabled by default on all intents, reordering
> > > > > things arbitrarily. That is not the case as things would certainly
> > > > > break, just like what would happen if ordered buffers were enabled by
> > > > > default. The mechanism is per log item and context specific. It is
> > > > > "generic" in the sense that there are (were) multiple use cases for it,
> > > > > not that it should be used arbitrarily or "without care."
> > > > > 
> > > > > Use cases that have very particular ordering requirements across certain
> > > > > sets of items should probably not enable this mechanism on those items
> > > > > or otherwise verify that relogging a particular item is safe. The
> > > > > potential example of this ordering problem being cited is quotaoff, but
> > > > > we've already gone through this example multiple times and established
> > > > > that relogging the quotaoff start item is safe.
> > > > > 
> > > > > All that said, extending a relogging notification somehow to a
> > > > > particular context has always been a consideration because 1.) direct
> > > > > EFI relogging would require log recovery changes and 2.) there was yet
> > > > > another potential use case where dfops needed to know whether to relog a
> > > > > particular intent in a long running chain to avoid some issue (the
> > > > > details of which escape me). I think issue #1 is not complicated to
> > > > > address, but creates a backwards incompatibility for log recovery. Issue
> > > > > #2 would potentially separate out relogging as a notification mechanism
> > > > > from the reservation management bits, but it's still not clear to me
> > > > > what that notification mechanism would look like for a transaction that
> > > > > has already been committed by some caller context.
> > > > > 
> > > > > I think Darrick was looking at repurposing ->iop_relog() for that one so
> > > > > I'd be curious to know what that is looking like in general...
> > > > 
> > > > ...it's graduated to the point that I'm willing/crazy enough to run it
> > > > on my development workstations, and it hasn't let out the magic smoke.
> > > > 
> > > 
> > > Heh.
> > > 
> > > > I added ->iop_relog handlers to all the major intent items (EFI, RUI,
> > > > CUI, BUI, SXI), then taught xfs_defer_finish_noroll to relog everything
> > > > on dop_pending every 7 transaction rolls[1].  Initially this caused log
> > > > reservation overflow problems with transactions that log intents whose
> > > > ->finish_item functions themselves log dozens more intents, but then
> > > > realized that the second patch[2] I had written took care of this
> > > > problem.
> > > > 
> > > 
> > > Thanks. I think I get the general idea. We're reworking the
> > > ->iop_relog() handler to complete and replace the current intent (rather
> > > than just relog the original intent, which is what this series did for
> > > the quotaoff case) in the current dfops transaction and allow the dfops
> > > code to update its reference to the item. The part that's obviously
> > > missing is some kind of determination on when we actually need to relog
> > > the outstanding intents vs. using a fixed roll count.
> > 
> > <nod>  I don't consider myself sufficiently ail-smart to know how to do
> > that part. :)
> > 
> > > I suppose we could do something like I was mentioning in my other reply
> > > on the AIL pushing issue Dave pointed out where we'd set a bit on
> > > certain items that are tail pinned and in need of relog. That sounds
> > > like overkill given this use case is currently self-contained to dfops.
> > 
> > That might be a useful optimization -- every time defer_finish rolls the
> > transaction, check the items to see if any of them have
> > XFS_LI_RELOGMEPLEASE set, and if any of them do, or we hit our (now
> > probably higher than 7) fixed roll count, we'll relog as desired to keep
> > the log moving forward.
> > 
> 
> It's an optimization in some sense to prevent unnecessary relogs, but
> the intent would be to avoid the need for a fixed count by notifying
> when a relog is needed to a transaction that should be guaranteed to
> have the reservation necessary to do so. I'm not sure it's worth the
> complexity if there were some reason we still needed to fall back to a
> hard count.

<nod> I might just be clinging to the current implementation. :)

> > > Perhaps the other idea of factoring out the threshold determination
> > > logic from xlog_grant_push_ail() might be useful.
> > > 
> > > For example, if the current free reservation is below the calculated
> > > threshold (with need_bytes == 0), return a threshold LSN based on the
> > > current tail. Instead of using that to push the AIL, compare it to
> > > ->li_lsn of each intent and relog any that are inside the threshold LSN
> > > (which will probably be all of them in practice since they are part of
> > > the same transaction). We'd probably need to identify intents that have
> > > been recently relogged so the process doesn't repeat until the CIL
> > > drains and the li_lsn eventually changes. Hmm.. I did have an
> > > XFS_LI_IN_CIL state tracking patch around somewhere for debugging
> > > purposes that might actually be sufficient for that. We could also
> > > consider stashing a "relog push" LSN somewhere (similar to the way AIL
> > > pushing works) and perhaps use that to avoid repeated relogs on a chain,
> > > but it's not immediately clear to me how well that would fit into the
> > > dfops mechanism...
> > 
> > ...is there a sane way for dfops to query the threshold LSN so that it
> > could compare against the li_lsn of each item it holds?
> > 
> 
> I'd start with just trying to reuse the logic in xlog_grant_push_ail()
> (i.e. just factor out the AIL push). That function starts with a check
> on available log reservation to filter out unnecessary pushes, then
> calculates the AIL push LSN by adding the amount of log space we need to
> free up to the current log tail. In this case we're not pushing the AIL,
> but I think we'd be able to use the same threshold calculation logic to
> determine when to relog intents that happen to reside within the range
> from the current tail to the calculated threshold.

<nod> Ok, I see how that would work.  I'll give it a try.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > That patch, of course, is one that I posted a while ago that makes it so
> > > > that if a transaction owner logs items A and B and then commits (or
> > > > calls defer_roll), dfops will now finish all the items created by A's
> > > > ->finish_item before it moves on to trying to finish B.
> > > > 
> > > > I had put /that/ patch aside after Brian pointed out that on its own,
> > > > that patch merely substituted pinning the tail on some sub-item of A
> > > > with pinning the tail on B, but I think with both patches applied I have
> > > > solved both problems.
> > > > 
> > > > --D
> > > > 
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=515cc4e637bf4e9afcfbaeb39b13f85b27923916
> > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=8e63f8a7af12d673feb5400d09179502632854c4
> > > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=ddfeca6f1f1862c3f162db8b8bdbfc5149f5e5c5
> > > > 
> > > > > > > > But back to quota-off: What I've realised is that the only dquot
> > > > > > > > modifications we need to protect against being recovered are the
> > > > > > > > ones that are running at the time the first quota-off is committed
> > > > > > > > to the journal. That is, once the DQACTIVE flags are clear,
> > > > > > > > transactions will not modify those dquots anymore. Hence by the time
> > > > > > > > that the quota off item pins the tail of the log, the transactions
> > > > > > > > that were actively dirtying inodes when it was committed have also
> > > > > > > > committed and are in the journal and there are no actively modified
> > > > > > > > dquots left in memory.
> > > > > > > > 
> > > > > > > 
> > > > > > > I'm not sure how the (sync) commit of the quotaoff guarantees some other
> > > > > > > transaction running in parallel hadn't modified a dquot and committed
> > > > > > > after the quotaoff, but I think I see where you're going in general...
> > > > > > 
> > > > > > We drained out all the transactions that can be modifying quotas
> > > > > > before we log the quotaoff items. So, by definition, this cannot
> > > > > > happen.
> > > > > > 
> > > > > > > > IOWs, we don't actually need to wait until we've released and purged
> > > > > > > > all the dquots from memory before we log the second quota off item;
> > > > > > > > all we need to wait for is for all the transactions with dirty
> > > > > > > > dquots to have committed. These transactions already have log
> > > > > > > > reservations, so completing them will free unused reservation space
> > > > > > > > for the second quota off transaction. Once they are committed, then
> > > > > > > > we can log the second item. i.e. we don't have to wait until we've
> > > > > > > > cleaned up the dquots to close out the quota-off transaction in the
> > > > > > > > journal.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok, so we can deterministically shorten the window with a runtime
> > > > > > > barrier (i.e. disable -> drain) on quota modifying transactions rather
> > > > > > > than relying on the full dquot purge to provide this ordering.
> > > > > > 
> > > > > > Yup.
> > > > > > 
> > > > > > > > To make it even more robust, if we stop all the transactions that
> > > > > > > > may dirty dquots and drain the active ones before we log the first
> > > > > > > > quota-off item, we can log the second item immediately afterwards
> > > > > > > > because it is known that there are no dquot modifications in flight
> > > > > > > > when the first item is logged. We can probably even log both items
> > > > > > > > in the same transaction.
> > > > > > > > 
> > > > > > > 
> > > > > > > I was going to ask why we'd even need two items if this approach is
> > > > > > > generally viable.
> > > > > > 
> > > > > > Because I don't want to change the in-journal appearance of
> > > > > > quota-off to older kernels. Changing how things appear on disk is
> > > > > > dangerous and likely going to bite us in unexpected ways.
> > > > > > 
> > > > > 
> > > > > Well combining them into a single transaction doesn't guarantee ordering
> > > > > of the two, right? So it might not be worth doing that either if we're
> > > > > concerned about log appearance. Regardless, those potential steps can be
> > > > > evaluated independently on top of the core runtime fixes.
> > > > > 
> > > > > > > > So, putting my money where my mouth is, the patch below does this.
> > > > > > > > It's survived 100 cycles of xfs/305 (qoff vs fsstress) and 10 cycles
> > > > > > > > of -g quota with all quotas enabled and is currently running a full
> > > > > > > > auto cycle with all quotas enabled. It hasn't let the smoke out
> > > > > > > > after about 4 hours of testing now....
> > > > > > > > 
> > > > > > > 
> > > > > > > Thanks for the patch. First, I like the idea and agree that it's more
> > > > > > > simple than the relogging approach. I do still need to stare at it some
> > > > > > > more to grok it and convince myself it's safe.
> > > > > > > 
> > > > > > > The thing that sticks out to me is tagging all of the transactions that
> > > > > > > modify quotas. Is there any reason we can't just quiesce the transaction
> > > > > > > subsystem entirely as a first step? It's not like quotaoff is common or
> > > > > > > performance sensitive. For example:
> > > > > > >
> > > > > > > 1. stop all transactions, wait to drain, force log
> > > > > > > 2. log the sb/quotaoff synchronously (punching through via something
> > > > > > >    like NO_WRITECOUNT)
> > > > > > > 3. clear the xfs_mount quota active flags
> > > > > > > 4. restart the transaction subsystem (no more dquot mods)
> > > > > > > 5. complete quotaoff via the dquot release and purge sequence
> > > > > > 
> > > > > > Yup, as I said on #xfs a short while ago:
> > > > > > 
> > > > > > [3/7/20 01:15] <djwong> qi_active_trans?
> > > > > > [3/7/20 01:15] <djwong> man, we just killed off m_active_trans
> > > > > > [3/7/20 08:47] <dchinner> djwong: I know we just killed off that atomic counter, it was used for doing exactly what I needed for quota-off, but freeze didn't need it anymore
> > > > > > [3/7/20 08:48] <dchinner> I mean, we could just make quota-off freeze the filesystem, do quota-off, then unfreeze....
> > > > > > [3/7/20 08:48] <dchinner> that's a simple, brute force solution
> > > > > > [3/7/20 08:49] <dchinner> but it's also overkill in that it forces lots of unnecessary data writeback...
> > > > > > [3/7/20 08:52] * djwong sometimes wonders if we just need a "run XXXX with exclusive access" thing
> > > > > > [3/7/20 08:58] <dchinner> djwong: that's kinda what xfs_quiesce_attr() was originally intended for
> > > > > > [3/7/20 08:59] <dchinner> but as all the code slowly got moved up into the VFS freeze layers, it stopped being able to be used for that sort of operation....
> > > > > > [3/7/20 09:01] <djwong> oh
> > > > > > [3/7/20 09:03] <dchinner> and so just after we remove the last remaining fragment of that original functionality, we find that maybe we actually still need to be able to quiesce the filesytsem for internal synchronisation reasons
> > > > > > 
> > > > > > So, we used to have exactly the functionality I needed in XFS as
> > > > > > general infrastructure, but we've removed it over the past few years
> > > > > > as the VFS has slowly been brought up to feature parity with XFS. I
> > > > > > just implemented what I needed to block/halt quota modifications
> > > > > > because I didn't want to perturb anything else while exploring if my
> > > > > > hypothesis was correct.
> > > > > > 
> > > > > 
> > > > > Ok.
> > > > > 
> > > > > > The only outstanding thing I haven't checked out fully is the
> > > > > > delayed allocation reservations that aren't done in transaction
> > > > > > contexts. I -think- these are OK because they are in memory only,
> > > > > > and they will serialised on the inode lock when detatching dquots
> > > > > > (i.e. the existing dquot purging ordering mechanisms) after quotas
> > > > > > are turned off. Hence I think these are fine, but more investigation
> > > > > > will be needed there to confirm behaviour is correct.
> > > > > > 
> > > > > 
> > > > > Yep.
> > > > > 
> > > > > > > I think it could be worth the tradeoff for the simplicity of not having
> > > > > > > to maintain the transaction reservation tags or the special quota
> > > > > > > waiting infrastructure vs. something like the more generic (recently
> > > > > > > removed) transaction counter. We might even be able to abstract the
> > > > > > > whole thing behind a transaction flag. E.g.:
> > > > > > > 
> > > > > > > 	/*
> > > > > > > 	 * A barrier transaction locks out further transactions and waits on
> > > > > > > 	 * outstanding transactions to drain (i.e. commit) before returning.
> > > > > > > 	 * Everything unlocks when the transaction commits.
> > > > > > > 	 */
> > > > > > > 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > > > > > 			XFS_TRANS_BARRIER, &tp);
> > > > > > > 	...
> > > > > > 
> > > > > > Yup, if we decide that we want to track all active transactions again
> > > > > > rather than just when quota is active, it would make a lot of
> > > > > > sense to make it a formal function of the xfs_trans_alloc() API.
> > > > > > 
> > > > > > Really, though, I've got so many other things on my plate right now
> > > > > > I don't have the time to take on yet another infrastructure
> > > > > > reworking. I spent the time to write the patch because if I was
> > > > > > going to say I didn't like relogging then it was absolutely
> > > > > > necessary for me to provide an alternative solution to the problem,
> > > > > > but I'm ireally hoping that it is sufficient for someone else to be
> > > > > > able to pick it up and run with it....
> > > > > > 
> > > > > 
> > > > > Ok, I can take a look at this since I need to step back and rethink this
> > > > > particular feature anyways.
> > > > > 
> > > > > Brian
> > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Dave.
> > > > > > 
> > > > > > PS. FWIW, if anyone wants to pick up any RFC patchset I've posted in
> > > > > > the past and run with it, I'm more than happy for you to do so. I've
> > > > > > got way more ideas and prototypes than I've got time to turn into
> > > > > > full production features. I also don't care about "ownership" of the
> > > > > > work; it's better to have someone actively working on the code than
> > > > > > having it sit around waiting for me to find time to get back to
> > > > > > it...
> > > > > > 
> > > > > > -- 
> > > > > > Dave Chinner
> > > > > > david@fromorbit.com
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-06 16:03       ` Brian Foster
  2020-07-06 17:42         ` Darrick J. Wong
@ 2020-07-10  4:09         ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2020-07-10  4:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jul 06, 2020 at 12:03:06PM -0400, Brian Foster wrote:
> On Fri, Jul 03, 2020 at 10:49:40AM +1000, Dave Chinner wrote:
> > On Thu, Jul 02, 2020 at 02:52:09PM -0400, Brian Foster wrote:
> > > On Thu, Jul 02, 2020 at 09:51:44PM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 01, 2020 at 12:51:06PM -0400, Brian Foster wrote:

[....]
> > > how it works, etc., so that suggests there are still concerns around the
> > > mechanism itself independent from quotaoff. I've sent 5 or so RFCs to
> > > try and elicit general feedback and address fundamental concerns before
> > > putting in the effort to solidify the implementation, which was notably
> > > more time consuming than reworking the RFC. It's quite frustrating to
> > > see negative feedback broaden at this stage in a manner/pattern that
> > > suggests the mechanism is not generally acceptable.
> > 
> > Well, my initial reponse to the very first RFC was:
> > 
> > | [...] I can see how appealing the concept of automatically
> > | relogging is, but I'm unconvinced that we can make it work,
> > | especially when there aren't sufficient reservations to relog
> > | the items that need relogging.
> > 
> > https://lore.kernel.org/linux-xfs/20191024224308.GD4614@dread.disaster.area/
> > 
> > To RFC v4, which was the next version I had time to look at:
> > 
> > | [long list of potential issues]
> > |
> > | Given this, I'm skeptical this can be made into a useful, reliable
> > | generic async relogging mechanism.
> > 
> > https://lore.kernel.org/linux-xfs/20191205210211.GP2695@dread.disaster.area/
> > 
> > Maybe general comments that "I remain unconvinced this will work"
> > got drowned out by all the other comments I made trying to help you
> > understand the code and hence make it work.
> 
> I explicitly worked through those issues to the point where to the best
> that I can tell, the mechanism works.

Yes, but that didn't remove my underlying concern that requiring
the subsystem that guarantees transactions can make forwards
progress to require the use of transactions to guarantee forwards
progress of transactions...

That layering inversion/catch-22 is the structure I am deeply
uncomfortable with. That's what I want to run away from screaming -
so much is reliant on the AIL making forwards progress (e.g. memory
reclaim!) that requiring something as complex as a CIL commit for
flushing to make progress is just a step to far for me.

[....]

> If the approach of some feature is generally not acceptable (as in "I'm
> not comfortable with the approach" or "I think it should be done another
> way"), that is potentially subjective but certainly valid feedback. I
> might or might not debate that feedback, but that's at least an honest
> debate where stances are clear. I'm certainly not going to try and
> stabilize something I know that one or more key upstream contributers do
> not agree with (unless I can convince them otherwise). If the feedback
> is "I'm skeptical it works because of items 1, 2, 3," that means the
> developer is likely to look through those issues and try to prove or
> disprove whether the mechanism works based on that insight.

The reviewer might be looking for insight, too, and whether
addressing the issues they raise alleviates their concerns.

Which, in this case for me, it hasn't.

> > > All that being what it is, I'd obviously rather not expend even more
> > > time if this is going to be met with vague/general reluctance. Do we
> > > need to go back to the drawing board on the repair use case? If so,
> > > should we reconsider the approach repair is using to release blocks?
> > > Perhaps alter this mechanism to address some tangible concerns? Try and
> > > come up with something else entirely..?
> > 
> > Well, like I said originally: I think relogging really needs to be
> > done from the perspective of the owner of the logged item so that we
> > can avoid things like ordering violations in the journal and other
> > similar issues. i.e. relogging is designed around it being a
> > function of the high order change algorithms and not something that
> > can be used to work around high level change algorithms that don't
> > follow the rules properly...
> 
> I'm curious if you have thoughts around what that might look like.
> Perhaps using quotaoff just as an example..? (Obviously we'd not
> implement that over the current proposal..).

I gave you one with the EFI relogging exmaple below. The intents
need a permanent transaction context to be relogged in, and the high
level code treats the intents like we do inodes and buffers and
relogs them on each transaction roll to ensure they keep moving
forward in the log....

> > Consider that a single transaction that contains an EFD for the
> > original EFI, and a new EFI for the same extent is effectively
> > "relogging the EFI". It does so by atomically cancelling the
> > original EFI in the log and creating a new EFI.
> 
> Right. This is how dfops currently works IIRC.

Well, not really. dfops is currently a mechanism for maintaining
atomic operations via linked intent chains; it is not (currently)
used as a mechanism for relogging intents that are already part of
the linked intent chain. It can likely be made to do this, but only
if the intent recovery code in older kernels can handle cancelling
and relogging without breaking...

> > Now, and EFI is defined on disk as:
> > 
> > typedef struct xfs_efi_log_format {
> >         uint16_t                efi_type;       /* efi log item type */
> >         uint16_t                efi_size;       /* size of this item */
> >         uint32_t                efi_nextents;   /* # extents to free */
> >         uint64_t                efi_id;         /* efi identifier */
> >         xfs_extent_t            efi_extents[1]; /* array of extents to free */
> > } xfs_efi_log_format_t;
> > 
> > Which means it can hold up to 2^16-1 individual extents that we
> > intend to free. We currently only use one extent per EFI, but if we
> > go back in history, they were dynamically sized structures and
> > could track arbitrary numbers of extents.
> > 
> > So, repair needs to track mulitple nested EFIs?
> > 
> > We cancel the old EFI, log a new EFI with all the old extents and
> > the new extent in it. We now have a single EFI in the journal
> > containing N+1 extents in it.
> > 
> 
> That's an interesting optimization.

It's not really an optimisation, it's largely reflective of how
extent freeing used to work a couple of decades ago where
xfs_itruncate_extents could free up to 4 data extents per
transaction.

That was problematic, though - over time we found all sorts of data
integrity issues as a result of race conditions with the
multi-extent operations based on cached extent maps. To solve these
problems we effectively reduced all the extent modification
operations down to a single extent at a time and that meant the
extent freeing loops that were wrapped by EFI/EFDs effectively
collapsed to "single extent only" operations...

> > We might have, but I don't recall that. And it would appear nobody
> > looked at this code in any detail if we did discuss it, so I'd say
> > the discussion was largely uninformed...
> > 
> > > Log recovery processes the quotaoff intent in pass 1 and
> > > dquot updates in pass 2, which I thought was intended to handle this
> > > kind of problem.
> > 
> > Right, it does handle it, but only because there are two quota-off
> > items in the log. i.e.  There's two recovery situations in play here
> > - 1) quota off in progress and 2) quota off done.
> > 
> > In the first case, only the initial quota-off item is in the log, so
> > it is needed to be detect to stop replay of relevant dquots that
> > have been logged after the quota off was started.
> > 
> > The second case has to be broken down into two sitations: a) both quota-off items
> > are active in the log, or b) only the second item is active in the log
> > as the tail has moved forwards past the first item.
> > 
> > In the case of 2a), it doesn't matter which item recovery sees, it
> > will cancel the dquot updates correctly. In the case of 2b), the
> > second quota off item is absolutely necessary to prevent replay of
> > the dquots in the log before it.
> > 
> > Hence if dquot modifications can leak past the first quota-off item
> > in the log, then the second item is absolutely necessary to catch
> > the 2b) case to prevent incorrect replay of dquot buffers.
> > 
> 
> Ok, but we're talking specifically about log recovery after quotaoff has
> completed but before both intents have fallen off of the log. Relogging
> of the initial intent (re: the original comment above about incorrect
> recovery behavior) has no impact on this general ordering between the
> start/end intents or dquot changes and the end intent.

Sure, I'm trying to explain why the intents were considered
necessary in the first place, not what impact relogging has on this
algorithm (which is none!).

> > > If I follow correctly, the recovery issue that warrants pinning the
> > > quotaoff in the log is not so much an ordering issue, but if the latter
> > > happens to fall off the end of the log before the last of the dquot
> > > modifications, recovery could see dquot changes after having lost the
> > > fact that a quotaoff had occurred at all. The current implementation
> > > presumably handles this by pinning the quotaoff until all dquots are
> > > completely purged from existence. The relog mechanism just allows the
> > > item to move while it effectively remains pinned, so I don't see how it
> > > introduces recovery issues.
> > 
> > As I said, it may not affect the specific quota-off usage, but we
> > can't just change the order of items in the physical journal without
> > care because the journal is supposed to be -strictly ordered-.
> > 
> 
> The mechanism itself is intended to target specific instances of log
> items. Each use case should be evaluated for correctness on its own,
> just like one would with ordered buffers or some other internal low
> level construct that changes behavior.

Yes, I know this. It's one of the things about this approach that
concerns me - the level of knowledge encoded into the specific
operations that are critical for correct operation.

> > Reordering intents in the log automatically without regard to higher
> > level transactional ordering dependencies of the log items may
> > violate the ordering rules for journalling and recovery of metadata.
> > This is why I said automatic relogging may not be useful as generic
> > infrastructure - if there are dependent log items, then they need to
> > relogged as an atomic change set that maintains the ordering
> > dependencies between objects. That's where this automatic mechanism
> > completely falls down - the ordering dependencies are known only by
> > the code running the original transaction, not the log items...
> > 
> 
> This and the above sounds to me that you're treating automatic relogging
> like it would just be enabled by default on all intents, reordering
> things arbitrarily. That is not the case as things would certainly
> break, just like what would happen if ordered buffers were enabled by
> default. The mechanism is per log item and context specific. It is
> "generic" in the sense that there are (were) multiple use cases for it,
> not that it should be used arbitrarily or "without care."

No, I'm treating it as "generic infrastructure" that various
things will use if they need to, and then extrapolating the problems
I see from there. I don't expect this to be applied to everything,
because that -will break everything-.

> Use cases that have very particular ordering requirements across certain
> sets of items should probably not enable this mechanism on those items
> or otherwise verify that relogging a particular item is safe. The
> potential example of this ordering problem being cited is quotaoff, but
> we've already gone through this example multiple times and established
> that relogging the quotaoff start item is safe.

Yes, but I've already proven that quotaoff -does not need relogging
at all- so whether relogging is safe for it or not is irrelevant to
the discussion of automatic relogging...

But that's precisely my concerns about this: every reloggin use case
has it's own complex set of requirements that need to be proven to
be safe and every change to said code will have to repeat that proof
so that it doesn't get broken.

The repeated, ongoing validation requirement is what killed
soft-updates as a viable fielsystem technology.....

> > > > To make it even more robust, if we stop all the transactions that
> > > > may dirty dquots and drain the active ones before we log the first
> > > > quota-off item, we can log the second item immediately afterwards
> > > > because it is known that there are no dquot modifications in flight
> > > > when the first item is logged. We can probably even log both items
> > > > in the same transaction.
> > > > 
> > > 
> > > I was going to ask why we'd even need two items if this approach is
> > > generally viable.
> > 
> > Because I don't want to change the in-journal appearance of
> > quota-off to older kernels. Changing how things appear on disk is
> > dangerous and likely going to bite us in unexpected ways.
> > 
> 
> Well combining them into a single transaction doesn't guarantee ordering
> of the two, right?

Sure it does. If they are combined into the one transaction, we can
_combine them_ into a single log item and guarantee that the two
quota off records are always formatted into the log in the correct
order.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-09 12:15               ` Brian Foster
  2020-07-09 16:32                 ` Darrick J. Wong
@ 2020-07-20  3:58                 ` Dave Chinner
  2020-08-26 12:17                   ` Brian Foster
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-07-20  3:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Jul 09, 2020 at 08:15:30AM -0400, Brian Foster wrote:
> On Wed, Jul 08, 2020 at 09:44:28AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 07, 2020 at 07:37:43AM -0400, Brian Foster wrote:
> > > > 
> > > 
> > > Thanks. I think I get the general idea. We're reworking the
> > > ->iop_relog() handler to complete and replace the current intent (rather
> > > than just relog the original intent, which is what this series did for
> > > the quotaoff case) in the current dfops transaction and allow the dfops
> > > code to update its reference to the item. The part that's obviously
> > > missing is some kind of determination on when we actually need to relog
> > > the outstanding intents vs. using a fixed roll count.
> > 
> > <nod>  I don't consider myself sufficiently ail-smart to know how to do
> > that part. :)
> > 
> > > I suppose we could do something like I was mentioning in my other reply
> > > on the AIL pushing issue Dave pointed out where we'd set a bit on
> > > certain items that are tail pinned and in need of relog. That sounds
> > > like overkill given this use case is currently self-contained to dfops.
> > 
> > That might be a useful optimization -- every time defer_finish rolls the
> > transaction, check the items to see if any of them have
> > XFS_LI_RELOGMEPLEASE set, and if any of them do, or we hit our (now
> > probably higher than 7) fixed roll count, we'll relog as desired to keep
> > the log moving forward.
> > 
> 
> It's an optimization in some sense to prevent unnecessary relogs, but
> the intent would be to avoid the need for a fixed count by notifying
> when a relog is needed to a transaction that should be guaranteed to
> have the reservation necessary to do so. I'm not sure it's worth the
> complexity if there were some reason we still needed to fall back to a
> hard count.

FWIW, relogging would only ever be necessary if
xfs_log_item_in_current_chkpt() returned false for an item we are
considering relogging. Otherwise, it's already queued for the next
journal checkpoint and there's no need to relog it until the
checkpoint commits....

> > > Perhaps the other idea of factoring out the threshold determination
> > > logic from xlog_grant_push_ail() might be useful.
> > > 
> > > For example, if the current free reservation is below the calculated
> > > threshold (with need_bytes == 0), return a threshold LSN based on the
> > > current tail. Instead of using that to push the AIL, compare it to
> > > ->li_lsn of each intent and relog any that are inside the threshold LSN
> > > (which will probably be all of them in practice since they are part of
> > > the same transaction). We'd probably need to identify intents that have
> > > been recently relogged so the process doesn't repeat until the CIL
> > > drains and the li_lsn eventually changes. Hmm.. I did have an
> > > XFS_LI_IN_CIL state tracking patch around somewhere for debugging
> > > purposes that might actually be sufficient for that. We could also
> > > consider stashing a "relog push" LSN somewhere (similar to the way AIL
> > > pushing works) and perhaps use that to avoid repeated relogs on a chain,
> > > but it's not immediately clear to me how well that would fit into the
> > > dfops mechanism...
> > 
> > ...is there a sane way for dfops to query the threshold LSN so that it
> > could compare against the li_lsn of each item it holds?
> > 
> 
> I'd start with just trying to reuse the logic in xlog_grant_push_ail()
> (i.e. just factor out the AIL push). That function starts with a check
> on available log reservation to filter out unnecessary pushes, then
> calculates the AIL push LSN by adding the amount of log space we need to
> free up to the current log tail. In this case we're not pushing the AIL,
> but I think we'd be able to use the same threshold calculation logic to
> determine when to relog intents that happen to reside within the range
> from the current tail to the calculated threshold.

Hmmmm. I'm kinda wanting to pull the AIL away from the demand-based
tail pushing that xlog_grant_push_ail() does. Distance from the tail
doesn't really tell us how quickly that distance will be consumed
by ongoing operations....

One of the problems we have at the moment is that the AIL will sit
at under 75% full and do nothing at all (because the
xlog_grant_push_ail() call does nothing at <75% full) until we get
memory pressure or the log worker comes along every 30s and calls
xfs_ail_push_all().

The result is that when we are under bursty workloads, we don't
keep pushing metadata out to free up log space - our working space
to soak up bursts is only 25% of the journal, which we can fill in a
couple of seconds, even on a 2GB log. SO under sustained bursty
worklaods, we really only have 25% of the log available at most,
rather than continuing to write back past the 75% threshold so the
next burst can have, say, 50% of the log space available to soak up
the burst.

i.e. if we have frequent bursts and the IO subsystem can soak up the
metadata writeback rate, we should be writing back faster so that
the bursts can hit the transaction reservation fast path for
longer...

IOWs, I'd like to see the AIL move more towards a mechanism that
balances a time-averaged rate of inserts (journal checkpoint
completions) with a time-averaged rate of removals (metadata IO
completions) rather than working to a free fixed space target.
If the workload is sustained, then this effective ends up the same
as we have now with transactions waiting for log reservation space,
but we should end up draining the AIL down further when than we
currently do when incoming work tails off.

I suspect that we could work into this a "need to be relogged within
X seconds" trigger for active reloggable items in the AIL, such that
if the top layer deferops sees the flag set on any item in the
processing of the deferops it relogs all the reloggable items in the
current set...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] xfs: automatic relogging
  2020-07-20  3:58                 ` Dave Chinner
@ 2020-08-26 12:17                   ` Brian Foster
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-08-26 12:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Mon, Jul 20, 2020 at 01:58:40PM +1000, Dave Chinner wrote:
> On Thu, Jul 09, 2020 at 08:15:30AM -0400, Brian Foster wrote:
> > On Wed, Jul 08, 2020 at 09:44:28AM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 07, 2020 at 07:37:43AM -0400, Brian Foster wrote:
> > > > > 
> > > > 
> > > > Thanks. I think I get the general idea. We're reworking the
> > > > ->iop_relog() handler to complete and replace the current intent (rather
> > > > than just relog the original intent, which is what this series did for
> > > > the quotaoff case) in the current dfops transaction and allow the dfops
> > > > code to update its reference to the item. The part that's obviously
> > > > missing is some kind of determination on when we actually need to relog
> > > > the outstanding intents vs. using a fixed roll count.
> > > 
> > > <nod>  I don't consider myself sufficiently ail-smart to know how to do
> > > that part. :)
> > > 
> > > > I suppose we could do something like I was mentioning in my other reply
> > > > on the AIL pushing issue Dave pointed out where we'd set a bit on
> > > > certain items that are tail pinned and in need of relog. That sounds
> > > > like overkill given this use case is currently self-contained to dfops.
> > > 
> > > That might be a useful optimization -- every time defer_finish rolls the
> > > transaction, check the items to see if any of them have
> > > XFS_LI_RELOGMEPLEASE set, and if any of them do, or we hit our (now
> > > probably higher than 7) fixed roll count, we'll relog as desired to keep
> > > the log moving forward.
> > > 
> > 
> > It's an optimization in some sense to prevent unnecessary relogs, but
> > the intent would be to avoid the need for a fixed count by notifying
> > when a relog is needed to a transaction that should be guaranteed to
> > have the reservation necessary to do so. I'm not sure it's worth the
> > complexity if there were some reason we still needed to fall back to a
> > hard count.
> 
> FWIW, relogging would only ever be necessary if
> xfs_log_item_in_current_chkpt() returned false for an item we are
> considering relogging. Otherwise, it's already queued for the next
> journal checkpoint and there's no need to relog it until the
> checkpoint commits....
> 
> > > > Perhaps the other idea of factoring out the threshold determination
> > > > logic from xlog_grant_push_ail() might be useful.
> > > > 
> > > > For example, if the current free reservation is below the calculated
> > > > threshold (with need_bytes == 0), return a threshold LSN based on the
> > > > current tail. Instead of using that to push the AIL, compare it to
> > > > ->li_lsn of each intent and relog any that are inside the threshold LSN
> > > > (which will probably be all of them in practice since they are part of
> > > > the same transaction). We'd probably need to identify intents that have
> > > > been recently relogged so the process doesn't repeat until the CIL
> > > > drains and the li_lsn eventually changes. Hmm.. I did have an
> > > > XFS_LI_IN_CIL state tracking patch around somewhere for debugging
> > > > purposes that might actually be sufficient for that. We could also
> > > > consider stashing a "relog push" LSN somewhere (similar to the way AIL
> > > > pushing works) and perhaps use that to avoid repeated relogs on a chain,
> > > > but it's not immediately clear to me how well that would fit into the
> > > > dfops mechanism...
> > > 
> > > ...is there a sane way for dfops to query the threshold LSN so that it
> > > could compare against the li_lsn of each item it holds?
> > > 
> > 
> > I'd start with just trying to reuse the logic in xlog_grant_push_ail()
> > (i.e. just factor out the AIL push). That function starts with a check
> > on available log reservation to filter out unnecessary pushes, then
> > calculates the AIL push LSN by adding the amount of log space we need to
> > free up to the current log tail. In this case we're not pushing the AIL,
> > but I think we'd be able to use the same threshold calculation logic to
> > determine when to relog intents that happen to reside within the range
> > from the current tail to the calculated threshold.
> 
> Hmmmm. I'm kinda wanting to pull the AIL away from the demand-based
> tail pushing that xlog_grant_push_ail() does. Distance from the tail
> doesn't really tell us how quickly that distance will be consumed
> by ongoing operations....
> 
> One of the problems we have at the moment is that the AIL will sit
> at under 75% full and do nothing at all (because the
> xlog_grant_push_ail() call does nothing at <75% full) until we get
> memory pressure or the log worker comes along every 30s and calls
> xfs_ail_push_all().
> 
> The result is that when we are under bursty workloads, we don't
> keep pushing metadata out to free up log space - our working space
> to soak up bursts is only 25% of the journal, which we can fill in a
> couple of seconds, even on a 2GB log. SO under sustained bursty
> worklaods, we really only have 25% of the log available at most,
> rather than continuing to write back past the 75% threshold so the
> next burst can have, say, 50% of the log space available to soak up
> the burst.
> 
> i.e. if we have frequent bursts and the IO subsystem can soak up the
> metadata writeback rate, we should be writing back faster so that
> the bursts can hit the transaction reservation fast path for
> longer...
> 

(Sorry for the delayed response. I ended up on leave before getting to
this and had to catch back up).

Sure, this is all characteristic of the current AIL pushing mechanism.
The objective above was just to identify how an external component
(dfops) could determine when to efficiently relog a particular item
pinned by the AIL without having to resort to guessing or crude roll
counts, etc. The solution is to essentially reuse AIL logic (flaws and
all) to reasonably determine whether the associated item is pinning the
tail against push pressure. If there's reason to significantly rework
the fundamental AIL pushing mechanism, this use case can most likely
pick up and continue to reuse whatever logic results from that without
much trouble.

> IOWs, I'd like to see the AIL move more towards a mechanism that
> balances a time-averaged rate of inserts (journal checkpoint
> completions) with a time-averaged rate of removals (metadata IO
> completions) rather than working to a free fixed space target.
> If the workload is sustained, then this effective ends up the same
> as we have now with transactions waiting for log reservation space,
> but we should end up draining the AIL down further when than we
> currently do when incoming work tails off.
> 

I'm not quite sure I follow... what this basically sounds like is we'd
have some insert side tracking calculating and storing a running rate of
something like 'items inserted per second' somewhere in the xfs_ail, and
then rather than simply receiving a push target, xfsaild becomes a bit
more self-deterministic and uses that current insert rate to set or
adjust the push target itself. Or perhaps does so based on an analogous
tail moving rate.. hm?

I guess I can see how the mechanics of something like that would work,
but I'm not totally clear on what the rate and translation to push
target would look like. For one, items and checkpoints are arbitrary
sizes, so it seems that something based on consumed log space (i.e.
bytes) might be more accurate. Inserts can also move (i.e. relog) items
that are already AIL resident, so that should be considered somehow
(does it affect the rate?). The current scheme also incorporates log
reservation, which seems like a critical factor since it depends on free
log space. It's not clear to me how an I/O based insert rate alone might
encapsulate all of that, but I'm sure I'm missing much of the bigger
picture..

> I suspect that we could work into this a "need to be relogged within
> X seconds" trigger for active reloggable items in the AIL, such that
> if the top layer deferops sees the flag set on any item in the
> processing of the deferops it relogs all the reloggable items in the
> current set...
> 

Right.. I haven't gone back and reviewed the earlier discussion, but I'm
pretty sure we considered such a flag for the current approach. That it
wasn't necessary atm was just an implementation detail because the
current push logic operates with fixed LSNs and the log item knows its
place in the log...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-08-26 12:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01 16:51 [PATCH 00/10] xfs: automatic relogging Brian Foster
2020-07-01 16:51 ` [PATCH 01/10] xfs: automatic relogging item management Brian Foster
2020-07-01 16:51 ` [PATCH 02/10] xfs: create helper for ticket-less log res ungrant Brian Foster
2020-07-01 16:51 ` [PATCH 03/10] xfs: extra runtime reservation overhead for relog transactions Brian Foster
2020-07-01 16:51 ` [PATCH 04/10] xfs: relog log reservation stealing and accounting Brian Foster
2020-07-01 16:51 ` [PATCH 05/10] xfs: automatic log item relog mechanism Brian Foster
2020-07-03  6:08   ` Dave Chinner
2020-07-06 16:06     ` Brian Foster
2020-07-01 16:51 ` [PATCH 06/10] xfs: automatically relog the quotaoff start intent Brian Foster
2020-07-01 16:51 ` [PATCH 07/10] xfs: prevent fs freeze with outstanding relog items Brian Foster
2020-07-01 16:51 ` [PATCH RFC 08/10] xfs: buffer relogging support prototype Brian Foster
2020-07-01 16:51 ` [PATCH RFC 09/10] xfs: create an error tag for random relog reservation Brian Foster
2020-07-01 16:51 ` [PATCH RFC 10/10] xfs: relog random buffers based on errortag Brian Foster
2020-07-02 11:51 ` [PATCH 00/10] xfs: automatic relogging Dave Chinner
2020-07-02 18:52   ` Brian Foster
2020-07-03  0:49     ` Dave Chinner
2020-07-06 16:03       ` Brian Foster
2020-07-06 17:42         ` Darrick J. Wong
2020-07-07 11:37           ` Brian Foster
2020-07-08 16:44             ` Darrick J. Wong
2020-07-09 12:15               ` Brian Foster
2020-07-09 16:32                 ` Darrick J. Wong
2020-07-20  3:58                 ` Dave Chinner
2020-08-26 12:17                   ` Brian Foster
2020-07-10  4:09         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).