All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v5 PATCH 0/9] xfs: automatic relogging experiment
@ 2020-02-27 13:43 Brian Foster
  2020-02-27 13:43 ` [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time Brian Foster
                   ` (9 more replies)
  0 siblings, 10 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's a v5 RFC of the automatic item relogging experiment. Firstly,
note that this is still a POC and experimental code with various quirks.
Some are documented in the code, others might not be (such as abusing
the AIL lock, etc.). The primary purpose of this series is still to
express and review a fundamental design. Based on discussion on the last
version, there is specific focus towards addressing log reservation and
pre-item locking deadlock vectors. While the code is still quite hacky,
I believe this design addresses both of those fundamental issues.
Further details on the design and approach are documented in the
individual commit logs.

In addition, the final few patches introduce buffer relogging capability
and test infrastructure, which currently has no use case other than to
demonstrate development flexibility and the ability to support arbitrary
log items in the future, if ever desired. If this approach is taken
forward, the current use cases are still centered around intent items
such as the quotaoff use case and extent freeing use case defined by
online repair of free space trees.

On somewhat of a tangent, another intent oriented use case idea crossed
my mind recently related to the long standing writeback stale data
exposure problem (i.e. if we crash after a delalloc extent is converted
but before writeback fully completes on the extent). The obvious
approach of using unwritten extents has been rebuffed due to performance
concerns over extent conversion. I wonder if we had the ability to log a
"writeback pending" intent on some reasonable level of granularity (i.e.
something between a block and extent), whether we could use that to
allow log recovery to zero (or convert) such extents in the event of a
crash. This is a whole separate design discussion, however, as it
involves tracking outstanding writeback, etc. In this context it simply
serves as a prospective use case for relogging, as such intents would
otherwise risk similar log subsystem deadlocks as the quotaoff use case.

Thoughts, reviews, flames appreciated.

Brian

rfcv5:
- More fleshed out design to prevent log reservation deadlock and
  locking problems.
- Split out core patches between pre-reservation management, relog item
  state management and relog mechanism.
- Added experimental buffer relogging capability.
rfcv4: https://lore.kernel.org/linux-xfs/20191205175037.52529-1-bfoster@redhat.com/
- AIL based approach.
rfcv3: https://lore.kernel.org/linux-xfs/20191125185523.47556-1-bfoster@redhat.com/
- CIL based approach.
rfcv2: https://lore.kernel.org/linux-xfs/20191122181927.32870-1-bfoster@redhat.com/
- Different approach based on workqueue and transaction rolling.
rfc: https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/

Brian Foster (9):
  xfs: set t_task at wait time instead of alloc time
  xfs: introduce ->tr_relog transaction
  xfs: automatic relogging reservation management
  xfs: automatic relogging item management
  xfs: automatic log item relog mechanism
  xfs: automatically relog the quotaoff start intent
  xfs: buffer relogging support prototype
  xfs: create an error tag for random relog reservation
  xfs: relog random buffers based on errortag

 fs/xfs/libxfs/xfs_errortag.h   |   4 +-
 fs/xfs/libxfs/xfs_shared.h     |   1 +
 fs/xfs/libxfs/xfs_trans_resv.c |  24 +++-
 fs/xfs/libxfs/xfs_trans_resv.h |   1 +
 fs/xfs/xfs_buf_item.c          |   5 +
 fs/xfs/xfs_dquot_item.c        |   7 ++
 fs/xfs/xfs_error.c             |   3 +
 fs/xfs/xfs_log.c               |   2 +-
 fs/xfs/xfs_qm_syscalls.c       |  12 +-
 fs/xfs/xfs_trace.h             |   3 +
 fs/xfs/xfs_trans.c             |  79 +++++++++++-
 fs/xfs/xfs_trans.h             |  13 +-
 fs/xfs/xfs_trans_ail.c         | 216 ++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trans_buf.c         |  35 ++++++
 fs/xfs/xfs_trans_priv.h        |   6 +
 15 files changed, 399 insertions(+), 12 deletions(-)

-- 
2.21.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 20:48   ` Allison Collins
  2020-02-27 23:28   ` Darrick J. Wong
  2020-02-27 13:43 ` [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction Brian Foster
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

The xlog_ticket structure contains a task reference to support
blocking for available log reservation. This reference is assigned
at ticket allocation time, which assumes that the transaction
allocator will acquire reservation in the same context. This is
normally true, but will not always be the case with automatic
relogging.

There is otherwise no fundamental reason log space cannot be
reserved for a ticket from a context different from the allocating
context. Move the task assignment to the log reservation blocking
code where it is used.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f6006d94a581..df60942a9804 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -262,6 +262,7 @@ xlog_grant_head_wait(
 	int			need_bytes) __releases(&head->lock)
 					    __acquires(&head->lock)
 {
+	tic->t_task = current;
 	list_add_tail(&tic->t_queue, &head->waiters);
 
 	do {
@@ -3601,7 +3602,6 @@ xlog_ticket_alloc(
 	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
 
 	atomic_set(&tic->t_ref, 1);
-	tic->t_task		= current;
 	INIT_LIST_HEAD(&tic->t_queue);
 	tic->t_unit_res		= unit_res;
 	tic->t_curr_res		= unit_res;
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
  2020-02-27 13:43 ` [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 20:49   ` Allison Collins
  2020-02-27 23:31   ` Darrick J. Wong
  2020-02-27 13:43 ` [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management Brian Foster
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

Create a transaction reservation specifically for relog
transactions. For now it only supports the quotaoff intent, so use
the associated reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 15 +++++++++++++++
 fs/xfs/libxfs/xfs_trans_resv.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 7a9c04920505..1f5c9e6e1afc 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -832,6 +832,17 @@ xfs_calc_sb_reservation(
 	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize);
 }
 
+/*
+ * Internal relog transaction.
+ *   quotaoff intent
+ */
+STATIC uint
+xfs_calc_relog_reservation(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_qm_quotaoff_reservation(mp);
+}
+
 void
 xfs_trans_resv_calc(
 	struct xfs_mount	*mp,
@@ -946,4 +957,8 @@ xfs_trans_resv_calc(
 	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
 	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
 	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
+
+	resp->tr_relog.tr_logres = xfs_calc_relog_reservation(mp);
+	resp->tr_relog.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
+	resp->tr_relog.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 }
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 7241ab28cf84..b723979cad09 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -50,6 +50,7 @@ struct xfs_trans_resv {
 	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
 	struct xfs_trans_res	tr_sb;		/* modify superblock */
 	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
+	struct xfs_trans_res	tr_relog;	/* internal relog transaction */
 };
 
 /* shorthand way of accessing reservation structure */
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
  2020-02-27 13:43 ` [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time Brian Foster
  2020-02-27 13:43 ` [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 20:49   ` Allison Collins
                     ` (2 more replies)
  2020-02-27 13:43 ` [RFC v5 PATCH 4/9] xfs: automatic relogging item management Brian Foster
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

Automatic item relogging will occur from xfsaild context. xfsaild
cannot acquire log reservation itself because it is also responsible
for writeback and thus making used log reservation available again.
Since there is no guarantee log reservation is available by the time
a relogged item reaches the AIL, this is prone to deadlock.

To guarantee log reservation for automatic relogging, implement a
reservation management scheme where a transaction that is capable of
enabling relogging of an item must contribute the necessary
reservation to the relog mechanism up front. Use reference counting
to associate the lifetime of pending relog reservation to the
lifetime of in-core log items with relogging enabled.

The basic log reservation sequence for a relog enabled transaction
is as follows:

- A transaction that uses relogging specifies XFS_TRANS_RELOG at
  allocation time.
- Once initialized, RELOG transactions check for the existence of
  the global relog log ticket. If it exists, grab a reference and
  return. If not, allocate an empty ticket and install into the relog
  subsystem. Seed the relog ticket from reservation of the current
  transaction. Roll the current transaction to replenish its
  reservation and return to the caller.
- The transaction is used as normal. If an item is relogged in the
  transaction, that item acquires a reference on the global relog
  ticket currently held open by the transaction. The item's reference
  persists until relogging is disabled on the item.
- The RELOG transaction commits and releases its reference to the
  global relog ticket. The global relog ticket is released once its
  reference count drops to zero.

This provides a central relog log ticket that guarantees reservation
availability for relogged items, avoids log reservation deadlocks
and is allocated and released on demand.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_shared.h |  1 +
 fs/xfs/xfs_trans.c         | 37 +++++++++++++---
 fs/xfs/xfs_trans.h         |  3 ++
 fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans_priv.h    |  1 +
 5 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index c45acbd3add9..0a10ca0853ab 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
  * made then this algorithm will eventually find all the space it needs.
  */
 #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
+#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
 
 /*
  * Field values for xfs_trans_mod_sb.
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3b208f9a865c..8ac05ed8deda 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -107,9 +107,14 @@ xfs_trans_dup(
 
 	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
 		       (tp->t_flags & XFS_TRANS_RESERVE) |
-		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
-	/* We gave our writer reference to the new transaction */
+		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
+		       (tp->t_flags & XFS_TRANS_RELOG);
+	/*
+	 * The writer reference and relog reference transfer to the new
+	 * transaction.
+	 */
 	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
+	tp->t_flags &= ~XFS_TRANS_RELOG;
 	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
 
 	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
@@ -284,15 +289,25 @@ xfs_trans_alloc(
 	tp->t_firstblock = NULLFSBLOCK;
 
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
-	if (error) {
-		xfs_trans_cancel(tp);
-		return error;
+	if (error)
+		goto error;
+
+	if (flags & XFS_TRANS_RELOG) {
+		error = xfs_trans_ail_relog_reserve(&tp);
+		if (error)
+			goto error;
 	}
 
 	trace_xfs_trans_alloc(tp, _RET_IP_);
 
 	*tpp = tp;
 	return 0;
+
+error:
+	/* clear relog flag if we haven't acquired a ref */
+	tp->t_flags &= ~XFS_TRANS_RELOG;
+	xfs_trans_cancel(tp);
+	return error;
 }
 
 /*
@@ -973,6 +988,10 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
+	/* release the relog ticket reference if this transaction holds one */
+	if (tp->t_flags & XFS_TRANS_RELOG)
+		xfs_trans_ail_relog_put(mp);
+
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
@@ -1004,6 +1023,10 @@ __xfs_trans_commit(
 			error = -EIO;
 		tp->t_ticket = NULL;
 	}
+	/* release the relog ticket reference if this transaction holds one */
+	/* XXX: handle RELOG items on transaction abort */
+	if (tp->t_flags & XFS_TRANS_RELOG)
+		xfs_trans_ail_relog_put(mp);
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, !!error);
 	xfs_trans_free(tp);
@@ -1064,6 +1087,10 @@ xfs_trans_cancel(
 		tp->t_ticket = NULL;
 	}
 
+	/* release the relog ticket reference if this transaction holds one */
+	if (tp->t_flags & XFS_TRANS_RELOG)
+		xfs_trans_ail_relog_put(mp);
+
 	/* mark this thread as no longer being in a transaction */
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 752c7fef9de7..a032989943bd 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -236,6 +236,9 @@ int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
 void		xfs_trans_cancel(xfs_trans_t *);
 int		xfs_trans_ail_init(struct xfs_mount *);
 void		xfs_trans_ail_destroy(struct xfs_mount *);
+int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
+bool		xfs_trans_ail_relog_get(struct xfs_mount *);
+int		xfs_trans_ail_relog_put(struct xfs_mount *);
 
 void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
 				       enum xfs_blft);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 00cc5b8734be..a3fb64275baa 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -818,6 +819,93 @@ xfs_trans_ail_delete(
 		xfs_log_space_wake(ailp->ail_mount);
 }
 
+bool
+xfs_trans_ail_relog_get(
+	struct xfs_mount	*mp)
+{
+	struct xfs_ail		*ailp = mp->m_ail;
+	bool			ret = false;
+
+	spin_lock(&ailp->ail_lock);
+	if (ailp->ail_relog_tic) {
+		xfs_log_ticket_get(ailp->ail_relog_tic);
+		ret = true;
+	}
+	spin_unlock(&ailp->ail_lock);
+	return ret;
+}
+
+/*
+ * Reserve log space for the automatic relogging ->tr_relog ticket. This
+ * requires a clean, permanent transaction from the caller. Pull reservation
+ * for the relog ticket and roll the caller's transaction back to its fully
+ * reserved state. If the AIL relog ticket is already initialized, grab a
+ * reference and return.
+ */
+int
+xfs_trans_ail_relog_reserve(
+	struct xfs_trans	**tpp)
+{
+	struct xfs_trans	*tp = *tpp;
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_ail		*ailp = mp->m_ail;
+	struct xlog_ticket	*tic;
+	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
+
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	if (xfs_trans_ail_relog_get(mp))
+		return 0;
+
+	/* no active ticket, fall into slow path to allocate one.. */
+	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
+	if (!tic)
+		return -ENOMEM;
+	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
+
+	/* check again since we dropped the lock for the allocation */
+	spin_lock(&ailp->ail_lock);
+	if (ailp->ail_relog_tic) {
+		xfs_log_ticket_get(ailp->ail_relog_tic);
+		spin_unlock(&ailp->ail_lock);
+		xfs_log_ticket_put(tic);
+		return 0;
+	}
+
+	/* attach and reserve space for the ->tr_relog ticket */
+	ailp->ail_relog_tic = tic;
+	tp->t_ticket->t_curr_res -= tic->t_curr_res;
+	spin_unlock(&ailp->ail_lock);
+
+	return xfs_trans_roll(tpp);
+}
+
+/*
+ * Release a reference to the relog ticket.
+ */
+int
+xfs_trans_ail_relog_put(
+	struct xfs_mount	*mp)
+{
+	struct xfs_ail		*ailp = mp->m_ail;
+	struct xlog_ticket	*tic;
+
+	spin_lock(&ailp->ail_lock);
+	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
+		spin_unlock(&ailp->ail_lock);
+		return 0;
+	}
+
+	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
+	tic = ailp->ail_relog_tic;
+	ailp->ail_relog_tic = NULL;
+	spin_unlock(&ailp->ail_lock);
+
+	xfs_log_done(mp, tic, NULL, false);
+	return 0;
+}
+
 int
 xfs_trans_ail_init(
 	xfs_mount_t	*mp)
@@ -854,6 +942,7 @@ xfs_trans_ail_destroy(
 {
 	struct xfs_ail	*ailp = mp->m_ail;
 
+	ASSERT(ailp->ail_relog_tic == NULL);
 	kthread_stop(ailp->ail_task);
 	kmem_free(ailp);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 2e073c1c4614..839df6559b9f 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -61,6 +61,7 @@ struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	struct xlog_ticket	*ail_relog_tic;
 };
 
 /*
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 4/9] xfs: automatic relogging item management
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
                   ` (2 preceding siblings ...)
  2020-02-27 13:43 ` [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 21:18   ` Allison Collins
  2020-03-02  5:58   ` Dave Chinner
  2020-02-27 13:43 ` [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism Brian Foster
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

As implemented by the previous patch, relogging can be enabled on
any item via a relog enabled transaction (which holds a reference to
an active relog ticket). Add a couple log item flags to track relog
state of an arbitrary log item. The item holds a reference to the
global relog ticket when relogging is enabled and releases the
reference when relogging is disabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trace.h      |  2 ++
 fs/xfs/xfs_trans.c      | 36 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.h      |  6 +++++-
 fs/xfs/xfs_trans_priv.h |  2 ++
 4 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a86be7f807ee..a066617ec54d 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1063,6 +1063,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
+DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
 
 DECLARE_EVENT_CLASS(xfs_ail_class,
 	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 8ac05ed8deda..f7f2411ead4e 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -778,6 +778,41 @@ xfs_trans_del_item(
 	list_del_init(&lip->li_trans);
 }
 
+void
+xfs_trans_relog_item(
+	struct xfs_log_item	*lip)
+{
+	if (!test_and_set_bit(XFS_LI_RELOG, &lip->li_flags)) {
+		xfs_trans_ail_relog_get(lip->li_mountp);
+		trace_xfs_relog_item(lip);
+	}
+}
+
+void
+xfs_trans_relog_item_cancel(
+	struct xfs_log_item	*lip,
+	bool			drain) /* wait for relogging to cease */
+{
+	struct xfs_mount	*mp = lip->li_mountp;
+
+	if (!test_and_clear_bit(XFS_LI_RELOG, &lip->li_flags))
+		return;
+	xfs_trans_ail_relog_put(lip->li_mountp);
+	trace_xfs_relog_item_cancel(lip);
+
+	if (!drain)
+		return;
+
+	/*
+	 * Some operations might require relog activity to cease before they can
+	 * proceed. For example, an operation must wait before including a
+	 * non-lockable log item (i.e. intent) in another transaction.
+	 */
+	while (wait_on_bit_timeout(&lip->li_flags, XFS_LI_RELOGGED,
+				   TASK_UNINTERRUPTIBLE, HZ))
+		xfs_log_force(mp, XFS_LOG_SYNC);
+}
+
 /* Detach and unlock all of the items in a transaction */
 static void
 xfs_trans_free_items(
@@ -863,6 +898,7 @@ xfs_trans_committed_bulk(
 
 		if (aborted)
 			set_bit(XFS_LI_ABORTED, &lip->li_flags);
+		clear_and_wake_up_bit(XFS_LI_RELOGGED, &lip->li_flags);
 
 		if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
 			lip->li_ops->iop_release(lip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index a032989943bd..fc4c25b6eee4 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -59,12 +59,16 @@ struct xfs_log_item {
 #define	XFS_LI_ABORTED	1
 #define	XFS_LI_FAILED	2
 #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
+#define	XFS_LI_RELOG	4	/* automatically relog item */
+#define	XFS_LI_RELOGGED	5	/* item relogged (not committed) */
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
 	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
 	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
-	{ (1 << XFS_LI_DIRTY),		"DIRTY" }
+	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
+	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
+	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }
 
 struct xfs_item_ops {
 	unsigned flags;
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 839df6559b9f..d1edec1cb8ad 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -16,6 +16,8 @@ struct xfs_log_vec;
 void	xfs_trans_init(struct xfs_mount *);
 void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_del_item(struct xfs_log_item *);
+void	xfs_trans_relog_item(struct xfs_log_item *);
+void	xfs_trans_relog_item_cancel(struct xfs_log_item *, bool);
 void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
 
 void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
                   ` (3 preceding siblings ...)
  2020-02-27 13:43 ` [RFC v5 PATCH 4/9] xfs: automatic relogging item management Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 22:54   ` Allison Collins
                     ` (2 more replies)
  2020-02-27 13:43 ` [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent Brian Foster
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

Now that relog reservation is available and relog state tracking is
in place, all that remains to automatically relog items is the relog
mechanism itself. An item with relogging enabled is basically pinned
from writeback until relog is disabled. Instead of being written
back, the item must instead be periodically committed in a new
transaction to move it in the physical log. The purpose of moving
the item is to avoid long term tail pinning and thus avoid log
deadlocks for long running operations.

The ideal time to relog an item is in response to tail pushing
pressure. This accommodates the current workload at any given time
as opposed to a fixed time interval or log reservation heuristic,
which risks performance regression. This is essentially the same
heuristic that drives metadata writeback. XFS already implements
various log tail pushing heuristics that attempt to keep the log
progressing on an active fileystem under various workloads.

The act of relogging an item simply requires to add it to a
transaction and commit. This pushes the already dirty item into a
subsequent log checkpoint and frees up its previous location in the
on-disk log. Joining an item to a transaction of course requires
locking the item first, which means we have to be aware of
type-specific locks and lock ordering wherever the relog takes
place.

Fundamentally, this points to xfsaild as the ideal location to
process relog enabled items. xfsaild already processes log resident
items, is driven by log tail pushing pressure, processes arbitrary
log item types through callbacks, and is sensitive to type-specific
locking rules by design. The fact that automatic relogging
essentially diverts items between writeback or relog also suggests
xfsaild as an ideal location to process items one way or the other.

Of course, we don't want xfsaild to process transactions as it is a
critical component of the log subsystem for driving metadata
writeback and freeing up log space. Therefore, similar to how
xfsaild builds up a writeback queue of dirty items and queues writes
asynchronously, make xfsaild responsible only for directing pending
relog items into an appropriate queue and create an async
(workqueue) context for processing the queue. The workqueue context
utilizes the pre-reserved relog ticket to drain the queue by rolling
a permanent transaction.

Update the AIL pushing infrastructure to support a new RELOG item
state. If a log item push returns the relog state, queue the item
for relog instead of writeback. On completion of a push cycle,
schedule the relog task at the same point metadata buffer I/O is
submitted. This allows items to be relogged automatically under the
same locking rules and pressure heuristics that govern metadata
writeback.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trace.h      |   1 +
 fs/xfs/xfs_trans.h      |   1 +
 fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trans_priv.h |   3 ++
 4 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a066617ec54d..df0114ec66f1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
 DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
 DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index fc4c25b6eee4..1637df32c64c 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
 #define XFS_ITEM_PINNED		1
 #define XFS_ITEM_LOCKED		2
 #define XFS_ITEM_FLUSHING	3
+#define XFS_ITEM_RELOG		4
 
 /*
  * Deferred operation item relogging limits.
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index a3fb64275baa..71a47faeaae8 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -144,6 +144,75 @@ xfs_ail_max_lsn(
 	return lsn;
 }
 
+/*
+ * Relog log items on the AIL relog queue.
+ */
+static void
+xfs_ail_relog(
+	struct work_struct	*work)
+{
+	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
+						     ail_relog_work);
+	struct xfs_mount	*mp = ailp->ail_mount;
+	struct xfs_trans_res	tres = {};
+	struct xfs_trans	*tp;
+	struct xfs_log_item	*lip;
+	int			error;
+
+	/*
+	 * The first transaction to submit a relog item contributed relog
+	 * reservation to the relog ticket before committing. Create an empty
+	 * transaction and manually associate the relog ticket.
+	 */
+	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
+	ASSERT(!error);
+	if (error)
+		return;
+	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
+	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
+	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
+	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
+
+	spin_lock(&ailp->ail_lock);
+	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
+					       struct xfs_log_item,
+					       li_trans)) != NULL) {
+		/*
+		 * Drop the AIL processing ticket reference once the relog list
+		 * is emptied. At this point it's possible for our transaction
+		 * to hold the only reference.
+		 */
+		list_del_init(&lip->li_trans);
+		if (list_empty(&ailp->ail_relog_list))
+			xfs_log_ticket_put(ailp->ail_relog_tic);
+		spin_unlock(&ailp->ail_lock);
+
+		xfs_trans_add_item(tp, lip);
+		set_bit(XFS_LI_DIRTY, &lip->li_flags);
+		tp->t_flags |= XFS_TRANS_DIRTY;
+		/* XXX: include ticket owner task fix */
+		error = xfs_trans_roll(&tp);
+		ASSERT(!error);
+		if (error)
+			goto out;
+		spin_lock(&ailp->ail_lock);
+	}
+	spin_unlock(&ailp->ail_lock);
+
+out:
+	/* XXX: handle shutdown scenario */
+	/*
+	 * Drop the relog reference owned by the transaction separately because
+	 * we don't want the cancel to release reservation if this isn't the
+	 * final reference. The relog ticket and associated reservation needs
+	 * to persist so long as relog items are active in the log subsystem.
+	 */
+	xfs_trans_ail_relog_put(mp);
+
+	tp->t_ticket = NULL;
+	xfs_trans_cancel(tp);
+}
+
 /*
  * The cursor keeps track of where our current traversal is up to by tracking
  * the next item in the list for us. However, for this to be safe, removing an
@@ -364,7 +433,7 @@ static long
 xfsaild_push(
 	struct xfs_ail		*ailp)
 {
-	xfs_mount_t		*mp = ailp->ail_mount;
+	struct xfs_mount	*mp = ailp->ail_mount;
 	struct xfs_ail_cursor	cur;
 	struct xfs_log_item	*lip;
 	xfs_lsn_t		lsn;
@@ -426,6 +495,23 @@ xfsaild_push(
 			ailp->ail_last_pushed_lsn = lsn;
 			break;
 
+		case XFS_ITEM_RELOG:
+			/*
+			 * The item requires a relog. Add to the pending relog
+			 * list and set the relogged bit to prevent further
+			 * relog requests. The relog bit and ticket reference
+			 * can be dropped from the item at any point, so hold a
+			 * relog ticket reference for the pending relog list to
+			 * ensure the ticket stays around.
+			 */
+			trace_xfs_ail_relog(lip);
+			ASSERT(list_empty(&lip->li_trans));
+			if (list_empty(&ailp->ail_relog_list))
+				xfs_log_ticket_get(ailp->ail_relog_tic);
+			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
+			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
+			break;
+
 		case XFS_ITEM_FLUSHING:
 			/*
 			 * The item or its backing buffer is already being
@@ -492,6 +578,9 @@ xfsaild_push(
 	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
 		ailp->ail_log_flush++;
 
+	if (!list_empty(&ailp->ail_relog_list))
+		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
+
 	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
 out_done:
 		/*
@@ -922,15 +1011,24 @@ xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
+	INIT_LIST_HEAD(&ailp->ail_relog_list);
+	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
+
+	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
+					     mp->m_super->s_id);
+	if (!ailp->ail_relog_wq)
+		goto out_free_ailp;
 
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_super->s_id);
 	if (IS_ERR(ailp->ail_task))
-		goto out_free_ailp;
+		goto out_destroy_wq;
 
 	mp->m_ail = ailp;
 	return 0;
 
+out_destroy_wq:
+	destroy_workqueue(ailp->ail_relog_wq);
 out_free_ailp:
 	kmem_free(ailp);
 	return -ENOMEM;
@@ -944,5 +1042,6 @@ xfs_trans_ail_destroy(
 
 	ASSERT(ailp->ail_relog_tic == NULL);
 	kthread_stop(ailp->ail_task);
+	destroy_workqueue(ailp->ail_relog_wq);
 	kmem_free(ailp);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index d1edec1cb8ad..33a724534869 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -63,6 +63,9 @@ struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	struct work_struct	ail_relog_work;
+	struct list_head	ail_relog_list;
+	struct workqueue_struct	*ail_relog_wq;
 	struct xlog_ticket	*ail_relog_tic;
 };
 
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
                   ` (4 preceding siblings ...)
  2020-02-27 13:43 ` [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 23:19   ` Allison Collins
  2020-02-28  1:16   ` Darrick J. Wong
  2020-02-27 13:43 ` [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype Brian Foster
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

The quotaoff operation has a rare but longstanding deadlock vector
in terms of how the operation is logged. A quotaoff start intent is
logged (synchronously) at the onset to ensure recovery can handle
the operation if interrupted before in-core changes are made. This
quotaoff intent pins the log tail while the quotaoff sequence scans
and purges dquots from all in-core inodes. While this operation
generally doesn't generate much log traffic on its own, it can be
time consuming. If unrelated, concurrent filesystem activity
consumes remaining log space before quotaoff is able to acquire log
reservation for the quotaoff end intent, the filesystem locks up
indefinitely.

quotaoff cannot allocate the end intent before the scan because the
latter can result in transaction allocation itself in certain
indirect cases (releasing an inode, for example). Further, rolling
the original transaction is difficult because the scanning work
occurs multiple layers down where caller context is lost and not
much information is available to determine how often to roll the
transaction.

To address this problem, enable automatic relogging of the quotaoff
start intent. This automatically relogs the intent whenever AIL
pushing finds the item at the tail of the log. When quotaoff
completes, wait for relogging to complete as the end intent expects
to be able to permanently remove the start intent from the log
subsystem. This ensures that the log tail is kept moving during a
particularly long quotaoff operation and avoids the log reservation
deadlock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
 fs/xfs/xfs_dquot_item.c        |  7 +++++++
 fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 1f5c9e6e1afc..f49b20c9ca33 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -935,7 +935,8 @@ xfs_trans_resv_calc(
 	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
 
 	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
-	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
+	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
+	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_qm_equotaoff.tr_logres =
 		xfs_calc_qm_quotaoff_end_reservation();
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index d60647d7197b..ea5123678466 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
 	struct xfs_log_item	*lip,
 	struct list_head	*buffer_list)
 {
+	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
+
+	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
+	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
+	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
+		return XFS_ITEM_RELOG;
+
 	return XFS_ITEM_LOCKED;
 }
 
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 1ea82764bf89..7b48d34da0f4 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -18,6 +18,7 @@
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_icache.h"
+#include "xfs_trans_priv.h"
 
 STATIC int
 xfs_qm_log_quotaoff(
@@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
 
 	*qoffstartp = NULL;
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
+				XFS_TRANS_RELOG, &tp);
 	if (error)
 		goto out;
 
 	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
 	xfs_trans_log_quotaoff_item(tp, qoffi);
+	xfs_trans_relog_item(&qoffi->qql_item);
 
 	spin_lock(&mp->m_sb_lock);
 	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
@@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
 	int			error;
 	struct xfs_qoff_logitem	*qoffi;
 
+	/*
+	 * startqoff must be in the AIL and not the CIL when the end intent
+	 * commits to ensure it is not readded to the AIL out of order. Wait on
+	 * relog activity to drain to isolate startqoff to the AIL.
+	 */
+	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
+
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
 	if (error)
 		return error;
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
                   ` (5 preceding siblings ...)
  2020-02-27 13:43 ` [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 23:33   ` Allison Collins
  2020-03-02  7:47   ` Dave Chinner
  2020-02-27 13:43 ` [RFC v5 PATCH 8/9] xfs: create an error tag for random relog reservation Brian Foster
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

Add a quick and dirty implementation of buffer relogging support.
There is currently no use case for buffer relogging. This is for
experimental use only and serves as an example to demonstrate the
ability to relog arbitrary items in the future, if necessary.

Add a hook to enable relogging a buffer in a transaction, update the
buffer log item handlers to support relogged BLIs and update the
relog handler to join the relogged buffer to the relog transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c  |  5 +++++
 fs/xfs/xfs_trans.h     |  1 +
 fs/xfs/xfs_trans_ail.c | 19 ++++++++++++++++---
 fs/xfs/xfs_trans_buf.c | 22 ++++++++++++++++++++++
 4 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 663810e6cd59..4ef2725fa8ce 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -463,6 +463,7 @@ xfs_buf_item_unpin(
 			list_del_init(&bp->b_li_list);
 			bp->b_iodone = NULL;
 		} else {
+			xfs_trans_relog_item_cancel(lip, false);
 			spin_lock(&ailp->ail_lock);
 			xfs_trans_ail_delete(ailp, lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -528,6 +529,9 @@ xfs_buf_item_push(
 		return XFS_ITEM_LOCKED;
 	}
 
+	if (test_bit(XFS_LI_RELOG, &lip->li_flags))
+		return XFS_ITEM_RELOG;
+
 	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
 
 	trace_xfs_buf_item_push(bip);
@@ -956,6 +960,7 @@ STATIC void
 xfs_buf_item_free(
 	struct xfs_buf_log_item	*bip)
 {
+	ASSERT(!test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags));
 	xfs_buf_item_free_format(bip);
 	kmem_free(bip->bli_item.li_lv_shadow);
 	kmem_cache_free(xfs_buf_item_zone, bip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 1637df32c64c..81cb42f552d9 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -226,6 +226,7 @@ void		xfs_trans_inode_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_stale_inode_buf(xfs_trans_t *, struct xfs_buf *);
 bool		xfs_trans_ordered_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_dquot_buf(xfs_trans_t *, struct xfs_buf *, uint);
+bool		xfs_trans_relog_buf(struct xfs_trans *, struct xfs_buf *);
 void		xfs_trans_inode_alloc_buf(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_ichgtime(struct xfs_trans *, struct xfs_inode *, int);
 void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 71a47faeaae8..103ab62e61be 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -18,6 +18,7 @@
 #include "xfs_error.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
+#include "xfs_buf_item.h"
 
 #ifdef DEBUG
 /*
@@ -187,9 +188,21 @@ xfs_ail_relog(
 			xfs_log_ticket_put(ailp->ail_relog_tic);
 		spin_unlock(&ailp->ail_lock);
 
-		xfs_trans_add_item(tp, lip);
-		set_bit(XFS_LI_DIRTY, &lip->li_flags);
-		tp->t_flags |= XFS_TRANS_DIRTY;
+		/*
+		 * TODO: Ideally, relog transaction management would be pushed
+		 * down into the ->iop_push() callbacks rather than playing
+		 * games with ->li_trans and looking at log item types here.
+		 */
+		if (lip->li_type == XFS_LI_BUF) {
+			struct xfs_buf_log_item	*bli = (struct xfs_buf_log_item *) lip;
+			xfs_buf_hold(bli->bli_buf);
+			xfs_trans_bjoin(tp, bli->bli_buf);
+			xfs_trans_dirty_buf(tp, bli->bli_buf);
+		} else {
+			xfs_trans_add_item(tp, lip);
+			set_bit(XFS_LI_DIRTY, &lip->li_flags);
+			tp->t_flags |= XFS_TRANS_DIRTY;
+		}
 		/* XXX: include ticket owner task fix */
 		error = xfs_trans_roll(&tp);
 		ASSERT(!error);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 08174ffa2118..e17715ac23fc 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -787,3 +787,25 @@ xfs_trans_dquot_buf(
 
 	xfs_trans_buf_set_type(tp, bp, type);
 }
+
+/*
+ * Enable automatic relogging on a buffer. This essentially pins a dirty buffer
+ * in-core until relogging is disabled. Note that the buffer must not already be
+ * queued for writeback.
+ */
+bool
+xfs_trans_relog_buf(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+
+	ASSERT(tp->t_flags & XFS_TRANS_RELOG);
+	ASSERT(xfs_buf_islocked(bp));
+
+	if (bp->b_flags & _XBF_DELWRI_Q)
+		return false;
+
+	xfs_trans_relog_item(&bip->bli_item);
+	return true;
+}
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 8/9] xfs: create an error tag for random relog reservation
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
                   ` (6 preceding siblings ...)
  2020-02-27 13:43 ` [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 23:35   ` Allison Collins
  2020-02-27 13:43 ` [RFC v5 PATCH 9/9] xfs: relog random buffers based on errortag Brian Foster
  2020-02-27 15:09 ` [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Darrick J. Wong
  9 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

Create an errortag to randomly enable relogging on permanent
transactions. This only stresses relog reservation management and
does not enable relogging of any particular items. The tag will be
reused in a subsequent patch to enable random item relogging.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_errortag.h | 4 +++-
 fs/xfs/xfs_error.c           | 3 +++
 fs/xfs/xfs_trans.c           | 6 ++++++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 79e6c4fb1d8a..ca7bcadb9455 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -55,7 +55,8 @@
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR			32
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
 #define XFS_ERRTAG_IUNLINK_FALLBACK			34
-#define XFS_ERRTAG_MAX					35
+#define XFS_ERRTAG_RELOG				35
+#define XFS_ERRTAG_MAX					36
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -95,5 +96,6 @@
 #define XFS_RANDOM_FORCE_SCRUB_REPAIR			1
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
 #define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
+#define XFS_RANDOM_RELOG				XFS_RANDOM_DEFAULT
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 331765afc53e..2838b909287e 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -53,6 +53,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_FORCE_SCRUB_REPAIR,
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
 	XFS_RANDOM_IUNLINK_FALLBACK,
+	XFS_RANDOM_RELOG,
 };
 
 struct xfs_errortag_attr {
@@ -162,6 +163,7 @@ XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
 XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
 XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
+XFS_ERRORTAG_ATTR_RW(relog,		XFS_ERRTAG_RELOG);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -199,6 +201,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(force_repair),
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
 	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
+	XFS_ERRORTAG_ATTR_LIST(relog),
 	NULL,
 };
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index f7f2411ead4e..24e0208b74b8 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -19,6 +19,7 @@
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
+#include "xfs_errortag.h"
 
 kmem_zone_t	*xfs_trans_zone;
 
@@ -263,6 +264,11 @@ xfs_trans_alloc(
 	struct xfs_trans	*tp;
 	int			error;
 
+	/* relogging requires permanent transactions */
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RELOG) &&
+	    resp->tr_logflags & XFS_TRANS_PERM_LOG_RES)
+		flags |= XFS_TRANS_RELOG;
+
 	/*
 	 * Allocate the handle before we do our freeze accounting and setting up
 	 * GFP_NOFS allocation context so that we avoid lockdep false positives
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v5 PATCH 9/9] xfs: relog random buffers based on errortag
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
                   ` (7 preceding siblings ...)
  2020-02-27 13:43 ` [RFC v5 PATCH 8/9] xfs: create an error tag for random relog reservation Brian Foster
@ 2020-02-27 13:43 ` Brian Foster
  2020-02-27 23:48   ` Allison Collins
  2020-02-27 15:09 ` [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Darrick J. Wong
  9 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-02-27 13:43 UTC (permalink / raw)
  To: linux-xfs

Since there is currently no specific use case for buffer relogging,
add some hacky and experimental code to relog random buffers when
the associated errortag is enabled. Update the relog reservation
calculation appropriately and use fixed termination logic to help
ensure that the relog queue doesn't grow indefinitely.

Note that this patch was useful in causing log reservation deadlocks
on an fsstress workload if the relog mechanism code is modified to
acquire its own log reservation rather than rely on the relog
pre-reservation mechanism. In other words, this helps prove that the
relog reservation management code effectively avoids log reservation
deadlocks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |  8 +++++++-
 fs/xfs/xfs_trans.h             |  4 +++-
 fs/xfs/xfs_trans_ail.c         | 11 +++++++++++
 fs/xfs/xfs_trans_buf.c         | 13 +++++++++++++
 4 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index f49b20c9ca33..59a328a0dec6 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -840,7 +840,13 @@ STATIC uint
 xfs_calc_relog_reservation(
 	struct xfs_mount	*mp)
 {
-	return xfs_calc_qm_quotaoff_reservation(mp);
+	uint			res;
+
+	res = xfs_calc_qm_quotaoff_reservation(mp);
+#ifdef DEBUG
+	res = max(res, xfs_calc_buf_res(4, XFS_FSB_TO_B(mp, 1)));
+#endif
+	return res;
 }
 
 void
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 81cb42f552d9..1783441f6d03 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -61,6 +61,7 @@ struct xfs_log_item {
 #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
 #define	XFS_LI_RELOG	4	/* automatically relog item */
 #define	XFS_LI_RELOGGED	5	/* item relogged (not committed) */
+#define	XFS_LI_RELOG_RAND 6
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
@@ -68,7 +69,8 @@ struct xfs_log_item {
 	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
 	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
 	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
-	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }
+	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }, \
+	{ (1 << XFS_LI_RELOG_RAND),	"RELOG_RAND" }
 
 struct xfs_item_ops {
 	unsigned flags;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 103ab62e61be..9b1d7c8df6d8 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -188,6 +188,17 @@ xfs_ail_relog(
 			xfs_log_ticket_put(ailp->ail_relog_tic);
 		spin_unlock(&ailp->ail_lock);
 
+		/*
+		 * Terminate random/debug relogs at a fixed, aggressive rate to
+		 * avoid building up too much relog activity.
+		 */
+		if (test_bit(XFS_LI_RELOG_RAND, &lip->li_flags) &&
+		    ((prandom_u32() & 1) ||
+		     (mp->m_flags & XFS_MOUNT_UNMOUNTING))) {
+			clear_bit(XFS_LI_RELOG_RAND, &lip->li_flags);
+			xfs_trans_relog_item_cancel(lip, false);
+		}
+
 		/*
 		 * TODO: Ideally, relog transaction management would be pushed
 		 * down into the ->iop_push() callbacks rather than playing
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index e17715ac23fc..de7b9a68fe38 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -14,6 +14,8 @@
 #include "xfs_buf_item.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
 
 /*
  * Check to see if a buffer matching the given parameters is already
@@ -527,6 +529,17 @@ xfs_trans_log_buf(
 
 	trace_xfs_trans_log_buf(bip);
 	xfs_buf_item_log(bip, first, last);
+
+	/*
+	 * Relog random buffers so long as the transaction is relog enabled and
+	 * the buffer wasn't already relogged explicitly.
+	 */
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_RELOG) &&
+	    (tp->t_flags & XFS_TRANS_RELOG) &&
+	    !test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags)) {
+		if (xfs_trans_relog_buf(tp, bp))
+			set_bit(XFS_LI_RELOG_RAND, &bip->bli_item.li_flags);
+	}
 }
 
 
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 0/9] xfs: automatic relogging experiment
  2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
                   ` (8 preceding siblings ...)
  2020-02-27 13:43 ` [RFC v5 PATCH 9/9] xfs: relog random buffers based on errortag Brian Foster
@ 2020-02-27 15:09 ` Darrick J. Wong
  2020-02-27 15:18   ` Brian Foster
  9 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-27 15:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:12AM -0500, Brian Foster wrote:
> Hi all,
> 
> Here's a v5 RFC of the automatic item relogging experiment. Firstly,
> note that this is still a POC and experimental code with various quirks.

Heh, funny, I was going to ask you if you might have time next week to
review the latest iteration of the btree bulk loading series so that I
could get closer to merging the rest of online repair and/or refactoring
offline repair.  I'll take a closer look at this after I read through
everything else that came in overnight.

--D

> Some are documented in the code, others might not be (such as abusing
> the AIL lock, etc.). The primary purpose of this series is still to
> express and review a fundamental design. Based on discussion on the last
> version, there is specific focus towards addressing log reservation and
> pre-item locking deadlock vectors. While the code is still quite hacky,
> I believe this design addresses both of those fundamental issues.
> Further details on the design and approach are documented in the
> individual commit logs.
> 
> In addition, the final few patches introduce buffer relogging capability
> and test infrastructure, which currently has no use case other than to
> demonstrate development flexibility and the ability to support arbitrary
> log items in the future, if ever desired. If this approach is taken
> forward, the current use cases are still centered around intent items
> such as the quotaoff use case and extent freeing use case defined by
> online repair of free space trees.
> 
> On somewhat of a tangent, another intent oriented use case idea crossed
> my mind recently related to the long standing writeback stale data
> exposure problem (i.e. if we crash after a delalloc extent is converted
> but before writeback fully completes on the extent). The obvious
> approach of using unwritten extents has been rebuffed due to performance
> concerns over extent conversion. I wonder if we had the ability to log a
> "writeback pending" intent on some reasonable level of granularity (i.e.
> something between a block and extent), whether we could use that to
> allow log recovery to zero (or convert) such extents in the event of a
> crash. This is a whole separate design discussion, however, as it
> involves tracking outstanding writeback, etc. In this context it simply
> serves as a prospective use case for relogging, as such intents would
> otherwise risk similar log subsystem deadlocks as the quotaoff use case.
> 
> Thoughts, reviews, flames appreciated.
> 
> Brian
> 
> rfcv5:
> - More fleshed out design to prevent log reservation deadlock and
>   locking problems.
> - Split out core patches between pre-reservation management, relog item
>   state management and relog mechanism.
> - Added experimental buffer relogging capability.
> rfcv4: https://lore.kernel.org/linux-xfs/20191205175037.52529-1-bfoster@redhat.com/
> - AIL based approach.
> rfcv3: https://lore.kernel.org/linux-xfs/20191125185523.47556-1-bfoster@redhat.com/
> - CIL based approach.
> rfcv2: https://lore.kernel.org/linux-xfs/20191122181927.32870-1-bfoster@redhat.com/
> - Different approach based on workqueue and transaction rolling.
> rfc: https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/
> 
> Brian Foster (9):
>   xfs: set t_task at wait time instead of alloc time
>   xfs: introduce ->tr_relog transaction
>   xfs: automatic relogging reservation management
>   xfs: automatic relogging item management
>   xfs: automatic log item relog mechanism
>   xfs: automatically relog the quotaoff start intent
>   xfs: buffer relogging support prototype
>   xfs: create an error tag for random relog reservation
>   xfs: relog random buffers based on errortag
> 
>  fs/xfs/libxfs/xfs_errortag.h   |   4 +-
>  fs/xfs/libxfs/xfs_shared.h     |   1 +
>  fs/xfs/libxfs/xfs_trans_resv.c |  24 +++-
>  fs/xfs/libxfs/xfs_trans_resv.h |   1 +
>  fs/xfs/xfs_buf_item.c          |   5 +
>  fs/xfs/xfs_dquot_item.c        |   7 ++
>  fs/xfs/xfs_error.c             |   3 +
>  fs/xfs/xfs_log.c               |   2 +-
>  fs/xfs/xfs_qm_syscalls.c       |  12 +-
>  fs/xfs/xfs_trace.h             |   3 +
>  fs/xfs/xfs_trans.c             |  79 +++++++++++-
>  fs/xfs/xfs_trans.h             |  13 +-
>  fs/xfs/xfs_trans_ail.c         | 216 ++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_trans_buf.c         |  35 ++++++
>  fs/xfs/xfs_trans_priv.h        |   6 +
>  15 files changed, 399 insertions(+), 12 deletions(-)
> 
> -- 
> 2.21.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 0/9] xfs: automatic relogging experiment
  2020-02-27 15:09 ` [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Darrick J. Wong
@ 2020-02-27 15:18   ` Brian Foster
  2020-02-27 15:22     ` Darrick J. Wong
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-02-27 15:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 07:09:36AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:12AM -0500, Brian Foster wrote:
> > Hi all,
> > 
> > Here's a v5 RFC of the automatic item relogging experiment. Firstly,
> > note that this is still a POC and experimental code with various quirks.
> 
> Heh, funny, I was going to ask you if you might have time next week to
> review the latest iteration of the btree bulk loading series so that I
> could get closer to merging the rest of online repair and/or refactoring
> offline repair.  I'll take a closer look at this after I read through
> everything else that came in overnight.
> 

Sure.. I can put that next on the list. Is the latest release pending a
post or already posted? Being out for over a month (effectively closer
to two when considering proximity to the holidays) caused me to pretty
much clear everything in my mailbox for obvious reasons. ;) As a result,
anything that might have been on my radar prior to that timeframe has
most likely dropped completely off it. :P

Brian

> --D
> 
> > Some are documented in the code, others might not be (such as abusing
> > the AIL lock, etc.). The primary purpose of this series is still to
> > express and review a fundamental design. Based on discussion on the last
> > version, there is specific focus towards addressing log reservation and
> > pre-item locking deadlock vectors. While the code is still quite hacky,
> > I believe this design addresses both of those fundamental issues.
> > Further details on the design and approach are documented in the
> > individual commit logs.
> > 
> > In addition, the final few patches introduce buffer relogging capability
> > and test infrastructure, which currently has no use case other than to
> > demonstrate development flexibility and the ability to support arbitrary
> > log items in the future, if ever desired. If this approach is taken
> > forward, the current use cases are still centered around intent items
> > such as the quotaoff use case and extent freeing use case defined by
> > online repair of free space trees.
> > 
> > On somewhat of a tangent, another intent oriented use case idea crossed
> > my mind recently related to the long standing writeback stale data
> > exposure problem (i.e. if we crash after a delalloc extent is converted
> > but before writeback fully completes on the extent). The obvious
> > approach of using unwritten extents has been rebuffed due to performance
> > concerns over extent conversion. I wonder if we had the ability to log a
> > "writeback pending" intent on some reasonable level of granularity (i.e.
> > something between a block and extent), whether we could use that to
> > allow log recovery to zero (or convert) such extents in the event of a
> > crash. This is a whole separate design discussion, however, as it
> > involves tracking outstanding writeback, etc. In this context it simply
> > serves as a prospective use case for relogging, as such intents would
> > otherwise risk similar log subsystem deadlocks as the quotaoff use case.
> > 
> > Thoughts, reviews, flames appreciated.
> > 
> > Brian
> > 
> > rfcv5:
> > - More fleshed out design to prevent log reservation deadlock and
> >   locking problems.
> > - Split out core patches between pre-reservation management, relog item
> >   state management and relog mechanism.
> > - Added experimental buffer relogging capability.
> > rfcv4: https://lore.kernel.org/linux-xfs/20191205175037.52529-1-bfoster@redhat.com/
> > - AIL based approach.
> > rfcv3: https://lore.kernel.org/linux-xfs/20191125185523.47556-1-bfoster@redhat.com/
> > - CIL based approach.
> > rfcv2: https://lore.kernel.org/linux-xfs/20191122181927.32870-1-bfoster@redhat.com/
> > - Different approach based on workqueue and transaction rolling.
> > rfc: https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/
> > 
> > Brian Foster (9):
> >   xfs: set t_task at wait time instead of alloc time
> >   xfs: introduce ->tr_relog transaction
> >   xfs: automatic relogging reservation management
> >   xfs: automatic relogging item management
> >   xfs: automatic log item relog mechanism
> >   xfs: automatically relog the quotaoff start intent
> >   xfs: buffer relogging support prototype
> >   xfs: create an error tag for random relog reservation
> >   xfs: relog random buffers based on errortag
> > 
> >  fs/xfs/libxfs/xfs_errortag.h   |   4 +-
> >  fs/xfs/libxfs/xfs_shared.h     |   1 +
> >  fs/xfs/libxfs/xfs_trans_resv.c |  24 +++-
> >  fs/xfs/libxfs/xfs_trans_resv.h |   1 +
> >  fs/xfs/xfs_buf_item.c          |   5 +
> >  fs/xfs/xfs_dquot_item.c        |   7 ++
> >  fs/xfs/xfs_error.c             |   3 +
> >  fs/xfs/xfs_log.c               |   2 +-
> >  fs/xfs/xfs_qm_syscalls.c       |  12 +-
> >  fs/xfs/xfs_trace.h             |   3 +
> >  fs/xfs/xfs_trans.c             |  79 +++++++++++-
> >  fs/xfs/xfs_trans.h             |  13 +-
> >  fs/xfs/xfs_trans_ail.c         | 216 ++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trans_buf.c         |  35 ++++++
> >  fs/xfs/xfs_trans_priv.h        |   6 +
> >  15 files changed, 399 insertions(+), 12 deletions(-)
> > 
> > -- 
> > 2.21.1
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 0/9] xfs: automatic relogging experiment
  2020-02-27 15:18   ` Brian Foster
@ 2020-02-27 15:22     ` Darrick J. Wong
  0 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-27 15:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 10:18:14AM -0500, Brian Foster wrote:
> On Thu, Feb 27, 2020 at 07:09:36AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 27, 2020 at 08:43:12AM -0500, Brian Foster wrote:
> > > Hi all,
> > > 
> > > Here's a v5 RFC of the automatic item relogging experiment. Firstly,
> > > note that this is still a POC and experimental code with various quirks.
> > 
> > Heh, funny, I was going to ask you if you might have time next week to
> > review the latest iteration of the btree bulk loading series so that I
> > could get closer to merging the rest of online repair and/or refactoring
> > offline repair.  I'll take a closer look at this after I read through
> > everything else that came in overnight.
> > 
> 
> Sure.. I can put that next on the list. Is the latest release pending a
> post or already posted? Being out for over a month (effectively closer
> to two when considering proximity to the holidays) caused me to pretty
> much clear everything in my mailbox for obvious reasons. ;) As a result,
> anything that might have been on my radar prior to that timeframe has
> most likely dropped completely off it. :P

Pending.  The patches themselves haven't changed much since the end of
October when I fixed all the things we talked about at the beginning of
that month, but you might as well wait for a new version rebased off
5.6. :)

(If you get really really bored and/or I get bogged down in something
else, the NYE patchbomb version is pretty close to what's in my tree
now...)

--D

> Brian
> 
> > --D
> > 
> > > Some are documented in the code, others might not be (such as abusing
> > > the AIL lock, etc.). The primary purpose of this series is still to
> > > express and review a fundamental design. Based on discussion on the last
> > > version, there is specific focus towards addressing log reservation and
> > > pre-item locking deadlock vectors. While the code is still quite hacky,
> > > I believe this design addresses both of those fundamental issues.
> > > Further details on the design and approach are documented in the
> > > individual commit logs.
> > > 
> > > In addition, the final few patches introduce buffer relogging capability
> > > and test infrastructure, which currently has no use case other than to
> > > demonstrate development flexibility and the ability to support arbitrary
> > > log items in the future, if ever desired. If this approach is taken
> > > forward, the current use cases are still centered around intent items
> > > such as the quotaoff use case and extent freeing use case defined by
> > > online repair of free space trees.
> > > 
> > > On somewhat of a tangent, another intent oriented use case idea crossed
> > > my mind recently related to the long standing writeback stale data
> > > exposure problem (i.e. if we crash after a delalloc extent is converted
> > > but before writeback fully completes on the extent). The obvious
> > > approach of using unwritten extents has been rebuffed due to performance
> > > concerns over extent conversion. I wonder if we had the ability to log a
> > > "writeback pending" intent on some reasonable level of granularity (i.e.
> > > something between a block and extent), whether we could use that to
> > > allow log recovery to zero (or convert) such extents in the event of a
> > > crash. This is a whole separate design discussion, however, as it
> > > involves tracking outstanding writeback, etc. In this context it simply
> > > serves as a prospective use case for relogging, as such intents would
> > > otherwise risk similar log subsystem deadlocks as the quotaoff use case.
> > > 
> > > Thoughts, reviews, flames appreciated.
> > > 
> > > Brian
> > > 
> > > rfcv5:
> > > - More fleshed out design to prevent log reservation deadlock and
> > >   locking problems.
> > > - Split out core patches between pre-reservation management, relog item
> > >   state management and relog mechanism.
> > > - Added experimental buffer relogging capability.
> > > rfcv4: https://lore.kernel.org/linux-xfs/20191205175037.52529-1-bfoster@redhat.com/
> > > - AIL based approach.
> > > rfcv3: https://lore.kernel.org/linux-xfs/20191125185523.47556-1-bfoster@redhat.com/
> > > - CIL based approach.
> > > rfcv2: https://lore.kernel.org/linux-xfs/20191122181927.32870-1-bfoster@redhat.com/
> > > - Different approach based on workqueue and transaction rolling.
> > > rfc: https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/
> > > 
> > > Brian Foster (9):
> > >   xfs: set t_task at wait time instead of alloc time
> > >   xfs: introduce ->tr_relog transaction
> > >   xfs: automatic relogging reservation management
> > >   xfs: automatic relogging item management
> > >   xfs: automatic log item relog mechanism
> > >   xfs: automatically relog the quotaoff start intent
> > >   xfs: buffer relogging support prototype
> > >   xfs: create an error tag for random relog reservation
> > >   xfs: relog random buffers based on errortag
> > > 
> > >  fs/xfs/libxfs/xfs_errortag.h   |   4 +-
> > >  fs/xfs/libxfs/xfs_shared.h     |   1 +
> > >  fs/xfs/libxfs/xfs_trans_resv.c |  24 +++-
> > >  fs/xfs/libxfs/xfs_trans_resv.h |   1 +
> > >  fs/xfs/xfs_buf_item.c          |   5 +
> > >  fs/xfs/xfs_dquot_item.c        |   7 ++
> > >  fs/xfs/xfs_error.c             |   3 +
> > >  fs/xfs/xfs_log.c               |   2 +-
> > >  fs/xfs/xfs_qm_syscalls.c       |  12 +-
> > >  fs/xfs/xfs_trace.h             |   3 +
> > >  fs/xfs/xfs_trans.c             |  79 +++++++++++-
> > >  fs/xfs/xfs_trans.h             |  13 +-
> > >  fs/xfs/xfs_trans_ail.c         | 216 ++++++++++++++++++++++++++++++++-
> > >  fs/xfs/xfs_trans_buf.c         |  35 ++++++
> > >  fs/xfs/xfs_trans_priv.h        |   6 +
> > >  15 files changed, 399 insertions(+), 12 deletions(-)
> > > 
> > > -- 
> > > 2.21.1
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time
  2020-02-27 13:43 ` [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time Brian Foster
@ 2020-02-27 20:48   ` Allison Collins
  2020-02-27 23:28   ` Darrick J. Wong
  1 sibling, 0 replies; 59+ messages in thread
From: Allison Collins @ 2020-02-27 20:48 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 2/27/20 6:43 AM, Brian Foster wrote:
> The xlog_ticket structure contains a task reference to support
> blocking for available log reservation. This reference is assigned
> at ticket allocation time, which assumes that the transaction
> allocator will acquire reservation in the same context. This is
> normally true, but will not always be the case with automatic
> relogging.
> 
> There is otherwise no fundamental reason log space cannot be
> reserved for a ticket from a context different from the allocating
> context. Move the task assignment to the log reservation blocking
> code where it is used.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Ok, looks ok to me
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f6006d94a581..df60942a9804 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -262,6 +262,7 @@ xlog_grant_head_wait(
>   	int			need_bytes) __releases(&head->lock)
>   					    __acquires(&head->lock)
>   {
> +	tic->t_task = current;
>   	list_add_tail(&tic->t_queue, &head->waiters);
>   
>   	do {
> @@ -3601,7 +3602,6 @@ xlog_ticket_alloc(
>   	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
>   
>   	atomic_set(&tic->t_ref, 1);
> -	tic->t_task		= current;
>   	INIT_LIST_HEAD(&tic->t_queue);
>   	tic->t_unit_res		= unit_res;
>   	tic->t_curr_res		= unit_res;
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction
  2020-02-27 13:43 ` [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction Brian Foster
@ 2020-02-27 20:49   ` Allison Collins
  2020-02-27 23:31   ` Darrick J. Wong
  1 sibling, 0 replies; 59+ messages in thread
From: Allison Collins @ 2020-02-27 20:49 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 2/27/20 6:43 AM, Brian Foster wrote:
> Create a transaction reservation specifically for relog
> transactions. For now it only supports the quotaoff intent, so use
> the associated reservation.
> 

Alrighty, looks ok
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/libxfs/xfs_trans_resv.c | 15 +++++++++++++++
>   fs/xfs/libxfs/xfs_trans_resv.h |  1 +
>   2 files changed, 16 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 7a9c04920505..1f5c9e6e1afc 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -832,6 +832,17 @@ xfs_calc_sb_reservation(
>   	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize);
>   }
>   
> +/*
> + * Internal relog transaction.
> + *   quotaoff intent
> + */
> +STATIC uint
> +xfs_calc_relog_reservation(
> +	struct xfs_mount	*mp)
> +{
> +	return xfs_calc_qm_quotaoff_reservation(mp);
> +}
> +
>   void
>   xfs_trans_resv_calc(
>   	struct xfs_mount	*mp,
> @@ -946,4 +957,8 @@ xfs_trans_resv_calc(
>   	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
>   	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
>   	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> +
> +	resp->tr_relog.tr_logres = xfs_calc_relog_reservation(mp);
> +	resp->tr_relog.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> +	resp->tr_relog.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>   }
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 7241ab28cf84..b723979cad09 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -50,6 +50,7 @@ struct xfs_trans_resv {
>   	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
>   	struct xfs_trans_res	tr_sb;		/* modify superblock */
>   	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> +	struct xfs_trans_res	tr_relog;	/* internal relog transaction */
>   };
>   
>   /* shorthand way of accessing reservation structure */
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-02-27 13:43 ` [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management Brian Foster
@ 2020-02-27 20:49   ` Allison Collins
  2020-02-28  0:02   ` Darrick J. Wong
  2020-03-02  3:07   ` Dave Chinner
  2 siblings, 0 replies; 59+ messages in thread
From: Allison Collins @ 2020-02-27 20:49 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 2/27/20 6:43 AM, Brian Foster wrote:
> Automatic item relogging will occur from xfsaild context. xfsaild
> cannot acquire log reservation itself because it is also responsible
> for writeback and thus making used log reservation available again.
> Since there is no guarantee log reservation is available by the time
> a relogged item reaches the AIL, this is prone to deadlock.
> 
> To guarantee log reservation for automatic relogging, implement a
> reservation management scheme where a transaction that is capable of
> enabling relogging of an item must contribute the necessary
> reservation to the relog mechanism up front. Use reference counting
> to associate the lifetime of pending relog reservation to the
> lifetime of in-core log items with relogging enabled.
> 
> The basic log reservation sequence for a relog enabled transaction
> is as follows:
> 
> - A transaction that uses relogging specifies XFS_TRANS_RELOG at
>    allocation time.
> - Once initialized, RELOG transactions check for the existence of
>    the global relog log ticket. If it exists, grab a reference and
>    return. If not, allocate an empty ticket and install into the relog
>    subsystem. Seed the relog ticket from reservation of the current
>    transaction. Roll the current transaction to replenish its
>    reservation and return to the caller.
> - The transaction is used as normal. If an item is relogged in the
>    transaction, that item acquires a reference on the global relog
>    ticket currently held open by the transaction. The item's reference
>    persists until relogging is disabled on the item.
> - The RELOG transaction commits and releases its reference to the
>    global relog ticket. The global relog ticket is released once its
>    reference count drops to zero.
> 
> This provides a central relog log ticket that guarantees reservation
> availability for relogged items, avoids log reservation deadlocks
> and is allocated and released on demand.
> 
Ok, I followed it through and didn't see any obvious errors
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/libxfs/xfs_shared.h |  1 +
>   fs/xfs/xfs_trans.c         | 37 +++++++++++++---
>   fs/xfs/xfs_trans.h         |  3 ++
>   fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_trans_priv.h    |  1 +
>   5 files changed, 126 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index c45acbd3add9..0a10ca0853ab 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>    * made then this algorithm will eventually find all the space it needs.
>    */
>   #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
>   
>   /*
>    * Field values for xfs_trans_mod_sb.
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..8ac05ed8deda 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -107,9 +107,14 @@ xfs_trans_dup(
>   
>   	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
>   		       (tp->t_flags & XFS_TRANS_RESERVE) |
> -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> -	/* We gave our writer reference to the new transaction */
> +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> +		       (tp->t_flags & XFS_TRANS_RELOG);
> +	/*
> +	 * The writer reference and relog reference transfer to the new
> +	 * transaction.
> +	 */
>   	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
>   	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
>   
>   	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> @@ -284,15 +289,25 @@ xfs_trans_alloc(
>   	tp->t_firstblock = NULLFSBLOCK;
>   
>   	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> -	if (error) {
> -		xfs_trans_cancel(tp);
> -		return error;
> +	if (error)
> +		goto error;
> +
> +	if (flags & XFS_TRANS_RELOG) {
> +		error = xfs_trans_ail_relog_reserve(&tp);
> +		if (error)
> +			goto error;
>   	}
>   
>   	trace_xfs_trans_alloc(tp, _RET_IP_);
>   
>   	*tpp = tp;
>   	return 0;
> +
> +error:
> +	/* clear relog flag if we haven't acquired a ref */
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
> +	xfs_trans_cancel(tp);
> +	return error;
>   }
>   
>   /*
> @@ -973,6 +988,10 @@ __xfs_trans_commit(
>   
>   	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>   
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>   	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>   	xfs_trans_free(tp);
>   
> @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
>   			error = -EIO;
>   		tp->t_ticket = NULL;
>   	}
> +	/* release the relog ticket reference if this transaction holds one */
> +	/* XXX: handle RELOG items on transaction abort */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
>   	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>   	xfs_trans_free_items(tp, !!error);
>   	xfs_trans_free(tp);
> @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
>   		tp->t_ticket = NULL;
>   	}
>   
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>   	/* mark this thread as no longer being in a transaction */
>   	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>   
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 752c7fef9de7..a032989943bd 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -236,6 +236,9 @@ int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
>   void		xfs_trans_cancel(xfs_trans_t *);
>   int		xfs_trans_ail_init(struct xfs_mount *);
>   void		xfs_trans_ail_destroy(struct xfs_mount *);
> +int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
> +bool		xfs_trans_ail_relog_get(struct xfs_mount *);
> +int		xfs_trans_ail_relog_put(struct xfs_mount *);
>   
>   void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
>   				       enum xfs_blft);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00cc5b8734be..a3fb64275baa 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>   #include "xfs_errortag.h"
>   #include "xfs_error.h"
>   #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>   
>   #ifdef DEBUG
>   /*
> @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
>   		xfs_log_space_wake(ailp->ail_mount);
>   }
>   
> +bool
> +xfs_trans_ail_relog_get(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	bool			ret = false;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		ret = true;
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +	return ret;
> +}
> +
> +/*
> + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> + * requires a clean, permanent transaction from the caller. Pull reservation
> + * for the relog ticket and roll the caller's transaction back to its fully
> + * reserved state. If the AIL relog ticket is already initialized, grab a
> + * reference and return.
> + */
> +int
> +xfs_trans_ail_relog_reserve(
> +	struct xfs_trans	**tpp)
> +{
> +	struct xfs_trans	*tp = *tpp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	if (xfs_trans_ail_relog_get(mp))
> +		return 0;
> +
> +	/* no active ticket, fall into slow path to allocate one.. */
> +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> +	if (!tic)
> +		return -ENOMEM;
> +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> +
> +	/* check again since we dropped the lock for the allocation */
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +		xfs_log_ticket_put(tic);
> +		return 0;
> +	}
> +
> +	/* attach and reserve space for the ->tr_relog ticket */
> +	ailp->ail_relog_tic = tic;
> +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	return xfs_trans_roll(tpp);
> +}
> +
> +/*
> + * Release a reference to the relog ticket.
> + */
> +int
> +xfs_trans_ail_relog_put(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
> +		spin_unlock(&ailp->ail_lock);
> +		return 0;
> +	}
> +
> +	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
> +	tic = ailp->ail_relog_tic;
> +	ailp->ail_relog_tic = NULL;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	xfs_log_done(mp, tic, NULL, false);
> +	return 0;
> +}
> +
>   int
>   xfs_trans_ail_init(
>   	xfs_mount_t	*mp)
> @@ -854,6 +942,7 @@ xfs_trans_ail_destroy(
>   {
>   	struct xfs_ail	*ailp = mp->m_ail;
>   
> +	ASSERT(ailp->ail_relog_tic == NULL);
>   	kthread_stop(ailp->ail_task);
>   	kmem_free(ailp);
>   }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614..839df6559b9f 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,6 +61,7 @@ struct xfs_ail {
>   	int			ail_log_flush;
>   	struct list_head	ail_buf_list;
>   	wait_queue_head_t	ail_empty;
> +	struct xlog_ticket	*ail_relog_tic;
>   };
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 4/9] xfs: automatic relogging item management
  2020-02-27 13:43 ` [RFC v5 PATCH 4/9] xfs: automatic relogging item management Brian Foster
@ 2020-02-27 21:18   ` Allison Collins
  2020-03-02  5:58   ` Dave Chinner
  1 sibling, 0 replies; 59+ messages in thread
From: Allison Collins @ 2020-02-27 21:18 UTC (permalink / raw)
  To: Brian Foster, linux-xfs

On 2/27/20 6:43 AM, Brian Foster wrote:
> As implemented by the previous patch, relogging can be enabled on
> any item via a relog enabled transaction (which holds a reference to
> an active relog ticket). Add a couple log item flags to track relog
> state of an arbitrary log item. The item holds a reference to the
> global relog ticket when relogging is enabled and releases the
> reference when relogging is disabled.
> 
Alrighty, I think it's ok
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_trace.h      |  2 ++
>   fs/xfs/xfs_trans.c      | 36 ++++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_trans.h      |  6 +++++-
>   fs/xfs/xfs_trans_priv.h |  2 ++
>   4 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a86be7f807ee..a066617ec54d 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
> +DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>   
>   DECLARE_EVENT_CLASS(xfs_ail_class,
>   	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 8ac05ed8deda..f7f2411ead4e 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -778,6 +778,41 @@ xfs_trans_del_item(
>   	list_del_init(&lip->li_trans);
>   }
>   
> +void
> +xfs_trans_relog_item(
> +	struct xfs_log_item	*lip)
> +{
> +	if (!test_and_set_bit(XFS_LI_RELOG, &lip->li_flags)) {
> +		xfs_trans_ail_relog_get(lip->li_mountp);
> +		trace_xfs_relog_item(lip);
> +	}
> +}
> +
> +void
> +xfs_trans_relog_item_cancel(
> +	struct xfs_log_item	*lip,
> +	bool			drain) /* wait for relogging to cease */
> +{
> +	struct xfs_mount	*mp = lip->li_mountp;
> +
> +	if (!test_and_clear_bit(XFS_LI_RELOG, &lip->li_flags))
> +		return;
> +	xfs_trans_ail_relog_put(lip->li_mountp);
> +	trace_xfs_relog_item_cancel(lip);
> +
> +	if (!drain)
> +		return;
> +
> +	/*
> +	 * Some operations might require relog activity to cease before they can
> +	 * proceed. For example, an operation must wait before including a
> +	 * non-lockable log item (i.e. intent) in another transaction.
> +	 */
> +	while (wait_on_bit_timeout(&lip->li_flags, XFS_LI_RELOGGED,
> +				   TASK_UNINTERRUPTIBLE, HZ))
> +		xfs_log_force(mp, XFS_LOG_SYNC);
> +}
> +
>   /* Detach and unlock all of the items in a transaction */
>   static void
>   xfs_trans_free_items(
> @@ -863,6 +898,7 @@ xfs_trans_committed_bulk(
>   
>   		if (aborted)
>   			set_bit(XFS_LI_ABORTED, &lip->li_flags);
> +		clear_and_wake_up_bit(XFS_LI_RELOGGED, &lip->li_flags);
>   
>   		if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
>   			lip->li_ops->iop_release(lip);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index a032989943bd..fc4c25b6eee4 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -59,12 +59,16 @@ struct xfs_log_item {
>   #define	XFS_LI_ABORTED	1
>   #define	XFS_LI_FAILED	2
>   #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
> +#define	XFS_LI_RELOG	4	/* automatically relog item */
> +#define	XFS_LI_RELOGGED	5	/* item relogged (not committed) */
>   
>   #define XFS_LI_FLAGS \
>   	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
>   	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
>   	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
> -	{ (1 << XFS_LI_DIRTY),		"DIRTY" }
> +	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
> +	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
> +	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }
>   
>   struct xfs_item_ops {
>   	unsigned flags;
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 839df6559b9f..d1edec1cb8ad 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -16,6 +16,8 @@ struct xfs_log_vec;
>   void	xfs_trans_init(struct xfs_mount *);
>   void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
>   void	xfs_trans_del_item(struct xfs_log_item *);
> +void	xfs_trans_relog_item(struct xfs_log_item *);
> +void	xfs_trans_relog_item_cancel(struct xfs_log_item *, bool);
>   void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
>   
>   void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-02-27 13:43 ` [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism Brian Foster
@ 2020-02-27 22:54   ` Allison Collins
  2020-02-28  0:13   ` Darrick J. Wong
  2020-03-02  7:18   ` Dave Chinner
  2 siblings, 0 replies; 59+ messages in thread
From: Allison Collins @ 2020-02-27 22:54 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 2/27/20 6:43 AM, Brian Foster wrote:
> Now that relog reservation is available and relog state tracking is
> in place, all that remains to automatically relog items is the relog
> mechanism itself. An item with relogging enabled is basically pinned
> from writeback until relog is disabled. Instead of being written
> back, the item must instead be periodically committed in a new
> transaction to move it in the physical log. The purpose of moving
> the item is to avoid long term tail pinning and thus avoid log
> deadlocks for long running operations.
> 
> The ideal time to relog an item is in response to tail pushing
> pressure. This accommodates the current workload at any given time
> as opposed to a fixed time interval or log reservation heuristic,
> which risks performance regression. This is essentially the same
> heuristic that drives metadata writeback. XFS already implements
> various log tail pushing heuristics that attempt to keep the log
> progressing on an active fileystem under various workloads.
> 
> The act of relogging an item simply requires to add it to a
> transaction and commit. This pushes the already dirty item into a
> subsequent log checkpoint and frees up its previous location in the
> on-disk log. Joining an item to a transaction of course requires
> locking the item first, which means we have to be aware of
> type-specific locks and lock ordering wherever the relog takes
> place.
> 
> Fundamentally, this points to xfsaild as the ideal location to
> process relog enabled items. xfsaild already processes log resident
> items, is driven by log tail pushing pressure, processes arbitrary
> log item types through callbacks, and is sensitive to type-specific
> locking rules by design. The fact that automatic relogging
> essentially diverts items between writeback or relog also suggests
> xfsaild as an ideal location to process items one way or the other.
> 
> Of course, we don't want xfsaild to process transactions as it is a
> critical component of the log subsystem for driving metadata
> writeback and freeing up log space. Therefore, similar to how
> xfsaild builds up a writeback queue of dirty items and queues writes
> asynchronously, make xfsaild responsible only for directing pending
> relog items into an appropriate queue and create an async
> (workqueue) context for processing the queue. The workqueue context
> utilizes the pre-reserved relog ticket to drain the queue by rolling
> a permanent transaction.
> 
> Update the AIL pushing infrastructure to support a new RELOG item
> state. If a log item push returns the relog state, queue the item
> for relog instead of writeback. On completion of a push cycle,
> schedule the relog task at the same point metadata buffer I/O is
> submitted. This allows items to be relogged automatically under the
> same locking rules and pressure heuristics that govern metadata
> writeback.
>  > Signed-off-by: Brian Foster <bfoster@redhat.com>
Ok, I followed it through, and didn't notice any logic errors

Reviewed-by: Allison Collins <allison.henderson@oracle.com>
> ---
>   fs/xfs/xfs_trace.h      |   1 +
>   fs/xfs/xfs_trans.h      |   1 +
>   fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
>   fs/xfs/xfs_trans_priv.h |   3 ++
>   4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a066617ec54d..df0114ec66f1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
>   DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
>   DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>   
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index fc4c25b6eee4..1637df32c64c 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>   #define XFS_ITEM_PINNED		1
>   #define XFS_ITEM_LOCKED		2
>   #define XFS_ITEM_FLUSHING	3
> +#define XFS_ITEM_RELOG		4
>   
>   /*
>    * Deferred operation item relogging limits.
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index a3fb64275baa..71a47faeaae8 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
>   	return lsn;
>   }
>   
> +/*
> + * Relog log items on the AIL relog queue.
> + */
> +static void
> +xfs_ail_relog(
> +	struct work_struct	*work)
> +{
> +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> +						     ail_relog_work);
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +	struct xfs_trans_res	tres = {};
> +	struct xfs_trans	*tp;
> +	struct xfs_log_item	*lip;
> +	int			error;
> +
> +	/*
> +	 * The first transaction to submit a relog item contributed relog
> +	 * reservation to the relog ticket before committing. Create an empty
> +	 * transaction and manually associate the relog ticket.
> +	 */
> +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
> +	ASSERT(!error);
> +	if (error)
> +		return;
> +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> +
> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> +					       struct xfs_log_item,
> +					       li_trans)) != NULL) {
> +		/*
> +		 * Drop the AIL processing ticket reference once the relog list
> +		 * is emptied. At this point it's possible for our transaction
> +		 * to hold the only reference.
> +		 */
> +		list_del_init(&lip->li_trans);
> +		if (list_empty(&ailp->ail_relog_list))
> +			xfs_log_ticket_put(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/* XXX: include ticket owner task fix */
> +		error = xfs_trans_roll(&tp);
> +		ASSERT(!error);
> +		if (error)
> +			goto out;
> +		spin_lock(&ailp->ail_lock);
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +
> +out:
> +	/* XXX: handle shutdown scenario */
> +	/*
> +	 * Drop the relog reference owned by the transaction separately because
> +	 * we don't want the cancel to release reservation if this isn't the
> +	 * final reference. The relog ticket and associated reservation needs
> +	 * to persist so long as relog items are active in the log subsystem.
> +	 */
> +	xfs_trans_ail_relog_put(mp);
> +
> +	tp->t_ticket = NULL;
> +	xfs_trans_cancel(tp);
> +}
> +
>   /*
>    * The cursor keeps track of where our current traversal is up to by tracking
>    * the next item in the list for us. However, for this to be safe, removing an
> @@ -364,7 +433,7 @@ static long
>   xfsaild_push(
>   	struct xfs_ail		*ailp)
>   {
> -	xfs_mount_t		*mp = ailp->ail_mount;
> +	struct xfs_mount	*mp = ailp->ail_mount;
>   	struct xfs_ail_cursor	cur;
>   	struct xfs_log_item	*lip;
>   	xfs_lsn_t		lsn;
> @@ -426,6 +495,23 @@ xfsaild_push(
>   			ailp->ail_last_pushed_lsn = lsn;
>   			break;
>   
> +		case XFS_ITEM_RELOG:
> +			/*
> +			 * The item requires a relog. Add to the pending relog
> +			 * list and set the relogged bit to prevent further
> +			 * relog requests. The relog bit and ticket reference
> +			 * can be dropped from the item at any point, so hold a
> +			 * relog ticket reference for the pending relog list to
> +			 * ensure the ticket stays around.
> +			 */
> +			trace_xfs_ail_relog(lip);
> +			ASSERT(list_empty(&lip->li_trans));
> +			if (list_empty(&ailp->ail_relog_list))
> +				xfs_log_ticket_get(ailp->ail_relog_tic);
> +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> +			break;
> +
>   		case XFS_ITEM_FLUSHING:
>   			/*
>   			 * The item or its backing buffer is already being
> @@ -492,6 +578,9 @@ xfsaild_push(
>   	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>   		ailp->ail_log_flush++;
>   
> +	if (!list_empty(&ailp->ail_relog_list))
> +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> +
>   	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
>   out_done:
>   		/*
> @@ -922,15 +1011,24 @@ xfs_trans_ail_init(
>   	spin_lock_init(&ailp->ail_lock);
>   	INIT_LIST_HEAD(&ailp->ail_buf_list);
>   	init_waitqueue_head(&ailp->ail_empty);
> +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> +
> +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> +					     mp->m_super->s_id);
> +	if (!ailp->ail_relog_wq)
> +		goto out_free_ailp;
>   
>   	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
>   			ailp->ail_mount->m_super->s_id);
>   	if (IS_ERR(ailp->ail_task))
> -		goto out_free_ailp;
> +		goto out_destroy_wq;
>   
>   	mp->m_ail = ailp;
>   	return 0;
>   
> +out_destroy_wq:
> +	destroy_workqueue(ailp->ail_relog_wq);
>   out_free_ailp:
>   	kmem_free(ailp);
>   	return -ENOMEM;
> @@ -944,5 +1042,6 @@ xfs_trans_ail_destroy(
>   
>   	ASSERT(ailp->ail_relog_tic == NULL);
>   	kthread_stop(ailp->ail_task);
> +	destroy_workqueue(ailp->ail_relog_wq);
>   	kmem_free(ailp);
>   }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index d1edec1cb8ad..33a724534869 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -63,6 +63,9 @@ struct xfs_ail {
>   	int			ail_log_flush;
>   	struct list_head	ail_buf_list;
>   	wait_queue_head_t	ail_empty;
> +	struct work_struct	ail_relog_work;
> +	struct list_head	ail_relog_list;
> +	struct workqueue_struct	*ail_relog_wq;
>   	struct xlog_ticket	*ail_relog_tic;
>   };
>   
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-27 13:43 ` [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent Brian Foster
@ 2020-02-27 23:19   ` Allison Collins
  2020-02-28 14:03     ` Brian Foster
  2020-02-28  1:16   ` Darrick J. Wong
  1 sibling, 1 reply; 59+ messages in thread
From: Allison Collins @ 2020-02-27 23:19 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 2/27/20 6:43 AM, Brian Foster wrote:
> The quotaoff operation has a rare but longstanding deadlock vector
> in terms of how the operation is logged. A quotaoff start intent is
> logged (synchronously) at the onset to ensure recovery can handle
> the operation if interrupted before in-core changes are made. This
> quotaoff intent pins the log tail while the quotaoff sequence scans
> and purges dquots from all in-core inodes. While this operation
> generally doesn't generate much log traffic on its own, it can be
> time consuming. If unrelated, concurrent filesystem activity
> consumes remaining log space before quotaoff is able to acquire log
> reservation for the quotaoff end intent, the filesystem locks up
> indefinitely.
> 
> quotaoff cannot allocate the end intent before the scan because the
> latter can result in transaction allocation itself in certain
> indirect cases (releasing an inode, for example). Further, rolling
> the original transaction is difficult because the scanning work
> occurs multiple layers down where caller context is lost and not
> much information is available to determine how often to roll the
> transaction.
> 
> To address this problem, enable automatic relogging of the quotaoff
> start intent. This automatically relogs the intent whenever AIL
> pushing finds the item at the tail of the log. When quotaoff
> completes, wait for relogging to complete as the end intent expects
> to be able to permanently remove the start intent from the log
> subsystem. This ensures that the log tail is kept moving during a
> particularly long quotaoff operation and avoids the log reservation
> deadlock.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
>   fs/xfs/xfs_dquot_item.c        |  7 +++++++
>   fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
>   3 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 1f5c9e6e1afc..f49b20c9ca33 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -935,7 +935,8 @@ xfs_trans_resv_calc(
>   	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
>   
>   	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
> -	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> +	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> +	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
What's the reason for the log count change here?  Otherwise looks ok.

Allison
>   
>   	resp->tr_qm_equotaoff.tr_logres =
>   		xfs_calc_qm_quotaoff_end_reservation();
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index d60647d7197b..ea5123678466 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
>   	struct xfs_log_item	*lip,
>   	struct list_head	*buffer_list)
>   {
> +	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
> +
> +	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
> +	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
> +	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
> +		return XFS_ITEM_RELOG;
> +
>   	return XFS_ITEM_LOCKED;
>   }
>   
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 1ea82764bf89..7b48d34da0f4 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -18,6 +18,7 @@
>   #include "xfs_quota.h"
>   #include "xfs_qm.h"
>   #include "xfs_icache.h"
> +#include "xfs_trans_priv.h"
>   
>   STATIC int
>   xfs_qm_log_quotaoff(
> @@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
>   
>   	*qoffstartp = NULL;
>   
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> +				XFS_TRANS_RELOG, &tp);
>   	if (error)
>   		goto out;
>   
>   	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
>   	xfs_trans_log_quotaoff_item(tp, qoffi);
> +	xfs_trans_relog_item(&qoffi->qql_item);
>   
>   	spin_lock(&mp->m_sb_lock);
>   	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
> @@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
>   	int			error;
>   	struct xfs_qoff_logitem	*qoffi;
>   
> +	/*
> +	 * startqoff must be in the AIL and not the CIL when the end intent
> +	 * commits to ensure it is not readded to the AIL out of order. Wait on
> +	 * relog activity to drain to isolate startqoff to the AIL.
> +	 */
> +	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
> +
>   	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
>   	if (error)
>   		return error;
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time
  2020-02-27 13:43 ` [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time Brian Foster
  2020-02-27 20:48   ` Allison Collins
@ 2020-02-27 23:28   ` Darrick J. Wong
  2020-02-28  0:10     ` Dave Chinner
  1 sibling, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-27 23:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:13AM -0500, Brian Foster wrote:
> The xlog_ticket structure contains a task reference to support
> blocking for available log reservation. This reference is assigned
> at ticket allocation time, which assumes that the transaction
> allocator will acquire reservation in the same context. This is
> normally true, but will not always be the case with automatic
> relogging.
> 
> There is otherwise no fundamental reason log space cannot be
> reserved for a ticket from a context different from the allocating
> context. Move the task assignment to the log reservation blocking
> code where it is used.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f6006d94a581..df60942a9804 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -262,6 +262,7 @@ xlog_grant_head_wait(
>  	int			need_bytes) __releases(&head->lock)
>  					    __acquires(&head->lock)
>  {
> +	tic->t_task = current;
>  	list_add_tail(&tic->t_queue, &head->waiters);
>  
>  	do {
> @@ -3601,7 +3602,6 @@ xlog_ticket_alloc(
>  	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
>  
>  	atomic_set(&tic->t_ref, 1);
> -	tic->t_task		= current;

Hm.  So this leaves t_task set to NULL in the ticket constructor in
favor of setting it in xlog_grant_head_wait.  I guess this implies that
some future piece will be able to transfer a ticket to another process
as part of a regrant or something?

I've been wondering lately if you could transfer a dirty permanent
transaction to a different task so that the front end could return to
userspace as soon as the first transaction (with the intent items)
commits, and then you could reduce the latency of front-end system
calls.  That's probably a huge fantasy since you'd also have to transfer
a whole ton of state to that worker and whatever you locked to do the
operation remains locked...

--D

>  	INIT_LIST_HEAD(&tic->t_queue);
>  	tic->t_unit_res		= unit_res;
>  	tic->t_curr_res		= unit_res;
> -- 
> 2.21.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction
  2020-02-27 13:43 ` [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction Brian Foster
  2020-02-27 20:49   ` Allison Collins
@ 2020-02-27 23:31   ` Darrick J. Wong
  2020-02-28 13:52     ` Brian Foster
  1 sibling, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-27 23:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:14AM -0500, Brian Foster wrote:
> Create a transaction reservation specifically for relog
> transactions. For now it only supports the quotaoff intent, so use
> the associated reservation.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 15 +++++++++++++++
>  fs/xfs/libxfs/xfs_trans_resv.h |  1 +
>  2 files changed, 16 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 7a9c04920505..1f5c9e6e1afc 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -832,6 +832,17 @@ xfs_calc_sb_reservation(
>  	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize);
>  }
>  
> +/*
> + * Internal relog transaction.
> + *   quotaoff intent
> + */
> +STATIC uint
> +xfs_calc_relog_reservation(
> +	struct xfs_mount	*mp)
> +{
> +	return xfs_calc_qm_quotaoff_reservation(mp);

So when we add the next reloggable intent item, this will turn this
into an n-way max(sizeof(type0), sizeof(type1), ...sizeof(typeN)); ?

> +}
> +
>  void
>  xfs_trans_resv_calc(
>  	struct xfs_mount	*mp,
> @@ -946,4 +957,8 @@ xfs_trans_resv_calc(
>  	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
>  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
>  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> +
> +	resp->tr_relog.tr_logres = xfs_calc_relog_reservation(mp);
> +	resp->tr_relog.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;

Relog operations can roll?  I would have figured that you'd simply log
the old item(s) in a new transaction and commit it, along with some
magic to let the log tail move forward.  I guess I'll see what happens
in the next 7 patches. :)

--D

> +	resp->tr_relog.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  }
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 7241ab28cf84..b723979cad09 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -50,6 +50,7 @@ struct xfs_trans_resv {
>  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
>  	struct xfs_trans_res	tr_sb;		/* modify superblock */
>  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> +	struct xfs_trans_res	tr_relog;	/* internal relog transaction */
>  };
>  
>  /* shorthand way of accessing reservation structure */
> -- 
> 2.21.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype
  2020-02-27 13:43 ` [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype Brian Foster
@ 2020-02-27 23:33   ` Allison Collins
  2020-02-28 14:04     ` Brian Foster
  2020-03-02  7:47   ` Dave Chinner
  1 sibling, 1 reply; 59+ messages in thread
From: Allison Collins @ 2020-02-27 23:33 UTC (permalink / raw)
  To: Brian Foster, linux-xfs

On 2/27/20 6:43 AM, Brian Foster wrote:
> Add a quick and dirty implementation of buffer relogging support.
> There is currently no use case for buffer relogging. This is for
> experimental use only and serves as an example to demonstrate the
> ability to relog arbitrary items in the future, if necessary.
> 
> Add a hook to enable relogging a buffer in a transaction, update the
> buffer log item handlers to support relogged BLIs and update the
> relog handler to join the relogged buffer to the relog transaction.
> 
Alrighty, thanks for the example!  It sounds like it's meant more to be 
a demo than to really be applied though?

Allison

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_buf_item.c  |  5 +++++
>   fs/xfs/xfs_trans.h     |  1 +
>   fs/xfs/xfs_trans_ail.c | 19 ++++++++++++++++---
>   fs/xfs/xfs_trans_buf.c | 22 ++++++++++++++++++++++
>   4 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 663810e6cd59..4ef2725fa8ce 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -463,6 +463,7 @@ xfs_buf_item_unpin(
>   			list_del_init(&bp->b_li_list);
>   			bp->b_iodone = NULL;
>   		} else {
> +			xfs_trans_relog_item_cancel(lip, false);
>   			spin_lock(&ailp->ail_lock);
>   			xfs_trans_ail_delete(ailp, lip, SHUTDOWN_LOG_IO_ERROR);
>   			xfs_buf_item_relse(bp);
> @@ -528,6 +529,9 @@ xfs_buf_item_push(
>   		return XFS_ITEM_LOCKED;
>   	}
>   
> +	if (test_bit(XFS_LI_RELOG, &lip->li_flags))
> +		return XFS_ITEM_RELOG;
> +
>   	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
>   
>   	trace_xfs_buf_item_push(bip);
> @@ -956,6 +960,7 @@ STATIC void
>   xfs_buf_item_free(
>   	struct xfs_buf_log_item	*bip)
>   {
> +	ASSERT(!test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags));
>   	xfs_buf_item_free_format(bip);
>   	kmem_free(bip->bli_item.li_lv_shadow);
>   	kmem_cache_free(xfs_buf_item_zone, bip);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 1637df32c64c..81cb42f552d9 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -226,6 +226,7 @@ void		xfs_trans_inode_buf(xfs_trans_t *, struct xfs_buf *);
>   void		xfs_trans_stale_inode_buf(xfs_trans_t *, struct xfs_buf *);
>   bool		xfs_trans_ordered_buf(xfs_trans_t *, struct xfs_buf *);
>   void		xfs_trans_dquot_buf(xfs_trans_t *, struct xfs_buf *, uint);
> +bool		xfs_trans_relog_buf(struct xfs_trans *, struct xfs_buf *);
>   void		xfs_trans_inode_alloc_buf(xfs_trans_t *, struct xfs_buf *);
>   void		xfs_trans_ichgtime(struct xfs_trans *, struct xfs_inode *, int);
>   void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 71a47faeaae8..103ab62e61be 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -18,6 +18,7 @@
>   #include "xfs_error.h"
>   #include "xfs_log.h"
>   #include "xfs_log_priv.h"
> +#include "xfs_buf_item.h"
>   
>   #ifdef DEBUG
>   /*
> @@ -187,9 +188,21 @@ xfs_ail_relog(
>   			xfs_log_ticket_put(ailp->ail_relog_tic);
>   		spin_unlock(&ailp->ail_lock);
>   
> -		xfs_trans_add_item(tp, lip);
> -		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> -		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/*
> +		 * TODO: Ideally, relog transaction management would be pushed
> +		 * down into the ->iop_push() callbacks rather than playing
> +		 * games with ->li_trans and looking at log item types here.
> +		 */
> +		if (lip->li_type == XFS_LI_BUF) {
> +			struct xfs_buf_log_item	*bli = (struct xfs_buf_log_item *) lip;
> +			xfs_buf_hold(bli->bli_buf);
> +			xfs_trans_bjoin(tp, bli->bli_buf);
> +			xfs_trans_dirty_buf(tp, bli->bli_buf);
> +		} else {
> +			xfs_trans_add_item(tp, lip);
> +			set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +			tp->t_flags |= XFS_TRANS_DIRTY;
> +		}
>   		/* XXX: include ticket owner task fix */
>   		error = xfs_trans_roll(&tp);
>   		ASSERT(!error);
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 08174ffa2118..e17715ac23fc 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -787,3 +787,25 @@ xfs_trans_dquot_buf(
>   
>   	xfs_trans_buf_set_type(tp, bp, type);
>   }
> +
> +/*
> + * Enable automatic relogging on a buffer. This essentially pins a dirty buffer
> + * in-core until relogging is disabled. Note that the buffer must not already be
> + * queued for writeback.
> + */
> +bool
> +xfs_trans_relog_buf(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_buf_log_item	*bip = bp->b_log_item;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_RELOG);
> +	ASSERT(xfs_buf_islocked(bp));
> +
> +	if (bp->b_flags & _XBF_DELWRI_Q)
> +		return false;
> +
> +	xfs_trans_relog_item(&bip->bli_item);
> +	return true;
> +}
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 8/9] xfs: create an error tag for random relog reservation
  2020-02-27 13:43 ` [RFC v5 PATCH 8/9] xfs: create an error tag for random relog reservation Brian Foster
@ 2020-02-27 23:35   ` Allison Collins
  0 siblings, 0 replies; 59+ messages in thread
From: Allison Collins @ 2020-02-27 23:35 UTC (permalink / raw)
  To: Brian Foster, linux-xfs

On 2/27/20 6:43 AM, Brian Foster wrote:
> Create an errortag to randomly enable relogging on permanent
> transactions. This only stresses relog reservation management and
> does not enable relogging of any particular items. The tag will be
> reused in a subsequent patch to enable random item relogging.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
Alrighty, looks ok:
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/libxfs/xfs_errortag.h | 4 +++-
>   fs/xfs/xfs_error.c           | 3 +++
>   fs/xfs/xfs_trans.c           | 6 ++++++
>   3 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
> index 79e6c4fb1d8a..ca7bcadb9455 100644
> --- a/fs/xfs/libxfs/xfs_errortag.h
> +++ b/fs/xfs/libxfs/xfs_errortag.h
> @@ -55,7 +55,8 @@
>   #define XFS_ERRTAG_FORCE_SCRUB_REPAIR			32
>   #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
>   #define XFS_ERRTAG_IUNLINK_FALLBACK			34
> -#define XFS_ERRTAG_MAX					35
> +#define XFS_ERRTAG_RELOG				35
> +#define XFS_ERRTAG_MAX					36
>   
>   /*
>    * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> @@ -95,5 +96,6 @@
>   #define XFS_RANDOM_FORCE_SCRUB_REPAIR			1
>   #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
>   #define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
> +#define XFS_RANDOM_RELOG				XFS_RANDOM_DEFAULT
>   
>   #endif /* __XFS_ERRORTAG_H_ */
> diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
> index 331765afc53e..2838b909287e 100644
> --- a/fs/xfs/xfs_error.c
> +++ b/fs/xfs/xfs_error.c
> @@ -53,6 +53,7 @@ static unsigned int xfs_errortag_random_default[] = {
>   	XFS_RANDOM_FORCE_SCRUB_REPAIR,
>   	XFS_RANDOM_FORCE_SUMMARY_RECALC,
>   	XFS_RANDOM_IUNLINK_FALLBACK,
> +	XFS_RANDOM_RELOG,
>   };
>   
>   struct xfs_errortag_attr {
> @@ -162,6 +163,7 @@ XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
>   XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
>   XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
>   XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
> +XFS_ERRORTAG_ATTR_RW(relog,		XFS_ERRTAG_RELOG);
>   
>   static struct attribute *xfs_errortag_attrs[] = {
>   	XFS_ERRORTAG_ATTR_LIST(noerror),
> @@ -199,6 +201,7 @@ static struct attribute *xfs_errortag_attrs[] = {
>   	XFS_ERRORTAG_ATTR_LIST(force_repair),
>   	XFS_ERRORTAG_ATTR_LIST(bad_summary),
>   	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
> +	XFS_ERRORTAG_ATTR_LIST(relog),
>   	NULL,
>   };
>   
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index f7f2411ead4e..24e0208b74b8 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -19,6 +19,7 @@
>   #include "xfs_trace.h"
>   #include "xfs_error.h"
>   #include "xfs_defer.h"
> +#include "xfs_errortag.h"
>   
>   kmem_zone_t	*xfs_trans_zone;
>   
> @@ -263,6 +264,11 @@ xfs_trans_alloc(
>   	struct xfs_trans	*tp;
>   	int			error;
>   
> +	/* relogging requires permanent transactions */
> +	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RELOG) &&
> +	    resp->tr_logflags & XFS_TRANS_PERM_LOG_RES)
> +		flags |= XFS_TRANS_RELOG;
> +
>   	/*
>   	 * Allocate the handle before we do our freeze accounting and setting up
>   	 * GFP_NOFS allocation context so that we avoid lockdep false positives
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 9/9] xfs: relog random buffers based on errortag
  2020-02-27 13:43 ` [RFC v5 PATCH 9/9] xfs: relog random buffers based on errortag Brian Foster
@ 2020-02-27 23:48   ` Allison Collins
  2020-02-28 14:06     ` Brian Foster
  0 siblings, 1 reply; 59+ messages in thread
From: Allison Collins @ 2020-02-27 23:48 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 2/27/20 6:43 AM, Brian Foster wrote:
> Since there is currently no specific use case for buffer relogging,
> add some hacky and experimental code to relog random buffers when
> the associated errortag is enabled. Update the relog reservation
> calculation appropriately and use fixed termination logic to help
> ensure that the relog queue doesn't grow indefinitely.
> 
> Note that this patch was useful in causing log reservation deadlocks
> on an fsstress workload if the relog mechanism code is modified to
> acquire its own log reservation rather than rely on the relog
> pre-reservation mechanism. In other words, this helps prove that the
> relog reservation management code effectively avoids log reservation
> deadlocks.
> 

Oh i see, so the last three are sort of an internal test case.  They 
look like they are good sand box tools for testing though.  I guess they 
dont really get RVBs since they dont apply?  Otherwise looks good for 
the purpose they are meant for.  :-)

Allison

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/libxfs/xfs_trans_resv.c |  8 +++++++-
>   fs/xfs/xfs_trans.h             |  4 +++-
>   fs/xfs/xfs_trans_ail.c         | 11 +++++++++++
>   fs/xfs/xfs_trans_buf.c         | 13 +++++++++++++
>   4 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index f49b20c9ca33..59a328a0dec6 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -840,7 +840,13 @@ STATIC uint
>   xfs_calc_relog_reservation(
>   	struct xfs_mount	*mp)
>   {
> -	return xfs_calc_qm_quotaoff_reservation(mp);
> +	uint			res;
> +
> +	res = xfs_calc_qm_quotaoff_reservation(mp);
> +#ifdef DEBUG
> +	res = max(res, xfs_calc_buf_res(4, XFS_FSB_TO_B(mp, 1)));
> +#endif
> +	return res;
>   }
>   
>   void
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 81cb42f552d9..1783441f6d03 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -61,6 +61,7 @@ struct xfs_log_item {
>   #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
>   #define	XFS_LI_RELOG	4	/* automatically relog item */
>   #define	XFS_LI_RELOGGED	5	/* item relogged (not committed) */
> +#define	XFS_LI_RELOG_RAND 6
>   
>   #define XFS_LI_FLAGS \
>   	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
> @@ -68,7 +69,8 @@ struct xfs_log_item {
>   	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
>   	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
>   	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
> -	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }
> +	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }, \
> +	{ (1 << XFS_LI_RELOG_RAND),	"RELOG_RAND" }
>   
>   struct xfs_item_ops {
>   	unsigned flags;
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 103ab62e61be..9b1d7c8df6d8 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -188,6 +188,17 @@ xfs_ail_relog(
>   			xfs_log_ticket_put(ailp->ail_relog_tic);
>   		spin_unlock(&ailp->ail_lock);
>   
> +		/*
> +		 * Terminate random/debug relogs at a fixed, aggressive rate to
> +		 * avoid building up too much relog activity.
> +		 */
> +		if (test_bit(XFS_LI_RELOG_RAND, &lip->li_flags) &&
> +		    ((prandom_u32() & 1) ||
> +		     (mp->m_flags & XFS_MOUNT_UNMOUNTING))) {
> +			clear_bit(XFS_LI_RELOG_RAND, &lip->li_flags);
> +			xfs_trans_relog_item_cancel(lip, false);
> +		}
> +
>   		/*
>   		 * TODO: Ideally, relog transaction management would be pushed
>   		 * down into the ->iop_push() callbacks rather than playing
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index e17715ac23fc..de7b9a68fe38 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -14,6 +14,8 @@
>   #include "xfs_buf_item.h"
>   #include "xfs_trans_priv.h"
>   #include "xfs_trace.h"
> +#include "xfs_error.h"
> +#include "xfs_errortag.h"
>   
>   /*
>    * Check to see if a buffer matching the given parameters is already
> @@ -527,6 +529,17 @@ xfs_trans_log_buf(
>   
>   	trace_xfs_trans_log_buf(bip);
>   	xfs_buf_item_log(bip, first, last);
> +
> +	/*
> +	 * Relog random buffers so long as the transaction is relog enabled and
> +	 * the buffer wasn't already relogged explicitly.
> +	 */
> +	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_RELOG) &&
> +	    (tp->t_flags & XFS_TRANS_RELOG) &&
> +	    !test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags)) {
> +		if (xfs_trans_relog_buf(tp, bp))
> +			set_bit(XFS_LI_RELOG_RAND, &bip->bli_item.li_flags);
> +	}
>   }
>   
>   
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-02-27 13:43 ` [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management Brian Foster
  2020-02-27 20:49   ` Allison Collins
@ 2020-02-28  0:02   ` Darrick J. Wong
  2020-02-28 13:55     ` Brian Foster
  2020-03-02  3:07   ` Dave Chinner
  2 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-28  0:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> Automatic item relogging will occur from xfsaild context. xfsaild
> cannot acquire log reservation itself because it is also responsible
> for writeback and thus making used log reservation available again.
> Since there is no guarantee log reservation is available by the time
> a relogged item reaches the AIL, this is prone to deadlock.
> 
> To guarantee log reservation for automatic relogging, implement a
> reservation management scheme where a transaction that is capable of
> enabling relogging of an item must contribute the necessary
> reservation to the relog mechanism up front.

Ooooh, I had wondered where I was going to find that hook. :)

What does it mean to be capable of enabling relogging of an item?

For the quotaoff example, does this mean that all the transactions that
happen on behalf of a quotaoff operation must specify TRANS_RELOG?

What if the relog thread grinds to a halt while other non-RELOG threads
continue to push things into the log?  Can we starve and/or livelock
waiting around?  Or should the log be able to kick some higher level
thread to inject a TRANS_RELOG transaction to move things along?

> Use reference counting
> to associate the lifetime of pending relog reservation to the
> lifetime of in-core log items with relogging enabled.

Ok, so we only have to pay the relog reservation while there are
reloggable items floating around in the system.

> The basic log reservation sequence for a relog enabled transaction
> is as follows:
> 
> - A transaction that uses relogging specifies XFS_TRANS_RELOG at
>   allocation time.
> - Once initialized, RELOG transactions check for the existence of
>   the global relog log ticket. If it exists, grab a reference and
>   return. If not, allocate an empty ticket and install into the relog
>   subsystem. Seed the relog ticket from reservation of the current
>   transaction. Roll the current transaction to replenish its
>   reservation and return to the caller.

I guess we'd have to be careful that the transaction we're stealing from
actually has enough reservation to re-log a pending item, but that
shouldn't be difficult.

I worry that there might be some operation somewhere that Just Works
because tr_logcount * tr_logres is enough space for it to run without
having to get more reseration, but (tr_logcount - 1) * tr_logres isn't
enough.  Though that might not be a big issue seeing how bloated the
log reservations become when reflink and rmap are turned on. <cough>

> - The transaction is used as normal. If an item is relogged in the
>   transaction, that item acquires a reference on the global relog
>   ticket currently held open by the transaction. The item's reference
>   persists until relogging is disabled on the item.
> - The RELOG transaction commits and releases its reference to the
>   global relog ticket. The global relog ticket is released once its
>   reference count drops to zero.
> 
> This provides a central relog log ticket that guarantees reservation
> availability for relogged items, avoids log reservation deadlocks
> and is allocated and released on demand.

Sounds cool.  /me jumps in.

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_shared.h |  1 +
>  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
>  fs/xfs/xfs_trans.h         |  3 ++
>  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans_priv.h    |  1 +
>  5 files changed, 126 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index c45acbd3add9..0a10ca0853ab 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>   * made then this algorithm will eventually find all the space it needs.
>   */
>  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
>  
>  /*
>   * Field values for xfs_trans_mod_sb.
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..8ac05ed8deda 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -107,9 +107,14 @@ xfs_trans_dup(
>  
>  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
>  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> -	/* We gave our writer reference to the new transaction */
> +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> +		       (tp->t_flags & XFS_TRANS_RELOG);
> +	/*
> +	 * The writer reference and relog reference transfer to the new
> +	 * transaction.
> +	 */
>  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
>  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
>  
>  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> @@ -284,15 +289,25 @@ xfs_trans_alloc(
>  	tp->t_firstblock = NULLFSBLOCK;
>  
>  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> -	if (error) {
> -		xfs_trans_cancel(tp);
> -		return error;
> +	if (error)
> +		goto error;
> +
> +	if (flags & XFS_TRANS_RELOG) {
> +		error = xfs_trans_ail_relog_reserve(&tp);
> +		if (error)
> +			goto error;
>  	}
>  
>  	trace_xfs_trans_alloc(tp, _RET_IP_);
>  
>  	*tpp = tp;
>  	return 0;
> +
> +error:
> +	/* clear relog flag if we haven't acquired a ref */
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
> +	xfs_trans_cancel(tp);
> +	return error;
>  }
>  
>  /*
> @@ -973,6 +988,10 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
> @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
>  			error = -EIO;
>  		tp->t_ticket = NULL;
>  	}
> +	/* release the relog ticket reference if this transaction holds one */
> +	/* XXX: handle RELOG items on transaction abort */

"Handle"?  Hm.  Do the reloggable items end up attached in some way to
this new transaction, or are we purely stealing the reservation so that
the ail can use it to relog the items on its own?  If it's the second,
then I wonder what handling do we need to do?

Or maybe you meant handling the relog items that the caller attached to
this relog transaction?  Won't those get cancelled the same way they do
now?

Mechanically this looks reasonable.

--D

> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, !!error);
>  	xfs_trans_free(tp);
> @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
>  		tp->t_ticket = NULL;
>  	}
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>  	/* mark this thread as no longer being in a transaction */
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 752c7fef9de7..a032989943bd 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -236,6 +236,9 @@ int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
>  void		xfs_trans_cancel(xfs_trans_t *);
>  int		xfs_trans_ail_init(struct xfs_mount *);
>  void		xfs_trans_ail_destroy(struct xfs_mount *);
> +int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
> +bool		xfs_trans_ail_relog_get(struct xfs_mount *);
> +int		xfs_trans_ail_relog_put(struct xfs_mount *);
>  
>  void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
>  				       enum xfs_blft);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00cc5b8734be..a3fb64275baa 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  
>  #ifdef DEBUG
>  /*
> @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
>  		xfs_log_space_wake(ailp->ail_mount);
>  }
>  
> +bool
> +xfs_trans_ail_relog_get(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	bool			ret = false;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		ret = true;
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +	return ret;
> +}
> +
> +/*
> + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> + * requires a clean, permanent transaction from the caller. Pull reservation
> + * for the relog ticket and roll the caller's transaction back to its fully
> + * reserved state. If the AIL relog ticket is already initialized, grab a
> + * reference and return.
> + */
> +int
> +xfs_trans_ail_relog_reserve(
> +	struct xfs_trans	**tpp)
> +{
> +	struct xfs_trans	*tp = *tpp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	if (xfs_trans_ail_relog_get(mp))
> +		return 0;
> +
> +	/* no active ticket, fall into slow path to allocate one.. */
> +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> +	if (!tic)
> +		return -ENOMEM;
> +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> +
> +	/* check again since we dropped the lock for the allocation */
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +		xfs_log_ticket_put(tic);
> +		return 0;
> +	}
> +
> +	/* attach and reserve space for the ->tr_relog ticket */
> +	ailp->ail_relog_tic = tic;
> +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	return xfs_trans_roll(tpp);
> +}
> +
> +/*
> + * Release a reference to the relog ticket.
> + */
> +int
> +xfs_trans_ail_relog_put(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
> +		spin_unlock(&ailp->ail_lock);
> +		return 0;
> +	}
> +
> +	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
> +	tic = ailp->ail_relog_tic;
> +	ailp->ail_relog_tic = NULL;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	xfs_log_done(mp, tic, NULL, false);
> +	return 0;
> +}
> +
>  int
>  xfs_trans_ail_init(
>  	xfs_mount_t	*mp)
> @@ -854,6 +942,7 @@ xfs_trans_ail_destroy(
>  {
>  	struct xfs_ail	*ailp = mp->m_ail;
>  
> +	ASSERT(ailp->ail_relog_tic == NULL);
>  	kthread_stop(ailp->ail_task);
>  	kmem_free(ailp);
>  }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614..839df6559b9f 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,6 +61,7 @@ struct xfs_ail {
>  	int			ail_log_flush;
>  	struct list_head	ail_buf_list;
>  	wait_queue_head_t	ail_empty;
> +	struct xlog_ticket	*ail_relog_tic;
>  };
>  
>  /*
> -- 
> 2.21.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time
  2020-02-27 23:28   ` Darrick J. Wong
@ 2020-02-28  0:10     ` Dave Chinner
  2020-02-28 13:46       ` Brian Foster
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-02-28  0:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Thu, Feb 27, 2020 at 03:28:53PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:13AM -0500, Brian Foster wrote:
> > The xlog_ticket structure contains a task reference to support
> > blocking for available log reservation. This reference is assigned
> > at ticket allocation time, which assumes that the transaction
> > allocator will acquire reservation in the same context. This is
> > normally true, but will not always be the case with automatic
> > relogging.
> > 
> > There is otherwise no fundamental reason log space cannot be
> > reserved for a ticket from a context different from the allocating
> > context. Move the task assignment to the log reservation blocking
> > code where it is used.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index f6006d94a581..df60942a9804 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -262,6 +262,7 @@ xlog_grant_head_wait(
> >  	int			need_bytes) __releases(&head->lock)
> >  					    __acquires(&head->lock)
> >  {
> > +	tic->t_task = current;
> >  	list_add_tail(&tic->t_queue, &head->waiters);
> >  
> >  	do {
> > @@ -3601,7 +3602,6 @@ xlog_ticket_alloc(
> >  	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
> >  
> >  	atomic_set(&tic->t_ref, 1);
> > -	tic->t_task		= current;
> 
> Hm.  So this leaves t_task set to NULL in the ticket constructor in
> favor of setting it in xlog_grant_head_wait.  I guess this implies that
> some future piece will be able to transfer a ticket to another process
> as part of a regrant or something?
> 
> I've been wondering lately if you could transfer a dirty permanent
> transaction to a different task so that the front end could return to
> userspace as soon as the first transaction (with the intent items)
> commits, and then you could reduce the latency of front-end system
> calls.  That's probably a huge fantasy since you'd also have to transfer
> a whole ton of state to that worker and whatever you locked to do the
> operation remains locked...

Yup, that's basically the idea I've raised in the past for "async
XFS" where the front end is completely detached from the back end
that does the internal work. i.e deferred ops are the basis for
turning XFS into a huge async processing machine.

This isn't a new idea - tux3 was based around this "async back end"
concept, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-02-27 13:43 ` [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism Brian Foster
  2020-02-27 22:54   ` Allison Collins
@ 2020-02-28  0:13   ` Darrick J. Wong
  2020-02-28 14:02     ` Brian Foster
  2020-03-02  7:18   ` Dave Chinner
  2 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-28  0:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> Now that relog reservation is available and relog state tracking is
> in place, all that remains to automatically relog items is the relog
> mechanism itself. An item with relogging enabled is basically pinned
> from writeback until relog is disabled. Instead of being written
> back, the item must instead be periodically committed in a new
> transaction to move it in the physical log. The purpose of moving
> the item is to avoid long term tail pinning and thus avoid log
> deadlocks for long running operations.
> 
> The ideal time to relog an item is in response to tail pushing
> pressure. This accommodates the current workload at any given time
> as opposed to a fixed time interval or log reservation heuristic,
> which risks performance regression. This is essentially the same
> heuristic that drives metadata writeback. XFS already implements
> various log tail pushing heuristics that attempt to keep the log
> progressing on an active fileystem under various workloads.
> 
> The act of relogging an item simply requires to add it to a
> transaction and commit. This pushes the already dirty item into a
> subsequent log checkpoint and frees up its previous location in the
> on-disk log. Joining an item to a transaction of course requires
> locking the item first, which means we have to be aware of
> type-specific locks and lock ordering wherever the relog takes
> place.
> 
> Fundamentally, this points to xfsaild as the ideal location to
> process relog enabled items. xfsaild already processes log resident
> items, is driven by log tail pushing pressure, processes arbitrary
> log item types through callbacks, and is sensitive to type-specific
> locking rules by design. The fact that automatic relogging
> essentially diverts items between writeback or relog also suggests
> xfsaild as an ideal location to process items one way or the other.
> 
> Of course, we don't want xfsaild to process transactions as it is a
> critical component of the log subsystem for driving metadata
> writeback and freeing up log space. Therefore, similar to how
> xfsaild builds up a writeback queue of dirty items and queues writes
> asynchronously, make xfsaild responsible only for directing pending
> relog items into an appropriate queue and create an async
> (workqueue) context for processing the queue. The workqueue context
> utilizes the pre-reserved relog ticket to drain the queue by rolling
> a permanent transaction.

Aha!  I bet that's that workqueue I was musing about earlier.

> Update the AIL pushing infrastructure to support a new RELOG item
> state. If a log item push returns the relog state, queue the item
> for relog instead of writeback. On completion of a push cycle,
> schedule the relog task at the same point metadata buffer I/O is
> submitted. This allows items to be relogged automatically under the
> same locking rules and pressure heuristics that govern metadata
> writeback.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_trace.h      |   1 +
>  fs/xfs/xfs_trans.h      |   1 +
>  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_trans_priv.h |   3 ++
>  4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a066617ec54d..df0114ec66f1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index fc4c25b6eee4..1637df32c64c 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>  #define XFS_ITEM_PINNED		1
>  #define XFS_ITEM_LOCKED		2
>  #define XFS_ITEM_FLUSHING	3
> +#define XFS_ITEM_RELOG		4
>  
>  /*
>   * Deferred operation item relogging limits.
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index a3fb64275baa..71a47faeaae8 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
>  	return lsn;
>  }
>  
> +/*
> + * Relog log items on the AIL relog queue.
> + */
> +static void
> +xfs_ail_relog(
> +	struct work_struct	*work)
> +{
> +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> +						     ail_relog_work);
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +	struct xfs_trans_res	tres = {};
> +	struct xfs_trans	*tp;
> +	struct xfs_log_item	*lip;
> +	int			error;
> +
> +	/*
> +	 * The first transaction to submit a relog item contributed relog
> +	 * reservation to the relog ticket before committing. Create an empty
> +	 * transaction and manually associate the relog ticket.
> +	 */
> +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);

Ah, and I see that the work item actually does create its own
transaction to relog the items...

> +	ASSERT(!error);
> +	if (error)
> +		return;
> +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> +
> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> +					       struct xfs_log_item,
> +					       li_trans)) != NULL) {

...but this part really cranks up my curiosity about what happens when
there are more items to relog than there is actual reservation in this
transaction?  I think most transactions types reserve enough space that
we could attach hundreds of relogged intent items.

> +		/*
> +		 * Drop the AIL processing ticket reference once the relog list
> +		 * is emptied. At this point it's possible for our transaction
> +		 * to hold the only reference.
> +		 */
> +		list_del_init(&lip->li_trans);
> +		if (list_empty(&ailp->ail_relog_list))
> +			xfs_log_ticket_put(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/* XXX: include ticket owner task fix */

XXX?

--D

> +		error = xfs_trans_roll(&tp);
> +		ASSERT(!error);
> +		if (error)
> +			goto out;
> +		spin_lock(&ailp->ail_lock);
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +
> +out:
> +	/* XXX: handle shutdown scenario */
> +	/*
> +	 * Drop the relog reference owned by the transaction separately because
> +	 * we don't want the cancel to release reservation if this isn't the
> +	 * final reference. The relog ticket and associated reservation needs
> +	 * to persist so long as relog items are active in the log subsystem.
> +	 */
> +	xfs_trans_ail_relog_put(mp);
> +
> +	tp->t_ticket = NULL;
> +	xfs_trans_cancel(tp);
> +}
> +
>  /*
>   * The cursor keeps track of where our current traversal is up to by tracking
>   * the next item in the list for us. However, for this to be safe, removing an
> @@ -364,7 +433,7 @@ static long
>  xfsaild_push(
>  	struct xfs_ail		*ailp)
>  {
> -	xfs_mount_t		*mp = ailp->ail_mount;
> +	struct xfs_mount	*mp = ailp->ail_mount;
>  	struct xfs_ail_cursor	cur;
>  	struct xfs_log_item	*lip;
>  	xfs_lsn_t		lsn;
> @@ -426,6 +495,23 @@ xfsaild_push(
>  			ailp->ail_last_pushed_lsn = lsn;
>  			break;
>  
> +		case XFS_ITEM_RELOG:
> +			/*
> +			 * The item requires a relog. Add to the pending relog
> +			 * list and set the relogged bit to prevent further
> +			 * relog requests. The relog bit and ticket reference
> +			 * can be dropped from the item at any point, so hold a
> +			 * relog ticket reference for the pending relog list to
> +			 * ensure the ticket stays around.
> +			 */
> +			trace_xfs_ail_relog(lip);
> +			ASSERT(list_empty(&lip->li_trans));
> +			if (list_empty(&ailp->ail_relog_list))
> +				xfs_log_ticket_get(ailp->ail_relog_tic);
> +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> +			break;
> +
>  		case XFS_ITEM_FLUSHING:
>  			/*
>  			 * The item or its backing buffer is already being
> @@ -492,6 +578,9 @@ xfsaild_push(
>  	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>  		ailp->ail_log_flush++;
>  
> +	if (!list_empty(&ailp->ail_relog_list))
> +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> +
>  	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
>  out_done:
>  		/*
> @@ -922,15 +1011,24 @@ xfs_trans_ail_init(
>  	spin_lock_init(&ailp->ail_lock);
>  	INIT_LIST_HEAD(&ailp->ail_buf_list);
>  	init_waitqueue_head(&ailp->ail_empty);
> +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> +
> +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> +					     mp->m_super->s_id);
> +	if (!ailp->ail_relog_wq)
> +		goto out_free_ailp;
>  
>  	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
>  			ailp->ail_mount->m_super->s_id);
>  	if (IS_ERR(ailp->ail_task))
> -		goto out_free_ailp;
> +		goto out_destroy_wq;
>  
>  	mp->m_ail = ailp;
>  	return 0;
>  
> +out_destroy_wq:
> +	destroy_workqueue(ailp->ail_relog_wq);
>  out_free_ailp:
>  	kmem_free(ailp);
>  	return -ENOMEM;
> @@ -944,5 +1042,6 @@ xfs_trans_ail_destroy(
>  
>  	ASSERT(ailp->ail_relog_tic == NULL);
>  	kthread_stop(ailp->ail_task);
> +	destroy_workqueue(ailp->ail_relog_wq);
>  	kmem_free(ailp);
>  }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index d1edec1cb8ad..33a724534869 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -63,6 +63,9 @@ struct xfs_ail {
>  	int			ail_log_flush;
>  	struct list_head	ail_buf_list;
>  	wait_queue_head_t	ail_empty;
> +	struct work_struct	ail_relog_work;
> +	struct list_head	ail_relog_list;
> +	struct workqueue_struct	*ail_relog_wq;
>  	struct xlog_ticket	*ail_relog_tic;
>  };
>  
> -- 
> 2.21.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-27 13:43 ` [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent Brian Foster
  2020-02-27 23:19   ` Allison Collins
@ 2020-02-28  1:16   ` Darrick J. Wong
  2020-02-28 14:04     ` Brian Foster
  1 sibling, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-28  1:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:18AM -0500, Brian Foster wrote:
> The quotaoff operation has a rare but longstanding deadlock vector
> in terms of how the operation is logged. A quotaoff start intent is
> logged (synchronously) at the onset to ensure recovery can handle
> the operation if interrupted before in-core changes are made. This
> quotaoff intent pins the log tail while the quotaoff sequence scans
> and purges dquots from all in-core inodes. While this operation
> generally doesn't generate much log traffic on its own, it can be
> time consuming. If unrelated, concurrent filesystem activity
> consumes remaining log space before quotaoff is able to acquire log
> reservation for the quotaoff end intent, the filesystem locks up
> indefinitely.
> 
> quotaoff cannot allocate the end intent before the scan because the
> latter can result in transaction allocation itself in certain
> indirect cases (releasing an inode, for example). Further, rolling
> the original transaction is difficult because the scanning work
> occurs multiple layers down where caller context is lost and not
> much information is available to determine how often to roll the
> transaction.
> 
> To address this problem, enable automatic relogging of the quotaoff
> start intent. This automatically relogs the intent whenever AIL
> pushing finds the item at the tail of the log. When quotaoff
> completes, wait for relogging to complete as the end intent expects
> to be able to permanently remove the start intent from the log
> subsystem. This ensures that the log tail is kept moving during a
> particularly long quotaoff operation and avoids the log reservation
> deadlock.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
>  fs/xfs/xfs_dquot_item.c        |  7 +++++++
>  fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
>  3 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 1f5c9e6e1afc..f49b20c9ca33 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -935,7 +935,8 @@ xfs_trans_resv_calc(
>  	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
>  
>  	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
> -	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> +	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> +	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
>  	resp->tr_qm_equotaoff.tr_logres =
>  		xfs_calc_qm_quotaoff_end_reservation();
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index d60647d7197b..ea5123678466 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
>  	struct xfs_log_item	*lip,
>  	struct list_head	*buffer_list)
>  {
> +	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
> +
> +	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
> +	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
> +	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
> +		return XFS_ITEM_RELOG;
> +
>  	return XFS_ITEM_LOCKED;
>  }
>  
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 1ea82764bf89..7b48d34da0f4 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -18,6 +18,7 @@
>  #include "xfs_quota.h"
>  #include "xfs_qm.h"
>  #include "xfs_icache.h"
> +#include "xfs_trans_priv.h"
>  
>  STATIC int
>  xfs_qm_log_quotaoff(
> @@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
>  
>  	*qoffstartp = NULL;
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> +				XFS_TRANS_RELOG, &tp);

Humm, maybe I don't understand how this works after all.  From what I
can tell from this patch, (1) the quotaoff transaction is created with
RELOG, so (2) the AIL steals some reservation from it for an eventual
relogging of the quotaoff item, and then (3) we log the quotaoff item.

Later, the AIL can decide to trigger the workqueue item to take the
ticket generated in step (2) to relog the item we logged in step (3) to
move the log tail forward, but what happens if there are further delays
and the AIL needs to relog again?  That ticket from (2) is now used up
and is gone, right?

I suppose some other RELOG transaction could wander in and generate a
new relog ticket, but as this is the only RELOG transaction that gets
created anywhere, that won't happen.  Is there some magic I missed? :)

--D

>  	if (error)
>  		goto out;
>  
>  	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
>  	xfs_trans_log_quotaoff_item(tp, qoffi);
> +	xfs_trans_relog_item(&qoffi->qql_item);
>  
>  	spin_lock(&mp->m_sb_lock);
>  	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
> @@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
>  	int			error;
>  	struct xfs_qoff_logitem	*qoffi;
>  
> +	/*
> +	 * startqoff must be in the AIL and not the CIL when the end intent
> +	 * commits to ensure it is not readded to the AIL out of order. Wait on
> +	 * relog activity to drain to isolate startqoff to the AIL.
> +	 */
> +	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
> +
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
>  	if (error)
>  		return error;
> -- 
> 2.21.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time
  2020-02-28  0:10     ` Dave Chinner
@ 2020-02-28 13:46       ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-28 13:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 28, 2020 at 11:10:00AM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 03:28:53PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 27, 2020 at 08:43:13AM -0500, Brian Foster wrote:
> > > The xlog_ticket structure contains a task reference to support
> > > blocking for available log reservation. This reference is assigned
> > > at ticket allocation time, which assumes that the transaction
> > > allocator will acquire reservation in the same context. This is
> > > normally true, but will not always be the case with automatic
> > > relogging.
> > > 
> > > There is otherwise no fundamental reason log space cannot be
> > > reserved for a ticket from a context different from the allocating
> > > context. Move the task assignment to the log reservation blocking
> > > code where it is used.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index f6006d94a581..df60942a9804 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -262,6 +262,7 @@ xlog_grant_head_wait(
> > >  	int			need_bytes) __releases(&head->lock)
> > >  					    __acquires(&head->lock)
> > >  {
> > > +	tic->t_task = current;
> > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > >  
> > >  	do {
> > > @@ -3601,7 +3602,6 @@ xlog_ticket_alloc(
> > >  	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
> > >  
> > >  	atomic_set(&tic->t_ref, 1);
> > > -	tic->t_task		= current;
> > 
> > Hm.  So this leaves t_task set to NULL in the ticket constructor in
> > favor of setting it in xlog_grant_head_wait.  I guess this implies that
> > some future piece will be able to transfer a ticket to another process
> > as part of a regrant or something?
> > 

Pretty much.. it's mostly just breaking the assumption that the task
that allocates a log ticket is necessarily the one that acquires log
reservation (or regrants it). The purpose of this change is so that any
particular task could allocate (and reserve) a relog ticket and donate
it to the relog mechanism (a separate task) for use (i.e. to roll it).

> > I've been wondering lately if you could transfer a dirty permanent
> > transaction to a different task so that the front end could return to
> > userspace as soon as the first transaction (with the intent items)
> > commits, and then you could reduce the latency of front-end system
> > calls.  That's probably a huge fantasy since you'd also have to transfer
> > a whole ton of state to that worker and whatever you locked to do the
> > operation remains locked...
> 
> Yup, that's basically the idea I've raised in the past for "async
> XFS" where the front end is completely detached from the back end
> that does the internal work. i.e deferred ops are the basis for
> turning XFS into a huge async processing machine.
> 

I think we've discussed this in the past, though I'm not clear on
whether it rely on this sort of change. Either way, there's a big
difference in scope between the tweak made by this patch and the design
of a generic async XFS front-end. :)

Brian

> This isn't a new idea - tux3 was based around this "async back end"
> concept, too.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction
  2020-02-27 23:31   ` Darrick J. Wong
@ 2020-02-28 13:52     ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-28 13:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 03:31:53PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:14AM -0500, Brian Foster wrote:
> > Create a transaction reservation specifically for relog
> > transactions. For now it only supports the quotaoff intent, so use
> > the associated reservation.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_trans_resv.c | 15 +++++++++++++++
> >  fs/xfs/libxfs/xfs_trans_resv.h |  1 +
> >  2 files changed, 16 insertions(+)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 7a9c04920505..1f5c9e6e1afc 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -832,6 +832,17 @@ xfs_calc_sb_reservation(
> >  	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize);
> >  }
> >  
> > +/*
> > + * Internal relog transaction.
> > + *   quotaoff intent
> > + */
> > +STATIC uint
> > +xfs_calc_relog_reservation(
> > +	struct xfs_mount	*mp)
> > +{
> > +	return xfs_calc_qm_quotaoff_reservation(mp);
> 
> So when we add the next reloggable intent item, this will turn this
> into an n-way max(sizeof(type0), sizeof(type1), ...sizeof(typeN)); ?
> 

Possibly. I'm trying to keep things simple for now. So if we suppose the
near term use cases are the quotaoff intent, the scrub EFI intent and
perhaps the writeback stale data exposure zeroing intent, then I'd
probably just leave it as a max of those. We could also multiply that by
some constant factor for a simple form of batching, since the log
reservation is still likely to be on the smaller size.

If longer term we end up with relog support for a variety of item times
and the potential for a lot of concurrent relog activity, I'd be more
inclined to consider a specific calculation or to pick off the current
max transaction size or something and require batching implement some
form of reservation use tracking (i.e., consider an
xfs_trans_add_item_try(...) interface that performed a magical size
check and failed when the transaction is full).

As it is, I don't see enough use case right now to cross that complexity
threshold from the first model to the second right away..

> > +}
> > +
> >  void
> >  xfs_trans_resv_calc(
> >  	struct xfs_mount	*mp,
> > @@ -946,4 +957,8 @@ xfs_trans_resv_calc(
> >  	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
> >  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
> >  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> > +
> > +	resp->tr_relog.tr_logres = xfs_calc_relog_reservation(mp);
> > +	resp->tr_relog.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> 
> Relog operations can roll?  I would have figured that you'd simply log
> the old item(s) in a new transaction and commit it, along with some
> magic to let the log tail move forward.  I guess I'll see what happens
> in the next 7 patches. :)
> 

The current scheme is that the relog transaction rolls one item at a
time. This is again, simplified for the purpose of a POC. For a
production iteration, I'd probably just turn that into a fixed count to
be able to batch 5 or 10 items at a time or something along those lines
(probably more depends on what the transaction size looks like and the
pressure put on by the scrub use case).

Brian

> --D
> 
> > +	resp->tr_relog.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> > index 7241ab28cf84..b723979cad09 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.h
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> > @@ -50,6 +50,7 @@ struct xfs_trans_resv {
> >  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
> >  	struct xfs_trans_res	tr_sb;		/* modify superblock */
> >  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> > +	struct xfs_trans_res	tr_relog;	/* internal relog transaction */
> >  };
> >  
> >  /* shorthand way of accessing reservation structure */
> > -- 
> > 2.21.1
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-02-28  0:02   ` Darrick J. Wong
@ 2020-02-28 13:55     ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-28 13:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 04:02:18PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > Automatic item relogging will occur from xfsaild context. xfsaild
> > cannot acquire log reservation itself because it is also responsible
> > for writeback and thus making used log reservation available again.
> > Since there is no guarantee log reservation is available by the time
> > a relogged item reaches the AIL, this is prone to deadlock.
> > 
> > To guarantee log reservation for automatic relogging, implement a
> > reservation management scheme where a transaction that is capable of
> > enabling relogging of an item must contribute the necessary
> > reservation to the relog mechanism up front.
> 
> Ooooh, I had wondered where I was going to find that hook. :)
> 
> What does it mean to be capable of enabling relogging of an item?
> 

It's basically a requirement introduced by this patch that any
transaction that wants to enable relog on an item needs to check whether
the "central relog ticket" already exists, and if not, it needs to
contribute reservation to it as part of its normal transaction
allocation process.

> For the quotaoff example, does this mean that all the transactions that
> happen on behalf of a quotaoff operation must specify TRANS_RELOG?
> 

No, only the transaction that commits the quotaoff start intent.

> What if the relog thread grinds to a halt while other non-RELOG threads
> continue to push things into the log?  Can we starve and/or livelock
> waiting around?  Or should the log be able to kick some higher level
> thread to inject a TRANS_RELOG transaction to move things along?
> 

I hope not. :P At least I haven't seen such issues in my (fairly limited
and focused) testing so far.

> > Use reference counting
> > to associate the lifetime of pending relog reservation to the
> > lifetime of in-core log items with relogging enabled.
> 
> Ok, so we only have to pay the relog reservation while there are
> reloggable items floating around in the system.
> 

Yeah, I was back and forth on this because the relog reservation
tracking thing is a bit of complexity itself. It's much simpler to have
the AIL or something allocate a relog transaction up front and just keep
it around for the lifetime of the mount. My thinking was that relogs
could be quite rare (quotaoff, scrub), so it's probably an unfair use of
resources. I figured it best to try and provide some flexibility up
front and we can always back off to something more simplified later.

If we ended up with something like a reloggable "zero on recovery"
intent for writeback, that ties more into core fs functionality and the
simpler relog ticket implementation might be justified.

> > The basic log reservation sequence for a relog enabled transaction
> > is as follows:
> > 
> > - A transaction that uses relogging specifies XFS_TRANS_RELOG at
> >   allocation time.
> > - Once initialized, RELOG transactions check for the existence of
> >   the global relog log ticket. If it exists, grab a reference and
> >   return. If not, allocate an empty ticket and install into the relog
> >   subsystem. Seed the relog ticket from reservation of the current
> >   transaction. Roll the current transaction to replenish its
> >   reservation and return to the caller.
> 
> I guess we'd have to be careful that the transaction we're stealing from
> actually has enough reservation to re-log a pending item, but that
> shouldn't be difficult.
> 
> I worry that there might be some operation somewhere that Just Works
> because tr_logcount * tr_logres is enough space for it to run without
> having to get more reseration, but (tr_logcount - 1) * tr_logres isn't
> enough.  Though that might not be a big issue seeing how bloated the
> log reservations become when reflink and rmap are turned on. <cough>
> 

This ties in to the size calculation of the relog transaction and the
fact that the contributing transaction rolls before it carries on to
normal use. IOW, this approach adds zero extra reservation consumption
to existing transaction commits, so there should be no additional risk
of reservation overrun in that regard.

> > - The transaction is used as normal. If an item is relogged in the
> >   transaction, that item acquires a reference on the global relog
> >   ticket currently held open by the transaction. The item's reference
> >   persists until relogging is disabled on the item.
> > - The RELOG transaction commits and releases its reference to the
> >   global relog ticket. The global relog ticket is released once its
> >   reference count drops to zero.
> > 
> > This provides a central relog log ticket that guarantees reservation
> > availability for relogged items, avoids log reservation deadlocks
> > and is allocated and released on demand.
> 
> Sounds cool.  /me jumps in.
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_shared.h |  1 +
> >  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
> >  fs/xfs/xfs_trans.h         |  3 ++
> >  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans_priv.h    |  1 +
> >  5 files changed, 126 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> > index c45acbd3add9..0a10ca0853ab 100644
> > --- a/fs/xfs/libxfs/xfs_shared.h
> > +++ b/fs/xfs/libxfs/xfs_shared.h
> > @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
> >   * made then this algorithm will eventually find all the space it needs.
> >   */
> >  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> > +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
> >  
> >  /*
> >   * Field values for xfs_trans_mod_sb.
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 3b208f9a865c..8ac05ed8deda 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -107,9 +107,14 @@ xfs_trans_dup(
> >  
> >  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
> >  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> > -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> > -	/* We gave our writer reference to the new transaction */
> > +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> > +		       (tp->t_flags & XFS_TRANS_RELOG);
> > +	/*
> > +	 * The writer reference and relog reference transfer to the new
> > +	 * transaction.
> > +	 */
> >  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> >  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
> >  
> >  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> >  	tp->t_firstblock = NULLFSBLOCK;
> >  
> >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > -	if (error) {
> > -		xfs_trans_cancel(tp);
> > -		return error;
> > +	if (error)
> > +		goto error;
> > +
> > +	if (flags & XFS_TRANS_RELOG) {
> > +		error = xfs_trans_ail_relog_reserve(&tp);
> > +		if (error)
> > +			goto error;
> >  	}
> >  
> >  	trace_xfs_trans_alloc(tp, _RET_IP_);
> >  
> >  	*tpp = tp;
> >  	return 0;
> > +
> > +error:
> > +	/* clear relog flag if we haven't acquired a ref */
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> > +	xfs_trans_cancel(tp);
> > +	return error;
> >  }
> >  
> >  /*
> > @@ -973,6 +988,10 @@ __xfs_trans_commit(
> >  
> >  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> > +
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free(tp);
> >  
> > @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
> >  			error = -EIO;
> >  		tp->t_ticket = NULL;
> >  	}
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	/* XXX: handle RELOG items on transaction abort */
> 
> "Handle"?  Hm.  Do the reloggable items end up attached in some way to
> this new transaction, or are we purely stealing the reservation so that
> the ail can use it to relog the items on its own?  If it's the second,
> then I wonder what handling do we need to do?
> 

More the latter...

> Or maybe you meant handling the relog items that the caller attached to
> this relog transaction?  Won't those get cancelled the same way they do
> now?
> 

This comment was more of a note to self that when putting this together
I hadn't thought through the abort/shutdown path and whether the code is
correct (i.e., should the transaction cancel relog state? I still need
to test an abort of a relog commit, etc.). That's still something I need
to work through, but I wouldn't read more into the comment than that.

Brian

> Mechanically this looks reasonable.
> 
> --D
> 
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free_items(tp, !!error);
> >  	xfs_trans_free(tp);
> > @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
> >  		tp->t_ticket = NULL;
> >  	}
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> > +
> >  	/* mark this thread as no longer being in a transaction */
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 752c7fef9de7..a032989943bd 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -236,6 +236,9 @@ int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
> >  void		xfs_trans_cancel(xfs_trans_t *);
> >  int		xfs_trans_ail_init(struct xfs_mount *);
> >  void		xfs_trans_ail_destroy(struct xfs_mount *);
> > +int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
> > +bool		xfs_trans_ail_relog_get(struct xfs_mount *);
> > +int		xfs_trans_ail_relog_put(struct xfs_mount *);
> >  
> >  void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
> >  				       enum xfs_blft);
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 00cc5b8734be..a3fb64275baa 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -17,6 +17,7 @@
> >  #include "xfs_errortag.h"
> >  #include "xfs_error.h"
> >  #include "xfs_log.h"
> > +#include "xfs_log_priv.h"
> >  
> >  #ifdef DEBUG
> >  /*
> > @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
> >  		xfs_log_space_wake(ailp->ail_mount);
> >  }
> >  
> > +bool
> > +xfs_trans_ail_relog_get(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	bool			ret = false;
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		ret = true;
> > +	}
> > +	spin_unlock(&ailp->ail_lock);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> > + * requires a clean, permanent transaction from the caller. Pull reservation
> > + * for the relog ticket and roll the caller's transaction back to its fully
> > + * reserved state. If the AIL relog ticket is already initialized, grab a
> > + * reference and return.
> > + */
> > +int
> > +xfs_trans_ail_relog_reserve(
> > +	struct xfs_trans	**tpp)
> > +{
> > +	struct xfs_trans	*tp = *tpp;
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	struct xlog_ticket	*tic;
> > +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> > +
> > +	if (xfs_trans_ail_relog_get(mp))
> > +		return 0;
> > +
> > +	/* no active ticket, fall into slow path to allocate one.. */
> > +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> > +	if (!tic)
> > +		return -ENOMEM;
> > +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> > +
> > +	/* check again since we dropped the lock for the allocation */
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		spin_unlock(&ailp->ail_lock);
> > +		xfs_log_ticket_put(tic);
> > +		return 0;
> > +	}
> > +
> > +	/* attach and reserve space for the ->tr_relog ticket */
> > +	ailp->ail_relog_tic = tic;
> > +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +	return xfs_trans_roll(tpp);
> > +}
> > +
> > +/*
> > + * Release a reference to the relog ticket.
> > + */
> > +int
> > +xfs_trans_ail_relog_put(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	struct xlog_ticket	*tic;
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
> > +		spin_unlock(&ailp->ail_lock);
> > +		return 0;
> > +	}
> > +
> > +	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
> > +	tic = ailp->ail_relog_tic;
> > +	ailp->ail_relog_tic = NULL;
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +	xfs_log_done(mp, tic, NULL, false);
> > +	return 0;
> > +}
> > +
> >  int
> >  xfs_trans_ail_init(
> >  	xfs_mount_t	*mp)
> > @@ -854,6 +942,7 @@ xfs_trans_ail_destroy(
> >  {
> >  	struct xfs_ail	*ailp = mp->m_ail;
> >  
> > +	ASSERT(ailp->ail_relog_tic == NULL);
> >  	kthread_stop(ailp->ail_task);
> >  	kmem_free(ailp);
> >  }
> > diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> > index 2e073c1c4614..839df6559b9f 100644
> > --- a/fs/xfs/xfs_trans_priv.h
> > +++ b/fs/xfs/xfs_trans_priv.h
> > @@ -61,6 +61,7 @@ struct xfs_ail {
> >  	int			ail_log_flush;
> >  	struct list_head	ail_buf_list;
> >  	wait_queue_head_t	ail_empty;
> > +	struct xlog_ticket	*ail_relog_tic;
> >  };
> >  
> >  /*
> > -- 
> > 2.21.1
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-02-28  0:13   ` Darrick J. Wong
@ 2020-02-28 14:02     ` Brian Foster
  2020-03-02  7:32       ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-02-28 14:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 04:13:45PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > Now that relog reservation is available and relog state tracking is
> > in place, all that remains to automatically relog items is the relog
> > mechanism itself. An item with relogging enabled is basically pinned
> > from writeback until relog is disabled. Instead of being written
> > back, the item must instead be periodically committed in a new
> > transaction to move it in the physical log. The purpose of moving
> > the item is to avoid long term tail pinning and thus avoid log
> > deadlocks for long running operations.
> > 
> > The ideal time to relog an item is in response to tail pushing
> > pressure. This accommodates the current workload at any given time
> > as opposed to a fixed time interval or log reservation heuristic,
> > which risks performance regression. This is essentially the same
> > heuristic that drives metadata writeback. XFS already implements
> > various log tail pushing heuristics that attempt to keep the log
> > progressing on an active fileystem under various workloads.
> > 
> > The act of relogging an item simply requires to add it to a
> > transaction and commit. This pushes the already dirty item into a
> > subsequent log checkpoint and frees up its previous location in the
> > on-disk log. Joining an item to a transaction of course requires
> > locking the item first, which means we have to be aware of
> > type-specific locks and lock ordering wherever the relog takes
> > place.
> > 
> > Fundamentally, this points to xfsaild as the ideal location to
> > process relog enabled items. xfsaild already processes log resident
> > items, is driven by log tail pushing pressure, processes arbitrary
> > log item types through callbacks, and is sensitive to type-specific
> > locking rules by design. The fact that automatic relogging
> > essentially diverts items between writeback or relog also suggests
> > xfsaild as an ideal location to process items one way or the other.
> > 
> > Of course, we don't want xfsaild to process transactions as it is a
> > critical component of the log subsystem for driving metadata
> > writeback and freeing up log space. Therefore, similar to how
> > xfsaild builds up a writeback queue of dirty items and queues writes
> > asynchronously, make xfsaild responsible only for directing pending
> > relog items into an appropriate queue and create an async
> > (workqueue) context for processing the queue. The workqueue context
> > utilizes the pre-reserved relog ticket to drain the queue by rolling
> > a permanent transaction.
> 
> Aha!  I bet that's that workqueue I was musing about earlier.
> 

:)

> > Update the AIL pushing infrastructure to support a new RELOG item
> > state. If a log item push returns the relog state, queue the item
> > for relog instead of writeback. On completion of a push cycle,
> > schedule the relog task at the same point metadata buffer I/O is
> > submitted. This allows items to be relogged automatically under the
> > same locking rules and pressure heuristics that govern metadata
> > writeback.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_trace.h      |   1 +
> >  fs/xfs/xfs_trans.h      |   1 +
> >  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trans_priv.h |   3 ++
> >  4 files changed, 106 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a066617ec54d..df0114ec66f1 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> > +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index fc4c25b6eee4..1637df32c64c 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> >  #define XFS_ITEM_PINNED		1
> >  #define XFS_ITEM_LOCKED		2
> >  #define XFS_ITEM_FLUSHING	3
> > +#define XFS_ITEM_RELOG		4
> >  
> >  /*
> >   * Deferred operation item relogging limits.
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index a3fb64275baa..71a47faeaae8 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
> >  	return lsn;
> >  }
> >  
> > +/*
> > + * Relog log items on the AIL relog queue.
> > + */
> > +static void
> > +xfs_ail_relog(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> > +						     ail_relog_work);
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> > +	struct xfs_trans_res	tres = {};
> > +	struct xfs_trans	*tp;
> > +	struct xfs_log_item	*lip;
> > +	int			error;
> > +
> > +	/*
> > +	 * The first transaction to submit a relog item contributed relog
> > +	 * reservation to the relog ticket before committing. Create an empty
> > +	 * transaction and manually associate the relog ticket.
> > +	 */
> > +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
> 
> Ah, and I see that the work item actually does create its own
> transaction to relog the items...
> 

Yeah, bit of a hack to allocate a shell transaction and attach the relog
ticket managed by the earlier patch, but it ended up cleaner than having
the transaction allocated and passed into this context.

> > +	ASSERT(!error);
> > +	if (error)
> > +		return;
> > +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> > +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> > +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> > +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > +					       struct xfs_log_item,
> > +					       li_trans)) != NULL) {
> 
> ...but this part really cranks up my curiosity about what happens when
> there are more items to relog than there is actual reservation in this
> transaction?  I think most transactions types reserve enough space that
> we could attach hundreds of relogged intent items.
> 

See my earlier comment around batching. Right now we only relog one item
at a time and the relog reservation is intended to be the max possible
reloggable item in the fs. This needs to increase to support some kind
of batching here, but I think the prospective reloggable items right now
(i.e. 2 or 3 different intent types) allows a fixed calculation size to
work well enough for our needs.

Note that I think there's a whole separate ball of complexity we could
delve into if we wanted to support something like arbitrary, per-item
(set) relog tickets with different reservation values as opposed to one
global, fixed size ticket. That would require some association between
log items and tickets and perhaps other items covered by the same
ticket, etc., but would provide a much more generic mechanism. As it is,
I think that's hugely overkill for the current use cases, but maybe we
find a reason to evolve this into something like that down the road..

> > +		/*
> > +		 * Drop the AIL processing ticket reference once the relog list
> > +		 * is emptied. At this point it's possible for our transaction
> > +		 * to hold the only reference.
> > +		 */
> > +		list_del_init(&lip->li_trans);
> > +		if (list_empty(&ailp->ail_relog_list))
> > +			xfs_log_ticket_put(ailp->ail_relog_tic);
> > +		spin_unlock(&ailp->ail_lock);
> > +
> > +		xfs_trans_add_item(tp, lip);
> > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > +		/* XXX: include ticket owner task fix */
> 
> XXX?
> 

Oops, this refers to patch 1. I added the patch but forgot to remove the
comment, so this can go away..

Brian

> --D
> 
> > +		error = xfs_trans_roll(&tp);
> > +		ASSERT(!error);
> > +		if (error)
> > +			goto out;
> > +		spin_lock(&ailp->ail_lock);
> > +	}
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +out:
> > +	/* XXX: handle shutdown scenario */
> > +	/*
> > +	 * Drop the relog reference owned by the transaction separately because
> > +	 * we don't want the cancel to release reservation if this isn't the
> > +	 * final reference. The relog ticket and associated reservation needs
> > +	 * to persist so long as relog items are active in the log subsystem.
> > +	 */
> > +	xfs_trans_ail_relog_put(mp);
> > +
> > +	tp->t_ticket = NULL;
> > +	xfs_trans_cancel(tp);
> > +}
> > +
> >  /*
> >   * The cursor keeps track of where our current traversal is up to by tracking
> >   * the next item in the list for us. However, for this to be safe, removing an
> > @@ -364,7 +433,7 @@ static long
> >  xfsaild_push(
> >  	struct xfs_ail		*ailp)
> >  {
> > -	xfs_mount_t		*mp = ailp->ail_mount;
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> >  	struct xfs_ail_cursor	cur;
> >  	struct xfs_log_item	*lip;
> >  	xfs_lsn_t		lsn;
> > @@ -426,6 +495,23 @@ xfsaild_push(
> >  			ailp->ail_last_pushed_lsn = lsn;
> >  			break;
> >  
> > +		case XFS_ITEM_RELOG:
> > +			/*
> > +			 * The item requires a relog. Add to the pending relog
> > +			 * list and set the relogged bit to prevent further
> > +			 * relog requests. The relog bit and ticket reference
> > +			 * can be dropped from the item at any point, so hold a
> > +			 * relog ticket reference for the pending relog list to
> > +			 * ensure the ticket stays around.
> > +			 */
> > +			trace_xfs_ail_relog(lip);
> > +			ASSERT(list_empty(&lip->li_trans));
> > +			if (list_empty(&ailp->ail_relog_list))
> > +				xfs_log_ticket_get(ailp->ail_relog_tic);
> > +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> > +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> > +			break;
> > +
> >  		case XFS_ITEM_FLUSHING:
> >  			/*
> >  			 * The item or its backing buffer is already being
> > @@ -492,6 +578,9 @@ xfsaild_push(
> >  	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
> >  		ailp->ail_log_flush++;
> >  
> > +	if (!list_empty(&ailp->ail_relog_list))
> > +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> > +
> >  	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
> >  out_done:
> >  		/*
> > @@ -922,15 +1011,24 @@ xfs_trans_ail_init(
> >  	spin_lock_init(&ailp->ail_lock);
> >  	INIT_LIST_HEAD(&ailp->ail_buf_list);
> >  	init_waitqueue_head(&ailp->ail_empty);
> > +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> > +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> > +
> > +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> > +					     mp->m_super->s_id);
> > +	if (!ailp->ail_relog_wq)
> > +		goto out_free_ailp;
> >  
> >  	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
> >  			ailp->ail_mount->m_super->s_id);
> >  	if (IS_ERR(ailp->ail_task))
> > -		goto out_free_ailp;
> > +		goto out_destroy_wq;
> >  
> >  	mp->m_ail = ailp;
> >  	return 0;
> >  
> > +out_destroy_wq:
> > +	destroy_workqueue(ailp->ail_relog_wq);
> >  out_free_ailp:
> >  	kmem_free(ailp);
> >  	return -ENOMEM;
> > @@ -944,5 +1042,6 @@ xfs_trans_ail_destroy(
> >  
> >  	ASSERT(ailp->ail_relog_tic == NULL);
> >  	kthread_stop(ailp->ail_task);
> > +	destroy_workqueue(ailp->ail_relog_wq);
> >  	kmem_free(ailp);
> >  }
> > diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> > index d1edec1cb8ad..33a724534869 100644
> > --- a/fs/xfs/xfs_trans_priv.h
> > +++ b/fs/xfs/xfs_trans_priv.h
> > @@ -63,6 +63,9 @@ struct xfs_ail {
> >  	int			ail_log_flush;
> >  	struct list_head	ail_buf_list;
> >  	wait_queue_head_t	ail_empty;
> > +	struct work_struct	ail_relog_work;
> > +	struct list_head	ail_relog_list;
> > +	struct workqueue_struct	*ail_relog_wq;
> >  	struct xlog_ticket	*ail_relog_tic;
> >  };
> >  
> > -- 
> > 2.21.1
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-27 23:19   ` Allison Collins
@ 2020-02-28 14:03     ` Brian Foster
  2020-02-28 18:55       ` Allison Collins
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-02-28 14:03 UTC (permalink / raw)
  To: Allison Collins; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 04:19:42PM -0700, Allison Collins wrote:
> 
> 
> On 2/27/20 6:43 AM, Brian Foster wrote:
> > The quotaoff operation has a rare but longstanding deadlock vector
> > in terms of how the operation is logged. A quotaoff start intent is
> > logged (synchronously) at the onset to ensure recovery can handle
> > the operation if interrupted before in-core changes are made. This
> > quotaoff intent pins the log tail while the quotaoff sequence scans
> > and purges dquots from all in-core inodes. While this operation
> > generally doesn't generate much log traffic on its own, it can be
> > time consuming. If unrelated, concurrent filesystem activity
> > consumes remaining log space before quotaoff is able to acquire log
> > reservation for the quotaoff end intent, the filesystem locks up
> > indefinitely.
> > 
> > quotaoff cannot allocate the end intent before the scan because the
> > latter can result in transaction allocation itself in certain
> > indirect cases (releasing an inode, for example). Further, rolling
> > the original transaction is difficult because the scanning work
> > occurs multiple layers down where caller context is lost and not
> > much information is available to determine how often to roll the
> > transaction.
> > 
> > To address this problem, enable automatic relogging of the quotaoff
> > start intent. This automatically relogs the intent whenever AIL
> > pushing finds the item at the tail of the log. When quotaoff
> > completes, wait for relogging to complete as the end intent expects
> > to be able to permanently remove the start intent from the log
> > subsystem. This ensures that the log tail is kept moving during a
> > particularly long quotaoff operation and avoids the log reservation
> > deadlock.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >   fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
> >   fs/xfs/xfs_dquot_item.c        |  7 +++++++
> >   fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
> >   3 files changed, 20 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 1f5c9e6e1afc..f49b20c9ca33 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -935,7 +935,8 @@ xfs_trans_resv_calc(
> >   	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> >   	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
> > -	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> > +	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> > +	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> What's the reason for the log count change here?  Otherwise looks ok.
> 

Permanent transactions have a separate default log count (2 instead of
1) because they are intended to be rolled (at least once). This
basically means the initial allocation will acquire enough log
reservation for the initial transaction and a subsequent roll, similar
to how all other permanant transactions are initialized in this file.
This is required for quotaoff because the transaction now uses
XFS_TRANS_RELOG, which requires a roll up front.

Brian

> Allison
> >   	resp->tr_qm_equotaoff.tr_logres =
> >   		xfs_calc_qm_quotaoff_end_reservation();
> > diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> > index d60647d7197b..ea5123678466 100644
> > --- a/fs/xfs/xfs_dquot_item.c
> > +++ b/fs/xfs/xfs_dquot_item.c
> > @@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
> >   	struct xfs_log_item	*lip,
> >   	struct list_head	*buffer_list)
> >   {
> > +	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
> > +
> > +	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
> > +	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
> > +	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
> > +		return XFS_ITEM_RELOG;
> > +
> >   	return XFS_ITEM_LOCKED;
> >   }
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index 1ea82764bf89..7b48d34da0f4 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
> > @@ -18,6 +18,7 @@
> >   #include "xfs_quota.h"
> >   #include "xfs_qm.h"
> >   #include "xfs_icache.h"
> > +#include "xfs_trans_priv.h"
> >   STATIC int
> >   xfs_qm_log_quotaoff(
> > @@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
> >   	*qoffstartp = NULL;
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > +				XFS_TRANS_RELOG, &tp);
> >   	if (error)
> >   		goto out;
> >   	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
> >   	xfs_trans_log_quotaoff_item(tp, qoffi);
> > +	xfs_trans_relog_item(&qoffi->qql_item);
> >   	spin_lock(&mp->m_sb_lock);
> >   	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
> > @@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
> >   	int			error;
> >   	struct xfs_qoff_logitem	*qoffi;
> > +	/*
> > +	 * startqoff must be in the AIL and not the CIL when the end intent
> > +	 * commits to ensure it is not readded to the AIL out of order. Wait on
> > +	 * relog activity to drain to isolate startqoff to the AIL.
> > +	 */
> > +	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
> > +
> >   	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> >   	if (error)
> >   		return error;
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-28  1:16   ` Darrick J. Wong
@ 2020-02-28 14:04     ` Brian Foster
  2020-02-29  5:35       ` Darrick J. Wong
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-02-28 14:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 05:16:40PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:18AM -0500, Brian Foster wrote:
> > The quotaoff operation has a rare but longstanding deadlock vector
> > in terms of how the operation is logged. A quotaoff start intent is
> > logged (synchronously) at the onset to ensure recovery can handle
> > the operation if interrupted before in-core changes are made. This
> > quotaoff intent pins the log tail while the quotaoff sequence scans
> > and purges dquots from all in-core inodes. While this operation
> > generally doesn't generate much log traffic on its own, it can be
> > time consuming. If unrelated, concurrent filesystem activity
> > consumes remaining log space before quotaoff is able to acquire log
> > reservation for the quotaoff end intent, the filesystem locks up
> > indefinitely.
> > 
> > quotaoff cannot allocate the end intent before the scan because the
> > latter can result in transaction allocation itself in certain
> > indirect cases (releasing an inode, for example). Further, rolling
> > the original transaction is difficult because the scanning work
> > occurs multiple layers down where caller context is lost and not
> > much information is available to determine how often to roll the
> > transaction.
> > 
> > To address this problem, enable automatic relogging of the quotaoff
> > start intent. This automatically relogs the intent whenever AIL
> > pushing finds the item at the tail of the log. When quotaoff
> > completes, wait for relogging to complete as the end intent expects
> > to be able to permanently remove the start intent from the log
> > subsystem. This ensures that the log tail is kept moving during a
> > particularly long quotaoff operation and avoids the log reservation
> > deadlock.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
> >  fs/xfs/xfs_dquot_item.c        |  7 +++++++
> >  fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
> >  3 files changed, 20 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 1f5c9e6e1afc..f49b20c9ca33 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -935,7 +935,8 @@ xfs_trans_resv_calc(
> >  	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> >  
> >  	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
> > -	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> > +	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> > +	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  
> >  	resp->tr_qm_equotaoff.tr_logres =
> >  		xfs_calc_qm_quotaoff_end_reservation();
> > diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> > index d60647d7197b..ea5123678466 100644
> > --- a/fs/xfs/xfs_dquot_item.c
> > +++ b/fs/xfs/xfs_dquot_item.c
> > @@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
> >  	struct xfs_log_item	*lip,
> >  	struct list_head	*buffer_list)
> >  {
> > +	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
> > +
> > +	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
> > +	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
> > +	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
> > +		return XFS_ITEM_RELOG;
> > +
> >  	return XFS_ITEM_LOCKED;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index 1ea82764bf89..7b48d34da0f4 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
> > @@ -18,6 +18,7 @@
> >  #include "xfs_quota.h"
> >  #include "xfs_qm.h"
> >  #include "xfs_icache.h"
> > +#include "xfs_trans_priv.h"
> >  
> >  STATIC int
> >  xfs_qm_log_quotaoff(
> > @@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
> >  
> >  	*qoffstartp = NULL;
> >  
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > +				XFS_TRANS_RELOG, &tp);
> 
> Humm, maybe I don't understand how this works after all.  From what I
> can tell from this patch, (1) the quotaoff transaction is created with
> RELOG, so (2) the AIL steals some reservation from it for an eventual
> relogging of the quotaoff item, and then (3) we log the quotaoff item.
> 

Yep.

> Later, the AIL can decide to trigger the workqueue item to take the
> ticket generated in step (2) to relog the item we logged in step (3) to
> move the log tail forward, but what happens if there are further delays
> and the AIL needs to relog again?  That ticket from (2) is now used up
> and is gone, right?
> 
> I suppose some other RELOG transaction could wander in and generate a
> new relog ticket, but as this is the only RELOG transaction that gets
> created anywhere, that won't happen.  Is there some magic I missed? :)
> 

xfs_ail_relog() only ever rolls its transaction, even if nothing else
happens to be queued at the time, so the relog ticket constantly
regrants. Since relogs never commit, the relog ticket always has
available relog reservation so long as XFS_LI_RELOG items exist. Once
there are no more relog items or transactions, the pending reservation
is released via xfs_trans_ail_relog_put() -> xfs_log_done().

It might be more simple to reason about the reservation model if you
factor out the dynamic relog ticket bits. This is basically equivalent
to the AIL allocating a relog transaction at mount time, constantly
rolling it with relog items when they pass through, and then cancelling
the reservation at unmount time. All of the extra XFS_TRANS_RELOG and
reference counting and ticket management stuff is purely so we only have
an active relog reservation when relogging is being used.

Brian

> --D
> 
> >  	if (error)
> >  		goto out;
> >  
> >  	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
> >  	xfs_trans_log_quotaoff_item(tp, qoffi);
> > +	xfs_trans_relog_item(&qoffi->qql_item);
> >  
> >  	spin_lock(&mp->m_sb_lock);
> >  	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
> > @@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
> >  	int			error;
> >  	struct xfs_qoff_logitem	*qoffi;
> >  
> > +	/*
> > +	 * startqoff must be in the AIL and not the CIL when the end intent
> > +	 * commits to ensure it is not readded to the AIL out of order. Wait on
> > +	 * relog activity to drain to isolate startqoff to the AIL.
> > +	 */
> > +	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
> > +
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> >  	if (error)
> >  		return error;
> > -- 
> > 2.21.1
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype
  2020-02-27 23:33   ` Allison Collins
@ 2020-02-28 14:04     ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-28 14:04 UTC (permalink / raw)
  To: Allison Collins; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 04:33:26PM -0700, Allison Collins wrote:
> On 2/27/20 6:43 AM, Brian Foster wrote:
> > Add a quick and dirty implementation of buffer relogging support.
> > There is currently no use case for buffer relogging. This is for
> > experimental use only and serves as an example to demonstrate the
> > ability to relog arbitrary items in the future, if necessary.
> > 
> > Add a hook to enable relogging a buffer in a transaction, update the
> > buffer log item handlers to support relogged BLIs and update the
> > relog handler to join the relogged buffer to the relog transaction.
> > 
> Alrighty, thanks for the example!  It sounds like it's meant more to be a
> demo than to really be applied though?
> 

Yeah, I just wanted to include something that demonstrates how this can
be used for something other than intents, because that concern was
raised on previous versions...

Brian

> Allison
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >   fs/xfs/xfs_buf_item.c  |  5 +++++
> >   fs/xfs/xfs_trans.h     |  1 +
> >   fs/xfs/xfs_trans_ail.c | 19 ++++++++++++++++---
> >   fs/xfs/xfs_trans_buf.c | 22 ++++++++++++++++++++++
> >   4 files changed, 44 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 663810e6cd59..4ef2725fa8ce 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -463,6 +463,7 @@ xfs_buf_item_unpin(
> >   			list_del_init(&bp->b_li_list);
> >   			bp->b_iodone = NULL;
> >   		} else {
> > +			xfs_trans_relog_item_cancel(lip, false);
> >   			spin_lock(&ailp->ail_lock);
> >   			xfs_trans_ail_delete(ailp, lip, SHUTDOWN_LOG_IO_ERROR);
> >   			xfs_buf_item_relse(bp);
> > @@ -528,6 +529,9 @@ xfs_buf_item_push(
> >   		return XFS_ITEM_LOCKED;
> >   	}
> > +	if (test_bit(XFS_LI_RELOG, &lip->li_flags))
> > +		return XFS_ITEM_RELOG;
> > +
> >   	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
> >   	trace_xfs_buf_item_push(bip);
> > @@ -956,6 +960,7 @@ STATIC void
> >   xfs_buf_item_free(
> >   	struct xfs_buf_log_item	*bip)
> >   {
> > +	ASSERT(!test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags));
> >   	xfs_buf_item_free_format(bip);
> >   	kmem_free(bip->bli_item.li_lv_shadow);
> >   	kmem_cache_free(xfs_buf_item_zone, bip);
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 1637df32c64c..81cb42f552d9 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -226,6 +226,7 @@ void		xfs_trans_inode_buf(xfs_trans_t *, struct xfs_buf *);
> >   void		xfs_trans_stale_inode_buf(xfs_trans_t *, struct xfs_buf *);
> >   bool		xfs_trans_ordered_buf(xfs_trans_t *, struct xfs_buf *);
> >   void		xfs_trans_dquot_buf(xfs_trans_t *, struct xfs_buf *, uint);
> > +bool		xfs_trans_relog_buf(struct xfs_trans *, struct xfs_buf *);
> >   void		xfs_trans_inode_alloc_buf(xfs_trans_t *, struct xfs_buf *);
> >   void		xfs_trans_ichgtime(struct xfs_trans *, struct xfs_inode *, int);
> >   void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 71a47faeaae8..103ab62e61be 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -18,6 +18,7 @@
> >   #include "xfs_error.h"
> >   #include "xfs_log.h"
> >   #include "xfs_log_priv.h"
> > +#include "xfs_buf_item.h"
> >   #ifdef DEBUG
> >   /*
> > @@ -187,9 +188,21 @@ xfs_ail_relog(
> >   			xfs_log_ticket_put(ailp->ail_relog_tic);
> >   		spin_unlock(&ailp->ail_lock);
> > -		xfs_trans_add_item(tp, lip);
> > -		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > -		tp->t_flags |= XFS_TRANS_DIRTY;
> > +		/*
> > +		 * TODO: Ideally, relog transaction management would be pushed
> > +		 * down into the ->iop_push() callbacks rather than playing
> > +		 * games with ->li_trans and looking at log item types here.
> > +		 */
> > +		if (lip->li_type == XFS_LI_BUF) {
> > +			struct xfs_buf_log_item	*bli = (struct xfs_buf_log_item *) lip;
> > +			xfs_buf_hold(bli->bli_buf);
> > +			xfs_trans_bjoin(tp, bli->bli_buf);
> > +			xfs_trans_dirty_buf(tp, bli->bli_buf);
> > +		} else {
> > +			xfs_trans_add_item(tp, lip);
> > +			set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +			tp->t_flags |= XFS_TRANS_DIRTY;
> > +		}
> >   		/* XXX: include ticket owner task fix */
> >   		error = xfs_trans_roll(&tp);
> >   		ASSERT(!error);
> > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> > index 08174ffa2118..e17715ac23fc 100644
> > --- a/fs/xfs/xfs_trans_buf.c
> > +++ b/fs/xfs/xfs_trans_buf.c
> > @@ -787,3 +787,25 @@ xfs_trans_dquot_buf(
> >   	xfs_trans_buf_set_type(tp, bp, type);
> >   }
> > +
> > +/*
> > + * Enable automatic relogging on a buffer. This essentially pins a dirty buffer
> > + * in-core until relogging is disabled. Note that the buffer must not already be
> > + * queued for writeback.
> > + */
> > +bool
> > +xfs_trans_relog_buf(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_buf		*bp)
> > +{
> > +	struct xfs_buf_log_item	*bip = bp->b_log_item;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_RELOG);
> > +	ASSERT(xfs_buf_islocked(bp));
> > +
> > +	if (bp->b_flags & _XBF_DELWRI_Q)
> > +		return false;
> > +
> > +	xfs_trans_relog_item(&bip->bli_item);
> > +	return true;
> > +}
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 9/9] xfs: relog random buffers based on errortag
  2020-02-27 23:48   ` Allison Collins
@ 2020-02-28 14:06     ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-28 14:06 UTC (permalink / raw)
  To: Allison Collins; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 04:48:59PM -0700, Allison Collins wrote:
> 
> 
> On 2/27/20 6:43 AM, Brian Foster wrote:
> > Since there is currently no specific use case for buffer relogging,
> > add some hacky and experimental code to relog random buffers when
> > the associated errortag is enabled. Update the relog reservation
> > calculation appropriately and use fixed termination logic to help
> > ensure that the relog queue doesn't grow indefinitely.
> > 
> > Note that this patch was useful in causing log reservation deadlocks
> > on an fsstress workload if the relog mechanism code is modified to
> > acquire its own log reservation rather than rely on the relog
> > pre-reservation mechanism. In other words, this helps prove that the
> > relog reservation management code effectively avoids log reservation
> > deadlocks.
> > 
> 
> Oh i see, so the last three are sort of an internal test case.  They look
> like they are good sand box tools for testing though.  I guess they dont
> really get RVBs since they dont apply?  Otherwise looks good for the purpose
> they are meant for.  :-)
> 

Right. I'm actually not opposed to polishing up the buffer relogging
code and packaging it with an fstest that invokes the errortag, but the
original intent is for these to stay RFC and drop off the series. I have
already cleaned up some of the code to use a new ->iop_relog() callback
and also have a patch to use buffer relogging as a solution for an
already fixed xattr buffer write verifier failure bug caused by
premature writeback. That use case is contrived (aside from being
already fixed), however, so by itself doesn't justify inclusion of the
buffer bits. I think it's more of a question of if anybody looks at this
patch and can think of any reasonable future use case ideas. If so, it
might be worth retaining and working out any kinks..

Brian

> Allison
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >   fs/xfs/libxfs/xfs_trans_resv.c |  8 +++++++-
> >   fs/xfs/xfs_trans.h             |  4 +++-
> >   fs/xfs/xfs_trans_ail.c         | 11 +++++++++++
> >   fs/xfs/xfs_trans_buf.c         | 13 +++++++++++++
> >   4 files changed, 34 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index f49b20c9ca33..59a328a0dec6 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -840,7 +840,13 @@ STATIC uint
> >   xfs_calc_relog_reservation(
> >   	struct xfs_mount	*mp)
> >   {
> > -	return xfs_calc_qm_quotaoff_reservation(mp);
> > +	uint			res;
> > +
> > +	res = xfs_calc_qm_quotaoff_reservation(mp);
> > +#ifdef DEBUG
> > +	res = max(res, xfs_calc_buf_res(4, XFS_FSB_TO_B(mp, 1)));
> > +#endif
> > +	return res;
> >   }
> >   void
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 81cb42f552d9..1783441f6d03 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -61,6 +61,7 @@ struct xfs_log_item {
> >   #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
> >   #define	XFS_LI_RELOG	4	/* automatically relog item */
> >   #define	XFS_LI_RELOGGED	5	/* item relogged (not committed) */
> > +#define	XFS_LI_RELOG_RAND 6
> >   #define XFS_LI_FLAGS \
> >   	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
> > @@ -68,7 +69,8 @@ struct xfs_log_item {
> >   	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
> >   	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
> >   	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
> > -	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }
> > +	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }, \
> > +	{ (1 << XFS_LI_RELOG_RAND),	"RELOG_RAND" }
> >   struct xfs_item_ops {
> >   	unsigned flags;
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 103ab62e61be..9b1d7c8df6d8 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -188,6 +188,17 @@ xfs_ail_relog(
> >   			xfs_log_ticket_put(ailp->ail_relog_tic);
> >   		spin_unlock(&ailp->ail_lock);
> > +		/*
> > +		 * Terminate random/debug relogs at a fixed, aggressive rate to
> > +		 * avoid building up too much relog activity.
> > +		 */
> > +		if (test_bit(XFS_LI_RELOG_RAND, &lip->li_flags) &&
> > +		    ((prandom_u32() & 1) ||
> > +		     (mp->m_flags & XFS_MOUNT_UNMOUNTING))) {
> > +			clear_bit(XFS_LI_RELOG_RAND, &lip->li_flags);
> > +			xfs_trans_relog_item_cancel(lip, false);
> > +		}
> > +
> >   		/*
> >   		 * TODO: Ideally, relog transaction management would be pushed
> >   		 * down into the ->iop_push() callbacks rather than playing
> > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> > index e17715ac23fc..de7b9a68fe38 100644
> > --- a/fs/xfs/xfs_trans_buf.c
> > +++ b/fs/xfs/xfs_trans_buf.c
> > @@ -14,6 +14,8 @@
> >   #include "xfs_buf_item.h"
> >   #include "xfs_trans_priv.h"
> >   #include "xfs_trace.h"
> > +#include "xfs_error.h"
> > +#include "xfs_errortag.h"
> >   /*
> >    * Check to see if a buffer matching the given parameters is already
> > @@ -527,6 +529,17 @@ xfs_trans_log_buf(
> >   	trace_xfs_trans_log_buf(bip);
> >   	xfs_buf_item_log(bip, first, last);
> > +
> > +	/*
> > +	 * Relog random buffers so long as the transaction is relog enabled and
> > +	 * the buffer wasn't already relogged explicitly.
> > +	 */
> > +	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_RELOG) &&
> > +	    (tp->t_flags & XFS_TRANS_RELOG) &&
> > +	    !test_bit(XFS_LI_RELOG, &bip->bli_item.li_flags)) {
> > +		if (xfs_trans_relog_buf(tp, bp))
> > +			set_bit(XFS_LI_RELOG_RAND, &bip->bli_item.li_flags);
> > +	}
> >   }
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-28 14:03     ` Brian Foster
@ 2020-02-28 18:55       ` Allison Collins
  0 siblings, 0 replies; 59+ messages in thread
From: Allison Collins @ 2020-02-28 18:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On 2/28/20 7:03 AM, Brian Foster wrote:
> On Thu, Feb 27, 2020 at 04:19:42PM -0700, Allison Collins wrote:
>>
>>
>> On 2/27/20 6:43 AM, Brian Foster wrote:
>>> The quotaoff operation has a rare but longstanding deadlock vector
>>> in terms of how the operation is logged. A quotaoff start intent is
>>> logged (synchronously) at the onset to ensure recovery can handle
>>> the operation if interrupted before in-core changes are made. This
>>> quotaoff intent pins the log tail while the quotaoff sequence scans
>>> and purges dquots from all in-core inodes. While this operation
>>> generally doesn't generate much log traffic on its own, it can be
>>> time consuming. If unrelated, concurrent filesystem activity
>>> consumes remaining log space before quotaoff is able to acquire log
>>> reservation for the quotaoff end intent, the filesystem locks up
>>> indefinitely.
>>>
>>> quotaoff cannot allocate the end intent before the scan because the
>>> latter can result in transaction allocation itself in certain
>>> indirect cases (releasing an inode, for example). Further, rolling
>>> the original transaction is difficult because the scanning work
>>> occurs multiple layers down where caller context is lost and not
>>> much information is available to determine how often to roll the
>>> transaction.
>>>
>>> To address this problem, enable automatic relogging of the quotaoff
>>> start intent. This automatically relogs the intent whenever AIL
>>> pushing finds the item at the tail of the log. When quotaoff
>>> completes, wait for relogging to complete as the end intent expects
>>> to be able to permanently remove the start intent from the log
>>> subsystem. This ensures that the log tail is kept moving during a
>>> particularly long quotaoff operation and avoids the log reservation
>>> deadlock.
>>>
>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>> ---
>>>    fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
>>>    fs/xfs/xfs_dquot_item.c        |  7 +++++++
>>>    fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
>>>    3 files changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
>>> index 1f5c9e6e1afc..f49b20c9ca33 100644
>>> --- a/fs/xfs/libxfs/xfs_trans_resv.c
>>> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
>>> @@ -935,7 +935,8 @@ xfs_trans_resv_calc(
>>>    	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
>>>    	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
>>> -	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
>>> +	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
>>> +	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>> What's the reason for the log count change here?  Otherwise looks ok.
>>
> 
> Permanent transactions have a separate default log count (2 instead of
> 1) because they are intended to be rolled (at least once). This
> basically means the initial allocation will acquire enough log
> reservation for the initial transaction and a subsequent roll, similar
> to how all other permanant transactions are initialized in this file.
> This is required for quotaoff because the transaction now uses
> XFS_TRANS_RELOG, which requires a roll up front.
> 
> Brian

Ok, I see.  Thank you for the explanation!
Allison

> 
>> Allison
>>>    	resp->tr_qm_equotaoff.tr_logres =
>>>    		xfs_calc_qm_quotaoff_end_reservation();
>>> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
>>> index d60647d7197b..ea5123678466 100644
>>> --- a/fs/xfs/xfs_dquot_item.c
>>> +++ b/fs/xfs/xfs_dquot_item.c
>>> @@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
>>>    	struct xfs_log_item	*lip,
>>>    	struct list_head	*buffer_list)
>>>    {
>>> +	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
>>> +
>>> +	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
>>> +	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
>>> +	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
>>> +		return XFS_ITEM_RELOG;
>>> +
>>>    	return XFS_ITEM_LOCKED;
>>>    }
>>> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
>>> index 1ea82764bf89..7b48d34da0f4 100644
>>> --- a/fs/xfs/xfs_qm_syscalls.c
>>> +++ b/fs/xfs/xfs_qm_syscalls.c
>>> @@ -18,6 +18,7 @@
>>>    #include "xfs_quota.h"
>>>    #include "xfs_qm.h"
>>>    #include "xfs_icache.h"
>>> +#include "xfs_trans_priv.h"
>>>    STATIC int
>>>    xfs_qm_log_quotaoff(
>>> @@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
>>>    	*qoffstartp = NULL;
>>> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
>>> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
>>> +				XFS_TRANS_RELOG, &tp);
>>>    	if (error)
>>>    		goto out;
>>>    	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
>>>    	xfs_trans_log_quotaoff_item(tp, qoffi);
>>> +	xfs_trans_relog_item(&qoffi->qql_item);
>>>    	spin_lock(&mp->m_sb_lock);
>>>    	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
>>> @@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
>>>    	int			error;
>>>    	struct xfs_qoff_logitem	*qoffi;
>>> +	/*
>>> +	 * startqoff must be in the AIL and not the CIL when the end intent
>>> +	 * commits to ensure it is not readded to the AIL out of order. Wait on
>>> +	 * relog activity to drain to isolate startqoff to the AIL.
>>> +	 */
>>> +	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
>>> +
>>>    	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
>>>    	if (error)
>>>    		return error;
>>>
>>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-28 14:04     ` Brian Foster
@ 2020-02-29  5:35       ` Darrick J. Wong
  2020-02-29 12:15         ` Brian Foster
  0 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2020-02-29  5:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Feb 28, 2020 at 09:04:13AM -0500, Brian Foster wrote:
> On Thu, Feb 27, 2020 at 05:16:40PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 27, 2020 at 08:43:18AM -0500, Brian Foster wrote:
> > > The quotaoff operation has a rare but longstanding deadlock vector
> > > in terms of how the operation is logged. A quotaoff start intent is
> > > logged (synchronously) at the onset to ensure recovery can handle
> > > the operation if interrupted before in-core changes are made. This
> > > quotaoff intent pins the log tail while the quotaoff sequence scans
> > > and purges dquots from all in-core inodes. While this operation
> > > generally doesn't generate much log traffic on its own, it can be
> > > time consuming. If unrelated, concurrent filesystem activity
> > > consumes remaining log space before quotaoff is able to acquire log
> > > reservation for the quotaoff end intent, the filesystem locks up
> > > indefinitely.
> > > 
> > > quotaoff cannot allocate the end intent before the scan because the
> > > latter can result in transaction allocation itself in certain
> > > indirect cases (releasing an inode, for example). Further, rolling
> > > the original transaction is difficult because the scanning work
> > > occurs multiple layers down where caller context is lost and not
> > > much information is available to determine how often to roll the
> > > transaction.
> > > 
> > > To address this problem, enable automatic relogging of the quotaoff
> > > start intent. This automatically relogs the intent whenever AIL
> > > pushing finds the item at the tail of the log. When quotaoff
> > > completes, wait for relogging to complete as the end intent expects
> > > to be able to permanently remove the start intent from the log
> > > subsystem. This ensures that the log tail is kept moving during a
> > > particularly long quotaoff operation and avoids the log reservation
> > > deadlock.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
> > >  fs/xfs/xfs_dquot_item.c        |  7 +++++++
> > >  fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
> > >  3 files changed, 20 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > > index 1f5c9e6e1afc..f49b20c9ca33 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > > @@ -935,7 +935,8 @@ xfs_trans_resv_calc(
> > >  	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> > >  
> > >  	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
> > > -	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> > > +	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> > > +	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> > >  
> > >  	resp->tr_qm_equotaoff.tr_logres =
> > >  		xfs_calc_qm_quotaoff_end_reservation();
> > > diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> > > index d60647d7197b..ea5123678466 100644
> > > --- a/fs/xfs/xfs_dquot_item.c
> > > +++ b/fs/xfs/xfs_dquot_item.c
> > > @@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
> > >  	struct xfs_log_item	*lip,
> > >  	struct list_head	*buffer_list)
> > >  {
> > > +	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
> > > +
> > > +	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
> > > +	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
> > > +	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
> > > +		return XFS_ITEM_RELOG;
> > > +
> > >  	return XFS_ITEM_LOCKED;
> > >  }
> > >  
> > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > > index 1ea82764bf89..7b48d34da0f4 100644
> > > --- a/fs/xfs/xfs_qm_syscalls.c
> > > +++ b/fs/xfs/xfs_qm_syscalls.c
> > > @@ -18,6 +18,7 @@
> > >  #include "xfs_quota.h"
> > >  #include "xfs_qm.h"
> > >  #include "xfs_icache.h"
> > > +#include "xfs_trans_priv.h"
> > >  
> > >  STATIC int
> > >  xfs_qm_log_quotaoff(
> > > @@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
> > >  
> > >  	*qoffstartp = NULL;
> > >  
> > > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > +				XFS_TRANS_RELOG, &tp);
> > 
> > Humm, maybe I don't understand how this works after all.  From what I
> > can tell from this patch, (1) the quotaoff transaction is created with
> > RELOG, so (2) the AIL steals some reservation from it for an eventual
> > relogging of the quotaoff item, and then (3) we log the quotaoff item.
> > 
> 
> Yep.
> 
> > Later, the AIL can decide to trigger the workqueue item to take the
> > ticket generated in step (2) to relog the item we logged in step (3) to
> > move the log tail forward, but what happens if there are further delays
> > and the AIL needs to relog again?  That ticket from (2) is now used up
> > and is gone, right?
> > 
> > I suppose some other RELOG transaction could wander in and generate a
> > new relog ticket, but as this is the only RELOG transaction that gets
> > created anywhere, that won't happen.  Is there some magic I missed? :)
> > 
> 
> xfs_ail_relog() only ever rolls its transaction, even if nothing else
> happens to be queued at the time, so the relog ticket constantly
> regrants. Since relogs never commit, the relog ticket always has
> available relog reservation so long as XFS_LI_RELOG items exist. Once
> there are no more relog items or transactions, the pending reservation
> is released via xfs_trans_ail_relog_put() -> xfs_log_done().

Aha, that's the subtlety I didn't quite catch. :)

Now that I see how this works for the simple case, I guess I'll try to
figure out on my own what would happen if we flooded the system with a
/lot/ of reloggable items.  Though I bet you've already done that, given
our earlier speculating about closing the writeback hole.

> It might be more simple to reason about the reservation model if you
> factor out the dynamic relog ticket bits. This is basically equivalent
> to the AIL allocating a relog transaction at mount time, constantly
> rolling it with relog items when they pass through, and then cancelling
> the reservation at unmount time. All of the extra XFS_TRANS_RELOG and
> reference counting and ticket management stuff is purely so we only have
> an active relog reservation when relogging is being used.

<nod>

--D

> Brian
> 
> > --D
> > 
> > >  	if (error)
> > >  		goto out;
> > >  
> > >  	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
> > >  	xfs_trans_log_quotaoff_item(tp, qoffi);
> > > +	xfs_trans_relog_item(&qoffi->qql_item);
> > >  
> > >  	spin_lock(&mp->m_sb_lock);
> > >  	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
> > > @@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
> > >  	int			error;
> > >  	struct xfs_qoff_logitem	*qoffi;
> > >  
> > > +	/*
> > > +	 * startqoff must be in the AIL and not the CIL when the end intent
> > > +	 * commits to ensure it is not readded to the AIL out of order. Wait on
> > > +	 * relog activity to drain to isolate startqoff to the AIL.
> > > +	 */
> > > +	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
> > > +
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> > >  	if (error)
> > >  		return error;
> > > -- 
> > > 2.21.1
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent
  2020-02-29  5:35       ` Darrick J. Wong
@ 2020-02-29 12:15         ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-02-29 12:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Feb 28, 2020 at 09:35:31PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 28, 2020 at 09:04:13AM -0500, Brian Foster wrote:
> > On Thu, Feb 27, 2020 at 05:16:40PM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 27, 2020 at 08:43:18AM -0500, Brian Foster wrote:
> > > > The quotaoff operation has a rare but longstanding deadlock vector
> > > > in terms of how the operation is logged. A quotaoff start intent is
> > > > logged (synchronously) at the onset to ensure recovery can handle
> > > > the operation if interrupted before in-core changes are made. This
> > > > quotaoff intent pins the log tail while the quotaoff sequence scans
> > > > and purges dquots from all in-core inodes. While this operation
> > > > generally doesn't generate much log traffic on its own, it can be
> > > > time consuming. If unrelated, concurrent filesystem activity
> > > > consumes remaining log space before quotaoff is able to acquire log
> > > > reservation for the quotaoff end intent, the filesystem locks up
> > > > indefinitely.
> > > > 
> > > > quotaoff cannot allocate the end intent before the scan because the
> > > > latter can result in transaction allocation itself in certain
> > > > indirect cases (releasing an inode, for example). Further, rolling
> > > > the original transaction is difficult because the scanning work
> > > > occurs multiple layers down where caller context is lost and not
> > > > much information is available to determine how often to roll the
> > > > transaction.
> > > > 
> > > > To address this problem, enable automatic relogging of the quotaoff
> > > > start intent. This automatically relogs the intent whenever AIL
> > > > pushing finds the item at the tail of the log. When quotaoff
> > > > completes, wait for relogging to complete as the end intent expects
> > > > to be able to permanently remove the start intent from the log
> > > > subsystem. This ensures that the log tail is kept moving during a
> > > > particularly long quotaoff operation and avoids the log reservation
> > > > deadlock.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_trans_resv.c |  3 ++-
> > > >  fs/xfs/xfs_dquot_item.c        |  7 +++++++
> > > >  fs/xfs/xfs_qm_syscalls.c       | 12 +++++++++++-
> > > >  3 files changed, 20 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > > > index 1f5c9e6e1afc..f49b20c9ca33 100644
> > > > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > > > @@ -935,7 +935,8 @@ xfs_trans_resv_calc(
> > > >  	resp->tr_qm_setqlim.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> > > >  
> > > >  	resp->tr_qm_quotaoff.tr_logres = xfs_calc_qm_quotaoff_reservation(mp);
> > > > -	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_LOG_COUNT;
> > > > +	resp->tr_qm_quotaoff.tr_logcount = XFS_DEFAULT_PERM_LOG_COUNT;
> > > > +	resp->tr_qm_quotaoff.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> > > >  
> > > >  	resp->tr_qm_equotaoff.tr_logres =
> > > >  		xfs_calc_qm_quotaoff_end_reservation();
> > > > diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> > > > index d60647d7197b..ea5123678466 100644
> > > > --- a/fs/xfs/xfs_dquot_item.c
> > > > +++ b/fs/xfs/xfs_dquot_item.c
> > > > @@ -297,6 +297,13 @@ xfs_qm_qoff_logitem_push(
> > > >  	struct xfs_log_item	*lip,
> > > >  	struct list_head	*buffer_list)
> > > >  {
> > > > +	struct xfs_log_item	*mlip = xfs_ail_min(lip->li_ailp);
> > > > +
> > > > +	if (test_bit(XFS_LI_RELOG, &lip->li_flags) &&
> > > > +	    !test_bit(XFS_LI_RELOGGED, &lip->li_flags) &&
> > > > +	    !XFS_LSN_CMP(lip->li_lsn, mlip->li_lsn))
> > > > +		return XFS_ITEM_RELOG;
> > > > +
> > > >  	return XFS_ITEM_LOCKED;
> > > >  }
> > > >  
> > > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > > > index 1ea82764bf89..7b48d34da0f4 100644
> > > > --- a/fs/xfs/xfs_qm_syscalls.c
> > > > +++ b/fs/xfs/xfs_qm_syscalls.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include "xfs_quota.h"
> > > >  #include "xfs_qm.h"
> > > >  #include "xfs_icache.h"
> > > > +#include "xfs_trans_priv.h"
> > > >  
> > > >  STATIC int
> > > >  xfs_qm_log_quotaoff(
> > > > @@ -31,12 +32,14 @@ xfs_qm_log_quotaoff(
> > > >  
> > > >  	*qoffstartp = NULL;
> > > >  
> > > > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > > +				XFS_TRANS_RELOG, &tp);
> > > 
> > > Humm, maybe I don't understand how this works after all.  From what I
> > > can tell from this patch, (1) the quotaoff transaction is created with
> > > RELOG, so (2) the AIL steals some reservation from it for an eventual
> > > relogging of the quotaoff item, and then (3) we log the quotaoff item.
> > > 
> > 
> > Yep.
> > 
> > > Later, the AIL can decide to trigger the workqueue item to take the
> > > ticket generated in step (2) to relog the item we logged in step (3) to
> > > move the log tail forward, but what happens if there are further delays
> > > and the AIL needs to relog again?  That ticket from (2) is now used up
> > > and is gone, right?
> > > 
> > > I suppose some other RELOG transaction could wander in and generate a
> > > new relog ticket, but as this is the only RELOG transaction that gets
> > > created anywhere, that won't happen.  Is there some magic I missed? :)
> > > 
> > 
> > xfs_ail_relog() only ever rolls its transaction, even if nothing else
> > happens to be queued at the time, so the relog ticket constantly
> > regrants. Since relogs never commit, the relog ticket always has
> > available relog reservation so long as XFS_LI_RELOG items exist. Once
> > there are no more relog items or transactions, the pending reservation
> > is released via xfs_trans_ail_relog_put() -> xfs_log_done().
> 
> Aha, that's the subtlety I didn't quite catch. :)
> 
> Now that I see how this works for the simple case, I guess I'll try to
> figure out on my own what would happen if we flooded the system with a
> /lot/ of reloggable items.  Though I bet you've already done that, given
> our earlier speculating about closing the writeback hole.
> 

I haven't stressed it to the breakdown max yet (from a performance
perspective). My stress testing to this point has been focused on
correctness, particularly related to the relog reservation management,
and working out any kinks. I've stressed it enough so far that I haven't
been able to reproduce log res deadlocks that are otherwise reproducible
without the regrant mechanism in place. IOW, manually replace the relog
ticket bits with a regular old xfs_trans_alloc() in xfs_ail_relog(),
enable aggressive enough buffer relogging via the errortag and throw
fsstress at it and things will likely (eventually) grind to a halt
waiting on relog ticket reservation.

Brian

> > It might be more simple to reason about the reservation model if you
> > factor out the dynamic relog ticket bits. This is basically equivalent
> > to the AIL allocating a relog transaction at mount time, constantly
> > rolling it with relog items when they pass through, and then cancelling
> > the reservation at unmount time. All of the extra XFS_TRANS_RELOG and
> > reference counting and ticket management stuff is purely so we only have
> > an active relog reservation when relogging is being used.
> 
> <nod>
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > >  	if (error)
> > > >  		goto out;
> > > >  
> > > >  	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
> > > >  	xfs_trans_log_quotaoff_item(tp, qoffi);
> > > > +	xfs_trans_relog_item(&qoffi->qql_item);
> > > >  
> > > >  	spin_lock(&mp->m_sb_lock);
> > > >  	mp->m_sb.sb_qflags = (mp->m_qflags & ~(flags)) & XFS_MOUNT_QUOTA_ALL;
> > > > @@ -69,6 +72,13 @@ xfs_qm_log_quotaoff_end(
> > > >  	int			error;
> > > >  	struct xfs_qoff_logitem	*qoffi;
> > > >  
> > > > +	/*
> > > > +	 * startqoff must be in the AIL and not the CIL when the end intent
> > > > +	 * commits to ensure it is not readded to the AIL out of order. Wait on
> > > > +	 * relog activity to drain to isolate startqoff to the AIL.
> > > > +	 */
> > > > +	xfs_trans_relog_item_cancel(&startqoff->qql_item, true);
> > > > +
> > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> > > >  	if (error)
> > > >  		return error;
> > > > -- 
> > > > 2.21.1
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-02-27 13:43 ` [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management Brian Foster
  2020-02-27 20:49   ` Allison Collins
  2020-02-28  0:02   ` Darrick J. Wong
@ 2020-03-02  3:07   ` Dave Chinner
  2020-03-02 18:06     ` Brian Foster
  2 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-02  3:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> Automatic item relogging will occur from xfsaild context. xfsaild
> cannot acquire log reservation itself because it is also responsible
> for writeback and thus making used log reservation available again.
> Since there is no guarantee log reservation is available by the time
> a relogged item reaches the AIL, this is prone to deadlock.
> 
> To guarantee log reservation for automatic relogging, implement a
> reservation management scheme where a transaction that is capable of
> enabling relogging of an item must contribute the necessary
> reservation to the relog mechanism up front. Use reference counting
> to associate the lifetime of pending relog reservation to the
> lifetime of in-core log items with relogging enabled.
> 
> The basic log reservation sequence for a relog enabled transaction
> is as follows:
> 
> - A transaction that uses relogging specifies XFS_TRANS_RELOG at
>   allocation time.
> - Once initialized, RELOG transactions check for the existence of
>   the global relog log ticket. If it exists, grab a reference and
>   return. If not, allocate an empty ticket and install into the relog
>   subsystem. Seed the relog ticket from reservation of the current
>   transaction. Roll the current transaction to replenish its
>   reservation and return to the caller.
> - The transaction is used as normal. If an item is relogged in the
>   transaction, that item acquires a reference on the global relog
>   ticket currently held open by the transaction. The item's reference
>   persists until relogging is disabled on the item.
> - The RELOG transaction commits and releases its reference to the
>   global relog ticket. The global relog ticket is released once its
>   reference count drops to zero.
> 
> This provides a central relog log ticket that guarantees reservation
> availability for relogged items, avoids log reservation deadlocks
> and is allocated and released on demand.

Hi Brain,

I've held off commenting immediately on this while I tried to get
the concept of dynamic relogging straight in my head. I couldn't put
my finger on what I thought was wrong - just a nagging feeling that
I'd gone down this path before and it ended in somethign that didn't
work.

It wasn't until a couple of hours ago that a big cogs clunked into
place and I realised this roughly mirrored a path I went down 12 or
13 years ago trying to implement what turned into the CIL. I failed
at least 4 times over 5 years trying to implement delayed logging...

THere's a couple of simple code comments below before what was in my
head seemed to gel together into something slightly more coherent
than "it seems inside out"....

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_shared.h |  1 +
>  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
>  fs/xfs/xfs_trans.h         |  3 ++
>  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans_priv.h    |  1 +
>  5 files changed, 126 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index c45acbd3add9..0a10ca0853ab 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>   * made then this algorithm will eventually find all the space it needs.
>   */
>  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
>  
>  /*
>   * Field values for xfs_trans_mod_sb.
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..8ac05ed8deda 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -107,9 +107,14 @@ xfs_trans_dup(
>  
>  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
>  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> -	/* We gave our writer reference to the new transaction */
> +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> +		       (tp->t_flags & XFS_TRANS_RELOG);
> +	/*
> +	 * The writer reference and relog reference transfer to the new
> +	 * transaction.
> +	 */
>  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
>  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
>  
>  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> @@ -284,15 +289,25 @@ xfs_trans_alloc(
>  	tp->t_firstblock = NULLFSBLOCK;
>  
>  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> -	if (error) {
> -		xfs_trans_cancel(tp);
> -		return error;
> +	if (error)
> +		goto error;
> +
> +	if (flags & XFS_TRANS_RELOG) {
> +		error = xfs_trans_ail_relog_reserve(&tp);
> +		if (error)
> +			goto error;
>  	}

Hmmmm. So we are putting the AIL lock directly into the transaction
reserve path? xfs_trans_reserve() goes out of it's way to be
lockless in the fast paths, so if you want this to be a generic
mechanism that any transaction can use, the reservation needs to be
completely lockless.

>  
>  	trace_xfs_trans_alloc(tp, _RET_IP_);
>  
>  	*tpp = tp;
>  	return 0;
> +
> +error:
> +	/* clear relog flag if we haven't acquired a ref */
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
> +	xfs_trans_cancel(tp);
> +	return error;

seems like a case of "only set the flags once you have a reference"?
Then xfs_trans_cancel() can clean up without special cases being
needed anywhere...

>  }
>  
>  /*
> @@ -973,6 +988,10 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +

That looks ... interesting. xfs_trans_ail_relog_put() can call
xfs_log_done(), which means it can do a log write, which means the
commit lsn for this transaction could change. Hence to make a relog
permanent as a result of a sync transaction, we'd need the
commit_lsn of the AIL relog ticket write here, not that of the
original transaction that was written.


>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
> @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
>  			error = -EIO;
>  		tp->t_ticket = NULL;
>  	}
> +	/* release the relog ticket reference if this transaction holds one */
> +	/* XXX: handle RELOG items on transaction abort */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);

This has the potential for log writes to be issued from a
transaction abort context, right?

>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, !!error);
>  	xfs_trans_free(tp);
> @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
>  		tp->t_ticket = NULL;
>  	}
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);

And log writes from a cancel seems possible here, too. I don't think
we do this at all right now, so this could have some interesting
unexpected side-effects during error handling. (As if that wasn't
complex enough to begin with!)

> @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
>  		xfs_log_space_wake(ailp->ail_mount);
>  }
>  
> +bool
> +xfs_trans_ail_relog_get(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	bool			ret = false;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		ret = true;
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +	return ret;
> +}
> +
> +/*
> + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> + * requires a clean, permanent transaction from the caller. Pull reservation
> + * for the relog ticket and roll the caller's transaction back to its fully
> + * reserved state. If the AIL relog ticket is already initialized, grab a
> + * reference and return.
> + */
> +int
> +xfs_trans_ail_relog_reserve(
> +	struct xfs_trans	**tpp)
> +{
> +	struct xfs_trans	*tp = *tpp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	if (xfs_trans_ail_relog_get(mp))
> +		return 0;
> +
> +	/* no active ticket, fall into slow path to allocate one.. */
> +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> +	if (!tic)
> +		return -ENOMEM;
> +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> +
> +	/* check again since we dropped the lock for the allocation */
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +		xfs_log_ticket_put(tic);
> +		return 0;
> +	}
> +
> +	/* attach and reserve space for the ->tr_relog ticket */
> +	ailp->ail_relog_tic = tic;
> +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	return xfs_trans_roll(tpp);
> +}

Hmmm.

So before we've even returned from xfs_trans_alloc(), we may have
committed the transaction and burnt and entire transaction entire
log space reservations?

IOWs, to support relogging of a single item in a permanent
transaction, we have to increase the log count of the transaction so
that we don't end up running out of reservation space one
transaction before the end of the normal fast path behaviour of the
change being made?

And we do that up front because we're not exactly sure of how much
space the item that needs relogging is going to require, but the AIL
is going to need to hold on to it from as long as it needs to relog
the item.

<GRRRRIIINNNNDDDD>

<CLUNK>

Ahhhhh.

This is the reason why the CIL ended up using a reservation stealing
mechanism for it's ticket. Every log reservation takes into account
the worst case log overhead for that single transaction - that means
there are no special up front hooks or changes to the log
reservation to support the CIL, and it also means that the CIL can
steal what it needs from the current ticket at commit time. Then the
CIL can commit at any time, knowing it has enough space to write all
the items it tracks to the log.

This avoided the transaction reservation mechanism from needing to
know anything about the CIL or that it was stealing space from
committing transactions for it's own private ticket. It greatly
simplified everything, and it's the reason that the CIL succeeded
where several other attempts to relog items in memory failed....

So....

Why can't the AIL steal the reservation it needs from the current
transaction when the item that needs relogging is being committed?
i.e. we add the "relog overhead" to the permanent transaction that
requires items to be relogged by the AIL, and when that item is
formatted into the CIL we also check to see if it is currently
marked as "reloggable". If it's not, we steal the relogging
reservation from the transaction (essentially the item's formatted
size) and pass it to the AIL's private relog ticket, setting the log
item to be "reloggable" so the reservation isn't stolen over and
over again as the object is relogged in the CIL.

Hence as we commit reloggable items to the CIL, the AIL ticket
reservation grows with each of those items marked as reloggable.
As we relog items to the CIL and the CIL grows it's reservation via
the size delta, the AIL reservation can also be updated with the
same delta.

Hence the AIL will always know exactly how much space it needs to relog all
the items it holds for relogging, and because it's been stolen from
the original transaction it is, like the CIL tciket reservation,
considered used space in the log. Hence the log space required for
relogging items via the AIL is correctly accounted for without
needing up front static, per-item reservations.

When the item is logged the final time and the reloggable flag is
removed (e.g. when the quotaoff completes) then we can remove the
reservation from AIL ticket and add it back to the current
transaction. Hence when the final transaction is committed and the
ticket for that transaction is released via xfs_log_done(), the
space the AIL held for relogging the item is also released.

This doesn't require any modifications to the transaction
reservation subsystem, nor the itransaction commit/cancel code,
no changes to the log space accounting, etc. All we need to do is
tag items when we log them as reloggable, and in the CIL formatting
of the item pass the formatted size differences to the AIL so it can
steal the reservation it needs.

Then the AIL doesn't need a dynamic relogging ticket. It can just
hold a single ticket from init to teardown by taking an extra
reference to the ticket. When it is running a relog, it can build it
as though it's just a normal permanent transaction without needing
to get a new reservation, and when it rolls the write space consumed
is regranted automatically.

AFAICT, that gets rid of the need for all this ticket reference counting,
null pointer checking, lock juggling and rechecking, etc. It does
not require modification to the transaction reservation path, and we
can account for the relogging on an item by item basis in each
individual transaction reservation....

Thoughts?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 4/9] xfs: automatic relogging item management
  2020-02-27 13:43 ` [RFC v5 PATCH 4/9] xfs: automatic relogging item management Brian Foster
  2020-02-27 21:18   ` Allison Collins
@ 2020-03-02  5:58   ` Dave Chinner
  2020-03-02 18:08     ` Brian Foster
  1 sibling, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-02  5:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:16AM -0500, Brian Foster wrote:
> As implemented by the previous patch, relogging can be enabled on
> any item via a relog enabled transaction (which holds a reference to
> an active relog ticket). Add a couple log item flags to track relog
> state of an arbitrary log item. The item holds a reference to the
> global relog ticket when relogging is enabled and releases the
> reference when relogging is disabled.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_trace.h      |  2 ++
>  fs/xfs/xfs_trans.c      | 36 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans.h      |  6 +++++-
>  fs/xfs/xfs_trans_priv.h |  2 ++
>  4 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a86be7f807ee..a066617ec54d 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
> +DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>  
>  DECLARE_EVENT_CLASS(xfs_ail_class,
>  	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 8ac05ed8deda..f7f2411ead4e 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -778,6 +778,41 @@ xfs_trans_del_item(
>  	list_del_init(&lip->li_trans);
>  }
>  
> +void
> +xfs_trans_relog_item(
> +	struct xfs_log_item	*lip)
> +{
> +	if (!test_and_set_bit(XFS_LI_RELOG, &lip->li_flags)) {
> +		xfs_trans_ail_relog_get(lip->li_mountp);
> +		trace_xfs_relog_item(lip);
> +	}

What if xfs_trans_ail_relog_get() fails to get a reference here
because there is no current ail relog ticket? Isn't the transaction
it was reserved in required to be checked here for XFS_TRANS_RELOG
being set?

> +}
> +
> +void
> +xfs_trans_relog_item_cancel(
> +	struct xfs_log_item	*lip,
> +	bool			drain) /* wait for relogging to cease */
> +{
> +	struct xfs_mount	*mp = lip->li_mountp;
> +
> +	if (!test_and_clear_bit(XFS_LI_RELOG, &lip->li_flags))
> +		return;
> +	xfs_trans_ail_relog_put(lip->li_mountp);
> +	trace_xfs_relog_item_cancel(lip);
> +
> +	if (!drain)
> +		return;
> +
> +	/*
> +	 * Some operations might require relog activity to cease before they can
> +	 * proceed. For example, an operation must wait before including a
> +	 * non-lockable log item (i.e. intent) in another transaction.
> +	 */
> +	while (wait_on_bit_timeout(&lip->li_flags, XFS_LI_RELOGGED,
> +				   TASK_UNINTERRUPTIBLE, HZ))
> +		xfs_log_force(mp, XFS_LOG_SYNC);
> +}

What is a "cancel" operation? Is it something you do when cancelling
a transaction (i.e. on operation failure) or is is something the
final transaction does to remove the relog item from the AIL (i.e.
part of the normal successful finish to a long running transaction)?


>  /* Detach and unlock all of the items in a transaction */
>  static void
>  xfs_trans_free_items(
> @@ -863,6 +898,7 @@ xfs_trans_committed_bulk(
>  
>  		if (aborted)
>  			set_bit(XFS_LI_ABORTED, &lip->li_flags);
> +		clear_and_wake_up_bit(XFS_LI_RELOGGED, &lip->li_flags);

I don't know what the XFS_LI_RELOGGED flag really means in this
patch because I don't know what sets it. Perhaps this would be
better moved into the patch that first sets the RELOGGED flag?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-02-27 13:43 ` [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism Brian Foster
  2020-02-27 22:54   ` Allison Collins
  2020-02-28  0:13   ` Darrick J. Wong
@ 2020-03-02  7:18   ` Dave Chinner
  2020-03-02 18:52     ` Brian Foster
  2 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-02  7:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> Now that relog reservation is available and relog state tracking is
> in place, all that remains to automatically relog items is the relog
> mechanism itself. An item with relogging enabled is basically pinned
> from writeback until relog is disabled. Instead of being written
> back, the item must instead be periodically committed in a new
> transaction to move it in the physical log. The purpose of moving
> the item is to avoid long term tail pinning and thus avoid log
> deadlocks for long running operations.
> 
> The ideal time to relog an item is in response to tail pushing
> pressure. This accommodates the current workload at any given time
> as opposed to a fixed time interval or log reservation heuristic,
> which risks performance regression. This is essentially the same
> heuristic that drives metadata writeback. XFS already implements
> various log tail pushing heuristics that attempt to keep the log
> progressing on an active fileystem under various workloads.
> 
> The act of relogging an item simply requires to add it to a
> transaction and commit. This pushes the already dirty item into a
> subsequent log checkpoint and frees up its previous location in the
> on-disk log. Joining an item to a transaction of course requires
> locking the item first, which means we have to be aware of
> type-specific locks and lock ordering wherever the relog takes
> place.
> 
> Fundamentally, this points to xfsaild as the ideal location to
> process relog enabled items. xfsaild already processes log resident
> items, is driven by log tail pushing pressure, processes arbitrary
> log item types through callbacks, and is sensitive to type-specific
> locking rules by design. The fact that automatic relogging
> essentially diverts items between writeback or relog also suggests
> xfsaild as an ideal location to process items one way or the other.
> 
> Of course, we don't want xfsaild to process transactions as it is a
> critical component of the log subsystem for driving metadata
> writeback and freeing up log space. Therefore, similar to how
> xfsaild builds up a writeback queue of dirty items and queues writes
> asynchronously, make xfsaild responsible only for directing pending
> relog items into an appropriate queue and create an async
> (workqueue) context for processing the queue. The workqueue context
> utilizes the pre-reserved relog ticket to drain the queue by rolling
> a permanent transaction.
> 
> Update the AIL pushing infrastructure to support a new RELOG item
> state. If a log item push returns the relog state, queue the item
> for relog instead of writeback. On completion of a push cycle,
> schedule the relog task at the same point metadata buffer I/O is
> submitted. This allows items to be relogged automatically under the
> same locking rules and pressure heuristics that govern metadata
> writeback.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_trace.h      |   1 +
>  fs/xfs/xfs_trans.h      |   1 +
>  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_trans_priv.h |   3 ++
>  4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a066617ec54d..df0114ec66f1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index fc4c25b6eee4..1637df32c64c 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>  #define XFS_ITEM_PINNED		1
>  #define XFS_ITEM_LOCKED		2
>  #define XFS_ITEM_FLUSHING	3
> +#define XFS_ITEM_RELOG		4
>  
>  /*
>   * Deferred operation item relogging limits.
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index a3fb64275baa..71a47faeaae8 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
>  	return lsn;
>  }
>  
> +/*
> + * Relog log items on the AIL relog queue.
> + */
> +static void
> +xfs_ail_relog(
> +	struct work_struct	*work)
> +{
> +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> +						     ail_relog_work);
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +	struct xfs_trans_res	tres = {};
> +	struct xfs_trans	*tp;
> +	struct xfs_log_item	*lip;
> +	int			error;
> +
> +	/*
> +	 * The first transaction to submit a relog item contributed relog
> +	 * reservation to the relog ticket before committing. Create an empty
> +	 * transaction and manually associate the relog ticket.
> +	 */
> +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);

I suspect deadlocks on filesystems in the process of being frozen
when the log is full and relogging is required to make progress.

> +	ASSERT(!error);
> +	if (error)
> +		return;
> +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);

So this assumes you've stolen the log reservation for this ticket
from somewhere else, because otherwise the transaction log
reservation and the ticket don't match.

FWIW, I'm having trouble real keeping all the ail relog ticket
references straight. Code seems to be
arbitratily taking and dropping references to that ticket, and I
can't see a pattern or set of rules for usage.

Why does this specific transaction need a reference to the ticket,
when the ail_relog_list has a reference, every item that has been
marked as XFS_LI_RELOG already has a reference, etc?

> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> +					       struct xfs_log_item,
> +					       li_trans)) != NULL) {

I dislike the way the "swiss army knife" list macro makes this
code unreadable.

        while (!list_empty(&ailp->ail_relog_list)) {
		lip = list_first_entry(...)

is much neater and easier to read.

> +		/*
> +		 * Drop the AIL processing ticket reference once the relog list
> +		 * is emptied. At this point it's possible for our transaction
> +		 * to hold the only reference.
> +		 */
> +		list_del_init(&lip->li_trans);
> +		if (list_empty(&ailp->ail_relog_list))
> +			xfs_log_ticket_put(ailp->ail_relog_tic);

This seems completely arbitrary.

> +		spin_unlock(&ailp->ail_lock);
> +
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/* XXX: include ticket owner task fix */
> +		error = xfs_trans_roll(&tp);

So the reservation for this ticket is going to be regranted over and
over again and space reserved repeatedly as we work through the
locked relog items one at a time?

Unfortunately, this violates the rule that prevents rolling
transactions from deadlocking. That is, any object that is held
locked across the transaction commit and regrant that *might pin the
tail of the log* must be relogged in the transaction to move the
item forward and prevent it from pinning the tail of the log.

IOWs, if one of the items later in the relog list pins the tail of
the log we will end up sleeping here:

  xfs_trans_roll()
    xfs_trans_reserve
      xfs_log_regrant
        xlog_grant_head_check(need_bytes)
	  xlog_grant_head_wait()

waiting for the write grant head to move. ANd it never will, because
we hold the lock on that item so the AIL can't push it out.
IOWs, using a rolling transaction per relog item will not work for
processing multiple relog items.

If it's a single transaction, and we join all the locked items
for relogging into it in a single transaction commit, then we are
fine - we don't try to regrant log space while holding locked items
that could pin the tail of the log.

We *can* use a rolling transaction if we do this - the AIL has a
permenant transaction (plus ticket!) allocated at mount time with
a log count of zero, we can then steal reserve/write grant head
space into it that ticket at CIL commit time as I mentioned
previously. We do a loop like above, but it's basically:

{
	LIST_HEAD(tmp);

	spin_lock(&ailp->ail_lock);
        if (list_empty(&ailp->ail_relog_list)) {
		spin_unlock(&ailp->ail_lock);
		return;
	}

	list_splice_init(&ilp->ail_relog_list, &tmp);
	spin_unlock(&ailp->ail_lock);

	xfs_ail_relog_items(ail, &tmp);
}

This allows the AIL to keep working and building up a new relog
as it goes along. hence we can work on our list without interruption
or needed to repeatedly take the AIL lock just to get items from the
list.

And xfs_ail_relog_items() does something like this:

{
	struct xfs_trans	*tp;

	/*
	 * Make CIL committers trying to change relog status of log
	 * items wait for us to istabilise the relog transaction
	 * again by committing the current relog list and rolling
	 * the transaction.
	 */
	down_write(&ail->ail_relog_lock);
	tp = ail->relog_trans;

	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
					       struct xfs_log_item,
					       li_trans)) != NULL) {
        while (!list_empty(&ailp->ail_relog_list)) {
		lip = list_first_entry(&ailp->ail_relog_list,
					struct xfs_log_item, li_trans);
		list_del_init(&lip->li_trans);

		xfs_trans_add_item(tp, lip);
		set_bit(XFS_LI_DIRTY, &lip->li_flags);
		tp->t_flags |= XFS_TRANS_DIRTY;
	}

	error = xfs_trans_roll(&tp);
	if (error) {
		SHUTDOWN!
	}
	ail->relog_trans = tp;
	up_write(&ail->ail_relog_lock);
}

> @@ -426,6 +495,23 @@ xfsaild_push(
>  			ailp->ail_last_pushed_lsn = lsn;
>  			break;
>  
> +		case XFS_ITEM_RELOG:
> +			/*
> +			 * The item requires a relog. Add to the pending relog
> +			 * list and set the relogged bit to prevent further
> +			 * relog requests. The relog bit and ticket reference
> +			 * can be dropped from the item at any point, so hold a
> +			 * relog ticket reference for the pending relog list to
> +			 * ensure the ticket stays around.
> +			 */
> +			trace_xfs_ail_relog(lip);
> +			ASSERT(list_empty(&lip->li_trans));
> +			if (list_empty(&ailp->ail_relog_list))
> +				xfs_log_ticket_get(ailp->ail_relog_tic);
> +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> +			break;

So the XFS_LI_RELOGGED bit indicates that the item is locked on the
relog list? That means nothing else in the transaction subsystem
will set that until the item is unlocked, right? Which is when it
ends up back on the CIL. This then gets cleared by the CIL being
forced and the item moved forward in the AIL.

IOWs, a log force clears this flag until the AIL relogs it again.
Why do we need this flag to issue wakeups when the wait loop does
blocking log forces? i.e. it will have already waited for the flag
to be cleared by waiting for the log force....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-02-28 14:02     ` Brian Foster
@ 2020-03-02  7:32       ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2020-03-02  7:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 28, 2020 at 09:02:02AM -0500, Brian Foster wrote:
> See my earlier comment around batching. Right now we only relog one item
> at a time and the relog reservation is intended to be the max possible
> reloggable item in the fs. This needs to increase to support some kind
> of batching here, but I think the prospective reloggable items right now
> (i.e. 2 or 3 different intent types) allows a fixed calculation size to
> work well enough for our needs.
> 
> Note that I think there's a whole separate ball of complexity we could
> delve into if we wanted to support something like arbitrary, per-item
> (set) relog tickets with different reservation values as opposed to one
> global, fixed size ticket. That would require some association between
> log items and tickets and perhaps other items covered by the same
> ticket, etc., but would provide a much more generic mechanism. As it is,
> I think that's hugely overkill for the current use cases, but maybe we
> find a reason to evolve this into something like that down the road..

From what I'm seeing, a generic, arbitrary relogging mechanism based
on a reservation stealing concept similar to the CIL is going to be
simpler, less invasive, easier to expand in future and more robust
than this "limited scope" proposal.

I'm firmly in the "do it right the first time" camp here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype
  2020-02-27 13:43 ` [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype Brian Foster
  2020-02-27 23:33   ` Allison Collins
@ 2020-03-02  7:47   ` Dave Chinner
  2020-03-02 19:00     ` Brian Foster
  1 sibling, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-02  7:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 27, 2020 at 08:43:19AM -0500, Brian Foster wrote:
> Add a quick and dirty implementation of buffer relogging support.
> There is currently no use case for buffer relogging. This is for
> experimental use only and serves as an example to demonstrate the
> ability to relog arbitrary items in the future, if necessary.
> 
> Add a hook to enable relogging a buffer in a transaction, update the
> buffer log item handlers to support relogged BLIs and update the
> relog handler to join the relogged buffer to the relog transaction.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
.....
>  /*
> @@ -187,9 +188,21 @@ xfs_ail_relog(
>  			xfs_log_ticket_put(ailp->ail_relog_tic);
>  		spin_unlock(&ailp->ail_lock);
>  
> -		xfs_trans_add_item(tp, lip);
> -		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> -		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/*
> +		 * TODO: Ideally, relog transaction management would be pushed
> +		 * down into the ->iop_push() callbacks rather than playing
> +		 * games with ->li_trans and looking at log item types here.
> +		 */
> +		if (lip->li_type == XFS_LI_BUF) {
> +			struct xfs_buf_log_item	*bli = (struct xfs_buf_log_item *) lip;
> +			xfs_buf_hold(bli->bli_buf);

What is this for? The bli already has a reference to the buffer.

> +			xfs_trans_bjoin(tp, bli->bli_buf);
> +			xfs_trans_dirty_buf(tp, bli->bli_buf);
> +		} else {
> +			xfs_trans_add_item(tp, lip);
> +			set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +			tp->t_flags |= XFS_TRANS_DIRTY;
> +		}

Really, this should be a xfs_item_ops callout. i.e.

		lip->li_ops->iop_relog(lip);

And then a) it doesn't matter really where we call it from, and b)
it becomes fully generic and we can implement the callout
as future functionality requires.

However, we have to make sure that the current transaction we are
running has the correct space usage accounted to it, so I think this
callout really does need to be done in a tight loop iterating and
accounting all the relog items into the transaction without outside
interference.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-02  3:07   ` Dave Chinner
@ 2020-03-02 18:06     ` Brian Foster
  2020-03-02 23:25       ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-03-02 18:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > Automatic item relogging will occur from xfsaild context. xfsaild
> > cannot acquire log reservation itself because it is also responsible
> > for writeback and thus making used log reservation available again.
> > Since there is no guarantee log reservation is available by the time
> > a relogged item reaches the AIL, this is prone to deadlock.
> > 
> > To guarantee log reservation for automatic relogging, implement a
> > reservation management scheme where a transaction that is capable of
> > enabling relogging of an item must contribute the necessary
> > reservation to the relog mechanism up front. Use reference counting
> > to associate the lifetime of pending relog reservation to the
> > lifetime of in-core log items with relogging enabled.
> > 
> > The basic log reservation sequence for a relog enabled transaction
> > is as follows:
> > 
> > - A transaction that uses relogging specifies XFS_TRANS_RELOG at
> >   allocation time.
> > - Once initialized, RELOG transactions check for the existence of
> >   the global relog log ticket. If it exists, grab a reference and
> >   return. If not, allocate an empty ticket and install into the relog
> >   subsystem. Seed the relog ticket from reservation of the current
> >   transaction. Roll the current transaction to replenish its
> >   reservation and return to the caller.
> > - The transaction is used as normal. If an item is relogged in the
> >   transaction, that item acquires a reference on the global relog
> >   ticket currently held open by the transaction. The item's reference
> >   persists until relogging is disabled on the item.
> > - The RELOG transaction commits and releases its reference to the
> >   global relog ticket. The global relog ticket is released once its
> >   reference count drops to zero.
> > 
> > This provides a central relog log ticket that guarantees reservation
> > availability for relogged items, avoids log reservation deadlocks
> > and is allocated and released on demand.
> 
> Hi Brain,
> 
> I've held off commenting immediately on this while I tried to get
> the concept of dynamic relogging straight in my head. I couldn't put
> my finger on what I thought was wrong - just a nagging feeling that
> I'd gone down this path before and it ended in somethign that didn't
> work.
> 

No problem..

> It wasn't until a couple of hours ago that a big cogs clunked into
> place and I realised this roughly mirrored a path I went down 12 or
> 13 years ago trying to implement what turned into the CIL. I failed
> at least 4 times over 5 years trying to implement delayed logging...
> 
> THere's a couple of simple code comments below before what was in my
> head seemed to gel together into something slightly more coherent
> than "it seems inside out"....
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_shared.h |  1 +
> >  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
> >  fs/xfs/xfs_trans.h         |  3 ++
> >  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans_priv.h    |  1 +
> >  5 files changed, 126 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> > index c45acbd3add9..0a10ca0853ab 100644
> > --- a/fs/xfs/libxfs/xfs_shared.h
> > +++ b/fs/xfs/libxfs/xfs_shared.h
> > @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
> >   * made then this algorithm will eventually find all the space it needs.
> >   */
> >  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> > +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
> >  
> >  /*
> >   * Field values for xfs_trans_mod_sb.
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 3b208f9a865c..8ac05ed8deda 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -107,9 +107,14 @@ xfs_trans_dup(
> >  
> >  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
> >  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> > -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> > -	/* We gave our writer reference to the new transaction */
> > +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> > +		       (tp->t_flags & XFS_TRANS_RELOG);
> > +	/*
> > +	 * The writer reference and relog reference transfer to the new
> > +	 * transaction.
> > +	 */
> >  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> >  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
> >  
> >  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> >  	tp->t_firstblock = NULLFSBLOCK;
> >  
> >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > -	if (error) {
> > -		xfs_trans_cancel(tp);
> > -		return error;
> > +	if (error)
> > +		goto error;
> > +
> > +	if (flags & XFS_TRANS_RELOG) {
> > +		error = xfs_trans_ail_relog_reserve(&tp);
> > +		if (error)
> > +			goto error;
> >  	}
> 
> Hmmmm. So we are putting the AIL lock directly into the transaction
> reserve path? xfs_trans_reserve() goes out of it's way to be
> lockless in the fast paths, so if you want this to be a generic
> mechanism that any transaction can use, the reservation needs to be
> completely lockless.
> 

Yeah, this is one of those warts mentioned in the cover letter. I wasn't
planning to make it lockless as much as using an independent lock, but I
can take a closer look at that if it still looks like a contention point
after getting through the bigger picture feedback..

> >  
> >  	trace_xfs_trans_alloc(tp, _RET_IP_);
> >  
> >  	*tpp = tp;
> >  	return 0;
> > +
> > +error:
> > +	/* clear relog flag if we haven't acquired a ref */
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> > +	xfs_trans_cancel(tp);
> > +	return error;
> 
> seems like a case of "only set the flags once you have a reference"?
> Then xfs_trans_cancel() can clean up without special cases being
> needed anywhere...
> 

Yeah, I suppose that might be a bit more readable.

> >  }
> >  
> >  /*
> > @@ -973,6 +988,10 @@ __xfs_trans_commit(
> >  
> >  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> > +
> 
> That looks ... interesting. xfs_trans_ail_relog_put() can call
> xfs_log_done(), which means it can do a log write, which means the
> commit lsn for this transaction could change. Hence to make a relog
> permanent as a result of a sync transaction, we'd need the
> commit_lsn of the AIL relog ticket write here, not that of the
> original transaction that was written.
> 

Hmm.. xfs_log_done() is eventually called for every transaction ticket,
so I don't think it should be doing log writes based on that. The
_ail_relog_put() path is basically just intended to properly terminate
the relog ticket based on the existing transaction completion path. I'd
have to dig a bit further into the code here, but e.g. we don't pass an
iclog via this path so if we were attempting to write I'd probably have
seen assert failures (or worse) by now. Indeed, this is similar to a
cancel of a clean transaction, the difference being this is open-coded
because we only have the relog ticket in this context.

> 
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free(tp);
> >  
> > @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
> >  			error = -EIO;
> >  		tp->t_ticket = NULL;
> >  	}
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	/* XXX: handle RELOG items on transaction abort */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> 
> This has the potential for log writes to be issued from a
> transaction abort context, right?
> 
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free_items(tp, !!error);
> >  	xfs_trans_free(tp);
> > @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
> >  		tp->t_ticket = NULL;
> >  	}
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> 
> And log writes from a cancel seems possible here, too. I don't think
> we do this at all right now, so this could have some interesting
> unexpected side-effects during error handling. (As if that wasn't
> complex enough to begin with!)
> 

I'm not quite sure I follow your comments here, though I do still need
to work through all of the error paths and make sure everything is
correct. I've been putting details like that off in favor of getting a
high level design approach worked out first.

> > @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
> >  		xfs_log_space_wake(ailp->ail_mount);
> >  }
> >  
> > +bool
> > +xfs_trans_ail_relog_get(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	bool			ret = false;
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		ret = true;
> > +	}
> > +	spin_unlock(&ailp->ail_lock);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> > + * requires a clean, permanent transaction from the caller. Pull reservation
> > + * for the relog ticket and roll the caller's transaction back to its fully
> > + * reserved state. If the AIL relog ticket is already initialized, grab a
> > + * reference and return.
> > + */
> > +int
> > +xfs_trans_ail_relog_reserve(
> > +	struct xfs_trans	**tpp)
> > +{
> > +	struct xfs_trans	*tp = *tpp;
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	struct xlog_ticket	*tic;
> > +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> > +
> > +	if (xfs_trans_ail_relog_get(mp))
> > +		return 0;
> > +
> > +	/* no active ticket, fall into slow path to allocate one.. */
> > +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> > +	if (!tic)
> > +		return -ENOMEM;
> > +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> > +
> > +	/* check again since we dropped the lock for the allocation */
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		spin_unlock(&ailp->ail_lock);
> > +		xfs_log_ticket_put(tic);
> > +		return 0;
> > +	}
> > +
> > +	/* attach and reserve space for the ->tr_relog ticket */
> > +	ailp->ail_relog_tic = tic;
> > +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +	return xfs_trans_roll(tpp);
> > +}
> 
> Hmmm.
> 
> So before we've even returned from xfs_trans_alloc(), we may have
> committed the transaction and burnt and entire transaction entire
> log space reservations?
> 

Yep.

> IOWs, to support relogging of a single item in a permanent
> transaction, we have to increase the log count of the transaction so
> that we don't end up running out of reservation space one
> transaction before the end of the normal fast path behaviour of the
> change being made?
> 

Right.

> And we do that up front because we're not exactly sure of how much
> space the item that needs relogging is going to require, but the AIL
> is going to need to hold on to it from as long as it needs to relog
> the item.
> 

*nod*

> <GRRRRIIINNNNDDDD>
> 
> <CLUNK>
> 
> Ahhhhh.
> 

Heh. :P

> This is the reason why the CIL ended up using a reservation stealing
> mechanism for it's ticket. Every log reservation takes into account
> the worst case log overhead for that single transaction - that means
> there are no special up front hooks or changes to the log
> reservation to support the CIL, and it also means that the CIL can
> steal what it needs from the current ticket at commit time. Then the
> CIL can commit at any time, knowing it has enough space to write all
> the items it tracks to the log.
> 
> This avoided the transaction reservation mechanism from needing to
> know anything about the CIL or that it was stealing space from
> committing transactions for it's own private ticket. It greatly
> simplified everything, and it's the reason that the CIL succeeded
> where several other attempts to relog items in memory failed....
> 
> So....
> 
> Why can't the AIL steal the reservation it needs from the current
> transaction when the item that needs relogging is being committed?

I think we can. My first proposal[1] implemented something like this.
The broader design was much less fleshed out, so it was crudely
implemented as more of a commit time hook to relog pending items based
on the currently committing transaction as opposed to reservation
stealing for an independent relog ticket. This was actually based on the
observation that the CIL already did something similar (though I didn't
have all of the background as to why that approach was used for the
CIL).

The feedback at the time was to consider moving in a different direction
that didn't involve stealing reservation from transactions, which gave
me the impression that approach wasn't going to be acceptable. Granted,
the first few variations of this were widely different in approach and I
don't think there was any explicit objection to reservation stealing
itself, it just seemed a better development approach to work out an
ideal design based on fundamentals as opposed to limiting the scope to
contexts that might facilitate reservation stealing. Despite the fact
that this code is hairy, the fundamental approach is rather
straightforward and simplistic. The hairiness mostly comes from
attempting to make things generic and dynamic and could certainly use
some cleanup.

Now that this is at a point where I'm reasonably confident it is
fundamentally correct, I'm quite happy to revisit a reservation stealing
approach if that improves functional operation. This can serve a decent
reference/baseline implementation for that, I think.

[1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/

> i.e. we add the "relog overhead" to the permanent transaction that
> requires items to be relogged by the AIL, and when that item is
> formatted into the CIL we also check to see if it is currently
> marked as "reloggable". If it's not, we steal the relogging
> reservation from the transaction (essentially the item's formatted
> size) and pass it to the AIL's private relog ticket, setting the log
> item to be "reloggable" so the reservation isn't stolen over and
> over again as the object is relogged in the CIL.
> 

Ok, so this implies to me we'd update the per-transaction reservation
calculations to incorporate worst case relog overhead for each
transaction. E.g., the quotaoff transaction would x2 the intent size,
a transaction that might relog a buffer adds a max buffer size overhead,
etc. Right?

> Hence as we commit reloggable items to the CIL, the AIL ticket
> reservation grows with each of those items marked as reloggable.
> As we relog items to the CIL and the CIL grows it's reservation via
> the size delta, the AIL reservation can also be updated with the
> same delta.
> 
> Hence the AIL will always know exactly how much space it needs to relog all
> the items it holds for relogging, and because it's been stolen from
> the original transaction it is, like the CIL tciket reservation,
> considered used space in the log. Hence the log space required for
> relogging items via the AIL is correctly accounted for without
> needing up front static, per-item reservations.
> 

By this I assume you're referring to avoiding the need to reserve ->
donate -> roll in the current scheme. Instead, we'd acquire the larger
reservation up front and only steal if it necessary, which is less
overhead because we don't need to always replenish the full transaction.

Do note that I don't consider the current approach high overhead overall
because the roll only needs to happen when (and if) the first
transaction enables the relog ticket. There is no additional overhead
from that point forward. The "relog overhead" approach sounds fine to me
as well, I'm just noting that there are still some tradeoffs to either
approach.

> When the item is logged the final time and the reloggable flag is
> removed (e.g. when the quotaoff completes) then we can remove the
> reservation from AIL ticket and add it back to the current
> transaction. Hence when the final transaction is committed and the
> ticket for that transaction is released via xfs_log_done(), the
> space the AIL held for relogging the item is also released.
> 

Makes sense.

> This doesn't require any modifications to the transaction
> reservation subsystem, nor the itransaction commit/cancel code,
> no changes to the log space accounting, etc. All we need to do is
> tag items when we log them as reloggable, and in the CIL formatting
> of the item pass the formatted size differences to the AIL so it can
> steal the reservation it needs.
> 
> Then the AIL doesn't need a dynamic relogging ticket. It can just
> hold a single ticket from init to teardown by taking an extra
> reference to the ticket. When it is running a relog, it can build it
> as though it's just a normal permanent transaction without needing
> to get a new reservation, and when it rolls the write space consumed
> is regranted automatically.
> 
> AFAICT, that gets rid of the need for all this ticket reference counting,
> null pointer checking, lock juggling and rechecking, etc. It does
> not require modification to the transaction reservation path, and we
> can account for the relogging on an item by item basis in each
> individual transaction reservation....
> 

Yep, I think there's potential for some nice cleanup there. I'd be very
happy to rip out some of guts of the current patch, particularly the
reference counting bits.

> Thoughts?
> 

<thinking out loud from here on..>

One thing that comes to mind thinking about this is dealing with
batching (relog multiple items per roll). This code doesn't handle that
yet, but I anticipate it being a requirement and it's fairly easy to
update the current scheme to support a fixed item count per relog-roll.

A stealing approach potentially complicates things when it comes to
batching because we have per-item reservation granularity to consider.
For example, consider if we had a variety of different relog item types
active at once, a subset are being relogged while another subset are
being disabled (before ever needing to be relogged). For one, we'd have
to be careful to not release reservation while it might be accounted to
an active relog transaction that is currently rolling with some other
items, etc. There's also a potential quirk in handling reservation of a
relogged item that is cancelled while it's being relogged, but that
might be more of an implementation detail.

I don't think that's a show stopper, but rather just something I'd like
to have factored into the design from the start. One option could be to
maintain a separate counter of active relog reservation aside from the
actual relog ticket. That way the relog ticket could just pull from this
relog reservation pool based on the current item(s) being relogged
asynchronously from different tasks that might add or remove reservation
from the pool for separate items. That might get a little wonky when we
consider the relog ticket needs to pull from the pool and then put
something back if the item is still reloggable after the relog
transaction rolls.

Another problem is that reloggable items are still otherwise usable once
they are unlocked. So for example we'd have to account for a situation
where one transaction dirties a buffer, enables relogging, commits and
then some other transaction dirties more of the buffer and commits
without caring whether the buffer was relog enabled or not. Unless
runtime relog reservation is always worst case, that's a subtle path to
reservation overrun in the relog transaction.

I'm actually wondering if a more simple approach to tracking stolen
reservation is to add a new field that tracks active relog reservation
to each supported log item. Then the initial transaction enables
relogging with a simple transfer from the transaction to the item. The
relog transaction knows how much reservation it can assign based on the
current population of items and requires no further reservation
accounting because it rolls and thus automatically reacquires relog
reservation for each associated item. The path that clears relog state
transfers res from the item back to a transaction simply so the
reservation can be released back to the pool of unused log space. Note
that clearing relog state doesn't require a transaction in the current
implementation, but we could easily define a helper to allocate an empty
transaction, clear relog and reclaim relog reservation and then cancel
for those contexts.

Thoughts on any of the above appreciated. I need to think about this
some more but otherwise I'll attempt some form of a res stealing
approach for the next iteration...

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 4/9] xfs: automatic relogging item management
  2020-03-02  5:58   ` Dave Chinner
@ 2020-03-02 18:08     ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-03-02 18:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 02, 2020 at 04:58:37PM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 08:43:16AM -0500, Brian Foster wrote:
> > As implemented by the previous patch, relogging can be enabled on
> > any item via a relog enabled transaction (which holds a reference to
> > an active relog ticket). Add a couple log item flags to track relog
> > state of an arbitrary log item. The item holds a reference to the
> > global relog ticket when relogging is enabled and releases the
> > reference when relogging is disabled.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_trace.h      |  2 ++
> >  fs/xfs/xfs_trans.c      | 36 ++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans.h      |  6 +++++-
> >  fs/xfs/xfs_trans_priv.h |  2 ++
> >  4 files changed, 45 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a86be7f807ee..a066617ec54d 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1063,6 +1063,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> > +DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
> > +DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
> >  
> >  DECLARE_EVENT_CLASS(xfs_ail_class,
> >  	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 8ac05ed8deda..f7f2411ead4e 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -778,6 +778,41 @@ xfs_trans_del_item(
> >  	list_del_init(&lip->li_trans);
> >  }
> >  
> > +void
> > +xfs_trans_relog_item(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	if (!test_and_set_bit(XFS_LI_RELOG, &lip->li_flags)) {
> > +		xfs_trans_ail_relog_get(lip->li_mountp);
> > +		trace_xfs_relog_item(lip);
> > +	}
> 
> What if xfs_trans_ail_relog_get() fails to get a reference here
> because there is no current ail relog ticket? Isn't the transaction
> it was reserved in required to be checked here for XFS_TRANS_RELOG
> being set?
> 

That shouldn't happen because XFS_TRANS_RELOG is required of the
transaction, as you noted. Ideally this would at least have an assert.
I'm guessing I didn't do that simply because there was no other reason
to pass the transaction into this function.

I could clean this up, but much of this might go away if the reservation
model changes such that XFS_TRANS_RELOG goes away.

> > +}
> > +
> > +void
> > +xfs_trans_relog_item_cancel(
> > +	struct xfs_log_item	*lip,
> > +	bool			drain) /* wait for relogging to cease */
> > +{
> > +	struct xfs_mount	*mp = lip->li_mountp;
> > +
> > +	if (!test_and_clear_bit(XFS_LI_RELOG, &lip->li_flags))
> > +		return;
> > +	xfs_trans_ail_relog_put(lip->li_mountp);
> > +	trace_xfs_relog_item_cancel(lip);
> > +
> > +	if (!drain)
> > +		return;
> > +
> > +	/*
> > +	 * Some operations might require relog activity to cease before they can
> > +	 * proceed. For example, an operation must wait before including a
> > +	 * non-lockable log item (i.e. intent) in another transaction.
> > +	 */
> > +	while (wait_on_bit_timeout(&lip->li_flags, XFS_LI_RELOGGED,
> > +				   TASK_UNINTERRUPTIBLE, HZ))
> > +		xfs_log_force(mp, XFS_LOG_SYNC);
> > +}
> 
> What is a "cancel" operation? Is it something you do when cancelling
> a transaction (i.e. on operation failure) or is is something the
> final transaction does to remove the relog item from the AIL (i.e.
> part of the normal successful finish to a long running transaction)?
> 

This just means to cancel relogging on a log item. To cancel relogging
only requires to clear the flag, so it doesn't require a transaction at
all at the moment. The waiting bit is for callers (i.e. quotaoff) that
might want to remove the item from the AIL after relogging is disabled.
Without that, the item could still be in the CIL when the caller wants
to remove it.

> 
> >  /* Detach and unlock all of the items in a transaction */
> >  static void
> >  xfs_trans_free_items(
> > @@ -863,6 +898,7 @@ xfs_trans_committed_bulk(
> >  
> >  		if (aborted)
> >  			set_bit(XFS_LI_ABORTED, &lip->li_flags);
> > +		clear_and_wake_up_bit(XFS_LI_RELOGGED, &lip->li_flags);
> 
> I don't know what the XFS_LI_RELOGGED flag really means in this
> patch because I don't know what sets it. Perhaps this would be
> better moved into the patch that first sets the RELOGGED flag?
> 

Hmm, yeah that's a bit of wart. It basically means that an item is
queued for relog by the AIL (similar to an item added to the buffer
writeback list, but not yet submitted). Perhaps RELOG_QUEUED would be a
better name. It's included in this patch because it's used by
xfs_trans_relog_item_cancel(), but I suppose that whole functional hunk
could be added later once the flag is introduced properly.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-03-02  7:18   ` Dave Chinner
@ 2020-03-02 18:52     ` Brian Foster
  2020-03-03  0:06       ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-03-02 18:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 02, 2020 at 06:18:43PM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > Now that relog reservation is available and relog state tracking is
> > in place, all that remains to automatically relog items is the relog
> > mechanism itself. An item with relogging enabled is basically pinned
> > from writeback until relog is disabled. Instead of being written
> > back, the item must instead be periodically committed in a new
> > transaction to move it in the physical log. The purpose of moving
> > the item is to avoid long term tail pinning and thus avoid log
> > deadlocks for long running operations.
> > 
> > The ideal time to relog an item is in response to tail pushing
> > pressure. This accommodates the current workload at any given time
> > as opposed to a fixed time interval or log reservation heuristic,
> > which risks performance regression. This is essentially the same
> > heuristic that drives metadata writeback. XFS already implements
> > various log tail pushing heuristics that attempt to keep the log
> > progressing on an active fileystem under various workloads.
> > 
> > The act of relogging an item simply requires to add it to a
> > transaction and commit. This pushes the already dirty item into a
> > subsequent log checkpoint and frees up its previous location in the
> > on-disk log. Joining an item to a transaction of course requires
> > locking the item first, which means we have to be aware of
> > type-specific locks and lock ordering wherever the relog takes
> > place.
> > 
> > Fundamentally, this points to xfsaild as the ideal location to
> > process relog enabled items. xfsaild already processes log resident
> > items, is driven by log tail pushing pressure, processes arbitrary
> > log item types through callbacks, and is sensitive to type-specific
> > locking rules by design. The fact that automatic relogging
> > essentially diverts items between writeback or relog also suggests
> > xfsaild as an ideal location to process items one way or the other.
> > 
> > Of course, we don't want xfsaild to process transactions as it is a
> > critical component of the log subsystem for driving metadata
> > writeback and freeing up log space. Therefore, similar to how
> > xfsaild builds up a writeback queue of dirty items and queues writes
> > asynchronously, make xfsaild responsible only for directing pending
> > relog items into an appropriate queue and create an async
> > (workqueue) context for processing the queue. The workqueue context
> > utilizes the pre-reserved relog ticket to drain the queue by rolling
> > a permanent transaction.
> > 
> > Update the AIL pushing infrastructure to support a new RELOG item
> > state. If a log item push returns the relog state, queue the item
> > for relog instead of writeback. On completion of a push cycle,
> > schedule the relog task at the same point metadata buffer I/O is
> > submitted. This allows items to be relogged automatically under the
> > same locking rules and pressure heuristics that govern metadata
> > writeback.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_trace.h      |   1 +
> >  fs/xfs/xfs_trans.h      |   1 +
> >  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trans_priv.h |   3 ++
> >  4 files changed, 106 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a066617ec54d..df0114ec66f1 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> > +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index fc4c25b6eee4..1637df32c64c 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> >  #define XFS_ITEM_PINNED		1
> >  #define XFS_ITEM_LOCKED		2
> >  #define XFS_ITEM_FLUSHING	3
> > +#define XFS_ITEM_RELOG		4
> >  
> >  /*
> >   * Deferred operation item relogging limits.
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index a3fb64275baa..71a47faeaae8 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
> >  	return lsn;
> >  }
> >  
> > +/*
> > + * Relog log items on the AIL relog queue.
> > + */
> > +static void
> > +xfs_ail_relog(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> > +						     ail_relog_work);
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> > +	struct xfs_trans_res	tres = {};
> > +	struct xfs_trans	*tp;
> > +	struct xfs_log_item	*lip;
> > +	int			error;
> > +
> > +	/*
> > +	 * The first transaction to submit a relog item contributed relog
> > +	 * reservation to the relog ticket before committing. Create an empty
> > +	 * transaction and manually associate the relog ticket.
> > +	 */
> > +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
> 
> I suspect deadlocks on filesystems in the process of being frozen
> when the log is full and relogging is required to make progress.
> 

I'll have to look into interactions with fs freeze as well once the
basics are nailed down.

> > +	ASSERT(!error);
> > +	if (error)
> > +		return;
> > +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> > +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> > +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> > +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> 
> So this assumes you've stolen the log reservation for this ticket
> from somewhere else, because otherwise the transaction log
> reservation and the ticket don't match.
> 
> FWIW, I'm having trouble real keeping all the ail relog ticket
> references straight. Code seems to be
> arbitratily taking and dropping references to that ticket, and I
> can't see a pattern or set of rules for usage.
> 

The reference counting stuff is a mess. I was anticipating needing to
simplify it, perhaps abstract it from the actual log ticket count, but
I'm going to hold off on getting too deep into the weeds of the
reference counting stuff unless the reservation stealing approach
doesn't pan out. Otherwise, this should presumably all go away..

> Why does this specific transaction need a reference to the ticket,
> when the ail_relog_list has a reference, every item that has been
> marked as XFS_LI_RELOG already has a reference, etc?
> 

This transaction has a reference because it uses the log ticket. It will
release a ticket reference on commit. Also note that the relog state
(and thus references) can be cleared from items at any time.

> > +	spin_lock(&ailp->ail_lock);
> > +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > +					       struct xfs_log_item,
> > +					       li_trans)) != NULL) {
> 
> I dislike the way the "swiss army knife" list macro makes this
> code unreadable.
> 
>         while (!list_empty(&ailp->ail_relog_list)) {
> 		lip = list_first_entry(...)
> 
> is much neater and easier to read.
> 

Ok.

> > +		/*
> > +		 * Drop the AIL processing ticket reference once the relog list
> > +		 * is emptied. At this point it's possible for our transaction
> > +		 * to hold the only reference.
> > +		 */
> > +		list_del_init(&lip->li_trans);
> > +		if (list_empty(&ailp->ail_relog_list))
> > +			xfs_log_ticket_put(ailp->ail_relog_tic);
> 
> This seems completely arbitrary.
> 

The list holds a reference for when an item is queued for relog but the
relog state cleared before the relog occurs.

> > +		spin_unlock(&ailp->ail_lock);
> > +
> > +		xfs_trans_add_item(tp, lip);
> > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > +		/* XXX: include ticket owner task fix */
> > +		error = xfs_trans_roll(&tp);
> 
> So the reservation for this ticket is going to be regranted over and
> over again and space reserved repeatedly as we work through the
> locked relog items one at a time?
> 
> Unfortunately, this violates the rule that prevents rolling
> transactions from deadlocking. That is, any object that is held
> locked across the transaction commit and regrant that *might pin the
> tail of the log* must be relogged in the transaction to move the
> item forward and prevent it from pinning the tail of the log.
> 
> IOWs, if one of the items later in the relog list pins the tail of
> the log we will end up sleeping here:
> 
>   xfs_trans_roll()
>     xfs_trans_reserve
>       xfs_log_regrant
>         xlog_grant_head_check(need_bytes)
> 	  xlog_grant_head_wait()
> 
> waiting for the write grant head to move. ANd it never will, because
> we hold the lock on that item so the AIL can't push it out.
> IOWs, using a rolling transaction per relog item will not work for
> processing multiple relog items.
> 

Hm, Ok. I must be missing something about the rolling transaction
guarantees.

> If it's a single transaction, and we join all the locked items
> for relogging into it in a single transaction commit, then we are
> fine - we don't try to regrant log space while holding locked items
> that could pin the tail of the log.
> 
> We *can* use a rolling transaction if we do this - the AIL has a
> permenant transaction (plus ticket!) allocated at mount time with
> a log count of zero, we can then steal reserve/write grant head
> space into it that ticket at CIL commit time as I mentioned
> previously. We do a loop like above, but it's basically:
> 
> {
> 	LIST_HEAD(tmp);
> 
> 	spin_lock(&ailp->ail_lock);
>         if (list_empty(&ailp->ail_relog_list)) {
> 		spin_unlock(&ailp->ail_lock);
> 		return;
> 	}
> 
> 	list_splice_init(&ilp->ail_relog_list, &tmp);
> 	spin_unlock(&ailp->ail_lock);
> 
> 	xfs_ail_relog_items(ail, &tmp);
> }
> 
> This allows the AIL to keep working and building up a new relog
> as it goes along. hence we can work on our list without interruption
> or needed to repeatedly take the AIL lock just to get items from the
> list.
> 

Yeah, I was planning splicing the list as such to avoid cycling the lock
so much regardless..

> And xfs_ail_relog_items() does something like this:
> 
> {
> 	struct xfs_trans	*tp;
> 
> 	/*
> 	 * Make CIL committers trying to change relog status of log
> 	 * items wait for us to istabilise the relog transaction
> 	 * again by committing the current relog list and rolling
> 	 * the transaction.
> 	 */
> 	down_write(&ail->ail_relog_lock);
> 	tp = ail->relog_trans;
> 
> 	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> 					       struct xfs_log_item,
> 					       li_trans)) != NULL) {
>         while (!list_empty(&ailp->ail_relog_list)) {
> 		lip = list_first_entry(&ailp->ail_relog_list,
> 					struct xfs_log_item, li_trans);
> 		list_del_init(&lip->li_trans);
> 
> 		xfs_trans_add_item(tp, lip);
> 		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> 		tp->t_flags |= XFS_TRANS_DIRTY;
> 	}
> 
> 	error = xfs_trans_roll(&tp);
> 	if (error) {
> 		SHUTDOWN!
> 	}
> 	ail->relog_trans = tp;
> 	up_write(&ail->ail_relog_lock);
> }
> 

I think I follow.. The fundamental difference here is basically that we
commit whatever we locked, right? IOW, the current approach _could_
technically be corrected, but it would have to lock one item at a time
(ugh) rather than build up a queue of locked items..?

The reservation stealing approach facilitates this batching because
instead of only having a guarantee that we can commit one max sized
relog item at a time, we can commit however many we have queued because
sufficient reservation has already been acquired.

That does raise another issue in that we presumably want some kind of
maximum transaction size and/or maximum outstanding relog reservation
with the above approach. Otherwise it could be possible for a workload
to go off the rails without any kind of throttling or hueristics
incorporated in the current (mostly) fixed transaction sizes. Perhaps
it's reasonable enough to cap outstanding relog reservation to the max
transaction size and return an error to callers that attempt to exceed
it..? It's not clear to me if that would impact the prospective scrub
use case. Hmmm.. maybe the right thing to do is cap the size of the
current relog queue so we cap the size of the relog transaction without
necessarily capping the max outstanding relog reservation. Thoughts on
that?

> > @@ -426,6 +495,23 @@ xfsaild_push(
> >  			ailp->ail_last_pushed_lsn = lsn;
> >  			break;
> >  
> > +		case XFS_ITEM_RELOG:
> > +			/*
> > +			 * The item requires a relog. Add to the pending relog
> > +			 * list and set the relogged bit to prevent further
> > +			 * relog requests. The relog bit and ticket reference
> > +			 * can be dropped from the item at any point, so hold a
> > +			 * relog ticket reference for the pending relog list to
> > +			 * ensure the ticket stays around.
> > +			 */
> > +			trace_xfs_ail_relog(lip);
> > +			ASSERT(list_empty(&lip->li_trans));
> > +			if (list_empty(&ailp->ail_relog_list))
> > +				xfs_log_ticket_get(ailp->ail_relog_tic);
> > +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> > +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> > +			break;
> 
> So the XFS_LI_RELOGGED bit indicates that the item is locked on the
> relog list? That means nothing else in the transaction subsystem
> will set that until the item is unlocked, right? Which is when it
> ends up back on the CIL. This then gets cleared by the CIL being
> forced and the item moved forward in the AIL.
> 
> IOWs, a log force clears this flag until the AIL relogs it again.
> Why do we need this flag to issue wakeups when the wait loop does
> blocking log forces? i.e. it will have already waited for the flag
> to be cleared by waiting for the log force....
> 

The bit tracks when a relog item has been queued. It's set when first
added to the relog queue and is cleared when the item lands back in the
on-disk log. The primary purpose of the bit is thus to prevent spurious
relogging of an item that's already been dropped back into the CIL and
thus a move in the physical log is pending.

The wait/wake is a secondary usage to isolate the item to the AIL once
relogging is disabled. The wait covers the case where the item is queued
on the relog list but not committed back to the log subsystem. Otherwise
the log force serves no purpose.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype
  2020-03-02  7:47   ` Dave Chinner
@ 2020-03-02 19:00     ` Brian Foster
  2020-03-03  0:09       ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-03-02 19:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 02, 2020 at 06:47:28PM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 08:43:19AM -0500, Brian Foster wrote:
> > Add a quick and dirty implementation of buffer relogging support.
> > There is currently no use case for buffer relogging. This is for
> > experimental use only and serves as an example to demonstrate the
> > ability to relog arbitrary items in the future, if necessary.
> > 
> > Add a hook to enable relogging a buffer in a transaction, update the
> > buffer log item handlers to support relogged BLIs and update the
> > relog handler to join the relogged buffer to the relog transaction.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> .....
> >  /*
> > @@ -187,9 +188,21 @@ xfs_ail_relog(
> >  			xfs_log_ticket_put(ailp->ail_relog_tic);
> >  		spin_unlock(&ailp->ail_lock);
> >  
> > -		xfs_trans_add_item(tp, lip);
> > -		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > -		tp->t_flags |= XFS_TRANS_DIRTY;
> > +		/*
> > +		 * TODO: Ideally, relog transaction management would be pushed
> > +		 * down into the ->iop_push() callbacks rather than playing
> > +		 * games with ->li_trans and looking at log item types here.
> > +		 */
> > +		if (lip->li_type == XFS_LI_BUF) {
> > +			struct xfs_buf_log_item	*bli = (struct xfs_buf_log_item *) lip;
> > +			xfs_buf_hold(bli->bli_buf);
> 
> What is this for? The bli already has a reference to the buffer.
> 

The buffer reference is for the transaction. It is analogous to the
reference acquired in xfs_buf_find() via xfs_trans_[get|read]_buf(), for
example.

> > +			xfs_trans_bjoin(tp, bli->bli_buf);
> > +			xfs_trans_dirty_buf(tp, bli->bli_buf);
> > +		} else {
> > +			xfs_trans_add_item(tp, lip);
> > +			set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +			tp->t_flags |= XFS_TRANS_DIRTY;
> > +		}
> 
> Really, this should be a xfs_item_ops callout. i.e.
> 
> 		lip->li_ops->iop_relog(lip);
> 

Yeah, I've already done pretty much this in my local tree. The callback
also takes the transaction because that's the code that knows how to add
a particular type of item to a transaction. I didn't require a callback
for the else case above where no special handling is required
(quotaoff), so the callback is optional, but I'm not opposed to
reworking things such that ->iop_relog() is always required if that is
preferred.

> And then a) it doesn't matter really where we call it from, and b)
> it becomes fully generic and we can implement the callout
> as future functionality requires.
> 

Yep.

Brian

> However, we have to make sure that the current transaction we are
> running has the correct space usage accounted to it, so I think this
> callout really does need to be done in a tight loop iterating and
> accounting all the relog items into the transaction without outside
> interference.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-02 18:06     ` Brian Foster
@ 2020-03-02 23:25       ` Dave Chinner
  2020-03-03  4:07         ` Dave Chinner
  2020-03-03 14:13         ` Brian Foster
  0 siblings, 2 replies; 59+ messages in thread
From: Dave Chinner @ 2020-03-02 23:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote:
> > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > > Automatic item relogging will occur from xfsaild context. xfsaild
> > > cannot acquire log reservation itself because it is also responsible
> > > for writeback and thus making used log reservation available again.
> > > Since there is no guarantee log reservation is available by the time
> > > a relogged item reaches the AIL, this is prone to deadlock.

.....

> > > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> > >  	tp->t_firstblock = NULLFSBLOCK;
> > >  
> > >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > > -	if (error) {
> > > -		xfs_trans_cancel(tp);
> > > -		return error;
> > > +	if (error)
> > > +		goto error;
> > > +
> > > +	if (flags & XFS_TRANS_RELOG) {
> > > +		error = xfs_trans_ail_relog_reserve(&tp);
> > > +		if (error)
> > > +			goto error;
> > >  	}
> > 
> > Hmmmm. So we are putting the AIL lock directly into the transaction
> > reserve path? xfs_trans_reserve() goes out of it's way to be
> > lockless in the fast paths, so if you want this to be a generic
> > mechanism that any transaction can use, the reservation needs to be
> > completely lockless.
> > 
> 
> Yeah, this is one of those warts mentioned in the cover letter. I wasn't
> planning to make it lockless as much as using an independent lock, but I
> can take a closer look at that if it still looks like a contention point
> after getting through the bigger picture feedback..

If relogging ends up getting widely used, then a filesystem global
lock in the transaction reserve path is guaranteed to be a
contention point. :)

> > > +	/* release the relog ticket reference if this transaction holds one */
> > > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > > +		xfs_trans_ail_relog_put(mp);
> > > +
> > 
> > That looks ... interesting. xfs_trans_ail_relog_put() can call
> > xfs_log_done(), which means it can do a log write, which means the
> > commit lsn for this transaction could change. Hence to make a relog
> > permanent as a result of a sync transaction, we'd need the
> > commit_lsn of the AIL relog ticket write here, not that of the
> > original transaction that was written.
> > 
> 
> Hmm.. xfs_log_done() is eventually called for every transaction ticket,
> so I don't think it should be doing log writes based on that.

  xfs_log_done()
    xlog_commit_record()
      xlog_write(XLOG_COMMIT_TRANS)
        xlog_state_release_iclog()
	  xlog_sync()
	    xlog_write_iclog()
	      submit_bio()
>
> The
> _ail_relog_put() path is basically just intended to properly terminate
> the relog ticket based on the existing transaction completion path. I'd
> have to dig a bit further into the code here, but e.g. we don't pass an
> iclog via this path so if we were attempting to write I'd probably have
> seen assert failures (or worse) by now.

xlog_write() code handles a missing commit iclog pointer just fine.
Indeed, that's the reason xlog_write() may issue log IO itself:

....
        spin_lock(&log->l_icloglock);
        xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
        if (commit_iclog) {
                ASSERT(flags & XLOG_COMMIT_TRANS);
                *commit_iclog = iclog;
        } else {
                error = xlog_state_release_iclog(log, iclog);
        }
        spin_unlock(&log->l_icloglock);

i.e. if you don't provide xlog_write with a commit_iclog pointer
then you are saying "caller does not call xfs_log_release_iclog()
itself, so please release the iclog once the log iovec has been
written to the iclog. Hence the way you've written this code
explicitly tells xlog_write() to issue log IO on you behalf if it is
necessary....

> Indeed, this is similar to a
> cancel of a clean transaction, the difference being this is open-coded
> because we only have the relog ticket in this context.

A clean transaction has XLOG_TIC_INITED set, so xfs_log_done() will
never call xlog_commit_record() on it. XLOG_TIC_INITED is used to
indicate a start record has been written to the log for a permanent
transaction so it doesn't get written again when it is relogged.....

.... Hmmmm .....

You might be right.

It looks like delayed logging made XLOG_TIC_INITED completely
redundant. xfs_trans_commit() no longer writes directly to iclogs,
so the normal transaction tickets will never have XLOG_TIC_INITED
cleared. Hence on any kernel since delayed logging was the only
option, we will never get log writes from the xfs_trans* interfaces.

And we no longer do an xlog_write() call for every item in the
transaction after we format them into a log iovec. Instead, the CIL
just passes one big list of iovecs to a singel xlog_write() call,
so all callers only make a single xlog_write call per physical log
transaction.

OK, XLOG_TIC_INITED is redundant, and should be removed. And
xfs_log_done() needs to be split into two, one for releasing the
ticket, one for completing the xlog_write() call. Compile tested
only patch below for you :P

> > <GRRRRIIINNNNDDDD>
> > 
> > <CLUNK>
> > 
> > Ahhhhh.
> > 
> 
> Heh. :P
> 
> > This is the reason why the CIL ended up using a reservation stealing
> > mechanism for it's ticket. Every log reservation takes into account
> > the worst case log overhead for that single transaction - that means
> > there are no special up front hooks or changes to the log
> > reservation to support the CIL, and it also means that the CIL can
> > steal what it needs from the current ticket at commit time. Then the
> > CIL can commit at any time, knowing it has enough space to write all
> > the items it tracks to the log.
> > 
> > This avoided the transaction reservation mechanism from needing to
> > know anything about the CIL or that it was stealing space from
> > committing transactions for it's own private ticket. It greatly
> > simplified everything, and it's the reason that the CIL succeeded
> > where several other attempts to relog items in memory failed....
> > 
> > So....
> > 
> > Why can't the AIL steal the reservation it needs from the current
> > transaction when the item that needs relogging is being committed?
> 
> I think we can. My first proposal[1] implemented something like this.
> The broader design was much less fleshed out, so it was crudely
> implemented as more of a commit time hook to relog pending items based
> on the currently committing transaction as opposed to reservation
> stealing for an independent relog ticket. This was actually based on the
> observation that the CIL already did something similar (though I didn't
> have all of the background as to why that approach was used for the
> CIL).
> 
> The feedback at the time was to consider moving in a different direction
> that didn't involve stealing reservation from transactions, which gave
> me the impression that approach wasn't going to be acceptable. Granted,

I don't recall that discussion steering away from stealing
reservations; it was more about mechanics like using transaction
callbacks and other possible mechanisms for enabling relogging.

> the first few variations of this were widely different in approach and I
> don't think there was any explicit objection to reservation stealing
> itself, it just seemed a better development approach to work out an
> ideal design based on fundamentals as opposed to limiting the scope to
> contexts that might facilitate reservation stealing.

Right, it wasn't clear how to best implement this at the time,
because we were all just struggling to get our heads around the
problem scope. :)

> Despite the fact
> that this code is hairy, the fundamental approach is rather
> straightforward and simplistic. The hairiness mostly comes from
> attempting to make things generic and dynamic and could certainly use
> some cleanup.
> 
> Now that this is at a point where I'm reasonably confident it is
> fundamentally correct, I'm quite happy to revisit a reservation stealing
> approach if that improves functional operation. This can serve a decent
> reference/baseline implementation for that, I think.

*nod*

> 
> [1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/
> 
> > i.e. we add the "relog overhead" to the permanent transaction that
> > requires items to be relogged by the AIL, and when that item is
> > formatted into the CIL we also check to see if it is currently
> > marked as "reloggable". If it's not, we steal the relogging
> > reservation from the transaction (essentially the item's formatted
> > size) and pass it to the AIL's private relog ticket, setting the log
> > item to be "reloggable" so the reservation isn't stolen over and
> > over again as the object is relogged in the CIL.
> > 
> 
> Ok, so this implies to me we'd update the per-transaction reservation
> calculations to incorporate worst case relog overhead for each
> transaction. E.g., the quotaoff transaction would x2 the intent size,
> a transaction that might relog a buffer adds a max buffer size overhead,
> etc. Right?

Essentially, yes.

And the reloggable items and reservation can be clearly documented
in the per-transaction reservation explanation comments like we
already do for all the structures that are modified in a
transaction.

> > Hence as we commit reloggable items to the CIL, the AIL ticket
> > reservation grows with each of those items marked as reloggable.
> > As we relog items to the CIL and the CIL grows it's reservation via
> > the size delta, the AIL reservation can also be updated with the
> > same delta.
> > 
> > Hence the AIL will always know exactly how much space it needs to relog all
> > the items it holds for relogging, and because it's been stolen from
> > the original transaction it is, like the CIL tciket reservation,
> > considered used space in the log. Hence the log space required for
> > relogging items via the AIL is correctly accounted for without
> > needing up front static, per-item reservations.
> > 
> 
> By this I assume you're referring to avoiding the need to reserve ->
> donate -> roll in the current scheme. Instead, we'd acquire the larger
> reservation up front and only steal if it necessary, which is less
> overhead because we don't need to always replenish the full transaction.

Yes - we acquire the initial relog reservation by stealing it
from the current transaction. This gives the AIL ticket the reserved
space (both reserve and write grant space) it needs to relog the
item once.

When the item is relogged by the AIL, the transaction commit
immediately regrants the reservation space that was just consumed,
and the trans_roll regrants the write grant space the commit just
consumed from the ticket via it's call to xfs_trans_reserve(). Hence
we only need to steal the space necessary to do the first relogging
of the item as the AIL will hold that reservation until the high
level code turns off relogging for that log item.

> > Thoughts?
> 
> <thinking out loud from here on..>
> 
> One thing that comes to mind thinking about this is dealing with
> batching (relog multiple items per roll). This code doesn't handle that
> yet, but I anticipate it being a requirement and it's fairly easy to
> update the current scheme to support a fixed item count per relog-roll.
> 
> A stealing approach potentially complicates things when it comes to
> batching because we have per-item reservation granularity to consider.
> For example, consider if we had a variety of different relog item types
> active at once, a subset are being relogged while another subset are
> being disabled (before ever needing to be relogged).

So concurrent "active relogging" + "start relogging" + "stop
relogging"?

> For one, we'd have
> to be careful to not release reservation while it might be accounted to
> an active relog transaction that is currently rolling with some other
> items, etc. There's also a potential quirk in handling reservation of a
> relogged item that is cancelled while it's being relogged, but that
> might be more of an implementation detail.

Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
my example relog transaction item add loop + commit function in?

i.e. while we are building and committing a relog transaction, we
hold off the transactions that are trying to add/remove their items
to/from the relog list. Hence the reservation stealing accounting in
the ticket can be be serialised against the transactional use of the
ticket.

It's basically the same method we use for serialising addition to
the CIL in transaction commit against CIL pushes draining the
current list for log writes (rwsem for add/push serialisation, spin
lock for concurrent add serialisation under the rwsem).

> I don't think that's a show stopper, but rather just something I'd like
> to have factored into the design from the start. One option could be to

*nod*

> maintain a separate counter of active relog reservation aside from the
> actual relog ticket. That way the relog ticket could just pull from this
> relog reservation pool based on the current item(s) being relogged
> asynchronously from different tasks that might add or remove reservation
> from the pool for separate items. That might get a little wonky when we
> consider the relog ticket needs to pull from the pool and then put
> something back if the item is still reloggable after the relog
> transaction rolls.

RIght, that's the whole problem that we solve via a) stealing the
initial reserve/write grant space at commit time and b) serialising
stealing vs transactional use of the ticket.

That is, after the roll, the ticket has a full reserve and grant
space reservation for all the items accounted to the relog ticket.
Every new relog item added to the ticket (or is relogged in the CIL
and uses more space) adds the full required reserve/write grant
space to the the relog ticket. Hence the relog ticket always has
current log space reserved to commit the entire set of items tagged
as reloggable. And by avoiding modifying the ticket while we are
actively processing the relog transaction, we don't screw up the
ticket accounting in the middle of the transaction....

> Another problem is that reloggable items are still otherwise usable once
> they are unlocked. So for example we'd have to account for a situation
> where one transaction dirties a buffer, enables relogging, commits and
> then some other transaction dirties more of the buffer and commits
> without caring whether the buffer was relog enabled or not.

yup, that's the delta size updates from the CIL commit. i.e. if we
relog an item to the CIL that has the XFS_LI_RELOG flag already set
on it, the change in size that we steal for the CIL ticket also
needs to be stolen for the AIL ticket. i.e. we already do almost all
the work we need to handle this.

> Unless
> runtime relog reservation is always worst case, that's a subtle path to
> reservation overrun in the relog transaction.

Yes, but it's a problem the CIL already solves for us :P

> I'm actually wondering if a more simple approach to tracking stolen
> reservation is to add a new field that tracks active relog reservation
> to each supported log item.  Then the initial transaction enables
> relogging with a simple transfer from the transaction to the item. The
> relog transaction knows how much reservation it can assign based on the
> current population of items and requires no further reservation
> accounting because it rolls and thus automatically reacquires relog
> reservation for each associated item. The path that clears relog state
> transfers res from the item back to a transaction simply so the
> reservation can be released back to the pool of unused log space. Note
> that clearing relog state doesn't require a transaction in the current
> implementation, but we could easily define a helper to allocate an empty
> transaction, clear relog and reclaim relog reservation and then cancel
> for those contexts.

I don't think we need any of that - the AIL ticket only needs to be
kept up to date with the changes to the formatted size of the item
marked for relogging. It's no different to the CIL ticket
reservation accounting from that perspective.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-03-02 18:52     ` Brian Foster
@ 2020-03-03  0:06       ` Dave Chinner
  2020-03-03 14:14         ` Brian Foster
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-03  0:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Mar 02, 2020 at 01:52:52PM -0500, Brian Foster wrote:
> On Mon, Mar 02, 2020 at 06:18:43PM +1100, Dave Chinner wrote:
> > On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > > +		spin_unlock(&ailp->ail_lock);
> > > +
> > > +		xfs_trans_add_item(tp, lip);
> > > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > > +		/* XXX: include ticket owner task fix */
> > > +		error = xfs_trans_roll(&tp);
> > 
> > So the reservation for this ticket is going to be regranted over and
> > over again and space reserved repeatedly as we work through the
> > locked relog items one at a time?
> > 
> > Unfortunately, this violates the rule that prevents rolling
> > transactions from deadlocking. That is, any object that is held
> > locked across the transaction commit and regrant that *might pin the
> > tail of the log* must be relogged in the transaction to move the
> > item forward and prevent it from pinning the tail of the log.
> > 
> > IOWs, if one of the items later in the relog list pins the tail of
> > the log we will end up sleeping here:
> > 
> >   xfs_trans_roll()
> >     xfs_trans_reserve
> >       xfs_log_regrant
> >         xlog_grant_head_check(need_bytes)
> > 	  xlog_grant_head_wait()
> > 
> > waiting for the write grant head to move. ANd it never will, because
> > we hold the lock on that item so the AIL can't push it out.
> > IOWs, using a rolling transaction per relog item will not work for
> > processing multiple relog items.
> > 
> 
> Hm, Ok. I must be missing something about the rolling transaction
> guarantees.

Ok, I know you understand some of the rules, but because I don't
know quite which bit of this complex game you are missing, I'll go
over it from the start (sorry for repeating things you know!).

We're not allowed to hold locked items that are in the AIL over a
transaction reservation call because the transaction reservation may
need to push the log to free up space in the log. Writing back
metadata requires locking the item to flush it, and if we hold it
locked the it can't be flushed. Hence if it pins the tail of the log
and prevents the the reservation from making space available, we
deadlock.

Normally, the only thing that is pinned across a transaction roll is
an inode, and the inode is being logged in every transaction. Hence
it is being continually moved to the head of the log and so can't
pin the tail of the log and prevent the reservation making progress.

The problem here is that having a log reservation for a modification
doesn't guarantee you that log space is immediately available - all
it guarantees you is that the log space will be available if the log
tail if free to move forward.

That's why there are two grant heads. The reservation grant head is
the one that guarantees that you have space available in the log for
the rolling transaction. That is always immediately regranted during
transaction commit, hence we guarantee the rolling transaction will
always fit in the log. The reserve head ensures we never overcommit
the available log space.

The second grant head is the write head, and this tracks the space
immediately available to physically write into the log. This is
essentially tracks the space available for physical writes into the
log. When your write reservation runs out (i.e. after the number of
rolls the  log count in the initial transaction reservation
specifies), we have to regrant physical space for the next
transaction in the rolling chain. If the log is physically full, we
have to wait for physical space to be made available.

The only way to increase the amount of physical space available in
the log is to have the tail move forwards. xfs_trans_reserve() does
that by setting a push target for the AIL to flush all the metadata
older than that target. It then blocks waiting for the tail of the
log to move. When the tail of the log moves, the available write
grant space increases because the log head can now physically move
forwards in the log.

Hence when the log is full and we are in a tail pushing situation,
new transactions wait on the reserve grant head to get the log space
guarantee they require. Long duration rolling transactions already
have a log space guarantee from the reserve grant head, so they
end up waiting for the physical log space they require on the write
grant head.

The tail pinning deadlock rolling transactions can trigger is
against the write grant head, not the reserve grant head. If the
tail of the log cannot move, then the write grant space never
increases and xfs_trans_reserve() blocks forever. Hence we cannot
call xfs_trans_roll() whilst holding items locked that have a high
probability of being at the tail of the log.

Given that this relogging functionality is all about preventing
items from either pinning the tail of the log or disappearing off
the tail of the log because they aren't relogged, we have to be very
careful about holding them locked over operations that require
the AIL to be able to make forwards progress....


> > If it's a single transaction, and we join all the locked items
> > for relogging into it in a single transaction commit, then we are
> > fine - we don't try to regrant log space while holding locked items
> > that could pin the tail of the log.
> > 
> > We *can* use a rolling transaction if we do this - the AIL has a
> > permenant transaction (plus ticket!) allocated at mount time with
> > a log count of zero, we can then steal reserve/write grant head
> > space into it that ticket at CIL commit time as I mentioned
> > previously. We do a loop like above, but it's basically:
> > 
> > {
> > 	LIST_HEAD(tmp);
> > 
> > 	spin_lock(&ailp->ail_lock);
> >         if (list_empty(&ailp->ail_relog_list)) {
> > 		spin_unlock(&ailp->ail_lock);
> > 		return;
> > 	}
> > 
> > 	list_splice_init(&ilp->ail_relog_list, &tmp);
> > 	spin_unlock(&ailp->ail_lock);
> > 
> > 	xfs_ail_relog_items(ail, &tmp);
> > }
> > 
> > This allows the AIL to keep working and building up a new relog
> > as it goes along. hence we can work on our list without interruption
> > or needed to repeatedly take the AIL lock just to get items from the
> > list.
> > 
> 
> Yeah, I was planning splicing the list as such to avoid cycling the lock
> so much regardless..
> 
> > And xfs_ail_relog_items() does something like this:
> > 
> > {
> > 	struct xfs_trans	*tp;
> > 
> > 	/*
> > 	 * Make CIL committers trying to change relog status of log
> > 	 * items wait for us to istabilise the relog transaction
> > 	 * again by committing the current relog list and rolling
> > 	 * the transaction.
> > 	 */
> > 	down_write(&ail->ail_relog_lock);
> > 	tp = ail->relog_trans;
> > 
> > 	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > 					       struct xfs_log_item,
> > 					       li_trans)) != NULL) {
> >         while (!list_empty(&ailp->ail_relog_list)) {
> > 		lip = list_first_entry(&ailp->ail_relog_list,
> > 					struct xfs_log_item, li_trans);
> > 		list_del_init(&lip->li_trans);
> > 
> > 		xfs_trans_add_item(tp, lip);
> > 		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > 		tp->t_flags |= XFS_TRANS_DIRTY;
> > 	}
> > 
> > 	error = xfs_trans_roll(&tp);
> > 	if (error) {
> > 		SHUTDOWN!
> > 	}
> > 	ail->relog_trans = tp;
> > 	up_write(&ail->ail_relog_lock);
> > }
> > 
> 
> I think I follow.. The fundamental difference here is basically that we
> commit whatever we locked, right? IOW, the current approach _could_
> technically be corrected, but it would have to lock one item at a time
> (ugh) rather than build up a queue of locked items..?
> 
> The reservation stealing approach facilitates this batching because
> instead of only having a guarantee that we can commit one max sized
> relog item at a time, we can commit however many we have queued because
> sufficient reservation has already been acquired.

Yes.

> That does raise another issue in that we presumably want some kind of
> maximum transaction size and/or maximum outstanding relog reservation
> with the above approach. Otherwise it could be possible for a workload
> to go off the rails without any kind of throttling or hueristics
> incorporated in the current (mostly) fixed transaction sizes.

Possibly, though I really don't have any intuition on how big the
relog reservation could possible grow. Right now it doesn't seem
like there's a space problem (single item!), so perhaps this is
something we can defer until we have some further understanding of
how many relog items are active at any given time?

> Perhaps
> it's reasonable enough to cap outstanding relog reservation to the max
> transaction size and return an error to callers that attempt to exceed

Max transaction size the CIL currently uses for large logs is 32MB.
That's an awful lot of relog items....

> it..? It's not clear to me if that would impact the prospective scrub
> use case. Hmmm.. maybe the right thing to do is cap the size of the
> current relog queue so we cap the size of the relog transaction without
> necessarily capping the max outstanding relog reservation. Thoughts on
> that?

Maybe.

Though this seems like an ideal candidate for a relog reservation
grant head. i.e. the transaction reserve structure we pass to
xfs_trans_reserve() has a new field that contains the relog
reservation needed for this transaction, and we use the existing
lockless grant head accounting infrastructure to throttle relogging
to an acceptible bound...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype
  2020-03-02 19:00     ` Brian Foster
@ 2020-03-03  0:09       ` Dave Chinner
  2020-03-03 14:14         ` Brian Foster
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-03  0:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Mar 02, 2020 at 02:00:34PM -0500, Brian Foster wrote:
> On Mon, Mar 02, 2020 at 06:47:28PM +1100, Dave Chinner wrote:
> > On Thu, Feb 27, 2020 at 08:43:19AM -0500, Brian Foster wrote:
> > > Add a quick and dirty implementation of buffer relogging support.
> > > There is currently no use case for buffer relogging. This is for
> > > experimental use only and serves as an example to demonstrate the
> > > ability to relog arbitrary items in the future, if necessary.
> > > 
> > > Add a hook to enable relogging a buffer in a transaction, update the
> > > buffer log item handlers to support relogged BLIs and update the
> > > relog handler to join the relogged buffer to the relog transaction.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > .....
> > >  /*
> > > @@ -187,9 +188,21 @@ xfs_ail_relog(
> > >  			xfs_log_ticket_put(ailp->ail_relog_tic);
> > >  		spin_unlock(&ailp->ail_lock);
> > >  
> > > -		xfs_trans_add_item(tp, lip);
> > > -		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > -		tp->t_flags |= XFS_TRANS_DIRTY;
> > > +		/*
> > > +		 * TODO: Ideally, relog transaction management would be pushed
> > > +		 * down into the ->iop_push() callbacks rather than playing
> > > +		 * games with ->li_trans and looking at log item types here.
> > > +		 */
> > > +		if (lip->li_type == XFS_LI_BUF) {
> > > +			struct xfs_buf_log_item	*bli = (struct xfs_buf_log_item *) lip;
> > > +			xfs_buf_hold(bli->bli_buf);
> > 
> > What is this for? The bli already has a reference to the buffer.
> > 
> 
> The buffer reference is for the transaction. It is analogous to the
> reference acquired in xfs_buf_find() via xfs_trans_[get|read]_buf(), for
> example.

Ah. Comment please :P

> > > +			xfs_trans_bjoin(tp, bli->bli_buf);
> > > +			xfs_trans_dirty_buf(tp, bli->bli_buf);
> > > +		} else {
> > > +			xfs_trans_add_item(tp, lip);
> > > +			set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > +			tp->t_flags |= XFS_TRANS_DIRTY;
> > > +		}
> > 
> > Really, this should be a xfs_item_ops callout. i.e.
> > 
> > 		lip->li_ops->iop_relog(lip);
> > 
> 
> Yeah, I've already done pretty much this in my local tree. The callback
> also takes the transaction because that's the code that knows how to add
> a particular type of item to a transaction. I didn't require a callback
> for the else case above where no special handling is required
> (quotaoff), so the callback is optional, but I'm not opposed to
> reworking things such that ->iop_relog() is always required if that is
> preferred.

I think I'd prefer to keep things simple right now. Making it an
unconditional callout keeps this code simple, and if there's a
common implementation, add a generic function for it that the items
use.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-02 23:25       ` Dave Chinner
@ 2020-03-03  4:07         ` Dave Chinner
  2020-03-03 15:12           ` Brian Foster
  2020-03-03 14:13         ` Brian Foster
  1 sibling, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-03  4:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> OK, XLOG_TIC_INITED is redundant, and should be removed. And
> xfs_log_done() needs to be split into two, one for releasing the
> ticket, one for completing the xlog_write() call. Compile tested
> only patch below for you :P

And now with sample patch.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: kill XLOG_TIC_INITED

From: Dave Chinner <dchinner@redhat.com>

Delayed logging made this redundant as we never directly write
transactions to the log anymore. Hence we no longer make multiple
xlog_write() calls for a transaction as we format individual items
in a transaction, and hence don't need to keep track of whether we
should be writing a start record for every xlog_write call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 79 ++++++++++++++++++---------------------------------
 fs/xfs/xfs_log.h      |  4 ---
 fs/xfs/xfs_log_cil.c  | 13 +++++----
 fs/xfs/xfs_log_priv.h | 18 ++++++------
 fs/xfs/xfs_trans.c    | 24 ++++++++--------
 5 files changed, 55 insertions(+), 83 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f6006d94a581..a45f3eefee39 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -496,8 +496,8 @@ xfs_log_reserve(
  * This routine is called when a user of a log manager ticket is done with
  * the reservation.  If the ticket was ever used, then a commit record for
  * the associated transaction is written out as a log operation header with
- * no data.  The flag XLOG_TIC_INITED is set when the first write occurs with
- * a given ticket.  If the ticket was one with a permanent reservation, then
+ * no data. 
+ * If the ticket was one with a permanent reservation, then
  * a few operations are done differently.  Permanent reservation tickets by
  * default don't release the reservation.  They just commit the current
  * transaction with the belief that the reservation is still needed.  A flag
@@ -506,49 +506,38 @@ xfs_log_reserve(
  * the inited state again.  By doing this, a start record will be written
  * out when the next write occurs.
  */
-xfs_lsn_t
-xfs_log_done(
-	struct xfs_mount	*mp,
+int
+xlog_write_done(
+	struct xlog		*log,
 	struct xlog_ticket	*ticket,
 	struct xlog_in_core	**iclog,
-	bool			regrant)
+	xfs_lsn_t		*lsn)
 {
-	struct xlog		*log = mp->m_log;
-	xfs_lsn_t		lsn = 0;
-
-	if (XLOG_FORCED_SHUTDOWN(log) ||
-	    /*
-	     * If nothing was ever written, don't write out commit record.
-	     * If we get an error, just continue and give back the log ticket.
-	     */
-	    (((ticket->t_flags & XLOG_TIC_INITED) == 0) &&
-	     (xlog_commit_record(log, ticket, iclog, &lsn)))) {
-		lsn = (xfs_lsn_t) -1;
-		regrant = false;
-	}
+	if (XLOG_FORCED_SHUTDOWN(log))
+		return -EIO;
 
+	return xlog_commit_record(log, ticket, iclog, lsn);
+}
 
+/*
+ * Release or regrant the ticket reservation now the transaction is done with
+ * it depending on caller context. Rolling transactions need the ticket
+ * regranted, otherwise we release it completely.
+ */
+void
+xlog_ticket_done(
+	struct xlog		*log,
+	struct xlog_ticket	*ticket,
+	bool			regrant)
+{
 	if (!regrant) {
 		trace_xfs_log_done_nonperm(log, ticket);
-
-		/*
-		 * Release ticket if not permanent reservation or a specific
-		 * request has been made to release a permanent reservation.
-		 */
 		xlog_ungrant_log_space(log, ticket);
 	} else {
 		trace_xfs_log_done_perm(log, ticket);
-
 		xlog_regrant_reserve_log_space(log, ticket);
-		/* If this ticket was a permanent reservation and we aren't
-		 * trying to release it, reset the inited flags; so next time
-		 * we write, a start record will be written out.
-		 */
-		ticket->t_flags |= XLOG_TIC_INITED;
 	}
-
 	xfs_log_ticket_put(ticket);
-	return lsn;
 }
 
 static bool
@@ -2148,8 +2137,9 @@ xlog_print_trans(
 }
 
 /*
- * Calculate the potential space needed by the log vector.  Each region gets
- * its own xlog_op_header_t and may need to be double word aligned.
+ * Calculate the potential space needed by the log vector.  We always write a
+ * start record, and each region gets its own xlog_op_header_t and may need to
+ * be double word aligned.
  */
 static int
 xlog_write_calc_vec_length(
@@ -2157,14 +2147,10 @@ xlog_write_calc_vec_length(
 	struct xfs_log_vec	*log_vector)
 {
 	struct xfs_log_vec	*lv;
-	int			headers = 0;
+	int			headers = 1;
 	int			len = 0;
 	int			i;
 
-	/* acct for start rec of xact */
-	if (ticket->t_flags & XLOG_TIC_INITED)
-		headers++;
-
 	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
@@ -2195,17 +2181,11 @@ xlog_write_start_rec(
 	struct xlog_op_header	*ophdr,
 	struct xlog_ticket	*ticket)
 {
-	if (!(ticket->t_flags & XLOG_TIC_INITED))
-		return 0;
-
 	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
 	ophdr->oh_clientid = ticket->t_clientid;
 	ophdr->oh_len = 0;
 	ophdr->oh_flags = XLOG_START_TRANS;
 	ophdr->oh_res2 = 0;
-
-	ticket->t_flags &= ~XLOG_TIC_INITED;
-
 	return sizeof(struct xlog_op_header);
 }
 
@@ -2410,12 +2390,10 @@ xlog_write(
 	len = xlog_write_calc_vec_length(ticket, log_vector);
 
 	/*
-	 * Region headers and bytes are already accounted for.
-	 * We only need to take into account start records and
-	 * split regions in this function.
+	 * Region headers and bytes are already accounted for.  We only need to
+	 * take into account start records and split regions in this function.
 	 */
-	if (ticket->t_flags & XLOG_TIC_INITED)
-		ticket->t_curr_res -= sizeof(xlog_op_header_t);
+	ticket->t_curr_res -= sizeof(xlog_op_header_t);
 
 	/*
 	 * Commit record headers need to be accounted for. These
@@ -3609,7 +3587,6 @@ xlog_ticket_alloc(
 	tic->t_ocnt		= cnt;
 	tic->t_tid		= prandom_u32();
 	tic->t_clientid		= client;
-	tic->t_flags		= XLOG_TIC_INITED;
 	if (permanent)
 		tic->t_flags |= XLOG_TIC_PERM_RESERV;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 84e06805160f..85f8d0966811 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -105,10 +105,6 @@ struct xfs_log_item;
 struct xfs_item_ops;
 struct xfs_trans;
 
-xfs_lsn_t xfs_log_done(struct xfs_mount *mp,
-		       struct xlog_ticket *ticket,
-		       struct xlog_in_core **iclog,
-		       bool regrant);
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
 int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
 		int *log_forced);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 48435cf2aa16..255065d276fc 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -841,10 +841,11 @@ xlog_cil_push(
 	}
 	spin_unlock(&cil->xc_push_lock);
 
-	/* xfs_log_done always frees the ticket on error. */
-	commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, false);
-	if (commit_lsn == -1)
-		goto out_abort;
+	error = xlog_write_done(log, tic, &commit_iclog, &commit_lsn);
+	if (error)
+		goto out_abort_free_ticket;
+
+	xlog_ticket_done(log, tic, false);
 
 	spin_lock(&commit_iclog->ic_callback_lock);
 	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
@@ -876,7 +877,7 @@ xlog_cil_push(
 	return 0;
 
 out_abort_free_ticket:
-	xfs_log_ticket_put(tic);
+	xlog_ticket_done(log, tic, false);
 out_abort:
 	xlog_cil_committed(ctx, true);
 	return -EIO;
@@ -1017,7 +1018,7 @@ xfs_log_commit_cil(
 	if (commit_lsn)
 		*commit_lsn = xc_commit_lsn;
 
-	xfs_log_done(mp, tp->t_ticket, NULL, regrant);
+	xlog_ticket_done(log, tp->t_ticket, regrant);
 	tp->t_ticket = NULL;
 	xfs_trans_unreserve_and_mod_sb(tp);
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b192c5a9f9fd..6965d164ff45 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -53,11 +53,9 @@ enum xlog_iclog_state {
 /*
  * Flags to log ticket
  */
-#define XLOG_TIC_INITED		0x1	/* has been initialized */
 #define XLOG_TIC_PERM_RESERV	0x2	/* permanent reservation */
 
 #define XLOG_TIC_FLAGS \
-	{ XLOG_TIC_INITED,	"XLOG_TIC_INITED" }, \
 	{ XLOG_TIC_PERM_RESERV,	"XLOG_TIC_PERM_RESERV" }
 
 /*
@@ -438,14 +436,14 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
 
 void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
-int
-xlog_write(
-	struct xlog		*log,
-	struct xfs_log_vec	*log_vector,
-	struct xlog_ticket	*tic,
-	xfs_lsn_t		*start_lsn,
-	struct xlog_in_core	**commit_iclog,
-	uint			flags);
+
+int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
+			struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
+			struct xlog_in_core **commit_iclog, uint flags);
+int xlog_write_done(struct xlog *log, struct xlog_ticket *ticket,
+			struct xlog_in_core **iclog, xfs_lsn_t *lsn);
+void xlog_ticket_done(struct xlog *log, struct xlog_ticket *ticket,
+			bool regrant);
 
 /*
  * When we crack an atomic LSN, we sample it first so that the value will not
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3b208f9a865c..85ea3727878b 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -9,6 +9,7 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
+#include "xfs_log_priv.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_extent_busy.h"
@@ -150,8 +151,9 @@ xfs_trans_reserve(
 	uint			blocks,
 	uint			rtextents)
 {
-	int		error = 0;
-	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
+	struct xfs_mount	*mp = tp->t_mountp;
+	int			error = 0;
+	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
 	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
@@ -162,7 +164,7 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (blocks > 0) {
-		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
+		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
 			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
@@ -191,9 +193,9 @@ xfs_trans_reserve(
 
 		if (tp->t_ticket != NULL) {
 			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
-			error = xfs_log_regrant(tp->t_mountp, tp->t_ticket);
+			error = xfs_log_regrant(mp, tp->t_ticket);
 		} else {
-			error = xfs_log_reserve(tp->t_mountp,
+			error = xfs_log_reserve(mp,
 						resp->tr_logres,
 						resp->tr_logcount,
 						&tp->t_ticket, XFS_TRANSACTION,
@@ -213,7 +215,7 @@ xfs_trans_reserve(
 	 * fail if the count would go below zero.
 	 */
 	if (rtextents > 0) {
-		error = xfs_mod_frextents(tp->t_mountp, -((int64_t)rtextents));
+		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
 		if (error) {
 			error = -ENOSPC;
 			goto undo_log;
@@ -229,7 +231,7 @@ xfs_trans_reserve(
 	 */
 undo_log:
 	if (resp->tr_logres > 0) {
-		xfs_log_done(tp->t_mountp, tp->t_ticket, NULL, false);
+		xlog_ticket_done(mp->m_log, tp->t_ticket, false);
 		tp->t_ticket = NULL;
 		tp->t_log_res = 0;
 		tp->t_flags &= ~XFS_TRANS_PERM_LOG_RES;
@@ -237,7 +239,7 @@ xfs_trans_reserve(
 
 undo_blocks:
 	if (blocks > 0) {
-		xfs_mod_fdblocks(tp->t_mountp, (int64_t)blocks, rsvd);
+		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
 		tp->t_blk_res = 0;
 	}
 
@@ -999,9 +1001,7 @@ __xfs_trans_commit(
 	 */
 	xfs_trans_unreserve_and_mod_dquots(tp);
 	if (tp->t_ticket) {
-		commit_lsn = xfs_log_done(mp, tp->t_ticket, NULL, regrant);
-		if (commit_lsn == -1 && !error)
-			error = -EIO;
+		xlog_ticket_done(mp->m_log, tp->t_ticket, regrant);
 		tp->t_ticket = NULL;
 	}
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
@@ -1060,7 +1060,7 @@ xfs_trans_cancel(
 	xfs_trans_unreserve_and_mod_dquots(tp);
 
 	if (tp->t_ticket) {
-		xfs_log_done(mp, tp->t_ticket, NULL, false);
+		xlog_ticket_done(mp->m_log, tp->t_ticket, false);
 		tp->t_ticket = NULL;
 	}
 

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-02 23:25       ` Dave Chinner
  2020-03-03  4:07         ` Dave Chinner
@ 2020-03-03 14:13         ` Brian Foster
  2020-03-03 21:26           ` Dave Chinner
  1 sibling, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-03-03 14:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote:
> > > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > > > Automatic item relogging will occur from xfsaild context. xfsaild
> > > > cannot acquire log reservation itself because it is also responsible
> > > > for writeback and thus making used log reservation available again.
> > > > Since there is no guarantee log reservation is available by the time
> > > > a relogged item reaches the AIL, this is prone to deadlock.
> 
> .....
> 
> > > > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> > > >  	tp->t_firstblock = NULLFSBLOCK;
> > > >  
> > > >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > > > -	if (error) {
> > > > -		xfs_trans_cancel(tp);
> > > > -		return error;
> > > > +	if (error)
> > > > +		goto error;
> > > > +
> > > > +	if (flags & XFS_TRANS_RELOG) {
> > > > +		error = xfs_trans_ail_relog_reserve(&tp);
> > > > +		if (error)
> > > > +			goto error;
> > > >  	}
> > > 
> > > Hmmmm. So we are putting the AIL lock directly into the transaction
> > > reserve path? xfs_trans_reserve() goes out of it's way to be
> > > lockless in the fast paths, so if you want this to be a generic
> > > mechanism that any transaction can use, the reservation needs to be
> > > completely lockless.
> > > 
> > 
> > Yeah, this is one of those warts mentioned in the cover letter. I wasn't
> > planning to make it lockless as much as using an independent lock, but I
> > can take a closer look at that if it still looks like a contention point
> > after getting through the bigger picture feedback..
> 
> If relogging ends up getting widely used, then a filesystem global
> lock in the transaction reserve path is guaranteed to be a
> contention point. :)
> 

I was thinking yesterday that I had implemented a fast path there.  I
intended to, but looking at the code I realize I hadn't got to that yet.
So... I see your point and this will need to be addressed if it doesn't
fall away via the relog res rework.

> > > > +	/* release the relog ticket reference if this transaction holds one */
> > > > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > > > +		xfs_trans_ail_relog_put(mp);
> > > > +
> > > 
> > > That looks ... interesting. xfs_trans_ail_relog_put() can call
> > > xfs_log_done(), which means it can do a log write, which means the
> > > commit lsn for this transaction could change. Hence to make a relog
> > > permanent as a result of a sync transaction, we'd need the
> > > commit_lsn of the AIL relog ticket write here, not that of the
> > > original transaction that was written.
> > > 
> > 
> > Hmm.. xfs_log_done() is eventually called for every transaction ticket,
> > so I don't think it should be doing log writes based on that.
> 
>   xfs_log_done()
>     xlog_commit_record()
>       xlog_write(XLOG_COMMIT_TRANS)
>         xlog_state_release_iclog()
> 	  xlog_sync()
> 	    xlog_write_iclog()
> 	      submit_bio()

Yeah, I know that we _can_ issue writes from this path because it's
shared between the CIL log ticket path and the per-transaction log
ticket path. I'm just not convinced that happens from the transaction
path simply because if it did, it would seem like a potentially
noticeable bug.

> >
> > The
> > _ail_relog_put() path is basically just intended to properly terminate
> > the relog ticket based on the existing transaction completion path. I'd
> > have to dig a bit further into the code here, but e.g. we don't pass an
> > iclog via this path so if we were attempting to write I'd probably have
> > seen assert failures (or worse) by now.
> 
> xlog_write() code handles a missing commit iclog pointer just fine.
> Indeed, that's the reason xlog_write() may issue log IO itself:
> 
> ....
>         spin_lock(&log->l_icloglock);
>         xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
>         if (commit_iclog) {
>                 ASSERT(flags & XLOG_COMMIT_TRANS);
>                 *commit_iclog = iclog;
>         } else {
>                 error = xlog_state_release_iclog(log, iclog);
>         }
>         spin_unlock(&log->l_icloglock);
> 
> i.e. if you don't provide xlog_write with a commit_iclog pointer
> then you are saying "caller does not call xfs_log_release_iclog()
> itself, so please release the iclog once the log iovec has been
> written to the iclog. Hence the way you've written this code
> explicitly tells xlog_write() to issue log IO on you behalf if it is
> necessary....
> 

xlog_commit_record(), which is the xlog_write() caller in this case,
expects a valid iclog and (unconditionally) asserts otherwise. I would
expect to have seen that assert had this happened.

> > Indeed, this is similar to a
> > cancel of a clean transaction, the difference being this is open-coded
> > because we only have the relog ticket in this context.
> 
> A clean transaction has XLOG_TIC_INITED set, so xfs_log_done() will
> never call xlog_commit_record() on it. XLOG_TIC_INITED is used to
> indicate a start record has been written to the log for a permanent
> transaction so it doesn't get written again when it is relogged.....
> 
> .... Hmmmm .....
> 
> You might be right.
> 

I figured it was related to the _INITED flag since that obviously gates
the call to xlog_commit_record(), but when I've taken a quick look at
that code in the past I wasn't really able to make much sense of its
purpose (i.e. why the behavior differs between CIL and transaction log
tickets) and didn't dig any further than that.

> It looks like delayed logging made XLOG_TIC_INITED completely
> redundant. xfs_trans_commit() no longer writes directly to iclogs,
> so the normal transaction tickets will never have XLOG_TIC_INITED
> cleared. Hence on any kernel since delayed logging was the only
> option, we will never get log writes from the xfs_trans* interfaces.
> 
> And we no longer do an xlog_write() call for every item in the
> transaction after we format them into a log iovec. Instead, the CIL
> just passes one big list of iovecs to a singel xlog_write() call,
> so all callers only make a single xlog_write call per physical log
> transaction.
> 
> OK, XLOG_TIC_INITED is redundant, and should be removed. And
> xfs_log_done() needs to be split into two, one for releasing the
> ticket, one for completing the xlog_write() call. Compile tested
> only patch below for you :P
> 

Heh, Ok. Will take a look.

> > > <GRRRRIIINNNNDDDD>
> > > 
> > > <CLUNK>
> > > 
> > > Ahhhhh.
> > > 
> > 
> > Heh. :P
> > 
> > > This is the reason why the CIL ended up using a reservation stealing
> > > mechanism for it's ticket. Every log reservation takes into account
> > > the worst case log overhead for that single transaction - that means
> > > there are no special up front hooks or changes to the log
> > > reservation to support the CIL, and it also means that the CIL can
> > > steal what it needs from the current ticket at commit time. Then the
> > > CIL can commit at any time, knowing it has enough space to write all
> > > the items it tracks to the log.
> > > 
> > > This avoided the transaction reservation mechanism from needing to
> > > know anything about the CIL or that it was stealing space from
> > > committing transactions for it's own private ticket. It greatly
> > > simplified everything, and it's the reason that the CIL succeeded
> > > where several other attempts to relog items in memory failed....
> > > 
> > > So....
> > > 
> > > Why can't the AIL steal the reservation it needs from the current
> > > transaction when the item that needs relogging is being committed?
> > 
> > I think we can. My first proposal[1] implemented something like this.
> > The broader design was much less fleshed out, so it was crudely
> > implemented as more of a commit time hook to relog pending items based
> > on the currently committing transaction as opposed to reservation
> > stealing for an independent relog ticket. This was actually based on the
> > observation that the CIL already did something similar (though I didn't
> > have all of the background as to why that approach was used for the
> > CIL).
> > 
> > The feedback at the time was to consider moving in a different direction
> > that didn't involve stealing reservation from transactions, which gave
> > me the impression that approach wasn't going to be acceptable. Granted,
> 
> I don't recall that discussion steering away from stealing
> reservations; it was more about mechanics like using transaction
> callbacks and other possible mechanisms for enabling relogging.
> 
> > the first few variations of this were widely different in approach and I
> > don't think there was any explicit objection to reservation stealing
> > itself, it just seemed a better development approach to work out an
> > ideal design based on fundamentals as opposed to limiting the scope to
> > contexts that might facilitate reservation stealing.
> 
> Right, it wasn't clear how to best implement this at the time,
> because we were all just struggling to get our heads around the
> problem scope. :)
> 
> > Despite the fact
> > that this code is hairy, the fundamental approach is rather
> > straightforward and simplistic. The hairiness mostly comes from
> > attempting to make things generic and dynamic and could certainly use
> > some cleanup.
> > 
> > Now that this is at a point where I'm reasonably confident it is
> > fundamentally correct, I'm quite happy to revisit a reservation stealing
> > approach if that improves functional operation. This can serve a decent
> > reference/baseline implementation for that, I think.
> 
> *nod*
> 
> > 
> > [1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/
> > 
> > > i.e. we add the "relog overhead" to the permanent transaction that
> > > requires items to be relogged by the AIL, and when that item is
> > > formatted into the CIL we also check to see if it is currently
> > > marked as "reloggable". If it's not, we steal the relogging
> > > reservation from the transaction (essentially the item's formatted
> > > size) and pass it to the AIL's private relog ticket, setting the log
> > > item to be "reloggable" so the reservation isn't stolen over and
> > > over again as the object is relogged in the CIL.
> > > 
> > 
> > Ok, so this implies to me we'd update the per-transaction reservation
> > calculations to incorporate worst case relog overhead for each
> > transaction. E.g., the quotaoff transaction would x2 the intent size,
> > a transaction that might relog a buffer adds a max buffer size overhead,
> > etc. Right?
> 
> Essentially, yes.
> 
> And the reloggable items and reservation can be clearly documented
> in the per-transaction reservation explanation comments like we
> already do for all the structures that are modified in a
> transaction.
> 

Ok.

> > > Hence as we commit reloggable items to the CIL, the AIL ticket
> > > reservation grows with each of those items marked as reloggable.
> > > As we relog items to the CIL and the CIL grows it's reservation via
> > > the size delta, the AIL reservation can also be updated with the
> > > same delta.
> > > 
> > > Hence the AIL will always know exactly how much space it needs to relog all
> > > the items it holds for relogging, and because it's been stolen from
> > > the original transaction it is, like the CIL tciket reservation,
> > > considered used space in the log. Hence the log space required for
> > > relogging items via the AIL is correctly accounted for without
> > > needing up front static, per-item reservations.
> > > 
> > 
> > By this I assume you're referring to avoiding the need to reserve ->
> > donate -> roll in the current scheme. Instead, we'd acquire the larger
> > reservation up front and only steal if it necessary, which is less
> > overhead because we don't need to always replenish the full transaction.
> 
> Yes - we acquire the initial relog reservation by stealing it
> from the current transaction. This gives the AIL ticket the reserved
> space (both reserve and write grant space) it needs to relog the
> item once.
> 
> When the item is relogged by the AIL, the transaction commit
> immediately regrants the reservation space that was just consumed,
> and the trans_roll regrants the write grant space the commit just
> consumed from the ticket via it's call to xfs_trans_reserve(). Hence
> we only need to steal the space necessary to do the first relogging
> of the item as the AIL will hold that reservation until the high
> level code turns off relogging for that log item.
> 

Yep.

> > > Thoughts?
> > 
> > <thinking out loud from here on..>
> > 
> > One thing that comes to mind thinking about this is dealing with
> > batching (relog multiple items per roll). This code doesn't handle that
> > yet, but I anticipate it being a requirement and it's fairly easy to
> > update the current scheme to support a fixed item count per relog-roll.
> > 
> > A stealing approach potentially complicates things when it comes to
> > batching because we have per-item reservation granularity to consider.
> > For example, consider if we had a variety of different relog item types
> > active at once, a subset are being relogged while another subset are
> > being disabled (before ever needing to be relogged).
> 
> So concurrent "active relogging" + "start relogging" + "stop
> relogging"?
> 

Yeah..

> > For one, we'd have
> > to be careful to not release reservation while it might be accounted to
> > an active relog transaction that is currently rolling with some other
> > items, etc. There's also a potential quirk in handling reservation of a
> > relogged item that is cancelled while it's being relogged, but that
> > might be more of an implementation detail.
> 
> Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
> my example relog transaction item add loop + commit function in?
> 

Yes, but the side effects didn't quite register..

> i.e. while we are building and committing a relog transaction, we
> hold off the transactions that are trying to add/remove their items
> to/from the relog list. Hence the reservation stealing accounting in
> the ticket can be be serialised against the transactional use of the
> ticket.
> 

Hmm, so this changes relog item/state management as well as the
reservation management. That's probably why it wasn't clear to me from
the code example.

> It's basically the same method we use for serialising addition to
> the CIL in transaction commit against CIL pushes draining the
> current list for log writes (rwsem for add/push serialisation, spin
> lock for concurrent add serialisation under the rwsem).
> 

Ok. The difference with the CIL is that in that context we're processing
a single, constantly maintained list of items. As of right now, there is
no such list for relog enabled items. If I'm following correctly, that's
still not necessary here, we're just talking about wrapping a rw lock
around the state management (+ res accounting) and the actual res
consumption such that a relog cancel doesn't steal reservation from an
active relog transaction, for example.

> > I don't think that's a show stopper, but rather just something I'd like
> > to have factored into the design from the start. One option could be to
> 
> *nod*
> 
> > maintain a separate counter of active relog reservation aside from the
> > actual relog ticket. That way the relog ticket could just pull from this
> > relog reservation pool based on the current item(s) being relogged
> > asynchronously from different tasks that might add or remove reservation
> > from the pool for separate items. That might get a little wonky when we
> > consider the relog ticket needs to pull from the pool and then put
> > something back if the item is still reloggable after the relog
> > transaction rolls.
> 
> RIght, that's the whole problem that we solve via a) stealing the
> initial reserve/write grant space at commit time and b) serialising
> stealing vs transactional use of the ticket.
> 
> That is, after the roll, the ticket has a full reserve and grant
> space reservation for all the items accounted to the relog ticket.
> Every new relog item added to the ticket (or is relogged in the CIL
> and uses more space) adds the full required reserve/write grant
> space to the the relog ticket. Hence the relog ticket always has
> current log space reserved to commit the entire set of items tagged
> as reloggable. And by avoiding modifying the ticket while we are
> actively processing the relog transaction, we don't screw up the
> ticket accounting in the middle of the transaction....
> 

Yep, Ok.

> > Another problem is that reloggable items are still otherwise usable once
> > they are unlocked. So for example we'd have to account for a situation
> > where one transaction dirties a buffer, enables relogging, commits and
> > then some other transaction dirties more of the buffer and commits
> > without caring whether the buffer was relog enabled or not.
> 
> yup, that's the delta size updates from the CIL commit. i.e. if we
> relog an item to the CIL that has the XFS_LI_RELOG flag already set
> on it, the change in size that we steal for the CIL ticket also
> needs to be stolen for the AIL ticket. i.e. we already do almost all
> the work we need to handle this.
> 
> > Unless
> > runtime relog reservation is always worst case, that's a subtle path to
> > reservation overrun in the relog transaction.
> 
> Yes, but it's a problem the CIL already solves for us :P
> 

Ok, I think we're pretty close to the same page here. I was thinking
about the worst case relog reservation being pulled off the committing
transaction unconditionally, where I think you're thinking about it as
the transaction (i.e. reservation calculation) would have the worst case
reservation, but we'd only pull off the delta as needed at commit time
(i.e. exactly how the CIL works wrt to transaction reservation
consumption). Let me work through a simple example to try and
(in)validate my concern:

- Transaction A is relog enabled, dirties 50% of a buffer and enables
  auto relogging. On commit, the CIL takes buffer/2 reservation for the
  log vector and the relog mechanism takes the same amount for
  subsequent relogs.
- Transaction B is not relog enabled (so no extra relog reservation),
  dirties another 40% of the (already relog enabled) buffer and commits.
  The CIL takes 90% of the transaction buffer reservation. The relog
  ticket now needs an additional 40% (since the original 50% is
  maintained by the relog system), but afaict there is no guarantee that
  res is available if trans B is not relog enabled.

So IIUC, the CIL scheme works because every transaction automatically
includes worst case reservation for every possible item supported by the
transaction. Relog transactions are presumably more selective, however,
so we need to either have some rule where only relog enabled
transactions can touch relog enabled items (it's not clear to me how
realistic that is for things like buffers etc., but it sounds
potentially delicate) or we simplify the relog reservation consumption
calculation to consume worst case reservation for the relog ticket in
anticipation of unrelated transactions dirtying more of the associated
object since it was originally committed for relog. Thoughts?

Note that carrying worst case reservation also potentially simplifies
accounting between multiple dirty+relog increments and a single relog
cancel, particularly if a relog cancelling transaction also happens to
dirty more of a buffer before it commits. I'd prefer to avoid subtle
usage landmines like that as much as possible. Hmm.. even if we do
ultimately want a more granular and dynamic relog res accounting, I
might start with the simple worst-case approach since 1.) it probably
doesn't matter for intents, 2.) it's easier to reason about the isolated
reservation calculation changes in an independent patch from the rest of
the mechanism and 3.) I'd rather get the basics nailed down and solid
before potentially fighting with granular reservation accounting
accuracy bugs. ;)

Brian

> > I'm actually wondering if a more simple approach to tracking stolen
> > reservation is to add a new field that tracks active relog reservation
> > to each supported log item.  Then the initial transaction enables
> > relogging with a simple transfer from the transaction to the item. The
> > relog transaction knows how much reservation it can assign based on the
> > current population of items and requires no further reservation
> > accounting because it rolls and thus automatically reacquires relog
> > reservation for each associated item. The path that clears relog state
> > transfers res from the item back to a transaction simply so the
> > reservation can be released back to the pool of unused log space. Note
> > that clearing relog state doesn't require a transaction in the current
> > implementation, but we could easily define a helper to allocate an empty
> > transaction, clear relog and reclaim relog reservation and then cancel
> > for those contexts.
> 
> I don't think we need any of that - the AIL ticket only needs to be
> kept up to date with the changes to the formatted size of the item
> marked for relogging. It's no different to the CIL ticket
> reservation accounting from that perspective.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism
  2020-03-03  0:06       ` Dave Chinner
@ 2020-03-03 14:14         ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-03-03 14:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 03, 2020 at 11:06:43AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2020 at 01:52:52PM -0500, Brian Foster wrote:
> > On Mon, Mar 02, 2020 at 06:18:43PM +1100, Dave Chinner wrote:
> > > On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > > > +		spin_unlock(&ailp->ail_lock);
> > > > +
> > > > +		xfs_trans_add_item(tp, lip);
> > > > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > > > +		/* XXX: include ticket owner task fix */
> > > > +		error = xfs_trans_roll(&tp);
> > > 
> > > So the reservation for this ticket is going to be regranted over and
> > > over again and space reserved repeatedly as we work through the
> > > locked relog items one at a time?
> > > 
> > > Unfortunately, this violates the rule that prevents rolling
> > > transactions from deadlocking. That is, any object that is held
> > > locked across the transaction commit and regrant that *might pin the
> > > tail of the log* must be relogged in the transaction to move the
> > > item forward and prevent it from pinning the tail of the log.
> > > 
> > > IOWs, if one of the items later in the relog list pins the tail of
> > > the log we will end up sleeping here:
> > > 
> > >   xfs_trans_roll()
> > >     xfs_trans_reserve
> > >       xfs_log_regrant
> > >         xlog_grant_head_check(need_bytes)
> > > 	  xlog_grant_head_wait()
> > > 
> > > waiting for the write grant head to move. ANd it never will, because
> > > we hold the lock on that item so the AIL can't push it out.
> > > IOWs, using a rolling transaction per relog item will not work for
> > > processing multiple relog items.
> > > 
> > 
> > Hm, Ok. I must be missing something about the rolling transaction
> > guarantees.
> 
> Ok, I know you understand some of the rules, but because I don't
> know quite which bit of this complex game you are missing, I'll go
> over it from the start (sorry for repeating things you know!).
> 
> We're not allowed to hold locked items that are in the AIL over a
> transaction reservation call because the transaction reservation may
> need to push the log to free up space in the log. Writing back
> metadata requires locking the item to flush it, and if we hold it
> locked the it can't be flushed. Hence if it pins the tail of the log
> and prevents the the reservation from making space available, we
> deadlock.
> 
> Normally, the only thing that is pinned across a transaction roll is
> an inode, and the inode is being logged in every transaction. Hence
> it is being continually moved to the head of the log and so can't
> pin the tail of the log and prevent the reservation making progress.
> 
> The problem here is that having a log reservation for a modification
> doesn't guarantee you that log space is immediately available - all
> it guarantees you is that the log space will be available if the log
> tail if free to move forward.
> 
> That's why there are two grant heads. The reservation grant head is
> the one that guarantees that you have space available in the log for
> the rolling transaction. That is always immediately regranted during
> transaction commit, hence we guarantee the rolling transaction will
> always fit in the log. The reserve head ensures we never overcommit
> the available log space.
> 
> The second grant head is the write head, and this tracks the space
> immediately available to physically write into the log. This is
> essentially tracks the space available for physical writes into the
> log. When your write reservation runs out (i.e. after the number of
> rolls the  log count in the initial transaction reservation
> specifies), we have to regrant physical space for the next
> transaction in the rolling chain. If the log is physically full, we
> have to wait for physical space to be made available.
> 
> The only way to increase the amount of physical space available in
> the log is to have the tail move forwards. xfs_trans_reserve() does
> that by setting a push target for the AIL to flush all the metadata
> older than that target. It then blocks waiting for the tail of the
> log to move. When the tail of the log moves, the available write
> grant space increases because the log head can now physically move
> forwards in the log.
> 
> Hence when the log is full and we are in a tail pushing situation,
> new transactions wait on the reserve grant head to get the log space
> guarantee they require. Long duration rolling transactions already
> have a log space guarantee from the reserve grant head, so they
> end up waiting for the physical log space they require on the write
> grant head.
> 
> The tail pinning deadlock rolling transactions can trigger is
> against the write grant head, not the reserve grant head. If the
> tail of the log cannot move, then the write grant space never
> increases and xfs_trans_reserve() blocks forever. Hence we cannot
> call xfs_trans_roll() whilst holding items locked that have a high
> probability of being at the tail of the log.
> 

Ok. I've stared at the reserve vs. write grant head code before and
recognized the difference in behavior between new transactions bumping
both heads and rolling transactions replenishing straight from the write
head (presumably holding a constant log reservation over the entire
process), but the side effect of that in the rolling transaction case
wasn't always clear to me. This clears up the purpose of the write head
and makes the connection to the deadlock vector. Thanks for the
explanation...

> Given that this relogging functionality is all about preventing
> items from either pinning the tail of the log or disappearing off
> the tail of the log because they aren't relogged, we have to be very
> careful about holding them locked over operations that require
> the AIL to be able to make forwards progress....
> 

Indeed, and we're intentionally relying on AIL pressure to defer relogs
until items are potentially at or near the tail of the log as well.

> 
> > > If it's a single transaction, and we join all the locked items
> > > for relogging into it in a single transaction commit, then we are
> > > fine - we don't try to regrant log space while holding locked items
> > > that could pin the tail of the log.
> > > 
> > > We *can* use a rolling transaction if we do this - the AIL has a
> > > permenant transaction (plus ticket!) allocated at mount time with
> > > a log count of zero, we can then steal reserve/write grant head
> > > space into it that ticket at CIL commit time as I mentioned
> > > previously. We do a loop like above, but it's basically:
> > > 
> > > {
> > > 	LIST_HEAD(tmp);
> > > 
> > > 	spin_lock(&ailp->ail_lock);
> > >         if (list_empty(&ailp->ail_relog_list)) {
> > > 		spin_unlock(&ailp->ail_lock);
> > > 		return;
> > > 	}
> > > 
> > > 	list_splice_init(&ilp->ail_relog_list, &tmp);
> > > 	spin_unlock(&ailp->ail_lock);
> > > 
> > > 	xfs_ail_relog_items(ail, &tmp);
> > > }
> > > 
> > > This allows the AIL to keep working and building up a new relog
> > > as it goes along. hence we can work on our list without interruption
> > > or needed to repeatedly take the AIL lock just to get items from the
> > > list.
> > > 
> > 
> > Yeah, I was planning splicing the list as such to avoid cycling the lock
> > so much regardless..
> > 
> > > And xfs_ail_relog_items() does something like this:
> > > 
> > > {
> > > 	struct xfs_trans	*tp;
> > > 
> > > 	/*
> > > 	 * Make CIL committers trying to change relog status of log
> > > 	 * items wait for us to istabilise the relog transaction
> > > 	 * again by committing the current relog list and rolling
> > > 	 * the transaction.
> > > 	 */
> > > 	down_write(&ail->ail_relog_lock);
> > > 	tp = ail->relog_trans;
> > > 
> > > 	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > > 					       struct xfs_log_item,
> > > 					       li_trans)) != NULL) {
> > >         while (!list_empty(&ailp->ail_relog_list)) {
> > > 		lip = list_first_entry(&ailp->ail_relog_list,
> > > 					struct xfs_log_item, li_trans);
> > > 		list_del_init(&lip->li_trans);
> > > 
> > > 		xfs_trans_add_item(tp, lip);
> > > 		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > 		tp->t_flags |= XFS_TRANS_DIRTY;
> > > 	}
> > > 
> > > 	error = xfs_trans_roll(&tp);
> > > 	if (error) {
> > > 		SHUTDOWN!
> > > 	}
> > > 	ail->relog_trans = tp;
> > > 	up_write(&ail->ail_relog_lock);
> > > }
> > > 
> > 
> > I think I follow.. The fundamental difference here is basically that we
> > commit whatever we locked, right? IOW, the current approach _could_
> > technically be corrected, but it would have to lock one item at a time
> > (ugh) rather than build up a queue of locked items..?
> > 
> > The reservation stealing approach facilitates this batching because
> > instead of only having a guarantee that we can commit one max sized
> > relog item at a time, we can commit however many we have queued because
> > sufficient reservation has already been acquired.
> 
> Yes.
> 
> > That does raise another issue in that we presumably want some kind of
> > maximum transaction size and/or maximum outstanding relog reservation
> > with the above approach. Otherwise it could be possible for a workload
> > to go off the rails without any kind of throttling or hueristics
> > incorporated in the current (mostly) fixed transaction sizes.
> 
> Possibly, though I really don't have any intuition on how big the
> relog reservation could possible grow. Right now it doesn't seem
> like there's a space problem (single item!), so perhaps this is
> something we can defer until we have some further understanding of
> how many relog items are active at any given time?
> 

Sure, it's certainly not an immediate issue for the current use case.

> > Perhaps
> > it's reasonable enough to cap outstanding relog reservation to the max
> > transaction size and return an error to callers that attempt to exceed
> 
> Max transaction size the CIL currently uses for large logs is 32MB.
> That's an awful lot of relog items....
> 

Yeah. As above, it's not clear to me what the max relog requirement
might be in the worst case of online repair of a metadata btree. This
might warrant some empirical data once the mechanism is in place. The
buffer relogging test code could help with that as well.

> > it..? It's not clear to me if that would impact the prospective scrub
> > use case. Hmmm.. maybe the right thing to do is cap the size of the
> > current relog queue so we cap the size of the relog transaction without
> > necessarily capping the max outstanding relog reservation. Thoughts on
> > that?
> 
> Maybe.
> 
> Though this seems like an ideal candidate for a relog reservation
> grant head. i.e. the transaction reserve structure we pass to
> xfs_trans_reserve() has a new field that contains the relog
> reservation needed for this transaction, and we use the existing
> lockless grant head accounting infrastructure to throttle relogging
> to an acceptible bound...
> 

That's an interesting idea too. I guess this will mostly be dictated by
the requirements of the repair use case, and there's potential for
flexibility in terms of having a hard cap vs. a throttling approach
independent from the reservation tracking implementation details.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype
  2020-03-03  0:09       ` Dave Chinner
@ 2020-03-03 14:14         ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-03-03 14:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 03, 2020 at 11:09:09AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2020 at 02:00:34PM -0500, Brian Foster wrote:
> > On Mon, Mar 02, 2020 at 06:47:28PM +1100, Dave Chinner wrote:
> > > On Thu, Feb 27, 2020 at 08:43:19AM -0500, Brian Foster wrote:
> > > > Add a quick and dirty implementation of buffer relogging support.
> > > > There is currently no use case for buffer relogging. This is for
> > > > experimental use only and serves as an example to demonstrate the
> > > > ability to relog arbitrary items in the future, if necessary.
> > > > 
> > > > Add a hook to enable relogging a buffer in a transaction, update the
> > > > buffer log item handlers to support relogged BLIs and update the
> > > > relog handler to join the relogged buffer to the relog transaction.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > .....
> > > >  /*
> > > > @@ -187,9 +188,21 @@ xfs_ail_relog(
> > > >  			xfs_log_ticket_put(ailp->ail_relog_tic);
> > > >  		spin_unlock(&ailp->ail_lock);
> > > >  
> > > > -		xfs_trans_add_item(tp, lip);
> > > > -		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > > -		tp->t_flags |= XFS_TRANS_DIRTY;
> > > > +		/*
> > > > +		 * TODO: Ideally, relog transaction management would be pushed
> > > > +		 * down into the ->iop_push() callbacks rather than playing
> > > > +		 * games with ->li_trans and looking at log item types here.
> > > > +		 */
> > > > +		if (lip->li_type == XFS_LI_BUF) {
> > > > +			struct xfs_buf_log_item	*bli = (struct xfs_buf_log_item *) lip;
> > > > +			xfs_buf_hold(bli->bli_buf);
> > > 
> > > What is this for? The bli already has a reference to the buffer.
> > > 
> > 
> > The buffer reference is for the transaction. It is analogous to the
> > reference acquired in xfs_buf_find() via xfs_trans_[get|read]_buf(), for
> > example.
> 
> Ah. Comment please :P
> 

Sure.

> > > > +			xfs_trans_bjoin(tp, bli->bli_buf);
> > > > +			xfs_trans_dirty_buf(tp, bli->bli_buf);
> > > > +		} else {
> > > > +			xfs_trans_add_item(tp, lip);
> > > > +			set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > > +			tp->t_flags |= XFS_TRANS_DIRTY;
> > > > +		}
> > > 
> > > Really, this should be a xfs_item_ops callout. i.e.
> > > 
> > > 		lip->li_ops->iop_relog(lip);
> > > 
> > 
> > Yeah, I've already done pretty much this in my local tree. The callback
> > also takes the transaction because that's the code that knows how to add
> > a particular type of item to a transaction. I didn't require a callback
> > for the else case above where no special handling is required
> > (quotaoff), so the callback is optional, but I'm not opposed to
> > reworking things such that ->iop_relog() is always required if that is
> > preferred.
> 
> I think I'd prefer to keep things simple right now. Making it an
> unconditional callout keeps this code simple, and if there's a
> common implementation, add a generic function for it that the items
> use.
> 

Fine by me, will fix.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-03  4:07         ` Dave Chinner
@ 2020-03-03 15:12           ` Brian Foster
  2020-03-03 21:47             ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2020-03-03 15:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 03, 2020 at 03:07:35PM +1100, Dave Chinner wrote:
> On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > OK, XLOG_TIC_INITED is redundant, and should be removed. And
> > xfs_log_done() needs to be split into two, one for releasing the
> > ticket, one for completing the xlog_write() call. Compile tested
> > only patch below for you :P
> 
> And now with sample patch.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: kill XLOG_TIC_INITED
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Delayed logging made this redundant as we never directly write
> transactions to the log anymore. Hence we no longer make multiple
> xlog_write() calls for a transaction as we format individual items
> in a transaction, and hence don't need to keep track of whether we
> should be writing a start record for every xlog_write call.
> 

FWIW the commit log could use a bit more context, perhaps from your
previous description, about the original semantics of _INITED flag.
E.g., it's always been rather vague to me, probably because it seems to
be a remnant of some no longer fully in place functionality.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 79 ++++++++++++++++++---------------------------------
>  fs/xfs/xfs_log.h      |  4 ---
>  fs/xfs/xfs_log_cil.c  | 13 +++++----
>  fs/xfs/xfs_log_priv.h | 18 ++++++------
>  fs/xfs/xfs_trans.c    | 24 ++++++++--------
>  5 files changed, 55 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f6006d94a581..a45f3eefee39 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -496,8 +496,8 @@ xfs_log_reserve(
>   * This routine is called when a user of a log manager ticket is done with
>   * the reservation.  If the ticket was ever used, then a commit record for
>   * the associated transaction is written out as a log operation header with
> - * no data.  The flag XLOG_TIC_INITED is set when the first write occurs with
> - * a given ticket.  If the ticket was one with a permanent reservation, then
> + * no data. 

	      ^ trailing whitespace

> + * If the ticket was one with a permanent reservation, then
>   * a few operations are done differently.  Permanent reservation tickets by
>   * default don't release the reservation.  They just commit the current
>   * transaction with the belief that the reservation is still needed.  A flag
> @@ -506,49 +506,38 @@ xfs_log_reserve(
>   * the inited state again.  By doing this, a start record will be written
>   * out when the next write occurs.
>   */
> -xfs_lsn_t
> -xfs_log_done(
> -	struct xfs_mount	*mp,
> +int
> +xlog_write_done(
> +	struct xlog		*log,
>  	struct xlog_ticket	*ticket,
>  	struct xlog_in_core	**iclog,
> -	bool			regrant)
> +	xfs_lsn_t		*lsn)
>  {
> -	struct xlog		*log = mp->m_log;
> -	xfs_lsn_t		lsn = 0;
> -
> -	if (XLOG_FORCED_SHUTDOWN(log) ||
> -	    /*
> -	     * If nothing was ever written, don't write out commit record.
> -	     * If we get an error, just continue and give back the log ticket.
> -	     */
> -	    (((ticket->t_flags & XLOG_TIC_INITED) == 0) &&
> -	     (xlog_commit_record(log, ticket, iclog, &lsn)))) {
> -		lsn = (xfs_lsn_t) -1;
> -		regrant = false;
> -	}
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		return -EIO;
>  
> +	return xlog_commit_record(log, ticket, iclog, lsn);
> +}
>  
> +/*
> + * Release or regrant the ticket reservation now the transaction is done with
> + * it depending on caller context. Rolling transactions need the ticket
> + * regranted, otherwise we release it completely.
> + */
> +void
> +xlog_ticket_done(
> +	struct xlog		*log,
> +	struct xlog_ticket	*ticket,
> +	bool			regrant)
> +{
>  	if (!regrant) {
>  		trace_xfs_log_done_nonperm(log, ticket);
> -
> -		/*
> -		 * Release ticket if not permanent reservation or a specific
> -		 * request has been made to release a permanent reservation.
> -		 */
>  		xlog_ungrant_log_space(log, ticket);
>  	} else {
>  		trace_xfs_log_done_perm(log, ticket);
> -
>  		xlog_regrant_reserve_log_space(log, ticket);
> -		/* If this ticket was a permanent reservation and we aren't
> -		 * trying to release it, reset the inited flags; so next time
> -		 * we write, a start record will be written out.
> -		 */
> -		ticket->t_flags |= XLOG_TIC_INITED;
>  	}
> -
>  	xfs_log_ticket_put(ticket);
> -	return lsn;
>  }

In general it would be nicer to split off as much refactoring as
possible into separate patches, even though it's not yet clear to me
what granularity is possible with this patch...

>  
>  static bool
> @@ -2148,8 +2137,9 @@ xlog_print_trans(
>  }
>  
>  /*
> - * Calculate the potential space needed by the log vector.  Each region gets
> - * its own xlog_op_header_t and may need to be double word aligned.
> + * Calculate the potential space needed by the log vector.  We always write a
> + * start record, and each region gets its own xlog_op_header_t and may need to
> + * be double word aligned.
>   */
>  static int
>  xlog_write_calc_vec_length(
> @@ -2157,14 +2147,10 @@ xlog_write_calc_vec_length(
>  	struct xfs_log_vec	*log_vector)
>  {
>  	struct xfs_log_vec	*lv;
> -	int			headers = 0;
> +	int			headers = 1;
>  	int			len = 0;
>  	int			i;
>  
> -	/* acct for start rec of xact */
> -	if (ticket->t_flags & XLOG_TIC_INITED)
> -		headers++;
> -
>  	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> @@ -2195,17 +2181,11 @@ xlog_write_start_rec(
>  	struct xlog_op_header	*ophdr,
>  	struct xlog_ticket	*ticket)
>  {
> -	if (!(ticket->t_flags & XLOG_TIC_INITED))
> -		return 0;
> -
>  	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
>  	ophdr->oh_clientid = ticket->t_clientid;
>  	ophdr->oh_len = 0;
>  	ophdr->oh_flags = XLOG_START_TRANS;
>  	ophdr->oh_res2 = 0;
> -
> -	ticket->t_flags &= ~XLOG_TIC_INITED;
> -
>  	return sizeof(struct xlog_op_header);
>  }

The header comment to this function refers to the "inited" state.

Also note that there's a similar reference in
xfs_log_write_unmount_record(), but that instance sets ->t_flags to zero
so might be fine outside of the stale comment.

>  
> @@ -2410,12 +2390,10 @@ xlog_write(
>  	len = xlog_write_calc_vec_length(ticket, log_vector);
>  
>  	/*
> -	 * Region headers and bytes are already accounted for.
> -	 * We only need to take into account start records and
> -	 * split regions in this function.
> +	 * Region headers and bytes are already accounted for.  We only need to
> +	 * take into account start records and split regions in this function.
>  	 */
> -	if (ticket->t_flags & XLOG_TIC_INITED)
> -		ticket->t_curr_res -= sizeof(xlog_op_header_t);
> +	ticket->t_curr_res -= sizeof(xlog_op_header_t);
>  

So AFAICT the CIL allocates a ticket and up to this point only mucks
around with the reservation value. That means _INITED is still in place
once we get to xlog_write(). xlog_write() immediately calls
xlog_write_calc_vec_length() and makes the ->t_curr_res adjustment
before touching ->t_flags, so those bits all seems fine.

We then get into the log vector loops, where it looks like we call
xlog_write_start_rec() for each log vector region and rely on the
_INITED flag to only write a start record once per associated ticket.
Unless I'm missing something, this looks like it would change behavior
to perhaps write a start record per-region..? Note that this might not
preclude the broader change to kill off _INITED since we're using the
same ticket throughout the call, but some initial refactoring might be
required to remove this dependency first.

>  	/*
>  	 * Commit record headers need to be accounted for. These
> @@ -3609,7 +3587,6 @@ xlog_ticket_alloc(
>  	tic->t_ocnt		= cnt;
>  	tic->t_tid		= prandom_u32();
>  	tic->t_clientid		= client;
> -	tic->t_flags		= XLOG_TIC_INITED;
>  	if (permanent)
>  		tic->t_flags |= XLOG_TIC_PERM_RESERV;
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 84e06805160f..85f8d0966811 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -105,10 +105,6 @@ struct xfs_log_item;
>  struct xfs_item_ops;
>  struct xfs_trans;
>  
> -xfs_lsn_t xfs_log_done(struct xfs_mount *mp,
> -		       struct xlog_ticket *ticket,
> -		       struct xlog_in_core **iclog,
> -		       bool regrant);
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
>  int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
>  		int *log_forced);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 48435cf2aa16..255065d276fc 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -841,10 +841,11 @@ xlog_cil_push(
>  	}
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	/* xfs_log_done always frees the ticket on error. */
> -	commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, false);
> -	if (commit_lsn == -1)
> -		goto out_abort;
> +	error = xlog_write_done(log, tic, &commit_iclog, &commit_lsn);
> +	if (error)
> +		goto out_abort_free_ticket;
> +
> +	xlog_ticket_done(log, tic, false);
>  
>  	spin_lock(&commit_iclog->ic_callback_lock);
>  	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
> @@ -876,7 +877,7 @@ xlog_cil_push(
>  	return 0;
>  
>  out_abort_free_ticket:
> -	xfs_log_ticket_put(tic);
> +	xlog_ticket_done(log, tic, false);
>  out_abort:
>  	xlog_cil_committed(ctx, true);
>  	return -EIO;
> @@ -1017,7 +1018,7 @@ xfs_log_commit_cil(
>  	if (commit_lsn)
>  		*commit_lsn = xc_commit_lsn;
>  
> -	xfs_log_done(mp, tp->t_ticket, NULL, regrant);
> +	xlog_ticket_done(log, tp->t_ticket, regrant);
>  	tp->t_ticket = NULL;
>  	xfs_trans_unreserve_and_mod_sb(tp);
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b192c5a9f9fd..6965d164ff45 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -53,11 +53,9 @@ enum xlog_iclog_state {
>  /*
>   * Flags to log ticket
>   */
> -#define XLOG_TIC_INITED		0x1	/* has been initialized */
>  #define XLOG_TIC_PERM_RESERV	0x2	/* permanent reservation */

These values don't end up on disk, right? If not, it might be worth
resetting the _PERM_RESERV value to 1. Otherwise the rest looks like
mostly straightforward refactoring. 

Brian

>  
>  #define XLOG_TIC_FLAGS \
> -	{ XLOG_TIC_INITED,	"XLOG_TIC_INITED" }, \
>  	{ XLOG_TIC_PERM_RESERV,	"XLOG_TIC_PERM_RESERV" }
>  
>  /*
> @@ -438,14 +436,14 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
>  
>  void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
> -int
> -xlog_write(
> -	struct xlog		*log,
> -	struct xfs_log_vec	*log_vector,
> -	struct xlog_ticket	*tic,
> -	xfs_lsn_t		*start_lsn,
> -	struct xlog_in_core	**commit_iclog,
> -	uint			flags);
> +
> +int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
> +			struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> +			struct xlog_in_core **commit_iclog, uint flags);
> +int xlog_write_done(struct xlog *log, struct xlog_ticket *ticket,
> +			struct xlog_in_core **iclog, xfs_lsn_t *lsn);
> +void xlog_ticket_done(struct xlog *log, struct xlog_ticket *ticket,
> +			bool regrant);
>  
>  /*
>   * When we crack an atomic LSN, we sample it first so that the value will not
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..85ea3727878b 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -9,6 +9,7 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_extent_busy.h"
> @@ -150,8 +151,9 @@ xfs_trans_reserve(
>  	uint			blocks,
>  	uint			rtextents)
>  {
> -	int		error = 0;
> -	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	int			error = 0;
> +	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
>  	/* Mark this thread as being in a transaction */
>  	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> @@ -162,7 +164,7 @@ xfs_trans_reserve(
>  	 * fail if the count would go below zero.
>  	 */
>  	if (blocks > 0) {
> -		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> +		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
>  		if (error != 0) {
>  			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  			return -ENOSPC;
> @@ -191,9 +193,9 @@ xfs_trans_reserve(
>  
>  		if (tp->t_ticket != NULL) {
>  			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
> -			error = xfs_log_regrant(tp->t_mountp, tp->t_ticket);
> +			error = xfs_log_regrant(mp, tp->t_ticket);
>  		} else {
> -			error = xfs_log_reserve(tp->t_mountp,
> +			error = xfs_log_reserve(mp,
>  						resp->tr_logres,
>  						resp->tr_logcount,
>  						&tp->t_ticket, XFS_TRANSACTION,
> @@ -213,7 +215,7 @@ xfs_trans_reserve(
>  	 * fail if the count would go below zero.
>  	 */
>  	if (rtextents > 0) {
> -		error = xfs_mod_frextents(tp->t_mountp, -((int64_t)rtextents));
> +		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
>  		if (error) {
>  			error = -ENOSPC;
>  			goto undo_log;
> @@ -229,7 +231,7 @@ xfs_trans_reserve(
>  	 */
>  undo_log:
>  	if (resp->tr_logres > 0) {
> -		xfs_log_done(tp->t_mountp, tp->t_ticket, NULL, false);
> +		xlog_ticket_done(mp->m_log, tp->t_ticket, false);
>  		tp->t_ticket = NULL;
>  		tp->t_log_res = 0;
>  		tp->t_flags &= ~XFS_TRANS_PERM_LOG_RES;
> @@ -237,7 +239,7 @@ xfs_trans_reserve(
>  
>  undo_blocks:
>  	if (blocks > 0) {
> -		xfs_mod_fdblocks(tp->t_mountp, (int64_t)blocks, rsvd);
> +		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
>  		tp->t_blk_res = 0;
>  	}
>  
> @@ -999,9 +1001,7 @@ __xfs_trans_commit(
>  	 */
>  	xfs_trans_unreserve_and_mod_dquots(tp);
>  	if (tp->t_ticket) {
> -		commit_lsn = xfs_log_done(mp, tp->t_ticket, NULL, regrant);
> -		if (commit_lsn == -1 && !error)
> -			error = -EIO;
> +		xlog_ticket_done(mp->m_log, tp->t_ticket, regrant);
>  		tp->t_ticket = NULL;
>  	}
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> @@ -1060,7 +1060,7 @@ xfs_trans_cancel(
>  	xfs_trans_unreserve_and_mod_dquots(tp);
>  
>  	if (tp->t_ticket) {
> -		xfs_log_done(mp, tp->t_ticket, NULL, false);
> +		xlog_ticket_done(mp->m_log, tp->t_ticket, false);
>  		tp->t_ticket = NULL;
>  	}
>  
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-03 14:13         ` Brian Foster
@ 2020-03-03 21:26           ` Dave Chinner
  2020-03-04 14:03             ` Brian Foster
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-03-03 21:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 03, 2020 at 09:13:16AM -0500, Brian Foster wrote:
> On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > > <thinking out loud from here on..>
> > > 
> > > One thing that comes to mind thinking about this is dealing with
> > > batching (relog multiple items per roll). This code doesn't handle that
> > > yet, but I anticipate it being a requirement and it's fairly easy to
> > > update the current scheme to support a fixed item count per relog-roll.
> > > 
> > > A stealing approach potentially complicates things when it comes to
> > > batching because we have per-item reservation granularity to consider.
> > > For example, consider if we had a variety of different relog item types
> > > active at once, a subset are being relogged while another subset are
> > > being disabled (before ever needing to be relogged).
> > 
> > So concurrent "active relogging" + "start relogging" + "stop
> > relogging"?
> > 
> 
> Yeah..
> 
> > > For one, we'd have
> > > to be careful to not release reservation while it might be accounted to
> > > an active relog transaction that is currently rolling with some other
> > > items, etc. There's also a potential quirk in handling reservation of a
> > > relogged item that is cancelled while it's being relogged, but that
> > > might be more of an implementation detail.
> > 
> > Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
> > my example relog transaction item add loop + commit function in?
> > 
> 
> Yes, but the side effects didn't quite register..
> 
> > i.e. while we are building and committing a relog transaction, we
> > hold off the transactions that are trying to add/remove their items
> > to/from the relog list. Hence the reservation stealing accounting in
> > the ticket can be be serialised against the transactional use of the
> > ticket.
> > 
> 
> Hmm, so this changes relog item/state management as well as the
> reservation management. That's probably why it wasn't clear to me from
> the code example.

Yeah, I left a lot out :P

> > It's basically the same method we use for serialising addition to
> > the CIL in transaction commit against CIL pushes draining the
> > current list for log writes (rwsem for add/push serialisation, spin
> > lock for concurrent add serialisation under the rwsem).
> > 
> 
> Ok. The difference with the CIL is that in that context we're processing
> a single, constantly maintained list of items. As of right now, there is
> no such list for relog enabled items. If I'm following correctly, that's
> still not necessary here, we're just talking about wrapping a rw lock
> around the state management (+ res accounting) and the actual res
> consumption such that a relog cancel doesn't steal reservation from an
> active relog transaction, for example.

That is correct. I don't think it actually matters because we can't
remove an item that is locked and being relogged (so it's
reservation is in use). Except for the fact we can't serialise
internal itcket accounting updates the transaction might be doing
with external tciket accounting modifications any other way.

> > > I don't think that's a show stopper, but rather just something I'd like
> > > to have factored into the design from the start. One option could be to
> > 
> > *nod*
> > 
> > > maintain a separate counter of active relog reservation aside from the
> > > actual relog ticket. That way the relog ticket could just pull from this
> > > relog reservation pool based on the current item(s) being relogged
> > > asynchronously from different tasks that might add or remove reservation
> > > from the pool for separate items. That might get a little wonky when we
> > > consider the relog ticket needs to pull from the pool and then put
> > > something back if the item is still reloggable after the relog
> > > transaction rolls.
> > 
> > RIght, that's the whole problem that we solve via a) stealing the
> > initial reserve/write grant space at commit time and b) serialising
> > stealing vs transactional use of the ticket.
> > 
> > That is, after the roll, the ticket has a full reserve and grant
> > space reservation for all the items accounted to the relog ticket.
> > Every new relog item added to the ticket (or is relogged in the CIL
> > and uses more space) adds the full required reserve/write grant
> > space to the the relog ticket. Hence the relog ticket always has
> > current log space reserved to commit the entire set of items tagged
> > as reloggable. And by avoiding modifying the ticket while we are
> > actively processing the relog transaction, we don't screw up the
> > ticket accounting in the middle of the transaction....
> > 
> 
> Yep, Ok.
> 
> > > Another problem is that reloggable items are still otherwise usable once
> > > they are unlocked. So for example we'd have to account for a situation
> > > where one transaction dirties a buffer, enables relogging, commits and
> > > then some other transaction dirties more of the buffer and commits
> > > without caring whether the buffer was relog enabled or not.
> > 
> > yup, that's the delta size updates from the CIL commit. i.e. if we
> > relog an item to the CIL that has the XFS_LI_RELOG flag already set
> > on it, the change in size that we steal for the CIL ticket also
> > needs to be stolen for the AIL ticket. i.e. we already do almost all
> > the work we need to handle this.
> > 
> > > Unless
> > > runtime relog reservation is always worst case, that's a subtle path to
> > > reservation overrun in the relog transaction.
> > 
> > Yes, but it's a problem the CIL already solves for us :P
> 
> Ok, I think we're pretty close to the same page here. I was thinking
> about the worst case relog reservation being pulled off the committing
> transaction unconditionally, where I think you're thinking about it as
> the transaction (i.e. reservation calculation) would have the worst case
> reservation, but we'd only pull off the delta as needed at commit time
> (i.e. exactly how the CIL works wrt to transaction reservation
> consumption). Let me work through a simple example to try and
> (in)validate my concern:
> 
> - Transaction A is relog enabled, dirties 50% of a buffer and enables
>   auto relogging. On commit, the CIL takes buffer/2 reservation for the
>   log vector and the relog mechanism takes the same amount for
>   subsequent relogs.
> - Transaction B is not relog enabled (so no extra relog reservation),
>   dirties another 40% of the (already relog enabled) buffer and commits.
>   The CIL takes 90% of the transaction buffer reservation. The relog
>   ticket now needs an additional 40% (since the original 50% is
>   maintained by the relog system), but afaict there is no guarantee that
>   res is available if trans B is not relog enabled.

Yes, I can see that would be an issue - very well spotted, Brian.

Without reading your further comments: off the top of my head that
means we would probably have to declare particular types of objects
as reloggable, and explicitly include that object in each
reservation rather than use a generic "buffer" or "inode"
reservation for it. We are likely to only be relogging "container"
objects such as intents or high level structures such as,
inodes, AG headers, etc, so this probably isn't a huge increase
in transaction size or complexity, and it will be largely self
documenting.

But cwit still adds more complexity, and we're trying to avoid that
....

.....

Oh, I'm a bit stupid only half way through my first coffee: just get
rid of delta updates altogether and steal the entire relog
reservation for the object up front.  We really don't need delta
updates, it was just something the CIL does so I didn't think past
"we can use that because it is already there"....

/me reads on...

> So IIUC, the CIL scheme works because every transaction automatically
> includes worst case reservation for every possible item supported by the
> transaction. Relog transactions are presumably more selective, however,
> so we need to either have some rule where only relog enabled
> transactions can touch relog enabled items (it's not clear to me how
> realistic that is for things like buffers etc., but it sounds
> potentially delicate) or we simplify the relog reservation consumption
> calculation to consume worst case reservation for the relog ticket in
> anticipation of unrelated transactions dirtying more of the associated
> object since it was originally committed for relog. Thoughts?

Yep, I think we've both come to the same conclusion :)

> Note that carrying worst case reservation also potentially simplifies
> accounting between multiple dirty+relog increments and a single relog
> cancel, particularly if a relog cancelling transaction also happens to
> dirty more of a buffer before it commits. I'd prefer to avoid subtle
> usage landmines like that as much as possible. Hmm.. even if we do
> ultimately want a more granular and dynamic relog res accounting, I
> might start with the simple worst-case approach since 1.) it probably
> doesn't matter for intents, 2.) it's easier to reason about the isolated
> reservation calculation changes in an independent patch from the rest of
> the mechanism and 3.) I'd rather get the basics nailed down and solid
> before potentially fighting with granular reservation accounting
> accuracy bugs. ;)

Absolutely. I don't think we'll ever need anything more dynamic
or complex. And I think just taking the whole reservation will scale
a lot better if we are constantly modifying the object - we only
need to take the relog lock when we add or remove a relog
reservation now, not on every delta change to the object....

Again, well spotted and good thinking!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-03 15:12           ` Brian Foster
@ 2020-03-03 21:47             ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2020-03-03 21:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 03, 2020 at 10:12:17AM -0500, Brian Foster wrote:
> On Tue, Mar 03, 2020 at 03:07:35PM +1100, Dave Chinner wrote:
> > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > > OK, XLOG_TIC_INITED is redundant, and should be removed. And
> > > xfs_log_done() needs to be split into two, one for releasing the
> > > ticket, one for completing the xlog_write() call. Compile tested
> > > only patch below for you :P
> > 
> > And now with sample patch.
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > xfs: kill XLOG_TIC_INITED
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Delayed logging made this redundant as we never directly write
> > transactions to the log anymore. Hence we no longer make multiple
> > xlog_write() calls for a transaction as we format individual items
> > in a transaction, and hence don't need to keep track of whether we
> > should be writing a start record for every xlog_write call.
> > 
> 
> FWIW the commit log could use a bit more context, perhaps from your
> previous description, about the original semantics of _INITED flag.
> E.g., it's always been rather vague to me, probably because it seems to
> be a remnant of some no longer fully in place functionality.

Yup, it was a quick "here's what it looks like" patch.

> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c      | 79 ++++++++++++++++++---------------------------------
> >  fs/xfs/xfs_log.h      |  4 ---
> >  fs/xfs/xfs_log_cil.c  | 13 +++++----
> >  fs/xfs/xfs_log_priv.h | 18 ++++++------
> >  fs/xfs/xfs_trans.c    | 24 ++++++++--------
> >  5 files changed, 55 insertions(+), 83 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index f6006d94a581..a45f3eefee39 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -496,8 +496,8 @@ xfs_log_reserve(
> >   * This routine is called when a user of a log manager ticket is done with
> >   * the reservation.  If the ticket was ever used, then a commit record for
> >   * the associated transaction is written out as a log operation header with
> > - * no data.  The flag XLOG_TIC_INITED is set when the first write occurs with
> > - * a given ticket.  If the ticket was one with a permanent reservation, then
> > + * no data. 
> 
> 	      ^ trailing whitespace

Forgot to <ctrl-j> to join the lines back together :P

> 
> > + * If the ticket was one with a permanent reservation, then
> >   * a few operations are done differently.  Permanent reservation tickets by
> >   * default don't release the reservation.  They just commit the current
> >   * transaction with the belief that the reservation is still needed.  A flag
> > @@ -506,49 +506,38 @@ xfs_log_reserve(
> >   * the inited state again.  By doing this, a start record will be written
> >   * out when the next write occurs.
> >   */
> > -xfs_lsn_t
> > -xfs_log_done(
> > -	struct xfs_mount	*mp,
> > +int
> > +xlog_write_done(
> > +	struct xlog		*log,
> >  	struct xlog_ticket	*ticket,
> >  	struct xlog_in_core	**iclog,
> > -	bool			regrant)
> > +	xfs_lsn_t		*lsn)
> >  {
> > -	struct xlog		*log = mp->m_log;
> > -	xfs_lsn_t		lsn = 0;
> > -
> > -	if (XLOG_FORCED_SHUTDOWN(log) ||
> > -	    /*
> > -	     * If nothing was ever written, don't write out commit record.
> > -	     * If we get an error, just continue and give back the log ticket.
> > -	     */
> > -	    (((ticket->t_flags & XLOG_TIC_INITED) == 0) &&
> > -	     (xlog_commit_record(log, ticket, iclog, &lsn)))) {
> > -		lsn = (xfs_lsn_t) -1;
> > -		regrant = false;
> > -	}
> > +	if (XLOG_FORCED_SHUTDOWN(log))
> > +		return -EIO;
> >  
> > +	return xlog_commit_record(log, ticket, iclog, lsn);
> > +}
> >  
> > +/*
> > + * Release or regrant the ticket reservation now the transaction is done with
> > + * it depending on caller context. Rolling transactions need the ticket
> > + * regranted, otherwise we release it completely.
> > + */
> > +void
> > +xlog_ticket_done(
> > +	struct xlog		*log,
> > +	struct xlog_ticket	*ticket,
> > +	bool			regrant)
> > +{
> >  	if (!regrant) {
> >  		trace_xfs_log_done_nonperm(log, ticket);
> > -
> > -		/*
> > -		 * Release ticket if not permanent reservation or a specific
> > -		 * request has been made to release a permanent reservation.
> > -		 */
> >  		xlog_ungrant_log_space(log, ticket);
> >  	} else {
> >  		trace_xfs_log_done_perm(log, ticket);
> > -
> >  		xlog_regrant_reserve_log_space(log, ticket);
> > -		/* If this ticket was a permanent reservation and we aren't
> > -		 * trying to release it, reset the inited flags; so next time
> > -		 * we write, a start record will be written out.
> > -		 */
> > -		ticket->t_flags |= XLOG_TIC_INITED;
> >  	}
> > -
> >  	xfs_log_ticket_put(ticket);
> > -	return lsn;
> >  }
> 
> In general it would be nicer to split off as much refactoring as
> possible into separate patches, even though it's not yet clear to me
> what granularity is possible with this patch...

Yeah, there's heaps mor cleanups that can be done as a result of
this - e.g. xlog_write_done() and xlog_commit_record() should be
merged. The one caller of xlog_write() that does not provide a
commit_iclog variable shoudl do that call xlog_commit_record()
itself, then xlog_write() can just assume that always returns the
last iclog, etc....


> >  static bool
> > @@ -2148,8 +2137,9 @@ xlog_print_trans(
> >  }
> >  
> >  /*
> > - * Calculate the potential space needed by the log vector.  Each region gets
> > - * its own xlog_op_header_t and may need to be double word aligned.
> > + * Calculate the potential space needed by the log vector.  We always write a
> > + * start record, and each region gets its own xlog_op_header_t and may need to
> > + * be double word aligned.
> >   */
> >  static int
> >  xlog_write_calc_vec_length(
> > @@ -2157,14 +2147,10 @@ xlog_write_calc_vec_length(
> >  	struct xfs_log_vec	*log_vector)
> >  {
> >  	struct xfs_log_vec	*lv;
> > -	int			headers = 0;
> > +	int			headers = 1;
> >  	int			len = 0;
> >  	int			i;
> >  
> > -	/* acct for start rec of xact */
> > -	if (ticket->t_flags & XLOG_TIC_INITED)
> > -		headers++;
> > -
> >  	for (lv = log_vector; lv; lv = lv->lv_next) {
> >  		/* we don't write ordered log vectors */
> >  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> > @@ -2195,17 +2181,11 @@ xlog_write_start_rec(
> >  	struct xlog_op_header	*ophdr,
> >  	struct xlog_ticket	*ticket)
> >  {
> > -	if (!(ticket->t_flags & XLOG_TIC_INITED))
> > -		return 0;
> > -
> >  	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
> >  	ophdr->oh_clientid = ticket->t_clientid;
> >  	ophdr->oh_len = 0;
> >  	ophdr->oh_flags = XLOG_START_TRANS;
> >  	ophdr->oh_res2 = 0;
> > -
> > -	ticket->t_flags &= ~XLOG_TIC_INITED;
> > -
> >  	return sizeof(struct xlog_op_header);
> >  }
> 
> The header comment to this function refers to the "inited" state.

Missed that...

> Also note that there's a similar reference in
> xfs_log_write_unmount_record(), but that instance sets ->t_flags to zero
> so might be fine outside of the stale comment.

More cleanups!

> > @@ -2410,12 +2390,10 @@ xlog_write(
> >  	len = xlog_write_calc_vec_length(ticket, log_vector);
> >  
> >  	/*
> > -	 * Region headers and bytes are already accounted for.
> > -	 * We only need to take into account start records and
> > -	 * split regions in this function.
> > +	 * Region headers and bytes are already accounted for.  We only need to
> > +	 * take into account start records and split regions in this function.
> >  	 */
> > -	if (ticket->t_flags & XLOG_TIC_INITED)
> > -		ticket->t_curr_res -= sizeof(xlog_op_header_t);
> > +	ticket->t_curr_res -= sizeof(xlog_op_header_t);
> >  
> 
> So AFAICT the CIL allocates a ticket and up to this point only mucks
> around with the reservation value. That means _INITED is still in place
> once we get to xlog_write(). xlog_write() immediately calls
> xlog_write_calc_vec_length() and makes the ->t_curr_res adjustment
> before touching ->t_flags, so those bits all seems fine.
> 
> We then get into the log vector loops, where it looks like we call
> xlog_write_start_rec() for each log vector region and rely on the
> _INITED flag to only write a start record once per associated ticket.
> Unless I'm missing something, this looks like it would change behavior
> to perhaps write a start record per-region..? Note that this might not
> preclude the broader change to kill off _INITED since we're using the
> same ticket throughout the call, but some initial refactoring might be
> required to remove this dependency first.

Ah, yes, well spotted.

I need to move the call to xlog_write_start_rec() outside
the loop - it only needs to be written once per ticket, and we only
ever supply one ticket to xlog_write() now, and it is never reused
to call back into xlog_write again for the same transaction context.

I did say "compile tested only " :)

> > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> > index b192c5a9f9fd..6965d164ff45 100644
> > --- a/fs/xfs/xfs_log_priv.h
> > +++ b/fs/xfs/xfs_log_priv.h
> > @@ -53,11 +53,9 @@ enum xlog_iclog_state {
> >  /*
> >   * Flags to log ticket
> >   */
> > -#define XLOG_TIC_INITED		0x1	/* has been initialized */
> >  #define XLOG_TIC_PERM_RESERV	0x2	/* permanent reservation */
> 
> These values don't end up on disk, right? If not, it might be worth
> resetting the _PERM_RESERV value to 1. Otherwise the rest looks like
> mostly straightforward refactoring. 

*nod*

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management
  2020-03-03 21:26           ` Dave Chinner
@ 2020-03-04 14:03             ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2020-03-04 14:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 04, 2020 at 08:26:26AM +1100, Dave Chinner wrote:
> On Tue, Mar 03, 2020 at 09:13:16AM -0500, Brian Foster wrote:
> > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > > > <thinking out loud from here on..>
> > > > 
> > > > One thing that comes to mind thinking about this is dealing with
> > > > batching (relog multiple items per roll). This code doesn't handle that
> > > > yet, but I anticipate it being a requirement and it's fairly easy to
> > > > update the current scheme to support a fixed item count per relog-roll.
> > > > 
> > > > A stealing approach potentially complicates things when it comes to
> > > > batching because we have per-item reservation granularity to consider.
> > > > For example, consider if we had a variety of different relog item types
> > > > active at once, a subset are being relogged while another subset are
> > > > being disabled (before ever needing to be relogged).
> > > 
> > > So concurrent "active relogging" + "start relogging" + "stop
> > > relogging"?
> > > 
> > 
> > Yeah..
> > 
> > > > For one, we'd have
> > > > to be careful to not release reservation while it might be accounted to
> > > > an active relog transaction that is currently rolling with some other
> > > > items, etc. There's also a potential quirk in handling reservation of a
> > > > relogged item that is cancelled while it's being relogged, but that
> > > > might be more of an implementation detail.
> > > 
> > > Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
> > > my example relog transaction item add loop + commit function in?
> > > 
> > 
> > Yes, but the side effects didn't quite register..
> > 
> > > i.e. while we are building and committing a relog transaction, we
> > > hold off the transactions that are trying to add/remove their items
> > > to/from the relog list. Hence the reservation stealing accounting in
> > > the ticket can be be serialised against the transactional use of the
> > > ticket.
> > > 
> > 
> > Hmm, so this changes relog item/state management as well as the
> > reservation management. That's probably why it wasn't clear to me from
> > the code example.
> 
> Yeah, I left a lot out :P
> 
> > > It's basically the same method we use for serialising addition to
> > > the CIL in transaction commit against CIL pushes draining the
> > > current list for log writes (rwsem for add/push serialisation, spin
> > > lock for concurrent add serialisation under the rwsem).
> > > 
> > 
> > Ok. The difference with the CIL is that in that context we're processing
> > a single, constantly maintained list of items. As of right now, there is
> > no such list for relog enabled items. If I'm following correctly, that's
> > still not necessary here, we're just talking about wrapping a rw lock
> > around the state management (+ res accounting) and the actual res
> > consumption such that a relog cancel doesn't steal reservation from an
> > active relog transaction, for example.
> 
> That is correct. I don't think it actually matters because we can't
> remove an item that is locked and being relogged (so it's
> reservation is in use). Except for the fact we can't serialise
> internal itcket accounting updates the transaction might be doing
> with external tciket accounting modifications any other way.
> 

We do still need some form of serialization for objects that aren't
currently lockable, like certain intents. I suppose we could add locks
where necessary, but right now I'm thinking of a slight alteration of
the res accounting strategy to distinguish relog enabled items currently
resident in the relog transaction (i.e. awaiting the AIL processing to
complete and commit/roll the transaction) from all other relog enabled
items might be good enough.

Without something like that (or IOW if the relog transaction is the
central holder of all outstanding relog reservation), then we always
have to serialize relog cancel on an item against xfsaild populating a
relog transaction, regardless of whether the affected item is being
relogged at the time. That's again probably not something that's a huge
deal for the current use case, but it also _seems_ like something that
can be addressed via implementation tweaks without changing the
fundamental design. I guess we'll see once I have a chance to actually
play around with it... :P

> > > > I don't think that's a show stopper, but rather just something I'd like
> > > > to have factored into the design from the start. One option could be to
> > > 
> > > *nod*
> > > 
> > > > maintain a separate counter of active relog reservation aside from the
> > > > actual relog ticket. That way the relog ticket could just pull from this
> > > > relog reservation pool based on the current item(s) being relogged
> > > > asynchronously from different tasks that might add or remove reservation
> > > > from the pool for separate items. That might get a little wonky when we
> > > > consider the relog ticket needs to pull from the pool and then put
> > > > something back if the item is still reloggable after the relog
> > > > transaction rolls.
> > > 
> > > RIght, that's the whole problem that we solve via a) stealing the
> > > initial reserve/write grant space at commit time and b) serialising
> > > stealing vs transactional use of the ticket.
> > > 
> > > That is, after the roll, the ticket has a full reserve and grant
> > > space reservation for all the items accounted to the relog ticket.
> > > Every new relog item added to the ticket (or is relogged in the CIL
> > > and uses more space) adds the full required reserve/write grant
> > > space to the the relog ticket. Hence the relog ticket always has
> > > current log space reserved to commit the entire set of items tagged
> > > as reloggable. And by avoiding modifying the ticket while we are
> > > actively processing the relog transaction, we don't screw up the
> > > ticket accounting in the middle of the transaction....
> > > 
> > 
> > Yep, Ok.
> > 
> > > > Another problem is that reloggable items are still otherwise usable once
> > > > they are unlocked. So for example we'd have to account for a situation
> > > > where one transaction dirties a buffer, enables relogging, commits and
> > > > then some other transaction dirties more of the buffer and commits
> > > > without caring whether the buffer was relog enabled or not.
> > > 
> > > yup, that's the delta size updates from the CIL commit. i.e. if we
> > > relog an item to the CIL that has the XFS_LI_RELOG flag already set
> > > on it, the change in size that we steal for the CIL ticket also
> > > needs to be stolen for the AIL ticket. i.e. we already do almost all
> > > the work we need to handle this.
> > > 
> > > > Unless
> > > > runtime relog reservation is always worst case, that's a subtle path to
> > > > reservation overrun in the relog transaction.
> > > 
> > > Yes, but it's a problem the CIL already solves for us :P
> > 
> > Ok, I think we're pretty close to the same page here. I was thinking
> > about the worst case relog reservation being pulled off the committing
> > transaction unconditionally, where I think you're thinking about it as
> > the transaction (i.e. reservation calculation) would have the worst case
> > reservation, but we'd only pull off the delta as needed at commit time
> > (i.e. exactly how the CIL works wrt to transaction reservation
> > consumption). Let me work through a simple example to try and
> > (in)validate my concern:
> > 
> > - Transaction A is relog enabled, dirties 50% of a buffer and enables
> >   auto relogging. On commit, the CIL takes buffer/2 reservation for the
> >   log vector and the relog mechanism takes the same amount for
> >   subsequent relogs.
> > - Transaction B is not relog enabled (so no extra relog reservation),
> >   dirties another 40% of the (already relog enabled) buffer and commits.
> >   The CIL takes 90% of the transaction buffer reservation. The relog
> >   ticket now needs an additional 40% (since the original 50% is
> >   maintained by the relog system), but afaict there is no guarantee that
> >   res is available if trans B is not relog enabled.
> 
> Yes, I can see that would be an issue - very well spotted, Brian.
> 
> Without reading your further comments: off the top of my head that
> means we would probably have to declare particular types of objects
> as reloggable, and explicitly include that object in each
> reservation rather than use a generic "buffer" or "inode"
> reservation for it. We are likely to only be relogging "container"
> objects such as intents or high level structures such as,
> inodes, AG headers, etc, so this probably isn't a huge increase
> in transaction size or complexity, and it will be largely self
> documenting.
> 
> But cwit still adds more complexity, and we're trying to avoid that
> ....
> 
> .....
> 
> Oh, I'm a bit stupid only half way through my first coffee: just get
> rid of delta updates altogether and steal the entire relog
> reservation for the object up front.  We really don't need delta
> updates, it was just something the CIL does so I didn't think past
> "we can use that because it is already there"....
> 
> /me reads on...
> 
> > So IIUC, the CIL scheme works because every transaction automatically
> > includes worst case reservation for every possible item supported by the
> > transaction. Relog transactions are presumably more selective, however,
> > so we need to either have some rule where only relog enabled
> > transactions can touch relog enabled items (it's not clear to me how
> > realistic that is for things like buffers etc., but it sounds
> > potentially delicate) or we simplify the relog reservation consumption
> > calculation to consume worst case reservation for the relog ticket in
> > anticipation of unrelated transactions dirtying more of the associated
> > object since it was originally committed for relog. Thoughts?
> 
> Yep, I think we've both come to the same conclusion :)
> 

Indeed. Account for the max res. size of the associated object that the
relog enabled transaction already acquired for us anyways and we
shouldn't have to worry about further modifications of the object.

> > Note that carrying worst case reservation also potentially simplifies
> > accounting between multiple dirty+relog increments and a single relog
> > cancel, particularly if a relog cancelling transaction also happens to
> > dirty more of a buffer before it commits. I'd prefer to avoid subtle
> > usage landmines like that as much as possible. Hmm.. even if we do
> > ultimately want a more granular and dynamic relog res accounting, I
> > might start with the simple worst-case approach since 1.) it probably
> > doesn't matter for intents, 2.) it's easier to reason about the isolated
> > reservation calculation changes in an independent patch from the rest of
> > the mechanism and 3.) I'd rather get the basics nailed down and solid
> > before potentially fighting with granular reservation accounting
> > accuracy bugs. ;)
> 
> Absolutely. I don't think we'll ever need anything more dynamic
> or complex. And I think just taking the whole reservation will scale
> a lot better if we are constantly modifying the object - we only
> need to take the relog lock when we add or remove a relog
> reservation now, not on every delta change to the object....
> 

Yep, Ok. I think there's still some potential quirks to work through
with this approach (re: the per-item granularity
accounting/serialization bits mentioned above), but at this point it's
easier for me to reason about it by hacking on it a bit. If there are
still any issues to work out, best to have something concrete on the
list to guide discussion. Thanks Dave, appreciate the design input!

Brian

> Again, well spotted and good thinking!
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2020-03-04 14:03 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-27 13:43 [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 1/9] xfs: set t_task at wait time instead of alloc time Brian Foster
2020-02-27 20:48   ` Allison Collins
2020-02-27 23:28   ` Darrick J. Wong
2020-02-28  0:10     ` Dave Chinner
2020-02-28 13:46       ` Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 2/9] xfs: introduce ->tr_relog transaction Brian Foster
2020-02-27 20:49   ` Allison Collins
2020-02-27 23:31   ` Darrick J. Wong
2020-02-28 13:52     ` Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 3/9] xfs: automatic relogging reservation management Brian Foster
2020-02-27 20:49   ` Allison Collins
2020-02-28  0:02   ` Darrick J. Wong
2020-02-28 13:55     ` Brian Foster
2020-03-02  3:07   ` Dave Chinner
2020-03-02 18:06     ` Brian Foster
2020-03-02 23:25       ` Dave Chinner
2020-03-03  4:07         ` Dave Chinner
2020-03-03 15:12           ` Brian Foster
2020-03-03 21:47             ` Dave Chinner
2020-03-03 14:13         ` Brian Foster
2020-03-03 21:26           ` Dave Chinner
2020-03-04 14:03             ` Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 4/9] xfs: automatic relogging item management Brian Foster
2020-02-27 21:18   ` Allison Collins
2020-03-02  5:58   ` Dave Chinner
2020-03-02 18:08     ` Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 5/9] xfs: automatic log item relog mechanism Brian Foster
2020-02-27 22:54   ` Allison Collins
2020-02-28  0:13   ` Darrick J. Wong
2020-02-28 14:02     ` Brian Foster
2020-03-02  7:32       ` Dave Chinner
2020-03-02  7:18   ` Dave Chinner
2020-03-02 18:52     ` Brian Foster
2020-03-03  0:06       ` Dave Chinner
2020-03-03 14:14         ` Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 6/9] xfs: automatically relog the quotaoff start intent Brian Foster
2020-02-27 23:19   ` Allison Collins
2020-02-28 14:03     ` Brian Foster
2020-02-28 18:55       ` Allison Collins
2020-02-28  1:16   ` Darrick J. Wong
2020-02-28 14:04     ` Brian Foster
2020-02-29  5:35       ` Darrick J. Wong
2020-02-29 12:15         ` Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 7/9] xfs: buffer relogging support prototype Brian Foster
2020-02-27 23:33   ` Allison Collins
2020-02-28 14:04     ` Brian Foster
2020-03-02  7:47   ` Dave Chinner
2020-03-02 19:00     ` Brian Foster
2020-03-03  0:09       ` Dave Chinner
2020-03-03 14:14         ` Brian Foster
2020-02-27 13:43 ` [RFC v5 PATCH 8/9] xfs: create an error tag for random relog reservation Brian Foster
2020-02-27 23:35   ` Allison Collins
2020-02-27 13:43 ` [RFC v5 PATCH 9/9] xfs: relog random buffers based on errortag Brian Foster
2020-02-27 23:48   ` Allison Collins
2020-02-28 14:06     ` Brian Foster
2020-02-27 15:09 ` [RFC v5 PATCH 0/9] xfs: automatic relogging experiment Darrick J. Wong
2020-02-27 15:18   ` Brian Foster
2020-02-27 15:22     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.