All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: one extent per EFI
@ 2023-04-14 22:58 Wengang Wang
  2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang
  2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang
  0 siblings, 2 replies; 16+ messages in thread
From: Wengang Wang @ 2023-04-14 22:58 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

We are hitting the deadlock described in patch 1.
This patchset doesn't want to disturb the existing block allocation
routine, that would make the allocation routime even complex. Instead,
this patch avoids doing AGFL block allocation holding busy extents in current
memory transaction.

Patch 1 fixes the IO path and Patch 2 takes care of log recovery.

Wengang Wang (2):
  xfs: IO time one extent per EFI
  xfs: log recovery stage split EFIs with multiple extents

 fs/xfs/xfs_extfree_item.c | 104 ++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_extfree_item.h |   9 +++-
 2 files changed, 101 insertions(+), 12 deletions(-)

-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang
@ 2023-04-14 22:58 ` Wengang Wang
  2023-04-19 23:55   ` Dave Chinner
  2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang
  1 sibling, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-14 22:58 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

At IO time, make sure an EFI contains only one extent. Transaction rolling in
xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we
avoid holding busy extents (for previously extents in the same EFI) in current
transaction when allocating blocks for AGFL where we could be otherwise stuck
waiting the busy extents held by current transaction to be flushed (thus a
deadlock).

The log changes
1) before change:

    358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2  len: 48 flags: None
    359 EFI  nextents:2 id:ffff9fef708ba940
    360 EFI id=ffff9fef708ba940 (0x21, 7)
    361 EFI id=ffff9fef708ba940 (0x18, 8)
    362 -----------------------------------------------------------------
    363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2  len: 48 flags: None
    364 EFD  nextents:2 id:ffff9fef708ba940
    365 EFD id=ffff9fef708ba940 (0x21, 7)
    366 EFD id=ffff9fef708ba940 (0x18, 8)

2) after change:

    830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f  len: 32 flags: None
    831 EFI  nextents:1 id:ffff9fef708b9b80
    832 EFI id=ffff9fef708b9b80 (0x21, 7)
    833 -----------------------------------------------------------------
    834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f  len: 32 flags: None
    835 EFI  nextents:1 id:ffff9fef708b9d38
    836 EFI id=ffff9fef708b9d38 (0x18, 8)
    837 -----------------------------------------------------------------
    838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f  len: 32 flags: None
    839 EFD  nextents:1 id:ffff9fef708b9b80
    840 EFD id=ffff9fef708b9b80 (0x21, 7)
    841 -----------------------------------------------------------------
    842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f  len: 32 flags: None
    843 EFD  nextents:1 id:ffff9fef708b9d38
    844 EFD id=ffff9fef708b9d38 (0x18, 8)

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 fs/xfs/xfs_extfree_item.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
index da6a5afa607c..ae84d77eaf30 100644
--- a/fs/xfs/xfs_extfree_item.h
+++ b/fs/xfs/xfs_extfree_item.h
@@ -13,8 +13,15 @@ struct kmem_cache;
 
 /*
  * Max number of extents in fast allocation path.
+ *
+ * At IO time, make sure an EFI contains only one extent. Transaction rolling
+ * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By
+ * that we avoid holding busy extents (for previously extents in the same EFI)
+ * in current transaction when allocating blocks for AGFL where we could be
+ * otherwise stuck waiting the busy extents held by current transaction to be
+ * flushed (thus a deadlock).
  */
-#define	XFS_EFI_MAX_FAST_EXTENTS	16
+#define	XFS_EFI_MAX_FAST_EXTENTS	1
 
 /*
  * This is the "extent free intention" log item.  It is used to log the fact
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents
  2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang
  2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang
@ 2023-04-14 22:58 ` Wengang Wang
  2023-04-20  0:30   ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-14 22:58 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

At log recovery stage, we need to split EFIs with multiple extents. For each
orginal multiple-extent EFI, split it into new EFIs each including one extent
from the original EFI. By that we avoid deadlock when allocating blocks for
AGFL waiting for the held busy extents by current transaction to be flushed.

 For the original EFI, the process is
 1. Create and log new EFIs each covering one extent from the
    original EFI.
 2. Don't free extent with the original EFI.
 3. Log EFD for the original EFI.
    Make sure we log the new EFIs and original EFD in this order:
      new EFI 1
      new EFI 2
      ...
      new EFI N
      original EFD
 The original extents are freed with the new EFIs.

The example log items:

 rbbn 41572 rec_lsn: 1638833,41568 Oper 18: tid: d746ea5d  len: 48 flags: None
 EFI  nextents:2 id:ffff8b10b5a13c28        --> orginal EFI
 EFI id=ffff8b10b5a13c28 (0x5de4c42, 256)
 EFI id=ffff8b10b5a13c28 (0x5de4942, 256)

 rbbn 39041 rec_lsn: 1638834,39040 Oper 2: tid: 4e651c99  len: 32 flags: None
 EFI  nextents:1 id:ffff9fef39f4c528	    --> new EFI 1
 EFI id=ffff9fef39f4c528 (0x5de4c42, 256)
 -----------------------------------------------------------------------------
 rbbn 39041 rec_lsn: 1638834,39040 Oper 3: tid: 4e651c99  len: 32 flags: None
 EFI  nextents:1 id:ffff9fef39f4f548	    --> new EFI 2
 EFI id=ffff9fef39f4f548 (0x5de4942, 256)
 -----------------------------------------------------------------------------
 rbbn 39041 rec_lsn: 1638834,39040 Oper 4: tid: 4e651c99  len: 48 flags: None
 EFD  nextents:2 id:ffff8b10b5a13c28	    --> EFD to original EFI
 EFD id=ffff8b10b5a13c28 (0x5de4c42, 256)
 EFD id=ffff8b10b5a13c28 (0x5de4942, 256)
 -----------------------------------------------------------------------------
 rbbn 39041 rec_lsn: 1638834,39040 Oper 5: tid: 4e651c99  len: 32 flags: None
 EFD  nextents:1 id:ffff9fef39f4c528	    --> EFD to new EFI 1
 EFD id=ffff9fef39f4c528 (0x5de4c42, 256)

 ......

 rbbn 39057 rec_lsn: 1638834,39056 Oper 2: tid: e3264681  len: 32 flags: None
 EFD  nextents:1 id:ffff9fef39f4f548	    --> EFD to new EFI 2
 EFD id=ffff9fef39f4f548 (0x5de4942, 256)

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 fs/xfs/xfs_extfree_item.c | 104 ++++++++++++++++++++++++++++++++++----
 1 file changed, 93 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 011b50469301..b00b44234397 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -595,7 +595,11 @@ xfs_efi_item_recover(
 	struct list_head		*capture_list)
 {
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
+	int				nr_ext = efip->efi_format.efi_nextents;
 	struct xfs_mount		*mp = lip->li_log->l_mp;
+	struct xfs_efi_log_item		**new_efis, *new_efip;
+	struct xfs_efd_log_item		*new_efdp;
+	struct xfs_extent_free_item	fake;
 	struct xfs_efd_log_item		*efdp;
 	struct xfs_trans		*tp;
 	int				i;
@@ -606,7 +610,7 @@ xfs_efi_item_recover(
 	 * EFI.  If any are bad, then assume that all are bad and
 	 * just toss the EFI.
 	 */
-	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
+	for (i = 0; i < nr_ext; i++) {
 		if (!xfs_efi_validate_ext(mp,
 					&efip->efi_format.efi_extents[i])) {
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
@@ -619,28 +623,106 @@ xfs_efi_item_recover(
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
 	if (error)
 		return error;
-	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
 
-	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
-		struct xfs_extent_free_item	fake = {
-			.xefi_owner		= XFS_RMAP_OWN_UNKNOWN,
-		};
+	memset(&fake, 0, sizeof(fake));
+	fake.xefi_owner = XFS_RMAP_OWN_UNKNOWN;
+
+	if (nr_ext <= 1) {
+		efdp = xfs_trans_get_efd(tp, efip,
+				efip->efi_format.efi_nextents);
+
+		for (i = 0; i < efip->efi_format.efi_nextents; i++) {
+			struct xfs_extent		*extp;
+
+			extp = &efip->efi_format.efi_extents[i];
+
+			fake.xefi_startblock = extp->ext_start;
+			fake.xefi_blockcount = extp->ext_len;
+
+			error = xfs_trans_free_extent(tp, efdp, &fake);
+			if (error == -EFSCORRUPTED)
+				XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+						extp, sizeof(*extp));
+			if (error)
+				goto abort_error;
+
+		}
+
+		return xfs_defer_ops_capture_and_commit(tp, capture_list);
+	}
+
+	/*
+	 * Log recovery stage, we need to split a EFI into new EFIs if the
+	 * original EFI includes more than one extents. Check the change of
+	 * XFS_EFI_MAX_FAST_EXTENTS for the reason.
+	 * For the original EFI, the process is
+	 * 1. Create and log new EFIs each covering one extent from the
+	 *    original EFI.
+	 * 2. Don't free extent with the original EFI.
+	 * 3. Log EFD for the original EFI.
+	 *    Make sure we log the new EFIs and original EFD in this order:
+	 *	new EFI 1
+	 *	new EFI 2
+	 *	...
+	 *	new EFI N
+	 *	original EFD
+	 * The original extents are freed with the new EFIs.
+	 */
+	new_efis = kmem_zalloc(sizeof(*new_efis) * nr_ext, 0);
+	if (!new_efis) {
+		error = -ENOMEM;
+		goto abort_error;
+	}
+	for (i = 0; i < nr_ext; i++) {
 		struct xfs_extent		*extp;
 
+		new_efip = xfs_efi_init(mp, 1);
 		extp = &efip->efi_format.efi_extents[i];
 
 		fake.xefi_startblock = extp->ext_start;
 		fake.xefi_blockcount = extp->ext_len;
+		xfs_trans_add_item(tp, &new_efip->efi_item);
+		xfs_extent_free_log_item(tp, new_efip, &fake);
+		new_efis[i] = new_efip;
+	}
+
+	/*
+	 * The new EFIs are in transaction now, add original EFD with
+	 * full extents.
+	 */
+	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
+	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
+	efdp->efd_next_extent = nr_ext;
+	for (i = 0; i < nr_ext; i++)
+		efdp->efd_format.efd_extents[i] =
+			efip->efi_format.efi_extents[i];
 
-		error = xfs_trans_free_extent(tp, efdp, &fake);
+	/*
+	 * Now process the new EFIs.
+	 * Current transaction is a new one, there are no defered
+	 * works attached. It's safe to use the following first
+	 * xfs_trans_roll() to commit it.
+	 */
+	for (i = 0; i < nr_ext; i++) {
+		struct xfs_extent		*extp;
+
+		new_efip = new_efis[i];
+		new_efdp = xfs_trans_get_efd(tp, new_efip, 1);
+		extp = &new_efip->efi_format.efi_extents[0];
+		fake.xefi_startblock = extp->ext_start;
+		fake.xefi_blockcount = extp->ext_len;
+		error = xfs_trans_free_extent(tp, new_efdp, &fake);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
-					extp, sizeof(*extp));
-		if (error)
+						extp, sizeof(*extp));
+		if (!error)
+			error = xfs_trans_roll(&tp);
+		if (error) {
+			kmem_free(new_efis);
 			goto abort_error;
-
+		}
 	}
-
+	kmem_free(new_efis);
 	return xfs_defer_ops_capture_and_commit(tp, capture_list);
 
 abort_error:
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang
@ 2023-04-19 23:55   ` Dave Chinner
  2023-04-20 17:31     ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2023-04-19 23:55 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote:
> At IO time, make sure an EFI contains only one extent. Transaction rolling in
> xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we
> avoid holding busy extents (for previously extents in the same EFI) in current
> transaction when allocating blocks for AGFL where we could be otherwise stuck
> waiting the busy extents held by current transaction to be flushed (thus a
> deadlock).
> 
> The log changes
> 1) before change:
> 
>     358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2  len: 48 flags: None
>     359 EFI  nextents:2 id:ffff9fef708ba940
>     360 EFI id=ffff9fef708ba940 (0x21, 7)
>     361 EFI id=ffff9fef708ba940 (0x18, 8)
>     362 -----------------------------------------------------------------
>     363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2  len: 48 flags: None
>     364 EFD  nextents:2 id:ffff9fef708ba940
>     365 EFD id=ffff9fef708ba940 (0x21, 7)
>     366 EFD id=ffff9fef708ba940 (0x18, 8)
> 
> 2) after change:
> 
>     830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f  len: 32 flags: None
>     831 EFI  nextents:1 id:ffff9fef708b9b80
>     832 EFI id=ffff9fef708b9b80 (0x21, 7)
>     833 -----------------------------------------------------------------
>     834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f  len: 32 flags: None
>     835 EFI  nextents:1 id:ffff9fef708b9d38
>     836 EFI id=ffff9fef708b9d38 (0x18, 8)
>     837 -----------------------------------------------------------------
>     838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f  len: 32 flags: None
>     839 EFD  nextents:1 id:ffff9fef708b9b80
>     840 EFD id=ffff9fef708b9b80 (0x21, 7)
>     841 -----------------------------------------------------------------
>     842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f  len: 32 flags: None
>     843 EFD  nextents:1 id:ffff9fef708b9d38
>     844 EFD id=ffff9fef708b9d38 (0x18, 8)
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  fs/xfs/xfs_extfree_item.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
> index da6a5afa607c..ae84d77eaf30 100644
> --- a/fs/xfs/xfs_extfree_item.h
> +++ b/fs/xfs/xfs_extfree_item.h
> @@ -13,8 +13,15 @@ struct kmem_cache;
>  
>  /*
>   * Max number of extents in fast allocation path.
> + *
> + * At IO time, make sure an EFI contains only one extent. Transaction rolling
> + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By
> + * that we avoid holding busy extents (for previously extents in the same EFI)
> + * in current transaction when allocating blocks for AGFL where we could be
> + * otherwise stuck waiting the busy extents held by current transaction to be
> + * flushed (thus a deadlock).
>   */
> -#define	XFS_EFI_MAX_FAST_EXTENTS	16
> +#define	XFS_EFI_MAX_FAST_EXTENTS	1

IIRC, this doesn't have anything to do with the number of extents an
EFI can hold. All it does is control how the memory for the EFI
allocated.

Oh, at some point it got overloaded code to define the max items in
a defer ops work item. Ok, I now see why you changed this, but I
don't think this is right way to solve the problem. We can handle
processing multiple extents per EFI just fine, we just need to
update the EFD and roll the transaction on each extent we process,
yes?

In hindsight, the use of XFS_EFI_MAX_FAST_EXTENTS to limit intent
size is pretty awful. I also see the same pattern copied in every
other intent.

Darrick, if the deferops code has been limiting the number of
extents in a work item to this value, when will we ever see an
intent with more extents that .max_items in it? And if that is the
case, then why wouldn't we consider an intent with more extents than
we support in log recovery a corruption event?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents
  2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang
@ 2023-04-20  0:30   ` Dave Chinner
  2023-04-20 17:10     ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2023-04-20  0:30 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote:
> At log recovery stage, we need to split EFIs with multiple extents. For each
> orginal multiple-extent EFI, split it into new EFIs each including one extent
> from the original EFI. By that we avoid deadlock when allocating blocks for
> AGFL waiting for the held busy extents by current transaction to be flushed.
> 
>  For the original EFI, the process is
>  1. Create and log new EFIs each covering one extent from the
>     original EFI.
>  2. Don't free extent with the original EFI.
>  3. Log EFD for the original EFI.
>     Make sure we log the new EFIs and original EFD in this order:
>       new EFI 1
>       new EFI 2
>       ...
>       new EFI N
>       original EFD
>  The original extents are freed with the new EFIs.

We may not have the log space available during recovery to explode a
single EFI out into many EFIs like this. The EFI only had enough
space reserved for processing a single EFI, and exploding a single
EFI out like this requires an individual log reservation for each
new EFI. Hence this de-multiplexing process risks running out of log
space and deadlocking before we've been able to process anything.

Hence the only option we really have here is to replicate how CUIs
are handled.  We must process the first extent with a whole EFD and
a new EFI containing the remaining unprocessed extents as defered
operations.  i.e.

1. free the first extent in the original EFI
2. log an EFD for the original EFI
3. Add all the remaining extents in the original EFI to an xefi chain
4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from
   the xefi chain and commit the current transaction.

xfs_defer_ops_capture_and_commit() will then add a work item to the
defered list which will come back to the new EFI and process it
through the normal runtime deferred ops intent processing path.

The first patch changed that path to only create intents with a
single extent, so the continued defer ops would then do the right
thing with that change in place. However, I think that we also need
the runtime code to process a single extent per intent per commit in
the same manner as above. i.e. we process the first extent in the
intent, then relog all the remaining unprocessed extents as a single
new intent.

Note that this is similar to how we already relog intents to roll
them forward in the journal. The only difference for single extent
processing is that an intent relog duplicates the entire extent list
in the EFD and the new EFI, whilst what we want is the new EFI to
contain all the extents except the one we just processed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents
  2023-04-20  0:30   ` Dave Chinner
@ 2023-04-20 17:10     ` Wengang Wang
  2023-04-20 22:54       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-20 17:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 19, 2023, at 5:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote:
>> At log recovery stage, we need to split EFIs with multiple extents. For each
>> orginal multiple-extent EFI, split it into new EFIs each including one extent
>> from the original EFI. By that we avoid deadlock when allocating blocks for
>> AGFL waiting for the held busy extents by current transaction to be flushed.
>> 
>> For the original EFI, the process is
>> 1. Create and log new EFIs each covering one extent from the
>>    original EFI.
>> 2. Don't free extent with the original EFI.
>> 3. Log EFD for the original EFI.
>>    Make sure we log the new EFIs and original EFD in this order:
>>      new EFI 1
>>      new EFI 2
>>      ...
>>      new EFI N
>>      original EFD
>> The original extents are freed with the new EFIs.
> 
> We may not have the log space available during recovery to explode a
> single EFI out into many EFIs like this. The EFI only had enough
> space reserved for processing a single EFI, and exploding a single
> EFI out like this requires an individual log reservation for each
> new EFI. Hence this de-multiplexing process risks running out of log
> space and deadlocking before we've been able to process anything.
> 

Oh, yes, got it.

> Hence the only option we really have here is to replicate how CUIs
> are handled.  We must process the first extent with a whole EFD and
> a new EFI containing the remaining unprocessed extents as defered
> operations.  i.e.
> 
> 1. free the first extent in the original EFI
> 2. log an EFD for the original EFI
> 3. Add all the remaining extents in the original EFI to an xefi chain
> 4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from
>   the xefi chain and commit the current transaction.
> 
> xfs_defer_ops_capture_and_commit() will then add a work item to the
> defered list which will come back to the new EFI and process it
> through the normal runtime deferred ops intent processing path.
> 

So you meant this?

Orig EFI with extent1 extent2 extent3
free first extent1
Full EFD to orig EFI
transaction roll,
xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3

If so, I don’t think it’s safe.
Consider that case that kernel panic happened after the transaction roll,
during next log replay, the original EFI has the matching EFD, so this EFI
is ignored, but actually extent2 and extent3 are not freed.

If you didn’t mean above, but instead this:

Orig EFI with extent1 extent2 extent3
free first extent1
New EFI extent2 extent3
Full EFD to orig EFI
transaction roll,
xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3

The problem will comeback to the log space issue, are we ensured we have
the space for the new EFI? 


> The first patch changed that path to only create intents with a
> single extent, so the continued defer ops would then do the right
> thing with that change in place. However, I think that we also need
> the runtime code to process a single extent per intent per commit in
> the same manner as above. i.e. we process the first extent in the
> intent, then relog all the remaining unprocessed extents as a single
> new intent.
> 
> Note that this is similar to how we already relog intents to roll
> them forward in the journal. The only difference for single extent
> processing is that an intent relog duplicates the entire extent list
> in the EFD and the new EFI, whilst what we want is the new EFI to
> contain all the extents except the one we just processed...
> 

The problem to me is that where we place the new EFI, it can’t be after the EFD.
I explained why above.

thanks,
wengang

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-19 23:55   ` Dave Chinner
@ 2023-04-20 17:31     ` Wengang Wang
  2023-04-20 23:22       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-20 17:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 19, 2023, at 4:55 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote:
>> At IO time, make sure an EFI contains only one extent. Transaction rolling in
>> xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we
>> avoid holding busy extents (for previously extents in the same EFI) in current
>> transaction when allocating blocks for AGFL where we could be otherwise stuck
>> waiting the busy extents held by current transaction to be flushed (thus a
>> deadlock).
>> 
>> The log changes
>> 1) before change:
>> 
>>    358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2  len: 48 flags: None
>>    359 EFI  nextents:2 id:ffff9fef708ba940
>>    360 EFI id=ffff9fef708ba940 (0x21, 7)
>>    361 EFI id=ffff9fef708ba940 (0x18, 8)
>>    362 -----------------------------------------------------------------
>>    363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2  len: 48 flags: None
>>    364 EFD  nextents:2 id:ffff9fef708ba940
>>    365 EFD id=ffff9fef708ba940 (0x21, 7)
>>    366 EFD id=ffff9fef708ba940 (0x18, 8)
>> 
>> 2) after change:
>> 
>>    830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f  len: 32 flags: None
>>    831 EFI  nextents:1 id:ffff9fef708b9b80
>>    832 EFI id=ffff9fef708b9b80 (0x21, 7)
>>    833 -----------------------------------------------------------------
>>    834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f  len: 32 flags: None
>>    835 EFI  nextents:1 id:ffff9fef708b9d38
>>    836 EFI id=ffff9fef708b9d38 (0x18, 8)
>>    837 -----------------------------------------------------------------
>>    838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f  len: 32 flags: None
>>    839 EFD  nextents:1 id:ffff9fef708b9b80
>>    840 EFD id=ffff9fef708b9b80 (0x21, 7)
>>    841 -----------------------------------------------------------------
>>    842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f  len: 32 flags: None
>>    843 EFD  nextents:1 id:ffff9fef708b9d38
>>    844 EFD id=ffff9fef708b9d38 (0x18, 8)
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> fs/xfs/xfs_extfree_item.h | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
>> index da6a5afa607c..ae84d77eaf30 100644
>> --- a/fs/xfs/xfs_extfree_item.h
>> +++ b/fs/xfs/xfs_extfree_item.h
>> @@ -13,8 +13,15 @@ struct kmem_cache;
>> 
>> /*
>>  * Max number of extents in fast allocation path.
>> + *
>> + * At IO time, make sure an EFI contains only one extent. Transaction rolling
>> + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By
>> + * that we avoid holding busy extents (for previously extents in the same EFI)
>> + * in current transaction when allocating blocks for AGFL where we could be
>> + * otherwise stuck waiting the busy extents held by current transaction to be
>> + * flushed (thus a deadlock).
>>  */
>> -#define XFS_EFI_MAX_FAST_EXTENTS 16
>> +#define XFS_EFI_MAX_FAST_EXTENTS 1
> 
> IIRC, this doesn't have anything to do with the number of extents an
> EFI can hold. All it does is control how the memory for the EFI
> allocated.
> 

Yes, it ensures that one EFI contains at most one extent. And because each
deferred intent goes with one transaction roll, it would solve the AGFL allocation
deadlock (because no busy extents held by the process when it is doing the
AGFL allocation).
And yes, this would requires more log space if the EFI/EFD pair doesn’t appear
in same checkpoint.


> Oh, at some point it got overloaded code to define the max items in
> a defer ops work item. Ok, I now see why you changed this, but I
> don't think this is right way to solve the problem. We can handle
> processing multiple extents per EFI just fine, we just need to
> update the EFD and roll the transaction on each extent we process,
> yes?
> 

I am not quite sure what does “update the EFD” mean.
My original concern is that (without your updated EFD), the extents in original EFI can be partially done before a crush. And during the recovery, the already done extents would also be replayed and hit error (because the in-place metadata could be flushed since the transaction is rolled.).

Now consider your “update the EFD”, you meant the following?

EFI:  ID:  THISISID1   extent1 extent2
free extent extent1
EFD: ID: THISISID1  extent1
free extent extent2
another EFD: ID: THISISID1 (same ID as above)  extent2

Do we currently support that? ( I am thinking NO).

thanks,
wengang

> In hindsight, the use of XFS_EFI_MAX_FAST_EXTENTS to limit intent
> size is pretty awful. I also see the same pattern copied in every
> other intent.
> 
> Darrick, if the deferops code has been limiting the number of
> extents in a work item to this value, when will we ever see an
> intent with more extents that .max_items in it? And if that is the
> case, then why wouldn't we consider an intent with more extents than
> we support in log recovery a corruption event?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents
  2023-04-20 17:10     ` Wengang Wang
@ 2023-04-20 22:54       ` Dave Chinner
  2023-04-21  0:32         ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2023-04-20 22:54 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Thu, Apr 20, 2023 at 05:10:42PM +0000, Wengang Wang wrote:
> 
> 
> > On Apr 19, 2023, at 5:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote:
> >> At log recovery stage, we need to split EFIs with multiple extents. For each
> >> orginal multiple-extent EFI, split it into new EFIs each including one extent
> >> from the original EFI. By that we avoid deadlock when allocating blocks for
> >> AGFL waiting for the held busy extents by current transaction to be flushed.
> >> 
> >> For the original EFI, the process is
> >> 1. Create and log new EFIs each covering one extent from the
> >>    original EFI.
> >> 2. Don't free extent with the original EFI.
> >> 3. Log EFD for the original EFI.
> >>    Make sure we log the new EFIs and original EFD in this order:
> >>      new EFI 1
> >>      new EFI 2
> >>      ...
> >>      new EFI N
> >>      original EFD
> >> The original extents are freed with the new EFIs.
> > 
> > We may not have the log space available during recovery to explode a
> > single EFI out into many EFIs like this. The EFI only had enough
> > space reserved for processing a single EFI, and exploding a single
> > EFI out like this requires an individual log reservation for each
> > new EFI. Hence this de-multiplexing process risks running out of log
> > space and deadlocking before we've been able to process anything.
> > 
> 
> Oh, yes, got it.
> 
> > Hence the only option we really have here is to replicate how CUIs
> > are handled.  We must process the first extent with a whole EFD and
> > a new EFI containing the remaining unprocessed extents as defered
> > operations.  i.e.
> > 
> > 1. free the first extent in the original EFI
> > 2. log an EFD for the original EFI
> > 3. Add all the remaining extents in the original EFI to an xefi chain
> > 4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from
> >   the xefi chain and commit the current transaction.
> > 
> > xfs_defer_ops_capture_and_commit() will then add a work item to the
> > defered list which will come back to the new EFI and process it
> > through the normal runtime deferred ops intent processing path.
> > 
> 
> So you meant this?
> 
> Orig EFI with extent1 extent2 extent3
> free first extent1
> Full EFD to orig EFI
> transaction roll,
> xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3

No. We do not need a transaction roll there if we rebuild a new
xefi list with the remaining extents from the original efi. At that
point, we call:

	<create transaction>
	<free first extent>
	<create new xefi list>
	<create and log EFD>
	xfs_defer_ops_capture_and_commit()
	  xfs_defer_ops_capture()
	    xfs_defer_create_intents()
	      for each tp->t_dfops
	        ->create intent
		  xfs_extent_free_create_intent()
		    create new EFI
		    walk each xefi and add it to the new intent
	    <captures remaining defered work>
	  xfs_trans_commit()

i.e. xfs_defer_ops_capture_and_commit() can builds new EFI for us
from the xefi list as part of defering the work that remains to be
done. Once it has done that, it queues the remaining work and
commits the transaction. Hence all we need to do in recovery of the
first extent is free it, create the xefi list and log the full EFD.
xfs_defer_ops_capture_and_commit() does the rest.

> If so, I don’t think it’s safe.
> Consider that case that kernel panic happened after the transaction roll,
> during next log replay, the original EFI has the matching EFD, so this EFI
> is ignored, but actually extent2 and extent3 are not freed.
> 
> If you didn’t mean above, but instead this:
> 
> Orig EFI with extent1 extent2 extent3
> free first extent1
> New EFI extent2 extent3
> Full EFD to orig EFI
> transaction roll,
> xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3
> 
> The problem will comeback to the log space issue, are we ensured we have
> the space for the new EFI? 

Yes, because logging the full EFD cancels the original EFI and so it
no longer consumes log space. hence we can log a new EFI using the
space the original EFI consumed.

> > The first patch changed that path to only create intents with a
> > single extent, so the continued defer ops would then do the right
> > thing with that change in place. However, I think that we also need
> > the runtime code to process a single extent per intent per commit in
> > the same manner as above. i.e. we process the first extent in the
> > intent, then relog all the remaining unprocessed extents as a single
> > new intent.
> > 
> > Note that this is similar to how we already relog intents to roll
> > them forward in the journal. The only difference for single extent
> > processing is that an intent relog duplicates the entire extent list
> > in the EFD and the new EFI, whilst what we want is the new EFI to
> > contain all the extents except the one we just processed...
> > 
> 
> The problem to me is that where we place the new EFI, it can’t be after the EFD.
> I explained why above.

Yes it can. The only thing that matters is that the EFD and new EFI
are committed in the same transaction. Remember: transactions are
atomic change sets - either all the changes in the transaction are
replayed on recovery, or none of them are.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-20 17:31     ` Wengang Wang
@ 2023-04-20 23:22       ` Dave Chinner
  2023-04-21  0:24         ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2023-04-20 23:22 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Thu, Apr 20, 2023 at 05:31:14PM +0000, Wengang Wang wrote:
> 
> 
> > On Apr 19, 2023, at 4:55 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote:
> >> At IO time, make sure an EFI contains only one extent. Transaction rolling in
> >> xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we
> >> avoid holding busy extents (for previously extents in the same EFI) in current
> >> transaction when allocating blocks for AGFL where we could be otherwise stuck
> >> waiting the busy extents held by current transaction to be flushed (thus a
> >> deadlock).
> >> 
> >> The log changes
> >> 1) before change:
> >> 
> >>    358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2  len: 48 flags: None
> >>    359 EFI  nextents:2 id:ffff9fef708ba940
> >>    360 EFI id=ffff9fef708ba940 (0x21, 7)
> >>    361 EFI id=ffff9fef708ba940 (0x18, 8)
> >>    362 -----------------------------------------------------------------
> >>    363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2  len: 48 flags: None
> >>    364 EFD  nextents:2 id:ffff9fef708ba940
> >>    365 EFD id=ffff9fef708ba940 (0x21, 7)
> >>    366 EFD id=ffff9fef708ba940 (0x18, 8)
> >> 
> >> 2) after change:
> >> 
> >>    830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f  len: 32 flags: None
> >>    831 EFI  nextents:1 id:ffff9fef708b9b80
> >>    832 EFI id=ffff9fef708b9b80 (0x21, 7)
> >>    833 -----------------------------------------------------------------
> >>    834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f  len: 32 flags: None
> >>    835 EFI  nextents:1 id:ffff9fef708b9d38
> >>    836 EFI id=ffff9fef708b9d38 (0x18, 8)
> >>    837 -----------------------------------------------------------------
> >>    838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f  len: 32 flags: None
> >>    839 EFD  nextents:1 id:ffff9fef708b9b80
> >>    840 EFD id=ffff9fef708b9b80 (0x21, 7)
> >>    841 -----------------------------------------------------------------
> >>    842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f  len: 32 flags: None
> >>    843 EFD  nextents:1 id:ffff9fef708b9d38
> >>    844 EFD id=ffff9fef708b9d38 (0x18, 8)
> >> 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >> ---
> >> fs/xfs/xfs_extfree_item.h | 9 ++++++++-
> >> 1 file changed, 8 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
> >> index da6a5afa607c..ae84d77eaf30 100644
> >> --- a/fs/xfs/xfs_extfree_item.h
> >> +++ b/fs/xfs/xfs_extfree_item.h
> >> @@ -13,8 +13,15 @@ struct kmem_cache;
> >> 
> >> /*
> >>  * Max number of extents in fast allocation path.
> >> + *
> >> + * At IO time, make sure an EFI contains only one extent. Transaction rolling
> >> + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By
> >> + * that we avoid holding busy extents (for previously extents in the same EFI)
> >> + * in current transaction when allocating blocks for AGFL where we could be
> >> + * otherwise stuck waiting the busy extents held by current transaction to be
> >> + * flushed (thus a deadlock).
> >>  */
> >> -#define XFS_EFI_MAX_FAST_EXTENTS 16
> >> +#define XFS_EFI_MAX_FAST_EXTENTS 1
> > 
> > IIRC, this doesn't have anything to do with the number of extents an
> > EFI can hold. All it does is control how the memory for the EFI
> > allocated.
> 
> Yes, it ensures that one EFI contains at most one extent. And because each
> deferred intent goes with one transaction roll, it would solve the AGFL allocation
> deadlock (because no busy extents held by the process when it is doing the
> AGFL allocation).
>
> > Oh, at some point it got overloaded code to define the max items in
> > a defer ops work item. Ok, I now see why you changed this, but I
> > don't think this is right way to solve the problem. We can handle
> > processing multiple extents per EFI just fine, we just need to
> > update the EFD and roll the transaction on each extent we process,
> > yes?
> > 
> 
> I am not quite sure what does “update the EFD” mean.

Historical terminology, see below.

> My original concern is that (without your updated EFD), the extents in original EFI can be partially done before a crush. And during the recovery, the already done extents would also be replayed and hit error (because the in-place metadata could be flushed since the transaction is rolled.).
> 
> Now consider your “update the EFD”, you meant the following?
> 
> EFI:  ID:  THISISID1   extent1 extent2
> free extent extent1
> EFD: ID: THISISID1  extent1
> free extent extent2
> another EFD: ID: THISISID1 (same ID as above)  extent2

Yes, that's pretty much how multi-extent EFIs used to work, except
the second and subsequent EFDs recorded all the extents that had
been freed.  That way recovery could simply find the EFD with the
highest LSN in the log to determine what part of the EFI had not
been replayed.

We don't do that anymore for partially processed multi-extent
intents anymore. Instead, we use deferred ops to chain updates. i.e.
we log a complete intent done items alongside a new intent
containing the remaining work to be done in the same transaction.
This cancels the original intent and atomically replaces it with a
new intent containing the remaining work to be done.

So when I say "update the EFD" I'm using historic terminology for
processing and recovering multi-extent intents. In modern terms,
what I mean is "update the deferred work intent chain to reflect the
work remaining to be done".

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-20 23:22       ` Dave Chinner
@ 2023-04-21  0:24         ` Wengang Wang
  2023-04-21  9:34           ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-21  0:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Apr 20, 2023 at 05:31:14PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Apr 19, 2023, at 4:55 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote:
>>>> At IO time, make sure an EFI contains only one extent. Transaction rolling in
>>>> xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we
>>>> avoid holding busy extents (for previously extents in the same EFI) in current
>>>> transaction when allocating blocks for AGFL where we could be otherwise stuck
>>>> waiting the busy extents held by current transaction to be flushed (thus a
>>>> deadlock).
>>>> 
>>>> The log changes
>>>> 1) before change:
>>>> 
>>>>   358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2  len: 48 flags: None
>>>>   359 EFI  nextents:2 id:ffff9fef708ba940
>>>>   360 EFI id=ffff9fef708ba940 (0x21, 7)
>>>>   361 EFI id=ffff9fef708ba940 (0x18, 8)
>>>>   362 -----------------------------------------------------------------
>>>>   363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2  len: 48 flags: None
>>>>   364 EFD  nextents:2 id:ffff9fef708ba940
>>>>   365 EFD id=ffff9fef708ba940 (0x21, 7)
>>>>   366 EFD id=ffff9fef708ba940 (0x18, 8)
>>>> 
>>>> 2) after change:
>>>> 
>>>>   830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f  len: 32 flags: None
>>>>   831 EFI  nextents:1 id:ffff9fef708b9b80
>>>>   832 EFI id=ffff9fef708b9b80 (0x21, 7)
>>>>   833 -----------------------------------------------------------------
>>>>   834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f  len: 32 flags: None
>>>>   835 EFI  nextents:1 id:ffff9fef708b9d38
>>>>   836 EFI id=ffff9fef708b9d38 (0x18, 8)
>>>>   837 -----------------------------------------------------------------
>>>>   838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f  len: 32 flags: None
>>>>   839 EFD  nextents:1 id:ffff9fef708b9b80
>>>>   840 EFD id=ffff9fef708b9b80 (0x21, 7)
>>>>   841 -----------------------------------------------------------------
>>>>   842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f  len: 32 flags: None
>>>>   843 EFD  nextents:1 id:ffff9fef708b9d38
>>>>   844 EFD id=ffff9fef708b9d38 (0x18, 8)
>>>> 
>>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>>> ---
>>>> fs/xfs/xfs_extfree_item.h | 9 ++++++++-
>>>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
>>>> index da6a5afa607c..ae84d77eaf30 100644
>>>> --- a/fs/xfs/xfs_extfree_item.h
>>>> +++ b/fs/xfs/xfs_extfree_item.h
>>>> @@ -13,8 +13,15 @@ struct kmem_cache;
>>>> 
>>>> /*
>>>> * Max number of extents in fast allocation path.
>>>> + *
>>>> + * At IO time, make sure an EFI contains only one extent. Transaction rolling
>>>> + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By
>>>> + * that we avoid holding busy extents (for previously extents in the same EFI)
>>>> + * in current transaction when allocating blocks for AGFL where we could be
>>>> + * otherwise stuck waiting the busy extents held by current transaction to be
>>>> + * flushed (thus a deadlock).
>>>> */
>>>> -#define XFS_EFI_MAX_FAST_EXTENTS 16
>>>> +#define XFS_EFI_MAX_FAST_EXTENTS 1
>>> 
>>> IIRC, this doesn't have anything to do with the number of extents an
>>> EFI can hold. All it does is control how the memory for the EFI
>>> allocated.
>> 
>> Yes, it ensures that one EFI contains at most one extent. And because each
>> deferred intent goes with one transaction roll, it would solve the AGFL allocation
>> deadlock (because no busy extents held by the process when it is doing the
>> AGFL allocation).
>> 
>>> Oh, at some point it got overloaded code to define the max items in
>>> a defer ops work item. Ok, I now see why you changed this, but I
>>> don't think this is right way to solve the problem. We can handle
>>> processing multiple extents per EFI just fine, we just need to
>>> update the EFD and roll the transaction on each extent we process,
>>> yes?
>>> 
>> 
>> I am not quite sure what does “update the EFD” mean.
> 
> Historical terminology, see below.
> 
>> My original concern is that (without your updated EFD), the extents in original EFI can be partially done before a crush. And during the recovery, the already done extents would also be replayed and hit error (because the in-place metadata could be flushed since the transaction is rolled.).
>> 
>> Now consider your “update the EFD”, you meant the following?
>> 
>> EFI:  ID:  THISISID1   extent1 extent2
>> free extent extent1
>> EFD: ID: THISISID1  extent1
>> free extent extent2
>> another EFD: ID: THISISID1 (same ID as above)  extent2
> 
> Yes, that's pretty much how multi-extent EFIs used to work, except
> the second and subsequent EFDs recorded all the extents that had
> been freed.  That way recovery could simply find the EFD with the
> highest LSN in the log to determine what part of the EFI had not
> been replayed.

Good to know it.

> 
> We don't do that anymore for partially processed multi-extent
> intents anymore. Instead, we use deferred ops to chain updates. i.e.
> we log a complete intent done items alongside a new intent
> containing the remaining work to be done in the same transaction.
> This cancels the original intent and atomically replaces it with a
> new intent containing the remaining work to be done.
> 
> So when I say "update the EFD" I'm using historic terminology for
> processing and recovering multi-extent intents. In modern terms,
> what I mean is "update the deferred work intent chain to reflect the
> work remaining to be done".

OK. so let’s see the difference between your implementation from mine.
Say, there are three extents to free.

My patch would result in:

EFI 1  with extent1
free extent1
EFD 1 with extent1
transaction roll
EFI 2 with extent2
free extent2
EFD 2 with extent2
transaction roll
EFI 3 with extent3
free extent3
EFD3 with extent3
transaction commit

The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.

Your idea yields this:

EFI 1 with extent1 extent2 extent3
free extent1
EFI 2 with extent2 extent3
EFD 1 with extent1 extent2 extent3
transaction commit
create transaction
free extent2
EFI 3 with extent3
EFD 2 with extent extent2 extent3
transaction commit
create transaction
free extent3
EFD 3 with extent3
transaction commit.


Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
So actually it’s still one extent per EFI per transaction. Though you are not changing
XFS_EFI_MAX_FAST_EXTENTS.

And your implementation may use more log space than mine in case the EFI
(and EFD pair) can’t be cancelled.  :D

The only difference if that you use transaction commit and I am using transaction roll
which is not safe as you said. 

Is my understanding correct?

One question is that if only one EFI is safe per transaction, how we ensure that there
is only one EFI per transaction in case there are more than 16
(current XFS_EFI_MAX_FAST_EXTENTS) extents to free in current code?


thanks,
wengang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents
  2023-04-20 22:54       ` Dave Chinner
@ 2023-04-21  0:32         ` Wengang Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Wengang Wang @ 2023-04-21  0:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 20, 2023, at 3:54 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Apr 20, 2023 at 05:10:42PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Apr 19, 2023, at 5:30 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote:
>>>> At log recovery stage, we need to split EFIs with multiple extents. For each
>>>> orginal multiple-extent EFI, split it into new EFIs each including one extent
>>>> from the original EFI. By that we avoid deadlock when allocating blocks for
>>>> AGFL waiting for the held busy extents by current transaction to be flushed.
>>>> 
>>>> For the original EFI, the process is
>>>> 1. Create and log new EFIs each covering one extent from the
>>>>   original EFI.
>>>> 2. Don't free extent with the original EFI.
>>>> 3. Log EFD for the original EFI.
>>>>   Make sure we log the new EFIs and original EFD in this order:
>>>>     new EFI 1
>>>>     new EFI 2
>>>>     ...
>>>>     new EFI N
>>>>     original EFD
>>>> The original extents are freed with the new EFIs.
>>> 
>>> We may not have the log space available during recovery to explode a
>>> single EFI out into many EFIs like this. The EFI only had enough
>>> space reserved for processing a single EFI, and exploding a single
>>> EFI out like this requires an individual log reservation for each
>>> new EFI. Hence this de-multiplexing process risks running out of log
>>> space and deadlocking before we've been able to process anything.
>>> 
>> 
>> Oh, yes, got it.
>> 
>>> Hence the only option we really have here is to replicate how CUIs
>>> are handled.  We must process the first extent with a whole EFD and
>>> a new EFI containing the remaining unprocessed extents as defered
>>> operations.  i.e.
>>> 
>>> 1. free the first extent in the original EFI
>>> 2. log an EFD for the original EFI
>>> 3. Add all the remaining extents in the original EFI to an xefi chain
>>> 4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from
>>>  the xefi chain and commit the current transaction.
>>> 
>>> xfs_defer_ops_capture_and_commit() will then add a work item to the
>>> defered list which will come back to the new EFI and process it
>>> through the normal runtime deferred ops intent processing path.
>>> 
>> 
>> So you meant this?
>> 
>> Orig EFI with extent1 extent2 extent3
>> free first extent1
>> Full EFD to orig EFI
>> transaction roll,
>> xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3
> 
> No. We do not need a transaction roll there if we rebuild a new
> xefi list with the remaining extents from the original efi. At that
> point, we call:
> 
> <create transaction>
> <free first extent>
> <create new xefi list>
> <create and log EFD>
> xfs_defer_ops_capture_and_commit()
>   xfs_defer_ops_capture()
>     xfs_defer_create_intents()
>       for each tp->t_dfops
>         ->create intent
>   xfs_extent_free_create_intent()
>     create new EFI
>     walk each xefi and add it to the new intent
>     <captures remaining defered work>
>   xfs_trans_commit()
> 
> i.e. xfs_defer_ops_capture_and_commit() can builds new EFI for us
> from the xefi list as part of defering the work that remains to be
> done. Once it has done that, it queues the remaining work and
> commits the transaction. Hence all we need to do in recovery of the
> first extent is free it, create the xefi list and log the full EFD.
> xfs_defer_ops_capture_and_commit() does the rest.

Yes, xfs_defer_ops_capture_and_commit will commit the transaction so we
don’t need a roll.

> 
>> If so, I don’t think it’s safe.
>> Consider that case that kernel panic happened after the transaction roll,
>> during next log replay, the original EFI has the matching EFD, so this EFI
>> is ignored, but actually extent2 and extent3 are not freed.
>> 
>> If you didn’t mean above, but instead this:
>> 
>> Orig EFI with extent1 extent2 extent3
>> free first extent1
>> New EFI extent2 extent3
>> Full EFD to orig EFI
>> transaction roll,
>> xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3
>> 
>> The problem will comeback to the log space issue, are we ensured we have
>> the space for the new EFI? 
> 
> Yes, because logging the full EFD cancels the original EFI and so it
> no longer consumes log space. hence we can log a new EFI using the
> space the original EFI consumed.
> 

Make sense.

>>> The first patch changed that path to only create intents with a
>>> single extent, so the continued defer ops would then do the right
>>> thing with that change in place. However, I think that we also need
>>> the runtime code to process a single extent per intent per commit in
>>> the same manner as above. i.e. we process the first extent in the
>>> intent, then relog all the remaining unprocessed extents as a single
>>> new intent.
>>> 
>>> Note that this is similar to how we already relog intents to roll
>>> them forward in the journal. The only difference for single extent
>>> processing is that an intent relog duplicates the entire extent list
>>> in the EFD and the new EFI, whilst what we want is the new EFI to
>>> contain all the extents except the one we just processed...
>>> 
>> 
>> The problem to me is that where we place the new EFI, it can’t be after the EFD.
>> I explained why above.
> 
> Yes it can. The only thing that matters is that the EFD and new EFI
> are committed in the same transaction. Remember: transactions are
> atomic change sets - either all the changes in the transaction are
> replayed on recovery, or none of them are.

Yes, will try this way.

thanks,
wengang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-21  0:24         ` Wengang Wang
@ 2023-04-21  9:34           ` Dave Chinner
  2023-04-21 18:23             ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2023-04-21  9:34 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
> > On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
> > We don't do that anymore for partially processed multi-extent
> > intents anymore. Instead, we use deferred ops to chain updates. i.e.
> > we log a complete intent done items alongside a new intent
> > containing the remaining work to be done in the same transaction.
> > This cancels the original intent and atomically replaces it with a
> > new intent containing the remaining work to be done.
> > 
> > So when I say "update the EFD" I'm using historic terminology for
> > processing and recovering multi-extent intents. In modern terms,
> > what I mean is "update the deferred work intent chain to reflect the
> > work remaining to be done".
> 
> OK. so let’s see the difference between your implementation from mine.
> Say, there are three extents to free.
> 
> My patch would result in:
> 
> EFI 1  with extent1
> free extent1
> EFD 1 with extent1
> transaction roll
> EFI 2 with extent2
> free extent2
> EFD 2 with extent2
> transaction roll
> EFI 3 with extent3
> free extent3
> EFD3 with extent3
> transaction commit

No, it wouldn't. This isn't how the deferred work processes work
items on the transaction. A work item with multiple extents on it
would result in this:

xfs_defer_finish(tp)  # tp contains three xefi work items 
  xfs_defer_finish_noroll
    xfs_defer_create_intents()
      list_for_each_defer_pending
        xfs_defer_create_intent(dfp)
	  ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
	    xfs_extent_free_create_intent()
	      <create EFI>
	      list_for_each_xefi
	        xfs_extent_free_log_item(xefi)
		  <adds extent to current EFI>

  xfs_defer_trans_roll()
    <save>
    xfs_trans_roll()
      xfs_trans_dup()
      xfs_trans_commit()
    <restore>

At this point we have this committed to the CIL

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3

And xfs_defer_finish_noroll() continues with

  <grabs first work item>
  xfs_defer_finish_one(dfp)
    ->create_done(EFI 1)
      xfs_extent_free_create_done
	<create EFD>
    list_for_each_xefi
      ops->finish_item(tp, dfp->dfp_done, li, &state);
        xfs_extent_free_finish_item()
	  xfs_trans_free_extent
	    __xfs_free_extent
	      <adds extent to EFD>

And once the processing of the single work item is done we loop
back to the start of the xfs_defer_finish_noroll() loop. We don't
have any new intents, so xfs_defer_create_intents() returns false,
but we completed a dfp work item, so we run:

  xfs_defer_trans_roll()
    <save>
    xfs_trans_roll()
      xfs_trans_dup()
      xfs_trans_commit()
    <restore>

At this point we have this committed to the CIL

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1

Then we run xfs_defer_finish_one() on EFI 2, commit, then run
xfs_defer_finish_one() on EFI 3. At this point, we have in the log:

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1
<AGF, AGFL, free space btree block mods>
EFD 2 with extent2

But we have not committed the final extent free or EFD 3 - that's in
the last transaction context we pass back to the _xfs_trans_commit()
context for it to finalise and close off the rolling transaction
chain. At this point, we finally have this in the CIL:

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1
<AGF, AGFL, free space btree block mods>
EFD 2 with extent2
<AGF, AGFL, free space btree block mods>
EFD 3 with extent3

> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.

I *always* ignore CIL intent whiteouts when thinking about
transaction chains and intents. That is purely a journal efficiency
optimisation, not something that is necessary for correct operation.

> Your idea yields this:
> 
> EFI 1 with extent1 extent2 extent3
> free extent1
> EFI 2 with extent2 extent3
> EFD 1 with extent1 extent2 extent3
> transaction commit
> create transaction
> free extent2
> EFI 3 with extent3
> EFD 2 with extent extent2 extent3
> transaction commit
> create transaction
> free extent3
> EFD 3 with extent3
> transaction commit.

The EFI/EFD contents are correct, but the rest of it is not - I am
not suggesting open coding transaction rolling like that. Everything
I am suggesting works through the same defer ops mechanism as you
are describing. So if we start with the initial journal contents
looks like this:

EFI 1 with extent1 extent2 extent3.

Then we run xfs_defer_finish_one() on that work, 

  xfs_defer_finish_one(dfp)
    ->create_done(EFI 1)
      xfs_extent_free_create_done
	<create EFD>
    list_for_each_xefi
      ops->finish_item(tp, dfp->dfp_done, li, &state);
        xfs_extent_free_finish_item()
	  xfs_trans_free_extent
	    __xfs_free_extent
	      <adds extent to EFD>

But now we have 3 xefis on the work to process. So on success of
the first call to xfs_trans_free_extent(), we want it to return
-EAGAIN to trigger the existing relogging code to create the new
EFI. How this works is describe in the section "Requesting a
Fresh Transaction while Finishing Deferred Work" in
fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.

The result is that the deferred work infrastructure does the work
of updaing the done intent and creating the new intents for the work
remaining. Hence after the next transaction roll, we have in the CIL

EFI 1 with extent1 extent2 extent3.
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1 extent2 extent3.
EFI 2 with extent2 extent3.

And so the loop goes until there is no more work remaining.

> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
> So actually it’s still one extent per EFI per transaction. Though you are not changing
> XFS_EFI_MAX_FAST_EXTENTS.

The difference is that what I'm suggesting is that the extent free
code can decide when it needs a transaction to be rolled. It isn't
forced to always run a single free per transaction, it can decide
that it can free multiple extents per transaction if there is no
risk of deadlocks (e.g. extents are in different AGs).  Forcing
everything to be limited to a transaction per extent free even when
there is no risk of deadlocks freeing multiple extents in a single
transaction is unnecessary.

Indeed, if the second extent is in a different AG, we don't risk
busy extents causing us issues, so we could do:

EFI 1 with extent1 extent2 extent3.
<AGF 1, AGFL 1, free space btree block mods>
<AGF 2, AGFL 2, free space btree block mods>
EFD 1 with extent1 extent2 extent3.
EFI 2 with extent3.
.....

The difference is that limiting the number of xefi items per
deferred work item means we can only process a single extent per
work item regardless of the current context.

Using the existing defered work "on demand transaction rolling"
mechanism I'm suggesting we use doesn't put any artificial
constrains on how we log and process the intents. This allows us to
aggregate multiple extent frees into a single transaction when there
is no risk associated with doing so and we have sufficient
transaction reservations remaining to guarantee we can free the
extent. It's far more efficient, and the infrastructure we have in
place already supports this sort of functionality....

When we go back to the original problem of the AGFL allocation
needing to roll the transaction to get busy extents processed, then
we could have it return -EAGAIN all the way back up to the defer-ops
level and we simply then roll the transaction and retry the same
extent free operation again. i.e. where extent freeing needs to
block waiting on stuff like busy extents, we could now have these
commit the current transaction, push the current work item to the
back of the current context's queue and come back to it later.

At this point, I think the "single extent per transaction"
constraint that is needed to avoid the busy extent deadlock goes
away, and with it the need for limiting EFI processing to a single
extent goes away too....

> And your implementation may use more log space than mine in case the EFI
> (and EFD pair) can’t be cancelled.  :D

True, but it's really not a concern.  Don't conflate "intent
recovery intent amplification can cause log space deadlocks" with
"intent size is a problem". :)

> The only difference if that you use transaction commit and I am using transaction roll
> which is not safe as you said. 
> 
> Is my understanding correct?

I think I understand where we are misunderstanding each other :)
Perhaps it is now obvious to you as well from the analysis above.
If so, ignore the rest of this :)

What does "transaction roll" actually mean?

XFS has a concept of "rolling transactions". These are made up of a
series of individual transaction contexts that are linked together
by passing a single log reservation ticket between transaction
contexts.

xfs_trans_roll() effectively does:

	ntp = xfs_trans_dup(tp)
	....
	xfs_trans_commit(tp)

	return ntp;

i.e. it duplicates the current transaction handle, takes the unused
block reservation from it, grabs the log reservation ticket from it,
moves the defered ops from the old to the new transaction context,
then commits the old transaction context and returns the new one.

tl;dr: a rolling transaction is a series of linked independent
transactions that share a common set of block and log reservations.

To make a rolling transaction chain an atomic operation on a
specific object (e.g. an inode) that object has to remain locked and
be logged in every transaction in the chain of rolling transactions.
This keeps it moving forward in the log and prevents it from pinning
the tail of the log and causing log space deadlocks.

Fundamentally, though, each individual transaction in a rolling
transaction is an independent atomic change set. So when you say
"roll the transaction", what you are actually saying is "commit the
current transaction and start a new transaction using the
reservations the current transaction already holds."

When I say "transaction commit" I am only refering to the process
that closes off the current transaction. If this is in the middle of
a rolling transaction, then what I am talking about is
xfs_trans_roll() calling xfs_trans_commit() after it has duplicated
the current transaction context.

Transaction contexts running defered operations, intent chains, etc
are *always* rolling transactions, and each time we roll the
transaction we commit it.

IOWS, when I say "transaction commit" and you say "transaction roll"
we are talking about exactly the same thing - transaction commit is
the operation that roll performs to finish off the current change
set...

Ideally, nobody should be calling xfs_trans_roll() directly anymore.
All rolling transactions should be implemented using deferred ops
nowdays - xfs_trans_roll() is the historic method of running rolling
transactions. e.g. see the recent rework of the attribute code to
use deferred ops rather than explicit calls to xfs_trans_roll() so
we can use intents for running attr operations.

These days the transaction model is based around chains of deferred
operations. We attach deferred work to the transaction, and then
when we call xfs_trans_commit() it goes off into the deferred work
infrastructure that creates intents and manages the transaction
context for processing the intents itself.

This is the primary reason we are trying to move away from high
level code using transaction rolling - we can't easily change the
way we process operations when the high level code controls
transaction contexts. Using deferred intent chains gives us fine
grained control over transaction context in the deferred ops
processing code - we can roll to a new transaction whenever we need
to....

> One question is that if only one EFI is safe per transaction, how
> we ensure that there is only one EFI per transaction in case there
> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to
> free in current code?

That will handled exactly the same way it does with your change - it
will simply split up the work items so there are multiple intents
logged.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-21  9:34           ` Dave Chinner
@ 2023-04-21 18:23             ` Wengang Wang
  2023-04-22  3:22               ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-21 18:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> We don't do that anymore for partially processed multi-extent
>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>> we log a complete intent done items alongside a new intent
>>> containing the remaining work to be done in the same transaction.
>>> This cancels the original intent and atomically replaces it with a
>>> new intent containing the remaining work to be done.
>>> 
>>> So when I say "update the EFD" I'm using historic terminology for
>>> processing and recovering multi-extent intents. In modern terms,
>>> what I mean is "update the deferred work intent chain to reflect the
>>> work remaining to be done".
>> 
>> OK. so let’s see the difference between your implementation from mine.
>> Say, there are three extents to free.
>> 
>> My patch would result in:
>> 
>> EFI 1  with extent1
>> free extent1
>> EFD 1 with extent1
>> transaction roll
>> EFI 2 with extent2
>> free extent2
>> EFD 2 with extent2
>> transaction roll
>> EFI 3 with extent3
>> free extent3
>> EFD3 with extent3
>> transaction commit
> 
> No, it wouldn't. This isn't how the deferred work processes work
> items on the transaction. A work item with multiple extents on it
> would result in this:
> 
> xfs_defer_finish(tp)  # tp contains three xefi work items 
>  xfs_defer_finish_noroll
>    xfs_defer_create_intents()
>      list_for_each_defer_pending
>        xfs_defer_create_intent(dfp)
>  ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>    xfs_extent_free_create_intent()
>      <create EFI>
>      list_for_each_xefi
>        xfs_extent_free_log_item(xefi)
>  <adds extent to current EFI>
> 
>  xfs_defer_trans_roll()
>    <save>
>    xfs_trans_roll()
>      xfs_trans_dup()
>      xfs_trans_commit()
>    <restore>
> 
> At this point we have this committed to the CIL
> 
> EFI 1 with extent1
> EFI 2 with extent2
> EFI 3 with extent3
> 
> And xfs_defer_finish_noroll() continues with
> 
>  <grabs first work item>
>  xfs_defer_finish_one(dfp)
>    ->create_done(EFI 1)
>      xfs_extent_free_create_done
> <create EFD>
>    list_for_each_xefi
>      ops->finish_item(tp, dfp->dfp_done, li, &state);
>        xfs_extent_free_finish_item()
>  xfs_trans_free_extent
>    __xfs_free_extent
>      <adds extent to EFD>
> 
> And once the processing of the single work item is done we loop
> back to the start of the xfs_defer_finish_noroll() loop. We don't
> have any new intents, so xfs_defer_create_intents() returns false,
> but we completed a dfp work item, so we run:
> 
>  xfs_defer_trans_roll()
>    <save>
>    xfs_trans_roll()
>      xfs_trans_dup()
>      xfs_trans_commit()
>    <restore>
> 
> At this point we have this committed to the CIL
> 
> EFI 1 with extent1
> EFI 2 with extent2
> EFI 3 with extent3
> <AGF, AGFL, free space btree block mods>
> EFD 1 with extent1
> 
> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
> 
> EFI 1 with extent1
> EFI 2 with extent2
> EFI 3 with extent3
> <AGF, AGFL, free space btree block mods>
> EFD 1 with extent1
> <AGF, AGFL, free space btree block mods>
> EFD 2 with extent2
> 
> But we have not committed the final extent free or EFD 3 - that's in
> the last transaction context we pass back to the _xfs_trans_commit()
> context for it to finalise and close off the rolling transaction
> chain. At this point, we finally have this in the CIL:
> 
> EFI 1 with extent1
> EFI 2 with extent2
> EFI 3 with extent3
> <AGF, AGFL, free space btree block mods>
> EFD 1 with extent1
> <AGF, AGFL, free space btree block mods>
> EFD 2 with extent2
> <AGF, AGFL, free space btree block mods>
> EFD 3 with extent3


Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
thinking it only create intent for the first in tp->t_dfops.

> 
>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
> 
> I *always* ignore CIL intent whiteouts when thinking about
> transaction chains and intents. That is purely a journal efficiency
> optimisation, not something that is necessary for correct operation.

OK.

> 
>> Your idea yields this:
>> 
>> EFI 1 with extent1 extent2 extent3
>> free extent1
>> EFI 2 with extent2 extent3
>> EFD 1 with extent1 extent2 extent3
>> transaction commit
>> create transaction
>> free extent2
>> EFI 3 with extent3
>> EFD 2 with extent extent2 extent3
>> transaction commit
>> create transaction
>> free extent3
>> EFD 3 with extent3
>> transaction commit.
> 
> The EFI/EFD contents are correct, but the rest of it is not - I am
> not suggesting open coding transaction rolling like that. Everything
> I am suggesting works through the same defer ops mechanism as you
> are describing. So if we start with the initial journal contents
> looks like this:
> 
> EFI 1 with extent1 extent2 extent3.
> 
> Then we run xfs_defer_finish_one() on that work, 
> 
>  xfs_defer_finish_one(dfp)
>    ->create_done(EFI 1)
>      xfs_extent_free_create_done
> <create EFD>
>    list_for_each_xefi
>      ops->finish_item(tp, dfp->dfp_done, li, &state);
>        xfs_extent_free_finish_item()
>  xfs_trans_free_extent
>    __xfs_free_extent
>      <adds extent to EFD>
> 
> But now we have 3 xefis on the work to process. So on success of
> the first call to xfs_trans_free_extent(), we want it to return
> -EAGAIN to trigger the existing relogging code to create the new
> EFI. How this works is describe in the section "Requesting a
> Fresh Transaction while Finishing Deferred Work" in
> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
> 
> The result is that the deferred work infrastructure does the work
> of updaing the done intent and creating the new intents for the work
> remaining. Hence after the next transaction roll, we have in the CIL
> 
> EFI 1 with extent1 extent2 extent3.
> <AGF, AGFL, free space btree block mods>
> EFD 1 with extent1 extent2 extent3.
> EFI 2 with extent2 extent3.
> 

Taking transaction rolls into account (also adding up to EFD3), above would be:

EFI 1 with extent1 extent2 extent3.
transaction roll
<AGF, AGFL, free space btree block mods>  for extent 1
EFD 1 with extent1 extent2 extent3.
EFI 2 with extent2 extent3.
transaction roll
free extent 2
EFD 2 with extent2 extent3
EFI 3 with extent3
transaction roll
free extent 3
EFD 3 with extent3

Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
I got it.

> And so the loop goes until there is no more work remaining.
> 
>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>> XFS_EFI_MAX_FAST_EXTENTS.
> 
> The difference is that what I'm suggesting is that the extent free
> code can decide when it needs a transaction to be rolled. It isn't
> forced to always run a single free per transaction, it can decide
> that it can free multiple extents per transaction if there is no
> risk of deadlocks (e.g. extents are in different AGs).  Forcing
> everything to be limited to a transaction per extent free even when
> there is no risk of deadlocks freeing multiple extents in a single
> transaction is unnecessary.
> 
> Indeed, if the second extent is in a different AG, we don't risk
> busy extents causing us issues, so we could do:
> 
> EFI 1 with extent1 extent2 extent3.
> <AGF 1, AGFL 1, free space btree block mods>
> <AGF 2, AGFL 2, free space btree block mods>
> EFD 1 with extent1 extent2 extent3.
> EFI 2 with extent3.
> .....
> 

My thumb is up.

> The difference is that limiting the number of xefi items per
> deferred work item means we can only process a single extent per
> work item regardless of the current context.
> 
> Using the existing defered work "on demand transaction rolling"
> mechanism I'm suggesting we use doesn't put any artificial
> constrains on how we log and process the intents. This allows us to
> aggregate multiple extent frees into a single transaction when there
> is no risk associated with doing so and we have sufficient
> transaction reservations remaining to guarantee we can free the
> extent. It's far more efficient, and the infrastructure we have in
> place already supports this sort of functionality....
> 
> When we go back to the original problem of the AGFL allocation
> needing to roll the transaction to get busy extents processed, then
> we could have it return -EAGAIN all the way back up to the defer-ops
> level and we simply then roll the transaction and retry the same
> extent free operation again. i.e. where extent freeing needs to
> block waiting on stuff like busy extents, we could now have these
> commit the current transaction, push the current work item to the
> back of the current context's queue and come back to it later.
> 
> At this point, I think the "single extent per transaction"
> constraint that is needed to avoid the busy extent deadlock goes
> away, and with it the need for limiting EFI processing to a single
> extent goes away too....

I am pretty clear now.
> 
>> And your implementation may use more log space than mine in case the EFI
>> (and EFD pair) can’t be cancelled.  :D
> 
> True, but it's really not a concern.  Don't conflate "intent
> recovery intent amplification can cause log space deadlocks" with
> "intent size is a problem". :)
> 

Got it.
>> The only difference if that you use transaction commit and I am using transaction roll
>> which is not safe as you said. 
>> 
>> Is my understanding correct?
> 
> I think I understand where we are misunderstanding each other :)
> Perhaps it is now obvious to you as well from the analysis above.
> If so, ignore the rest of this :)
> 
> What does "transaction roll" actually mean?
> 
> XFS has a concept of "rolling transactions". These are made up of a
> series of individual transaction contexts that are linked together
> by passing a single log reservation ticket between transaction
> contexts.
> 
> xfs_trans_roll() effectively does:
> 
> ntp = xfs_trans_dup(tp)
> ....
> xfs_trans_commit(tp)
> 
> return ntp;
> 
> i.e. it duplicates the current transaction handle, takes the unused
> block reservation from it, grabs the log reservation ticket from it,
> moves the defered ops from the old to the new transaction context,
> then commits the old transaction context and returns the new one.
> 
> tl;dr: a rolling transaction is a series of linked independent
> transactions that share a common set of block and log reservations.
> 
> To make a rolling transaction chain an atomic operation on a
> specific object (e.g. an inode) that object has to remain locked and
> be logged in every transaction in the chain of rolling transactions.
> This keeps it moving forward in the log and prevents it from pinning
> the tail of the log and causing log space deadlocks.
> 
> Fundamentally, though, each individual transaction in a rolling
> transaction is an independent atomic change set. So when you say
> "roll the transaction", what you are actually saying is "commit the
> current transaction and start a new transaction using the
> reservations the current transaction already holds."
> 
> When I say "transaction commit" I am only refering to the process
> that closes off the current transaction. If this is in the middle of
> a rolling transaction, then what I am talking about is
> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated
> the current transaction context.
> 
> Transaction contexts running defered operations, intent chains, etc
> are *always* rolling transactions, and each time we roll the
> transaction we commit it.
> 
> IOWS, when I say "transaction commit" and you say "transaction roll"
> we are talking about exactly the same thing - transaction commit is
> the operation that roll performs to finish off the current change
> set...
> 
> Ideally, nobody should be calling xfs_trans_roll() directly anymore.
> All rolling transactions should be implemented using deferred ops
> nowdays - xfs_trans_roll() is the historic method of running rolling
> transactions. e.g. see the recent rework of the attribute code to
> use deferred ops rather than explicit calls to xfs_trans_roll() so
> we can use intents for running attr operations.
> 
> These days the transaction model is based around chains of deferred
> operations. We attach deferred work to the transaction, and then
> when we call xfs_trans_commit() it goes off into the deferred work
> infrastructure that creates intents and manages the transaction
> context for processing the intents itself.
> 
> This is the primary reason we are trying to move away from high
> level code using transaction rolling - we can't easily change the
> way we process operations when the high level code controls
> transaction contexts. Using deferred intent chains gives us fine
> grained control over transaction context in the deferred ops
> processing code - we can roll to a new transaction whenever we need
> to....
> 

Above is really helpful.
I when I mention transaction roll, I meant
1) a new transaction is created, but its self does’t reserve any resource.
Instead, it inherits all the unused resources from the old transaction.
2) commit the old transaction.
3) don’t un-reserve unused resources from old transaction, the new transaction
will inherits them.
4) use the new transaction for further log items.

As I understand, the key is that the new transaction its self doesn’t reserve new resources.

And when I read your “transaction commit” I understood it as this:
1). commit old transaction
2) un-reserve unused resources from old transaction
3) create new transaction with all resources reserved.

Thus in my understand your “transaction commit” would have no risk of lack of log space to put
the extra EFI/EFDs.  But reading above, I’d think you meant “transaction roll” actually.

>> One question is that if only one EFI is safe per transaction, how
>> we ensure that there is only one EFI per transaction in case there
>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to
>> free in current code?
> 
> That will handled exactly the same way it does with your change - it
> will simply split up the work items so there are multiple intents
> logged.

I’d like to make it more clear. You were saying that during log recover stage, there may be no enough
log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage),
seems you have no concern logging more EFI/EFDs.  So why they are different?

thanks,
wengang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-21 18:23             ` Wengang Wang
@ 2023-04-22  3:22               ` Wengang Wang
  2023-04-24 15:53                 ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-22  3:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> We don't do that anymore for partially processed multi-extent
>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>> we log a complete intent done items alongside a new intent
>>>> containing the remaining work to be done in the same transaction.
>>>> This cancels the original intent and atomically replaces it with a
>>>> new intent containing the remaining work to be done.
>>>> 
>>>> So when I say "update the EFD" I'm using historic terminology for
>>>> processing and recovering multi-extent intents. In modern terms,
>>>> what I mean is "update the deferred work intent chain to reflect the
>>>> work remaining to be done".
>>> 
>>> OK. so let’s see the difference between your implementation from mine.
>>> Say, there are three extents to free.
>>> 
>>> My patch would result in:
>>> 
>>> EFI 1  with extent1
>>> free extent1
>>> EFD 1 with extent1
>>> transaction roll
>>> EFI 2 with extent2
>>> free extent2
>>> EFD 2 with extent2
>>> transaction roll
>>> EFI 3 with extent3
>>> free extent3
>>> EFD3 with extent3
>>> transaction commit
>> 
>> No, it wouldn't. This isn't how the deferred work processes work
>> items on the transaction. A work item with multiple extents on it
>> would result in this:
>> 
>> xfs_defer_finish(tp)  # tp contains three xefi work items 
>> xfs_defer_finish_noroll
>>   xfs_defer_create_intents()
>>     list_for_each_defer_pending
>>       xfs_defer_create_intent(dfp)
>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>   xfs_extent_free_create_intent()
>>     <create EFI>
>>     list_for_each_xefi
>>       xfs_extent_free_log_item(xefi)
>> <adds extent to current EFI>
>> 
>> xfs_defer_trans_roll()
>>   <save>
>>   xfs_trans_roll()
>>     xfs_trans_dup()
>>     xfs_trans_commit()
>>   <restore>
>> 
>> At this point we have this committed to the CIL
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> 
>> And xfs_defer_finish_noroll() continues with
>> 
>> <grabs first work item>
>> xfs_defer_finish_one(dfp)
>>   ->create_done(EFI 1)
>>     xfs_extent_free_create_done
>> <create EFD>
>>   list_for_each_xefi
>>     ops->finish_item(tp, dfp->dfp_done, li, &state);
>>       xfs_extent_free_finish_item()
>> xfs_trans_free_extent
>>   __xfs_free_extent
>>     <adds extent to EFD>
>> 
>> And once the processing of the single work item is done we loop
>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>> have any new intents, so xfs_defer_create_intents() returns false,
>> but we completed a dfp work item, so we run:
>> 
>> xfs_defer_trans_roll()
>>   <save>
>>   xfs_trans_roll()
>>     xfs_trans_dup()
>>     xfs_trans_commit()
>>   <restore>
>> 
>> At this point we have this committed to the CIL
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1
>> 
>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1
>> <AGF, AGFL, free space btree block mods>
>> EFD 2 with extent2
>> 
>> But we have not committed the final extent free or EFD 3 - that's in
>> the last transaction context we pass back to the _xfs_trans_commit()
>> context for it to finalise and close off the rolling transaction
>> chain. At this point, we finally have this in the CIL:
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1
>> <AGF, AGFL, free space btree block mods>
>> EFD 2 with extent2
>> <AGF, AGFL, free space btree block mods>
>> EFD 3 with extent3
> 
> 
> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
> thinking it only create intent for the first in tp->t_dfops.
> 
>> 
>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>> 
>> I *always* ignore CIL intent whiteouts when thinking about
>> transaction chains and intents. That is purely a journal efficiency
>> optimisation, not something that is necessary for correct operation.
> 
> OK.
> 
>> 
>>> Your idea yields this:
>>> 
>>> EFI 1 with extent1 extent2 extent3
>>> free extent1
>>> EFI 2 with extent2 extent3
>>> EFD 1 with extent1 extent2 extent3
>>> transaction commit
>>> create transaction
>>> free extent2
>>> EFI 3 with extent3
>>> EFD 2 with extent extent2 extent3
>>> transaction commit
>>> create transaction
>>> free extent3
>>> EFD 3 with extent3
>>> transaction commit.
>> 
>> The EFI/EFD contents are correct, but the rest of it is not - I am
>> not suggesting open coding transaction rolling like that. Everything
>> I am suggesting works through the same defer ops mechanism as you
>> are describing. So if we start with the initial journal contents
>> looks like this:
>> 
>> EFI 1 with extent1 extent2 extent3.
>> 
>> Then we run xfs_defer_finish_one() on that work, 
>> 
>> xfs_defer_finish_one(dfp)
>>   ->create_done(EFI 1)
>>     xfs_extent_free_create_done
>> <create EFD>
>>   list_for_each_xefi
>>     ops->finish_item(tp, dfp->dfp_done, li, &state);
>>       xfs_extent_free_finish_item()
>> xfs_trans_free_extent
>>   __xfs_free_extent
>>     <adds extent to EFD>
>> 
>> But now we have 3 xefis on the work to process. So on success of
>> the first call to xfs_trans_free_extent(), we want it to return
>> -EAGAIN to trigger the existing relogging code to create the new
>> EFI. How this works is describe in the section "Requesting a
>> Fresh Transaction while Finishing Deferred Work" in
>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>> 
>> The result is that the deferred work infrastructure does the work
>> of updaing the done intent and creating the new intents for the work
>> remaining. Hence after the next transaction roll, we have in the CIL
>> 
>> EFI 1 with extent1 extent2 extent3.
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1 extent2 extent3.
>> EFI 2 with extent2 extent3.
>> 
> 
> Taking transaction rolls into account (also adding up to EFD3), above would be:
> 
> EFI 1 with extent1 extent2 extent3.
> transaction roll
> <AGF, AGFL, free space btree block mods>  for extent 1
> EFD 1 with extent1 extent2 extent3.
> EFI 2 with extent2 extent3.
> transaction roll
> free extent 2
> EFD 2 with extent2 extent3
> EFI 3 with extent3
> transaction roll
> free extent 3
> EFD 3 with extent3
> 
> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
> I got it.
> 
>> And so the loop goes until there is no more work remaining.
>> 
>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>> XFS_EFI_MAX_FAST_EXTENTS.
>> 
>> The difference is that what I'm suggesting is that the extent free
>> code can decide when it needs a transaction to be rolled. It isn't
>> forced to always run a single free per transaction, it can decide
>> that it can free multiple extents per transaction if there is no
>> risk of deadlocks (e.g. extents are in different AGs).  Forcing
>> everything to be limited to a transaction per extent free even when
>> there is no risk of deadlocks freeing multiple extents in a single
>> transaction is unnecessary.
>> 
>> Indeed, if the second extent is in a different AG, we don't risk
>> busy extents causing us issues, so we could do:
>> 
>> EFI 1 with extent1 extent2 extent3.
>> <AGF 1, AGFL 1, free space btree block mods>
>> <AGF 2, AGFL 2, free space btree block mods>
>> EFD 1 with extent1 extent2 extent3.
>> EFI 2 with extent3.
>> .....
>> 
> 
> My thumb is up.

Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?

Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
are stored there. For those AGs
they have pending busy extents in in-memory transactions (not committed to CIL).
We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
if it’s not there, and continue to free the extent. Otherwise if the AG number is already
there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
the busy extents are committed to the xfs_cil_ctx.
Well, wondering if the pending busy extents would remain in pending state long before
committed to CIL and that become a bottle neck for freeing extents.  

thanks,
wengang

> 
>> The difference is that limiting the number of xefi items per
>> deferred work item means we can only process a single extent per
>> work item regardless of the current context.
>> 
>> Using the existing defered work "on demand transaction rolling"
>> mechanism I'm suggesting we use doesn't put any artificial
>> constrains on how we log and process the intents. This allows us to
>> aggregate multiple extent frees into a single transaction when there
>> is no risk associated with doing so and we have sufficient
>> transaction reservations remaining to guarantee we can free the
>> extent. It's far more efficient, and the infrastructure we have in
>> place already supports this sort of functionality....
>> 
>> When we go back to the original problem of the AGFL allocation
>> needing to roll the transaction to get busy extents processed, then
>> we could have it return -EAGAIN all the way back up to the defer-ops
>> level and we simply then roll the transaction and retry the same
>> extent free operation again. i.e. where extent freeing needs to
>> block waiting on stuff like busy extents, we could now have these
>> commit the current transaction, push the current work item to the
>> back of the current context's queue and come back to it later.
>> 
>> At this point, I think the "single extent per transaction"
>> constraint that is needed to avoid the busy extent deadlock goes
>> away, and with it the need for limiting EFI processing to a single
>> extent goes away too....
> 
> I am pretty clear now.
>> 
>>> And your implementation may use more log space than mine in case the EFI
>>> (and EFD pair) can’t be cancelled.  :D
>> 
>> True, but it's really not a concern.  Don't conflate "intent
>> recovery intent amplification can cause log space deadlocks" with
>> "intent size is a problem". :)
>> 
> 
> Got it.
>>> The only difference if that you use transaction commit and I am using transaction roll
>>> which is not safe as you said. 
>>> 
>>> Is my understanding correct?
>> 
>> I think I understand where we are misunderstanding each other :)
>> Perhaps it is now obvious to you as well from the analysis above.
>> If so, ignore the rest of this :)
>> 
>> What does "transaction roll" actually mean?
>> 
>> XFS has a concept of "rolling transactions". These are made up of a
>> series of individual transaction contexts that are linked together
>> by passing a single log reservation ticket between transaction
>> contexts.
>> 
>> xfs_trans_roll() effectively does:
>> 
>> ntp = xfs_trans_dup(tp)
>> ....
>> xfs_trans_commit(tp)
>> 
>> return ntp;
>> 
>> i.e. it duplicates the current transaction handle, takes the unused
>> block reservation from it, grabs the log reservation ticket from it,
>> moves the defered ops from the old to the new transaction context,
>> then commits the old transaction context and returns the new one.
>> 
>> tl;dr: a rolling transaction is a series of linked independent
>> transactions that share a common set of block and log reservations.
>> 
>> To make a rolling transaction chain an atomic operation on a
>> specific object (e.g. an inode) that object has to remain locked and
>> be logged in every transaction in the chain of rolling transactions.
>> This keeps it moving forward in the log and prevents it from pinning
>> the tail of the log and causing log space deadlocks.
>> 
>> Fundamentally, though, each individual transaction in a rolling
>> transaction is an independent atomic change set. So when you say
>> "roll the transaction", what you are actually saying is "commit the
>> current transaction and start a new transaction using the
>> reservations the current transaction already holds."
>> 
>> When I say "transaction commit" I am only refering to the process
>> that closes off the current transaction. If this is in the middle of
>> a rolling transaction, then what I am talking about is
>> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated
>> the current transaction context.
>> 
>> Transaction contexts running defered operations, intent chains, etc
>> are *always* rolling transactions, and each time we roll the
>> transaction we commit it.
>> 
>> IOWS, when I say "transaction commit" and you say "transaction roll"
>> we are talking about exactly the same thing - transaction commit is
>> the operation that roll performs to finish off the current change
>> set...
>> 
>> Ideally, nobody should be calling xfs_trans_roll() directly anymore.
>> All rolling transactions should be implemented using deferred ops
>> nowdays - xfs_trans_roll() is the historic method of running rolling
>> transactions. e.g. see the recent rework of the attribute code to
>> use deferred ops rather than explicit calls to xfs_trans_roll() so
>> we can use intents for running attr operations.
>> 
>> These days the transaction model is based around chains of deferred
>> operations. We attach deferred work to the transaction, and then
>> when we call xfs_trans_commit() it goes off into the deferred work
>> infrastructure that creates intents and manages the transaction
>> context for processing the intents itself.
>> 
>> This is the primary reason we are trying to move away from high
>> level code using transaction rolling - we can't easily change the
>> way we process operations when the high level code controls
>> transaction contexts. Using deferred intent chains gives us fine
>> grained control over transaction context in the deferred ops
>> processing code - we can roll to a new transaction whenever we need
>> to....
>> 
> 
> Above is really helpful.
> I when I mention transaction roll, I meant
> 1) a new transaction is created, but its self does’t reserve any resource.
> Instead, it inherits all the unused resources from the old transaction.
> 2) commit the old transaction.
> 3) don’t un-reserve unused resources from old transaction, the new transaction
> will inherits them.
> 4) use the new transaction for further log items.
> 
> As I understand, the key is that the new transaction its self doesn’t reserve new resources.
> 
> And when I read your “transaction commit” I understood it as this:
> 1). commit old transaction
> 2) un-reserve unused resources from old transaction
> 3) create new transaction with all resources reserved.
> 
> Thus in my understand your “transaction commit” would have no risk of lack of log space to put
> the extra EFI/EFDs.  But reading above, I’d think you meant “transaction roll” actually.
> 
>>> One question is that if only one EFI is safe per transaction, how
>>> we ensure that there is only one EFI per transaction in case there
>>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to
>>> free in current code?
>> 
>> That will handled exactly the same way it does with your change - it
>> will simply split up the work items so there are multiple intents
>> logged.
> 
> I’d like to make it more clear. You were saying that during log recover stage, there may be no enough
> log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage),
> seems you have no concern logging more EFI/EFDs.  So why they are different?
> 
> thanks,
> wengang



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-22  3:22               ` Wengang Wang
@ 2023-04-24 15:53                 ` Wengang Wang
  2023-04-24 22:52                   ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2023-04-24 15:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> We don't do that anymore for partially processed multi-extent
>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>>> we log a complete intent done items alongside a new intent
>>>>> containing the remaining work to be done in the same transaction.
>>>>> This cancels the original intent and atomically replaces it with a
>>>>> new intent containing the remaining work to be done.
>>>>> 
>>>>> So when I say "update the EFD" I'm using historic terminology for
>>>>> processing and recovering multi-extent intents. In modern terms,
>>>>> what I mean is "update the deferred work intent chain to reflect the
>>>>> work remaining to be done".
>>>> 
>>>> OK. so let’s see the difference between your implementation from mine.
>>>> Say, there are three extents to free.
>>>> 
>>>> My patch would result in:
>>>> 
>>>> EFI 1  with extent1
>>>> free extent1
>>>> EFD 1 with extent1
>>>> transaction roll
>>>> EFI 2 with extent2
>>>> free extent2
>>>> EFD 2 with extent2
>>>> transaction roll
>>>> EFI 3 with extent3
>>>> free extent3
>>>> EFD3 with extent3
>>>> transaction commit
>>> 
>>> No, it wouldn't. This isn't how the deferred work processes work
>>> items on the transaction. A work item with multiple extents on it
>>> would result in this:
>>> 
>>> xfs_defer_finish(tp)  # tp contains three xefi work items 
>>> xfs_defer_finish_noroll
>>>  xfs_defer_create_intents()
>>>    list_for_each_defer_pending
>>>      xfs_defer_create_intent(dfp)
>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>>  xfs_extent_free_create_intent()
>>>    <create EFI>
>>>    list_for_each_xefi
>>>      xfs_extent_free_log_item(xefi)
>>> <adds extent to current EFI>
>>> 
>>> xfs_defer_trans_roll()
>>>  <save>
>>>  xfs_trans_roll()
>>>    xfs_trans_dup()
>>>    xfs_trans_commit()
>>>  <restore>
>>> 
>>> At this point we have this committed to the CIL
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> 
>>> And xfs_defer_finish_noroll() continues with
>>> 
>>> <grabs first work item>
>>> xfs_defer_finish_one(dfp)
>>>  ->create_done(EFI 1)
>>>    xfs_extent_free_create_done
>>> <create EFD>
>>>  list_for_each_xefi
>>>    ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>      xfs_extent_free_finish_item()
>>> xfs_trans_free_extent
>>>  __xfs_free_extent
>>>    <adds extent to EFD>
>>> 
>>> And once the processing of the single work item is done we loop
>>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>>> have any new intents, so xfs_defer_create_intents() returns false,
>>> but we completed a dfp work item, so we run:
>>> 
>>> xfs_defer_trans_roll()
>>>  <save>
>>>  xfs_trans_roll()
>>>    xfs_trans_dup()
>>>    xfs_trans_commit()
>>>  <restore>
>>> 
>>> At this point we have this committed to the CIL
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1
>>> 
>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 2 with extent2
>>> 
>>> But we have not committed the final extent free or EFD 3 - that's in
>>> the last transaction context we pass back to the _xfs_trans_commit()
>>> context for it to finalise and close off the rolling transaction
>>> chain. At this point, we finally have this in the CIL:
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 2 with extent2
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 3 with extent3
>> 
>> 
>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
>> thinking it only create intent for the first in tp->t_dfops.
>> 
>>> 
>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>>> 
>>> I *always* ignore CIL intent whiteouts when thinking about
>>> transaction chains and intents. That is purely a journal efficiency
>>> optimisation, not something that is necessary for correct operation.
>> 
>> OK.
>> 
>>> 
>>>> Your idea yields this:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3
>>>> free extent1
>>>> EFI 2 with extent2 extent3
>>>> EFD 1 with extent1 extent2 extent3
>>>> transaction commit
>>>> create transaction
>>>> free extent2
>>>> EFI 3 with extent3
>>>> EFD 2 with extent extent2 extent3
>>>> transaction commit
>>>> create transaction
>>>> free extent3
>>>> EFD 3 with extent3
>>>> transaction commit.
>>> 
>>> The EFI/EFD contents are correct, but the rest of it is not - I am
>>> not suggesting open coding transaction rolling like that. Everything
>>> I am suggesting works through the same defer ops mechanism as you
>>> are describing. So if we start with the initial journal contents
>>> looks like this:
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> 
>>> Then we run xfs_defer_finish_one() on that work, 
>>> 
>>> xfs_defer_finish_one(dfp)
>>>  ->create_done(EFI 1)
>>>    xfs_extent_free_create_done
>>> <create EFD>
>>>  list_for_each_xefi
>>>    ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>      xfs_extent_free_finish_item()
>>> xfs_trans_free_extent
>>>  __xfs_free_extent
>>>    <adds extent to EFD>
>>> 
>>> But now we have 3 xefis on the work to process. So on success of
>>> the first call to xfs_trans_free_extent(), we want it to return
>>> -EAGAIN to trigger the existing relogging code to create the new
>>> EFI. How this works is describe in the section "Requesting a
>>> Fresh Transaction while Finishing Deferred Work" in
>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>>> 
>>> The result is that the deferred work infrastructure does the work
>>> of updaing the done intent and creating the new intents for the work
>>> remaining. Hence after the next transaction roll, we have in the CIL
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent2 extent3.
>>> 
>> 
>> Taking transaction rolls into account (also adding up to EFD3), above would be:
>> 
>> EFI 1 with extent1 extent2 extent3.
>> transaction roll
>> <AGF, AGFL, free space btree block mods>  for extent 1
>> EFD 1 with extent1 extent2 extent3.
>> EFI 2 with extent2 extent3.
>> transaction roll
>> free extent 2
>> EFD 2 with extent2 extent3
>> EFI 3 with extent3
>> transaction roll
>> free extent 3
>> EFD 3 with extent3
>> 
>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
>> I got it.
>> 
>>> And so the loop goes until there is no more work remaining.
>>> 
>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>>> XFS_EFI_MAX_FAST_EXTENTS.
>>> 
>>> The difference is that what I'm suggesting is that the extent free
>>> code can decide when it needs a transaction to be rolled. It isn't
>>> forced to always run a single free per transaction, it can decide
>>> that it can free multiple extents per transaction if there is no
>>> risk of deadlocks (e.g. extents are in different AGs).  Forcing
>>> everything to be limited to a transaction per extent free even when
>>> there is no risk of deadlocks freeing multiple extents in a single
>>> transaction is unnecessary.
>>> 
>>> Indeed, if the second extent is in a different AG, we don't risk
>>> busy extents causing us issues, so we could do:
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> <AGF 1, AGFL 1, free space btree block mods>
>>> <AGF 2, AGFL 2, free space btree block mods>
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent3.
>>> .....
>>> 
>> 
>> My thumb is up.
> 
> Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?
> 
> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
> are stored there. For those AGs
> they have pending busy extents in in-memory transactions (not committed to CIL).
> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
> if it’s not there, and continue to free the extent. Otherwise if the AG number is already
> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
> the busy extents are committed to the xfs_cil_ctx.
> Well, wondering if the pending busy extents would remain in pending state long before
> committed to CIL and that become a bottle neck for freeing extents.  

Thinking more, we could add per AG counter indicating current pending busy extents on the AG.
For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent
is always OK to go, there shouldn’t be a bottle neck.

thanks,
wengang

> 
> thanks,
> wengang
> 
>> 
>>> The difference is that limiting the number of xefi items per
>>> deferred work item means we can only process a single extent per
>>> work item regardless of the current context.
>>> 
>>> Using the existing defered work "on demand transaction rolling"
>>> mechanism I'm suggesting we use doesn't put any artificial
>>> constrains on how we log and process the intents. This allows us to
>>> aggregate multiple extent frees into a single transaction when there
>>> is no risk associated with doing so and we have sufficient
>>> transaction reservations remaining to guarantee we can free the
>>> extent. It's far more efficient, and the infrastructure we have in
>>> place already supports this sort of functionality....
>>> 
>>> When we go back to the original problem of the AGFL allocation
>>> needing to roll the transaction to get busy extents processed, then
>>> we could have it return -EAGAIN all the way back up to the defer-ops
>>> level and we simply then roll the transaction and retry the same
>>> extent free operation again. i.e. where extent freeing needs to
>>> block waiting on stuff like busy extents, we could now have these
>>> commit the current transaction, push the current work item to the
>>> back of the current context's queue and come back to it later.
>>> 
>>> At this point, I think the "single extent per transaction"
>>> constraint that is needed to avoid the busy extent deadlock goes
>>> away, and with it the need for limiting EFI processing to a single
>>> extent goes away too....
>> 
>> I am pretty clear now.
>>> 
>>>> And your implementation may use more log space than mine in case the EFI
>>>> (and EFD pair) can’t be cancelled.  :D
>>> 
>>> True, but it's really not a concern.  Don't conflate "intent
>>> recovery intent amplification can cause log space deadlocks" with
>>> "intent size is a problem". :)
>>> 
>> 
>> Got it.
>>>> The only difference if that you use transaction commit and I am using transaction roll
>>>> which is not safe as you said. 
>>>> 
>>>> Is my understanding correct?
>>> 
>>> I think I understand where we are misunderstanding each other :)
>>> Perhaps it is now obvious to you as well from the analysis above.
>>> If so, ignore the rest of this :)
>>> 
>>> What does "transaction roll" actually mean?
>>> 
>>> XFS has a concept of "rolling transactions". These are made up of a
>>> series of individual transaction contexts that are linked together
>>> by passing a single log reservation ticket between transaction
>>> contexts.
>>> 
>>> xfs_trans_roll() effectively does:
>>> 
>>> ntp = xfs_trans_dup(tp)
>>> ....
>>> xfs_trans_commit(tp)
>>> 
>>> return ntp;
>>> 
>>> i.e. it duplicates the current transaction handle, takes the unused
>>> block reservation from it, grabs the log reservation ticket from it,
>>> moves the defered ops from the old to the new transaction context,
>>> then commits the old transaction context and returns the new one.
>>> 
>>> tl;dr: a rolling transaction is a series of linked independent
>>> transactions that share a common set of block and log reservations.
>>> 
>>> To make a rolling transaction chain an atomic operation on a
>>> specific object (e.g. an inode) that object has to remain locked and
>>> be logged in every transaction in the chain of rolling transactions.
>>> This keeps it moving forward in the log and prevents it from pinning
>>> the tail of the log and causing log space deadlocks.
>>> 
>>> Fundamentally, though, each individual transaction in a rolling
>>> transaction is an independent atomic change set. So when you say
>>> "roll the transaction", what you are actually saying is "commit the
>>> current transaction and start a new transaction using the
>>> reservations the current transaction already holds."
>>> 
>>> When I say "transaction commit" I am only refering to the process
>>> that closes off the current transaction. If this is in the middle of
>>> a rolling transaction, then what I am talking about is
>>> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated
>>> the current transaction context.
>>> 
>>> Transaction contexts running defered operations, intent chains, etc
>>> are *always* rolling transactions, and each time we roll the
>>> transaction we commit it.
>>> 
>>> IOWS, when I say "transaction commit" and you say "transaction roll"
>>> we are talking about exactly the same thing - transaction commit is
>>> the operation that roll performs to finish off the current change
>>> set...
>>> 
>>> Ideally, nobody should be calling xfs_trans_roll() directly anymore.
>>> All rolling transactions should be implemented using deferred ops
>>> nowdays - xfs_trans_roll() is the historic method of running rolling
>>> transactions. e.g. see the recent rework of the attribute code to
>>> use deferred ops rather than explicit calls to xfs_trans_roll() so
>>> we can use intents for running attr operations.
>>> 
>>> These days the transaction model is based around chains of deferred
>>> operations. We attach deferred work to the transaction, and then
>>> when we call xfs_trans_commit() it goes off into the deferred work
>>> infrastructure that creates intents and manages the transaction
>>> context for processing the intents itself.
>>> 
>>> This is the primary reason we are trying to move away from high
>>> level code using transaction rolling - we can't easily change the
>>> way we process operations when the high level code controls
>>> transaction contexts. Using deferred intent chains gives us fine
>>> grained control over transaction context in the deferred ops
>>> processing code - we can roll to a new transaction whenever we need
>>> to....
>>> 
>> 
>> Above is really helpful.
>> I when I mention transaction roll, I meant
>> 1) a new transaction is created, but its self does’t reserve any resource.
>> Instead, it inherits all the unused resources from the old transaction.
>> 2) commit the old transaction.
>> 3) don’t un-reserve unused resources from old transaction, the new transaction
>> will inherits them.
>> 4) use the new transaction for further log items.
>> 
>> As I understand, the key is that the new transaction its self doesn’t reserve new resources.
>> 
>> And when I read your “transaction commit” I understood it as this:
>> 1). commit old transaction
>> 2) un-reserve unused resources from old transaction
>> 3) create new transaction with all resources reserved.
>> 
>> Thus in my understand your “transaction commit” would have no risk of lack of log space to put
>> the extra EFI/EFDs.  But reading above, I’d think you meant “transaction roll” actually.
>> 
>>>> One question is that if only one EFI is safe per transaction, how
>>>> we ensure that there is only one EFI per transaction in case there
>>>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to
>>>> free in current code?
>>> 
>>> That will handled exactly the same way it does with your change - it
>>> will simply split up the work items so there are multiple intents
>>> logged.
>> 
>> I’d like to make it more clear. You were saying that during log recover stage, there may be no enough
>> log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage),
>> seems you have no concern logging more EFI/EFDs.  So why they are different?
>> 
>> thanks,
>> wengang



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: IO time one extent per EFI
  2023-04-24 15:53                 ` Wengang Wang
@ 2023-04-24 22:52                   ` Wengang Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Wengang Wang @ 2023-04-24 22:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



> On Apr 24, 2023, at 8:53 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote:
>>>> 
>>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>> We don't do that anymore for partially processed multi-extent
>>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>>>> we log a complete intent done items alongside a new intent
>>>>>> containing the remaining work to be done in the same transaction.
>>>>>> This cancels the original intent and atomically replaces it with a
>>>>>> new intent containing the remaining work to be done.
>>>>>> 
>>>>>> So when I say "update the EFD" I'm using historic terminology for
>>>>>> processing and recovering multi-extent intents. In modern terms,
>>>>>> what I mean is "update the deferred work intent chain to reflect the
>>>>>> work remaining to be done".
>>>>> 
>>>>> OK. so let’s see the difference between your implementation from mine.
>>>>> Say, there are three extents to free.
>>>>> 
>>>>> My patch would result in:
>>>>> 
>>>>> EFI 1  with extent1
>>>>> free extent1
>>>>> EFD 1 with extent1
>>>>> transaction roll
>>>>> EFI 2 with extent2
>>>>> free extent2
>>>>> EFD 2 with extent2
>>>>> transaction roll
>>>>> EFI 3 with extent3
>>>>> free extent3
>>>>> EFD3 with extent3
>>>>> transaction commit
>>>> 
>>>> No, it wouldn't. This isn't how the deferred work processes work
>>>> items on the transaction. A work item with multiple extents on it
>>>> would result in this:
>>>> 
>>>> xfs_defer_finish(tp)  # tp contains three xefi work items 
>>>> xfs_defer_finish_noroll
>>>> xfs_defer_create_intents()
>>>>   list_for_each_defer_pending
>>>>     xfs_defer_create_intent(dfp)
>>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>>> xfs_extent_free_create_intent()
>>>>   <create EFI>
>>>>   list_for_each_xefi
>>>>     xfs_extent_free_log_item(xefi)
>>>> <adds extent to current EFI>
>>>> 
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>>   xfs_trans_dup()
>>>>   xfs_trans_commit()
>>>> <restore>
>>>> 
>>>> At this point we have this committed to the CIL
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> 
>>>> And xfs_defer_finish_noroll() continues with
>>>> 
>>>> <grabs first work item>
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>>   xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>>   ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>>     xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>>   <adds extent to EFD>
>>>> 
>>>> And once the processing of the single work item is done we loop
>>>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>>>> have any new intents, so xfs_defer_create_intents() returns false,
>>>> but we completed a dfp work item, so we run:
>>>> 
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>>   xfs_trans_dup()
>>>>   xfs_trans_commit()
>>>> <restore>
>>>> 
>>>> At this point we have this committed to the CIL
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> 
>>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>> 
>>>> But we have not committed the final extent free or EFD 3 - that's in
>>>> the last transaction context we pass back to the _xfs_trans_commit()
>>>> context for it to finalise and close off the rolling transaction
>>>> chain. At this point, we finally have this in the CIL:
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 3 with extent3
>>> 
>>> 
>>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
>>> thinking it only create intent for the first in tp->t_dfops.
>>> 
>>>> 
>>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>>>> 
>>>> I *always* ignore CIL intent whiteouts when thinking about
>>>> transaction chains and intents. That is purely a journal efficiency
>>>> optimisation, not something that is necessary for correct operation.
>>> 
>>> OK.
>>> 
>>>> 
>>>>> Your idea yields this:
>>>>> 
>>>>> EFI 1 with extent1 extent2 extent3
>>>>> free extent1
>>>>> EFI 2 with extent2 extent3
>>>>> EFD 1 with extent1 extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent2
>>>>> EFI 3 with extent3
>>>>> EFD 2 with extent extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent3
>>>>> EFD 3 with extent3
>>>>> transaction commit.
>>>> 
>>>> The EFI/EFD contents are correct, but the rest of it is not - I am
>>>> not suggesting open coding transaction rolling like that. Everything
>>>> I am suggesting works through the same defer ops mechanism as you
>>>> are describing. So if we start with the initial journal contents
>>>> looks like this:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> 
>>>> Then we run xfs_defer_finish_one() on that work, 
>>>> 
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>>   xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>>   ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>>     xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>>   <adds extent to EFD>
>>>> 
>>>> But now we have 3 xefis on the work to process. So on success of
>>>> the first call to xfs_trans_free_extent(), we want it to return
>>>> -EAGAIN to trigger the existing relogging code to create the new
>>>> EFI. How this works is describe in the section "Requesting a
>>>> Fresh Transaction while Finishing Deferred Work" in
>>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>>>> 
>>>> The result is that the deferred work infrastructure does the work
>>>> of updaing the done intent and creating the new intents for the work
>>>> remaining. Hence after the next transaction roll, we have in the CIL
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent2 extent3.
>>>> 
>>> 
>>> Taking transaction rolls into account (also adding up to EFD3), above would be:
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> transaction roll
>>> <AGF, AGFL, free space btree block mods>  for extent 1
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent2 extent3.
>>> transaction roll
>>> free extent 2
>>> EFD 2 with extent2 extent3
>>> EFI 3 with extent3
>>> transaction roll
>>> free extent 3
>>> EFD 3 with extent3
>>> 
>>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
>>> I got it.
>>> 
>>>> And so the loop goes until there is no more work remaining.
>>>> 
>>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>>>> XFS_EFI_MAX_FAST_EXTENTS.
>>>> 
>>>> The difference is that what I'm suggesting is that the extent free
>>>> code can decide when it needs a transaction to be rolled. It isn't
>>>> forced to always run a single free per transaction, it can decide
>>>> that it can free multiple extents per transaction if there is no
>>>> risk of deadlocks (e.g. extents are in different AGs).  Forcing
>>>> everything to be limited to a transaction per extent free even when
>>>> there is no risk of deadlocks freeing multiple extents in a single
>>>> transaction is unnecessary.
>>>> 
>>>> Indeed, if the second extent is in a different AG, we don't risk
>>>> busy extents causing us issues, so we could do:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF 1, AGFL 1, free space btree block mods>
>>>> <AGF 2, AGFL 2, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent3.
>>>> .....
>>>> 
>>> 
>>> My thumb is up.
>> 
>> Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
>> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
>> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
>> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?
>> 
>> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
>> are stored there. For those AGs
>> they have pending busy extents in in-memory transactions (not committed to CIL).
>> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
>> if it’s not there, and continue to free the extent. Otherwise if the AG number is already
>> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
>> the busy extents are committed to the xfs_cil_ctx.
>> Well, wondering if the pending busy extents would remain in pending state long before
>> committed to CIL and that become a bottle neck for freeing extents.  
> 
> Thinking more, we could add per AG counter indicating current pending busy extents on the AG.
> For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent
> is always OK to go, there shouldn’t be a bottle neck.

I just sent the first attempt (this way), pls review.

thanks,
wengang


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-04-24 22:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang
2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang
2023-04-19 23:55   ` Dave Chinner
2023-04-20 17:31     ` Wengang Wang
2023-04-20 23:22       ` Dave Chinner
2023-04-21  0:24         ` Wengang Wang
2023-04-21  9:34           ` Dave Chinner
2023-04-21 18:23             ` Wengang Wang
2023-04-22  3:22               ` Wengang Wang
2023-04-24 15:53                 ` Wengang Wang
2023-04-24 22:52                   ` Wengang Wang
2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang
2023-04-20  0:30   ` Dave Chinner
2023-04-20 17:10     ` Wengang Wang
2023-04-20 22:54       ` Dave Chinner
2023-04-21  0:32         ` Wengang Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.