[Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path
@ 2022-05-21 10:14 Heming Zhao via Ocfs2-devel
  2022-05-21 10:14 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally Heming Zhao via Ocfs2-devel
  2022-06-02  9:34 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path Joseph Qi via Ocfs2-devel
  0 siblings, 2 replies; 10+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-05-21 10:14 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

defragfs triggered jbd2 report ASSERT when working.

code path:

ocfs2_ioctl_move_extents
 ocfs2_move_extents
  __ocfs2_move_extents_range
   ocfs2_defrag_extent
    __ocfs2_move_extent
     + ocfs2_journal_access_di
     + ocfs2_split_extent  //[1]
     + ocfs2_journal_dirty //crash

[1]: ocfs2_split_extent called ocfs2_extend_trans, which committed
dirty buffers then restarted the transaction. The changed transaction
triggered jbd2 ASSERT.

crash stacks:

PID: 11297  TASK: ffff974a676dcd00  CPU: 67  COMMAND: "defragfs.ocfs2"
 #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
 #1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
 #2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
 #3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
 #4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
 #5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
 #6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
    [exception RIP: jbd2_journal_dirty_metadata+0x2ba]
    RIP: ffffffffc09ca54a  RSP: ffffb25d8dad3b70  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffff9706eedc5248  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: ffff97337029ea28  RDI: ffff9706eedc5250
    RBP: ffff9703c3520200   R8: 000000000f46b0b2   R9: 0000000000000000
    R10: 0000000000000001  R11: 00000001000000fe  R12: ffff97337029ea28
    R13: 0000000000000000  R14: ffff9703de59bf60  R15: ffff9706eedc5250
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
 #8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
 #9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]

The fix method is simple, ocfs2_split_extent includes three paths. all
the paths already have journal access/dirty pair. We only need to
remove journal access/dirty from __ocfs2_move_extent.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
---
 fs/ocfs2/move_extents.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 192cad0662d8..6251748c695b 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -105,14 +105,6 @@ static int __ocfs2_move_extent(handle_t *handle,
 	 */
 	replace_rec.e_flags = ext_flags & ~OCFS2_EXT_REFCOUNTED;
 
-	ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode),
-				      context->et.et_root_bh,
-				      OCFS2_JOURNAL_ACCESS_WRITE);
-	if (ret) {
-		mlog_errno(ret);
-		goto out;
-	}
-
 	ret = ocfs2_split_extent(handle, &context->et, path, index,
 				 &replace_rec, context->meta_ac,
 				 &context->dealloc);
@@ -121,8 +113,6 @@ static int __ocfs2_move_extent(handle_t *handle,
 		goto out;
 	}
 
-	ocfs2_journal_dirty(handle, context->et.et_root_bh);
-
 	context->new_phys_cpos = new_p_cpos;
 
 	/*
-- 
2.34.1


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally
  2022-05-21 10:14 [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path Heming Zhao via Ocfs2-devel
@ 2022-05-21 10:14 ` Heming Zhao via Ocfs2-devel
  2022-06-02 10:02   ` Joseph Qi via Ocfs2-devel
  2022-06-12  2:57   ` Joseph Qi via Ocfs2-devel
  2022-06-02  9:34 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path Joseph Qi via Ocfs2-devel
  1 sibling, 2 replies; 10+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-05-21 10:14 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

When la state is ENABLE, ocfs2_recalc_la_window restores la window
unconditionally. The logic is wrong.

Let's image below path.

1. la state (->local_alloc_state) is set THROTTLED or DISABLED.

2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
   ocfs2_la_enable_worker set la state to ENABLED directly.

3. a write IOs thread run:

   ```
   ocfs2_write_begin
    ...
     ocfs2_lock_allocators
      ocfs2_reserve_clusters
       ocfs2_reserve_clusters_with_limit
        ocfs2_reserve_local_alloc_bits
         ocfs2_local_alloc_slide_window // [1]
          + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
          + ...
          + ocfs2_local_alloc_new_window
             ocfs2_claim_clusters // [3]
   ```

[1]: will be called when la window bits used up.
[2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
     happened), it unconditionally restores la window to default value.
[3]: will use default la window size to search clusters. IMO the timing
     is O(n^4). The timing O(n^4) will cost huge time to scan global
     bitmap. It makes write IOs (eg user space 'dd') become dramatically
     slow.

i.e.
an ocfs2 partition size: 1.45TB, cluster size: 4KB,
la window default size: 106MB.
The partition is fragmentation by creating & deleting huge mount of
small file.

the timing should be (the number got from real world):
- la window size change order (size: MB):
  106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
  only 0.8MB succeed, 0.8MB also triggers la window to disable.
  ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
  runs in worst case.
- group chain number: 242
  ocfs2_claim_suballoc_bits calls for-loop 242 times
- each chain has 49 block group
  ocfs2_search_chain calls while-loop 49 times
- each bg has 32256 blocks
  ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
  for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
  (32256/64) for timing calucation.

So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)

In the worst case, user space writes 100MB data will trigger 42M scanning
times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
the write IO will suffer another 42M scanning times. It makes the ocfs2
partition keep pool performance all the time.

The fix method:

1.  la restores double la size once.

current code logic decrease la window with half size once, but directly
restores default_bits one time. It bounces the la window between '<1M'
and default_bits. This patch makes restoring process more smoothly.
eg.
la default window is 106MB, current la window is 13MB.
when there is a free action to release one block group space, la should
roll back la size to 26MB (by 13*2).
if there are many free actions to release many block group space, la
will smoothly roll back to default window (106MB).

2. introduced a new state: OCFS2_LA_RESTORE.

Current code uses OCFS2_LA_ENABLED to mark a new big space available.
the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
it's already in throttled status.
'->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
restore to default_bits.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
---
 fs/ocfs2/localalloc.c | 30 +++++++++++++++++++++---------
 fs/ocfs2/ocfs2.h      | 18 +++++++++++-------
 fs/ocfs2/suballoc.c   |  2 +-
 3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index c4426d12a2ad..28acea717d7f 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -205,20 +205,21 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb)
 
 static inline int ocfs2_la_state_enabled(struct ocfs2_super *osb)
 {
-	return (osb->local_alloc_state == OCFS2_LA_THROTTLED ||
-		osb->local_alloc_state == OCFS2_LA_ENABLED);
+	return osb->local_alloc_state & OCFS2_LA_ACTIVE;
 }
 
 void ocfs2_local_alloc_seen_free_bits(struct ocfs2_super *osb,
 				      unsigned int num_clusters)
 {
 	spin_lock(&osb->osb_lock);
-	if (osb->local_alloc_state == OCFS2_LA_DISABLED ||
-	    osb->local_alloc_state == OCFS2_LA_THROTTLED)
+	if (osb->local_alloc_state & (OCFS2_LA_DISABLED |
+		OCFS2_LA_THROTTLED | OCFS2_LA_RESTORE)) {
 		if (num_clusters >= osb->local_alloc_default_bits) {
 			cancel_delayed_work(&osb->la_enable_wq);
-			osb->local_alloc_state = OCFS2_LA_ENABLED;
+			osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
+			osb->local_alloc_state |= OCFS2_LA_RESTORE;
 		}
+	}
 	spin_unlock(&osb->osb_lock);
 }
 
@@ -228,7 +229,10 @@ void ocfs2_la_enable_worker(struct work_struct *work)
 		container_of(work, struct ocfs2_super,
 			     la_enable_wq.work);
 	spin_lock(&osb->osb_lock);
-	osb->local_alloc_state = OCFS2_LA_ENABLED;
+	if (osb->local_alloc_state & OCFS2_LA_DISABLED) {
+		osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
+		osb->local_alloc_state |= OCFS2_LA_ENABLED;
+	}
 	spin_unlock(&osb->osb_lock);
 }
 
@@ -1067,7 +1071,7 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
 			 * reason to assume the bitmap situation might
 			 * have changed.
 			 */
-			osb->local_alloc_state = OCFS2_LA_THROTTLED;
+			osb->local_alloc_state |= OCFS2_LA_THROTTLED;
 			osb->local_alloc_bits = bits;
 		} else {
 			osb->local_alloc_state = OCFS2_LA_DISABLED;
@@ -1083,8 +1087,16 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
 	 * risk bouncing around the global bitmap during periods of
 	 * low space.
 	 */
-	if (osb->local_alloc_state != OCFS2_LA_THROTTLED)
-		osb->local_alloc_bits = osb->local_alloc_default_bits;
+	if (osb->local_alloc_state & OCFS2_LA_RESTORE) {
+		bits = osb->local_alloc_bits * 2;
+		if (bits > osb->local_alloc_default_bits) {
+			osb->local_alloc_bits = osb->local_alloc_default_bits;
+			osb->local_alloc_state = OCFS2_LA_ENABLED;
+		} else {
+			/* keep RESTORE state & set new bits */
+			osb->local_alloc_bits = bits;
+		}
+	}
 
 out_unlock:
 	state = osb->local_alloc_state;
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 337527571461..1764077e3229 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -245,14 +245,18 @@ struct ocfs2_alloc_stats
 
 enum ocfs2_local_alloc_state
 {
-	OCFS2_LA_UNUSED = 0,	/* Local alloc will never be used for
-				 * this mountpoint. */
-	OCFS2_LA_ENABLED,	/* Local alloc is in use. */
-	OCFS2_LA_THROTTLED,	/* Local alloc is in use, but number
-				 * of bits has been reduced. */
-	OCFS2_LA_DISABLED	/* Local alloc has temporarily been
-				 * disabled. */
+	/* Local alloc will never be used for this mountpoint. */
+	OCFS2_LA_UNUSED = 1 << 0,
+	/* Local alloc is in use. */
+	OCFS2_LA_ENABLED = 1 << 1,
+	/* Local alloc is in use, but number of bits has been reduced. */
+	OCFS2_LA_THROTTLED = 1 << 2,
+	/* In throttle state, Local alloc meets contig big space. */
+	OCFS2_LA_RESTORE = 1 << 3,
+	/* Local alloc has temporarily been disabled. */
+	OCFS2_LA_DISABLED = 1 << 4,
 };
+#define OCFS2_LA_ACTIVE (OCFS2_LA_ENABLED | OCFS2_LA_THROTTLED | OCFS2_LA_RESTORE)
 
 enum ocfs2_mount_options
 {
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 166c8918c825..b0df1ab2d6dd 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -1530,7 +1530,7 @@ static int ocfs2_cluster_group_search(struct inode *inode,
 		 * of bits. */
 		if (min_bits <= res->sr_bits)
 			search = 0; /* success */
-		else if (res->sr_bits) {
+		if (res->sr_bits) {
 			/*
 			 * Don't show bits which we'll be returning
 			 * for allocation to the local alloc bitmap.
-- 
2.34.1


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path
  2022-05-21 10:14 [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path Heming Zhao via Ocfs2-devel
  2022-05-21 10:14 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally Heming Zhao via Ocfs2-devel
@ 2022-06-02  9:34 ` Joseph Qi via Ocfs2-devel
  2022-06-04  0:08   ` Heming Zhao via Ocfs2-devel
  1 sibling, 1 reply; 10+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-06-02  9:34 UTC (permalink / raw)
  To: Heming Zhao, ocfs2-devel

Hi,

Sorry for the late response since I was busy on other things...

On 5/21/22 6:14 PM, Heming Zhao wrote:
> defragfs triggered jbd2 report ASSERT when working.
> 
> code path:
> 
> ocfs2_ioctl_move_extents
>  ocfs2_move_extents
>   __ocfs2_move_extents_range
>    ocfs2_defrag_extent
>     __ocfs2_move_extent
>      + ocfs2_journal_access_di
>      + ocfs2_split_extent  //[1]
>      + ocfs2_journal_dirty //crash
> 
> [1]: ocfs2_split_extent called ocfs2_extend_trans, which committed
> dirty buffers then restarted the transaction. The changed transaction
> triggered jbd2 ASSERT.
> 
> crash stacks:
> 
> PID: 11297  TASK: ffff974a676dcd00  CPU: 67  COMMAND: "defragfs.ocfs2"
>  #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
>  #1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
>  #2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
>  #3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
>  #4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
>  #5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
>  #6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
>     [exception RIP: jbd2_journal_dirty_metadata+0x2ba]
>     RIP: ffffffffc09ca54a  RSP: ffffb25d8dad3b70  RFLAGS: 00010207
>     RAX: 0000000000000000  RBX: ffff9706eedc5248  RCX: 0000000000000000
>     RDX: 0000000000000001  RSI: ffff97337029ea28  RDI: ffff9706eedc5250
>     RBP: ffff9703c3520200   R8: 000000000f46b0b2   R9: 0000000000000000
>     R10: 0000000000000001  R11: 00000001000000fe  R12: ffff97337029ea28
>     R13: 0000000000000000  R14: ffff9703de59bf60  R15: ffff9706eedc5250
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
>  #8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
>  #9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]
> 
> The fix method is simple, ocfs2_split_extent includes three paths. all
> the paths already have journal access/dirty pair. We only need to
> remove journal access/dirty from __ocfs2_move_extent.
> 

I am not sure what you mean by "all three paths have journal access/
dirty pair".

It seems we can't do it in your way, as journal access has different
ocfs2_trigger with different offset, e.g. dinode is different from
extent block.

Or we may reserve enough credits to resolve this issue? 

Thanks,
Joseph

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally
  2022-05-21 10:14 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally Heming Zhao via Ocfs2-devel
@ 2022-06-02 10:02   ` Joseph Qi via Ocfs2-devel
  2022-06-12  2:57   ` Joseph Qi via Ocfs2-devel
  1 sibling, 0 replies; 10+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-06-02 10:02 UTC (permalink / raw)
  To: Heming Zhao, ocfs2-devel



On 5/21/22 6:14 PM, Heming Zhao wrote:
> When la state is ENABLE, ocfs2_recalc_la_window restores la window
> unconditionally. The logic is wrong.
> 
> Let's image below path.
> 
> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
> 
> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
>    ocfs2_la_enable_worker set la state to ENABLED directly.
> 
> 3. a write IOs thread run:
> 
>    ```
>    ocfs2_write_begin
>     ...
>      ocfs2_lock_allocators
>       ocfs2_reserve_clusters
>        ocfs2_reserve_clusters_with_limit
>         ocfs2_reserve_local_alloc_bits
>          ocfs2_local_alloc_slide_window // [1]
>           + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
>           + ...
>           + ocfs2_local_alloc_new_window
>              ocfs2_claim_clusters // [3]
>    ```
> 
> [1]: will be called when la window bits used up.
> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
>      happened), it unconditionally restores la window to default value.
> [3]: will use default la window size to search clusters. IMO the timing
>      is O(n^4). The timing O(n^4) will cost huge time to scan global
>      bitmap. It makes write IOs (eg user space 'dd') become dramatically
>      slow.
> 
> i.e.
> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
> la window default size: 106MB.
> The partition is fragmentation by creating & deleting huge mount of
> small file.
> 
> the timing should be (the number got from real world):
> - la window size change order (size: MB):
>   106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>   only 0.8MB succeed, 0.8MB also triggers la window to disable.
>   ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>   runs in worst case.
> - group chain number: 242
>   ocfs2_claim_suballoc_bits calls for-loop 242 times
> - each chain has 49 block group
>   ocfs2_search_chain calls while-loop 49 times
> - each bg has 32256 blocks
>   ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>   for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>   (32256/64) for timing calucation.
> 
> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
> 
> In the worst case, user space writes 100MB data will trigger 42M scanning
> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
> the write IO will suffer another 42M scanning times. It makes the ocfs2
> partition keep pool performance all the time.
> 

The scenario makes sense.
I have to spend more time to dig into the code and then get back to you.

Thanks,
Joseph

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path
  2022-06-02  9:34 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path Joseph Qi via Ocfs2-devel
@ 2022-06-04  0:08   ` Heming Zhao via Ocfs2-devel
  2022-06-10  7:46     ` Joseph Qi via Ocfs2-devel
  0 siblings, 1 reply; 10+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-06-04  0:08 UTC (permalink / raw)
  To: Joseph Qi, ocfs2-devel

On Thu, Jun 02, 2022 at 05:34:18PM +0800, Joseph Qi wrote:
> Hi,
> 
> Sorry for the late response since I was busy on other things...
> 
> On 5/21/22 6:14 PM, Heming Zhao wrote:
> > defragfs triggered jbd2 report ASSERT when working.
> > 
> > code path:
> > 
> > ocfs2_ioctl_move_extents
> >  ocfs2_move_extents
> >   __ocfs2_move_extents_range
> >    ocfs2_defrag_extent
> >     __ocfs2_move_extent
> >      + ocfs2_journal_access_di
> >      + ocfs2_split_extent  //[1]
> >      + ocfs2_journal_dirty //crash
> > 
> > [1]: ocfs2_split_extent called ocfs2_extend_trans, which committed
> > dirty buffers then restarted the transaction. The changed transaction
> > triggered jbd2 ASSERT.
> > 
> > crash stacks:
> > 
> > PID: 11297  TASK: ffff974a676dcd00  CPU: 67  COMMAND: "defragfs.ocfs2"
> >  #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
> >  #1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
> >  #2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
> >  #3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
> >  #4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
> >  #5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
> >  #6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
> >     [exception RIP: jbd2_journal_dirty_metadata+0x2ba]
> >     RIP: ffffffffc09ca54a  RSP: ffffb25d8dad3b70  RFLAGS: 00010207
> >     RAX: 0000000000000000  RBX: ffff9706eedc5248  RCX: 0000000000000000
> >     RDX: 0000000000000001  RSI: ffff97337029ea28  RDI: ffff9706eedc5250
> >     RBP: ffff9703c3520200   R8: 000000000f46b0b2   R9: 0000000000000000
> >     R10: 0000000000000001  R11: 00000001000000fe  R12: ffff97337029ea28
> >     R13: 0000000000000000  R14: ffff9703de59bf60  R15: ffff9706eedc5250
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >  #7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
> >  #8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
> >  #9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]
> > 
> > The fix method is simple, ocfs2_split_extent includes three paths. all
> > the paths already have journal access/dirty pair. We only need to
> > remove journal access/dirty from __ocfs2_move_extent.
> > 
> 
> I am not sure what you mean by "all three paths have journal access/
> dirty pair".

I am a newbie for ocfs2 & jbd2, I can't make sure this [1/2] patch is correct.
Below is my analysis.

All three paths (below 1.[123]):

__ocfs2_move_extent
 + ocfs2_journal_access_di
 |
 + ocfs2_split_extent           //[1]
 |  + ocfs2_replace_extent_rec  //[1.1]
 |  + ocfs2_split_and_insert    //[1.2]
 |  + ocfs2_try_to_merge_extent //[1.3]
 |
 + + ocfs2_journal_dirty       //crash

All three paths have itselves journal access/dirty pair.
So the access/dirty pair in caller __ocfs2_move_extent is unnecessary.

> 
> It seems we can't do it in your way, as journal access has different
> ocfs2_trigger with different offset, e.g. dinode is different from
> extent block.

There are 3 ocfs2_split_extent callers:
 fs/ocfs2/alloc.c <<ocfs2_change_extent_flag>>
 fs/ocfs2/move_extents.c <<__ocfs2_move_extent>>
 fs/ocfs2/refcounttree.c <<ocfs2_clear_ext_refcount>>

We can see, except defrag flow (caller __ocfs2_move_extent), other two
don't use jbd2 access/dirty pair around ocfs2_split_extent.

In my view, in defrag code flow, the struct ocfs2_triggers is changed many
times. The outermost ocfs2_triggers (belongs to __ocfs2_move_extent) must be
overwritten in sub-routine ocfs2_split_extent.

In another perspective, why ocfs2_change_extent_flag & ocfs2_clear_ext_refcount
don't use journal access/dirty pair to wrap ocfs2_split_extent? 

> 
> Or we may reserve enough credits to resolve this issue? 
> 

In my view, credits is not the key for this crash issue. The crash is
triggered by transaction changed:

crash code:

```jbd2_journal_dirty_metadata()

if (data_race(jh->b_transaction != transaction &&
		jh->b_next_transaction != transaction)) {
	spin_lock(&jh->b_state_lock);
	J_ASSERT_JH(jh, jh->b_transaction == transaction ||
		jh->b_next_transaction == transaction);
	spin_unlock(&jh->b_state_lock);
}
```

why transaction is changed?

ocfs2_split_extent
 ocfs2_split_and_insert
  ocfs2_do_insert_extent
    ocfs2_insert_path
       ocfs2_extend_trans //change transaction, and pls note this func comments

Thanks,
Heming


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path
  2022-06-04  0:08   ` Heming Zhao via Ocfs2-devel
@ 2022-06-10  7:46     ` Joseph Qi via Ocfs2-devel
  0 siblings, 0 replies; 10+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-06-10  7:46 UTC (permalink / raw)
  To: Heming Zhao, ocfs2-devel



On 6/4/22 8:08 AM, Heming Zhao wrote:
> On Thu, Jun 02, 2022 at 05:34:18PM +0800, Joseph Qi wrote:
>> Hi,
>>
>> Sorry for the late response since I was busy on other things...
>>
>> On 5/21/22 6:14 PM, Heming Zhao wrote:
>>> defragfs triggered jbd2 report ASSERT when working.
>>>
>>> code path:
>>>
>>> ocfs2_ioctl_move_extents
>>>  ocfs2_move_extents
>>>   __ocfs2_move_extents_range
>>>    ocfs2_defrag_extent
>>>     __ocfs2_move_extent
>>>      + ocfs2_journal_access_di
>>>      + ocfs2_split_extent  //[1]
>>>      + ocfs2_journal_dirty //crash
>>>
>>> [1]: ocfs2_split_extent called ocfs2_extend_trans, which committed
>>> dirty buffers then restarted the transaction. The changed transaction
>>> triggered jbd2 ASSERT.
>>>
>>> crash stacks:
>>>
>>> PID: 11297  TASK: ffff974a676dcd00  CPU: 67  COMMAND: "defragfs.ocfs2"
>>>  #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
>>>  #1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
>>>  #2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
>>>  #3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
>>>  #4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
>>>  #5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
>>>  #6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
>>>     [exception RIP: jbd2_journal_dirty_metadata+0x2ba]
>>>     RIP: ffffffffc09ca54a  RSP: ffffb25d8dad3b70  RFLAGS: 00010207
>>>     RAX: 0000000000000000  RBX: ffff9706eedc5248  RCX: 0000000000000000
>>>     RDX: 0000000000000001  RSI: ffff97337029ea28  RDI: ffff9706eedc5250
>>>     RBP: ffff9703c3520200   R8: 000000000f46b0b2   R9: 0000000000000000
>>>     R10: 0000000000000001  R11: 00000001000000fe  R12: ffff97337029ea28
>>>     R13: 0000000000000000  R14: ffff9703de59bf60  R15: ffff9706eedc5250
>>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>  #7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
>>>  #8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
>>>  #9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]
>>>
>>> The fix method is simple, ocfs2_split_extent includes three paths. all
>>> the paths already have journal access/dirty pair. We only need to
>>> remove journal access/dirty from __ocfs2_move_extent.
>>>
>>
>> I am not sure what you mean by "all three paths have journal access/
>> dirty pair".
> 
> I am a newbie for ocfs2 & jbd2, I can't make sure this [1/2] patch is correct.
> Below is my analysis.
> 
> All three paths (below 1.[123]):
> 
> __ocfs2_move_extent
>  + ocfs2_journal_access_di
>  |
>  + ocfs2_split_extent           //[1]
>  |  + ocfs2_replace_extent_rec  //[1.1]
>  |  + ocfs2_split_and_insert    //[1.2]
>  |  + ocfs2_try_to_merge_extent //[1.3]
>  |
>  + + ocfs2_journal_dirty       //crash
> 
> All three paths have itselves journal access/dirty pair.
> So the access/dirty pair in caller __ocfs2_move_extent is unnecessary.
> 

ocfs2_journal_access_xxx() is to notify jbd2 to do a specific operation
to a bh. I said in my last mail, they are for different bh.

>>
>> It seems we can't do it in your way, as journal access has different
>> ocfs2_trigger with different offset, e.g. dinode is different from
>> extent block.
> 
> There are 3 ocfs2_split_extent callers:
>  fs/ocfs2/alloc.c <<ocfs2_change_extent_flag>>
>  fs/ocfs2/move_extents.c <<__ocfs2_move_extent>>
>  fs/ocfs2/refcounttree.c <<ocfs2_clear_ext_refcount>>
> 
> We can see, except defrag flow (caller __ocfs2_move_extent), other two
> don't use jbd2 access/dirty pair around ocfs2_split_extent.
> 

Not exactly, you have to take look the whole flow.

Thanks,
Joseph

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally
  2022-05-21 10:14 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally Heming Zhao via Ocfs2-devel
  2022-06-02 10:02   ` Joseph Qi via Ocfs2-devel
@ 2022-06-12  2:57   ` Joseph Qi via Ocfs2-devel
  2022-06-12  7:45     ` heming.zhao--- via Ocfs2-devel
  1 sibling, 1 reply; 10+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-06-12  2:57 UTC (permalink / raw)
  To: Heming Zhao, ocfs2-devel



On 5/21/22 6:14 PM, Heming Zhao wrote:
> When la state is ENABLE, ocfs2_recalc_la_window restores la window
> unconditionally. The logic is wrong.
> 
> Let's image below path.
> 
> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
> 
> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
>    ocfs2_la_enable_worker set la state to ENABLED directly.
> 
> 3. a write IOs thread run:
> 
>    ```
>    ocfs2_write_begin
>     ...
>      ocfs2_lock_allocators
>       ocfs2_reserve_clusters
>        ocfs2_reserve_clusters_with_limit
>         ocfs2_reserve_local_alloc_bits
>          ocfs2_local_alloc_slide_window // [1]
>           + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
>           + ...
>           + ocfs2_local_alloc_new_window
>              ocfs2_claim_clusters // [3]
>    ```
> 
> [1]: will be called when la window bits used up.
> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
>      happened), it unconditionally restores la window to default value.
> [3]: will use default la window size to search clusters. IMO the timing
>      is O(n^4). The timing O(n^4) will cost huge time to scan global
>      bitmap. It makes write IOs (eg user space 'dd') become dramatically
>      slow.
> 
> i.e.
> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
> la window default size: 106MB.
> The partition is fragmentation by creating & deleting huge mount of
> small file.
> 
> the timing should be (the number got from real world):
> - la window size change order (size: MB):
>   106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>   only 0.8MB succeed, 0.8MB also triggers la window to disable.
>   ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>   runs in worst case.
> - group chain number: 242
>   ocfs2_claim_suballoc_bits calls for-loop 242 times
> - each chain has 49 block group
>   ocfs2_search_chain calls while-loop 49 times
> - each bg has 32256 blocks
>   ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>   for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>   (32256/64) for timing calucation.
> 
> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
> 
> In the worst case, user space writes 100MB data will trigger 42M scanning
> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
> the write IO will suffer another 42M scanning times. It makes the ocfs2
> partition keep pool performance all the time.
> 
> The fix method:
> 
> 1.  la restores double la size once.
> 
> current code logic decrease la window with half size once, but directly
> restores default_bits one time. It bounces the la window between '<1M'
> and default_bits. This patch makes restoring process more smoothly.
> eg.
> la default window is 106MB, current la window is 13MB.
> when there is a free action to release one block group space, la should
> roll back la size to 26MB (by 13*2).
> if there are many free actions to release many block group space, la
> will smoothly roll back to default window (106MB).
> 
> 2. introduced a new state: OCFS2_LA_RESTORE.
> 
> Current code uses OCFS2_LA_ENABLED to mark a new big space available.
> the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
> it's already in throttled status.
> '->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
> restore to default_bits.

Since now we have enough free space, why not restore to default la
window?

This is an issue happened in a corner case, which blames current restore
window is too large. I agree with your method that restoring double la
size once instead of default directly. So why not just change the logic
of ocfs2_recalc_la_window() to do this?

Thanks,
Joseph

> 
> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
> ---
>  fs/ocfs2/localalloc.c | 30 +++++++++++++++++++++---------
>  fs/ocfs2/ocfs2.h      | 18 +++++++++++-------
>  fs/ocfs2/suballoc.c   |  2 +-
>  3 files changed, 33 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
> index c4426d12a2ad..28acea717d7f 100644
> --- a/fs/ocfs2/localalloc.c
> +++ b/fs/ocfs2/localalloc.c
> @@ -205,20 +205,21 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb)
>  
>  static inline int ocfs2_la_state_enabled(struct ocfs2_super *osb)
>  {
> -	return (osb->local_alloc_state == 	 ||
> -		osb->local_alloc_state == OCFS2_LA_ENABLED);
> +	return osb->local_alloc_state & OCFS2_LA_ACTIVE;
>  }
>  
>  void ocfs2_local_alloc_seen_free_bits(struct ocfs2_super *osb,
>  				      unsigned int num_clusters)
>  {
>  	spin_lock(&osb->osb_lock);
> -	if (osb->local_alloc_state == OCFS2_LA_DISABLED ||
> -	    osb->local_alloc_state == OCFS2_LA_THROTTLED)
> +	if (osb->local_alloc_state & (OCFS2_LA_DISABLED |
> +		OCFS2_LA_THROTTLED | OCFS2_LA_RESTORE)) {
>  		if (num_clusters >= osb->local_alloc_default_bits) {
>  			cancel_delayed_work(&osb->la_enable_wq);
> -			osb->local_alloc_state = OCFS2_LA_ENABLED;
> +			osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
> +			osb->local_alloc_state |= OCFS2_LA_RESTORE;
>  		}
> +	}
>  	spin_unlock(&osb->osb_lock);
>  }
>  
> @@ -228,7 +229,10 @@ void ocfs2_la_enable_worker(struct work_struct *work)
>  		container_of(work, struct ocfs2_super,
>  			     la_enable_wq.work);
>  	spin_lock(&osb->osb_lock);
> -	osb->local_alloc_state = OCFS2_LA_ENABLED;
> +	if (osb->local_alloc_state & OCFS2_LA_DISABLED) {
> +		osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
> +		osb->local_alloc_state |= OCFS2_LA_ENABLED;
> +	}
>  	spin_unlock(&osb->osb_lock);
>  }
>  
> @@ -1067,7 +1071,7 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
>  			 * reason to assume the bitmap situation might
>  			 * have changed.
>  			 */
> -			osb->local_alloc_state = OCFS2_LA_THROTTLED;
> +			osb->local_alloc_state |= OCFS2_LA_THROTTLED;
>  			osb->local_alloc_bits = bits;
>  		} else {
>  			osb->local_alloc_state = OCFS2_LA_DISABLED;
> @@ -1083,8 +1087,16 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
>  	 * risk bouncing around the global bitmap during periods of
>  	 * low space.
>  	 */
> -	if (osb->local_alloc_state != OCFS2_LA_THROTTLED)
> -		osb->local_alloc_bits = osb->local_alloc_default_bits;
> +	if (osb->local_alloc_state & OCFS2_LA_RESTORE) {
> +		bits = osb->local_alloc_bits * 2;
> +		if (bits > osb->local_alloc_default_bits) {
> +			osb->local_alloc_bits = osb->local_alloc_default_bits;
> +			osb->local_alloc_state = OCFS2_LA_ENABLED;
> +		} else {
> +			/* keep RESTORE state & set new bits */
> +			osb->local_alloc_bits = bits;
> +		}
> +	}
>  
>  out_unlock:
>  	state = osb->local_alloc_state;
> diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
> index 337527571461..1764077e3229 100644
> --- a/fs/ocfs2/ocfs2.h
> +++ b/fs/ocfs2/ocfs2.h
> @@ -245,14 +245,18 @@ struct ocfs2_alloc_stats
>  
>  enum ocfs2_local_alloc_state
>  {
> -	OCFS2_LA_UNUSED = 0,	/* Local alloc will never be used for
> -				 * this mountpoint. */
> -	OCFS2_LA_ENABLED,	/* Local alloc is in use. */
> -	OCFS2_LA_THROTTLED,	/* Local alloc is in use, but number
> -				 * of bits has been reduced. */
> -	OCFS2_LA_DISABLED	/* Local alloc has temporarily been
> -				 * disabled. */
> +	/* Local alloc will never be used for this mountpoint. */
> +	OCFS2_LA_UNUSED = 1 << 0,
> +	/* Local alloc is in use. */
> +	OCFS2_LA_ENABLED = 1 << 1,
> +	/* Local alloc is in use, but number of bits has been reduced. */
> +	OCFS2_LA_THROTTLED = 1 << 2,
> +	/* In throttle state, Local alloc meets contig big space. */
> +	OCFS2_LA_RESTORE = 1 << 3,
> +	/* Local alloc has temporarily been disabled. */
> +	OCFS2_LA_DISABLED = 1 << 4,
>  };
> +#define OCFS2_LA_ACTIVE (OCFS2_LA_ENABLED | OCFS2_LA_THROTTLED | OCFS2_LA_RESTORE)
>  
>  enum ocfs2_mount_options
>  {
> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
> index 166c8918c825..b0df1ab2d6dd 100644
> --- a/fs/ocfs2/suballoc.c
> +++ b/fs/ocfs2/suballoc.c
> @@ -1530,7 +1530,7 @@ static int ocfs2_cluster_group_search(struct inode *inode,
>  		 * of bits. */
>  		if (min_bits <= res->sr_bits)
>  			search = 0; /* success */
> -		else if (res->sr_bits) {
> +		if (res->sr_bits) {
>  			/*
>  			 * Don't show bits which we'll be returning
>  			 * for allocation to the local alloc bitmap.

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally
  2022-06-12  2:57   ` Joseph Qi via Ocfs2-devel
@ 2022-06-12  7:45     ` heming.zhao--- via Ocfs2-devel
  2022-06-12 12:38       ` Joseph Qi via Ocfs2-devel
  0 siblings, 1 reply; 10+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-06-12  7:45 UTC (permalink / raw)
  To: Joseph Qi, ocfs2-devel

On 6/12/22 10:57, Joseph Qi wrote:
> 
> 
> On 5/21/22 6:14 PM, Heming Zhao wrote:
>> When la state is ENABLE, ocfs2_recalc_la_window restores la window
>> unconditionally. The logic is wrong.
>>
>> Let's image below path.
>>
>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>
>> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
>>     ocfs2_la_enable_worker set la state to ENABLED directly.
>>
>> 3. a write IOs thread run:
>>
>>     ```
>>     ocfs2_write_begin
>>      ...
>>       ocfs2_lock_allocators
>>        ocfs2_reserve_clusters
>>         ocfs2_reserve_clusters_with_limit
>>          ocfs2_reserve_local_alloc_bits
>>           ocfs2_local_alloc_slide_window // [1]
>>            + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
>>            + ...
>>            + ocfs2_local_alloc_new_window
>>               ocfs2_claim_clusters // [3]
>>     ```
>>
>> [1]: will be called when la window bits used up.
>> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
>>       happened), it unconditionally restores la window to default value.
>> [3]: will use default la window size to search clusters. IMO the timing
>>       is O(n^4). The timing O(n^4) will cost huge time to scan global
>>       bitmap. It makes write IOs (eg user space 'dd') become dramatically
>>       slow.
>>
>> i.e.
>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>> la window default size: 106MB.
>> The partition is fragmentation by creating & deleting huge mount of
>> small file.
>>
>> the timing should be (the number got from real world):
>> - la window size change order (size: MB):
>>    106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>    only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>    ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>    runs in worst case.
>> - group chain number: 242
>>    ocfs2_claim_suballoc_bits calls for-loop 242 times
>> - each chain has 49 block group
>>    ocfs2_search_chain calls while-loop 49 times
>> - each bg has 32256 blocks
>>    ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>    for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>    (32256/64) for timing calucation.
>>
>> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>
>> In the worst case, user space writes 100MB data will trigger 42M scanning
>> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
>> the write IO will suffer another 42M scanning times. It makes the ocfs2
>> partition keep pool performance all the time.
>>
>> The fix method:
>>
>> 1.  la restores double la size once.
>>
>> current code logic decrease la window with half size once, but directly
>> restores default_bits one time. It bounces the la window between '<1M'
>> and default_bits. This patch makes restoring process more smoothly.
>> eg.
>> la default window is 106MB, current la window is 13MB.
>> when there is a free action to release one block group space, la should
>> roll back la size to 26MB (by 13*2).
>> if there are many free actions to release many block group space, la
>> will smoothly roll back to default window (106MB).
>>
>> 2. introduced a new state: OCFS2_LA_RESTORE.
>>
>> Current code uses OCFS2_LA_ENABLED to mark a new big space available.
>> the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
>> it's already in throttled status.
>> '->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
>> restore to default_bits.
> 
> Since now we have enough free space, why not restore to default la
> window?

The key is: the decrease speed is not same with increase speed.
The decrease la window happens on the current la window size is not available any more.
But current restore action happens on there appears any one of default_bits space.
e.g:
the default la window is 100MB, current system only has some 20MB contiguous space.
la window change to half size (10MB) when there is no 20MB space any more.
but current code restore logic will restore to 100MB. and allocation path will
suffer the O(n^4) timing.

 From my understanding, most of the ocfs2 users use ocfs2 to manage big & not huge
number of files. But my patch response scenario: ocfs2 volume contains huge number
of small files. user case is creating/deleting/moving the small files all the time.
It makes the fs fragmentation totally.

> 
> This is an issue happened in a corner case, which blames current restore
> window is too large. I agree with your method that restoring double la
> size once instead of default directly. So why not just change the logic
> of ocfs2_recalc_la_window() to do this?

This path is related with two issues:
- restore speed more quick than decrease speed.
- unconditionally restore la window

only change ocfs2_recalc_la_window() can't avoid unconditionally restore la window.
so I introduced new state OCFS2_LA_RESTORE, which will help ocfs2 to keep throttled
state.

btw, during I investigating this bug, I found other two issues/works need to do:
- current allocation algorithm is very easy to generate fragment.
- there is O(n^4) timing. even after with this patch, the timing only becomes O(n^3):
   <group chain number> * <block group per chain> * <bg bitmap size>
we needs to improve the allocation algorithm to make ocfs2 more power.

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally
  2022-06-12  7:45     ` heming.zhao--- via Ocfs2-devel
@ 2022-06-12 12:38       ` Joseph Qi via Ocfs2-devel
  2022-06-13  1:48         ` heming.zhao--- via Ocfs2-devel
  0 siblings, 1 reply; 10+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-06-12 12:38 UTC (permalink / raw)
  To: heming.zhao, ocfs2-devel



On 6/12/22 3:45 PM, heming.zhao@suse.com wrote:
> On 6/12/22 10:57, Joseph Qi wrote:
>>
>>
>> On 5/21/22 6:14 PM, Heming Zhao wrote:
>>> When la state is ENABLE, ocfs2_recalc_la_window restores la window
>>> unconditionally. The logic is wrong.
>>>
>>> Let's image below path.
>>>
>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>
>>> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
>>>     ocfs2_la_enable_worker set la state to ENABLED directly.
>>>
>>> 3. a write IOs thread run:
>>>
>>>     ```
>>>     ocfs2_write_begin
>>>      ...
>>>       ocfs2_lock_allocators
>>>        ocfs2_reserve_clusters
>>>         ocfs2_reserve_clusters_with_limit
>>>          ocfs2_reserve_local_alloc_bits
>>>           ocfs2_local_alloc_slide_window // [1]
>>>            + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
>>>            + ...
>>>            + ocfs2_local_alloc_new_window
>>>               ocfs2_claim_clusters // [3]
>>>     ```
>>>
>>> [1]: will be called when la window bits used up.
>>> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
>>>       happened), it unconditionally restores la window to default value.
>>> [3]: will use default la window size to search clusters. IMO the timing
>>>       is O(n^4). The timing O(n^4) will cost huge time to scan global
>>>       bitmap. It makes write IOs (eg user space 'dd') become dramatically
>>>       slow.
>>>
>>> i.e.
>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>> la window default size: 106MB.
>>> The partition is fragmentation by creating & deleting huge mount of
>>> small file.
>>>
>>> the timing should be (the number got from real world):
>>> - la window size change order (size: MB):
>>>    106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>>    only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>>    ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>>    runs in worst case.
>>> - group chain number: 242
>>>    ocfs2_claim_suballoc_bits calls for-loop 242 times
>>> - each chain has 49 block group
>>>    ocfs2_search_chain calls while-loop 49 times
>>> - each bg has 32256 blocks
>>>    ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>>    for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>>    (32256/64) for timing calucation.
>>>
>>> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>
>>> In the worst case, user space writes 100MB data will trigger 42M scanning
>>> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
>>> the write IO will suffer another 42M scanning times. It makes the ocfs2
>>> partition keep pool performance all the time.
>>>
>>> The fix method:
>>>
>>> 1.  la restores double la size once.
>>>
>>> current code logic decrease la window with half size once, but directly
>>> restores default_bits one time. It bounces the la window between '<1M'
>>> and default_bits. This patch makes restoring process more smoothly.
>>> eg.
>>> la default window is 106MB, current la window is 13MB.
>>> when there is a free action to release one block group space, la should
>>> roll back la size to 26MB (by 13*2).
>>> if there are many free actions to release many block group space, la
>>> will smoothly roll back to default window (106MB).
>>>
>>> 2. introduced a new state: OCFS2_LA_RESTORE.
>>>
>>> Current code uses OCFS2_LA_ENABLED to mark a new big space available.
>>> the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
>>> it's already in throttled status.
>>> '->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
>>> restore to default_bits.
>>
>> Since now we have enough free space, why not restore to default la
>> window?
> 
> The key is: the decrease speed is not same with increase speed.
> The decrease la window happens on the current la window size is not available any more.
> But current restore action happens on there appears any one of default_bits space.
> e.g:
> the default la window is 100MB, current system only has some 20MB contiguous space.
> la window change to half size (10MB) when there is no 20MB space any more.
> but current code restore logic will restore to 100MB. and allocation path will
> suffer the O(n^4) timing.
> 
> From my understanding, most of the ocfs2 users use ocfs2 to manage big & not huge
> number of files. But my patch response scenario: ocfs2 volume contains huge number
> of small files. user case is creating/deleting/moving the small files all the time.
> It makes the fs fragmentation totally.
> 

Yes, typically scenario is vm images with cluster size 1M.
This is a corny talk and I'm afraid it still cannot resolve above case
with optimized local alloc window.

>>
>> This is an issue happened in a corner case, which blames current restore
>> window is too large. I agree with your method that restoring double la
>> size once instead of default directly. So why not just change the logic
>> of ocfs2_recalc_la_window() to do this?
> 
> This path is related with two issues:
> - restore speed more quick than decrease speed.
> - unconditionally restore la window
> 
> only change ocfs2_recalc_la_window() can't avoid unconditionally restore la window.

Seems the following will restore double each time?

if (osb->local_alloc_state != OCFS2_LA_THROTTLED)
	osb->local_alloc_bits <<= 1;

This may introduce another issue which blames restore too slow. So it's a
balance.

Thanks,
Joseph

> so I introduced new state OCFS2_LA_RESTORE, which will help ocfs2 to keep throttled
> state.
> 
> btw, during I investigating this bug, I found other two issues/works need to do:
> - current allocation algorithm is very easy to generate fragment.
> - there is O(n^4) timing. even after with this patch, the timing only becomes O(n^3):
>   <group chain number> * <block group per chain> * <bg bitmap size>
> we needs to improve the allocation algorithm to make ocfs2 more power.
> 
> Thanks,
> Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally
  2022-06-12 12:38       ` Joseph Qi via Ocfs2-devel
@ 2022-06-13  1:48         ` heming.zhao--- via Ocfs2-devel
  0 siblings, 0 replies; 10+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-06-13  1:48 UTC (permalink / raw)
  To: Joseph Qi, ocfs2-devel

On 6/12/22 20:38, Joseph Qi wrote:
> 
> 
> On 6/12/22 3:45 PM, heming.zhao@suse.com wrote:
>> On 6/12/22 10:57, Joseph Qi wrote:
>>>
>>>
>>> On 5/21/22 6:14 PM, Heming Zhao wrote:
>>>> When la state is ENABLE, ocfs2_recalc_la_window restores la window
>>>> unconditionally. The logic is wrong.
>>>>
>>>> Let's image below path.
>>>>
>>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>>
>>>> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
>>>>      ocfs2_la_enable_worker set la state to ENABLED directly.
>>>>
>>>> 3. a write IOs thread run:
>>>>
>>>>      ```
>>>>      ocfs2_write_begin
>>>>       ...
>>>>        ocfs2_lock_allocators
>>>>         ocfs2_reserve_clusters
>>>>          ocfs2_reserve_clusters_with_limit
>>>>           ocfs2_reserve_local_alloc_bits
>>>>            ocfs2_local_alloc_slide_window // [1]
>>>>             + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
>>>>             + ...
>>>>             + ocfs2_local_alloc_new_window
>>>>                ocfs2_claim_clusters // [3]
>>>>      ```
>>>>
>>>> [1]: will be called when la window bits used up.
>>>> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
>>>>        happened), it unconditionally restores la window to default value.
>>>> [3]: will use default la window size to search clusters. IMO the timing
>>>>        is O(n^4). The timing O(n^4) will cost huge time to scan global
>>>>        bitmap. It makes write IOs (eg user space 'dd') become dramatically
>>>>        slow.
>>>>
>>>> i.e.
>>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>>> la window default size: 106MB.
>>>> The partition is fragmentation by creating & deleting huge mount of
>>>> small file.
>>>>
>>>> the timing should be (the number got from real world):
>>>> - la window size change order (size: MB):
>>>>     106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>>>     only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>>>     ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>>>     runs in worst case.
>>>> - group chain number: 242
>>>>     ocfs2_claim_suballoc_bits calls for-loop 242 times
>>>> - each chain has 49 block group
>>>>     ocfs2_search_chain calls while-loop 49 times
>>>> - each bg has 32256 blocks
>>>>     ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>>>     for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>>>     (32256/64) for timing calucation.
>>>>
>>>> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>>
>>>> In the worst case, user space writes 100MB data will trigger 42M scanning
>>>> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
>>>> the write IO will suffer another 42M scanning times. It makes the ocfs2
>>>> partition keep pool performance all the time.
>>>>
>>>> The fix method:
>>>>
>>>> 1.  la restores double la size once.
>>>>
>>>> current code logic decrease la window with half size once, but directly
>>>> restores default_bits one time. It bounces the la window between '<1M'
>>>> and default_bits. This patch makes restoring process more smoothly.
>>>> eg.
>>>> la default window is 106MB, current la window is 13MB.
>>>> when there is a free action to release one block group space, la should
>>>> roll back la size to 26MB (by 13*2).
>>>> if there are many free actions to release many block group space, la
>>>> will smoothly roll back to default window (106MB).
>>>>
>>>> 2. introduced a new state: OCFS2_LA_RESTORE.
>>>>
>>>> Current code uses OCFS2_LA_ENABLED to mark a new big space available.
>>>> the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
>>>> it's already in throttled status.
>>>> '->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
>>>> restore to default_bits.
>>>
>>> Since now we have enough free space, why not restore to default la
>>> window?
>>
>> The key is: the decrease speed is not same with increase speed.
>> The decrease la window happens on the current la window size is not available any more.
>> But current restore action happens on there appears any one of default_bits space.
>> e.g:
>> the default la window is 100MB, current system only has some 20MB contiguous space.
>> la window change to half size (10MB) when there is no 20MB space any more.
>> but current code restore logic will restore to 100MB. and allocation path will
>> suffer the O(n^4) timing.
>>
>>  From my understanding, most of the ocfs2 users use ocfs2 to manage big & not huge
>> number of files. But my patch response scenario: ocfs2 volume contains huge number
>> of small files. user case is creating/deleting/moving the small files all the time.
>> It makes the fs fragmentation totally.
>>
> 
> Yes, typically scenario is vm images with cluster size 1M.
> This is a corny talk and I'm afraid it still cannot resolve above case
> with optimized local alloc window.

You're right. the optimized la window could help small files case but can't resolve.
The total solution should redesign something, eg. allocation algorithm.
This issue was triggered by a SUSE customer, they had bad experience during product env
frequently happening poor performance issue. In my view, if ocfs2 only works fine on
managing vm images (big & not too many files), it's waste & ridiculous.
A powerful fs should work fine for any size files. ocfs2 should support small files use case,
but there is a long way to go.

> 
>>>
>>> This is an issue happened in a corner case, which blames current restore
>>> window is too large. I agree with your method that restoring double la
>>> size once instead of default directly. So why not just change the logic
>>> of ocfs2_recalc_la_window() to do this?
>>
>> This path is related with two issues:
>> - restore speed more quick than decrease speed.
>> - unconditionally restore la window
>>
>> only change ocfs2_recalc_la_window() can't avoid unconditionally restore la window.
> 
> Seems the following will restore double each time?
YES.
> 
> if (osb->local_alloc_state != OCFS2_LA_THROTTLED)
> 	osb->local_alloc_bits <<= 1;
> 
> This may introduce another issue which blames restore too slow. So it's a
> balance.

Yes, I agree it's a balance.
But I prefer my patch works better than existed restore method.
(IIUC) there are some reasons:
1. Truncate feature may delay restore the released space. So there is a time gap
    between restore la window and available space ready.
2. existed method is suite for big file use case:
    restore to default la window gives a chance to grab big space at once.
    A VM file may tens of GB size, user may free one VM file then create a new VM
    file. this free action could very likely release enough space for later create
    action, so ocfs2 could benefit from la window restore default size.
3. this patch introduced slow-down restore give significant help for small file
    allocation. and not make regression for ocfs2 total performance.
4. the scenario is described in this patch commit log. if la window restore default
    size unconditionally from the before state DISABLE, ocfs2 will busy with waiting
    scanning la window.

for 3, I give a example. (below analysis base on patch code already merged)
(the ocfs2 volume size 1TB)

3.1> csize:1M, group:32G, la:31G
(these number get from OCFS2_LA_MAX_DEFAULT_MB source code comment in fs/ocfs2/localalloc.c)

For this case, ocfs2 volume is used for saving big file. the worse case
describe below (please tell me, if this case is not enough)

Current la window: 2MB, release a 40GB file. la window change to 4MB.

User wants to create a 40GB file. because la is 4MB, it can successfully
get the 4M contiguous space from just released 40GB space.
the scanning timing is same as before. but only get 4MB from global bitmap.
And later slide window action will speed up by the saving 'hint' in ocfs2_claim_suballoc_bits().
the timing is not O(n^4), may O(n) here.

in this case, la window will become double size by every slide la window:
   ocfs2_local_alloc_slide_window
    ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE)
     if (osb->local_alloc_state & OCFS2_LA_RESTORE)//set 'la window' * 2

So la window restore sequence (14 times):
4MB,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,default_window(31GB)

It costs extra 13 times to slide la window for allocating space.
But in my view, current ocfs2 hot spot should be on new file writing IOs,
not busy with waiting for scanning la window. User is very likely never
feel allocation is slow than before.

3.2> csize:4K, group:126M, la:121M
(see fs/ocfs2/localalloc.c for above numbers)

the worse case is like <3.1>: allocation a big file.

Current la window: 2MB, release a 40GB file. la window change to 4MB.
User wants to create a 40GB file. in this case, the la window restore sequence (6 times):
4MB,8,16,32,64,default_window(121MB)
It costs extra 5 times vs before. I have same world:
wait for scanning la window is not hot spot.

/Heming

> 
> Thanks,
> Joseph
> 
>> so I introduced new state OCFS2_LA_RESTORE, which will help ocfs2 to keep throttled
>> state.
>>
>> btw, during I investigating this bug, I found other two issues/works need to do:
>> - current allocation algorithm is very easy to generate fragment.
>> - there is O(n^4) timing. even after with this patch, the timing only becomes O(n^3):
>>    <group chain number> * <block group per chain> * <bg bitmap size>
>> we needs to improve the allocation algorithm to make ocfs2 more power.
>>
>> Thanks,
>> Heming


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-13  1:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-21 10:14 [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path Heming Zhao via Ocfs2-devel
2022-05-21 10:14 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally Heming Zhao via Ocfs2-devel
2022-06-02 10:02   ` Joseph Qi via Ocfs2-devel
2022-06-12  2:57   ` Joseph Qi via Ocfs2-devel
2022-06-12  7:45     ` heming.zhao--- via Ocfs2-devel
2022-06-12 12:38       ` Joseph Qi via Ocfs2-devel
2022-06-13  1:48         ` heming.zhao--- via Ocfs2-devel
2022-06-02  9:34 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path Joseph Qi via Ocfs2-devel
2022-06-04  0:08   ` Heming Zhao via Ocfs2-devel
2022-06-10  7:46     ` Joseph Qi via Ocfs2-devel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.