On 3/18/24 7:56 PM, Su Yue wrote: > From: Su Yue <glass.su@suse.com> > > inode ctime should be updated if ocfs2_fileattr_set is called. > > Signed-off-by: Su Yue <glass.su@suse.com> Looks fine. Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> > --- > fs/ocfs2/ioctl.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c > index b1550ba73f96..71beef7f8a60 100644 > --- a/fs/ocfs2/ioctl.c > +++ b/fs/ocfs2/ioctl.c > @@ -125,6 +125,7 @@ int ocfs2_fileattr_set(struct mnt_idmap *idmap, > > ocfs2_inode->ip_attr = flags; > ocfs2_set_inode_flags(inode); > + inode_set_ctime_current(inode); > > status = ocfs2_mark_inode_dirty(handle, inode, bh); > if (status < 0)
From: Su Yue <glass.su@suse.com> inode ctime should be updated if ocfs2_fileattr_set is called. Signed-off-by: Su Yue <glass.su@suse.com> --- fs/ocfs2/ioctl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c index b1550ba73f96..71beef7f8a60 100644 --- a/fs/ocfs2/ioctl.c +++ b/fs/ocfs2/ioctl.c @@ -125,6 +125,7 @@ int ocfs2_fileattr_set(struct mnt_idmap *idmap, ocfs2_inode->ip_attr = flags; ocfs2_set_inode_flags(inode); + inode_set_ctime_current(inode); status = ocfs2_mark_inode_dirty(handle, inode, bh); if (status < 0) -- 2.44.0
On 3/18/24 13:49, Joseph Qi wrote:
>
>
> On 3/18/24 1:29 PM, Heming Zhao wrote:
>> On 3/18/24 09:39, Joseph Qi wrote:
>>>
>>>
>>> On 3/16/24 11:44 AM, Heming Zhao wrote:
>>>> The group_search function ocfs2_cluster_group_search() should
>>>> bypass groups with insufficient space to avoid unnecessary
>>>> searches.
>>>>
>>>> This patch is particularly useful when ocfs2 is handling huge
>>>> number small files, and volume fragmentation is very high.
>>>> In this case, ocfs2 is busy with looking up available la window
>>>> from //global_bitmap.
>>>>
>>>> This patch introduces a new member in the Group Description (gd)
>>>> struct called 'bg_contig_free_bits', representing the max
>>>> contigous free bits in this gd. When ocfs2 allocates a new
>>>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>>>> expedite the search process.
>>>>
>>>> Let's image below path.
>>>>
>>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>>
>>>> 2. when user delete a large file and trigger
>>>> ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>>>> unconditionally.
>>>>
>>>> 3. a write IOs thread run and trigger the worst performance path
>>>>
>>>> ```
>>>> ocfs2_reserve_clusters_with_limit
>>>> ocfs2_reserve_local_alloc_bits
>>>> ocfs2_local_alloc_slide_window //[1]
>>>> + ocfs2_local_alloc_reserve_for_window //[2]
>>>> + ocfs2_local_alloc_new_window //[3]
>>>> ocfs2_recalc_la_window
>>>> ```
>>>>
>>>> [1]:
>>>> will be called when la window bits used up.
>>>>
>>>> [2]:
>>>> under la state is ENABLED, and this func only check global_bitmap
>>>> free bits, it will succeed in general.
>>>>
>>>> [3]:
>>>> will use the default la window size to search clusters then fail.
>>>> ocfs2_recalc_la_window attempts other la window sizes.
>>>> the timing complexity is O(n^4), resulting in a significant time
>>>> cost for scanning global bitmap. This leads to a dramatic slowdown
>>>> in write I/Os (e.g., user space 'dd').
>>>>
>>>> i.e.
>>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>>> la window default size: 106MB.
>>>> The partition is fragmentation by creating & deleting huge mount of
>>>> small files.
>>>>
>>>> before this patch, the timing of [3] should be
>>>> (the number got from real world):
>>>> - la window size change order (size: MB):
>>>> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>>> only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>>> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>>> runs in worst case.
>>>> - group chain number: 242
>>>> ocfs2_claim_suballoc_bits calls for-loop 242 times
>>>> - each chain has 49 block group
>>>> ocfs2_search_chain calls while-loop 49 times
>>>> - each bg has 32256 blocks
>>>> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>>> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>>> (32256/64) (this is not worst value) for timing calucation.
>>>>
>>>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>>
>>>> In the worst case, user space writes 1MB data will trigger 42M scanning
>>>> times.
>>>>
>>>> under this patch, the timing is '7*242*49 = 83006', reduced by three
>>>> orders of magnitude.
>>>>
>>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>>> ---
>>>> v2:
>>>> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>>>> for setting bg->bg_contig_free_bits.
>>>> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>>>> 'unsigned int'.
>>>> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>>>> parameter 'failure_path' to 'fastpath'.
>>>> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
>>>> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>>>> 'suballoc.h' to 'suballoc.c'.
>>>> 6. modify some code indent error.
>>>>
>>>> ---
>>>> fs/ocfs2/move_extents.c | 2 +-
>>>> fs/ocfs2/ocfs2_fs.h | 4 +-
>>>> fs/ocfs2/resize.c | 7 +++
>>>> fs/ocfs2/suballoc.c | 99 +++++++++++++++++++++++++++++++++++++----
>>>> fs/ocfs2/suballoc.h | 6 ++-
>>>> 5 files changed, 106 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>>>> index 1f9ed117e78b..f9d6a4f9ca92 100644
>>>> --- a/fs/ocfs2/move_extents.c
>>>> +++ b/fs/ocfs2/move_extents.c
>>>> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>>>> }
>>>> ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>>>> - goal_bit, len);
>>>> + goal_bit, len, 0, 0);
>>>> if (ret) {
>>>> ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>>>> le16_to_cpu(gd->bg_chain));
>>>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>>>> index 7aebdbf5cc0a..eae18d772e93 100644
>>>> --- a/fs/ocfs2/ocfs2_fs.h
>>>> +++ b/fs/ocfs2/ocfs2_fs.h
>>>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc
>>>> __le16 bg_free_bits_count; /* Free bits count */
>>>> __le16 bg_chain; /* What chain I am in. */
>>>> /*10*/ __le32 bg_generation;
>>>> - __le32 bg_reserved1;
>>>> + __le32 bg_contig_free_bits; /* max contig free bits length */
>>>
>>> I've found some parts you are using le32, while others using le16.
>>> It seems that le16 is enough here. So why not:
>>> __le16 bg_contig_free_bits;
>>> __le16 bg_reserved1;
>>>
>>> Joseph
>>
>> IIUC, From the comment before ocfs2_la_default_mb(), the max cluster
>> size is 1MB. And from 'struct ocfs2_group_desc', the offset of
>> bg_bitmap is 0x40. Therefore, the max contiguous bit of gd:
>> (1024*1024 - 0x40)*8 = 8388096 (23-bit length, bigger than __le16)
>>
> bg_bits and bg_free_bits_count are already using __le16 now.
You are right.
I rechecked the code. The gd block size should be blocksize not
clustersize. and the value of blocksize should not be bigger than
PAGE_SIZE, so the max value of bg_free_bits_count is
(4096-0x40)*8=32256, which is smaller than __le16.
I will modify all the code related to __le32.
-Heming
On 3/18/24 1:29 PM, Heming Zhao wrote:
> On 3/18/24 09:39, Joseph Qi wrote:
>>
>>
>> On 3/16/24 11:44 AM, Heming Zhao wrote:
>>> The group_search function ocfs2_cluster_group_search() should
>>> bypass groups with insufficient space to avoid unnecessary
>>> searches.
>>>
>>> This patch is particularly useful when ocfs2 is handling huge
>>> number small files, and volume fragmentation is very high.
>>> In this case, ocfs2 is busy with looking up available la window
>>> from //global_bitmap.
>>>
>>> This patch introduces a new member in the Group Description (gd)
>>> struct called 'bg_contig_free_bits', representing the max
>>> contigous free bits in this gd. When ocfs2 allocates a new
>>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>>> expedite the search process.
>>>
>>> Let's image below path.
>>>
>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>
>>> 2. when user delete a large file and trigger
>>> ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>>> unconditionally.
>>>
>>> 3. a write IOs thread run and trigger the worst performance path
>>>
>>> ```
>>> ocfs2_reserve_clusters_with_limit
>>> ocfs2_reserve_local_alloc_bits
>>> ocfs2_local_alloc_slide_window //[1]
>>> + ocfs2_local_alloc_reserve_for_window //[2]
>>> + ocfs2_local_alloc_new_window //[3]
>>> ocfs2_recalc_la_window
>>> ```
>>>
>>> [1]:
>>> will be called when la window bits used up.
>>>
>>> [2]:
>>> under la state is ENABLED, and this func only check global_bitmap
>>> free bits, it will succeed in general.
>>>
>>> [3]:
>>> will use the default la window size to search clusters then fail.
>>> ocfs2_recalc_la_window attempts other la window sizes.
>>> the timing complexity is O(n^4), resulting in a significant time
>>> cost for scanning global bitmap. This leads to a dramatic slowdown
>>> in write I/Os (e.g., user space 'dd').
>>>
>>> i.e.
>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>> la window default size: 106MB.
>>> The partition is fragmentation by creating & deleting huge mount of
>>> small files.
>>>
>>> before this patch, the timing of [3] should be
>>> (the number got from real world):
>>> - la window size change order (size: MB):
>>> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>> only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>> runs in worst case.
>>> - group chain number: 242
>>> ocfs2_claim_suballoc_bits calls for-loop 242 times
>>> - each chain has 49 block group
>>> ocfs2_search_chain calls while-loop 49 times
>>> - each bg has 32256 blocks
>>> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>> (32256/64) (this is not worst value) for timing calucation.
>>>
>>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>
>>> In the worst case, user space writes 1MB data will trigger 42M scanning
>>> times.
>>>
>>> under this patch, the timing is '7*242*49 = 83006', reduced by three
>>> orders of magnitude.
>>>
>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>> ---
>>> v2:
>>> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>>> for setting bg->bg_contig_free_bits.
>>> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>>> 'unsigned int'.
>>> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>>> parameter 'failure_path' to 'fastpath'.
>>> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
>>> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>>> 'suballoc.h' to 'suballoc.c'.
>>> 6. modify some code indent error.
>>>
>>> ---
>>> fs/ocfs2/move_extents.c | 2 +-
>>> fs/ocfs2/ocfs2_fs.h | 4 +-
>>> fs/ocfs2/resize.c | 7 +++
>>> fs/ocfs2/suballoc.c | 99 +++++++++++++++++++++++++++++++++++++----
>>> fs/ocfs2/suballoc.h | 6 ++-
>>> 5 files changed, 106 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>>> index 1f9ed117e78b..f9d6a4f9ca92 100644
>>> --- a/fs/ocfs2/move_extents.c
>>> +++ b/fs/ocfs2/move_extents.c
>>> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>>> }
>>> ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>>> - goal_bit, len);
>>> + goal_bit, len, 0, 0);
>>> if (ret) {
>>> ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>>> le16_to_cpu(gd->bg_chain));
>>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>>> index 7aebdbf5cc0a..eae18d772e93 100644
>>> --- a/fs/ocfs2/ocfs2_fs.h
>>> +++ b/fs/ocfs2/ocfs2_fs.h
>>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc
>>> __le16 bg_free_bits_count; /* Free bits count */
>>> __le16 bg_chain; /* What chain I am in. */
>>> /*10*/ __le32 bg_generation;
>>> - __le32 bg_reserved1;
>>> + __le32 bg_contig_free_bits; /* max contig free bits length */
>>
>> I've found some parts you are using le32, while others using le16.
>> It seems that le16 is enough here. So why not:
>> __le16 bg_contig_free_bits;
>> __le16 bg_reserved1;
>>
>> Joseph
>
> IIUC, From the comment before ocfs2_la_default_mb(), the max cluster
> size is 1MB. And from 'struct ocfs2_group_desc', the offset of
> bg_bitmap is 0x40. Therefore, the max contiguous bit of gd:
> (1024*1024 - 0x40)*8 = 8388096 (23-bit length, bigger than __le16)
>
bg_bits and bg_free_bits_count are already using __le16 now.
On 3/18/24 09:39, Joseph Qi wrote:
>
>
> On 3/16/24 11:44 AM, Heming Zhao wrote:
>> The group_search function ocfs2_cluster_group_search() should
>> bypass groups with insufficient space to avoid unnecessary
>> searches.
>>
>> This patch is particularly useful when ocfs2 is handling huge
>> number small files, and volume fragmentation is very high.
>> In this case, ocfs2 is busy with looking up available la window
>> from //global_bitmap.
>>
>> This patch introduces a new member in the Group Description (gd)
>> struct called 'bg_contig_free_bits', representing the max
>> contigous free bits in this gd. When ocfs2 allocates a new
>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>> expedite the search process.
>>
>> Let's image below path.
>>
>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>
>> 2. when user delete a large file and trigger
>> ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>> unconditionally.
>>
>> 3. a write IOs thread run and trigger the worst performance path
>>
>> ```
>> ocfs2_reserve_clusters_with_limit
>> ocfs2_reserve_local_alloc_bits
>> ocfs2_local_alloc_slide_window //[1]
>> + ocfs2_local_alloc_reserve_for_window //[2]
>> + ocfs2_local_alloc_new_window //[3]
>> ocfs2_recalc_la_window
>> ```
>>
>> [1]:
>> will be called when la window bits used up.
>>
>> [2]:
>> under la state is ENABLED, and this func only check global_bitmap
>> free bits, it will succeed in general.
>>
>> [3]:
>> will use the default la window size to search clusters then fail.
>> ocfs2_recalc_la_window attempts other la window sizes.
>> the timing complexity is O(n^4), resulting in a significant time
>> cost for scanning global bitmap. This leads to a dramatic slowdown
>> in write I/Os (e.g., user space 'dd').
>>
>> i.e.
>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>> la window default size: 106MB.
>> The partition is fragmentation by creating & deleting huge mount of
>> small files.
>>
>> before this patch, the timing of [3] should be
>> (the number got from real world):
>> - la window size change order (size: MB):
>> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>> only 0.8MB succeed, 0.8MB also triggers la window to disable.
>> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>> runs in worst case.
>> - group chain number: 242
>> ocfs2_claim_suballoc_bits calls for-loop 242 times
>> - each chain has 49 block group
>> ocfs2_search_chain calls while-loop 49 times
>> - each bg has 32256 blocks
>> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>> (32256/64) (this is not worst value) for timing calucation.
>>
>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>
>> In the worst case, user space writes 1MB data will trigger 42M scanning
>> times.
>>
>> under this patch, the timing is '7*242*49 = 83006', reduced by three
>> orders of magnitude.
>>
>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>> ---
>> v2:
>> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>> for setting bg->bg_contig_free_bits.
>> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>> 'unsigned int'.
>> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>> parameter 'failure_path' to 'fastpath'.
>> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
>> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>> 'suballoc.h' to 'suballoc.c'.
>> 6. modify some code indent error.
>>
>> ---
>> fs/ocfs2/move_extents.c | 2 +-
>> fs/ocfs2/ocfs2_fs.h | 4 +-
>> fs/ocfs2/resize.c | 7 +++
>> fs/ocfs2/suballoc.c | 99 +++++++++++++++++++++++++++++++++++++----
>> fs/ocfs2/suballoc.h | 6 ++-
>> 5 files changed, 106 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>> index 1f9ed117e78b..f9d6a4f9ca92 100644
>> --- a/fs/ocfs2/move_extents.c
>> +++ b/fs/ocfs2/move_extents.c
>> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>> }
>>
>> ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>> - goal_bit, len);
>> + goal_bit, len, 0, 0);
>> if (ret) {
>> ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>> le16_to_cpu(gd->bg_chain));
>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>> index 7aebdbf5cc0a..eae18d772e93 100644
>> --- a/fs/ocfs2/ocfs2_fs.h
>> +++ b/fs/ocfs2/ocfs2_fs.h
>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc
>> __le16 bg_free_bits_count; /* Free bits count */
>> __le16 bg_chain; /* What chain I am in. */
>> /*10*/ __le32 bg_generation;
>> - __le32 bg_reserved1;
>> + __le32 bg_contig_free_bits; /* max contig free bits length */
>
> I've found some parts you are using le32, while others using le16.
> It seems that le16 is enough here. So why not:
> __le16 bg_contig_free_bits;
> __le16 bg_reserved1;
>
> Joseph
IIUC, From the comment before ocfs2_la_default_mb(), the max cluster
size is 1MB. And from 'struct ocfs2_group_desc', the offset of
bg_bitmap is 0x40. Therefore, the max contiguous bit of gd:
(1024*1024 - 0x40)*8 = 8388096 (23-bit length, bigger than __le16)
-Heming
On 3/18/24 11:16 AM, Heming Zhao wrote: > On 3/18/24 09:26, Joseph Qi wrote: >> >> >> On 3/15/24 8:26 PM, Heming Zhao wrote: >>> On 3/15/24 15:22, Joseph Qi wrote: >>>> >>>> >>>> On 3/15/24 10:25 AM, Heming Zhao wrote: >>>>> On 3/15/24 09:09, Joseph Qi wrote: >>>>>> Hi, >>>>>> >>>>>> On 3/14/24 3:33 PM, Heming Zhao wrote: >>>>>>> This patch resolves 3 journal-related issues: >>>>>>> 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal >>>>>>> protection for modifying dinode. >>>>>>> 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide >>>>>>> journal protection for writing fe->i_size. >>>>>>> 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode(). >>>>>>> This adjustment ensures that only content requiring protection is >>>>>>> included in the journal. >>>>>>> >>>>>> >>>>>> Any real issue you've found for above cases? >>>>> >>>>> No. I just found these issues when I was reading code. >>>>>> >>>>>> Take ocfs2_rollback_alloc_dinode_counts() for example, it will always be >>>>>> pair used with ocfs2_alloc_dinode_update_counts(), more precisely, in >>>>>> the error case of ocfs2_block_group_set_bits(). And the 'di_bh' is >>>>>> already dirtied before. Since they are in the same transaction, I don't >>>>>> see any problem here. >>>>> >>>>> Yes, they share the same transaction. >>>>> However: >>>>> In ocfs2_alloc_dinode_update_counts(), there already uses a jbd2 access/dirty >>>>> pair to protect metadata. In my view, jbd2_journal_access starts the >>>>> protection stage, jbd2_journal_dirty closes the protection stage. the rollback >>>>> function should start a new access/dirty pair to info jbd2 the di_bh had been >>>>> modified again. >>>>> >>>> Refer jbd2_journal_dirty_metadata(): >>>> >>>> /* >>>> * fastpath, to avoid expensive locking. If this buffer is already >>>> * on the running transaction's metadata list there is nothing to do. >>>> * Nobody can take it off again because there is a handle open. >>>> * I _think_ we're OK here with SMP barriers - a mistaken decision will >>>> * result in this test being false, so we go in and take the locks. >>>> */ >>>> >>>> So IMO, we don't have to access/dirty again since the buffer head is >>>> already added before. >>>> >>>> Joseph >>> >>> Thanks for the detailed info. Regarding the jbd2 code, you are right. >>> >>> What is your opinion about other modifications of my patch? >>> (the modifications about the jbd2 protection scope.) >>> under above buffer protection rule of access/dirty, in function, >>> the bh is same. It seems like the jbd2 dirty function could be >>> called anywhere regardless of the end-use point of buffer_head. >>> it's counterintuitive. >>> >> Which part do you refer? >> It seems that all of them are changing the same buffer head which is >> dirtied before, so it won't be a problem since the transaction has been >> committed yet. >> BTW, dinode size change should be protected under ip_lock, so we can't >> move it out. >> >> Thanks, >> Joseph >> > > Thank you for your explanation. > For my patch, it's wrong to move out fe->i_size from spin_[un]lock() > area. > > But for above info, I have a different view: > ->ip_lock is not used to protect dinode i_size. > > From the comment in 'struct ocfs2_inode_info', 'spinlock_t ip_lock' > is used to protect below items: > ``` > /* These fields are protected by ip_lock */ > spinlock_t ip_lock; > u32 ip_open_count; > struct list_head ip_io_markers; > u32 ip_clusters; > > u16 ip_dyn_features; > struct mutex ip_io_mutex; //zhm: should not include this > u32 ip_flags; /* see below */ > u32 ip_attr; /* inode attributes */ > ``` > > from the code of ocfs2_block_group_alloc(): > ``` > 737 spin_lock(&OCFS2_I(alloc_inode)->ip_lock); > 738 OCFS2_I(alloc_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); > 739 fe->i_size = cpu_to_le64(ocfs2_clusters_to_bytes(alloc_inode->i_sb, > 740 le32_to_cpu(fe->i_clusters))); > 741 spin_unlock(&OCFS2_I(alloc_inode)->ip_lock); > ``` > > we can see that ->ip_lock is protecting line 738, the assignment > operation of "oi->ip_clusters". Because the cluster length is > obtained from 'fe->i_clusters', if we move line 739 out of spin_[un]lock(), > fe->clusters may change, resulting in a different value for fe->i_size. > So author Mark put both 738 and 739 into spin_[un]lock(). From another > viewpoint, it appears that Mark think fe->i_size should keep consistent > with oi->ip_clusters. > Make sense. > we can also verify my analysis from ocfs2_mark_inode_dirty(). > In this function, ->ip_lock doesn't protect fe->i_size. because > the assignment action reads data from 'inode', and not related > with any 'ocfs2_inode_info' ip_xx items. > > ------------------ > Let's discuss with jbd2 access/dirty. > Because these code has already been running almost twenty years, > I accept no changes for existing code. But it's better to unify > the jbd2 access/dirty code style. > > e.g: > ocfs2_group_add() & ocfs2_block_group_alloc(): > - jbd2 dirty *before* spin_lock area. > > ocfs2_update_last_group_and_inode(): > - jbd2 dirty *after* spin_lock area. > > ocfs2_mark_inode_dirty(): > - jbd2 dirty *after* spin_lock area. > > could we unify the code style? For example, placing jbd2 dirty > *after* spin_lock area. > > e.g: > ``` > @@ -559,13 +559,12 @@ int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input) > (input->clusters - input->frees) * cl_bpc); > le32_add_cpu(&fe->i_clusters, input->clusters); > > - ocfs2_journal_dirty(handle, main_bm_bh); > - > spin_lock(&OCFS2_I(main_bm_inode)->ip_lock); > OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); > le64_add_cpu(&fe->i_size, (u64)input->clusters << osb->s_clustersize_bits); > spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock); > i_size_write(main_bm_inode, le64_to_cpu(fe->i_size)); > + ocfs2_journal_dirty(handle, main_bm_bh); > > ocfs2_update_super_and_backups(main_bm_inode, input->clusters); > ``` > The above change looks sane. Joseph
On 3/18/24 09:26, Joseph Qi wrote:
>
>
> On 3/15/24 8:26 PM, Heming Zhao wrote:
>> On 3/15/24 15:22, Joseph Qi wrote:
>>>
>>>
>>> On 3/15/24 10:25 AM, Heming Zhao wrote:
>>>> On 3/15/24 09:09, Joseph Qi wrote:
>>>>> Hi,
>>>>>
>>>>> On 3/14/24 3:33 PM, Heming Zhao wrote:
>>>>>> This patch resolves 3 journal-related issues:
>>>>>> 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal
>>>>>> protection for modifying dinode.
>>>>>> 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide
>>>>>> journal protection for writing fe->i_size.
>>>>>> 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode().
>>>>>> This adjustment ensures that only content requiring protection is
>>>>>> included in the journal.
>>>>>>
>>>>>
>>>>> Any real issue you've found for above cases?
>>>>
>>>> No. I just found these issues when I was reading code.
>>>>>
>>>>> Take ocfs2_rollback_alloc_dinode_counts() for example, it will always be
>>>>> pair used with ocfs2_alloc_dinode_update_counts(), more precisely, in
>>>>> the error case of ocfs2_block_group_set_bits(). And the 'di_bh' is
>>>>> already dirtied before. Since they are in the same transaction, I don't
>>>>> see any problem here.
>>>>
>>>> Yes, they share the same transaction.
>>>> However:
>>>> In ocfs2_alloc_dinode_update_counts(), there already uses a jbd2 access/dirty
>>>> pair to protect metadata. In my view, jbd2_journal_access starts the
>>>> protection stage, jbd2_journal_dirty closes the protection stage. the rollback
>>>> function should start a new access/dirty pair to info jbd2 the di_bh had been
>>>> modified again.
>>>>
>>> Refer jbd2_journal_dirty_metadata():
>>>
>>> /*
>>> * fastpath, to avoid expensive locking. If this buffer is already
>>> * on the running transaction's metadata list there is nothing to do.
>>> * Nobody can take it off again because there is a handle open.
>>> * I _think_ we're OK here with SMP barriers - a mistaken decision will
>>> * result in this test being false, so we go in and take the locks.
>>> */
>>>
>>> So IMO, we don't have to access/dirty again since the buffer head is
>>> already added before.
>>>
>>> Joseph
>>
>> Thanks for the detailed info. Regarding the jbd2 code, you are right.
>>
>> What is your opinion about other modifications of my patch?
>> (the modifications about the jbd2 protection scope.)
>> under above buffer protection rule of access/dirty, in function,
>> the bh is same. It seems like the jbd2 dirty function could be
>> called anywhere regardless of the end-use point of buffer_head.
>> it's counterintuitive.
>>
> Which part do you refer?
> It seems that all of them are changing the same buffer head which is
> dirtied before, so it won't be a problem since the transaction has been
> committed yet.
> BTW, dinode size change should be protected under ip_lock, so we can't
> move it out.
>
> Thanks,
> Joseph
>
Thank you for your explanation.
For my patch, it's wrong to move out fe->i_size from spin_[un]lock()
area.
But for above info, I have a different view:
->ip_lock is not used to protect dinode i_size.
From the comment in 'struct ocfs2_inode_info', 'spinlock_t ip_lock'
is used to protect below items:
```
/* These fields are protected by ip_lock */
spinlock_t ip_lock;
u32 ip_open_count;
struct list_head ip_io_markers;
u32 ip_clusters;
u16 ip_dyn_features;
struct mutex ip_io_mutex; //zhm: should not include this
u32 ip_flags; /* see below */
u32 ip_attr; /* inode attributes */
```
from the code of ocfs2_block_group_alloc():
```
737 spin_lock(&OCFS2_I(alloc_inode)->ip_lock);
738 OCFS2_I(alloc_inode)->ip_clusters = le32_to_cpu(fe->i_clusters);
739 fe->i_size = cpu_to_le64(ocfs2_clusters_to_bytes(alloc_inode->i_sb,
740 le32_to_cpu(fe->i_clusters)));
741 spin_unlock(&OCFS2_I(alloc_inode)->ip_lock);
```
we can see that ->ip_lock is protecting line 738, the assignment
operation of "oi->ip_clusters". Because the cluster length is
obtained from 'fe->i_clusters', if we move line 739 out of spin_[un]lock(),
fe->clusters may change, resulting in a different value for fe->i_size.
So author Mark put both 738 and 739 into spin_[un]lock(). From another
viewpoint, it appears that Mark think fe->i_size should keep consistent
with oi->ip_clusters.
we can also verify my analysis from ocfs2_mark_inode_dirty().
In this function, ->ip_lock doesn't protect fe->i_size. because
the assignment action reads data from 'inode', and not related
with any 'ocfs2_inode_info' ip_xx items.
------------------
Let's discuss with jbd2 access/dirty.
Because these code has already been running almost twenty years,
I accept no changes for existing code. But it's better to unify
the jbd2 access/dirty code style.
e.g:
ocfs2_group_add() & ocfs2_block_group_alloc():
- jbd2 dirty *before* spin_lock area.
ocfs2_update_last_group_and_inode():
- jbd2 dirty *after* spin_lock area.
ocfs2_mark_inode_dirty():
- jbd2 dirty *after* spin_lock area.
could we unify the code style? For example, placing jbd2 dirty
*after* spin_lock area.
e.g:
```
@@ -559,13 +559,12 @@ int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input)
(input->clusters - input->frees) * cl_bpc);
le32_add_cpu(&fe->i_clusters, input->clusters);
- ocfs2_journal_dirty(handle, main_bm_bh);
-
spin_lock(&OCFS2_I(main_bm_inode)->ip_lock);
OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters);
le64_add_cpu(&fe->i_size, (u64)input->clusters << osb->s_clustersize_bits);
spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock);
i_size_write(main_bm_inode, le64_to_cpu(fe->i_size));
+ ocfs2_journal_dirty(handle, main_bm_bh);
ocfs2_update_super_and_backups(main_bm_inode, input->clusters);
```
Thanks,
Heming
On 3/16/24 11:44 AM, Heming Zhao wrote: > The group_search function ocfs2_cluster_group_search() should > bypass groups with insufficient space to avoid unnecessary > searches. > > This patch is particularly useful when ocfs2 is handling huge > number small files, and volume fragmentation is very high. > In this case, ocfs2 is busy with looking up available la window > from //global_bitmap. > > This patch introduces a new member in the Group Description (gd) > struct called 'bg_contig_free_bits', representing the max > contigous free bits in this gd. When ocfs2 allocates a new > la window from //global_bitmap, 'bg_contig_free_bits' helps > expedite the search process. > > Let's image below path. > > 1. la state (->local_alloc_state) is set THROTTLED or DISABLED. > > 2. when user delete a large file and trigger > ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state > unconditionally. > > 3. a write IOs thread run and trigger the worst performance path > > ``` > ocfs2_reserve_clusters_with_limit > ocfs2_reserve_local_alloc_bits > ocfs2_local_alloc_slide_window //[1] > + ocfs2_local_alloc_reserve_for_window //[2] > + ocfs2_local_alloc_new_window //[3] > ocfs2_recalc_la_window > ``` > > [1]: > will be called when la window bits used up. > > [2]: > under la state is ENABLED, and this func only check global_bitmap > free bits, it will succeed in general. > > [3]: > will use the default la window size to search clusters then fail. > ocfs2_recalc_la_window attempts other la window sizes. > the timing complexity is O(n^4), resulting in a significant time > cost for scanning global bitmap. This leads to a dramatic slowdown > in write I/Os (e.g., user space 'dd'). > > i.e. > an ocfs2 partition size: 1.45TB, cluster size: 4KB, > la window default size: 106MB. > The partition is fragmentation by creating & deleting huge mount of > small files. > > before this patch, the timing of [3] should be > (the number got from real world): > - la window size change order (size: MB): > 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8 > only 0.8MB succeed, 0.8MB also triggers la window to disable. > ocfs2_local_alloc_new_window retries 8 times, first 7 times totally > runs in worst case. > - group chain number: 242 > ocfs2_claim_suballoc_bits calls for-loop 242 times > - each chain has 49 block group > ocfs2_search_chain calls while-loop 49 times > - each bg has 32256 blocks > ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits. > for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use > (32256/64) (this is not worst value) for timing calucation. > > the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times) > > In the worst case, user space writes 1MB data will trigger 42M scanning > times. > > under this patch, the timing is '7*242*49 = 83006', reduced by three > orders of magnitude. > > Signed-off-by: Heming Zhao <heming.zhao@suse.com> > --- > v2: > 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32() > for setting bg->bg_contig_free_bits. > 2. change ocfs2_find_max_contig_free_bits return type from 'void' to > 'unsigned int'. > 3. restore ocfs2_block_group_set_bits() input parameters style, change > parameter 'failure_path' to 'fastpath'. > 4. after <3>, add new parameter 'unsigned int max_contig_bits'. > 5. after <3>, restore define 'struct ocfs2_suballoc_result' from > 'suballoc.h' to 'suballoc.c'. > 6. modify some code indent error. > > --- > fs/ocfs2/move_extents.c | 2 +- > fs/ocfs2/ocfs2_fs.h | 4 +- > fs/ocfs2/resize.c | 7 +++ > fs/ocfs2/suballoc.c | 99 +++++++++++++++++++++++++++++++++++++---- > fs/ocfs2/suballoc.h | 6 ++- > 5 files changed, 106 insertions(+), 12 deletions(-) > > diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c > index 1f9ed117e78b..f9d6a4f9ca92 100644 > --- a/fs/ocfs2/move_extents.c > +++ b/fs/ocfs2/move_extents.c > @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context, > } > > ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh, > - goal_bit, len); > + goal_bit, len, 0, 0); > if (ret) { > ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len, > le16_to_cpu(gd->bg_chain)); > diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h > index 7aebdbf5cc0a..eae18d772e93 100644 > --- a/fs/ocfs2/ocfs2_fs.h > +++ b/fs/ocfs2/ocfs2_fs.h > @@ -883,14 +883,14 @@ struct ocfs2_group_desc > __le16 bg_free_bits_count; /* Free bits count */ > __le16 bg_chain; /* What chain I am in. */ > /*10*/ __le32 bg_generation; > - __le32 bg_reserved1; > + __le32 bg_contig_free_bits; /* max contig free bits length */ I've found some parts you are using le32, while others using le16. It seems that le16 is enough here. So why not: __le16 bg_contig_free_bits; __le16 bg_reserved1; Joseph > __le64 bg_next_group; /* Next group in my list, in > blocks */ > /*20*/ __le64 bg_parent_dinode; /* dinode which owns me, in > blocks */ > __le64 bg_blkno; /* Offset on disk, in blocks */ > /*30*/ struct ocfs2_block_check bg_check; /* Error checking */ > - __le64 bg_reserved2; > + __le64 bg_reserved1; > /*40*/ union { > DECLARE_FLEX_ARRAY(__u8, bg_bitmap); > struct { > diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c > index d65d43c61857..2cf3772dd175 100644 > --- a/fs/ocfs2/resize.c > +++ b/fs/ocfs2/resize.c > @@ -91,6 +91,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, > u16 cl_bpc = le16_to_cpu(cl->cl_bpc); > u16 cl_cpg = le16_to_cpu(cl->cl_cpg); > u16 old_bg_clusters; > + u32 contig_bits, old_bg_contig_free_bits; > > trace_ocfs2_update_last_group_and_inode(new_clusters, > first_new_cluster); > @@ -122,6 +123,11 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, > le16_add_cpu(&group->bg_free_bits_count, -1 * backups); > } > > + contig_bits = ocfs2_find_max_contig_free_bits(group->bg_bitmap, > + group->bg_bits, 0); > + old_bg_contig_free_bits = group->bg_contig_free_bits; > + group->bg_contig_free_bits = cpu_to_le32(contig_bits); > + > ocfs2_journal_dirty(handle, group_bh); > > /* update the inode accordingly. */ > @@ -160,6 +166,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, > le16_add_cpu(&group->bg_free_bits_count, backups); > le16_add_cpu(&group->bg_bits, -1 * num_bits); > le16_add_cpu(&group->bg_free_bits_count, -1 * num_bits); > + group->bg_contig_free_bits = old_bg_contig_free_bits; > } > out: > if (ret) > diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c > index 166c8918c825..2af63124065c 100644 > --- a/fs/ocfs2/suballoc.c > +++ b/fs/ocfs2/suballoc.c > @@ -50,6 +50,10 @@ struct ocfs2_suballoc_result { > u64 sr_blkno; /* The first allocated block */ > unsigned int sr_bit_offset; /* The bit in the bg */ > unsigned int sr_bits; /* How many bits we claimed */ > + unsigned int sr_max_contig_bits; /* The length for contiguous > + * free bits, only available > + * for cluster group > + */ > }; > > static u64 ocfs2_group_from_res(struct ocfs2_suballoc_result *res) > @@ -1272,6 +1276,26 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh, > return ret; > } > > +int ocfs2_find_max_contig_free_bits(void *bitmap, > + unsigned int total_bits, int start) > +{ > + int offset, free_bits; > + unsigned int contig_bits = 0; > + > + while (start < total_bits) { > + offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start); > + if (offset == total_bits) > + break; > + > + start = ocfs2_find_next_bit(bitmap, total_bits, offset); > + free_bits = start - offset; > + if (contig_bits < free_bits) > + contig_bits = free_bits; > + } > + > + return contig_bits; > +} > + > static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, > struct buffer_head *bg_bh, > unsigned int bits_wanted, > @@ -1280,6 +1304,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, > { > void *bitmap; > u16 best_offset, best_size; > + u16 prev_best_size = 0; > int offset, start, found, status = 0; > struct ocfs2_group_desc *bg = (struct ocfs2_group_desc *) bg_bh->b_data; > > @@ -1308,6 +1333,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, > /* got a zero after some ones */ > found = 1; > start = offset + 1; > + prev_best_size = best_size; > } > if (found > best_size) { > best_size = found; > @@ -1320,6 +1346,8 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, > } > } > > + /* best_size will be allocated, we save prev_best_size */ > + res->sr_max_contig_bits = prev_best_size; > if (best_size) { > res->sr_bit_offset = best_offset; > res->sr_bits = best_size; > @@ -1337,11 +1365,15 @@ int ocfs2_block_group_set_bits(handle_t *handle, > struct ocfs2_group_desc *bg, > struct buffer_head *group_bh, > unsigned int bit_off, > - unsigned int num_bits) > + unsigned int num_bits, > + unsigned int max_contig_bits, > + int fastpath) > { > int status; > void *bitmap = bg->bg_bitmap; > int journal_type = OCFS2_JOURNAL_ACCESS_WRITE; > + unsigned int start = bit_off + num_bits; > + unsigned int contig_bits; > > /* All callers get the descriptor via > * ocfs2_read_group_descriptor(). Any corruption is a code bug. */ > @@ -1373,6 +1405,28 @@ int ocfs2_block_group_set_bits(handle_t *handle, > while(num_bits--) > ocfs2_set_bit(bit_off++, bitmap); > > + /* > + * this is optimize path, caller set old contig value > + * in max_contig_bits to bypass finding action. > + */ > + if (fastpath) { > + bg->bg_contig_free_bits = cpu_to_le32(max_contig_bits); > + } else if (ocfs2_is_cluster_bitmap(alloc_inode)) { > + /* > + * Usually, the block group bitmap allocates only 1 bit > + * at a time, while the cluster group allocates n bits > + * each time. Therefore, we only save the contig bits for > + * the cluster group. > + */ > + contig_bits = ocfs2_find_max_contig_free_bits(bitmap, > + le16_to_cpu(bg->bg_bits), start); > + if (contig_bits > max_contig_bits) > + max_contig_bits = contig_bits; > + bg->bg_contig_free_bits = cpu_to_le32(max_contig_bits); > + } else { > + bg->bg_contig_free_bits = 0; > + } > + > ocfs2_journal_dirty(handle, group_bh); > > bail: > @@ -1486,7 +1540,12 @@ static int ocfs2_cluster_group_search(struct inode *inode, > > BUG_ON(!ocfs2_is_cluster_bitmap(inode)); > > - if (gd->bg_free_bits_count) { > + if (le16_to_cpu(gd->bg_contig_free_bits) && > + le16_to_cpu(gd->bg_contig_free_bits) < bits_wanted) > + return -ENOSPC; > + > + /* ->bg_contig_free_bits may un-initialized, so compare again */ > + if (le16_to_cpu(gd->bg_free_bits_count) >= bits_wanted) { > max_bits = le16_to_cpu(gd->bg_bits); > > /* Tail groups in cluster bitmaps which aren't cpg > @@ -1555,7 +1614,7 @@ static int ocfs2_block_group_search(struct inode *inode, > BUG_ON(min_bits != 1); > BUG_ON(ocfs2_is_cluster_bitmap(inode)); > > - if (bg->bg_free_bits_count) { > + if (le16_to_cpu(bg->bg_free_bits_count) >= bits_wanted) { > ret = ocfs2_block_group_find_clear_bits(OCFS2_SB(inode->i_sb), > group_bh, bits_wanted, > le16_to_cpu(bg->bg_bits), > @@ -1715,7 +1774,8 @@ static int ocfs2_search_one_group(struct ocfs2_alloc_context *ac, > } > > ret = ocfs2_block_group_set_bits(handle, alloc_inode, gd, group_bh, > - res->sr_bit_offset, res->sr_bits); > + res->sr_bit_offset, res->sr_bits, > + res->sr_max_contig_bits, 0); > if (ret < 0) { > ocfs2_rollback_alloc_dinode_counts(alloc_inode, ac->ac_bh, > res->sr_bits, > @@ -1849,7 +1909,9 @@ static int ocfs2_search_chain(struct ocfs2_alloc_context *ac, > bg, > group_bh, > res->sr_bit_offset, > - res->sr_bits); > + res->sr_bits, > + res->sr_max_contig_bits, > + 0); > if (status < 0) { > ocfs2_rollback_alloc_dinode_counts(alloc_inode, > ac->ac_bh, res->sr_bits, chain); > @@ -2163,7 +2225,9 @@ int ocfs2_claim_new_inode_at_loc(handle_t *handle, > bg, > bg_bh, > res->sr_bit_offset, > - res->sr_bits); > + res->sr_bits, > + res->sr_max_contig_bits, > + 0); > if (ret < 0) { > ocfs2_rollback_alloc_dinode_counts(ac->ac_inode, > ac->ac_bh, res->sr_bits, chain); > @@ -2382,11 +2446,13 @@ static int ocfs2_block_group_clear_bits(handle_t *handle, > struct buffer_head *group_bh, > unsigned int bit_off, > unsigned int num_bits, > + unsigned int max_contig_bits, > void (*undo_fn)(unsigned int bit, > unsigned long *bmap)) > { > int status; > unsigned int tmp; > + unsigned int contig_bits; > struct ocfs2_group_desc *undo_bg = NULL; > struct journal_head *jh; > > @@ -2433,6 +2499,20 @@ static int ocfs2_block_group_clear_bits(handle_t *handle, > num_bits); > } > > + /* > + * TODO: even 'num_bits == 1' (the worst case, release 1 cluster), > + * we still need to rescan whole bitmap. > + */ > + if (ocfs2_is_cluster_bitmap(alloc_inode)) { > + contig_bits = ocfs2_find_max_contig_free_bits(bg->bg_bitmap, > + le16_to_cpu(bg->bg_bits), 0); > + if (contig_bits > max_contig_bits) > + max_contig_bits = contig_bits; > + bg->bg_contig_free_bits = cpu_to_le32(max_contig_bits); > + } else { > + bg->bg_contig_free_bits = 0; > + } > + > if (undo_fn) > spin_unlock(&jh->b_state_lock); > > @@ -2459,6 +2539,7 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle, > struct ocfs2_chain_list *cl = &fe->id2.i_chain; > struct buffer_head *group_bh = NULL; > struct ocfs2_group_desc *group; > + u32 old_bg_contig_free_bits = 0; > > /* The alloc_bh comes from ocfs2_free_dinode() or > * ocfs2_free_clusters(). The callers have all locked the > @@ -2483,9 +2564,11 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle, > > BUG_ON((count + start_bit) > le16_to_cpu(group->bg_bits)); > > + if (ocfs2_is_cluster_bitmap(alloc_inode)) > + old_bg_contig_free_bits = group->bg_contig_free_bits; > status = ocfs2_block_group_clear_bits(handle, alloc_inode, > group, group_bh, > - start_bit, count, undo_fn); > + start_bit, count, 0, undo_fn); > if (status < 0) { > mlog_errno(status); > goto bail; > @@ -2496,7 +2579,7 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle, > if (status < 0) { > mlog_errno(status); > ocfs2_block_group_set_bits(handle, alloc_inode, group, group_bh, > - start_bit, count); > + start_bit, count, old_bg_contig_free_bits, 1); > goto bail; > } > > diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h > index 9c74eace3adc..429fa12eea11 100644 > --- a/fs/ocfs2/suballoc.h > +++ b/fs/ocfs2/suballoc.h > @@ -79,12 +79,16 @@ void ocfs2_rollback_alloc_dinode_counts(struct inode *inode, > struct buffer_head *di_bh, > u32 num_bits, > u16 chain); > +int ocfs2_find_max_contig_free_bits(void *bitmap, > + unsigned int total_bits, int start); > int ocfs2_block_group_set_bits(handle_t *handle, > struct inode *alloc_inode, > struct ocfs2_group_desc *bg, > struct buffer_head *group_bh, > unsigned int bit_off, > - unsigned int num_bits); > + unsigned int num_bits, > + unsigned int max_contig_bits, > + int fastpath); > > int ocfs2_claim_metadata(handle_t *handle, > struct ocfs2_alloc_context *ac,
On 3/15/24 8:26 PM, Heming Zhao wrote:
> On 3/15/24 15:22, Joseph Qi wrote:
>>
>>
>> On 3/15/24 10:25 AM, Heming Zhao wrote:
>>> On 3/15/24 09:09, Joseph Qi wrote:
>>>> Hi,
>>>>
>>>> On 3/14/24 3:33 PM, Heming Zhao wrote:
>>>>> This patch resolves 3 journal-related issues:
>>>>> 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal
>>>>> protection for modifying dinode.
>>>>> 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide
>>>>> journal protection for writing fe->i_size.
>>>>> 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode().
>>>>> This adjustment ensures that only content requiring protection is
>>>>> included in the journal.
>>>>>
>>>>
>>>> Any real issue you've found for above cases?
>>>
>>> No. I just found these issues when I was reading code.
>>>>
>>>> Take ocfs2_rollback_alloc_dinode_counts() for example, it will always be
>>>> pair used with ocfs2_alloc_dinode_update_counts(), more precisely, in
>>>> the error case of ocfs2_block_group_set_bits(). And the 'di_bh' is
>>>> already dirtied before. Since they are in the same transaction, I don't
>>>> see any problem here.
>>>
>>> Yes, they share the same transaction.
>>> However:
>>> In ocfs2_alloc_dinode_update_counts(), there already uses a jbd2 access/dirty
>>> pair to protect metadata. In my view, jbd2_journal_access starts the
>>> protection stage, jbd2_journal_dirty closes the protection stage. the rollback
>>> function should start a new access/dirty pair to info jbd2 the di_bh had been
>>> modified again.
>>>
>> Refer jbd2_journal_dirty_metadata():
>>
>> /*
>> * fastpath, to avoid expensive locking. If this buffer is already
>> * on the running transaction's metadata list there is nothing to do.
>> * Nobody can take it off again because there is a handle open.
>> * I _think_ we're OK here with SMP barriers - a mistaken decision will
>> * result in this test being false, so we go in and take the locks.
>> */
>>
>> So IMO, we don't have to access/dirty again since the buffer head is
>> already added before.
>>
>> Joseph
>
> Thanks for the detailed info. Regarding the jbd2 code, you are right.
>
> What is your opinion about other modifications of my patch?
> (the modifications about the jbd2 protection scope.)
> under above buffer protection rule of access/dirty, in function,
> the bh is same. It seems like the jbd2 dirty function could be
> called anywhere regardless of the end-use point of buffer_head.
> it's counterintuitive.
>
Which part do you refer?
It seems that all of them are changing the same buffer head which is
dirtied before, so it won't be a problem since the transaction has been
committed yet.
BTW, dinode size change should be protected under ip_lock, so we can't
move it out.
Thanks,
Joseph
The group_search function ocfs2_cluster_group_search() should bypass groups with insufficient space to avoid unnecessary searches. This patch is particularly useful when ocfs2 is handling huge number small files, and volume fragmentation is very high. In this case, ocfs2 is busy with looking up available la window from //global_bitmap. This patch introduces a new member in the Group Description (gd) struct called 'bg_contig_free_bits', representing the max contigous free bits in this gd. When ocfs2 allocates a new la window from //global_bitmap, 'bg_contig_free_bits' helps expedite the search process. Let's image below path. 1. la state (->local_alloc_state) is set THROTTLED or DISABLED. 2. when user delete a large file and trigger ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state unconditionally. 3. a write IOs thread run and trigger the worst performance path ``` ocfs2_reserve_clusters_with_limit ocfs2_reserve_local_alloc_bits ocfs2_local_alloc_slide_window //[1] + ocfs2_local_alloc_reserve_for_window //[2] + ocfs2_local_alloc_new_window //[3] ocfs2_recalc_la_window ``` [1]: will be called when la window bits used up. [2]: under la state is ENABLED, and this func only check global_bitmap free bits, it will succeed in general. [3]: will use the default la window size to search clusters then fail. ocfs2_recalc_la_window attempts other la window sizes. the timing complexity is O(n^4), resulting in a significant time cost for scanning global bitmap. This leads to a dramatic slowdown in write I/Os (e.g., user space 'dd'). i.e. an ocfs2 partition size: 1.45TB, cluster size: 4KB, la window default size: 106MB. The partition is fragmentation by creating & deleting huge mount of small files. before this patch, the timing of [3] should be (the number got from real world): - la window size change order (size: MB): 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8 only 0.8MB succeed, 0.8MB also triggers la window to disable. ocfs2_local_alloc_new_window retries 8 times, first 7 times totally runs in worst case. - group chain number: 242 ocfs2_claim_suballoc_bits calls for-loop 242 times - each chain has 49 block group ocfs2_search_chain calls while-loop 49 times - each bg has 32256 blocks ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits. for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use (32256/64) (this is not worst value) for timing calucation. the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times) In the worst case, user space writes 1MB data will trigger 42M scanning times. under this patch, the timing is '7*242*49 = 83006', reduced by three orders of magnitude. Signed-off-by: Heming Zhao <heming.zhao@suse.com> --- v2: 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32() for setting bg->bg_contig_free_bits. 2. change ocfs2_find_max_contig_free_bits return type from 'void' to 'unsigned int'. 3. restore ocfs2_block_group_set_bits() input parameters style, change parameter 'failure_path' to 'fastpath'. 4. after <3>, add new parameter 'unsigned int max_contig_bits'. 5. after <3>, restore define 'struct ocfs2_suballoc_result' from 'suballoc.h' to 'suballoc.c'. 6. modify some code indent error. --- fs/ocfs2/move_extents.c | 2 +- fs/ocfs2/ocfs2_fs.h | 4 +- fs/ocfs2/resize.c | 7 +++ fs/ocfs2/suballoc.c | 99 +++++++++++++++++++++++++++++++++++++---- fs/ocfs2/suballoc.h | 6 ++- 5 files changed, 106 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c index 1f9ed117e78b..f9d6a4f9ca92 100644 --- a/fs/ocfs2/move_extents.c +++ b/fs/ocfs2/move_extents.c @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context, } ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh, - goal_bit, len); + goal_bit, len, 0, 0); if (ret) { ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len, le16_to_cpu(gd->bg_chain)); diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h index 7aebdbf5cc0a..eae18d772e93 100644 --- a/fs/ocfs2/ocfs2_fs.h +++ b/fs/ocfs2/ocfs2_fs.h @@ -883,14 +883,14 @@ struct ocfs2_group_desc __le16 bg_free_bits_count; /* Free bits count */ __le16 bg_chain; /* What chain I am in. */ /*10*/ __le32 bg_generation; - __le32 bg_reserved1; + __le32 bg_contig_free_bits; /* max contig free bits length */ __le64 bg_next_group; /* Next group in my list, in blocks */ /*20*/ __le64 bg_parent_dinode; /* dinode which owns me, in blocks */ __le64 bg_blkno; /* Offset on disk, in blocks */ /*30*/ struct ocfs2_block_check bg_check; /* Error checking */ - __le64 bg_reserved2; + __le64 bg_reserved1; /*40*/ union { DECLARE_FLEX_ARRAY(__u8, bg_bitmap); struct { diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c index d65d43c61857..2cf3772dd175 100644 --- a/fs/ocfs2/resize.c +++ b/fs/ocfs2/resize.c @@ -91,6 +91,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, u16 cl_bpc = le16_to_cpu(cl->cl_bpc); u16 cl_cpg = le16_to_cpu(cl->cl_cpg); u16 old_bg_clusters; + u32 contig_bits, old_bg_contig_free_bits; trace_ocfs2_update_last_group_and_inode(new_clusters, first_new_cluster); @@ -122,6 +123,11 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, le16_add_cpu(&group->bg_free_bits_count, -1 * backups); } + contig_bits = ocfs2_find_max_contig_free_bits(group->bg_bitmap, + group->bg_bits, 0); + old_bg_contig_free_bits = group->bg_contig_free_bits; + group->bg_contig_free_bits = cpu_to_le32(contig_bits); + ocfs2_journal_dirty(handle, group_bh); /* update the inode accordingly. */ @@ -160,6 +166,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, le16_add_cpu(&group->bg_free_bits_count, backups); le16_add_cpu(&group->bg_bits, -1 * num_bits); le16_add_cpu(&group->bg_free_bits_count, -1 * num_bits); + group->bg_contig_free_bits = old_bg_contig_free_bits; } out: if (ret) diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 166c8918c825..2af63124065c 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -50,6 +50,10 @@ struct ocfs2_suballoc_result { u64 sr_blkno; /* The first allocated block */ unsigned int sr_bit_offset; /* The bit in the bg */ unsigned int sr_bits; /* How many bits we claimed */ + unsigned int sr_max_contig_bits; /* The length for contiguous + * free bits, only available + * for cluster group + */ }; static u64 ocfs2_group_from_res(struct ocfs2_suballoc_result *res) @@ -1272,6 +1276,26 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh, return ret; } +int ocfs2_find_max_contig_free_bits(void *bitmap, + unsigned int total_bits, int start) +{ + int offset, free_bits; + unsigned int contig_bits = 0; + + while (start < total_bits) { + offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start); + if (offset == total_bits) + break; + + start = ocfs2_find_next_bit(bitmap, total_bits, offset); + free_bits = start - offset; + if (contig_bits < free_bits) + contig_bits = free_bits; + } + + return contig_bits; +} + static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, struct buffer_head *bg_bh, unsigned int bits_wanted, @@ -1280,6 +1304,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, { void *bitmap; u16 best_offset, best_size; + u16 prev_best_size = 0; int offset, start, found, status = 0; struct ocfs2_group_desc *bg = (struct ocfs2_group_desc *) bg_bh->b_data; @@ -1308,6 +1333,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, /* got a zero after some ones */ found = 1; start = offset + 1; + prev_best_size = best_size; } if (found > best_size) { best_size = found; @@ -1320,6 +1346,8 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, } } + /* best_size will be allocated, we save prev_best_size */ + res->sr_max_contig_bits = prev_best_size; if (best_size) { res->sr_bit_offset = best_offset; res->sr_bits = best_size; @@ -1337,11 +1365,15 @@ int ocfs2_block_group_set_bits(handle_t *handle, struct ocfs2_group_desc *bg, struct buffer_head *group_bh, unsigned int bit_off, - unsigned int num_bits) + unsigned int num_bits, + unsigned int max_contig_bits, + int fastpath) { int status; void *bitmap = bg->bg_bitmap; int journal_type = OCFS2_JOURNAL_ACCESS_WRITE; + unsigned int start = bit_off + num_bits; + unsigned int contig_bits; /* All callers get the descriptor via * ocfs2_read_group_descriptor(). Any corruption is a code bug. */ @@ -1373,6 +1405,28 @@ int ocfs2_block_group_set_bits(handle_t *handle, while(num_bits--) ocfs2_set_bit(bit_off++, bitmap); + /* + * this is optimize path, caller set old contig value + * in max_contig_bits to bypass finding action. + */ + if (fastpath) { + bg->bg_contig_free_bits = cpu_to_le32(max_contig_bits); + } else if (ocfs2_is_cluster_bitmap(alloc_inode)) { + /* + * Usually, the block group bitmap allocates only 1 bit + * at a time, while the cluster group allocates n bits + * each time. Therefore, we only save the contig bits for + * the cluster group. + */ + contig_bits = ocfs2_find_max_contig_free_bits(bitmap, + le16_to_cpu(bg->bg_bits), start); + if (contig_bits > max_contig_bits) + max_contig_bits = contig_bits; + bg->bg_contig_free_bits = cpu_to_le32(max_contig_bits); + } else { + bg->bg_contig_free_bits = 0; + } + ocfs2_journal_dirty(handle, group_bh); bail: @@ -1486,7 +1540,12 @@ static int ocfs2_cluster_group_search(struct inode *inode, BUG_ON(!ocfs2_is_cluster_bitmap(inode)); - if (gd->bg_free_bits_count) { + if (le16_to_cpu(gd->bg_contig_free_bits) && + le16_to_cpu(gd->bg_contig_free_bits) < bits_wanted) + return -ENOSPC; + + /* ->bg_contig_free_bits may un-initialized, so compare again */ + if (le16_to_cpu(gd->bg_free_bits_count) >= bits_wanted) { max_bits = le16_to_cpu(gd->bg_bits); /* Tail groups in cluster bitmaps which aren't cpg @@ -1555,7 +1614,7 @@ static int ocfs2_block_group_search(struct inode *inode, BUG_ON(min_bits != 1); BUG_ON(ocfs2_is_cluster_bitmap(inode)); - if (bg->bg_free_bits_count) { + if (le16_to_cpu(bg->bg_free_bits_count) >= bits_wanted) { ret = ocfs2_block_group_find_clear_bits(OCFS2_SB(inode->i_sb), group_bh, bits_wanted, le16_to_cpu(bg->bg_bits), @@ -1715,7 +1774,8 @@ static int ocfs2_search_one_group(struct ocfs2_alloc_context *ac, } ret = ocfs2_block_group_set_bits(handle, alloc_inode, gd, group_bh, - res->sr_bit_offset, res->sr_bits); + res->sr_bit_offset, res->sr_bits, + res->sr_max_contig_bits, 0); if (ret < 0) { ocfs2_rollback_alloc_dinode_counts(alloc_inode, ac->ac_bh, res->sr_bits, @@ -1849,7 +1909,9 @@ static int ocfs2_search_chain(struct ocfs2_alloc_context *ac, bg, group_bh, res->sr_bit_offset, - res->sr_bits); + res->sr_bits, + res->sr_max_contig_bits, + 0); if (status < 0) { ocfs2_rollback_alloc_dinode_counts(alloc_inode, ac->ac_bh, res->sr_bits, chain); @@ -2163,7 +2225,9 @@ int ocfs2_claim_new_inode_at_loc(handle_t *handle, bg, bg_bh, res->sr_bit_offset, - res->sr_bits); + res->sr_bits, + res->sr_max_contig_bits, + 0); if (ret < 0) { ocfs2_rollback_alloc_dinode_counts(ac->ac_inode, ac->ac_bh, res->sr_bits, chain); @@ -2382,11 +2446,13 @@ static int ocfs2_block_group_clear_bits(handle_t *handle, struct buffer_head *group_bh, unsigned int bit_off, unsigned int num_bits, + unsigned int max_contig_bits, void (*undo_fn)(unsigned int bit, unsigned long *bmap)) { int status; unsigned int tmp; + unsigned int contig_bits; struct ocfs2_group_desc *undo_bg = NULL; struct journal_head *jh; @@ -2433,6 +2499,20 @@ static int ocfs2_block_group_clear_bits(handle_t *handle, num_bits); } + /* + * TODO: even 'num_bits == 1' (the worst case, release 1 cluster), + * we still need to rescan whole bitmap. + */ + if (ocfs2_is_cluster_bitmap(alloc_inode)) { + contig_bits = ocfs2_find_max_contig_free_bits(bg->bg_bitmap, + le16_to_cpu(bg->bg_bits), 0); + if (contig_bits > max_contig_bits) + max_contig_bits = contig_bits; + bg->bg_contig_free_bits = cpu_to_le32(max_contig_bits); + } else { + bg->bg_contig_free_bits = 0; + } + if (undo_fn) spin_unlock(&jh->b_state_lock); @@ -2459,6 +2539,7 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle, struct ocfs2_chain_list *cl = &fe->id2.i_chain; struct buffer_head *group_bh = NULL; struct ocfs2_group_desc *group; + u32 old_bg_contig_free_bits = 0; /* The alloc_bh comes from ocfs2_free_dinode() or * ocfs2_free_clusters(). The callers have all locked the @@ -2483,9 +2564,11 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle, BUG_ON((count + start_bit) > le16_to_cpu(group->bg_bits)); + if (ocfs2_is_cluster_bitmap(alloc_inode)) + old_bg_contig_free_bits = group->bg_contig_free_bits; status = ocfs2_block_group_clear_bits(handle, alloc_inode, group, group_bh, - start_bit, count, undo_fn); + start_bit, count, 0, undo_fn); if (status < 0) { mlog_errno(status); goto bail; @@ -2496,7 +2579,7 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle, if (status < 0) { mlog_errno(status); ocfs2_block_group_set_bits(handle, alloc_inode, group, group_bh, - start_bit, count); + start_bit, count, old_bg_contig_free_bits, 1); goto bail; } diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h index 9c74eace3adc..429fa12eea11 100644 --- a/fs/ocfs2/suballoc.h +++ b/fs/ocfs2/suballoc.h @@ -79,12 +79,16 @@ void ocfs2_rollback_alloc_dinode_counts(struct inode *inode, struct buffer_head *di_bh, u32 num_bits, u16 chain); +int ocfs2_find_max_contig_free_bits(void *bitmap, + unsigned int total_bits, int start); int ocfs2_block_group_set_bits(handle_t *handle, struct inode *alloc_inode, struct ocfs2_group_desc *bg, struct buffer_head *group_bh, unsigned int bit_off, - unsigned int num_bits); + unsigned int num_bits, + unsigned int max_contig_bits, + int fastpath); int ocfs2_claim_metadata(handle_t *handle, struct ocfs2_alloc_context *ac, -- 2.35.3
On 3/15/24 15:22, Joseph Qi wrote:
>
>
> On 3/15/24 10:25 AM, Heming Zhao wrote:
>> On 3/15/24 09:09, Joseph Qi wrote:
>>> Hi,
>>>
>>> On 3/14/24 3:33 PM, Heming Zhao wrote:
>>>> This patch resolves 3 journal-related issues:
>>>> 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal
>>>> protection for modifying dinode.
>>>> 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide
>>>> journal protection for writing fe->i_size.
>>>> 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode().
>>>> This adjustment ensures that only content requiring protection is
>>>> included in the journal.
>>>>
>>>
>>> Any real issue you've found for above cases?
>>
>> No. I just found these issues when I was reading code.
>>>
>>> Take ocfs2_rollback_alloc_dinode_counts() for example, it will always be
>>> pair used with ocfs2_alloc_dinode_update_counts(), more precisely, in
>>> the error case of ocfs2_block_group_set_bits(). And the 'di_bh' is
>>> already dirtied before. Since they are in the same transaction, I don't
>>> see any problem here.
>>
>> Yes, they share the same transaction.
>> However:
>> In ocfs2_alloc_dinode_update_counts(), there already uses a jbd2 access/dirty
>> pair to protect metadata. In my view, jbd2_journal_access starts the
>> protection stage, jbd2_journal_dirty closes the protection stage. the rollback
>> function should start a new access/dirty pair to info jbd2 the di_bh had been
>> modified again.
>>
> Refer jbd2_journal_dirty_metadata():
>
> /*
> * fastpath, to avoid expensive locking. If this buffer is already
> * on the running transaction's metadata list there is nothing to do.
> * Nobody can take it off again because there is a handle open.
> * I _think_ we're OK here with SMP barriers - a mistaken decision will
> * result in this test being false, so we go in and take the locks.
> */
>
> So IMO, we don't have to access/dirty again since the buffer head is
> already added before.
>
> Joseph
Thanks for the detailed info. Regarding the jbd2 code, you are right.
What is your opinion about other modifications of my patch?
(the modifications about the jbd2 protection scope.)
under above buffer protection rule of access/dirty, in function,
the bh is same. It seems like the jbd2 dirty function could be
called anywhere regardless of the end-use point of buffer_head.
it's counterintuitive.
-Heming
On 3/15/24 10:25 AM, Heming Zhao wrote:
> On 3/15/24 09:09, Joseph Qi wrote:
>> Hi,
>>
>> On 3/14/24 3:33 PM, Heming Zhao wrote:
>>> This patch resolves 3 journal-related issues:
>>> 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal
>>> protection for modifying dinode.
>>> 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide
>>> journal protection for writing fe->i_size.
>>> 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode().
>>> This adjustment ensures that only content requiring protection is
>>> included in the journal.
>>>
>>
>> Any real issue you've found for above cases?
>
> No. I just found these issues when I was reading code.
>>
>> Take ocfs2_rollback_alloc_dinode_counts() for example, it will always be
>> pair used with ocfs2_alloc_dinode_update_counts(), more precisely, in
>> the error case of ocfs2_block_group_set_bits(). And the 'di_bh' is
>> already dirtied before. Since they are in the same transaction, I don't
>> see any problem here.
>
> Yes, they share the same transaction.
> However:
> In ocfs2_alloc_dinode_update_counts(), there already uses a jbd2 access/dirty
> pair to protect metadata. In my view, jbd2_journal_access starts the
> protection stage, jbd2_journal_dirty closes the protection stage. the rollback
> function should start a new access/dirty pair to info jbd2 the di_bh had been
> modified again.
>
Refer jbd2_journal_dirty_metadata():
/*
* fastpath, to avoid expensive locking. If this buffer is already
* on the running transaction's metadata list there is nothing to do.
* Nobody can take it off again because there is a handle open.
* I _think_ we're OK here with SMP barriers - a mistaken decision will
* result in this test being false, so we go in and take the locks.
*/
So IMO, we don't have to access/dirty again since the buffer head is
already added before.
Joseph
On 3/15/24 09:09, Joseph Qi wrote: > Hi, > > On 3/14/24 3:33 PM, Heming Zhao wrote: >> This patch resolves 3 journal-related issues: >> 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal >> protection for modifying dinode. >> 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide >> journal protection for writing fe->i_size. >> 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode(). >> This adjustment ensures that only content requiring protection is >> included in the journal. >> > > Any real issue you've found for above cases? No. I just found these issues when I was reading code. > > Take ocfs2_rollback_alloc_dinode_counts() for example, it will always be > pair used with ocfs2_alloc_dinode_update_counts(), more precisely, in > the error case of ocfs2_block_group_set_bits(). And the 'di_bh' is > already dirtied before. Since they are in the same transaction, I don't > see any problem here. Yes, they share the same transaction. However: In ocfs2_alloc_dinode_update_counts(), there already uses a jbd2 access/dirty pair to protect metadata. In my view, jbd2_journal_access starts the protection stage, jbd2_journal_dirty closes the protection stage. the rollback function should start a new access/dirty pair to info jbd2 the di_bh had been modified again. -Heming
Hi, On 3/14/24 3:33 PM, Heming Zhao wrote: > This patch resolves 3 journal-related issues: > 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal > protection for modifying dinode. > 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide > journal protection for writing fe->i_size. > 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode(). > This adjustment ensures that only content requiring protection is > included in the journal. > Any real issue you've found for above cases? Take ocfs2_rollback_alloc_dinode_counts() for example, it will always be pair used with ocfs2_alloc_dinode_update_counts(), more precisely, in the error case of ocfs2_block_group_set_bits(). And the 'di_bh' is already dirtied before. Since they are in the same transaction, I don't see any problem here. Thanks, Joseph > Signed-off-by: Heming Zhao <heming.zhao@suse.com> > --- > fs/ocfs2/move_extents.c | 2 +- > fs/ocfs2/resize.c | 7 +++---- > fs/ocfs2/suballoc.c | 25 +++++++++++++++++-------- > fs/ocfs2/suballoc.h | 3 ++- > 4 files changed, 23 insertions(+), 14 deletions(-) > > diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c > index 1f9ed117e78b..a631f9cf0c05 100644 > --- a/fs/ocfs2/move_extents.c > +++ b/fs/ocfs2/move_extents.c > @@ -687,7 +687,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context, > ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh, > goal_bit, len); > if (ret) { > - ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len, > + ocfs2_rollback_alloc_dinode_counts(handle, gb_inode, gb_bh, len, > le16_to_cpu(gd->bg_chain)); > mlog_errno(ret); > } > diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c > index d65d43c61857..2c7e3548ae82 100644 > --- a/fs/ocfs2/resize.c > +++ b/fs/ocfs2/resize.c > @@ -143,15 +143,14 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, > le32_add_cpu(&cr->c_free, -1 * backups); > le32_add_cpu(&fe->id1.bitmap1.i_used, backups); > } > + le64_add_cpu(&fe->i_size, (u64)new_clusters << osb->s_clustersize_bits); > + ocfs2_journal_dirty(handle, bm_bh); > > spin_lock(&OCFS2_I(bm_inode)->ip_lock); > OCFS2_I(bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); > - le64_add_cpu(&fe->i_size, (u64)new_clusters << osb->s_clustersize_bits); > spin_unlock(&OCFS2_I(bm_inode)->ip_lock); > i_size_write(bm_inode, le64_to_cpu(fe->i_size)); > > - ocfs2_journal_dirty(handle, bm_bh); > - > out_rollback: > if (ret < 0) { > ocfs2_calc_new_backup_super(bm_inode, > @@ -551,12 +550,12 @@ int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input) > le32_add_cpu(&fe->id1.bitmap1.i_used, > (input->clusters - input->frees) * cl_bpc); > le32_add_cpu(&fe->i_clusters, input->clusters); > + le64_add_cpu(&fe->i_size, (u64)input->clusters << osb->s_clustersize_bits); > > ocfs2_journal_dirty(handle, main_bm_bh); > > spin_lock(&OCFS2_I(main_bm_inode)->ip_lock); > OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); > - le64_add_cpu(&fe->i_size, (u64)input->clusters << osb->s_clustersize_bits); > spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock); > i_size_write(main_bm_inode, le64_to_cpu(fe->i_size)); > > diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c > index 166c8918c825..e1168db659df 100644 > --- a/fs/ocfs2/suballoc.c > +++ b/fs/ocfs2/suballoc.c > @@ -727,17 +727,16 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, > le16_to_cpu(bg->bg_free_bits_count)); > le32_add_cpu(&fe->id1.bitmap1.i_total, le16_to_cpu(bg->bg_bits)); > le32_add_cpu(&fe->i_clusters, le16_to_cpu(cl->cl_cpg)); > - > + fe->i_size = cpu_to_le64(ocfs2_clusters_to_bytes(alloc_inode->i_sb, > + le32_to_cpu(fe->i_clusters))); > ocfs2_journal_dirty(handle, bh); > + ocfs2_update_inode_fsync_trans(handle, alloc_inode, 0); > > spin_lock(&OCFS2_I(alloc_inode)->ip_lock); > OCFS2_I(alloc_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); > - fe->i_size = cpu_to_le64(ocfs2_clusters_to_bytes(alloc_inode->i_sb, > - le32_to_cpu(fe->i_clusters))); > spin_unlock(&OCFS2_I(alloc_inode)->ip_lock); > i_size_write(alloc_inode, le64_to_cpu(fe->i_size)); > alloc_inode->i_blocks = ocfs2_inode_sector_count(alloc_inode); > - ocfs2_update_inode_fsync_trans(handle, alloc_inode, 0); > > status = 0; > > @@ -1601,19 +1600,28 @@ int ocfs2_alloc_dinode_update_counts(struct inode *inode, > return ret; > } > > -void ocfs2_rollback_alloc_dinode_counts(struct inode *inode, > +void ocfs2_rollback_alloc_dinode_counts(handle_t *handle, > + struct inode *inode, > struct buffer_head *di_bh, > u32 num_bits, > u16 chain) > { > + int ret; > u32 tmp_used; > struct ocfs2_dinode *di = (struct ocfs2_dinode *) di_bh->b_data; > struct ocfs2_chain_list *cl; > > + ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh, > + OCFS2_JOURNAL_ACCESS_WRITE); > + if (ret < 0) { > + mlog_errno(ret); > + return; > + } > cl = (struct ocfs2_chain_list *)&di->id2.i_chain; > tmp_used = le32_to_cpu(di->id1.bitmap1.i_used); > di->id1.bitmap1.i_used = cpu_to_le32(tmp_used - num_bits); > le32_add_cpu(&cl->cl_recs[chain].c_free, num_bits); > + ocfs2_journal_dirty(handle, di_bh); > } > > static int ocfs2_bg_discontig_fix_by_rec(struct ocfs2_suballoc_result *res, > @@ -1717,7 +1725,8 @@ static int ocfs2_search_one_group(struct ocfs2_alloc_context *ac, > ret = ocfs2_block_group_set_bits(handle, alloc_inode, gd, group_bh, > res->sr_bit_offset, res->sr_bits); > if (ret < 0) { > - ocfs2_rollback_alloc_dinode_counts(alloc_inode, ac->ac_bh, > + ocfs2_rollback_alloc_dinode_counts(handle, > + alloc_inode, ac->ac_bh, > res->sr_bits, > le16_to_cpu(gd->bg_chain)); > mlog_errno(ret); > @@ -1851,7 +1860,7 @@ static int ocfs2_search_chain(struct ocfs2_alloc_context *ac, > res->sr_bit_offset, > res->sr_bits); > if (status < 0) { > - ocfs2_rollback_alloc_dinode_counts(alloc_inode, > + ocfs2_rollback_alloc_dinode_counts(handle, alloc_inode, > ac->ac_bh, res->sr_bits, chain); > mlog_errno(status); > goto bail; > @@ -2165,7 +2174,7 @@ int ocfs2_claim_new_inode_at_loc(handle_t *handle, > res->sr_bit_offset, > res->sr_bits); > if (ret < 0) { > - ocfs2_rollback_alloc_dinode_counts(ac->ac_inode, > + ocfs2_rollback_alloc_dinode_counts(handle, ac->ac_inode, > ac->ac_bh, res->sr_bits, chain); > mlog_errno(ret); > goto out; > diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h > index 9c74eace3adc..0c56ddce0752 100644 > --- a/fs/ocfs2/suballoc.h > +++ b/fs/ocfs2/suballoc.h > @@ -75,7 +75,8 @@ int ocfs2_alloc_dinode_update_counts(struct inode *inode, > struct buffer_head *di_bh, > u32 num_bits, > u16 chain); > -void ocfs2_rollback_alloc_dinode_counts(struct inode *inode, > +void ocfs2_rollback_alloc_dinode_counts(handle_t *handle, > + struct inode *inode, > struct buffer_head *di_bh, > u32 num_bits, > u16 chain);
On Thu, 14 Mar 2024 09:57:57 -0700 Alison Schofield <alison.schofield@intel.com> wrote: > On Fri, Feb 23, 2024 at 12:56:34PM -0500, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > > > [ > > This is a treewide change. I will likely re-create this patch again in > > the second week of the merge window of v6.9 and submit it then. Hoping > > to keep the conflicts that it will cause to a minimum. > > ] Note, change of plans. I plan on sending this in the next merge window, as this merge window I have this patch: https://lore.kernel.org/linux-trace-kernel/20240312113002.00031668@gandalf.local.home/ That will warn if the source string of __string() is different than the source string of __assign_str(). I want to make sure they are identical before just dropping one of them. > > > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h > > index bdf117a33744..07ba4e033347 100644 > > --- a/drivers/cxl/core/trace.h > > +++ b/drivers/cxl/core/trace.h > > snip to poison > > > @@ -668,8 +668,8 @@ TRACE_EVENT(cxl_poison, > > ), > > > > TP_fast_assign( > > - __assign_str(memdev, dev_name(&cxlmd->dev)); > > - __assign_str(host, dev_name(cxlmd->dev.parent)); > > + __assign_str(memdev); > > + __assign_str(host); > > I think I get that the above changes work because the TP_STRUCT__entry for > these did: > __string(memdev, dev_name(&cxlmd->dev)) > __string(host, dev_name(cxlmd->dev.parent)) That's the point. They have to be identical or you will likely bug. The __string(name, src) is used to find the string length of src which allocates the necessary length on the ring buffer. The __assign_str(name, src) will copy src into the ring buffer. Similar to: len = strlen(src); buf = malloc(len); strcpy(buf, str); Where __string() is strlen() and __assign_str() is strcpy(). It doesn't make sense to use two different strings, and if you did, it would likely be a bug. But the magic behind __string() does much more than just get the length of the string, and it could easily save the pointer to the string (along with its length) and have it copy that in the __assign_str() call, making the src parameter of __assign_str() useless. > > > __entry->serial = cxlmd->cxlds->serial; > > __entry->overflow_ts = cxl_poison_overflow(flags, overflow_ts); > > __entry->dpa = cxl_poison_record_dpa(record); > > @@ -678,12 +678,12 @@ TRACE_EVENT(cxl_poison, > > __entry->trace_type = trace_type; > > __entry->flags = flags; > > if (region) { > > - __assign_str(region, dev_name(®ion->dev)); > > + __assign_str(region); > > memcpy(__entry->uuid, ®ion->params.uuid, 16); > > __entry->hpa = cxl_trace_hpa(region, cxlmd, > > __entry->dpa); > > } else { > > - __assign_str(region, ""); > > + __assign_str(region); > > memset(__entry->uuid, 0, 16); > > __entry->hpa = ULLONG_MAX; > > For the above 2, there was no helper in TP_STRUCT__entry. A recently > posted patch is fixing that up to be __string(region, NULL) See [1], > with the actual assignment still happening in TP_fast_assign. __string(region, NULL) doesn't make sense. It's like: len = strlen(NULL); buf = malloc(len); strcpy(buf, NULL); ?? I'll reply to that email. -- Steve > > Does that assign logic need to move to the TP_STRUCT__entry definition > when you merge these changes? I'm not clear how much logic is able to be > included, ie like 'C' style code in the TP_STRUCT__entry. > > [1] > https://lore.kernel.org/linux-cxl/20240314044301.2108650-1-alison.schofield@intel.com/
On Fri, Feb 23, 2024 at 12:56:34PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > [ > This is a treewide change. I will likely re-create this patch again in > the second week of the merge window of v6.9 and submit it then. Hoping > to keep the conflicts that it will cause to a minimum. > ] > > With the rework of how the __string() handles dynamic strings where it > saves off the source string in field in the helper structure[1], the > assignment of that value to the trace event field is stored in the helper > value and does not need to be passed in again. > > This means that with: > > __string(field, mystring) > > Which use to be assigned with __assign_str(field, mystring), no longer > needs the second parameter and it is unused. With this, __assign_str() > will now only get a single parameter. > > There's over 700 users of __assign_str() and because coccinelle does not > handle the TRACE_EVENT() macro I ended up using the following sed script: > > git grep -l __assign_str | while read a ; do > sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; > mv /tmp/test-file $a; > done > > I then searched for __assign_str() that did not end with ';' as those > were multi line assignments that the sed script above would fail to catch. > > Note, the same updates will need to be done for: > > __assign_str_len() > __assign_rel_str() > __assign_rel_str_len() > __assign_bitmask() > __assign_rel_bitmask() > __assign_cpumask() > __assign_rel_cpumask() > > [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > --- > arch/arm64/kernel/trace-events-emulation.h | 2 +- > arch/powerpc/include/asm/trace.h | 4 +- > arch/x86/kvm/trace.h | 2 +- > drivers/base/regmap/trace.h | 18 +-- > drivers/base/trace.h | 2 +- > drivers/block/rnbd/rnbd-srv-trace.h | 12 +- > drivers/cxl/core/trace.h | 24 ++-- snip to CXL > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h > index bdf117a33744..07ba4e033347 100644 > --- a/drivers/cxl/core/trace.h > +++ b/drivers/cxl/core/trace.h snip to poison > @@ -668,8 +668,8 @@ TRACE_EVENT(cxl_poison, > ), > > TP_fast_assign( > - __assign_str(memdev, dev_name(&cxlmd->dev)); > - __assign_str(host, dev_name(cxlmd->dev.parent)); > + __assign_str(memdev); > + __assign_str(host); I think I get that the above changes work because the TP_STRUCT__entry for these did: __string(memdev, dev_name(&cxlmd->dev)) __string(host, dev_name(cxlmd->dev.parent)) > __entry->serial = cxlmd->cxlds->serial; > __entry->overflow_ts = cxl_poison_overflow(flags, overflow_ts); > __entry->dpa = cxl_poison_record_dpa(record); > @@ -678,12 +678,12 @@ TRACE_EVENT(cxl_poison, > __entry->trace_type = trace_type; > __entry->flags = flags; > if (region) { > - __assign_str(region, dev_name(®ion->dev)); > + __assign_str(region); > memcpy(__entry->uuid, ®ion->params.uuid, 16); > __entry->hpa = cxl_trace_hpa(region, cxlmd, > __entry->dpa); > } else { > - __assign_str(region, ""); > + __assign_str(region); > memset(__entry->uuid, 0, 16); > __entry->hpa = ULLONG_MAX; For the above 2, there was no helper in TP_STRUCT__entry. A recently posted patch is fixing that up to be __string(region, NULL) See [1], with the actual assignment still happening in TP_fast_assign. Does that assign logic need to move to the TP_STRUCT__entry definition when you merge these changes? I'm not clear how much logic is able to be included, ie like 'C' style code in the TP_STRUCT__entry. [1] https://lore.kernel.org/linux-cxl/20240314044301.2108650-1-alison.schofield@intel.com/ Thanks for helping, Alison > }
This patch resolves 3 journal-related issues: 1. ocfs2_rollback_alloc_dinode_counts() forgets to provide journal protection for modifying dinode. 2. ocfs2_block_group_alloc() & ocfs2_group_add() forget to provide journal protection for writing fe->i_size. 3. adjusted journal_dirty scope for ocfs2_update_last_group_and_inode(). This adjustment ensures that only content requiring protection is included in the journal. Signed-off-by: Heming Zhao <heming.zhao@suse.com> --- fs/ocfs2/move_extents.c | 2 +- fs/ocfs2/resize.c | 7 +++---- fs/ocfs2/suballoc.c | 25 +++++++++++++++++-------- fs/ocfs2/suballoc.h | 3 ++- 4 files changed, 23 insertions(+), 14 deletions(-) diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c index 1f9ed117e78b..a631f9cf0c05 100644 --- a/fs/ocfs2/move_extents.c +++ b/fs/ocfs2/move_extents.c @@ -687,7 +687,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context, ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh, goal_bit, len); if (ret) { - ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len, + ocfs2_rollback_alloc_dinode_counts(handle, gb_inode, gb_bh, len, le16_to_cpu(gd->bg_chain)); mlog_errno(ret); } diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c index d65d43c61857..2c7e3548ae82 100644 --- a/fs/ocfs2/resize.c +++ b/fs/ocfs2/resize.c @@ -143,15 +143,14 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, le32_add_cpu(&cr->c_free, -1 * backups); le32_add_cpu(&fe->id1.bitmap1.i_used, backups); } + le64_add_cpu(&fe->i_size, (u64)new_clusters << osb->s_clustersize_bits); + ocfs2_journal_dirty(handle, bm_bh); spin_lock(&OCFS2_I(bm_inode)->ip_lock); OCFS2_I(bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); - le64_add_cpu(&fe->i_size, (u64)new_clusters << osb->s_clustersize_bits); spin_unlock(&OCFS2_I(bm_inode)->ip_lock); i_size_write(bm_inode, le64_to_cpu(fe->i_size)); - ocfs2_journal_dirty(handle, bm_bh); - out_rollback: if (ret < 0) { ocfs2_calc_new_backup_super(bm_inode, @@ -551,12 +550,12 @@ int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input) le32_add_cpu(&fe->id1.bitmap1.i_used, (input->clusters - input->frees) * cl_bpc); le32_add_cpu(&fe->i_clusters, input->clusters); + le64_add_cpu(&fe->i_size, (u64)input->clusters << osb->s_clustersize_bits); ocfs2_journal_dirty(handle, main_bm_bh); spin_lock(&OCFS2_I(main_bm_inode)->ip_lock); OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); - le64_add_cpu(&fe->i_size, (u64)input->clusters << osb->s_clustersize_bits); spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock); i_size_write(main_bm_inode, le64_to_cpu(fe->i_size)); diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 166c8918c825..e1168db659df 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -727,17 +727,16 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, le16_to_cpu(bg->bg_free_bits_count)); le32_add_cpu(&fe->id1.bitmap1.i_total, le16_to_cpu(bg->bg_bits)); le32_add_cpu(&fe->i_clusters, le16_to_cpu(cl->cl_cpg)); - + fe->i_size = cpu_to_le64(ocfs2_clusters_to_bytes(alloc_inode->i_sb, + le32_to_cpu(fe->i_clusters))); ocfs2_journal_dirty(handle, bh); + ocfs2_update_inode_fsync_trans(handle, alloc_inode, 0); spin_lock(&OCFS2_I(alloc_inode)->ip_lock); OCFS2_I(alloc_inode)->ip_clusters = le32_to_cpu(fe->i_clusters); - fe->i_size = cpu_to_le64(ocfs2_clusters_to_bytes(alloc_inode->i_sb, - le32_to_cpu(fe->i_clusters))); spin_unlock(&OCFS2_I(alloc_inode)->ip_lock); i_size_write(alloc_inode, le64_to_cpu(fe->i_size)); alloc_inode->i_blocks = ocfs2_inode_sector_count(alloc_inode); - ocfs2_update_inode_fsync_trans(handle, alloc_inode, 0); status = 0; @@ -1601,19 +1600,28 @@ int ocfs2_alloc_dinode_update_counts(struct inode *inode, return ret; } -void ocfs2_rollback_alloc_dinode_counts(struct inode *inode, +void ocfs2_rollback_alloc_dinode_counts(handle_t *handle, + struct inode *inode, struct buffer_head *di_bh, u32 num_bits, u16 chain) { + int ret; u32 tmp_used; struct ocfs2_dinode *di = (struct ocfs2_dinode *) di_bh->b_data; struct ocfs2_chain_list *cl; + ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh, + OCFS2_JOURNAL_ACCESS_WRITE); + if (ret < 0) { + mlog_errno(ret); + return; + } cl = (struct ocfs2_chain_list *)&di->id2.i_chain; tmp_used = le32_to_cpu(di->id1.bitmap1.i_used); di->id1.bitmap1.i_used = cpu_to_le32(tmp_used - num_bits); le32_add_cpu(&cl->cl_recs[chain].c_free, num_bits); + ocfs2_journal_dirty(handle, di_bh); } static int ocfs2_bg_discontig_fix_by_rec(struct ocfs2_suballoc_result *res, @@ -1717,7 +1725,8 @@ static int ocfs2_search_one_group(struct ocfs2_alloc_context *ac, ret = ocfs2_block_group_set_bits(handle, alloc_inode, gd, group_bh, res->sr_bit_offset, res->sr_bits); if (ret < 0) { - ocfs2_rollback_alloc_dinode_counts(alloc_inode, ac->ac_bh, + ocfs2_rollback_alloc_dinode_counts(handle, + alloc_inode, ac->ac_bh, res->sr_bits, le16_to_cpu(gd->bg_chain)); mlog_errno(ret); @@ -1851,7 +1860,7 @@ static int ocfs2_search_chain(struct ocfs2_alloc_context *ac, res->sr_bit_offset, res->sr_bits); if (status < 0) { - ocfs2_rollback_alloc_dinode_counts(alloc_inode, + ocfs2_rollback_alloc_dinode_counts(handle, alloc_inode, ac->ac_bh, res->sr_bits, chain); mlog_errno(status); goto bail; @@ -2165,7 +2174,7 @@ int ocfs2_claim_new_inode_at_loc(handle_t *handle, res->sr_bit_offset, res->sr_bits); if (ret < 0) { - ocfs2_rollback_alloc_dinode_counts(ac->ac_inode, + ocfs2_rollback_alloc_dinode_counts(handle, ac->ac_inode, ac->ac_bh, res->sr_bits, chain); mlog_errno(ret); goto out; diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h index 9c74eace3adc..0c56ddce0752 100644 --- a/fs/ocfs2/suballoc.h +++ b/fs/ocfs2/suballoc.h @@ -75,7 +75,8 @@ int ocfs2_alloc_dinode_update_counts(struct inode *inode, struct buffer_head *di_bh, u32 num_bits, u16 chain); -void ocfs2_rollback_alloc_dinode_counts(struct inode *inode, +void ocfs2_rollback_alloc_dinode_counts(handle_t *handle, + struct inode *inode, struct buffer_head *di_bh, u32 num_bits, u16 chain); -- 2.35.3
On 3/14/24 10:17, Joseph Qi wrote:
> If no bits are zero, ocfs2_find_next_zero_bit() will return max size, so
> check the return value with -1 is meaningless.
> Correct this usage and cleanup the code.
>
> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> ---
> fs/ocfs2/localalloc.c | 19 ++++++-------------
> fs/ocfs2/reservations.c | 2 +-
> fs/ocfs2/suballoc.c | 6 ++----
> 3 files changed, 9 insertions(+), 18 deletions(-)
>
> diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
> index c803c10dd97e..33aeaaa056d7 100644
> --- a/fs/ocfs2/localalloc.c
> +++ b/fs/ocfs2/localalloc.c
> @@ -863,14 +863,8 @@ static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
>
> numfound = bitoff = startoff = 0;
> left = le32_to_cpu(alloc->id1.bitmap1.i_total);
> - while ((bitoff = ocfs2_find_next_zero_bit(bitmap, left, startoff)) != -1) {
> - if (bitoff == left) {
> - /* mlog(0, "bitoff (%d) == left", bitoff); */
> - break;
> - }
> - /* mlog(0, "Found a zero: bitoff = %d, startoff = %d, "
> - "numfound = %d\n", bitoff, startoff, numfound);*/
> -
> + while ((bitoff = ocfs2_find_next_zero_bit(bitmap, left, startoff)) <
> + left) {
> /* Ok, we found a zero bit... is it contig. or do we
> * start over?*/
> if (bitoff == startoff) {
> @@ -976,9 +970,9 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb,
> start = count = 0;
> left = le32_to_cpu(alloc->id1.bitmap1.i_total);
>
> - while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start))
> - != -1) {
> - if ((bit_off < left) && (bit_off == start)) {
> + while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) <
> + left) {
> + if (bit_off == start) {
> count++;
> start++;
> continue;
> @@ -1002,8 +996,7 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb,
> goto bail;
> }
> }
> - if (bit_off >= left)
> - break;
> +
> count = 1;
> start = bit_off + 1;
> }
> diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c
> index a9d1296d736d..1fe61974d9f0 100644
> --- a/fs/ocfs2/reservations.c
> +++ b/fs/ocfs2/reservations.c
> @@ -414,7 +414,7 @@ static int ocfs2_resmap_find_free_bits(struct ocfs2_reservation_map *resmap,
>
> start = search_start;
> while ((offset = ocfs2_find_next_zero_bit(bitmap, resmap->m_bitmap_len,
> - start)) != -1) {
> + start)) < resmap->m_bitmap_len) {
> /* Search reached end of the region */
> if (offset >= (search_start + search_len))
> break;
> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
> index 166c8918c825..961998415308 100644
> --- a/fs/ocfs2/suballoc.c
> +++ b/fs/ocfs2/suballoc.c
> @@ -1290,10 +1290,8 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
> found = start = best_offset = best_size = 0;
> bitmap = bg->bg_bitmap;
>
> - while((offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start)) != -1) {
> - if (offset == total_bits)
> - break;
> -
> + while ((offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start)) <
> + total_bits) {
> if (!ocfs2_test_bg_bit_allocatable(bg_bh, offset)) {
> /* We found a zero, but we can't use it as it
> * hasn't been put to disk yet! */
I found the same issue when I was writing the IO performance patch.
(So my patch doesn't follow the existing handling style.)
Thank you for taking time to fix it.
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Thanks,
Heming
If no bits are zero, ocfs2_find_next_zero_bit() will return max size, so check the return value with -1 is meaningless. Correct this usage and cleanup the code. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> --- fs/ocfs2/localalloc.c | 19 ++++++------------- fs/ocfs2/reservations.c | 2 +- fs/ocfs2/suballoc.c | 6 ++---- 3 files changed, 9 insertions(+), 18 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index c803c10dd97e..33aeaaa056d7 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -863,14 +863,8 @@ static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb, numfound = bitoff = startoff = 0; left = le32_to_cpu(alloc->id1.bitmap1.i_total); - while ((bitoff = ocfs2_find_next_zero_bit(bitmap, left, startoff)) != -1) { - if (bitoff == left) { - /* mlog(0, "bitoff (%d) == left", bitoff); */ - break; - } - /* mlog(0, "Found a zero: bitoff = %d, startoff = %d, " - "numfound = %d\n", bitoff, startoff, numfound);*/ - + while ((bitoff = ocfs2_find_next_zero_bit(bitmap, left, startoff)) < + left) { /* Ok, we found a zero bit... is it contig. or do we * start over?*/ if (bitoff == startoff) { @@ -976,9 +970,9 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb, start = count = 0; left = le32_to_cpu(alloc->id1.bitmap1.i_total); - while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) - != -1) { - if ((bit_off < left) && (bit_off == start)) { + while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) < + left) { + if (bit_off == start) { count++; start++; continue; @@ -1002,8 +996,7 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb, goto bail; } } - if (bit_off >= left) - break; + count = 1; start = bit_off + 1; } diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c index a9d1296d736d..1fe61974d9f0 100644 --- a/fs/ocfs2/reservations.c +++ b/fs/ocfs2/reservations.c @@ -414,7 +414,7 @@ static int ocfs2_resmap_find_free_bits(struct ocfs2_reservation_map *resmap, start = search_start; while ((offset = ocfs2_find_next_zero_bit(bitmap, resmap->m_bitmap_len, - start)) != -1) { + start)) < resmap->m_bitmap_len) { /* Search reached end of the region */ if (offset >= (search_start + search_len)) break; diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 166c8918c825..961998415308 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -1290,10 +1290,8 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb, found = start = best_offset = best_size = 0; bitmap = bg->bg_bitmap; - while((offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start)) != -1) { - if (offset == total_bits) - break; - + while ((offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start)) < + total_bits) { if (!ocfs2_test_bg_bit_allocatable(bg_bh, offset)) { /* We found a zero, but we can't use it as it * hasn't been put to disk yet! */ -- 2.24.4
On 3/13/24 17:09, Joseph Qi wrote:
>
>
> On 3/13/24 9:34 AM, Heming Zhao wrote:
>> Hi Joseph,
>>
>> Before sending v2 patch, I need to explain ocfs2_find_max_contig_free_bits()
>> for you. See the comment in below.
>>
>> On 3/11/24 19:04, Joseph Qi wrote:
>>>
>>> Hi,
>>>
>>> Please see my comments inline.
>>>
>>> On 3/8/24 9:52 AM, Heming Zhao wrote:
>>>> The group_search function ocfs2_cluster_group_search() should
>>>> bypass groups with insufficient space to avoid unnecessary
>>>> searches.
>>>>
>>>> This patch is particularly useful when ocfs2 is handling huge
>>>> number small files, and volume fragmentation is very high.
>>>> In this case, ocfs2 is busy with looking up available la window
>>>> from //global_bitmap.
>>>>
>>>> This patch introduces a new member in the Group Description (gd)
>>>> struct called 'bg_contig_free_bits', representing the max
>>>> contigous free bits in this gd. When ocfs2 allocates a new
>>>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>>>> expedite the search process.
>>>>
>>>> ... ...
>>>> +/* caller should initialize contig_bits */
>>>> +void ocfs2_find_max_contig_free_bits(void *bitmap,
>>>> + unsigned int total_bits, int start,
>>>> + unsigned int *contig_bits)
>>>> +{
>>>> + int offset, free_bits;
>>>> +
>>>> + while (start < total_bits) {
>>>> + offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start);
>>>> + if (offset == total_bits)
>>>> + break;
>>>> +
>>>> + start = ocfs2_find_next_bit(bitmap, total_bits, offset);
>>>> + free_bits = start - offset;
>>>> + if (*contig_bits < free_bits)
>>>> + *contig_bits = free_bits;
>>>> + }
>>>
>>> Or we can just return the contig_bits. Something like:
>>> unsigned int ocfs2_find_max_contig_free_bits(void *bitmap,
>>> unsigned int total_bits, int start)
>>> {
>>> int offset, start;
>>> unsigned int contig_bits = 0;
>>>
>>> while () {
>>> ...
>>> }
>>>
>>> return contig_bits;
>>> }
>>
>> the input *contig_bits has two jobs:
>> - return the max contigous free bits of bitmap
>> - if *contig_bits is not ZERO, it keeps the max-contig-bits
>> from last search. When the input 'start' not zero
>> (means start from middle of bitmap), the max-contig-bits
>> from latter part of bitmap may smaller than first part.
>> at this case, function should keep & return the input
>> *contig_bits.
>>
>
> IC, I don't think it's a good idea to mix up the 2 responsibilities.
> You've hidden the logic that it may change the input contig_bits (not
> the case of first search).
> You can also implement it something like:
>
> contig_bits = ocfs2_find_max_contig_free_bits();
> if (config_bits > sr_max_contig_bits)
> sr_max_contig_bits = contig_bits;
>
> This will be more explicit for an optimized search.
>
> Thanks,
> Joseph
>
Thanks for the explanation, your code logic is better.
- Heming
On 3/13/24 9:34 AM, Heming Zhao wrote: > Hi Joseph, > > Before sending v2 patch, I need to explain ocfs2_find_max_contig_free_bits() > for you. See the comment in below. > > On 3/11/24 19:04, Joseph Qi wrote: >> >> Hi, >> >> Please see my comments inline. >> >> On 3/8/24 9:52 AM, Heming Zhao wrote: >>> The group_search function ocfs2_cluster_group_search() should >>> bypass groups with insufficient space to avoid unnecessary >>> searches. >>> >>> This patch is particularly useful when ocfs2 is handling huge >>> number small files, and volume fragmentation is very high. >>> In this case, ocfs2 is busy with looking up available la window >>> from //global_bitmap. >>> >>> This patch introduces a new member in the Group Description (gd) >>> struct called 'bg_contig_free_bits', representing the max >>> contigous free bits in this gd. When ocfs2 allocates a new >>> la window from //global_bitmap, 'bg_contig_free_bits' helps >>> expedite the search process. >>> >>> Let's image below path. >>> >>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED. >>> >>> 2. when user delete a large file and trigger >>> ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state >>> unconditionally. >>> >>> 3. a write IOs thread run and trigger the worst performance path >>> >>> ``` >>> ocfs2_reserve_clusters_with_limit >>> ocfs2_reserve_local_alloc_bits >>> ocfs2_local_alloc_slide_window //[1] >>> + ocfs2_local_alloc_reserve_for_window //[2] >>> + ocfs2_local_alloc_new_window //[3] >>> ocfs2_recalc_la_window >>> ``` >>> >>> [1]: >>> will be called when la window bits used up. >>> >>> [2]: >>> under la state is ENABLED, and this func only check global_bitmap >>> free bits, it will succeed in general. >>> >>> [3]: >>> will use the default la window size to search clusters then fail. >>> ocfs2_recalc_la_window attempts other la window sizes. >>> the timing complexity is O(n^4), resulting in a significant time >>> cost for scanning global bitmap. This leads to a dramatic slowdown >>> in write I/Os (e.g., user space 'dd'). >>> >>> i.e. >>> an ocfs2 partition size: 1.45TB, cluster size: 4KB, >>> la window default size: 106MB. >>> The partition is fragmentation by creating & deleting huge mount of >>> small files. >>> >>> before this patch, the timing of [3] should be >>> (the number got from real world): >>> - la window size change order (size: MB): >>> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8 >>> only 0.8MB succeed, 0.8MB also triggers la window to disable. >>> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally >>> runs in worst case. >>> - group chain number: 242 >>> ocfs2_claim_suballoc_bits calls for-loop 242 times >>> - each chain has 49 block group >>> ocfs2_search_chain calls while-loop 49 times >>> - each bg has 32256 blocks >>> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits. >>> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use >>> (32256/64) (this is not worst value) for timing calucation. >>> >>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times) >>> >>> In the worst case, user space writes 1MB data will trigger 42M scanning >>> times. >>> >>> under this patch, the timing is '7*242*49 = 83006', reduce by three >>> orders of magnitude. >>> >>> Signed-off-by: Heming Zhao <heming.zhao@suse.com> >>> --- >>> fs/ocfs2/move_extents.c | 6 +- >>> fs/ocfs2/ocfs2_fs.h | 4 +- >>> fs/ocfs2/resize.c | 7 +++ >>> fs/ocfs2/suballoc.c | 120 ++++++++++++++++++++++++++++------------ >>> fs/ocfs2/suballoc.h | 28 +++++++++- >>> 5 files changed, 122 insertions(+), 43 deletions(-) >>> >>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c >>> index 1f9ed117e78b..6b9753b8c6fb 100644 >>> --- a/fs/ocfs2/move_extents.c >>> +++ b/fs/ocfs2/move_extents.c >>> @@ -576,6 +576,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context, >>> u32 move_max_hop = ocfs2_blocks_to_clusters(inode->i_sb, >>> context->range->me_threshold); >>> u64 phys_blkno, new_phys_blkno; >>> + struct ocfs2_suballoc_result res = {0}; >>> phys_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys_cpos); >>> @@ -684,8 +685,9 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context, >>> goto out_commit; >>> } >>> - ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh, >>> - goal_bit, len); >>> + res.sr_bit_offset = goal_bit; >>> + res.sr_bits = len; >>> + ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh, &res, 0); >>> if (ret) { >>> ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len, >>> le16_to_cpu(gd->bg_chain)); >>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h >>> index 7aebdbf5cc0a..eae18d772e93 100644 >>> --- a/fs/ocfs2/ocfs2_fs.h >>> +++ b/fs/ocfs2/ocfs2_fs.h >>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc >>> __le16 bg_free_bits_count; /* Free bits count */ >>> __le16 bg_chain; /* What chain I am in. */ >>> /*10*/ __le32 bg_generation; >>> - __le32 bg_reserved1; >>> + __le32 bg_contig_free_bits; /* max contig free bits length */ >>> __le64 bg_next_group; /* Next group in my list, in >>> blocks */ >>> /*20*/ __le64 bg_parent_dinode; /* dinode which owns me, in >>> blocks */ >>> __le64 bg_blkno; /* Offset on disk, in blocks */ >>> /*30*/ struct ocfs2_block_check bg_check; /* Error checking */ >>> - __le64 bg_reserved2; >>> + __le64 bg_reserved1; >>> /*40*/ union { >>> DECLARE_FLEX_ARRAY(__u8, bg_bitmap); >>> struct { >>> diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c >>> index d65d43c61857..ab30518d5f04 100644 >>> --- a/fs/ocfs2/resize.c >>> +++ b/fs/ocfs2/resize.c >>> @@ -91,6 +91,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, >>> u16 cl_bpc = le16_to_cpu(cl->cl_bpc); >>> u16 cl_cpg = le16_to_cpu(cl->cl_cpg); >>> u16 old_bg_clusters; >>> + u32 contig_bits = 0, old_bg_contig_free_bits; >>> trace_ocfs2_update_last_group_and_inode(new_clusters, >>> first_new_cluster); >>> @@ -122,6 +123,11 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, >>> le16_add_cpu(&group->bg_free_bits_count, -1 * backups); >>> } >>> + ocfs2_find_max_contig_free_bits(group->bg_bitmap, >>> + group->bg_bits, 0, &contig_bits); >>> + old_bg_contig_free_bits = group->bg_contig_free_bits; >>> + group->bg_contig_free_bits = cpu_to_le32(contig_bits); >>> + >>> ocfs2_journal_dirty(handle, group_bh); >>> /* update the inode accordingly. */ >>> @@ -160,6 +166,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle, >>> le16_add_cpu(&group->bg_free_bits_count, backups); >>> le16_add_cpu(&group->bg_bits, -1 * num_bits); >>> le16_add_cpu(&group->bg_free_bits_count, -1 * num_bits); >>> + group->bg_contig_free_bits = old_bg_contig_free_bits; >>> } >>> out: >>> if (ret) >>> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c >>> index 166c8918c825..df149c3b0eed 100644 >>> --- a/fs/ocfs2/suballoc.c >>> +++ b/fs/ocfs2/suballoc.c >>> @@ -37,21 +37,6 @@ >>> #define OCFS2_MAX_TO_STEAL 1024 >>> -struct ocfs2_suballoc_result { >>> - u64 sr_bg_blkno; /* The bg we allocated from. Set >>> - to 0 when a block group is >>> - contiguous. */ >>> - u64 sr_bg_stable_blkno; /* >>> - * Doesn't change, always >>> - * set to target block >>> - * group descriptor >>> - * block. >>> - */ >>> - u64 sr_blkno; /* The first allocated block */ >>> - unsigned int sr_bit_offset; /* The bit in the bg */ >>> - unsigned int sr_bits; /* How many bits we claimed */ >>> -}; >>> - >>> static u64 ocfs2_group_from_res(struct ocfs2_suballoc_result *res) >>> { >>> if (res->sr_blkno == 0) >>> @@ -1272,6 +1257,25 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh, >>> return ret; >>> } >>> +/* caller should initialize contig_bits */ >>> +void ocfs2_find_max_contig_free_bits(void *bitmap, >>> + unsigned int total_bits, int start, >>> + unsigned int *contig_bits) >>> +{ >>> + int offset, free_bits; >>> + >>> + while (start < total_bits) { >>> + offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start); >>> + if (offset == total_bits) >>> + break; >>> + >>> + start = ocfs2_find_next_bit(bitmap, total_bits, offset); >>> + free_bits = start - offset; >>> + if (*contig_bits < free_bits) >>> + *contig_bits = free_bits; >>> + } >> >> Or we can just return the contig_bits. Something like: >> unsigned int ocfs2_find_max_contig_free_bits(void *bitmap, >> unsigned int total_bits, int start) >> { >> int offset, start; >> unsigned int contig_bits = 0; >> >> while () { >> ... >> } >> >> return contig_bits; >> } > > the input *contig_bits has two jobs: > - return the max contigous free bits of bitmap > - if *contig_bits is not ZERO, it keeps the max-contig-bits > from last search. When the input 'start' not zero > (means start from middle of bitmap), the max-contig-bits > from latter part of bitmap may smaller than first part. > at this case, function should keep & return the input > *contig_bits. > IC, I don't think it's a good idea to mix up the 2 responsibilities. You've hidden the logic that it may change the input contig_bits (not the case of first search). You can also implement it something like: contig_bits = ocfs2_find_max_contig_free_bits(); if (config_bits > sr_max_contig_bits) sr_max_contig_bits = contig_bits; This will be more explicit for an optimized search. Thanks, Joseph > the view of ocfs2_find_max_contig_free_bits(): > > front part | latter part > [-------------+------------------] bitmap > 0 'start' total_bits > > the call stack: > > ocfs2_search_one_group > + ac->ac_group_search > | ocfs2_cluster_group_search > | ocfs2_block_group_find_clear_bits > | + scan area [0, res->sr_bit_offset + res->sr_bits - 1] > | | /* zhm: save the second max-contig-bits. > | | * if it doesn't exist, the value is 0. */ > | + res->sr_max_contig_bits = prev_best_size; > | > + ocfs2_block_group_set_bits > ocfs2_find_max_contig_free_bits(..., > start, /* zhm: may not ZERO, help to speed up the search */ > res->sr_max_contig_bits) /* zhm: the optimal value may have been reached */ > > So we should keep ocfs2_find_max_contig_free_bits() with a return type of 'void'. >
Hi Joseph,
Before sending v2 patch, I need to explain ocfs2_find_max_contig_free_bits()
for you. See the comment in below.
On 3/11/24 19:04, Joseph Qi wrote:
>
> Hi,
>
> Please see my comments inline.
>
> On 3/8/24 9:52 AM, Heming Zhao wrote:
>> The group_search function ocfs2_cluster_group_search() should
>> bypass groups with insufficient space to avoid unnecessary
>> searches.
>>
>> This patch is particularly useful when ocfs2 is handling huge
>> number small files, and volume fragmentation is very high.
>> In this case, ocfs2 is busy with looking up available la window
>> from //global_bitmap.
>>
>> This patch introduces a new member in the Group Description (gd)
>> struct called 'bg_contig_free_bits', representing the max
>> contigous free bits in this gd. When ocfs2 allocates a new
>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>> expedite the search process.
>>
>> Let's image below path.
>>
>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>
>> 2. when user delete a large file and trigger
>> ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>> unconditionally.
>>
>> 3. a write IOs thread run and trigger the worst performance path
>>
>> ```
>> ocfs2_reserve_clusters_with_limit
>> ocfs2_reserve_local_alloc_bits
>> ocfs2_local_alloc_slide_window //[1]
>> + ocfs2_local_alloc_reserve_for_window //[2]
>> + ocfs2_local_alloc_new_window //[3]
>> ocfs2_recalc_la_window
>> ```
>>
>> [1]:
>> will be called when la window bits used up.
>>
>> [2]:
>> under la state is ENABLED, and this func only check global_bitmap
>> free bits, it will succeed in general.
>>
>> [3]:
>> will use the default la window size to search clusters then fail.
>> ocfs2_recalc_la_window attempts other la window sizes.
>> the timing complexity is O(n^4), resulting in a significant time
>> cost for scanning global bitmap. This leads to a dramatic slowdown
>> in write I/Os (e.g., user space 'dd').
>>
>> i.e.
>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>> la window default size: 106MB.
>> The partition is fragmentation by creating & deleting huge mount of
>> small files.
>>
>> before this patch, the timing of [3] should be
>> (the number got from real world):
>> - la window size change order (size: MB):
>> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>> only 0.8MB succeed, 0.8MB also triggers la window to disable.
>> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>> runs in worst case.
>> - group chain number: 242
>> ocfs2_claim_suballoc_bits calls for-loop 242 times
>> - each chain has 49 block group
>> ocfs2_search_chain calls while-loop 49 times
>> - each bg has 32256 blocks
>> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>> (32256/64) (this is not worst value) for timing calucation.
>>
>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>
>> In the worst case, user space writes 1MB data will trigger 42M scanning
>> times.
>>
>> under this patch, the timing is '7*242*49 = 83006', reduce by three
>> orders of magnitude.
>>
>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>> ---
>> fs/ocfs2/move_extents.c | 6 +-
>> fs/ocfs2/ocfs2_fs.h | 4 +-
>> fs/ocfs2/resize.c | 7 +++
>> fs/ocfs2/suballoc.c | 120 ++++++++++++++++++++++++++++------------
>> fs/ocfs2/suballoc.h | 28 +++++++++-
>> 5 files changed, 122 insertions(+), 43 deletions(-)
>>
>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>> index 1f9ed117e78b..6b9753b8c6fb 100644
>> --- a/fs/ocfs2/move_extents.c
>> +++ b/fs/ocfs2/move_extents.c
>> @@ -576,6 +576,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>> u32 move_max_hop = ocfs2_blocks_to_clusters(inode->i_sb,
>> context->range->me_threshold);
>> u64 phys_blkno, new_phys_blkno;
>> + struct ocfs2_suballoc_result res = {0};
>>
>> phys_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys_cpos);
>>
>> @@ -684,8 +685,9 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>> goto out_commit;
>> }
>>
>> - ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>> - goal_bit, len);
>> + res.sr_bit_offset = goal_bit;
>> + res.sr_bits = len;
>> + ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh, &res, 0);
>> if (ret) {
>> ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>> le16_to_cpu(gd->bg_chain));
>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>> index 7aebdbf5cc0a..eae18d772e93 100644
>> --- a/fs/ocfs2/ocfs2_fs.h
>> +++ b/fs/ocfs2/ocfs2_fs.h
>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc
>> __le16 bg_free_bits_count; /* Free bits count */
>> __le16 bg_chain; /* What chain I am in. */
>> /*10*/ __le32 bg_generation;
>> - __le32 bg_reserved1;
>> + __le32 bg_contig_free_bits; /* max contig free bits length */
>> __le64 bg_next_group; /* Next group in my list, in
>> blocks */
>> /*20*/ __le64 bg_parent_dinode; /* dinode which owns me, in
>> blocks */
>> __le64 bg_blkno; /* Offset on disk, in blocks */
>> /*30*/ struct ocfs2_block_check bg_check; /* Error checking */
>> - __le64 bg_reserved2;
>> + __le64 bg_reserved1;
>> /*40*/ union {
>> DECLARE_FLEX_ARRAY(__u8, bg_bitmap);
>> struct {
>> diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c
>> index d65d43c61857..ab30518d5f04 100644
>> --- a/fs/ocfs2/resize.c
>> +++ b/fs/ocfs2/resize.c
>> @@ -91,6 +91,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>> u16 cl_bpc = le16_to_cpu(cl->cl_bpc);
>> u16 cl_cpg = le16_to_cpu(cl->cl_cpg);
>> u16 old_bg_clusters;
>> + u32 contig_bits = 0, old_bg_contig_free_bits;
>>
>> trace_ocfs2_update_last_group_and_inode(new_clusters,
>> first_new_cluster);
>> @@ -122,6 +123,11 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>> le16_add_cpu(&group->bg_free_bits_count, -1 * backups);
>> }
>>
>> + ocfs2_find_max_contig_free_bits(group->bg_bitmap,
>> + group->bg_bits, 0, &contig_bits);
>> + old_bg_contig_free_bits = group->bg_contig_free_bits;
>> + group->bg_contig_free_bits = cpu_to_le32(contig_bits);
>> +
>> ocfs2_journal_dirty(handle, group_bh);
>>
>> /* update the inode accordingly. */
>> @@ -160,6 +166,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>> le16_add_cpu(&group->bg_free_bits_count, backups);
>> le16_add_cpu(&group->bg_bits, -1 * num_bits);
>> le16_add_cpu(&group->bg_free_bits_count, -1 * num_bits);
>> + group->bg_contig_free_bits = old_bg_contig_free_bits;
>> }
>> out:
>> if (ret)
>> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
>> index 166c8918c825..df149c3b0eed 100644
>> --- a/fs/ocfs2/suballoc.c
>> +++ b/fs/ocfs2/suballoc.c
>> @@ -37,21 +37,6 @@
>>
>> #define OCFS2_MAX_TO_STEAL 1024
>>
>> -struct ocfs2_suballoc_result {
>> - u64 sr_bg_blkno; /* The bg we allocated from. Set
>> - to 0 when a block group is
>> - contiguous. */
>> - u64 sr_bg_stable_blkno; /*
>> - * Doesn't change, always
>> - * set to target block
>> - * group descriptor
>> - * block.
>> - */
>> - u64 sr_blkno; /* The first allocated block */
>> - unsigned int sr_bit_offset; /* The bit in the bg */
>> - unsigned int sr_bits; /* How many bits we claimed */
>> -};
>> -
>> static u64 ocfs2_group_from_res(struct ocfs2_suballoc_result *res)
>> {
>> if (res->sr_blkno == 0)
>> @@ -1272,6 +1257,25 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh,
>> return ret;
>> }
>>
>> +/* caller should initialize contig_bits */
>> +void ocfs2_find_max_contig_free_bits(void *bitmap,
>> + unsigned int total_bits, int start,
>> + unsigned int *contig_bits)
>> +{
>> + int offset, free_bits;
>> +
>> + while (start < total_bits) {
>> + offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start);
>> + if (offset == total_bits)
>> + break;
>> +
>> + start = ocfs2_find_next_bit(bitmap, total_bits, offset);
>> + free_bits = start - offset;
>> + if (*contig_bits < free_bits)
>> + *contig_bits = free_bits;
>> + }
>
> Or we can just return the contig_bits. Something like:
> unsigned int ocfs2_find_max_contig_free_bits(void *bitmap,
> unsigned int total_bits, int start)
> {
> int offset, start;
> unsigned int contig_bits = 0;
>
> while () {
> ...
> }
>
> return contig_bits;
> }
the input *contig_bits has two jobs:
- return the max contigous free bits of bitmap
- if *contig_bits is not ZERO, it keeps the max-contig-bits
from last search. When the input 'start' not zero
(means start from middle of bitmap), the max-contig-bits
from latter part of bitmap may smaller than first part.
at this case, function should keep & return the input
*contig_bits.
the view of ocfs2_find_max_contig_free_bits():
front part | latter part
[-------------+------------------] bitmap
0 'start' total_bits
the call stack:
ocfs2_search_one_group
+ ac->ac_group_search
| ocfs2_cluster_group_search
| ocfs2_block_group_find_clear_bits
| + scan area [0, res->sr_bit_offset + res->sr_bits - 1]
| | /* zhm: save the second max-contig-bits.
| | * if it doesn't exist, the value is 0. */
| + res->sr_max_contig_bits = prev_best_size;
|
+ ocfs2_block_group_set_bits
ocfs2_find_max_contig_free_bits(...,
start, /* zhm: may not ZERO, help to speed up the search */
res->sr_max_contig_bits) /* zhm: the optimal value may have been reached */
So we should keep ocfs2_find_max_contig_free_bits() with a return type of 'void'.
-Heming
On Tue, Mar 12, 2024 at 03:55:46PM -0400, Alexander Aring wrote: > Then you can reply to the mail with: > > Tested-by: Valentin Vidić <vvidic@valentin-vidic.from.hr> > > that would help to bring the fix upstream. Sure, replied to the other thread. > I see, I currently have plans/ideas to add namespace support into > kernel DLM that might help you by extending your testing. The general > idea would be that you can do something like: > > ... > ip netns exec node1 mount /dev/fake_shared_block /mnt/node1 > ip netns exec node2 mount /dev/fake_shared_block /mnt/node2 > ip netns exec node3 mount /dev/fake_shared_block /mnt/node3 > ... Yeah, that would definitely be cool. And thanks for the pointers, I will definitely try to use namespaces for some simpler corosync tests first. > It also requires a ocfs2-tools tool at a specific location: > > /sbin/ocfs2_hb_ctl > > maybe this is your issue why it wasn't working. My kernel was > complaining about it because it was missing in /sbin/ That binary is already in the Debian package. I think the error fsck reported was something like "Could not create domain". -- Valentin
Hi,
On Tue, Mar 12, 2024 at 3:55 PM Alexander Aring <aahringo@redhat.com> wrote:
>
> Hi,
>
> On Tue, Mar 12, 2024 at 2:37 PM Valentin Vidić
> <vvidic@valentin-vidic.from.hr> wrote:
> >
> > On Tue, Mar 12, 2024 at 01:55:54PM -0400, Alexander Aring wrote:
> > > I was able to reproduce the issue and sent a fix out.
> >
> > Yay, thanks a lot for testing, I can confirm the patches fix the problem
> > for me!
> >
>
> Then you can reply to the mail with:
>
> Tested-by: Valentin Vidić <vvidic@valentin-vidic.from.hr>
>
> that would help to bring the fix upstream.
>
> > > If you want to extend your debian testing using different approach
> > > (not libdlm) this can be done by:
> > >
> > > mount -t ocfs2_dlmfs none /dlm
> >
> > I tried that but fsck than fails to make a lock, so I think ocfs2_dlmfs
> > is only relevant when used with o2cb cluster type instead of corosync.
> >
>
> Yes, I think you are right. It is a replacement for
> corosync/dlm_controld handling. It seems I had both setup working at
> the same time, but it still uses kernel DLM locking.
It also requires a ocfs2-tools tool at a specific location:
/sbin/ocfs2_hb_ctl
maybe this is your issue why it wasn't working. My kernel was
complaining about it because it was missing in /sbin/
- Alex
Hi, On Tue, Mar 12, 2024 at 2:37 PM Valentin Vidić <vvidic@valentin-vidic.from.hr> wrote: > > On Tue, Mar 12, 2024 at 01:55:54PM -0400, Alexander Aring wrote: > > I was able to reproduce the issue and sent a fix out. > > Yay, thanks a lot for testing, I can confirm the patches fix the problem > for me! > Then you can reply to the mail with: Tested-by: Valentin Vidić <vvidic@valentin-vidic.from.hr> that would help to bring the fix upstream. > > If you want to extend your debian testing using different approach > > (not libdlm) this can be done by: > > > > mount -t ocfs2_dlmfs none /dlm > > I tried that but fsck than fails to make a lock, so I think ocfs2_dlmfs > is only relevant when used with o2cb cluster type instead of corosync. > Yes, I think you are right. It is a replacement for corosync/dlm_controld handling. It seems I had both setup working at the same time, but it still uses kernel DLM locking. > > I am curious about your testing at debian cluster, do you also test gfs2? > > Yes, a similar test for gfs2 would be this: > > https://salsa.debian.org/ha-team/gfs2-utils/-/blob/master/debian/tests/corosync?ref_type=heads > > It is not really a cluster test since it only runs on one KVM > machine. But as a package test it works well because even for > 1-node cluster all the code (userspace and kernel) still needs > to work correctly. > I see, I currently have plans/ideas to add namespace support into kernel DLM that might help you by extending your testing. The general idea would be that you can do something like: ... ip netns exec node1 mount /dev/fake_shared_block /mnt/node1 ip netns exec node2 mount /dev/fake_shared_block /mnt/node2 ip netns exec node3 mount /dev/fake_shared_block /mnt/node3 ... I hope that somehow explains the idea behind it. Setup your cluster networking with namespaces, e.g. selftests of Linux kernel is doing it [0] and the DLM lockspaces are separated by net namespaces*. It "should" run like a cluster with several machines, it is only to do more testing and we don't need to synchronize tests over the network. However it will _not_ replace real cluster testing but it is a starting point to make testing "simpler". Somewhere is even a big fat "software" switch you can run wireshark on it to see all cluster traffic going on. * In theory I am only abusing net namespaces for separate DLM lockspaces, but I don't want to introduce another kind of namespace now. However net namespaces fit here because it is necessary for a per node networking setup anyway. - Alex [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/net/icmp.sh