From: Heming Zhao <heming.zhao@suse.com>
To: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: ocfs2-devel@lists.linux.dev, ailiop@suse.com
Subject: Re: [PATCH v4 1/3] ocfs2: improve write IO performance when fragmentation is high
Date: Thu, 28 Mar 2024 09:55:08 +0800 [thread overview]
Message-ID: <6be36d71-09b1-476b-8698-38d0a8821457@suse.com> (raw)
In-Reply-To: <8bbdaca3-cf86-40da-91a6-e3dd1405e495@linux.alibaba.com>
On 3/28/24 09:50, Joseph Qi wrote:
>
>
> On 3/27/24 8:54 PM, Heming Zhao wrote:
>> On 3/27/24 18:58, Joseph Qi wrote:
>>>
>>>
>>> On 3/27/24 4:21 PM, Heming Zhao wrote:
>>>> The group_search function ocfs2_cluster_group_search() should
>>>> bypass groups with insufficient space to avoid unnecessary
>>>> searches.
>>>>
>>>> This patch is particularly useful when ocfs2 is handling huge
>>>> number small files, and volume fragmentation is very high.
>>>> In this case, ocfs2 is busy with looking up available la window
>>>> from //global_bitmap.
>>>>
>>>> This patch introduces a new member in the Group Description (gd)
>>>> struct called 'bg_contig_free_bits', representing the max
>>>> contigous free bits in this gd. When ocfs2 allocates a new
>>>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>>>> expedite the search process.
>>>>
>>>> Let's image below path.
>>>>
>>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>>
>>>> 2. when user delete a large file and trigger
>>>> ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>>>> unconditionally.
>>>>
>>>> 3. a write IOs thread run and trigger the worst performance path
>>>>
>>>> ```
>>>> ocfs2_reserve_clusters_with_limit
>>>> ocfs2_reserve_local_alloc_bits
>>>> ocfs2_local_alloc_slide_window //[1]
>>>> + ocfs2_local_alloc_reserve_for_window //[2]
>>>> + ocfs2_local_alloc_new_window //[3]
>>>> ocfs2_recalc_la_window
>>>> ```
>>>>
>>>> [1]:
>>>> will be called when la window bits used up.
>>>>
>>>> [2]:
>>>> under la state is ENABLED, and this func only check global_bitmap
>>>> free bits, it will succeed in general.
>>>>
>>>> [3]:
>>>> will use the default la window size to search clusters then fail.
>>>> ocfs2_recalc_la_window attempts other la window sizes.
>>>> the timing complexity is O(n^4), resulting in a significant time
>>>> cost for scanning global bitmap. This leads to a dramatic slowdown
>>>> in write I/Os (e.g., user space 'dd').
>>>>
>>>> i.e.
>>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>>> la window default size: 106MB.
>>>> The partition is fragmentation by creating & deleting huge mount of
>>>> small files.
>>>>
>>>> before this patch, the timing of [3] should be
>>>> (the number got from real world):
>>>> - la window size change order (size: MB):
>>>> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>>> only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>>> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>>> runs in worst case.
>>>> - group chain number: 242
>>>> ocfs2_claim_suballoc_bits calls for-loop 242 times
>>>> - each chain has 49 block group
>>>> ocfs2_search_chain calls while-loop 49 times
>>>> - each bg has 32256 blocks
>>>> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>>> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>>> (32256/64) (this is not worst value) for timing calucation.
>>>>
>>>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>>
>>>> In the worst case, user space writes 1MB data will trigger 42M scanning
>>>> times.
>>>>
>>>> under this patch, the timing is '7*242*49 = 83006', reduced by three
>>>> orders of magnitude.
>>>>
>>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>>> ---
>>>> v4:
>>>> fix sparse warning:
>>>> - in ocfs2_update_last_group_and_inode(), change
>>>> 'old_bg_contig_free_bits' type from 'u16' to '__le16'.
>>>> - in _ocfs2_free_suballoc_bits(), do le16_to_cpu convert
>>>> for assigning 'old_bg_contig_free_bits'.
>>>>
>>>> v3:
>>>> 1. Fix wrong var length for 'struct ocfs2_group_desc'
>>>> .bg_contig_free_bits, change from '__le32' to '__le16'.
>>>> 2. change all related code to use '__le16' instead of '__le32'.
>>>> Please note: change ocfs2_find_max_contig_free_bits() input
>>>> parameters and return type to 'u16'.
>>>>
>>>> v2:
>>>> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>>>> for setting bg->bg_contig_free_bits.
>>>> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>>>> 'unsigned int'.
>>>> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>>>> parameter 'failure_path' to 'fastpath'.
>>>> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
>>>> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>>>> 'suballoc.h' to 'suballoc.c'.
>>>> 6. modify some code indent error.
>>>> ---
>>>> fs/ocfs2/move_extents.c | 2 +-
>>>> fs/ocfs2/ocfs2_fs.h | 3 +-
>>>> fs/ocfs2/resize.c | 8 ++++
>>>> fs/ocfs2/suballoc.c | 99 +++++++++++++++++++++++++++++++++++++----
>>>> fs/ocfs2/suballoc.h | 6 ++-
>>>> 5 files changed, 107 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>>>> index 1f9ed117e78b..f9d6a4f9ca92 100644
>>>> --- a/fs/ocfs2/move_extents.c
>>>> +++ b/fs/ocfs2/move_extents.c
>>>> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>>>> }
>>>> ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>>>> - goal_bit, len);
>>>> + goal_bit, len, 0, 0);
>>>> if (ret) {
>>>> ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>>>> le16_to_cpu(gd->bg_chain));
>>>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>>>> index 7aebdbf5cc0a..c93689b568fe 100644
>>>> --- a/fs/ocfs2/ocfs2_fs.h
>>>> +++ b/fs/ocfs2/ocfs2_fs.h
>>>> @@ -883,7 +883,8 @@ struct ocfs2_group_desc
>>>> __le16 bg_free_bits_count; /* Free bits count */
>>>> __le16 bg_chain; /* What chain I am in. */
>>>> /*10*/ __le32 bg_generation;
>>>> - __le32 bg_reserved1;
>>>> + __le16 bg_contig_free_bits; /* max contig free bits length */
>>>> + __le16 bg_reserved1;
>>>> __le64 bg_next_group; /* Next group in my list, in
>>>> blocks */
>>>> /*20*/ __le64 bg_parent_dinode; /* dinode which owns me, in
>>>> diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c
>>>> index d65d43c61857..c4a4016d3866 100644
>>>> --- a/fs/ocfs2/resize.c
>>>> +++ b/fs/ocfs2/resize.c
>>>> @@ -91,6 +91,8 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>>>> u16 cl_bpc = le16_to_cpu(cl->cl_bpc);
>>>> u16 cl_cpg = le16_to_cpu(cl->cl_cpg);
>>>> u16 old_bg_clusters;
>>>> + u16 contig_bits;
>>>> + __le16 old_bg_contig_free_bits;
>>>> trace_ocfs2_update_last_group_and_inode(new_clusters,
>>>> first_new_cluster);
>>>> @@ -122,6 +124,11 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>>>> le16_add_cpu(&group->bg_free_bits_count, -1 * backups);
>>>> }
>>>> + contig_bits = ocfs2_find_max_contig_free_bits(group->bg_bitmap,
>>>> + le16_to_cpu(group->bg_bits), 0);
>>>> + old_bg_contig_free_bits = group->bg_contig_free_bits;
>>>> + group->bg_contig_free_bits = cpu_to_le16(contig_bits);
>>>> +
>>>> ocfs2_journal_dirty(handle, group_bh);
>>>> /* update the inode accordingly. */
>>>> @@ -160,6 +167,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>>>> le16_add_cpu(&group->bg_free_bits_count, backups);
>>>> le16_add_cpu(&group->bg_bits, -1 * num_bits);
>>>> le16_add_cpu(&group->bg_free_bits_count, -1 * num_bits);
>>>> + group->bg_contig_free_bits = old_bg_contig_free_bits;
>>>> }
>>>> out:
>>>> if (ret)
>>>> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
>>>> index 166c8918c825..6fd67c8da9fe 100644
>>>> --- a/fs/ocfs2/suballoc.c
>>>> +++ b/fs/ocfs2/suballoc.c
>>>> @@ -50,6 +50,10 @@ struct ocfs2_suballoc_result {
>>>> u64 sr_blkno; /* The first allocated block */
>>>> unsigned int sr_bit_offset; /* The bit in the bg */
>>>> unsigned int sr_bits; /* How many bits we claimed */
>>>> + unsigned int sr_max_contig_bits; /* The length for contiguous
>>>> + * free bits, only available
>>>> + * for cluster group
>>>> + */
>>>> };
>>>> static u64 ocfs2_group_from_res(struct ocfs2_suballoc_result *res)
>>>> @@ -1272,6 +1276,26 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh,
>>>> return ret;
>>>> }
>>>> +u16 ocfs2_find_max_contig_free_bits(void *bitmap,
>>>> + u16 total_bits, u16 start)
>>>> +{
>>>> + u16 offset, free_bits;
>>>> + u16 contig_bits = 0;
>>>> +
>>>> + while (start < total_bits) {
>>>> + offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start);
>>>> + if (offset == total_bits)
>>>> + break;
>>>> +
>>>> + start = ocfs2_find_next_bit(bitmap, total_bits, offset);
>>>> + free_bits = start - offset;
>>>> + if (contig_bits < free_bits)
>>>> + contig_bits = free_bits;
>>>> + }
>>>> +
>>>> + return contig_bits;
>>>> +}
>>>> +
>>>> static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>>>> struct buffer_head *bg_bh,
>>>> unsigned int bits_wanted,
>>>> @@ -1280,6 +1304,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>>>> {
>>>> void *bitmap;
>>>> u16 best_offset, best_size;
>>>> + u16 prev_best_size = 0;
>>>> int offset, start, found, status = 0;
>>>> struct ocfs2_group_desc *bg = (struct ocfs2_group_desc *) bg_bh->b_data;
>>>> @@ -1308,6 +1333,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>>>> /* got a zero after some ones */
>>>> found = 1;
>>>> start = offset + 1;
>>>> + prev_best_size = best_size;
>>>> }
>>>> if (found > best_size) {
>>>> best_size = found;
>>>> @@ -1320,6 +1346,8 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>>>> }
>>>> }
>>>> + /* best_size will be allocated, we save prev_best_size */
>>>> + res->sr_max_contig_bits = prev_best_size;
>>>> if (best_size) {
>>>> res->sr_bit_offset = best_offset;
>>>> res->sr_bits = best_size;
>>>> @@ -1337,11 +1365,15 @@ int ocfs2_block_group_set_bits(handle_t *handle,
>>>> struct ocfs2_group_desc *bg,
>>>> struct buffer_head *group_bh,
>>>> unsigned int bit_off,
>>>> - unsigned int num_bits)
>>>> + unsigned int num_bits,
>>>> + unsigned int max_contig_bits,
>>>> + int fastpath)
>>>> {
>>>> int status;
>>>> void *bitmap = bg->bg_bitmap;
>>>> int journal_type = OCFS2_JOURNAL_ACCESS_WRITE;
>>>> + unsigned int start = bit_off + num_bits;
>>>> + u16 contig_bits;
>>>> /* All callers get the descriptor via
>>>> * ocfs2_read_group_descriptor(). Any corruption is a code bug. */
>>>> @@ -1373,6 +1405,28 @@ int ocfs2_block_group_set_bits(handle_t *handle,
>>>> while(num_bits--)
>>>> ocfs2_set_bit(bit_off++, bitmap);
>>>> + /*
>>>> + * this is optimize path, caller set old contig value
>>>> + * in max_contig_bits to bypass finding action.
>>>> + */
>>>> + if (fastpath) {
>>>> + bg->bg_contig_free_bits = cpu_to_le16(max_contig_bits);
>>>> + } else if (ocfs2_is_cluster_bitmap(alloc_inode)) {
>>>> + /*
>>>> + * Usually, the block group bitmap allocates only 1 bit
>>>> + * at a time, while the cluster group allocates n bits
>>>> + * each time. Therefore, we only save the contig bits for
>>>> + * the cluster group.
>>>> + */
>>>> + contig_bits = ocfs2_find_max_contig_free_bits(bitmap,
>>>> + le16_to_cpu(bg->bg_bits), start);
>>>> + if (contig_bits > max_contig_bits)
>>>> + max_contig_bits = contig_bits;
>>>> + bg->bg_contig_free_bits = cpu_to_le16(max_contig_bits);
>>>> + } else {
>>>> + bg->bg_contig_free_bits = 0;
>>>> + }
>>>> +
>>>> ocfs2_journal_dirty(handle, group_bh);
>>>> bail:
>>>> @@ -1486,7 +1540,12 @@ static int ocfs2_cluster_group_search(struct inode *inode,
>>>> BUG_ON(!ocfs2_is_cluster_bitmap(inode));
>>>> - if (gd->bg_free_bits_count) {
>>>> + if (le16_to_cpu(gd->bg_contig_free_bits) &&
>>>> + le16_to_cpu(gd->bg_contig_free_bits) < bits_wanted)
>>>> + return -ENOSPC;
>>>> +
>>>> + /* ->bg_contig_free_bits may un-initialized, so compare again */
>>>> + if (le16_to_cpu(gd->bg_free_bits_count) >= bits_wanted) {
>>>> max_bits = le16_to_cpu(gd->bg_bits);
>>>> /* Tail groups in cluster bitmaps which aren't cpg
>>>> @@ -1555,7 +1614,7 @@ static int ocfs2_block_group_search(struct inode *inode,
>>>> BUG_ON(min_bits != 1);
>>>> BUG_ON(ocfs2_is_cluster_bitmap(inode));
>>>> - if (bg->bg_free_bits_count) {
>>>> + if (le16_to_cpu(bg->bg_free_bits_count) >= bits_wanted) {
>>>> ret = ocfs2_block_group_find_clear_bits(OCFS2_SB(inode->i_sb),
>>>> group_bh, bits_wanted,
>>>> le16_to_cpu(bg->bg_bits),
>>>> @@ -1715,7 +1774,8 @@ static int ocfs2_search_one_group(struct ocfs2_alloc_context *ac,
>>>> }
>>>> ret = ocfs2_block_group_set_bits(handle, alloc_inode, gd, group_bh,
>>>> - res->sr_bit_offset, res->sr_bits);
>>>> + res->sr_bit_offset, res->sr_bits,
>>>> + res->sr_max_contig_bits, 0);
>>>> if (ret < 0) {
>>>> ocfs2_rollback_alloc_dinode_counts(alloc_inode, ac->ac_bh,
>>>> res->sr_bits,
>>>> @@ -1849,7 +1909,9 @@ static int ocfs2_search_chain(struct ocfs2_alloc_context *ac,
>>>> bg,
>>>> group_bh,
>>>> res->sr_bit_offset,
>>>> - res->sr_bits);
>>>> + res->sr_bits,
>>>> + res->sr_max_contig_bits,
>>>> + 0);
>>>> if (status < 0) {
>>>> ocfs2_rollback_alloc_dinode_counts(alloc_inode,
>>>> ac->ac_bh, res->sr_bits, chain);
>>>> @@ -2163,7 +2225,9 @@ int ocfs2_claim_new_inode_at_loc(handle_t *handle,
>>>> bg,
>>>> bg_bh,
>>>> res->sr_bit_offset,
>>>> - res->sr_bits);
>>>> + res->sr_bits,
>>>> + res->sr_max_contig_bits,
>>>> + 0);
>>>> if (ret < 0) {
>>>> ocfs2_rollback_alloc_dinode_counts(ac->ac_inode,
>>>> ac->ac_bh, res->sr_bits, chain);
>>>> @@ -2382,11 +2446,13 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
>>>> struct buffer_head *group_bh,
>>>> unsigned int bit_off,
>>>> unsigned int num_bits,
>>>> + unsigned int max_contig_bits,
>>>> void (*undo_fn)(unsigned int bit,
>>>> unsigned long *bmap))
>>>> {
>>>> int status;
>>>> unsigned int tmp;
>>>> + u16 contig_bits;
>>>> struct ocfs2_group_desc *undo_bg = NULL;
>>>> struct journal_head *jh;
>>>> @@ -2433,6 +2499,20 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
>>>> num_bits);
>>>> }
>>>> + /*
>>>> + * TODO: even 'num_bits == 1' (the worst case, release 1 cluster),
>>>> + * we still need to rescan whole bitmap.
>>>> + */
>>>> + if (ocfs2_is_cluster_bitmap(alloc_inode)) {
>>>> + contig_bits = ocfs2_find_max_contig_free_bits(bg->bg_bitmap,
>>>> + le16_to_cpu(bg->bg_bits), 0);
>>>> + if (contig_bits > max_contig_bits)
>>>> + max_contig_bits = contig_bits;
>>>> + bg->bg_contig_free_bits = cpu_to_le16(max_contig_bits);
>>>> + } else {
>>>> + bg->bg_contig_free_bits = 0;
>>>> + }
>>>> +
>>>> if (undo_fn)
>>>> spin_unlock(&jh->b_state_lock);
>>>> @@ -2459,6 +2539,7 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle,
>>>> struct ocfs2_chain_list *cl = &fe->id2.i_chain;
>>>> struct buffer_head *group_bh = NULL;
>>>> struct ocfs2_group_desc *group;
>>>> + u16 old_bg_contig_free_bits = 0;
>>>
>>> Why not also define old_bg_contig_free_bits as __le16?
>>> And do convert when calling ocfs2_block_group_set_bits().
>>> This looks consistent with ocfs2_update_last_group_and_inode().
>>>
>>> Joseph
>>
>> _ocfs2_free_suballoc_bits() is not similar to
>> ocfs2_update_last_group_and_inode(). Here, 'old_bg_contig_free_bits'
>> will be passed to ocfs2_block_group_set_bits() for error handling.
>> In ocfs2_block_group_set_bits(), cpu_to_le16() is already used to
>> perform the conversion.
>>
>
> Don't understand what's your concern here.
>
> __le16 old_bg_contig_free_bits = 0;
> ...
> old_bg_contig_free_bits = group->bg_contig_free_bits;
>
> if (error)
> ocfs2_block_group_set_bits(...,
> le16_to_cpu(old_bg_contig_free_bits),
> ...);
>
> BTW, error case is an 'unlikely' branch, right?
>
> Thanks,
> Joseph
Your code is better, I ignored that this is a piece of 'unlikely' code.
Will change it in next version.
-Heming
next prev parent reply other threads:[~2024-03-28 1:55 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-27 8:21 [PATCH v4 0/3] improve write IO performance when fragmentation is high Heming Zhao
2024-03-27 8:21 ` [PATCH v4 1/3] ocfs2: " Heming Zhao
2024-03-27 10:58 ` Joseph Qi
2024-03-27 12:54 ` Heming Zhao
2024-03-28 1:50 ` Joseph Qi
2024-03-28 1:55 ` Heming Zhao [this message]
2024-03-27 8:21 ` [PATCH v4 2/3] ocfs2: adjust enabling place for la window Heming Zhao
2024-03-27 11:05 ` Joseph Qi
2024-03-27 12:59 ` Heming Zhao
2024-03-28 1:58 ` Joseph Qi
2024-03-28 2:36 ` Heming Zhao
2024-03-27 8:21 ` [PATCH v4 3/3] ocfs2: speed up chain-list searching Heming Zhao
2024-03-27 11:17 ` Joseph Qi
2024-03-27 12:59 ` Heming Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6be36d71-09b1-476b-8698-38d0a8821457@suse.com \
--to=heming.zhao@suse.com \
--cc=ailiop@suse.com \
--cc=joseph.qi@linux.alibaba.com \
--cc=ocfs2-devel@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).