All of lore.kernel.org
 help / color / mirror / Atom feed
From: Heming Zhao <heming.zhao@suse.com>
To: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: ocfs2-devel@lists.linux.dev, ailiop@suse.com
Subject: Re: [PATCH v2] ocfs2: improve write IO performance when fragmentation is high
Date: Mon, 18 Mar 2024 13:29:12 +0800	[thread overview]
Message-ID: <1f8715a2-cc77-4203-9d38-ecf99bb71198@suse.com> (raw)
In-Reply-To: <b09feb83-c506-4c5b-a6dd-c5a7d24f45e3@linux.alibaba.com>

On 3/18/24 09:39, Joseph Qi wrote:
> 
> 
> On 3/16/24 11:44 AM, Heming Zhao wrote:
>> The group_search function ocfs2_cluster_group_search() should
>> bypass groups with insufficient space to avoid unnecessary
>> searches.
>>
>> This patch is particularly useful when ocfs2 is handling huge
>> number small files, and volume fragmentation is very high.
>> In this case, ocfs2 is busy with looking up available la window
>> from //global_bitmap.
>>
>> This patch introduces a new member in the Group Description (gd)
>> struct called 'bg_contig_free_bits', representing the max
>> contigous free bits in this gd. When ocfs2 allocates a new
>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>> expedite the search process.
>>
>> Let's image below path.
>>
>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>
>> 2. when user delete a large file and trigger
>>     ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>>     unconditionally.
>>
>> 3. a write IOs thread run and trigger the worst performance path
>>
>> ```
>> ocfs2_reserve_clusters_with_limit
>>   ocfs2_reserve_local_alloc_bits
>>    ocfs2_local_alloc_slide_window //[1]
>>     + ocfs2_local_alloc_reserve_for_window //[2]
>>     + ocfs2_local_alloc_new_window //[3]
>>        ocfs2_recalc_la_window
>> ```
>>
>> [1]:
>> will be called when la window bits used up.
>>
>> [2]:
>> under la state is ENABLED, and this func only check global_bitmap
>> free bits, it will succeed in general.
>>
>> [3]:
>> will use the default la window size to search clusters then fail.
>> ocfs2_recalc_la_window attempts other la window sizes.
>> the timing complexity is O(n^4), resulting in a significant time
>> cost for scanning global bitmap. This leads to a dramatic slowdown
>> in write I/Os (e.g., user space 'dd').
>>
>> i.e.
>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>> la window default size: 106MB.
>> The partition is fragmentation by creating & deleting huge mount of
>> small files.
>>
>> before this patch, the timing of [3] should be
>> (the number got from real world):
>> - la window size change order (size: MB):
>>    106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>    only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>    ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>    runs in worst case.
>> - group chain number: 242
>>    ocfs2_claim_suballoc_bits calls for-loop 242 times
>> - each chain has 49 block group
>>    ocfs2_search_chain calls while-loop 49 times
>> - each bg has 32256 blocks
>>    ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>    for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>    (32256/64) (this is not worst value) for timing calucation.
>>
>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>
>> In the worst case, user space writes 1MB data will trigger 42M scanning
>> times.
>>
>> under this patch, the timing is '7*242*49 = 83006', reduced by three
>> orders of magnitude.
>>
>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>> ---
>> v2:
>> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>>     for setting bg->bg_contig_free_bits.
>> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>>     'unsigned int'.
>> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>>     parameter 'failure_path' to 'fastpath'.
>> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
>> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>>     'suballoc.h' to 'suballoc.c'.
>> 6. modify some code indent error.
>>
>> ---
>>   fs/ocfs2/move_extents.c |  2 +-
>>   fs/ocfs2/ocfs2_fs.h     |  4 +-
>>   fs/ocfs2/resize.c       |  7 +++
>>   fs/ocfs2/suballoc.c     | 99 +++++++++++++++++++++++++++++++++++++----
>>   fs/ocfs2/suballoc.h     |  6 ++-
>>   5 files changed, 106 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>> index 1f9ed117e78b..f9d6a4f9ca92 100644
>> --- a/fs/ocfs2/move_extents.c
>> +++ b/fs/ocfs2/move_extents.c
>> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>>   	}
>>   
>>   	ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>> -					 goal_bit, len);
>> +					 goal_bit, len, 0, 0);
>>   	if (ret) {
>>   		ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>>   					       le16_to_cpu(gd->bg_chain));
>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>> index 7aebdbf5cc0a..eae18d772e93 100644
>> --- a/fs/ocfs2/ocfs2_fs.h
>> +++ b/fs/ocfs2/ocfs2_fs.h
>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc
>>   	__le16	bg_free_bits_count;     /* Free bits count */
>>   	__le16   bg_chain;               /* What chain I am in. */
>>   /*10*/	__le32   bg_generation;
>> -	__le32	bg_reserved1;
>> +	__le32   bg_contig_free_bits;  /* max contig free bits length */
> 
> I've found some parts you are using le32, while others using le16.
> It seems that le16 is enough here. So why not:
> __le16 bg_contig_free_bits;
> __le16 bg_reserved1;
> 
> Joseph

IIUC, From the comment before ocfs2_la_default_mb(), the max cluster
size is 1MB. And from 'struct ocfs2_group_desc', the offset of
bg_bitmap is 0x40. Therefore, the max contiguous bit of gd:
(1024*1024 - 0x40)*8 = 8388096 (23-bit length, bigger than __le16)

-Heming

  reply	other threads:[~2024-03-18  5:29 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-16  3:44 [PATCH v2] ocfs2: improve write IO performance when fragmentation is high Heming Zhao
2024-03-18  1:39 ` Joseph Qi
2024-03-18  5:29   ` Heming Zhao [this message]
2024-03-18  5:49     ` Joseph Qi
2024-03-18  7:47       ` Heming Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1f8715a2-cc77-4203-9d38-ecf99bb71198@suse.com \
    --to=heming.zhao@suse.com \
    --cc=ailiop@suse.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=ocfs2-devel@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.