* [PATCH 0/2] ext4: fix two bugs in ext4_mb_normalize_request @ 2022-05-21 13:42 Baokun Li 2022-05-21 13:42 ` [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa Baokun Li 2022-05-21 13:42 ` [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request Baokun Li 0 siblings, 2 replies; 21+ messages in thread From: Baokun Li @ 2022-05-21 13:42 UTC (permalink / raw) To: linux-ext4 Cc: tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, libaokun1 Baokun Li (2): ext4: fix bug_on ext4_mb_use_inode_pa ext4: correct the judgment of BUG in ext4_mb_normalize_request fs/ext4/mballoc.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa 2022-05-21 13:42 [PATCH 0/2] ext4: fix two bugs in ext4_mb_normalize_request Baokun Li @ 2022-05-21 13:42 ` Baokun Li 2022-05-23 9:29 ` Jan Kara ` (2 more replies) 2022-05-21 13:42 ` [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request Baokun Li 1 sibling, 3 replies; 21+ messages in thread From: Baokun Li @ 2022-05-21 13:42 UTC (permalink / raw) To: linux-ext4 Cc: tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, libaokun1, Hulk Robot Hulk Robot reported a BUG_ON: ================================================================== kernel BUG at fs/ext4/mballoc.c:3211! [...] RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f [...] Call Trace: ext4_mb_new_blocks+0x9df/0x5d30 ext4_ext_map_blocks+0x1803/0x4d80 ext4_map_blocks+0x3a4/0x1a10 ext4_writepages+0x126d/0x2c30 do_writepages+0x7f/0x1b0 __filemap_fdatawrite_range+0x285/0x3b0 file_write_and_wait_range+0xb1/0x140 ext4_sync_file+0x1aa/0xca0 vfs_fsync_range+0xfb/0x260 do_fsync+0x48/0xa0 [...] ================================================================== Above issue may happen as follows: ------------------------------------- do_fsync vfs_fsync_range ext4_sync_file file_write_and_wait_range __filemap_fdatawrite_range do_writepages ext4_writepages mpage_map_and_submit_extent mpage_map_one_extent ext4_map_blocks ext4_mb_new_blocks ext4_mb_normalize_request >>> start + size <= ac->ac_o_ex.fe_logical ext4_mb_regular_allocator ext4_mb_simple_scan_group ext4_mb_use_best_found ext4_mb_new_preallocation ext4_mb_new_inode_pa ext4_mb_use_inode_pa >>> set ac->ac_b_ex.fe_len <= 0 ext4_mb_mark_diskspace_used >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); we can easily reproduce this problem with the following commands: `fallocate -l100M disk` `mkfs.ext4 -b 1024 -g 256 disk` `mount disk /mnt` `fsstress -d /mnt -l 0 -n 1000 -p 1` The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur when the size is truncated. So start should be the start position of the group where ac_o_ex.fe_logical is located after alignment. In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP is very large, the value calculated by start_off is more accurate. Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index ea653d19f9ec..32410b79b664 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, size = size >> bsbits; start = start_off >> bsbits; + /* + * Because size must be less than or equal to + * EXT4_BLOCKS_PER_GROUP, start should be the start position of + * the group where ac_o_ex.fe_logical is located after alignment. + * In addition, when the value of fe_logical or + * EXT4_BLOCKS_PER_GROUP is very large, the value calculated + * by start_off is more accurate. + */ + start = max(start, round_down(ac->ac_o_ex.fe_logical, + EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); + /* don't cover already allocated blocks in selected range */ if (ar->pleft && start <= ar->lleft) { size -= ar->lleft + 1 - start; -- 2.31.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa 2022-05-21 13:42 ` [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa Baokun Li @ 2022-05-23 9:29 ` Jan Kara 2022-05-23 9:58 ` Lukas Czerner 2022-05-23 19:51 ` Ritesh Harjani 2 siblings, 0 replies; 21+ messages in thread From: Jan Kara @ 2022-05-23 9:29 UTC (permalink / raw) To: Baokun Li Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, Hulk Robot On Sat 21-05-22 21:42:16, Baokun Li wrote: > Hulk Robot reported a BUG_ON: > ================================================================== > kernel BUG at fs/ext4/mballoc.c:3211! > [...] > RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f > [...] > Call Trace: > ext4_mb_new_blocks+0x9df/0x5d30 > ext4_ext_map_blocks+0x1803/0x4d80 > ext4_map_blocks+0x3a4/0x1a10 > ext4_writepages+0x126d/0x2c30 > do_writepages+0x7f/0x1b0 > __filemap_fdatawrite_range+0x285/0x3b0 > file_write_and_wait_range+0xb1/0x140 > ext4_sync_file+0x1aa/0xca0 > vfs_fsync_range+0xfb/0x260 > do_fsync+0x48/0xa0 > [...] > ================================================================== > > Above issue may happen as follows: > ------------------------------------- > do_fsync > vfs_fsync_range > ext4_sync_file > file_write_and_wait_range > __filemap_fdatawrite_range > do_writepages > ext4_writepages > mpage_map_and_submit_extent > mpage_map_one_extent > ext4_map_blocks > ext4_mb_new_blocks > ext4_mb_normalize_request > >>> start + size <= ac->ac_o_ex.fe_logical > ext4_mb_regular_allocator > ext4_mb_simple_scan_group > ext4_mb_use_best_found > ext4_mb_new_preallocation > ext4_mb_new_inode_pa > ext4_mb_use_inode_pa > >>> set ac->ac_b_ex.fe_len <= 0 > ext4_mb_mark_diskspace_used > >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); > > we can easily reproduce this problem with the following commands: > `fallocate -l100M disk` > `mkfs.ext4 -b 1024 -g 256 disk` > `mount disk /mnt` > `fsstress -d /mnt -l 0 -n 1000 -p 1` > > The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. > Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur > when the size is truncated. So start should be the start position of > the group where ac_o_ex.fe_logical is located after alignment. > In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP > is very large, the value calculated by start_off is more accurate. > > Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") > Reported-by: Hulk Robot <hulkci@huawei.com> > Signed-off-by: Baokun Li <libaokun1@huawei.com> Looks good. I'd just phrase the comment below a bit differently: > + /* > + * Because size must be less than or equal to > + * EXT4_BLOCKS_PER_GROUP, start should be the start position of > + * the group where ac_o_ex.fe_logical is located after alignment. > + * In addition, when the value of fe_logical or > + * EXT4_BLOCKS_PER_GROUP is very large, the value calculated > + * by start_off is more accurate. > + */ > + start = max(start, round_down(ac->ac_o_ex.fe_logical, > + EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); > + Can we make the comment like: /* * For tiny groups (smaller than 8MB) the chosen allocation * alignment may be larger than group size. Make sure the alignment * does not move allocation to a different group which makes mballoc * fail assertions later. */ With that feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa 2022-05-21 13:42 ` [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa Baokun Li 2022-05-23 9:29 ` Jan Kara @ 2022-05-23 9:58 ` Lukas Czerner [not found] ` <2525e39a-5be9-bae1-b77d-60f583892868@huawei.com> 2022-05-23 19:51 ` Ritesh Harjani 2 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2022-05-23 9:58 UTC (permalink / raw) To: Baokun Li Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, Hulk Robot On Sat, May 21, 2022 at 09:42:16PM +0800, Baokun Li wrote: > Hulk Robot reported a BUG_ON: > ================================================================== > kernel BUG at fs/ext4/mballoc.c:3211! > [...] > RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f > [...] > Call Trace: > ext4_mb_new_blocks+0x9df/0x5d30 > ext4_ext_map_blocks+0x1803/0x4d80 > ext4_map_blocks+0x3a4/0x1a10 > ext4_writepages+0x126d/0x2c30 > do_writepages+0x7f/0x1b0 > __filemap_fdatawrite_range+0x285/0x3b0 > file_write_and_wait_range+0xb1/0x140 > ext4_sync_file+0x1aa/0xca0 > vfs_fsync_range+0xfb/0x260 > do_fsync+0x48/0xa0 > [...] > ================================================================== > > Above issue may happen as follows: > ------------------------------------- > do_fsync > vfs_fsync_range > ext4_sync_file > file_write_and_wait_range > __filemap_fdatawrite_range > do_writepages > ext4_writepages > mpage_map_and_submit_extent > mpage_map_one_extent > ext4_map_blocks > ext4_mb_new_blocks > ext4_mb_normalize_request > >>> start + size <= ac->ac_o_ex.fe_logical > ext4_mb_regular_allocator > ext4_mb_simple_scan_group > ext4_mb_use_best_found > ext4_mb_new_preallocation > ext4_mb_new_inode_pa > ext4_mb_use_inode_pa > >>> set ac->ac_b_ex.fe_len <= 0 > ext4_mb_mark_diskspace_used > >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); > > we can easily reproduce this problem with the following commands: > `fallocate -l100M disk` > `mkfs.ext4 -b 1024 -g 256 disk` > `mount disk /mnt` > `fsstress -d /mnt -l 0 -n 1000 -p 1` > > The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. > Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur > when the size is truncated. So start should be the start position of > the group where ac_o_ex.fe_logical is located after alignment. > In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP > is very large, the value calculated by start_off is more accurate. > > Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") > Reported-by: Hulk Robot <hulkci@huawei.com> > Signed-off-by: Baokun Li <libaokun1@huawei.com> > --- > fs/ext4/mballoc.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index ea653d19f9ec..32410b79b664 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, > size = size >> bsbits; > start = start_off >> bsbits; > > + /* > + * Because size must be less than or equal to > + * EXT4_BLOCKS_PER_GROUP, start should be the start position of > + * the group where ac_o_ex.fe_logical is located after alignment. > + * In addition, when the value of fe_logical or > + * EXT4_BLOCKS_PER_GROUP is very large, the value calculated > + * by start_off is more accurate. > + */ > + start = max(start, round_down(ac->ac_o_ex.fe_logical, > + EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); This does not look right. The second argument in round_down() must be a power of two, but there is no such restriction on blocks per group. Also I am not quite sure why do we adjust the start in this way at all? If we found what seems to be a preallocated extent which we can use and we're actually going to use 0 lenght extent it seems like the problem is somewhere else? Can you desribe the problem a bit more in detail? Maybe I need to look at the ext4_mb_normalize_request() some more. -Lukas > + > /* don't cover already allocated blocks in selected range */ > if (ar->pleft && start <= ar->lleft) { > size -= ar->lleft + 1 - start; > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <2525e39a-5be9-bae1-b77d-60f583892868@huawei.com>]
* Re: [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa [not found] ` <2525e39a-5be9-bae1-b77d-60f583892868@huawei.com> @ 2022-05-24 12:11 ` Lukas Czerner 2022-05-24 12:42 ` Baokun Li 0 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2022-05-24 12:11 UTC (permalink / raw) To: libaokun (A) Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, Hulk Robot On Mon, May 23, 2022 at 08:19:03PM +0800, libaokun (A) wrote: > 在 2022/5/23 17:58, Lukas Czerner 写道: > > On Sat, May 21, 2022 at 09:42:16PM +0800, Baokun Li wrote: > > > Hulk Robot reported a BUG_ON: > > > ================================================================== > > > kernel BUG at fs/ext4/mballoc.c:3211! > > > [...] > > > RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f > > > [...] > > > Call Trace: > > > ext4_mb_new_blocks+0x9df/0x5d30 > > > ext4_ext_map_blocks+0x1803/0x4d80 > > > ext4_map_blocks+0x3a4/0x1a10 > > > ext4_writepages+0x126d/0x2c30 > > > do_writepages+0x7f/0x1b0 > > > __filemap_fdatawrite_range+0x285/0x3b0 > > > file_write_and_wait_range+0xb1/0x140 > > > ext4_sync_file+0x1aa/0xca0 > > > vfs_fsync_range+0xfb/0x260 > > > do_fsync+0x48/0xa0 > > > [...] > > > ================================================================== > > > > > > Above issue may happen as follows: > > > ------------------------------------- > > > do_fsync > > > vfs_fsync_range > > > ext4_sync_file > > > file_write_and_wait_range > > > __filemap_fdatawrite_range > > > do_writepages > > > ext4_writepages > > > mpage_map_and_submit_extent > > > mpage_map_one_extent > > > ext4_map_blocks > > > ext4_mb_new_blocks > > > ext4_mb_normalize_request > > > >>> start + size <= ac->ac_o_ex.fe_logical > > > ext4_mb_regular_allocator > > > ext4_mb_simple_scan_group > > > ext4_mb_use_best_found > > > ext4_mb_new_preallocation > > > ext4_mb_new_inode_pa > > > ext4_mb_use_inode_pa > > > >>> set ac->ac_b_ex.fe_len <= 0 > > > ext4_mb_mark_diskspace_used > > > >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); > > > > > > we can easily reproduce this problem with the following commands: > > > `fallocate -l100M disk` > > > `mkfs.ext4 -b 1024 -g 256 disk` > > > `mount disk /mnt` > > > `fsstress -d /mnt -l 0 -n 1000 -p 1` > > > > > > The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. > > > Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur > > > when the size is truncated. So start should be the start position of > > > the group where ac_o_ex.fe_logical is located after alignment. > > > In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP > > > is very large, the value calculated by start_off is more accurate. > > > > > > Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") > > > Reported-by: Hulk Robot<hulkci@huawei.com> > > > Signed-off-by: Baokun Li<libaokun1@huawei.com> > > > --- > > > fs/ext4/mballoc.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > > > index ea653d19f9ec..32410b79b664 100644 > > > --- a/fs/ext4/mballoc.c > > > +++ b/fs/ext4/mballoc.c > > > @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, > > > size = size >> bsbits; > > > start = start_off >> bsbits; > > > + /* > > > + * Because size must be less than or equal to > > > + * EXT4_BLOCKS_PER_GROUP, start should be the start position of > > > + * the group where ac_o_ex.fe_logical is located after alignment. > > > + * In addition, when the value of fe_logical or > > > + * EXT4_BLOCKS_PER_GROUP is very large, the value calculated > > > + * by start_off is more accurate. > > > + */ > > > + start = max(start, round_down(ac->ac_o_ex.fe_logical, > > > + EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); > > This does not look right. The second argument in round_down() must be a > > power of two, but there is no such restriction on blocks per group. > > Indeed, block peer group size should be a multiple of 8. I forgot. > > Thank you very much for your correction. > > > Also I am not quite sure why do we adjust the start in this way at all? > > If we found what seems to be a preallocated extent which we can use and > > we're actually going to use 0 lenght extent it seems like the problem is > > somewhere else? Can you desribe the problem a bit more in detail? > > > > Maybe I need to look at the ext4_mb_normalize_request() some more. > > > > -Lukas > The logical block map reached before the problem stack was 1011. > > The estimated location of the size logical block of the inde plus the > required allocation length 7, the size is 1018. > > But the i_size of inode is 1299, so the size is 1299, the aligned size is > 2048, and the end is 2048. > > Because of the restriction of ar -> pleft, start==648. > > EXT4_BLOCKS_PER_GROUP (ac- > ac_sb) is 256, so the size is 256 and the end > is 904. > > It is not normal to truncate here, the end is less than 1299 of the target > logical block, > that is, the allocated range does not contain the target logical block. > > Then this new scope conflicts with the previous PA, as follows: > > pa_start-506 pa_end-759 > |____________P________V_________P__________V_____________l________| > 0 start-648 end-904 logical-1299 > 2048 > > In this case, start is changed to pa_end, that is, 759. > In this case, a bug_ON is reported in ext4_mb_mark_diskspace_used. > > The problem is caused by the truncation introduced in the > cd648b8a8fd5 ("ext4: trim allocation requests to group size"). > The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. > However, the truncation method is incorrect. The group where the logical is > located should be used for allocation. If the value of EXT4_BLOCKS_PER_GROUP > is 256, size 2048 can be divided into eight groups. If the value of logical > is 1299, > the value of logical must be in the sixth group, that is, > start=1299/256*256=5*256=1280, end=size+1280=1536. > Then, the value range can be further narrowed down based on other > restrictions. > > 1024 1280 1536 > |________|________|________|________|________|__l______|________|________| > 0 group1 group2 group3 group4 group5 group6 group7 group8 2048 Ok, thanks for the explanation it makes sense now, although should not we just adjust the start only when the size is being truncated to the EXT4_BLOCKS_PER_GROUP? -Lukas > > > > + > > > /* don't cover already allocated blocks in selected range */ > > > if (ar->pleft && start <= ar->lleft) { > > > size -= ar->lleft + 1 - start; > > > -- > > > 2.31.1 > > > > > . > > -- > With Best Regards, > Baokun Li ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa 2022-05-24 12:11 ` Lukas Czerner @ 2022-05-24 12:42 ` Baokun Li 0 siblings, 0 replies; 21+ messages in thread From: Baokun Li @ 2022-05-24 12:42 UTC (permalink / raw) To: Lukas Czerner Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, Hulk Robot, Baokun Li 在 2022/5/24 20:11, Lukas Czerner 写道: > On Mon, May 23, 2022 at 08:19:03PM +0800, libaokun (A) wrote: >> 在 2022/5/23 17:58, Lukas Czerner 写道: >>> On Sat, May 21, 2022 at 09:42:16PM +0800, Baokun Li wrote: >>>> Hulk Robot reported a BUG_ON: >>>> ================================================================== >>>> kernel BUG at fs/ext4/mballoc.c:3211! >>>> [...] >>>> RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f >>>> [...] >>>> Call Trace: >>>> ext4_mb_new_blocks+0x9df/0x5d30 >>>> ext4_ext_map_blocks+0x1803/0x4d80 >>>> ext4_map_blocks+0x3a4/0x1a10 >>>> ext4_writepages+0x126d/0x2c30 >>>> do_writepages+0x7f/0x1b0 >>>> __filemap_fdatawrite_range+0x285/0x3b0 >>>> file_write_and_wait_range+0xb1/0x140 >>>> ext4_sync_file+0x1aa/0xca0 >>>> vfs_fsync_range+0xfb/0x260 >>>> do_fsync+0x48/0xa0 >>>> [...] >>>> ================================================================== >>>> >>>> Above issue may happen as follows: >>>> ------------------------------------- >>>> do_fsync >>>> vfs_fsync_range >>>> ext4_sync_file >>>> file_write_and_wait_range >>>> __filemap_fdatawrite_range >>>> do_writepages >>>> ext4_writepages >>>> mpage_map_and_submit_extent >>>> mpage_map_one_extent >>>> ext4_map_blocks >>>> ext4_mb_new_blocks >>>> ext4_mb_normalize_request >>>> >>> start + size <= ac->ac_o_ex.fe_logical >>>> ext4_mb_regular_allocator >>>> ext4_mb_simple_scan_group >>>> ext4_mb_use_best_found >>>> ext4_mb_new_preallocation >>>> ext4_mb_new_inode_pa >>>> ext4_mb_use_inode_pa >>>> >>> set ac->ac_b_ex.fe_len <= 0 >>>> ext4_mb_mark_diskspace_used >>>> >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); >>>> >>>> we can easily reproduce this problem with the following commands: >>>> `fallocate -l100M disk` >>>> `mkfs.ext4 -b 1024 -g 256 disk` >>>> `mount disk /mnt` >>>> `fsstress -d /mnt -l 0 -n 1000 -p 1` >>>> >>>> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. >>>> Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur >>>> when the size is truncated. So start should be the start position of >>>> the group where ac_o_ex.fe_logical is located after alignment. >>>> In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP >>>> is very large, the value calculated by start_off is more accurate. >>>> >>>> Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") >>>> Reported-by: Hulk Robot<hulkci@huawei.com> >>>> Signed-off-by: Baokun Li<libaokun1@huawei.com> >>>> --- >>>> fs/ext4/mballoc.c | 11 +++++++++++ >>>> 1 file changed, 11 insertions(+) >>>> >>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >>>> index ea653d19f9ec..32410b79b664 100644 >>>> --- a/fs/ext4/mballoc.c >>>> +++ b/fs/ext4/mballoc.c >>>> @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, >>>> size = size >> bsbits; >>>> start = start_off >> bsbits; >>>> + /* >>>> + * Because size must be less than or equal to >>>> + * EXT4_BLOCKS_PER_GROUP, start should be the start position of >>>> + * the group where ac_o_ex.fe_logical is located after alignment. >>>> + * In addition, when the value of fe_logical or >>>> + * EXT4_BLOCKS_PER_GROUP is very large, the value calculated >>>> + * by start_off is more accurate. >>>> + */ >>>> + start = max(start, round_down(ac->ac_o_ex.fe_logical, >>>> + EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); >>> This does not look right. The second argument in round_down() must be a >>> power of two, but there is no such restriction on blocks per group. >> Indeed, block peer group size should be a multiple of 8. I forgot. >> >> Thank you very much for your correction. >> >>> Also I am not quite sure why do we adjust the start in this way at all? >>> If we found what seems to be a preallocated extent which we can use and >>> we're actually going to use 0 lenght extent it seems like the problem is >>> somewhere else? Can you desribe the problem a bit more in detail? >>> >>> Maybe I need to look at the ext4_mb_normalize_request() some more. >>> >>> -Lukas >> The logical block map reached before the problem stack was 1011. >> >> The estimated location of the size logical block of the inde plus the >> required allocation length 7, the size is 1018. >> >> But the i_size of inode is 1299, so the size is 1299, the aligned size is >> 2048, and the end is 2048. >> >> Because of the restriction of ar -> pleft, start==648. >> >> EXT4_BLOCKS_PER_GROUP (ac- > ac_sb) is 256, so the size is 256 and the end >> is 904. >> >> It is not normal to truncate here, the end is less than 1299 of the target >> logical block, >> that is, the allocated range does not contain the target logical block. >> >> Then this new scope conflicts with the previous PA, as follows: >> >> pa_start-506 pa_end-759 >> |____________P________V_________P__________V_____________l________| >> 0 start-648 end-904 logical-1299 >> 2048 >> >> In this case, start is changed to pa_end, that is, 759. >> In this case, a bug_ON is reported in ext4_mb_mark_diskspace_used. >> >> The problem is caused by the truncation introduced in the >> cd648b8a8fd5 ("ext4: trim allocation requests to group size"). >> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. >> However, the truncation method is incorrect. The group where the logical is >> located should be used for allocation. If the value of EXT4_BLOCKS_PER_GROUP >> is 256, size 2048 can be divided into eight groups. If the value of logical >> is 1299, >> the value of logical must be in the sixth group, that is, >> start=1299/256*256=5*256=1280, end=size+1280=1536. >> Then, the value range can be further narrowed down based on other >> restrictions. >> >> 1024 1280 1536 >> |________|________|________|________|________|__l______|________|________| >> 0 group1 group2 group3 group4 group5 group6 group7 group8 2048 > Ok, thanks for the explanation it makes sense now, although should not > we just adjust the start only when the size is being truncated to the > EXT4_BLOCKS_PER_GROUP? > > -Lukas Yes, it is. Assume that the value of fe_logical is 1011, and the value of EXT4_BLOCKS_PER_GROUP is 1024. In this case, 1011 / 1024 = 0, and 0 x 1024 is still 0. Therefore, the value of start is 0, which does not change. >>>> + >>>> /* don't cover already allocated blocks in selected range */ >>>> if (ar->pleft && start <= ar->lleft) { >>>> size -= ar->lleft + 1 - start; >>>> -- >>>> 2.31.1 >>>> >>> . >> -- >> With Best Regards, >> Baokun Li > . -- With Best Regards, Baokun Li ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa 2022-05-21 13:42 ` [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa Baokun Li 2022-05-23 9:29 ` Jan Kara 2022-05-23 9:58 ` Lukas Czerner @ 2022-05-23 19:51 ` Ritesh Harjani 2 siblings, 0 replies; 21+ messages in thread From: Ritesh Harjani @ 2022-05-23 19:51 UTC (permalink / raw) To: Baokun Li Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yebin10, yukuai3, Hulk Robot On 22/05/21 09:42PM, Baokun Li wrote: > Hulk Robot reported a BUG_ON: > ================================================================== > kernel BUG at fs/ext4/mballoc.c:3211! > [...] > RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f > [...] > Call Trace: > ext4_mb_new_blocks+0x9df/0x5d30 > ext4_ext_map_blocks+0x1803/0x4d80 > ext4_map_blocks+0x3a4/0x1a10 > ext4_writepages+0x126d/0x2c30 > do_writepages+0x7f/0x1b0 > __filemap_fdatawrite_range+0x285/0x3b0 > file_write_and_wait_range+0xb1/0x140 > ext4_sync_file+0x1aa/0xca0 > vfs_fsync_range+0xfb/0x260 > do_fsync+0x48/0xa0 > [...] > ================================================================== > > Above issue may happen as follows: > ------------------------------------- > do_fsync > vfs_fsync_range > ext4_sync_file > file_write_and_wait_range > __filemap_fdatawrite_range > do_writepages > ext4_writepages > mpage_map_and_submit_extent > mpage_map_one_extent > ext4_map_blocks > ext4_mb_new_blocks > ext4_mb_normalize_request > >>> start + size <= ac->ac_o_ex.fe_logical > ext4_mb_regular_allocator > ext4_mb_simple_scan_group > ext4_mb_use_best_found > ext4_mb_new_preallocation > ext4_mb_new_inode_pa > ext4_mb_use_inode_pa > >>> set ac->ac_b_ex.fe_len <= 0 > ext4_mb_mark_diskspace_used > >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); > > we can easily reproduce this problem with the following commands: > `fallocate -l100M disk` > `mkfs.ext4 -b 1024 -g 256 disk` > `mount disk /mnt` > `fsstress -d /mnt -l 0 -n 1000 -p 1` Thanks for sharing the reproducer. Yes, this could easily trigger the bug_on. We don't even need "-b 1024". > > The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. > Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur > when the size is truncated. So start should be the start position of > the group where ac_o_ex.fe_logical is located after alignment. > In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP > is very large, the value calculated by start_off is more accurate. > > Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") Commit message does say that it can result into allocation request bigger then the size of the block group. So then, what happens in case of flex_bg feature is enabled (which by default nowadays). Shouldn't we consider flex_bg_groups * EXT4_BLOCKS_PER_GROUP as the allocation size request? > > > Reported-by: Hulk Robot <hulkci@huawei.com> > Signed-off-by: Baokun Li <libaokun1@huawei.com> > --- > fs/ext4/mballoc.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index ea653d19f9ec..32410b79b664 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, > size = size >> bsbits; > start = start_off >> bsbits; > > + /* > + * Because size must be less than or equal to > + * EXT4_BLOCKS_PER_GROUP, We should confirm whether in case of flex_bg groups, this assumption holds correct? > start should be the start position of > + * the group where ac_o_ex.fe_logical is located after alignment. > + * In addition, when the value of fe_logical or > + * EXT4_BLOCKS_PER_GROUP is very large, the value calculated > + * by start_off is more accurate. > + */ > + start = max(start, round_down(ac->ac_o_ex.fe_logical, > + EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); > + This does looks like it could solve the problem at hand. This is because we try to allocate based on the start offset upto size. As of now we do allocate more space then requested (due to normalization), but due to the start offset of our preallocation, our allocation request of ac->ac_o_ex.fe_logical + fe_len doesn't lie in the allocated PA. This happens since we trim the size of the allocation request to only till blocks_per_group. So with that if we make the above changes (your patch), it does looks like, that we will then be allocating the PA such that our allocation request will fall within the allocated PA (in ext4_mb_use_inode_pa()) and hence we won't hit the bug_on in ext4_mb_mark_diskspace_used(). One more suggestion - I think we should also add a WARN_ON()/BUG_ON() here. This makes the start of the problem more visible, rather then waiting till ext4_mb_mark_diskspace_used() call to hit the bug_on(). Thoughts? diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 252c168454c7..e91d5aeb8efd 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4301,6 +4301,7 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac, BUG_ON(start < pa->pa_pstart); BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len)); BUG_ON(pa->pa_free < len); + WARN_ON(len <= 0); pa->pa_free -= len; mb_debug(ac->ac_sb, "use %llu/%d from inode pa %p\n", start, len, pa); -ritesh ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-21 13:42 [PATCH 0/2] ext4: fix two bugs in ext4_mb_normalize_request Baokun Li 2022-05-21 13:42 ` [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa Baokun Li @ 2022-05-21 13:42 ` Baokun Li 2022-05-23 9:40 ` Jan Kara ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Baokun Li @ 2022-05-21 13:42 UTC (permalink / raw) To: linux-ext4 Cc: tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, libaokun1 When either of the "start + size <= ac->ac_o_ex.fe_logical" or "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates that the fe_logical is not in the allocated range. In this case, it should be bug_ON. Fixes: dfe076c106f6 ("ext4: get rid of code duplication") Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 32410b79b664..d0fb57970648 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, } rcu_read_unlock(); - if (start + size <= ac->ac_o_ex.fe_logical && + if (start + size <= ac->ac_o_ex.fe_logical || start > ac->ac_o_ex.fe_logical) { ext4_msg(ac->ac_sb, KERN_ERR, "start %lu, size %lu, fe_logical %lu", -- 2.31.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-21 13:42 ` [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request Baokun Li @ 2022-05-23 9:40 ` Jan Kara [not found] ` <3755e40b-f817-83df-b239-b0697976c272@huawei.com> 2022-05-23 10:05 ` Lukas Czerner 2022-05-23 20:08 ` Ritesh Harjani 2 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2022-05-23 9:40 UTC (permalink / raw) To: Baokun Li Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3 On Sat 21-05-22 21:42:17, Baokun Li wrote: > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > that the fe_logical is not in the allocated range. > In this case, it should be bug_ON. > > Fixes: dfe076c106f6 ("ext4: get rid of code duplication") > Signed-off-by: Baokun Li <libaokun1@huawei.com> I think this is actually wrong. The original condition checks whether start + size does not overflow the used integer type. Your condition is much stronger and I don't think it always has to be true. E.g. allocation goal block (start variable) can be pushed to larger values by existing preallocation or so. Honza > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 32410b79b664..d0fb57970648 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, > } > rcu_read_unlock(); > > - if (start + size <= ac->ac_o_ex.fe_logical && > + if (start + size <= ac->ac_o_ex.fe_logical || > start > ac->ac_o_ex.fe_logical) { > ext4_msg(ac->ac_sb, KERN_ERR, > "start %lu, size %lu, fe_logical %lu", > -- > 2.31.1 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <3755e40b-f817-83df-b239-b0697976c272@huawei.com>]
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request [not found] ` <3755e40b-f817-83df-b239-b0697976c272@huawei.com> @ 2022-05-24 9:30 ` Jan Kara 2022-05-24 13:44 ` Baokun Li 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2022-05-24 9:30 UTC (permalink / raw) To: libaokun (A) Cc: Jan Kara, lczerner, linux-ext4, tytso, adilger.kernel, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3 On Mon 23-05-22 21:04:16, libaokun (A) wrote: > 在 2022/5/23 17:40, Jan Kara 写道: > > On Sat 21-05-22 21:42:17, Baokun Li wrote: > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > > > that the fe_logical is not in the allocated range. > > > In this case, it should be bug_ON. > > > > > > Fixes: dfe076c106f6 ("ext4: get rid of code duplication") > > > Signed-off-by: Baokun Li<libaokun1@huawei.com> > > I think this is actually wrong. The original condition checks whether > > start + size does not overflow the used integer type. Your condition is > > much stronger and I don't think it always has to be true. E.g. allocation > > goal block (start variable) can be pushed to larger values by existing > > preallocation or so. > > > > Honza > > > I think there are two reasons for this: > > First of all, the code here is as follows. > ``` > size = end - start; > [...] > if (start + size <= ac->ac_o_ex.fe_logical && > start > ac->ac_o_ex.fe_logical) { > ext4_msg(ac->ac_sb, KERN_ERR, > "start %lu, size %lu, fe_logical %lu", > (unsigned long) start, (unsigned long) size, > (unsigned long) ac->ac_o_ex.fe_logical); > BUG(); > } > BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)); > ``` > First of all, there is no need to compare with ac_o_ex.fe_logical if it is > to determine whether there is an overflow. > Because the previous logic guarantees start < = ac_o_ex.fe_logical, and How does it guarantee that? The logic: if (ar->pleft && start <= ar->lleft) { size -= ar->lleft + 1 - start; start = ar->lleft + 1; } can move 'start' to further blocks... > limits the scope of size in > "BUG_ON (size < = 0 | | size > EXT4_BLOCKS_PER_GROUP (ac- > ac_sb))" > immediately following. OK, but what guarantees that ac_o_ex.fe_logical < UINT_MAX - size? > Secondly, the following code flow also reflects this logic. > > ext4_mb_normalize_request > >>> start + size <= ac->ac_o_ex.fe_logical > ext4_mb_regular_allocator > ext4_mb_simple_scan_group > ext4_mb_use_best_found > ext4_mb_new_preallocation > ext4_mb_new_inode_pa > ext4_mb_use_inode_pa > >>> set ac->ac_b_ex.fe_len <= 0 > ext4_mb_mark_diskspace_used > >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); > > In ext4_mb_use_inode_pa, you have the following code. > ``` > start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart); > end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi, > ac->ac_o_ex.fe_len)); > len = EXT4_NUM_B2C(sbi, end - start); > ac->ac_b_ex.fe_len = len; > ``` > The starting position in ext4_mb_mark_diskspace_used will be assert. > BUG_ON(ac->ac_b_ex.fe_len <= 0); > > When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of > end - start must be greater than 0. > However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this > bug_ON may be triggered. > When this bug_ON is triggered, that is, > > ac->ac_b_ex.fe_len <= 0 > end - start <= 0 > end <= start > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart + > (ac->ac_o_ex.fe_logical - pa->pa_lstart) > pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart > pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical > start + size <= ac->ac_o_ex.fe_logical > > So I think that "&&" here should be changed to "||". Sorry, I still disagree. After some more code reading I agree that ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks to be placed in the inode so logical extent of allocated blocks should include ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which case we can also remove some other code from ext4_mb_normalize_request()). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-24 9:30 ` Jan Kara @ 2022-05-24 13:44 ` Baokun Li 2022-05-25 11:29 ` Jan Kara 0 siblings, 1 reply; 21+ messages in thread From: Baokun Li @ 2022-05-24 13:44 UTC (permalink / raw) To: Jan Kara Cc: lczerner, linux-ext4, tytso, adilger.kernel, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, Baokun Li 在 2022/5/24 17:30, Jan Kara 写道: > On Mon 23-05-22 21:04:16, libaokun (A) wrote: >> 在 2022/5/23 17:40, Jan Kara 写道: >>> On Sat 21-05-22 21:42:17, Baokun Li wrote: >>>> When either of the "start + size <= ac->ac_o_ex.fe_logical" or >>>> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates >>>> that the fe_logical is not in the allocated range. >>>> In this case, it should be bug_ON. >>>> >>>> Fixes: dfe076c106f6 ("ext4: get rid of code duplication") >>>> Signed-off-by: Baokun Li<libaokun1@huawei.com> >>> I think this is actually wrong. The original condition checks whether >>> start + size does not overflow the used integer type. Your condition is >>> much stronger and I don't think it always has to be true. E.g. allocation >>> goal block (start variable) can be pushed to larger values by existing >>> preallocation or so. >>> >>> Honza >>> >> I think there are two reasons for this: >> >> First of all, the code here is as follows. >> ``` >> size = end - start; >> [...] >> if (start + size <= ac->ac_o_ex.fe_logical && >> start > ac->ac_o_ex.fe_logical) { >> ext4_msg(ac->ac_sb, KERN_ERR, >> "start %lu, size %lu, fe_logical %lu", >> (unsigned long) start, (unsigned long) size, >> (unsigned long) ac->ac_o_ex.fe_logical); >> BUG(); >> } >> BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)); >> ``` >> First of all, there is no need to compare with ac_o_ex.fe_logical if it is >> to determine whether there is an overflow. >> Because the previous logic guarantees start < = ac_o_ex.fe_logical, and > How does it guarantee that? The logic: > > if (ar->pleft && start <= ar->lleft) { > size -= ar->lleft + 1 - start; > start = ar->lleft + 1; > } > > can move 'start' to further blocks... This is not the case. According to the code of the preceding process, ar->pleft and ar->pright are assigned values in ext4_ext_map_blocks. ar->pleft is the first allocated block found to the left by map->m_lblk (that is, fe_logical), and ar->pright is the first allocated block found to the right. ar->lleft and ar->lright are logical block numbers, so there must be "ar->lleft < ac_o_ex.fe_logical < ar->lright". > >> limits the scope of size in >> "BUG_ON (size < = 0 | | size > EXT4_BLOCKS_PER_GROUP (ac- > ac_sb))" >> immediately following. > OK, but what guarantees that ac_o_ex.fe_logical < UINT_MAX - size? When ac_o_ex.fe_logical is too large to overflow, predict filesize enters the last branch. In this case, start = ac->ac_o_ex.fe_logical and size = ac->ac_o_ex.fe_len. However, the overflow is checked in ext4_ext_check_overlap of ext4_ext_map_blocks. The code is as follows: ``` 1898 /* check for wrap through zero on extent logical start block*/ 1899 if (b1 + len1 < b1) { 1900 len1 = EXT_MAX_BLOCKS - b1; 1901 newext->ee_len = cpu_to_le16(len1); 1902 ret = 1; 1903 } ``` Therefore, no overflow occurs. > >> Secondly, the following code flow also reflects this logic. >> >> ext4_mb_normalize_request >> >>> start + size <= ac->ac_o_ex.fe_logical >> ext4_mb_regular_allocator >> ext4_mb_simple_scan_group >> ext4_mb_use_best_found >> ext4_mb_new_preallocation >> ext4_mb_new_inode_pa >> ext4_mb_use_inode_pa >> >>> set ac->ac_b_ex.fe_len <= 0 >> ext4_mb_mark_diskspace_used >> >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); >> >> In ext4_mb_use_inode_pa, you have the following code. >> ``` >> start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart); >> end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi, >> ac->ac_o_ex.fe_len)); >> len = EXT4_NUM_B2C(sbi, end - start); >> ac->ac_b_ex.fe_len = len; >> ``` >> The starting position in ext4_mb_mark_diskspace_used will be assert. >> BUG_ON(ac->ac_b_ex.fe_len <= 0); >> >> When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of >> end - start must be greater than 0. >> However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this >> bug_ON may be triggered. >> When this bug_ON is triggered, that is, >> >> ac->ac_b_ex.fe_len <= 0 >> end - start <= 0 >> end <= start >> pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart + >> (ac->ac_o_ex.fe_logical - pa->pa_lstart) >> pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart >> pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical >> start + size <= ac->ac_o_ex.fe_logical >> >> So I think that "&&" here should be changed to "||". > Sorry, I still disagree. After some more code reading I agree that > ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks > to be placed in the inode so logical extent of allocated blocks should include > ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you > suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which > case we can also remove some other code from ext4_mb_normalize_request()). > > Honza > What codes are you referring to that can be deleted? -- With Best Regards, Baokun Li ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-24 13:44 ` Baokun Li @ 2022-05-25 11:29 ` Jan Kara 2022-05-26 1:16 ` Baokun Li 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2022-05-25 11:29 UTC (permalink / raw) To: Baokun Li Cc: Jan Kara, lczerner, linux-ext4, tytso, adilger.kernel, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3 On Tue 24-05-22 21:44:31, Baokun Li wrote: > 在 2022/5/24 17:30, Jan Kara 写道: > > On Mon 23-05-22 21:04:16, libaokun (A) wrote: > > > 在 2022/5/23 17:40, Jan Kara 写道: > > > > On Sat 21-05-22 21:42:17, Baokun Li wrote: > > > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > > > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > > > > > that the fe_logical is not in the allocated range. > > > > > In this case, it should be bug_ON. > > > > > > > > > > Fixes: dfe076c106f6 ("ext4: get rid of code duplication") > > > > > Signed-off-by: Baokun Li<libaokun1@huawei.com> > > > > I think this is actually wrong. The original condition checks whether > > > > start + size does not overflow the used integer type. Your condition is > > > > much stronger and I don't think it always has to be true. E.g. allocation > > > > goal block (start variable) can be pushed to larger values by existing > > > > preallocation or so. > > > > > > > > Honza > > > > > > > I think there are two reasons for this: > > > > > > First of all, the code here is as follows. > > > ``` > > > size = end - start; > > > [...] > > > if (start + size <= ac->ac_o_ex.fe_logical && > > > start > ac->ac_o_ex.fe_logical) { > > > ext4_msg(ac->ac_sb, KERN_ERR, > > > "start %lu, size %lu, fe_logical %lu", > > > (unsigned long) start, (unsigned long) size, > > > (unsigned long) ac->ac_o_ex.fe_logical); > > > BUG(); > > > } > > > BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)); > > > ``` > > > First of all, there is no need to compare with ac_o_ex.fe_logical if it is > > > to determine whether there is an overflow. > > > Because the previous logic guarantees start < = ac_o_ex.fe_logical, and > > How does it guarantee that? The logic: > > > > if (ar->pleft && start <= ar->lleft) { > > size -= ar->lleft + 1 - start; > > start = ar->lleft + 1; > > } > > > > can move 'start' to further blocks... > This is not the case. According to the code of the preceding process, > ar->pleft and ar->pright are assigned values in ext4_ext_map_blocks. > ar->pleft is the first allocated block found to the left by map->m_lblk > (that is, fe_logical), > and ar->pright is the first allocated block found to the right. > ar->lleft and ar->lright are logical block numbers, so there must be > "ar->lleft < ac_o_ex.fe_logical < ar->lright". Right, I've found that out after sending my previous email. Sorry for confusion. > > > Secondly, the following code flow also reflects this logic. > > > > > > ext4_mb_normalize_request > > > >>> start + size <= ac->ac_o_ex.fe_logical > > > ext4_mb_regular_allocator > > > ext4_mb_simple_scan_group > > > ext4_mb_use_best_found > > > ext4_mb_new_preallocation > > > ext4_mb_new_inode_pa > > > ext4_mb_use_inode_pa > > > >>> set ac->ac_b_ex.fe_len <= 0 > > > ext4_mb_mark_diskspace_used > > > >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); > > > > > > In ext4_mb_use_inode_pa, you have the following code. > > > ``` > > > start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart); > > > end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi, > > > ac->ac_o_ex.fe_len)); > > > len = EXT4_NUM_B2C(sbi, end - start); > > > ac->ac_b_ex.fe_len = len; > > > ``` > > > The starting position in ext4_mb_mark_diskspace_used will be assert. > > > BUG_ON(ac->ac_b_ex.fe_len <= 0); > > > When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of > > > end - start must be greater than 0. > > > However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this > > > bug_ON may be triggered. > > > When this bug_ON is triggered, that is, > > > > > > ac->ac_b_ex.fe_len <= 0 > > > end - start <= 0 > > > end <= start > > > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart + > > > (ac->ac_o_ex.fe_logical - pa->pa_lstart) > > > pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart > > > pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical > > > start + size <= ac->ac_o_ex.fe_logical > > > > > > So I think that "&&" here should be changed to "||". > > Sorry, I still disagree. After some more code reading I agree that > > ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks > > to be placed in the inode so logical extent of allocated blocks should include > > ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you > > suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which > > case we can also remove some other code from ext4_mb_normalize_request()). > > > > Honza > > > What codes are you referring to that can be deleted? So I though the shifting of 'start' by lleft cannot happen but then I realized that if 'start' got aligned down, it can now be lower than lleft so the shifting is indeed needed. So all code is needed there. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-25 11:29 ` Jan Kara @ 2022-05-26 1:16 ` Baokun Li 0 siblings, 0 replies; 21+ messages in thread From: Baokun Li @ 2022-05-26 1:16 UTC (permalink / raw) To: Jan Kara Cc: lczerner, linux-ext4, tytso, adilger.kernel, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3, Baokun Li 在 2022/5/25 19:29, Jan Kara 写道: > On Tue 24-05-22 21:44:31, Baokun Li wrote: >> 在 2022/5/24 17:30, Jan Kara 写道: >>> On Mon 23-05-22 21:04:16, libaokun (A) wrote: >>>> 在 2022/5/23 17:40, Jan Kara 写道: >>>>> On Sat 21-05-22 21:42:17, Baokun Li wrote: >>>>>> When either of the "start + size <= ac->ac_o_ex.fe_logical" or >>>>>> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates >>>>>> that the fe_logical is not in the allocated range. >>>>>> In this case, it should be bug_ON. >>>>>> >>>>>> Fixes: dfe076c106f6 ("ext4: get rid of code duplication") >>>>>> Signed-off-by: Baokun Li<libaokun1@huawei.com> >>>>> I think this is actually wrong. The original condition checks whether >>>>> start + size does not overflow the used integer type. Your condition is >>>>> much stronger and I don't think it always has to be true. E.g. allocation >>>>> goal block (start variable) can be pushed to larger values by existing >>>>> preallocation or so. >>>>> >>>>> Honza >>>>> >>>> I think there are two reasons for this: >>>> >>>> First of all, the code here is as follows. >>>> ``` >>>> size = end - start; >>>> [...] >>>> if (start + size <= ac->ac_o_ex.fe_logical && >>>> start > ac->ac_o_ex.fe_logical) { >>>> ext4_msg(ac->ac_sb, KERN_ERR, >>>> "start %lu, size %lu, fe_logical %lu", >>>> (unsigned long) start, (unsigned long) size, >>>> (unsigned long) ac->ac_o_ex.fe_logical); >>>> BUG(); >>>> } >>>> BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)); >>>> ``` >>>> First of all, there is no need to compare with ac_o_ex.fe_logical if it is >>>> to determine whether there is an overflow. >>>> Because the previous logic guarantees start < = ac_o_ex.fe_logical, and >>> How does it guarantee that? The logic: >>> >>> if (ar->pleft && start <= ar->lleft) { >>> size -= ar->lleft + 1 - start; >>> start = ar->lleft + 1; >>> } >>> >>> can move 'start' to further blocks... >> This is not the case. According to the code of the preceding process, >> ar->pleft and ar->pright are assigned values in ext4_ext_map_blocks. >> ar->pleft is the first allocated block found to the left by map->m_lblk >> (that is, fe_logical), >> and ar->pright is the first allocated block found to the right. >> ar->lleft and ar->lright are logical block numbers, so there must be >> "ar->lleft < ac_o_ex.fe_logical < ar->lright". > Right, I've found that out after sending my previous email. Sorry for > confusion. Don't be sorry. Thank you very much for your advice. It has benefited me a lot. > >>>> Secondly, the following code flow also reflects this logic. >>>> >>>> ext4_mb_normalize_request >>>> >>> start + size <= ac->ac_o_ex.fe_logical >>>> ext4_mb_regular_allocator >>>> ext4_mb_simple_scan_group >>>> ext4_mb_use_best_found >>>> ext4_mb_new_preallocation >>>> ext4_mb_new_inode_pa >>>> ext4_mb_use_inode_pa >>>> >>> set ac->ac_b_ex.fe_len <= 0 >>>> ext4_mb_mark_diskspace_used >>>> >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); >>>> >>>> In ext4_mb_use_inode_pa, you have the following code. >>>> ``` >>>> start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart); >>>> end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi, >>>> ac->ac_o_ex.fe_len)); >>>> len = EXT4_NUM_B2C(sbi, end - start); >>>> ac->ac_b_ex.fe_len = len; >>>> ``` >>>> The starting position in ext4_mb_mark_diskspace_used will be assert. >>>> BUG_ON(ac->ac_b_ex.fe_len <= 0); >>>> When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of >>>> end - start must be greater than 0. >>>> However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this >>>> bug_ON may be triggered. >>>> When this bug_ON is triggered, that is, >>>> >>>> ac->ac_b_ex.fe_len <= 0 >>>> end - start <= 0 >>>> end <= start >>>> pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart + >>>> (ac->ac_o_ex.fe_logical - pa->pa_lstart) >>>> pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart >>>> pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical >>>> start + size <= ac->ac_o_ex.fe_logical >>>> >>>> So I think that "&&" here should be changed to "||". >>> Sorry, I still disagree. After some more code reading I agree that >>> ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks >>> to be placed in the inode so logical extent of allocated blocks should include >>> ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you >>> suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which >>> case we can also remove some other code from ext4_mb_normalize_request()). >>> >>> Honza >>> >> What codes are you referring to that can be deleted? > So I though the shifting of 'start' by lleft cannot happen but then I > realized that if 'start' got aligned down, it can now be lower than lleft > so the shifting is indeed needed. So all code is needed there. > > Honza Okay, thanks again! -- With Best Regards, Baokun Li . ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-21 13:42 ` [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request Baokun Li 2022-05-23 9:40 ` Jan Kara @ 2022-05-23 10:05 ` Lukas Czerner 2022-05-23 20:08 ` Ritesh Harjani 2 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2022-05-23 10:05 UTC (permalink / raw) To: Baokun Li Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang, yebin10, yukuai3 On Sat, May 21, 2022 at 09:42:17PM +0800, Baokun Li wrote: > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > that the fe_logical is not in the allocated range. > In this case, it should be bug_ON. This seems wrong, I think that this condition is testing overflow and it's correct as it is. Or am I missing something? -Lukas > > Fixes: dfe076c106f6 ("ext4: get rid of code duplication") > Signed-off-by: Baokun Li <libaokun1@huawei.com> > --- > fs/ext4/mballoc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 32410b79b664..d0fb57970648 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, > } > rcu_read_unlock(); > > - if (start + size <= ac->ac_o_ex.fe_logical && > + if (start + size <= ac->ac_o_ex.fe_logical || > start > ac->ac_o_ex.fe_logical) { > ext4_msg(ac->ac_sb, KERN_ERR, > "start %lu, size %lu, fe_logical %lu", > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-21 13:42 ` [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request Baokun Li 2022-05-23 9:40 ` Jan Kara 2022-05-23 10:05 ` Lukas Czerner @ 2022-05-23 20:08 ` Ritesh Harjani 2022-05-23 21:08 ` Jan Kara 2022-05-24 6:09 ` Baokun Li 2 siblings, 2 replies; 21+ messages in thread From: Ritesh Harjani @ 2022-05-23 20:08 UTC (permalink / raw) To: Baokun Li Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yebin10, yukuai3 On 22/05/21 09:42PM, Baokun Li wrote: > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > that the fe_logical is not in the allocated range. Sounds about right to me based on the logic in ext4_mb_use_inode_pa(). We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall within the preallocated range. So if our start or start + size doesn't include fe_logical then it is a bug in the ext4_mb_normalize_request() logic. But should we be so harsh to hit a bug_on() or make it warn_on()? Also did you run any fs tests with this change. Since it looks like this logic existed since mballoc was introduced. > In this case, it should be bug_ON. > > Fixes: dfe076c106f6 ("ext4: get rid of code duplication") No, there is no issue with this patch. It correctly just removes the duplicate logic. > Signed-off-by: Baokun Li <libaokun1@huawei.com> > --- > fs/ext4/mballoc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 32410b79b664..d0fb57970648 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, > } > rcu_read_unlock(); > > - if (start + size <= ac->ac_o_ex.fe_logical && > + if (start + size <= ac->ac_o_ex.fe_logical || > start > ac->ac_o_ex.fe_logical) { > ext4_msg(ac->ac_sb, KERN_ERR, > "start %lu, size %lu, fe_logical %lu", > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-23 20:08 ` Ritesh Harjani @ 2022-05-23 21:08 ` Jan Kara 2022-05-24 6:26 ` Ritesh Harjani 2022-05-24 6:09 ` Baokun Li 1 sibling, 1 reply; 21+ messages in thread From: Jan Kara @ 2022-05-23 21:08 UTC (permalink / raw) To: Ritesh Harjani Cc: Baokun Li, linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yebin10, yukuai3 On Tue 24-05-22 01:38:44, Ritesh Harjani wrote: > On 22/05/21 09:42PM, Baokun Li wrote: > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > > that the fe_logical is not in the allocated range. > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa(). > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall > within the preallocated range. So if our start or start + size doesn't include > fe_logical then it is a bug in the ext4_mb_normalize_request() logic. I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a hard guarantee that we would allocate extent that includes that block. It was always treated as a hint only. In particular if you look at the logic in ext4_mb_normalize_request() it does shift the start of the allocation to avoid preallocated ranges etc. so I don't see how we are guaranteed that ext4_mb_normalize_request() will result in an allocation request that includes ac->ac_o_ex.fe_logical. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-23 21:08 ` Jan Kara @ 2022-05-24 6:26 ` Ritesh Harjani 2022-05-24 9:39 ` Jan Kara 0 siblings, 1 reply; 21+ messages in thread From: Ritesh Harjani @ 2022-05-24 6:26 UTC (permalink / raw) To: Jan Kara Cc: Baokun Li, linux-ext4, tytso, adilger.kernel, linux-kernel, yi.zhang, yebin10, yukuai3 On 22/05/23 11:08PM, Jan Kara wrote: > On Tue 24-05-22 01:38:44, Ritesh Harjani wrote: > > On 22/05/21 09:42PM, Baokun Li wrote: > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > > > that the fe_logical is not in the allocated range. > > > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa(). > > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall > > within the preallocated range. So if our start or start + size doesn't include > > fe_logical then it is a bug in the ext4_mb_normalize_request() logic. > > I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a > hard guarantee that we would allocate extent that includes that block. It Agree that the guarantee is not about the extent which finally gets allocated. It is only within ext4_mb_normalize_request() that the "start" and "size" variable calculations is done in such a way that the ac->ac_o_ex.fe_logical block will always fall within the "start" & "end" boundaries after normalization. That is how it also updates the goal block at the end too. ac->ac_g_ex. > was always treated as a hint only. In particular if you look at the logic > in ext4_mb_normalize_request() it does shift the start of the allocation to > avoid preallocated ranges etc. Yes, I checked the logic of ext4_mb_normalize_request() again. As I see it (I can be wrong, so please correct me), there is always an attempt to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical except just one change where we are trimming the size of the request to only EXT4_BLOCKS_PER_GROUP. For e.g. when it compares against preallocated ranges. It has a BUG() which checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range. Because then we should never have come here to do allocation of a new block. 4143 /* PA must not overlap original request */ 4144 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end || 4145 ac->ac_o_ex.fe_logical < pa->pa_lstart)); <...> 4152 BUG_ON(pa->pa_lstart <= start && pa_end >= end); Then after skipping the preallocated regions which doesn't fall in between "start" and "end"... 4147 /* skip PAs this normalized request doesn't overlap with */ 4148 if (pa->pa_lstart >= end || pa_end <= start) { 4149 spin_unlock(&pa->pa_lock); 4150 continue; 4151 } ...it adjusts "start" and "end" boundary according to allocated PAs boundaries such that fe_logical is always in between "start" and "end". 4154 /* adjust start or end to be adjacent to this pa */ 4155 if (pa_end <= ac->ac_o_ex.fe_logical) { 4156 BUG_ON(pa_end < start); 4157 start = pa_end; 4158 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) { 4159 BUG_ON(pa->pa_lstart > end); 4160 end = pa->pa_lstart; 4161 } > so I don't see how we are guaranteed that > ext4_mb_normalize_request() will result in an allocation request that > includes ac->ac_o_ex.fe_logical. It could be I am wrong, but looks like ext4_mb_normalize_request() keeps ac->ac_o_ex.fe_logical within "start" and "end" of allocation request. And then updates the goal block. 4196 ac->ac_g_ex.fe_logical = start; 4197 ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); Thoughts? -ritesh ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-24 6:26 ` Ritesh Harjani @ 2022-05-24 9:39 ` Jan Kara 2022-05-24 17:31 ` Ritesh Harjani 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2022-05-24 9:39 UTC (permalink / raw) To: Ritesh Harjani Cc: Jan Kara, Baokun Li, linux-ext4, tytso, adilger.kernel, linux-kernel, yi.zhang, yebin10, yukuai3 On Tue 24-05-22 11:56:55, Ritesh Harjani wrote: > On 22/05/23 11:08PM, Jan Kara wrote: > > On Tue 24-05-22 01:38:44, Ritesh Harjani wrote: > > > On 22/05/21 09:42PM, Baokun Li wrote: > > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > > > > that the fe_logical is not in the allocated range. > > > > > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa(). > > > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall > > > within the preallocated range. So if our start or start + size doesn't include > > > fe_logical then it is a bug in the ext4_mb_normalize_request() logic. > > > > I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a > > hard guarantee that we would allocate extent that includes that block. It > > Agree that the guarantee is not about the extent which finally gets allocated. > It is only within ext4_mb_normalize_request() that the "start" and "size" > variable calculations is done in such a way that the ac->ac_o_ex.fe_logical > block will always fall within the "start" & "end" boundaries after > normalization. > > That is how it also updates the goal block at the end too. ac->ac_g_ex. > > > was always treated as a hint only. In particular if you look at the logic > > in ext4_mb_normalize_request() it does shift the start of the allocation to > > avoid preallocated ranges etc. > > Yes, I checked the logic of ext4_mb_normalize_request() again. > As I see it (I can be wrong, so please correct me), there is always an attempt > to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical > except just one change where we are trimming the size of the request to only > EXT4_BLOCKS_PER_GROUP. > > For e.g. when it compares against preallocated ranges. It has a BUG() which > checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range. > Because then we should never have come here to do allocation of a new block. > > 4143 /* PA must not overlap original request */ > 4144 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end || > 4145 ac->ac_o_ex.fe_logical < pa->pa_lstart)); > <...> > 4152 BUG_ON(pa->pa_lstart <= start && pa_end >= end); > > Then after skipping the preallocated regions which doesn't fall in between > "start" and "end"... > > 4147 /* skip PAs this normalized request doesn't overlap with */ > 4148 if (pa->pa_lstart >= end || pa_end <= start) { > 4149 spin_unlock(&pa->pa_lock); > 4150 continue; > 4151 } > > ...it adjusts "start" and "end" boundary according to allocated PAs boundaries > such that fe_logical is always in between "start" and "end". > > 4154 /* adjust start or end to be adjacent to this pa */ > 4155 if (pa_end <= ac->ac_o_ex.fe_logical) { > 4156 BUG_ON(pa_end < start); > 4157 start = pa_end; > 4158 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) { > 4159 BUG_ON(pa->pa_lstart > end); > 4160 end = pa->pa_lstart; > 4161 } > > > > > so I don't see how we are guaranteed that > > ext4_mb_normalize_request() will result in an allocation request that > > includes ac->ac_o_ex.fe_logical. > > It could be I am wrong, but looks like ext4_mb_normalize_request() keeps > ac->ac_o_ex.fe_logical within "start" and "end" of allocation request. > And then updates the goal block. > > 4196 ac->ac_g_ex.fe_logical = start; > 4197 ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); > > Thoughts? Right, after some more inspection the only thing I'm concerned about is: /* don't cover already allocated blocks in selected range */ if (ar->pleft && start <= ar->lleft) { size -= ar->lleft + 1 - start; start = ar->lleft + 1; } which can shift start beyond ac->ac_o_ex.fe_logical if the block would be already allocated. But I guess in that case we should not be calling ext4_mb_normalize_request()? ... some more code digging .. Yes, that is guaranteed in how lleft is initialized in ext4_ext_map_blocks(). So OK, I withdraw my objection to the stronger check but the changelog really needs to do a better job to explain why the stronger condition should be true. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-24 9:39 ` Jan Kara @ 2022-05-24 17:31 ` Ritesh Harjani 2022-05-25 12:12 ` Jan Kara 0 siblings, 1 reply; 21+ messages in thread From: Ritesh Harjani @ 2022-05-24 17:31 UTC (permalink / raw) To: Jan Kara Cc: Baokun Li, linux-ext4, tytso, adilger.kernel, linux-kernel, yi.zhang, yebin10, yukuai3 On 22/05/24 11:39AM, Jan Kara wrote: > On Tue 24-05-22 11:56:55, Ritesh Harjani wrote: > > On 22/05/23 11:08PM, Jan Kara wrote: > > > On Tue 24-05-22 01:38:44, Ritesh Harjani wrote: > > > > On 22/05/21 09:42PM, Baokun Li wrote: > > > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > > > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > > > > > that the fe_logical is not in the allocated range. > > > > > > > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa(). > > > > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall > > > > within the preallocated range. So if our start or start + size doesn't include > > > > fe_logical then it is a bug in the ext4_mb_normalize_request() logic. > > > > > > I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a > > > hard guarantee that we would allocate extent that includes that block. It > > > > Agree that the guarantee is not about the extent which finally gets allocated. > > It is only within ext4_mb_normalize_request() that the "start" and "size" > > variable calculations is done in such a way that the ac->ac_o_ex.fe_logical > > block will always fall within the "start" & "end" boundaries after > > normalization. > > > > That is how it also updates the goal block at the end too. ac->ac_g_ex. > > > > > was always treated as a hint only. In particular if you look at the logic > > > in ext4_mb_normalize_request() it does shift the start of the allocation to > > > avoid preallocated ranges etc. > > > > Yes, I checked the logic of ext4_mb_normalize_request() again. > > As I see it (I can be wrong, so please correct me), there is always an attempt > > to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical > > except just one change where we are trimming the size of the request to only > > EXT4_BLOCKS_PER_GROUP. > > > > For e.g. when it compares against preallocated ranges. It has a BUG() which > > checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range. > > Because then we should never have come here to do allocation of a new block. > > > > 4143 /* PA must not overlap original request */ > > 4144 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end || > > 4145 ac->ac_o_ex.fe_logical < pa->pa_lstart)); > > <...> > > 4152 BUG_ON(pa->pa_lstart <= start && pa_end >= end); > > > > Then after skipping the preallocated regions which doesn't fall in between > > "start" and "end"... > > > > 4147 /* skip PAs this normalized request doesn't overlap with */ > > 4148 if (pa->pa_lstart >= end || pa_end <= start) { > > 4149 spin_unlock(&pa->pa_lock); > > 4150 continue; > > 4151 } > > > > ...it adjusts "start" and "end" boundary according to allocated PAs boundaries > > such that fe_logical is always in between "start" and "end". > > > > 4154 /* adjust start or end to be adjacent to this pa */ > > 4155 if (pa_end <= ac->ac_o_ex.fe_logical) { > > 4156 BUG_ON(pa_end < start); > > 4157 start = pa_end; > > 4158 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) { > > 4159 BUG_ON(pa->pa_lstart > end); > > 4160 end = pa->pa_lstart; > > 4161 } > > > > > > > > > so I don't see how we are guaranteed that > > > ext4_mb_normalize_request() will result in an allocation request that > > > includes ac->ac_o_ex.fe_logical. > > > > It could be I am wrong, but looks like ext4_mb_normalize_request() keeps > > ac->ac_o_ex.fe_logical within "start" and "end" of allocation request. > > And then updates the goal block. > > > > 4196 ac->ac_g_ex.fe_logical = start; > > 4197 ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); > > > > Thoughts? > > Right, after some more inspection the only thing I'm concerned about is: > > /* don't cover already allocated blocks in selected range */ > if (ar->pleft && start <= ar->lleft) { > size -= ar->lleft + 1 - start; > start = ar->lleft + 1; > } > > which can shift start beyond ac->ac_o_ex.fe_logical if the block would be > already allocated. But I guess in that case we should not be calling > ext4_mb_normalize_request()? ... some more code digging .. Yes, that is > guaranteed in how lleft is initialized in ext4_ext_map_blocks(). Yes. > So OK, I withdraw my objection to the stronger check but the changelog really needs Thanks Jan for confirming it. > to do a better job to explain why the stronger condition should be true. Agreed. diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 252c168454c7..9e7c145e9aa2 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4179,7 +4179,22 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, } rcu_read_unlock(); - if (start + size <= ac->ac_o_ex.fe_logical && + /* + * In this function "start" and "size" are normalized for better + * alignment and length such that we could preallocate more blocks. + * This normalization is done such that original request of + * ac->ac_o_ex.fe_logical & fe_len should always lie within "start" and + * "size" boundaries. Does above comment look good to you? + * (Note fe_len can be relaxed since FS block allocation API does not + * provide gurantee on number of contiguous blocks allocation since that + * depends upon free space left, etc). + * In case of inode pa, later we use the allocated blocks + * [pa_start + fe_logical - pa_lstart, fe_len/size] from the preallocated + * range of goal/best blocks [start, size] to put it at the + * ac_o_ex.fe_logical extent of this inode. + * (See ext4_mb_use_inode_pa() for more details) + */ ^^^ We can even put more info if needed (maybe in commit message?) + if (start + size <= ac->ac_o_ex.fe_logical || start > ac->ac_o_ex.fe_logical) { ext4_msg(ac->ac_sb, KERN_ERR, "start %lu, size %lu, fe_logical %lu", FYI - I ran the fsstress test (with -g 256) shared by Baokun with only above change (&& -> ||) and not the original fix. And I see that we can hit this mentioned BUG() now. <logs> ======== [ 599.619875] EXT4-fs (loop2): start 692, size 196, fe_logical 982 ...I think we should also add (orig_size >> bsbits) in above print msg ^^ [ 599.621043] ------------[ cut here ]------------ [ 599.625099] kernel BUG at fs/ext4/mballoc.c:4188! -ritesh ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-24 17:31 ` Ritesh Harjani @ 2022-05-25 12:12 ` Jan Kara 0 siblings, 0 replies; 21+ messages in thread From: Jan Kara @ 2022-05-25 12:12 UTC (permalink / raw) To: Ritesh Harjani Cc: Jan Kara, Baokun Li, linux-ext4, tytso, adilger.kernel, linux-kernel, yi.zhang, yebin10, yukuai3 On Tue 24-05-22 23:01:35, Ritesh Harjani wrote: > On 22/05/24 11:39AM, Jan Kara wrote: > > On Tue 24-05-22 11:56:55, Ritesh Harjani wrote: > > > On 22/05/23 11:08PM, Jan Kara wrote: > > > > On Tue 24-05-22 01:38:44, Ritesh Harjani wrote: > > > > > On 22/05/21 09:42PM, Baokun Li wrote: > > > > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or > > > > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates > > > > > > that the fe_logical is not in the allocated range. > > > > > > > > > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa(). > > > > > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall > > > > > within the preallocated range. So if our start or start + size doesn't include > > > > > fe_logical then it is a bug in the ext4_mb_normalize_request() logic. > > > > > > > > I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a > > > > hard guarantee that we would allocate extent that includes that block. It > > > > > > Agree that the guarantee is not about the extent which finally gets allocated. > > > It is only within ext4_mb_normalize_request() that the "start" and "size" > > > variable calculations is done in such a way that the ac->ac_o_ex.fe_logical > > > block will always fall within the "start" & "end" boundaries after > > > normalization. > > > > > > That is how it also updates the goal block at the end too. ac->ac_g_ex. > > > > > > > was always treated as a hint only. In particular if you look at the logic > > > > in ext4_mb_normalize_request() it does shift the start of the allocation to > > > > avoid preallocated ranges etc. > > > > > > Yes, I checked the logic of ext4_mb_normalize_request() again. > > > As I see it (I can be wrong, so please correct me), there is always an attempt > > > to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical > > > except just one change where we are trimming the size of the request to only > > > EXT4_BLOCKS_PER_GROUP. > > > > > > For e.g. when it compares against preallocated ranges. It has a BUG() which > > > checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range. > > > Because then we should never have come here to do allocation of a new block. > > > > > > 4143 /* PA must not overlap original request */ > > > 4144 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end || > > > 4145 ac->ac_o_ex.fe_logical < pa->pa_lstart)); > > > <...> > > > 4152 BUG_ON(pa->pa_lstart <= start && pa_end >= end); > > > > > > Then after skipping the preallocated regions which doesn't fall in between > > > "start" and "end"... > > > > > > 4147 /* skip PAs this normalized request doesn't overlap with */ > > > 4148 if (pa->pa_lstart >= end || pa_end <= start) { > > > 4149 spin_unlock(&pa->pa_lock); > > > 4150 continue; > > > 4151 } > > > > > > ...it adjusts "start" and "end" boundary according to allocated PAs boundaries > > > such that fe_logical is always in between "start" and "end". > > > > > > 4154 /* adjust start or end to be adjacent to this pa */ > > > 4155 if (pa_end <= ac->ac_o_ex.fe_logical) { > > > 4156 BUG_ON(pa_end < start); > > > 4157 start = pa_end; > > > 4158 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) { > > > 4159 BUG_ON(pa->pa_lstart > end); > > > 4160 end = pa->pa_lstart; > > > 4161 } > > > > > > > > > > > > > so I don't see how we are guaranteed that > > > > ext4_mb_normalize_request() will result in an allocation request that > > > > includes ac->ac_o_ex.fe_logical. > > > > > > It could be I am wrong, but looks like ext4_mb_normalize_request() keeps > > > ac->ac_o_ex.fe_logical within "start" and "end" of allocation request. > > > And then updates the goal block. > > > > > > 4196 ac->ac_g_ex.fe_logical = start; > > > 4197 ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); > > > > > > Thoughts? > > > > Right, after some more inspection the only thing I'm concerned about is: > > > > /* don't cover already allocated blocks in selected range */ > > if (ar->pleft && start <= ar->lleft) { > > size -= ar->lleft + 1 - start; > > start = ar->lleft + 1; > > } > > > > which can shift start beyond ac->ac_o_ex.fe_logical if the block would be > > already allocated. But I guess in that case we should not be calling > > ext4_mb_normalize_request()? ... some more code digging .. Yes, that is > > guaranteed in how lleft is initialized in ext4_ext_map_blocks(). > > Yes. > > > So OK, I withdraw my objection to the stronger check but the changelog really needs > > Thanks Jan for confirming it. > > > > to do a better job to explain why the stronger condition should be true. > > Agreed. > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 252c168454c7..9e7c145e9aa2 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4179,7 +4179,22 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, > } > rcu_read_unlock(); > > - if (start + size <= ac->ac_o_ex.fe_logical && > + /* > + * In this function "start" and "size" are normalized for better > + * alignment and length such that we could preallocate more blocks. > + * This normalization is done such that original request of > + * ac->ac_o_ex.fe_logical & fe_len should always lie within "start" and > + * "size" boundaries. > > Does above comment look good to you? Yes, thanks! > + * (Note fe_len can be relaxed since FS block allocation API does not > + * provide gurantee on number of contiguous blocks allocation since that > + * depends upon free space left, etc). > + * In case of inode pa, later we use the allocated blocks > + * [pa_start + fe_logical - pa_lstart, fe_len/size] from the preallocated > + * range of goal/best blocks [start, size] to put it at the > + * ac_o_ex.fe_logical extent of this inode. > + * (See ext4_mb_use_inode_pa() for more details) > + */ > > ^^^ We can even put more info if needed (maybe in commit message?) I'd just put into commit message a note like: ext4_mb_normalize_request() can move logical start of allocated blocks to reduce fragmentation and better utilize preallocation. However logical block requested as a start of allocation (ac->ac_o_ex.fe_logical) should always be covered by allocated blocks so we add assertion to check that. > > + if (start + size <= ac->ac_o_ex.fe_logical || > start > ac->ac_o_ex.fe_logical) { > ext4_msg(ac->ac_sb, KERN_ERR, > "start %lu, size %lu, fe_logical %lu", > > FYI - I ran the fsstress test (with -g 256) shared by Baokun with only above > change (&& -> ||) and not the original fix. And I see that we can hit this > mentioned BUG() now. Cool. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request 2022-05-23 20:08 ` Ritesh Harjani 2022-05-23 21:08 ` Jan Kara @ 2022-05-24 6:09 ` Baokun Li 1 sibling, 0 replies; 21+ messages in thread From: Baokun Li @ 2022-05-24 6:09 UTC (permalink / raw) To: Ritesh Harjani Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yebin10, yukuai3 在 2022/5/24 4:08, Ritesh Harjani 写道: > On 22/05/21 09:42PM, Baokun Li wrote: >> When either of the "start + size <= ac->ac_o_ex.fe_logical" or >> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates >> that the fe_logical is not in the allocated range. > Sounds about right to me based on the logic in ext4_mb_use_inode_pa(). > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall > within the preallocated range. So if our start or start + size doesn't include > fe_logical then it is a bug in the ext4_mb_normalize_request() logic. Yes, exactly. > But should we be so harsh to hit a bug_on() or make it warn_on()? I don't think hit a bug_on() is a problem. BUG_ON is not triggered here and will be triggered later. > Also did you run any fs tests with this change. Yes, I ran xfstests on ext3 and ext4 and found no problems. > Since it looks like this > logic existed since mballoc was introduced. > Yes, on our coverage report, those lines of code never seem to get there. >> In this case, it should be bug_ON. >> >> Fixes: dfe076c106f6 ("ext4: get rid of code duplication") > No, there is no issue with this patch. It correctly just removes the duplicate > logic. Okay, I'm going to remove this tag. >> Signed-off-by: Baokun Li<libaokun1@huawei.com> >> --- >> fs/ext4/mballoc.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >> index 32410b79b664..d0fb57970648 100644 >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, >> } >> rcu_read_unlock(); >> >> - if (start + size <= ac->ac_o_ex.fe_logical && >> + if (start + size <= ac->ac_o_ex.fe_logical || >> start > ac->ac_o_ex.fe_logical) { >> ext4_msg(ac->ac_sb, KERN_ERR, >> "start %lu, size %lu, fe_logical %lu", >> -- >> 2.31.1 >> > . -- With Best Regards, Baokun Li ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2022-05-26 1:16 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-21 13:42 [PATCH 0/2] ext4: fix two bugs in ext4_mb_normalize_request Baokun Li 2022-05-21 13:42 ` [PATCH 1/2] ext4: fix bug_on ext4_mb_use_inode_pa Baokun Li 2022-05-23 9:29 ` Jan Kara 2022-05-23 9:58 ` Lukas Czerner [not found] ` <2525e39a-5be9-bae1-b77d-60f583892868@huawei.com> 2022-05-24 12:11 ` Lukas Czerner 2022-05-24 12:42 ` Baokun Li 2022-05-23 19:51 ` Ritesh Harjani 2022-05-21 13:42 ` [PATCH 2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request Baokun Li 2022-05-23 9:40 ` Jan Kara [not found] ` <3755e40b-f817-83df-b239-b0697976c272@huawei.com> 2022-05-24 9:30 ` Jan Kara 2022-05-24 13:44 ` Baokun Li 2022-05-25 11:29 ` Jan Kara 2022-05-26 1:16 ` Baokun Li 2022-05-23 10:05 ` Lukas Czerner 2022-05-23 20:08 ` Ritesh Harjani 2022-05-23 21:08 ` Jan Kara 2022-05-24 6:26 ` Ritesh Harjani 2022-05-24 9:39 ` Jan Kara 2022-05-24 17:31 ` Ritesh Harjani 2022-05-25 12:12 ` Jan Kara 2022-05-24 6:09 ` Baokun Li
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.