linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing
@ 2022-08-17 13:26 Baokun Li
  2022-08-17 13:27 ` [PATCH 1/2] ext4: fix GDT corruption after online resizing with bigalloc enable and blocksize is 1024 Baokun Li
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Baokun Li @ 2022-08-17 13:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, ritesh.list, lczerner, enwlinux,
	linux-kernel, yi.zhang, yebin10, yukuai3, libaokun1

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 1466 bytes --]

We got a issue: the ext4 writeback process was stuck in do_writepages and
do_writepages kept retrying. However, '-ENOMEM' is returned each time, even
if there is still free memory on the current machine.

We find that the direct cause of this issue is that the bg_inode_table_hi
in the group descriptor is written to an incorrect value, which causes the
inode block found through the inode table to exceed the end_ block。Then,
sb_getblk always returns null, __ext4_get_inode_loc returns `-ENOMEM`,
and do_writepages keeps retrying.

The root cause is that the GDT is overwritten when the backup superblock
is updated in the online resizing process of the disk. The prerequisite is
that the block size of the disk is 1024, bigalloc and meta_bg are enabled,
and sparse_super is disabled.

Therefore, the check on inode_table is added to __ext4_get_inode_loc by
referring to the check on inode_bitmap in ext4_read_inode_bitmap to avoid
infinite loops in similar cases. In addition, the offset of the backup
super block in the group in the above case is also corrected to avoid some
strange problems caused by the GDT being overwritten.

Baokun Li (2):
  ext4: fix GDT corruption after online resizing with bigalloc enable
    and blocksize is 1024
  ext4: add inode table check in __ext4_get_inode_loc to aovid possible
    infinite loop

 fs/ext4/inode.c  | 10 +++++++++-
 fs/ext4/resize.c |  6 +++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] ext4: fix GDT corruption after online resizing with bigalloc enable and blocksize is 1024
  2022-08-17 13:26 [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Baokun Li
@ 2022-08-17 13:27 ` Baokun Li
  2022-08-17 13:27 ` [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop Baokun Li
  2022-11-29 21:12 ` [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Theodore Ts'o
  2 siblings, 0 replies; 15+ messages in thread
From: Baokun Li @ 2022-08-17 13:27 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, ritesh.list, lczerner, enwlinux,
	linux-kernel, yi.zhang, yebin10, yukuai3, libaokun1

When the backup superblock is updated in update_backups, the offset of
the current superblock in the group (that is, sbi->s_sbh->b_blocknr)
is used as the offset of the backup superblock in the group where the
backup superblock resides.

When blocksize==1024, sbi->s_sbh->b_blocknr is 1. Their block distribution
of groups is {0: 1-8192, 1:8193-16384...}, so the location of the backup
superblock is still the first block in each group.

If bigalloc is enabled at the same time, the block distribution of each
group changes to {0: 0-131071, 1:131072-262143...}. In this case,
update_backups overwrites the second block instead of the first block in
each group that contains backup superblocks with the current superblock.
As a result, both the backup superblock and the backup GDT are incorrect.
This is nothing, after all, the backup GDT is only used when the disk is
repaired.

However, in some cases, this may cause file system corruption, data loss,
and even some programs stuck in the D state. We can easily reproduce this
problem with the following commands:

  mkfs.ext4 -F -O ^resize_inode,^sparse_super,bigalloc -b 1024 /dev/sdb 4M
  mount /dev/sdb /tmp/test
  resize2fs /dev/sdb 4G

This is because the GDT for each meta_bg is placed in its first group. When
sparse_super is disabled, backup superblocks exist in each group. In this
case, the GDT of the new meta_bg obtained by online resizing is corrupt.

To solve this issue, we only need to specify the offset of the backup
superblock in the group to 0 when bigalloc is enabled.

Fixes: d77147ff443b ("ext4: add support for online resizing with bigalloc")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/resize.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index fea2a68d067b..0146a11efd06 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -1590,8 +1590,12 @@ static int ext4_flex_group_add(struct super_block *sb,
 				   EXT4_DESC_PER_BLOCK(sb));
 		int meta_bg = ext4_has_feature_meta_bg(sb);
 		sector_t old_gdb = 0;
+		sector_t blk_off = sbi->s_sbh->b_blocknr;
 
-		update_backups(sb, sbi->s_sbh->b_blocknr, (char *)es,
+		if (ext4_has_feature_bigalloc(sb))
+			blk_off = 0;
+
+		update_backups(sb, blk_off, (char *)es,
 			       sizeof(struct ext4_super_block), 0);
 		for (; gdb_num <= gdb_num_end; gdb_num++) {
 			struct buffer_head *gdb_bh;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-17 13:26 [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Baokun Li
  2022-08-17 13:27 ` [PATCH 1/2] ext4: fix GDT corruption after online resizing with bigalloc enable and blocksize is 1024 Baokun Li
@ 2022-08-17 13:27 ` Baokun Li
  2022-08-17 14:31   ` Jan Kara
  2022-11-29 21:12 ` [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Theodore Ts'o
  2 siblings, 1 reply; 15+ messages in thread
From: Baokun Li @ 2022-08-17 13:27 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, ritesh.list, lczerner, enwlinux,
	linux-kernel, yi.zhang, yebin10, yukuai3, libaokun1

In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.

In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
the function returns -ENOMEM.

In __getblk_slow, if the return value of grow_buffers is less than 0,
the function returns NULL.

When the three processes are connected in series like the following stack,
an infinite loop may occur:

do_writepages					<--- keep retrying
 ext4_writepages
  mpage_map_and_submit_extent
   mpage_map_one_extent
    ext4_map_blocks
     ext4_ext_map_blocks
      ext4_ext_handle_unwritten_extents
       ext4_ext_convert_to_initialized
        ext4_split_extent
         ext4_split_extent_at
          __ext4_ext_dirty
           __ext4_mark_inode_dirty
            ext4_reserve_inode_write
             ext4_get_inode_loc
              __ext4_get_inode_loc		<--- return -ENOMEM
               sb_getblk
                __getblk_gfp
                 __getblk_slow			<--- return NULL
                  grow_buffers
                   grow_dev_page		<--- return -ENXIO
                    ret = (block < end_block) ? 1 : -ENXIO;

In this issue, bg_inode_table_hi is overwritten as an incorrect value.
As a result, `block < end_block` cannot be met in grow_dev_page.
Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
keeps retrying. As a result, the writeback process is in the D state due
to an infinite loop.

Add a check on inode table block in the __ext4_get_inode_loc function by
referring to ext4_read_inode_bitmap to avoid this infinite loop.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/inode.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 601214453c3a..5e171879fa23 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
 	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
 	inode_offset = ((ino - 1) %
 			EXT4_INODES_PER_GROUP(sb));
-	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
 	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
 
+	block = ext4_inode_table(sb, gdp);
+	if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
+	    (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
+		ext4_error(sb, "Invalid inode table block %llu in "
+			   "block_group %u", block, iloc->block_group);
+		return -EFSCORRUPTED;
+	}
+	block += (inode_offset / inodes_per_block);
+
 	bh = sb_getblk(sb, block);
 	if (unlikely(!bh))
 		return -ENOMEM;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-17 13:27 ` [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop Baokun Li
@ 2022-08-17 14:31   ` Jan Kara
  2022-08-18  1:54     ` Baokun Li
  2022-08-18 14:43     ` Ritesh Harjani (IBM)
  0 siblings, 2 replies; 15+ messages in thread
From: Jan Kara @ 2022-08-17 14:31 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list, lczerner,
	enwlinux, linux-kernel, yi.zhang, yebin10, yukuai3

On Wed 17-08-22 21:27:01, Baokun Li wrote:
> In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> 
> In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> the function returns -ENOMEM.
> 
> In __getblk_slow, if the return value of grow_buffers is less than 0,
> the function returns NULL.
> 
> When the three processes are connected in series like the following stack,
> an infinite loop may occur:
> 
> do_writepages					<--- keep retrying
>  ext4_writepages
>   mpage_map_and_submit_extent
>    mpage_map_one_extent
>     ext4_map_blocks
>      ext4_ext_map_blocks
>       ext4_ext_handle_unwritten_extents
>        ext4_ext_convert_to_initialized
>         ext4_split_extent
>          ext4_split_extent_at
>           __ext4_ext_dirty
>            __ext4_mark_inode_dirty
>             ext4_reserve_inode_write
>              ext4_get_inode_loc
>               __ext4_get_inode_loc		<--- return -ENOMEM
>                sb_getblk
>                 __getblk_gfp
>                  __getblk_slow			<--- return NULL
>                   grow_buffers
>                    grow_dev_page		<--- return -ENXIO
>                     ret = (block < end_block) ? 1 : -ENXIO;
> 
> In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> As a result, `block < end_block` cannot be met in grow_dev_page.
> Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> keeps retrying. As a result, the writeback process is in the D state due
> to an infinite loop.
> 
> Add a check on inode table block in the __ext4_get_inode_loc function by
> referring to ext4_read_inode_bitmap to avoid this infinite loop.
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Thanks for the fixes. Normally, we check that inode table is fine in
ext4_check_descriptors() (and those checks are much stricter) so it seems
unnecessary to check it again here. I understand that in your case it was
resize that corrupted the group descriptor after the filesystem was mounted
which is nasty but there's much more metadata that can be corrupted like
this and it's infeasible to check each metadata block before we use it.

IMHO a proper fix to this class of issues would be for sb_getblk() to
return proper error so that we can distinguish ENOMEM from other errors.
But that will be a larger undertaking...

								Honza

> ---
>  fs/ext4/inode.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 601214453c3a..5e171879fa23 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
>  	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
>  	inode_offset = ((ino - 1) %
>  			EXT4_INODES_PER_GROUP(sb));
> -	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
>  	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
>  
> +	block = ext4_inode_table(sb, gdp);
> +	if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> +	    (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> +		ext4_error(sb, "Invalid inode table block %llu in "
> +			   "block_group %u", block, iloc->block_group);
> +		return -EFSCORRUPTED;
> +	}
> +	block += (inode_offset / inodes_per_block);
> +
>  	bh = sb_getblk(sb, block);
>  	if (unlikely(!bh))
>  		return -ENOMEM;
> -- 
> 2.31.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-17 14:31   ` Jan Kara
@ 2022-08-18  1:54     ` Baokun Li
  2022-08-18 14:43     ` Ritesh Harjani (IBM)
  1 sibling, 0 replies; 15+ messages in thread
From: Baokun Li @ 2022-08-18  1:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ritesh.list, lczerner,
	enwlinux, linux-kernel, yi.zhang, yebin10, yukuai3, Baokun Li

On 8/17/2022 10:31 PM, Jan Kara wrote:
> On Wed 17-08-22 21:27:01, Baokun Li wrote:
>> In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
>> and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
>>
>> In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
>> the function returns -ENOMEM.
>>
>> In __getblk_slow, if the return value of grow_buffers is less than 0,
>> the function returns NULL.
>>
>> When the three processes are connected in series like the following stack,
>> an infinite loop may occur:
>>
>> do_writepages					<--- keep retrying
>>   ext4_writepages
>>    mpage_map_and_submit_extent
>>     mpage_map_one_extent
>>      ext4_map_blocks
>>       ext4_ext_map_blocks
>>        ext4_ext_handle_unwritten_extents
>>         ext4_ext_convert_to_initialized
>>          ext4_split_extent
>>           ext4_split_extent_at
>>            __ext4_ext_dirty
>>             __ext4_mark_inode_dirty
>>              ext4_reserve_inode_write
>>               ext4_get_inode_loc
>>                __ext4_get_inode_loc		<--- return -ENOMEM
>>                 sb_getblk
>>                  __getblk_gfp
>>                   __getblk_slow			<--- return NULL
>>                    grow_buffers
>>                     grow_dev_page		<--- return -ENXIO
>>                      ret = (block < end_block) ? 1 : -ENXIO;
>>
>> In this issue, bg_inode_table_hi is overwritten as an incorrect value.
>> As a result, `block < end_block` cannot be met in grow_dev_page.
>> Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
>> keeps retrying. As a result, the writeback process is in the D state due
>> to an infinite loop.
>>
>> Add a check on inode table block in the __ext4_get_inode_loc function by
>> referring to ext4_read_inode_bitmap to avoid this infinite loop.
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Thanks for the fixes. Normally, we check that inode table is fine in
> ext4_check_descriptors() (and those checks are much stricter) so it seems
Yes, when I first found this problem, I thought that the inode table was 
an abnormal value
when the disk was mounted. However, in ext4_ check_ Descriptors see the 
inspection of inode table.
ext4_ check_ Descriptors is more strict, and it also checks blocks_ 
Bitmap and inode_ bitmap。

However, in addition to mounting, I can't find any changes to 
bg_inode_table_hi  in other parts of
the kernel, which makes me very confused. It wasn't until I tracked the 
address of the variable that
I found it was modified by memcpy.

This check is added here to facilitate us to find problems and avoid 
dead loops. Otherwise,
we may not find problems until the write back process is stuck in the D 
state for several hours.
  At this time, the file system may be in a mess.
> unnecessary to check it again here. I understand that in your case it was
> resize that corrupted the group descriptor after the filesystem was mounted
> which is nasty but there's much more metadata that can be corrupted like
> this and it's infeasible to check each metadata block before we use it.
Indeed, But is it true that checking for inode_bitmap in 
ext4_read_inode_bitmap and
checking for block_bitmap in ext4_read_block_bitmap_nowait also 
unnecessary?
After all, inode_bitmap and block_bitmap may be faulty only when the 
metadata  is corrupted.
I think it seems unreasonable that both block_bitmap and inode_bitmap 
have checks and inode_table does not.
>
> IMHO a proper fix to this class of issues would be for sb_getblk() to
> return proper error so that we can distinguish ENOMEM from other errors.

Totally agree! There is a FIXME in the comments of __getblk_gfp:

` ` `
__getblk_gfp() will lock up the machine if grow_dev_page's
try_to_free_buffers() attempt is failing. FIXME, perhaps?
` ` `

This is the same as this issue because the return value of the 
grow_buffers function is not propagated correctly.

> But that will be a larger undertaking...
>
> 								Honza
Yes, this function can be called in many places, and external interface 
changes are involved.
>> ---
>>   fs/ext4/inode.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
Thanks!
-- 
With Best Regards,
Baokun Li

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-17 14:31   ` Jan Kara
  2022-08-18  1:54     ` Baokun Li
@ 2022-08-18 14:43     ` Ritesh Harjani (IBM)
  2022-08-18 17:23       ` Jan Kara
  2022-11-28 20:44       ` Theodore Ts'o
  1 sibling, 2 replies; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-08-18 14:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Baokun Li, linux-ext4, tytso, adilger.kernel, lczerner, enwlinux,
	linux-kernel, yi.zhang, yebin10, yukuai3

On 22/08/17 04:31PM, Jan Kara wrote:
> On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> >
> > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > the function returns -ENOMEM.
> >
> > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > the function returns NULL.
> >
> > When the three processes are connected in series like the following stack,
> > an infinite loop may occur:
> >
> > do_writepages					<--- keep retrying
> >  ext4_writepages
> >   mpage_map_and_submit_extent
> >    mpage_map_one_extent
> >     ext4_map_blocks
> >      ext4_ext_map_blocks
> >       ext4_ext_handle_unwritten_extents
> >        ext4_ext_convert_to_initialized
> >         ext4_split_extent
> >          ext4_split_extent_at
> >           __ext4_ext_dirty
> >            __ext4_mark_inode_dirty
> >             ext4_reserve_inode_write
> >              ext4_get_inode_loc
> >               __ext4_get_inode_loc		<--- return -ENOMEM
> >                sb_getblk
> >                 __getblk_gfp
> >                  __getblk_slow			<--- return NULL
> >                   grow_buffers
> >                    grow_dev_page		<--- return -ENXIO
> >                     ret = (block < end_block) ? 1 : -ENXIO;
> >
> > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > As a result, `block < end_block` cannot be met in grow_dev_page.
> > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > keeps retrying. As a result, the writeback process is in the D state due
> > to an infinite loop.
> >
> > Add a check on inode table block in the __ext4_get_inode_loc function by
> > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> >
> > Signed-off-by: Baokun Li <libaokun1@huawei.com>
>
> Thanks for the fixes. Normally, we check that inode table is fine in
> ext4_check_descriptors() (and those checks are much stricter) so it seems
> unnecessary to check it again here. I understand that in your case it was
> resize that corrupted the group descriptor after the filesystem was mounted
> which is nasty but there's much more metadata that can be corrupted like
> this and it's infeasible to check each metadata block before we use it.
>
> IMHO a proper fix to this class of issues would be for sb_getblk() to
> return proper error so that we can distinguish ENOMEM from other errors.
> But that will be a larger undertaking...
>

Hi Jan,

How about adding a wrapper around sb_getblk() which will do the basic block
bound checks for ext4. Then we can carefully convert all the callers of
sb_getblk() in ext4 to call ext4_sb_getblk().

ext4_sb_getblk() will then return either of below -
1. ERR_PTR(-EFSCORRUPTED)
2. NULL
3. struct buffer_head*

It's caller can then implement the proper error handling.

Folding a small patch to implement the simple bound check. Is this the right
approach?

From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Date: Thu, 18 Aug 2022 07:53:58 -0500
Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking

We might need more bounds checking on the block before calling sb_getblk().
This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
Later we will need to carefully convert the callers to use ext4_sb_getblk()
instead of sb_getblk().
For e.g. __ext4_get_inode_loc()

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/block_validity.c |  4 +---
 fs/ext4/ext4.h           | 12 ++++++++++++
 fs/ext4/super.c          |  9 +++++++++
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
index 5504f72bbbbe..69347fab7c38 100644
--- a/fs/ext4/block_validity.c
+++ b/fs/ext4/block_validity.c
@@ -301,9 +301,7 @@ int ext4_sb_block_valid(struct super_block *sb, struct inode *inode,
 	struct rb_node *n;
 	int ret = 1;

-	if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
-	    (start_blk + count < start_blk) ||
-	    (start_blk + count > ext4_blocks_count(sbi->s_es)))
+	if (!ext4_sb_block_valid_fastcheck(sbi->s_es, start_blk, count))
 		return 0;

 	/*
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9bca5565547b..79d0e45185d3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3072,6 +3072,8 @@ extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
 					 sector_t block, blk_opf_t op_flags);
 extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
 						   sector_t block);
+extern struct buffer_head *ext4_sb_getblk(struct super_block *sb,
+					  sector_t block);
 extern void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
 				bh_end_io_t *end_io);
 extern int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
@@ -3358,6 +3360,16 @@ static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
 	return 1 << sbi->s_log_groups_per_flex;
 }

+static inline bool ext4_sb_block_valid_fastcheck(struct ext4_super_block *es,
+					sector_t start_blk, unsigned int count)
+{
+	if ((start_blk <= le32_to_cpu(es->s_first_data_block)) ||
+	    (start_blk + count < start_blk) ||
+	    (start_blk + count > ext4_blocks_count(es)))
+		return false;
+	return true;
+}
+
 #define ext4_std_error(sb, errno)				\
 do {								\
 	if ((errno))						\
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..5b29272ad9a9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -269,6 +269,15 @@ void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
 	}
 }

+struct buffer_head *ext4_sb_getblk(struct super_block *sb, sector_t block)
+{
+	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+
+	if (!ext4_sb_block_valid_fastcheck(es, block, 1))
+		return ERR_PTR(-EFSCORRUPTED);
+	return sb_getblk(sb, block);
+}
+
 static int ext4_verify_csum_type(struct super_block *sb,
 				 struct ext4_super_block *es)
 {
--
2.25.1

-ritesh



> 								Honza
>
> > ---
> >  fs/ext4/inode.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 601214453c3a..5e171879fa23 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> >  	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> >  	inode_offset = ((ino - 1) %
> >  			EXT4_INODES_PER_GROUP(sb));
> > -	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
> >  	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
> >
> > +	block = ext4_inode_table(sb, gdp);
> > +	if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> > +	    (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> > +		ext4_error(sb, "Invalid inode table block %llu in "
> > +			   "block_group %u", block, iloc->block_group);
> > +		return -EFSCORRUPTED;
> > +	}
> > +	block += (inode_offset / inodes_per_block);
> > +
> >  	bh = sb_getblk(sb, block);
> >  	if (unlikely(!bh))
> >  		return -ENOMEM;
> > --
> > 2.31.1
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-18 14:43     ` Ritesh Harjani (IBM)
@ 2022-08-18 17:23       ` Jan Kara
  2022-08-18 23:15         ` Ritesh Harjani (IBM)
  2022-11-28 20:44       ` Theodore Ts'o
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kara @ 2022-08-18 17:23 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Jan Kara, Baokun Li, linux-ext4, tytso, adilger.kernel, lczerner,
	enwlinux, linux-kernel, yi.zhang, yebin10, yukuai3

On Thu 18-08-22 20:13:53, Ritesh Harjani (IBM) wrote:
> On 22/08/17 04:31PM, Jan Kara wrote:
> > On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> > >
> > > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > > the function returns -ENOMEM.
> > >
> > > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > > the function returns NULL.
> > >
> > > When the three processes are connected in series like the following stack,
> > > an infinite loop may occur:
> > >
> > > do_writepages					<--- keep retrying
> > >  ext4_writepages
> > >   mpage_map_and_submit_extent
> > >    mpage_map_one_extent
> > >     ext4_map_blocks
> > >      ext4_ext_map_blocks
> > >       ext4_ext_handle_unwritten_extents
> > >        ext4_ext_convert_to_initialized
> > >         ext4_split_extent
> > >          ext4_split_extent_at
> > >           __ext4_ext_dirty
> > >            __ext4_mark_inode_dirty
> > >             ext4_reserve_inode_write
> > >              ext4_get_inode_loc
> > >               __ext4_get_inode_loc		<--- return -ENOMEM
> > >                sb_getblk
> > >                 __getblk_gfp
> > >                  __getblk_slow			<--- return NULL
> > >                   grow_buffers
> > >                    grow_dev_page		<--- return -ENXIO
> > >                     ret = (block < end_block) ? 1 : -ENXIO;
> > >
> > > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > > As a result, `block < end_block` cannot be met in grow_dev_page.
> > > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > > keeps retrying. As a result, the writeback process is in the D state due
> > > to an infinite loop.
> > >
> > > Add a check on inode table block in the __ext4_get_inode_loc function by
> > > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> > >
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> >
> > Thanks for the fixes. Normally, we check that inode table is fine in
> > ext4_check_descriptors() (and those checks are much stricter) so it seems
> > unnecessary to check it again here. I understand that in your case it was
> > resize that corrupted the group descriptor after the filesystem was mounted
> > which is nasty but there's much more metadata that can be corrupted like
> > this and it's infeasible to check each metadata block before we use it.
> >
> > IMHO a proper fix to this class of issues would be for sb_getblk() to
> > return proper error so that we can distinguish ENOMEM from other errors.
> > But that will be a larger undertaking...
> >
> 
> Hi Jan,
> 
> How about adding a wrapper around sb_getblk() which will do the basic block
> bound checks for ext4. Then we can carefully convert all the callers of
> sb_getblk() in ext4 to call ext4_sb_getblk().
> 
> ext4_sb_getblk() will then return either of below -
> 1. ERR_PTR(-EFSCORRUPTED)
> 2. NULL
> 3. struct buffer_head*
> 
> It's caller can then implement the proper error handling.
> 
> Folding a small patch to implement the simple bound check. Is this the right
> approach?

Yep, looks sensible to me. Maybe I'd just make ext4_sb_getblk() return bh
or ERR_PTR so something like ERR_PTR(-EFSCORRUPTED), ERR_PTR(-ENXIO), or bh
pointer.

								Honza

> 
> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> Date: Thu, 18 Aug 2022 07:53:58 -0500
> Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
> 
> We might need more bounds checking on the block before calling sb_getblk().
> This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> Later we will need to carefully convert the callers to use ext4_sb_getblk()
> instead of sb_getblk().
> For e.g. __ext4_get_inode_loc()
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext4/block_validity.c |  4 +---
>  fs/ext4/ext4.h           | 12 ++++++++++++
>  fs/ext4/super.c          |  9 +++++++++
>  3 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
> index 5504f72bbbbe..69347fab7c38 100644
> --- a/fs/ext4/block_validity.c
> +++ b/fs/ext4/block_validity.c
> @@ -301,9 +301,7 @@ int ext4_sb_block_valid(struct super_block *sb, struct inode *inode,
>  	struct rb_node *n;
>  	int ret = 1;
> 
> -	if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
> -	    (start_blk + count < start_blk) ||
> -	    (start_blk + count > ext4_blocks_count(sbi->s_es)))
> +	if (!ext4_sb_block_valid_fastcheck(sbi->s_es, start_blk, count))
>  		return 0;
> 
>  	/*
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9bca5565547b..79d0e45185d3 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3072,6 +3072,8 @@ extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
>  					 sector_t block, blk_opf_t op_flags);
>  extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
>  						   sector_t block);
> +extern struct buffer_head *ext4_sb_getblk(struct super_block *sb,
> +					  sector_t block);
>  extern void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
>  				bh_end_io_t *end_io);
>  extern int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> @@ -3358,6 +3360,16 @@ static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
>  	return 1 << sbi->s_log_groups_per_flex;
>  }
> 
> +static inline bool ext4_sb_block_valid_fastcheck(struct ext4_super_block *es,
> +					sector_t start_blk, unsigned int count)
> +{
> +	if ((start_blk <= le32_to_cpu(es->s_first_data_block)) ||
> +	    (start_blk + count < start_blk) ||
> +	    (start_blk + count > ext4_blocks_count(es)))
> +		return false;
> +	return true;
> +}
> +
>  #define ext4_std_error(sb, errno)				\
>  do {								\
>  	if ((errno))						\
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 9a66abcca1a8..5b29272ad9a9 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -269,6 +269,15 @@ void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
>  	}
>  }
> 
> +struct buffer_head *ext4_sb_getblk(struct super_block *sb, sector_t block)
> +{
> +	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
> +
> +	if (!ext4_sb_block_valid_fastcheck(es, block, 1))
> +		return ERR_PTR(-EFSCORRUPTED);
> +	return sb_getblk(sb, block);
> +}
> +
>  static int ext4_verify_csum_type(struct super_block *sb,
>  				 struct ext4_super_block *es)
>  {
> --
> 2.25.1
> 
> -ritesh
> 
> 
> 
> > 								Honza
> >
> > > ---
> > >  fs/ext4/inode.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 601214453c3a..5e171879fa23 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> > >  	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> > >  	inode_offset = ((ino - 1) %
> > >  			EXT4_INODES_PER_GROUP(sb));
> > > -	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
> > >  	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
> > >
> > > +	block = ext4_inode_table(sb, gdp);
> > > +	if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> > > +	    (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> > > +		ext4_error(sb, "Invalid inode table block %llu in "
> > > +			   "block_group %u", block, iloc->block_group);
> > > +		return -EFSCORRUPTED;
> > > +	}
> > > +	block += (inode_offset / inodes_per_block);
> > > +
> > >  	bh = sb_getblk(sb, block);
> > >  	if (unlikely(!bh))
> > >  		return -ENOMEM;
> > > --
> > > 2.31.1
> > >
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-18 17:23       ` Jan Kara
@ 2022-08-18 23:15         ` Ritesh Harjani (IBM)
  2022-08-19  8:44           ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-08-18 23:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Baokun Li, linux-ext4, tytso, adilger.kernel, lczerner, enwlinux,
	linux-kernel, yi.zhang, yebin10, yukuai3

On 22/08/18 07:23PM, Jan Kara wrote:
> On Thu 18-08-22 20:13:53, Ritesh Harjani (IBM) wrote:
> > On 22/08/17 04:31PM, Jan Kara wrote:
> > > On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > > > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > > > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> > > >
> > > > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > > > the function returns -ENOMEM.
> > > >
> > > > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > > > the function returns NULL.
> > > >
> > > > When the three processes are connected in series like the following stack,
> > > > an infinite loop may occur:
> > > >
> > > > do_writepages					<--- keep retrying
> > > >  ext4_writepages
> > > >   mpage_map_and_submit_extent
> > > >    mpage_map_one_extent
> > > >     ext4_map_blocks
> > > >      ext4_ext_map_blocks
> > > >       ext4_ext_handle_unwritten_extents
> > > >        ext4_ext_convert_to_initialized
> > > >         ext4_split_extent
> > > >          ext4_split_extent_at
> > > >           __ext4_ext_dirty
> > > >            __ext4_mark_inode_dirty
> > > >             ext4_reserve_inode_write
> > > >              ext4_get_inode_loc
> > > >               __ext4_get_inode_loc		<--- return -ENOMEM
> > > >                sb_getblk
> > > >                 __getblk_gfp
> > > >                  __getblk_slow			<--- return NULL
> > > >                   grow_buffers
> > > >                    grow_dev_page		<--- return -ENXIO
> > > >                     ret = (block < end_block) ? 1 : -ENXIO;
> > > >
> > > > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > > > As a result, `block < end_block` cannot be met in grow_dev_page.
> > > > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > > > keeps retrying. As a result, the writeback process is in the D state due
> > > > to an infinite loop.
> > > >
> > > > Add a check on inode table block in the __ext4_get_inode_loc function by
> > > > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> > > >
> > > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > >
> > > Thanks for the fixes. Normally, we check that inode table is fine in
> > > ext4_check_descriptors() (and those checks are much stricter) so it seems
> > > unnecessary to check it again here. I understand that in your case it was
> > > resize that corrupted the group descriptor after the filesystem was mounted
> > > which is nasty but there's much more metadata that can be corrupted like
> > > this and it's infeasible to check each metadata block before we use it.
> > >
> > > IMHO a proper fix to this class of issues would be for sb_getblk() to
> > > return proper error so that we can distinguish ENOMEM from other errors.
> > > But that will be a larger undertaking...
> > >
> >
> > Hi Jan,
> >
> > How about adding a wrapper around sb_getblk() which will do the basic block
> > bound checks for ext4. Then we can carefully convert all the callers of
> > sb_getblk() in ext4 to call ext4_sb_getblk().
> >
> > ext4_sb_getblk() will then return either of below -
> > 1. ERR_PTR(-EFSCORRUPTED)
> > 2. NULL
> > 3. struct buffer_head*
> >
> > It's caller can then implement the proper error handling.
> >
> > Folding a small patch to implement the simple bound check. Is this the right
> > approach?
>
> Yep, looks sensible to me. Maybe I'd just make ext4_sb_getblk() return bh
> or ERR_PTR so something like ERR_PTR(-EFSCORRUPTED), ERR_PTR(-ENXIO), or bh
> pointer.

Sure, Thanks Jan. Will do that once I clear some confusion w.r.t
	"start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)"

At some places this is checked with "<= s_first_data_block"
	e.g. fs/ext4/ialloc.c, ext4_sb_block_valid()

while at some places I see it to be "< s_first_data_block"
	e.g. fs/ext4/mballoc.c, fs/ext4/mmp.c

Will spend sometime to understand why the difference and if there is anything
I might miss here for off-by-one check.

Adding more to the confusion would be difference w.r.t blocksize = 1024 v/s
other blocksizes. Based on the blocksize value I guess, s_first_data_block can
be different (0/1??). Or can bigalloc can change this...
...Will look more into this.

-ritesh

>
> 								Honza
>
> >
> > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> > Date: Thu, 18 Aug 2022 07:53:58 -0500
> > Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
> >
> > We might need more bounds checking on the block before calling sb_getblk().
> > This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> > Later we will need to carefully convert the callers to use ext4_sb_getblk()
> > instead of sb_getblk().
> > For e.g. __ext4_get_inode_loc()
> >
> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > ---
> >  fs/ext4/block_validity.c |  4 +---
> >  fs/ext4/ext4.h           | 12 ++++++++++++
> >  fs/ext4/super.c          |  9 +++++++++
> >  3 files changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
> > index 5504f72bbbbe..69347fab7c38 100644
> > --- a/fs/ext4/block_validity.c
> > +++ b/fs/ext4/block_validity.c
> > @@ -301,9 +301,7 @@ int ext4_sb_block_valid(struct super_block *sb, struct inode *inode,
> >  	struct rb_node *n;
> >  	int ret = 1;
> >
> > -	if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
> > -	    (start_blk + count < start_blk) ||
> > -	    (start_blk + count > ext4_blocks_count(sbi->s_es)))
> > +	if (!ext4_sb_block_valid_fastcheck(sbi->s_es, start_blk, count))
> >  		return 0;
> >
> >  	/*
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 9bca5565547b..79d0e45185d3 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -3072,6 +3072,8 @@ extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
> >  					 sector_t block, blk_opf_t op_flags);
> >  extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
> >  						   sector_t block);
> > +extern struct buffer_head *ext4_sb_getblk(struct super_block *sb,
> > +					  sector_t block);
> >  extern void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
> >  				bh_end_io_t *end_io);
> >  extern int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> > @@ -3358,6 +3360,16 @@ static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
> >  	return 1 << sbi->s_log_groups_per_flex;
> >  }
> >
> > +static inline bool ext4_sb_block_valid_fastcheck(struct ext4_super_block *es,
> > +					sector_t start_blk, unsigned int count)
> > +{
> > +	if ((start_blk <= le32_to_cpu(es->s_first_data_block)) ||
> > +	    (start_blk + count < start_blk) ||
> > +	    (start_blk + count > ext4_blocks_count(es)))
> > +		return false;
> > +	return true;
> > +}
> > +
> >  #define ext4_std_error(sb, errno)				\
> >  do {								\
> >  	if ((errno))						\
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 9a66abcca1a8..5b29272ad9a9 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -269,6 +269,15 @@ void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
> >  	}
> >  }
> >
> > +struct buffer_head *ext4_sb_getblk(struct super_block *sb, sector_t block)
> > +{
> > +	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
> > +
> > +	if (!ext4_sb_block_valid_fastcheck(es, block, 1))
> > +		return ERR_PTR(-EFSCORRUPTED);
> > +	return sb_getblk(sb, block);
> > +}
> > +
> >  static int ext4_verify_csum_type(struct super_block *sb,
> >  				 struct ext4_super_block *es)
> >  {
> > --
> > 2.25.1
> >
> > -ritesh
> >
> >
> >
> > > 								Honza
> > >
> > > > ---
> > > >  fs/ext4/inode.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 601214453c3a..5e171879fa23 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -4466,9 +4466,17 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> > > >  	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> > > >  	inode_offset = ((ino - 1) %
> > > >  			EXT4_INODES_PER_GROUP(sb));
> > > > -	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
> > > >  	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
> > > >
> > > > +	block = ext4_inode_table(sb, gdp);
> > > > +	if ((block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block)) ||
> > > > +	    (block >= ext4_blocks_count(EXT4_SB(sb)->s_es))) {
> > > > +		ext4_error(sb, "Invalid inode table block %llu in "
> > > > +			   "block_group %u", block, iloc->block_group);
> > > > +		return -EFSCORRUPTED;
> > > > +	}
> > > > +	block += (inode_offset / inodes_per_block);
> > > > +
> > > >  	bh = sb_getblk(sb, block);
> > > >  	if (unlikely(!bh))
> > > >  		return -ENOMEM;
> > > > --
> > > > 2.31.1
> > > >
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-18 23:15         ` Ritesh Harjani (IBM)
@ 2022-08-19  8:44           ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2022-08-19  8:44 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Jan Kara, Baokun Li, linux-ext4, tytso, adilger.kernel, lczerner,
	enwlinux, linux-kernel, yi.zhang, yebin10, yukuai3

On Fri 19-08-22 04:45:41, Ritesh Harjani (IBM) wrote:
> On 22/08/18 07:23PM, Jan Kara wrote:
> > On Thu 18-08-22 20:13:53, Ritesh Harjani (IBM) wrote:
> > > On 22/08/17 04:31PM, Jan Kara wrote:
> > > > On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > > > > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > > > > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> > > > >
> > > > > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > > > > the function returns -ENOMEM.
> > > > >
> > > > > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > > > > the function returns NULL.
> > > > >
> > > > > When the three processes are connected in series like the following stack,
> > > > > an infinite loop may occur:
> > > > >
> > > > > do_writepages					<--- keep retrying
> > > > >  ext4_writepages
> > > > >   mpage_map_and_submit_extent
> > > > >    mpage_map_one_extent
> > > > >     ext4_map_blocks
> > > > >      ext4_ext_map_blocks
> > > > >       ext4_ext_handle_unwritten_extents
> > > > >        ext4_ext_convert_to_initialized
> > > > >         ext4_split_extent
> > > > >          ext4_split_extent_at
> > > > >           __ext4_ext_dirty
> > > > >            __ext4_mark_inode_dirty
> > > > >             ext4_reserve_inode_write
> > > > >              ext4_get_inode_loc
> > > > >               __ext4_get_inode_loc		<--- return -ENOMEM
> > > > >                sb_getblk
> > > > >                 __getblk_gfp
> > > > >                  __getblk_slow			<--- return NULL
> > > > >                   grow_buffers
> > > > >                    grow_dev_page		<--- return -ENXIO
> > > > >                     ret = (block < end_block) ? 1 : -ENXIO;
> > > > >
> > > > > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > > > > As a result, `block < end_block` cannot be met in grow_dev_page.
> > > > > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > > > > keeps retrying. As a result, the writeback process is in the D state due
> > > > > to an infinite loop.
> > > > >
> > > > > Add a check on inode table block in the __ext4_get_inode_loc function by
> > > > > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> > > > >
> > > > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > > >
> > > > Thanks for the fixes. Normally, we check that inode table is fine in
> > > > ext4_check_descriptors() (and those checks are much stricter) so it seems
> > > > unnecessary to check it again here. I understand that in your case it was
> > > > resize that corrupted the group descriptor after the filesystem was mounted
> > > > which is nasty but there's much more metadata that can be corrupted like
> > > > this and it's infeasible to check each metadata block before we use it.
> > > >
> > > > IMHO a proper fix to this class of issues would be for sb_getblk() to
> > > > return proper error so that we can distinguish ENOMEM from other errors.
> > > > But that will be a larger undertaking...
> > > >
> > >
> > > Hi Jan,
> > >
> > > How about adding a wrapper around sb_getblk() which will do the basic block
> > > bound checks for ext4. Then we can carefully convert all the callers of
> > > sb_getblk() in ext4 to call ext4_sb_getblk().
> > >
> > > ext4_sb_getblk() will then return either of below -
> > > 1. ERR_PTR(-EFSCORRUPTED)
> > > 2. NULL
> > > 3. struct buffer_head*
> > >
> > > It's caller can then implement the proper error handling.
> > >
> > > Folding a small patch to implement the simple bound check. Is this the right
> > > approach?
> >
> > Yep, looks sensible to me. Maybe I'd just make ext4_sb_getblk() return bh
> > or ERR_PTR so something like ERR_PTR(-EFSCORRUPTED), ERR_PTR(-ENXIO), or bh
> > pointer.
> 
> Sure, Thanks Jan. Will do that once I clear some confusion w.r.t
> 	"start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)"
> 
> At some places this is checked with "<= s_first_data_block"
> 	e.g. fs/ext4/ialloc.c, ext4_sb_block_valid()
> 
> while at some places I see it to be "< s_first_data_block"
> 	e.g. fs/ext4/mballoc.c, fs/ext4/mmp.c

Well, superblock is stored at s_first_data_block offset. So strictly
speaking the check should be < s_first_data_block because that block is a
valid filesystem block. OTOH in most places you are not supposed to look at
block with the superblock so stricter <= s_first_data_block is fine.

> Will spend sometime to understand why the difference and if there is anything
> I might miss here for off-by-one check.
> 
> Adding more to the confusion would be difference w.r.t blocksize = 1024 v/s
> other blocksizes. Based on the blocksize value I guess, s_first_data_block can
> be different (0/1??). Or can bigalloc can change this...
> ...Will look more into this.

That's what we discussed on our ext4 call yesterday. Normally,
s_first_data_block is 1 for blocksize 1k and 0 for blocksize > 1k. So for
1k blocksize the first group begins with block 1 and not block 0.
Effectively the whole filesystem is shifted by 1 block with 1k blocksize.
When bigalloc comes to play and blocksize is 1k, things are even more
interesting because there, the first group starts at block 0 while
s_first_data_block is still 1, because superblock is stored in block 1 of
cluster 0.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-08-18 14:43     ` Ritesh Harjani (IBM)
  2022-08-18 17:23       ` Jan Kara
@ 2022-11-28 20:44       ` Theodore Ts'o
  2022-11-29  8:54         ` Ritesh Harjani (IBM)
  1 sibling, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2022-11-28 20:44 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Jan Kara, Baokun Li, linux-ext4, adilger.kernel, lczerner,
	enwlinux, linux-kernel, yi.zhang, yebin10, yukuai3

On Thu, Aug 18, 2022 at 08:13:53PM +0530, Ritesh Harjani (IBM) wrote:
> Folding a small patch to implement the simple bound check. Is this the right
> approach?
> 
> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> Date: Thu, 18 Aug 2022 07:53:58 -0500
> Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
> 
> We might need more bounds checking on the block before calling sb_getblk().
> This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> Later we will need to carefully convert the callers to use ext4_sb_getblk()
> instead of sb_getblk().

Hey Ritesh,

I was going through some old patches and came across this RFC patch.
Have you had a chance to polish this up?  I don't think I've seen a
newer version of this patch, but maybe I missed it.

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
  2022-11-28 20:44       ` Theodore Ts'o
@ 2022-11-29  8:54         ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-11-29  8:54 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Baokun Li, linux-ext4, adilger.kernel, lczerner,
	enwlinux, linux-kernel, yi.zhang, yebin10, yukuai3

On 22/11/28 03:44PM, Theodore Ts'o wrote:
> On Thu, Aug 18, 2022 at 08:13:53PM +0530, Ritesh Harjani (IBM) wrote:
> > Folding a small patch to implement the simple bound check. Is this the right
> > approach?
> >
> > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> > Date: Thu, 18 Aug 2022 07:53:58 -0500
> > Subject: [RFC] ext4: Add ext4_sb_getblk() wrapper for block bounds checking
> >
> > We might need more bounds checking on the block before calling sb_getblk().
> > This helper does that and if it is not valid then returns ERR_PTR(-EFSCORRUPTED)
> > Later we will need to carefully convert the callers to use ext4_sb_getblk()
> > instead of sb_getblk().
>
> Hey Ritesh,
>
> I was going through some old patches and came across this RFC patch.
> Have you had a chance to polish this up?  I don't think I've seen a
> newer version of this patch, but maybe I missed it.

Hello Ted,

Sorry, I guess I completely forgot about this, since around those dates
I was travelling. I will definitely try and get back to this once I finish
few ongoing things on my plate.

Thanks for bringing it up!

-ritesh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing
  2022-08-17 13:26 [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Baokun Li
  2022-08-17 13:27 ` [PATCH 1/2] ext4: fix GDT corruption after online resizing with bigalloc enable and blocksize is 1024 Baokun Li
  2022-08-17 13:27 ` [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop Baokun Li
@ 2022-11-29 21:12 ` Theodore Ts'o
  2022-11-30  2:08   ` Baokun Li
  2 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2022-11-29 21:12 UTC (permalink / raw)
  To: linux-ext4, Baokun Li
  Cc: Theodore Ts'o, linux-kernel, enwlinux, jack, lczerner,
	yukuai3, yi.zhang, yebin10, ritesh.list, adilger.kernel

On Wed, 17 Aug 2022 21:26:59 +0800, Baokun Li wrote:
> We got a issue: the ext4 writeback process was stuck in do_writepages and
> do_writepages kept retrying. However, '-ENOMEM' is returned each time, even
> if there is still free memory on the current machine.
> 
> We find that the direct cause of this issue is that the bg_inode_table_hi
> in the group descriptor is written to an incorrect value, which causes the
> inode block found through the inode table to exceed the end_ block。Then,
> sb_getblk always returns null, __ext4_get_inode_loc returns `-ENOMEM`,
> and do_writepages keeps retrying.
> 
> [...]

Applied, thanks!

[1/2] ext4: fix GDT corruption after online resizing with bigalloc enable and blocksize is 1024
      commit: 496fb12f8e236f303de6bc73a0334dd50c4eb64a
[2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
      commit: bfb0625e8e86f8797264b1c7d10e146afe243d23

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing
  2022-11-29 21:12 ` [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Theodore Ts'o
@ 2022-11-30  2:08   ` Baokun Li
  2022-12-01  3:42     ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Baokun Li @ 2022-11-30  2:08 UTC (permalink / raw)
  To: Theodore Ts'o, linux-ext4
  Cc: linux-kernel, enwlinux, jack, lczerner, yukuai3, yi.zhang,
	yebin10, ritesh.list, adilger.kernel

On 2022/11/30 5:12, Theodore Ts'o wrote:
> On Wed, 17 Aug 2022 21:26:59 +0800, Baokun Li wrote:
>> We got a issue: the ext4 writeback process was stuck in do_writepages and
>> do_writepages kept retrying. However, '-ENOMEM' is returned each time, even
>> if there is still free memory on the current machine.
>>
>> We find that the direct cause of this issue is that the bg_inode_table_hi
>> in the group descriptor is written to an incorrect value, which causes the
>> inode block found through the inode table to exceed the end_ block。Then,
>> sb_getblk always returns null, __ext4_get_inode_loc returns `-ENOMEM`,
>> and do_writepages keeps retrying.
>>
>> [...]
> Applied, thanks!
>
> [1/2] ext4: fix GDT corruption after online resizing with bigalloc enable and blocksize is 1024
>        commit: 496fb12f8e236f303de6bc73a0334dd50c4eb64a
> [2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop
>        commit: bfb0625e8e86f8797264b1c7d10e146afe243d23
>
> Best regards,

Hi Theodore,

Thank you very much for applying this patch set!

But I thought this patch set was discarded because there was no 
"Reviewed-by".
And a few days ago, I came up with a better solution to the problem 
fixed by PATCH 1/2.
The new solution is called "ext4: fix corruption when online resizing a 
1K bigalloc fs", which
is in another patch set  ("[PATCH v3 0/3] ext4: fix some bugs in online 
resize") that fixes
some online resize problems. This patch set has been reviewed, and I 
would appreciate it if
you could revert PATCH 1/2 and apply the patch set containing the new 
solution.

Sorry for wasting your time without stating that a new solution is 
available after the old patch.

Thanks again!
-- 
With Best Regards,
Baokun Li

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing
  2022-11-30  2:08   ` Baokun Li
@ 2022-12-01  3:42     ` Theodore Ts'o
  2022-12-01  6:26       ` Baokun Li
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2022-12-01  3:42 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, linux-kernel, enwlinux, jack, lczerner, yukuai3,
	yi.zhang, yebin10, ritesh.list, adilger.kernel

On Wed, Nov 30, 2022 at 10:08:13AM +0800, Baokun Li wrote:
> 
> But I thought this patch set was discarded because there was no
> "Reviewed-by".

No, it had just slipped through the cracks; and when I was going back
through patchwork to see what patches were outstanding, I came across
the patch set which had gotten missed during the last cycle; and I
didn't notice that it had been superceded with a newer patch which
achieved the same goal.

I'll drop it from the ext4 tree and then pick up your newer series.

          	     	      	   	- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing
  2022-12-01  3:42     ` Theodore Ts'o
@ 2022-12-01  6:26       ` Baokun Li
  0 siblings, 0 replies; 15+ messages in thread
From: Baokun Li @ 2022-12-01  6:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-ext4, linux-kernel, enwlinux, jack, lczerner, yukuai3,
	yi.zhang, yebin10, ritesh.list, adilger.kernel

On 2022/12/1 11:42, Theodore Ts'o wrote:
> On Wed, Nov 30, 2022 at 10:08:13AM +0800, Baokun Li wrote:
>> But I thought this patch set was discarded because there was no
>> "Reviewed-by".
> No, it had just slipped through the cracks; and when I was going back
> through patchwork to see what patches were outstanding, I came across
> the patch set which had gotten missed during the last cycle; and I
> didn't notice that it had been superceded with a newer patch which
> achieved the same goal.
>
> I'll drop it from the ext4 tree and then pick up your newer series.
>
>            	     	      	   	- Ted
>
Thank you very much! 😉

-- 
With Best Regards,
Baokun Li
.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-12-01  6:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17 13:26 [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Baokun Li
2022-08-17 13:27 ` [PATCH 1/2] ext4: fix GDT corruption after online resizing with bigalloc enable and blocksize is 1024 Baokun Li
2022-08-17 13:27 ` [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop Baokun Li
2022-08-17 14:31   ` Jan Kara
2022-08-18  1:54     ` Baokun Li
2022-08-18 14:43     ` Ritesh Harjani (IBM)
2022-08-18 17:23       ` Jan Kara
2022-08-18 23:15         ` Ritesh Harjani (IBM)
2022-08-19  8:44           ` Jan Kara
2022-11-28 20:44       ` Theodore Ts'o
2022-11-29  8:54         ` Ritesh Harjani (IBM)
2022-11-29 21:12 ` [PATCH 0/2] ext4: fix a infinite loop in do_writepages after online resizing Theodore Ts'o
2022-11-30  2:08   ` Baokun Li
2022-12-01  3:42     ` Theodore Ts'o
2022-12-01  6:26       ` Baokun Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).