linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] ext4: fix WARNING in ext4_add_complete_io
@ 2023-03-16 11:28 Baokun Li
  2023-03-16 11:28 ` [PATCH 1/3] ext4: correct error ctx->mask_s_##name in ctx_set_##name Baokun Li
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Baokun Li @ 2023-03-16 11:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang,
	yangerkun, yukuai3, libaokun1

Baokun Li (3):
  ext4: correct error ctx->mask_s_##name in ctx_set_##name
  ext4: add helper to check if flag is changed by ctx
  ext4: fix race between writepages and remount

 fs/ext4/ext4.h      |  3 ++-
 fs/ext4/ext4_jbd2.h |  9 +++++----
 fs/ext4/super.c     | 33 +++++++++++++++++++++++++++++----
 3 files changed, 36 insertions(+), 9 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] ext4: correct error ctx->mask_s_##name in ctx_set_##name
  2023-03-16 11:28 [PATCH 0/3] ext4: fix WARNING in ext4_add_complete_io Baokun Li
@ 2023-03-16 11:28 ` Baokun Li
  2023-03-16 11:28 ` [PATCH 2/3] ext4: add helper to check if flag is changed by ctx Baokun Li
  2023-03-16 11:28 ` [PATCH 3/3] ext4: fix race between writepages and remount Baokun Li
  2 siblings, 0 replies; 8+ messages in thread
From: Baokun Li @ 2023-03-16 11:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang,
	yangerkun, yukuai3, libaokun1

We should only save the flag to be cleared in ctx->mask_s_##name.

Fixes: 6e47a3cc68fc ("ext4: get rid of super block and sbi from handle_mount_ops()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f43e526112ae..5b4a323c218b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2023,7 +2023,7 @@ static int ext4_parse_test_dummy_encryption(const struct fs_parameter *param,
 static inline void ctx_set_##name(struct ext4_fs_context *ctx,		\
 				  unsigned long flag)			\
 {									\
-	ctx->mask_s_##name |= flag;					\
+	ctx->mask_s_##name &= ~flag;					\
 	ctx->vals_s_##name |= flag;					\
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] ext4: add helper to check if flag is changed by ctx
  2023-03-16 11:28 [PATCH 0/3] ext4: fix WARNING in ext4_add_complete_io Baokun Li
  2023-03-16 11:28 ` [PATCH 1/3] ext4: correct error ctx->mask_s_##name in ctx_set_##name Baokun Li
@ 2023-03-16 11:28 ` Baokun Li
  2023-03-16 11:28 ` [PATCH 3/3] ext4: fix race between writepages and remount Baokun Li
  2 siblings, 0 replies; 8+ messages in thread
From: Baokun Li @ 2023-03-16 11:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang,
	yangerkun, yukuai3, libaokun1

Add helper to check if flag will be changed, return 1 means it will be
set, -1 means it will be cleared, 0 means it will not be changed.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/super.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 5b4a323c218b..fefcd42f34ea 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2042,13 +2042,26 @@ ctx_test_##name(struct ext4_fs_context *ctx, unsigned long flag)	\
 	return (ctx->vals_s_##name & flag);				\
 }
 
+#define EXT4_CHANGED_CTX(name)						\
+static inline int							\
+ctx_changed_##name(struct ext4_fs_context *ctx, unsigned long flag)	\
+{									\
+	if (ctx->vals_s_##name & flag)                                  \
+		return 1;						\
+	if (ctx->mask_s_##name & flag)					\
+		return -1;						\
+	return 0;							\
+}
+
 EXT4_SET_CTX(flags); /* set only */
 EXT4_SET_CTX(mount_opt);
 EXT4_CLEAR_CTX(mount_opt);
 EXT4_TEST_CTX(mount_opt);
+EXT4_CHANGED_CTX(mount_opt);
 EXT4_SET_CTX(mount_opt2);
 EXT4_CLEAR_CTX(mount_opt2);
 EXT4_TEST_CTX(mount_opt2);
+EXT4_CHANGED_CTX(mount_opt2);
 
 static inline void ctx_set_mount_flag(struct ext4_fs_context *ctx, int bit)
 {
@@ -2537,9 +2550,7 @@ static int ext4_check_quota_consistency(struct fs_context *fc,
 
 	quota_flags = EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA |
 		      EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA;
-	if (quota_loaded &&
-	    ctx->mask_s_mount_opt & quota_flags &&
-	    !ctx_test_mount_opt(ctx, quota_flags))
+	if (quota_loaded && (ctx_changed_mount_opt(ctx, quota_flags) < 0))
 		goto err_quota_change;
 
 	if (ctx->spec & EXT4_SPEC_JQUOTA) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] ext4: fix race between writepages and remount
  2023-03-16 11:28 [PATCH 0/3] ext4: fix WARNING in ext4_add_complete_io Baokun Li
  2023-03-16 11:28 ` [PATCH 1/3] ext4: correct error ctx->mask_s_##name in ctx_set_##name Baokun Li
  2023-03-16 11:28 ` [PATCH 2/3] ext4: add helper to check if flag is changed by ctx Baokun Li
@ 2023-03-16 11:28 ` Baokun Li
  2023-03-23 11:44   ` Jan Kara
  2 siblings, 1 reply; 8+ messages in thread
From: Baokun Li @ 2023-03-16 11:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, ritesh.list, linux-kernel, yi.zhang,
	yangerkun, yukuai3, libaokun1, stable

We got a WARNING in ext4_add_complete_io:
==================================================================
 WARNING: at fs/ext4/page-io.c:231 ext4_put_io_end_defer+0x182/0x250
 CPU: 10 PID: 77 Comm: ksoftirqd/10 Tainted: 6.3.0-rc2 #85
 RIP: 0010:ext4_put_io_end_defer+0x182/0x250 [ext4]
 [...]
 Call Trace:
  <TASK>
  ext4_end_bio+0xa8/0x240 [ext4]
  bio_endio+0x195/0x310
  blk_update_request+0x184/0x770
  scsi_end_request+0x2f/0x240
  scsi_io_completion+0x75/0x450
  scsi_finish_command+0xef/0x160
  scsi_complete+0xa3/0x180
  blk_complete_reqs+0x60/0x80
  blk_done_softirq+0x25/0x40
  __do_softirq+0x119/0x4c8
  run_ksoftirqd+0x42/0x70
  smpboot_thread_fn+0x136/0x3c0
  kthread+0x140/0x1a0
  ret_from_fork+0x2c/0x50
==================================================================

Above issue may happen as follows:
           cpu1                           cpu2
______________________________|_____________________________
mount -o dioread_lock
ext4_writepages
 ext4_do_writepages
  *if (ext4_should_dioread_nolock(inode))*
    // rsv_blocks is not assigned here
                                 mount -o remount,dioread_nolock
  ext4_journal_start_with_reserve
   __ext4_journal_start
    __ext4_journal_start_sb
     jbd2__journal_start
      *if (rsv_blocks)*
        // h_rsv_handle is not initialized here
  mpage_map_and_submit_extent
    mpage_map_one_extent
      dioread_nolock = ext4_should_dioread_nolock(inode)
      if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN))
        mpd->io_submit.io_end->handle = handle->h_rsv_handle
        ext4_set_io_unwritten_flag
          io_end->flag |= EXT4_IO_END_UNWRITTEN
      // now io_end->handle is NULL but has EXT4_IO_END_UNWRITTEN flag

scsi_finish_command
 scsi_io_completion
  scsi_io_completion_action
   scsi_end_request
    blk_update_request
     req_bio_endio
      bio_endio
       bio->bi_end_io  > ext4_end_bio
        ext4_put_io_end_defer
	 ext4_add_complete_io
	  // trigger WARN_ON(!io_end->handle && sbi->s_journal);

The immediate cause of this problem is that ext4_should_dioread_nolock()
function returns inconsistent values in the ext4_do_writepages() and
mpage_map_one_extent(). There are four conditions in this function that
can be changed at mount time to cause this problem. These four conditions
can be divided into two categories:
    (1) journal_data and EXT4_EXTENTS_FL, which can be changed by ioctl
    (2) DELALLOC and DIOREAD_NOLOCK, which can be changed by remount
The two in the first category have been fixed by commit c8585c6fcaf2
("ext4: fix races between changing inode journal mode and ext4_writepages")
and commit cb85f4d23f79 ("ext4: fix race between writepages and enabling
EXT4_EXTENTS_FL") respectively.
Two cases in the other category have not yet been fixed, and the above
issue is caused by this situation. We refer to the fix for the first
category, When DELALLOC or DIOREAD_NOLOCK is detected to be changed
during remount, we hold the s_writepages_rwsem lock to avoid racing with
ext4_writepages to trigger the problem.
Moreover, we add an EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK macro to ensure that
the mount options used by ext4_should_dioread_nolock() and __ext4_remount()
are always consistent.

Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Cc: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h      |  3 ++-
 fs/ext4/ext4_jbd2.h |  9 +++++----
 fs/ext4/super.c     | 14 ++++++++++++++
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 08b29c289da4..f60967fa648f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1703,7 +1703,8 @@ struct ext4_sb_info {
 
 	/*
 	 * Barrier between writepages ops and changing any inode's JOURNAL_DATA
-	 * or EXTENTS flag.
+	 * or EXTENTS flag or between changing SHOULD_DIOREAD_NOLOCK flag on
+	 * remount and writepages ops.
 	 */
 	struct percpu_rw_semaphore s_writepages_rwsem;
 	struct dax_device *s_daxdev;
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 0c77697d5e90..d82bfcdd56e5 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -488,6 +488,9 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
 	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);
 }
 
+/* delalloc is a temporary fix to prevent generic/422 test failures*/
+#define EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK (EXT4_MOUNT_DIOREAD_NOLOCK | \
+					  EXT4_MOUNT_DELALLOC)
 /*
  * This function controls whether or not we should try to go down the
  * dioread_nolock code paths, which makes it safe to avoid taking
@@ -499,7 +502,8 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
  */
 static inline int ext4_should_dioread_nolock(struct inode *inode)
 {
-	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
+	if (test_opt(inode->i_sb, SHOULD_DIOREAD_NOLOCK) !=
+	    EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK)
 		return 0;
 	if (!S_ISREG(inode->i_mode))
 		return 0;
@@ -507,9 +511,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
 		return 0;
 	if (ext4_should_journal_data(inode))
 		return 0;
-	/* temporary fix to prevent generic/422 test failures */
-	if (!test_opt(inode->i_sb, DELALLOC))
-		return 0;
 	return 1;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index fefcd42f34ea..bdf6b288aeff 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -6403,8 +6403,22 @@ static int __ext4_remount(struct fs_context *fc, struct super_block *sb)
 
 	}
 
+	/* Get the flag we really need to set/clear. */
+	ctx->mask_s_mount_opt &= sbi->s_mount_opt;
+	ctx->vals_s_mount_opt &= ~sbi->s_mount_opt;
+
+	/*
+	 * If EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK change on remount, we need
+	 * to hold s_writepages_rwsem to avoid racing with writepages ops.
+	 */
+	if (ctx_changed_mount_opt(ctx, EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK))
+		percpu_down_write(&sbi->s_writepages_rwsem);
+
 	ext4_apply_options(fc, sb);
 
+	if (ctx_changed_mount_opt(ctx, EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK))
+		percpu_up_write(&sbi->s_writepages_rwsem);
+
 	if ((old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) ^
 	    test_opt(sb, JOURNAL_CHECKSUM)) {
 		ext4_msg(sb, KERN_ERR, "changing journal_checksum "
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] ext4: fix race between writepages and remount
  2023-03-16 11:28 ` [PATCH 3/3] ext4: fix race between writepages and remount Baokun Li
@ 2023-03-23 11:44   ` Jan Kara
  2023-03-23 14:18     ` Baokun Li
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2023-03-23 11:44 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, ritesh.list,
	linux-kernel, yi.zhang, yangerkun, yukuai3, stable

On Thu 16-03-23 19:28:32, Baokun Li wrote:
> We got a WARNING in ext4_add_complete_io:
> ==================================================================
>  WARNING: at fs/ext4/page-io.c:231 ext4_put_io_end_defer+0x182/0x250
>  CPU: 10 PID: 77 Comm: ksoftirqd/10 Tainted: 6.3.0-rc2 #85
>  RIP: 0010:ext4_put_io_end_defer+0x182/0x250 [ext4]
>  [...]
>  Call Trace:
>   <TASK>
>   ext4_end_bio+0xa8/0x240 [ext4]
>   bio_endio+0x195/0x310
>   blk_update_request+0x184/0x770
>   scsi_end_request+0x2f/0x240
>   scsi_io_completion+0x75/0x450
>   scsi_finish_command+0xef/0x160
>   scsi_complete+0xa3/0x180
>   blk_complete_reqs+0x60/0x80
>   blk_done_softirq+0x25/0x40
>   __do_softirq+0x119/0x4c8
>   run_ksoftirqd+0x42/0x70
>   smpboot_thread_fn+0x136/0x3c0
>   kthread+0x140/0x1a0
>   ret_from_fork+0x2c/0x50
> ==================================================================
> 
> Above issue may happen as follows:
>            cpu1                           cpu2
> ______________________________|_____________________________
> mount -o dioread_lock
> ext4_writepages
>  ext4_do_writepages
>   *if (ext4_should_dioread_nolock(inode))*
>     // rsv_blocks is not assigned here
>                                  mount -o remount,dioread_nolock
>   ext4_journal_start_with_reserve
>    __ext4_journal_start
>     __ext4_journal_start_sb
>      jbd2__journal_start
>       *if (rsv_blocks)*
>         // h_rsv_handle is not initialized here
>   mpage_map_and_submit_extent
>     mpage_map_one_extent
>       dioread_nolock = ext4_should_dioread_nolock(inode)
>       if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN))
>         mpd->io_submit.io_end->handle = handle->h_rsv_handle
>         ext4_set_io_unwritten_flag
>           io_end->flag |= EXT4_IO_END_UNWRITTEN
>       // now io_end->handle is NULL but has EXT4_IO_END_UNWRITTEN flag
> 
> scsi_finish_command
>  scsi_io_completion
>   scsi_io_completion_action
>    scsi_end_request
>     blk_update_request
>      req_bio_endio
>       bio_endio
>        bio->bi_end_io  > ext4_end_bio
>         ext4_put_io_end_defer
> 	 ext4_add_complete_io
> 	  // trigger WARN_ON(!io_end->handle && sbi->s_journal);
> 
> The immediate cause of this problem is that ext4_should_dioread_nolock()
> function returns inconsistent values in the ext4_do_writepages() and
> mpage_map_one_extent(). There are four conditions in this function that
> can be changed at mount time to cause this problem. These four conditions
> can be divided into two categories:
>     (1) journal_data and EXT4_EXTENTS_FL, which can be changed by ioctl
>     (2) DELALLOC and DIOREAD_NOLOCK, which can be changed by remount
> The two in the first category have been fixed by commit c8585c6fcaf2
> ("ext4: fix races between changing inode journal mode and ext4_writepages")
> and commit cb85f4d23f79 ("ext4: fix race between writepages and enabling
> EXT4_EXTENTS_FL") respectively.
> Two cases in the other category have not yet been fixed, and the above
> issue is caused by this situation. We refer to the fix for the first
> category, When DELALLOC or DIOREAD_NOLOCK is detected to be changed
> during remount, we hold the s_writepages_rwsem lock to avoid racing with
> ext4_writepages to trigger the problem.
> Moreover, we add an EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK macro to ensure that
> the mount options used by ext4_should_dioread_nolock() and __ext4_remount()
> are always consistent.
> 
> Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
> Cc: stable@vger.kernel.org
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Nice catch! One comment below:

> ---
>  fs/ext4/ext4.h      |  3 ++-
>  fs/ext4/ext4_jbd2.h |  9 +++++----
>  fs/ext4/super.c     | 14 ++++++++++++++
>  3 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 08b29c289da4..f60967fa648f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1703,7 +1703,8 @@ struct ext4_sb_info {
>  
>  	/*
>  	 * Barrier between writepages ops and changing any inode's JOURNAL_DATA
> -	 * or EXTENTS flag.
> +	 * or EXTENTS flag or between changing SHOULD_DIOREAD_NOLOCK flag on
> +	 * remount and writepages ops.
>  	 */
>  	struct percpu_rw_semaphore s_writepages_rwsem;
>  	struct dax_device *s_daxdev;
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 0c77697d5e90..d82bfcdd56e5 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -488,6 +488,9 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
>  	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);
>  }
>  
> +/* delalloc is a temporary fix to prevent generic/422 test failures*/
> +#define EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK (EXT4_MOUNT_DIOREAD_NOLOCK | \
> +					  EXT4_MOUNT_DELALLOC)
>  /*
>   * This function controls whether or not we should try to go down the
>   * dioread_nolock code paths, which makes it safe to avoid taking
> @@ -499,7 +502,8 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
>   */
>  static inline int ext4_should_dioread_nolock(struct inode *inode)
>  {
> -	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
> +	if (test_opt(inode->i_sb, SHOULD_DIOREAD_NOLOCK) !=
> +	    EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK)
>  		return 0;
>  	if (!S_ISREG(inode->i_mode))
>  		return 0;
> @@ -507,9 +511,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>  		return 0;
>  	if (ext4_should_journal_data(inode))
>  		return 0;
> -	/* temporary fix to prevent generic/422 test failures */
> -	if (!test_opt(inode->i_sb, DELALLOC))
> -		return 0;
>  	return 1;
>  }

Is there a need for this SHOULD_DIOREAD_NOLOCK? When called from writeback
we will be protected by s_writepages_rwsem anyway. When called from other
places, we either decide to do dioread_nolock or don't but the situation
can change at any instant so I don't see how unifying this check would
help. And the new SHOULD_DIOREAD_NOLOCK somewhat obfuscates what's going
on.

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index fefcd42f34ea..bdf6b288aeff 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -6403,8 +6403,22 @@ static int __ext4_remount(struct fs_context *fc, struct super_block *sb)
>  
>  	}
>  
> +	/* Get the flag we really need to set/clear. */
> +	ctx->mask_s_mount_opt &= sbi->s_mount_opt;
> +	ctx->vals_s_mount_opt &= ~sbi->s_mount_opt;
> +
> +	/*
> +	 * If EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK change on remount, we need
> +	 * to hold s_writepages_rwsem to avoid racing with writepages ops.
> +	 */
> +	if (ctx_changed_mount_opt(ctx, EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK))
> +		percpu_down_write(&sbi->s_writepages_rwsem);
> +

Honestly, I'd be inclined to grab s_writepages_rwsem unconditionally during
remount. Remount is not a fast path operation and waiting for writepages
isn't too bad. Also it's easier for testing :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] ext4: fix race between writepages and remount
  2023-03-23 11:44   ` Jan Kara
@ 2023-03-23 14:18     ` Baokun Li
  2023-03-27  9:35       ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Baokun Li @ 2023-03-23 14:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ritesh.list, linux-kernel,
	yi.zhang, yangerkun, yukuai3, stable, Baokun Li

On 2023/3/23 19:44, Jan Kara wrote:
> On Thu 16-03-23 19:28:32, Baokun Li wrote:
>> Above issue may happen as follows:
>>             cpu1                           cpu2
>> ______________________________|_____________________________
>> mount -o dioread_lock
>> ext4_writepages
>>   ext4_do_writepages
>>    *if (ext4_should_dioread_nolock(inode))*
>>      // rsv_blocks is not assigned here
>>                                   mount -o remount,dioread_nolock
>>    ext4_journal_start_with_reserve
>>     __ext4_journal_start
>>      __ext4_journal_start_sb
>>       jbd2__journal_start
>>        *if (rsv_blocks)*
>>          // h_rsv_handle is not initialized here
>>    mpage_map_and_submit_extent
>>      mpage_map_one_extent
>>        dioread_nolock = ext4_should_dioread_nolock(inode)
>>        if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN))
>>          mpd->io_submit.io_end->handle = handle->h_rsv_handle
>>          ext4_set_io_unwritten_flag
>>            io_end->flag |= EXT4_IO_END_UNWRITTEN
>>        // now io_end->handle is NULL but has EXT4_IO_END_UNWRITTEN flag
>>
>> scsi_finish_command
>>   scsi_io_completion
>>    scsi_io_completion_action
>>     scsi_end_request
>>      blk_update_request
>>       req_bio_endio
>>        bio_endio
>>         bio->bi_end_io  > ext4_end_bio
>>          ext4_put_io_end_defer
>> 	 ext4_add_complete_io
>> 	  // trigger WARN_ON(!io_end->handle && sbi->s_journal);
>>
>> The immediate cause of this problem is that ext4_should_dioread_nolock()
>> function returns inconsistent values in the ext4_do_writepages() and
>> mpage_map_one_extent(). There are four conditions in this function that
>> can be changed at mount time to cause this problem. These four conditions
>> can be divided into two categories:
>>      (1) journal_data and EXT4_EXTENTS_FL, which can be changed by ioctl
>>      (2) DELALLOC and DIOREAD_NOLOCK, which can be changed by remount
>> The two in the first category have been fixed by commit c8585c6fcaf2
>> ("ext4: fix races between changing inode journal mode and ext4_writepages")
>> and commit cb85f4d23f79 ("ext4: fix race between writepages and enabling
>> EXT4_EXTENTS_FL") respectively.
>> Two cases in the other category have not yet been fixed, and the above
>> issue is caused by this situation. We refer to the fix for the first
>> category, When DELALLOC or DIOREAD_NOLOCK is detected to be changed
>> during remount, we hold the s_writepages_rwsem lock to avoid racing with
>> ext4_writepages to trigger the problem.
>> Moreover, we add an EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK macro to ensure that
>> the mount options used by ext4_should_dioread_nolock() and __ext4_remount()
>> are always consistent.
>>
>> Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Nice catch! One comment below:
Thank you very much for your review!
>> ---
>>   fs/ext4/ext4.h      |  3 ++-
>>   fs/ext4/ext4_jbd2.h |  9 +++++----
>>   fs/ext4/super.c     | 14 ++++++++++++++
>>   3 files changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 08b29c289da4..f60967fa648f 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1703,7 +1703,8 @@ struct ext4_sb_info {
>>   
>>   	/*
>>   	 * Barrier between writepages ops and changing any inode's JOURNAL_DATA
>> -	 * or EXTENTS flag.
>> +	 * or EXTENTS flag or between changing SHOULD_DIOREAD_NOLOCK flag on
>> +	 * remount and writepages ops.
>>   	 */
>>   	struct percpu_rw_semaphore s_writepages_rwsem;
>>   	struct dax_device *s_daxdev;
>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> index 0c77697d5e90..d82bfcdd56e5 100644
>> --- a/fs/ext4/ext4_jbd2.h
>> +++ b/fs/ext4/ext4_jbd2.h
>> @@ -488,6 +488,9 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
>>   	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);
>>   }
>>   
>> +/* delalloc is a temporary fix to prevent generic/422 test failures*/
>> +#define EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK (EXT4_MOUNT_DIOREAD_NOLOCK | \
>> +					  EXT4_MOUNT_DELALLOC)
>>   /*
>>    * This function controls whether or not we should try to go down the
>>    * dioread_nolock code paths, which makes it safe to avoid taking
>> @@ -499,7 +502,8 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
>>    */
>>   static inline int ext4_should_dioread_nolock(struct inode *inode)
>>   {
>> -	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
>> +	if (test_opt(inode->i_sb, SHOULD_DIOREAD_NOLOCK) !=
>> +	    EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK)
>>   		return 0;
>>   	if (!S_ISREG(inode->i_mode))
>>   		return 0;
>> @@ -507,9 +511,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>>   		return 0;
>>   	if (ext4_should_journal_data(inode))
>>   		return 0;
>> -	/* temporary fix to prevent generic/422 test failures */
>> -	if (!test_opt(inode->i_sb, DELALLOC))
>> -		return 0;
>>   	return 1;
>>   }
> Is there a need for this SHOULD_DIOREAD_NOLOCK? When called from writeback
> we will be protected by s_writepages_rwsem anyway. When called from other
> places, we either decide to do dioread_nolock or don't but the situation
> can change at any instant so I don't see how unifying this check would
> help. And the new SHOULD_DIOREAD_NOLOCK somewhat obfuscates what's going
> on.
We're thinking that the mount-related flags in 
ext4_should_dioread_nolock() might
be modified, such as DELALLOC being removed because generic/422 test 
failures
were fixed in some other way, resulting in some unnecessary locking 
during remount,
or for whatever reason a mount-related flag was added to 
ext4_should_dioread_nolock(),
and we didn't make a synchronization change in __ext4_remount() that 
would cause
the problem to recur.  So we added this flag to this function (instead 
of in ext4.h), so that
when we change the mount option in ext4_should_dioread_nolock(), we 
directly change
this flag, and we don't have to consider making synchronization changes 
in __ext4_remount().

We have checked where this function is called and there are two types of 
calls to this function:
1. One category is ext4_do_writepages() and mpage_map_one_extent(), 
which are protected
     by s_writepages_rwsem, the location of the problem;
2. The other type is in ext4_page_mkwrite(), 
ext4_convert_inline_data_to_extent(),ext4_write_begin()
     to determine whether to get the block using 
ext4_get_block_unwritten() or ext4_get_ block().
     1) If we just started fetching written blocks, it looks like there 
is no problem;
     2) If we start getting unwritten blocks, when DIOREAD_NOLOCK is 
cleared by remount,
         we will convert the blocks to written in ext4_map_blocks(). The 
data=ordered mode
         ensures that we don't see stale data.
>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index fefcd42f34ea..bdf6b288aeff 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -6403,8 +6403,22 @@ static int __ext4_remount(struct fs_context *fc, struct super_block *sb)
>>   
>>   	}
>>   
>> +	/* Get the flag we really need to set/clear. */
>> +	ctx->mask_s_mount_opt &= sbi->s_mount_opt;
>> +	ctx->vals_s_mount_opt &= ~sbi->s_mount_opt;
>> +
>> +	/*
>> +	 * If EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK change on remount, we need
>> +	 * to hold s_writepages_rwsem to avoid racing with writepages ops.
>> +	 */
>> +	if (ctx_changed_mount_opt(ctx, EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK))
>> +		percpu_down_write(&sbi->s_writepages_rwsem);
>> +
> Honestly, I'd be inclined to grab s_writepages_rwsem unconditionally during
> remount. Remount is not a fast path operation and waiting for writepages
> isn't too bad. Also it's easier for testing :).
>
> 								Honza
Initially I was concerned that adding this lock to remount would cause 
users to
wait too long when remounting, so I added the lock only when the **actual**
mount flags being modified **contained** DIOREAD_NOLOCK or DELALLOC.

But now that I think about it, if there is no better way to do it, it 
seems to be
acceptable to add the lock unconditionally, so that the code looks cleaner.

Thanks again!
Any comments are welcome!
-- 
With Best Regards,
Baokun Li
.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] ext4: fix race between writepages and remount
  2023-03-23 14:18     ` Baokun Li
@ 2023-03-27  9:35       ` Jan Kara
  2023-03-27 11:11         ` Baokun Li
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2023-03-27  9:35 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ritesh.list,
	linux-kernel, yi.zhang, yangerkun, yukuai3, stable

On Thu 23-03-23 22:18:53, Baokun Li wrote:
> On 2023/3/23 19:44, Jan Kara wrote:
> > > ---
> > >   fs/ext4/ext4.h      |  3 ++-
> > >   fs/ext4/ext4_jbd2.h |  9 +++++----
> > >   fs/ext4/super.c     | 14 ++++++++++++++
> > >   3 files changed, 21 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 08b29c289da4..f60967fa648f 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -1703,7 +1703,8 @@ struct ext4_sb_info {
> > >   	/*
> > >   	 * Barrier between writepages ops and changing any inode's JOURNAL_DATA
> > > -	 * or EXTENTS flag.
> > > +	 * or EXTENTS flag or between changing SHOULD_DIOREAD_NOLOCK flag on
> > > +	 * remount and writepages ops.
> > >   	 */
> > >   	struct percpu_rw_semaphore s_writepages_rwsem;
> > >   	struct dax_device *s_daxdev;
> > > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > > index 0c77697d5e90..d82bfcdd56e5 100644
> > > --- a/fs/ext4/ext4_jbd2.h
> > > +++ b/fs/ext4/ext4_jbd2.h
> > > @@ -488,6 +488,9 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
> > >   	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);
> > >   }
> > > +/* delalloc is a temporary fix to prevent generic/422 test failures*/
> > > +#define EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK (EXT4_MOUNT_DIOREAD_NOLOCK | \
> > > +					  EXT4_MOUNT_DELALLOC)
> > >   /*
> > >    * This function controls whether or not we should try to go down the
> > >    * dioread_nolock code paths, which makes it safe to avoid taking
> > > @@ -499,7 +502,8 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
> > >    */
> > >   static inline int ext4_should_dioread_nolock(struct inode *inode)
> > >   {
> > > -	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
> > > +	if (test_opt(inode->i_sb, SHOULD_DIOREAD_NOLOCK) !=
> > > +	    EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK)
> > >   		return 0;
> > >   	if (!S_ISREG(inode->i_mode))
> > >   		return 0;
> > > @@ -507,9 +511,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > >   		return 0;
> > >   	if (ext4_should_journal_data(inode))
> > >   		return 0;
> > > -	/* temporary fix to prevent generic/422 test failures */
> > > -	if (!test_opt(inode->i_sb, DELALLOC))
> > > -		return 0;
> > >   	return 1;
> > >   }
> > Is there a need for this SHOULD_DIOREAD_NOLOCK? When called from writeback
> > we will be protected by s_writepages_rwsem anyway. When called from other
> > places, we either decide to do dioread_nolock or don't but the situation
> > can change at any instant so I don't see how unifying this check would
> > help. And the new SHOULD_DIOREAD_NOLOCK somewhat obfuscates what's going
> > on.
> We're thinking that the mount-related flags in
> ext4_should_dioread_nolock() might be modified, such as DELALLOC being
> removed because generic/422 test failures were fixed in some other way,
> resulting in some unnecessary locking during remount, or for whatever
> reason a mount-related flag was added to ext4_should_dioread_nolock(),
> and we didn't make a synchronization change in __ext4_remount() that
> would cause the problem to recur.  So we added this flag to this function
> (instead of in ext4.h), so that when we change the mount option in
> ext4_should_dioread_nolock(), we directly change this flag, and we don't
> have to consider making synchronization changes in __ext4_remount().
> 
> We have checked where this function is called and there are two types of
> calls to this function:
> 1. One category is ext4_do_writepages() and mpage_map_one_extent(), which
> are protected by s_writepages_rwsem, the location of the problem;
> 2. The other type is in ext4_page_mkwrite(),
> ext4_convert_inline_data_to_extent(), ext4_write_begin() to determine
> whether to get the block using ext4_get_block_unwritten() or
> ext4_get_block().
>
>     1) If we just started fetching written blocks, it looks like there is no
> problem;
>     2) If we start getting unwritten blocks, when DIOREAD_NOLOCK is cleared
> by remount,
>         we will convert the blocks to written in ext4_map_blocks(). The
> data=ordered mode ensures that we don't see stale data.

Yes. So do you agree that EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK is not really
needed?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] ext4: fix race between writepages and remount
  2023-03-27  9:35       ` Jan Kara
@ 2023-03-27 11:11         ` Baokun Li
  0 siblings, 0 replies; 8+ messages in thread
From: Baokun Li @ 2023-03-27 11:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ritesh.list, linux-kernel,
	yi.zhang, yangerkun, yukuai3, stable, Baokun Li

On 2023/3/27 17:35, Jan Kara wrote:
> On Thu 23-03-23 22:18:53, Baokun Li wrote:
>> On 2023/3/23 19:44, Jan Kara wrote:
>>>> ---
>>>>    fs/ext4/ext4.h      |  3 ++-
>>>>    fs/ext4/ext4_jbd2.h |  9 +++++----
>>>>    fs/ext4/super.c     | 14 ++++++++++++++
>>>>    3 files changed, 21 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 08b29c289da4..f60967fa648f 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1703,7 +1703,8 @@ struct ext4_sb_info {
>>>>    	/*
>>>>    	 * Barrier between writepages ops and changing any inode's JOURNAL_DATA
>>>> -	 * or EXTENTS flag.
>>>> +	 * or EXTENTS flag or between changing SHOULD_DIOREAD_NOLOCK flag on
>>>> +	 * remount and writepages ops.
>>>>    	 */
>>>>    	struct percpu_rw_semaphore s_writepages_rwsem;
>>>>    	struct dax_device *s_daxdev;
>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>> index 0c77697d5e90..d82bfcdd56e5 100644
>>>> --- a/fs/ext4/ext4_jbd2.h
>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>> @@ -488,6 +488,9 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
>>>>    	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);
>>>>    }
>>>> +/* delalloc is a temporary fix to prevent generic/422 test failures*/
>>>> +#define EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK (EXT4_MOUNT_DIOREAD_NOLOCK | \
>>>> +					  EXT4_MOUNT_DELALLOC)
>>>>    /*
>>>>     * This function controls whether or not we should try to go down the
>>>>     * dioread_nolock code paths, which makes it safe to avoid taking
>>>> @@ -499,7 +502,8 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
>>>>     */
>>>>    static inline int ext4_should_dioread_nolock(struct inode *inode)
>>>>    {
>>>> -	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
>>>> +	if (test_opt(inode->i_sb, SHOULD_DIOREAD_NOLOCK) !=
>>>> +	    EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK)
>>>>    		return 0;
>>>>    	if (!S_ISREG(inode->i_mode))
>>>>    		return 0;
>>>> @@ -507,9 +511,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>>>>    		return 0;
>>>>    	if (ext4_should_journal_data(inode))
>>>>    		return 0;
>>>> -	/* temporary fix to prevent generic/422 test failures */
>>>> -	if (!test_opt(inode->i_sb, DELALLOC))
>>>> -		return 0;
>>>>    	return 1;
>>>>    }
>>> Is there a need for this SHOULD_DIOREAD_NOLOCK? When called from writeback
>>> we will be protected by s_writepages_rwsem anyway. When called from other
>>> places, we either decide to do dioread_nolock or don't but the situation
>>> can change at any instant so I don't see how unifying this check would
>>> help. And the new SHOULD_DIOREAD_NOLOCK somewhat obfuscates what's going
>>> on.
>> We're thinking that the mount-related flags in
>> ext4_should_dioread_nolock() might be modified, such as DELALLOC being
>> removed because generic/422 test failures were fixed in some other way,
>> resulting in some unnecessary locking during remount, or for whatever
>> reason a mount-related flag was added to ext4_should_dioread_nolock(),
>> and we didn't make a synchronization change in __ext4_remount() that
>> would cause the problem to recur.  So we added this flag to this function
>> (instead of in ext4.h), so that when we change the mount option in
>> ext4_should_dioread_nolock(), we directly change this flag, and we don't
>> have to consider making synchronization changes in __ext4_remount().
>>
>> We have checked where this function is called and there are two types of
>> calls to this function:
>> 1. One category is ext4_do_writepages() and mpage_map_one_extent(), which
>> are protected by s_writepages_rwsem, the location of the problem;
>> 2. The other type is in ext4_page_mkwrite(),
>> ext4_convert_inline_data_to_extent(), ext4_write_begin() to determine
>> whether to get the block using ext4_get_block_unwritten() or
>> ext4_get_block().
>>
>>      1) If we just started fetching written blocks, it looks like there is no
>> problem;
>>      2) If we start getting unwritten blocks, when DIOREAD_NOLOCK is cleared
>> by remount,
>>          we will convert the blocks to written in ext4_map_blocks(). The
>> data=ordered mode ensures that we don't see stale data.
> Yes. So do you agree that EXT4_MOUNT_SHOULD_DIOREAD_NOLOCK is not really
> needed?
>
> 								Honza
Yes, I totally agree!
If we unconditionally grabbed s_writepages_rwsem when remounting,
there would be no mount option synchronization problem, and the flag
would be completely unnecessary.

I will send a patch V2 with the changes suggested by you.
-- 
With Best Regards,
Baokun Li
.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-27 11:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-16 11:28 [PATCH 0/3] ext4: fix WARNING in ext4_add_complete_io Baokun Li
2023-03-16 11:28 ` [PATCH 1/3] ext4: correct error ctx->mask_s_##name in ctx_set_##name Baokun Li
2023-03-16 11:28 ` [PATCH 2/3] ext4: add helper to check if flag is changed by ctx Baokun Li
2023-03-16 11:28 ` [PATCH 3/3] ext4: fix race between writepages and remount Baokun Li
2023-03-23 11:44   ` Jan Kara
2023-03-23 14:18     ` Baokun Li
2023-03-27  9:35       ` Jan Kara
2023-03-27 11:11         ` Baokun Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).