linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next v2 0/6] Fix some issues about mmp
@ 2021-09-11  9:00 Ye Bin
  2021-09-11  9:00 ` [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd Ye Bin
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Ye Bin @ 2021-09-11  9:00 UTC (permalink / raw)
  To: tytso, adilger.kernel, linux-ext4; +Cc: linux-kernel, jack, Ye Bin

I test mmp function as follow steps:
1. Inject delay 5s in ext4_multi_mount_protect function after
"skip:" label.
2. Share HostA block device(sda) with HostB(nbd0) by NBD.
3. Enable mmp feature when mkfs.ext4 sda.
4. Mount sda and nbd0 at the same time.

I found kmmpd never trigger detect multi-mount. Reason is as follow:
1. Kmmpd init seq with 0, if two host have same nodename, will lead to
detect confliction very slow, even never detect confliction.
2. When detect confliction in kmmpd, we get 'check_bh' is same with 'bh'.
so we compare data with itself.
3. We only trigger detect when ”diff > mmp_check_interval * HZ“,
'mmp_check_interval' is double of 'mmp_update_interval', 'diff' is
about 'mmp_update_interval'. So 'diff' is little than 'mmp_check_interval * HZ'
normaly.

And i also found that 'check_interval' value store in disk  is not sure
after umount.

v1->v2:
Fix 'last_check_time' not initialized before checking.

Ye Bin (6):
  ext4: init seq with random value in kmmpd
  ext4: introduce last_check_time record previous check time
  ext4: compare to local seq and nodename when check conflict
  ext4: avoid to re-read mmp check data get from page cache
  ext4: avoid to double free s_mmp_bh
  ext4: fix possible store wrong check interval value in disk when
    umount

 fs/ext4/mmp.c | 77 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 43 insertions(+), 34 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd
  2021-09-11  9:00 [PATCH -next v2 0/6] Fix some issues about mmp Ye Bin
@ 2021-09-11  9:00 ` Ye Bin
  2021-10-07 12:26   ` Jan Kara
  2021-09-11  9:00 ` [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time Ye Bin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Ye Bin @ 2021-09-11  9:00 UTC (permalink / raw)
  To: tytso, adilger.kernel, linux-ext4; +Cc: linux-kernel, jack, Ye Bin

If two host has the same nodename, and seq start from 0, May cause the
detection mechanism to fail.
So init seq with random value to accelerate conflict detection.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/ext4/mmp.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index cebea4270817..12af6dc8457b 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -122,6 +122,21 @@ void __dump_mmp_msg(struct super_block *sb, struct mmp_struct *mmp,
 		       (int)sizeof(mmp->mmp_bdevname), mmp->mmp_bdevname);
 }
 
+/*
+ * Get a random new sequence number but make sure it is not greater than
+ * EXT4_MMP_SEQ_MAX.
+ */
+static unsigned int mmp_new_seq(void)
+{
+	u32 new_seq;
+
+	do {
+		new_seq = prandom_u32();
+	} while (new_seq > EXT4_MMP_SEQ_MAX);
+
+	return new_seq;
+}
+
 /*
  * kmmpd will update the MMP sequence every s_mmp_update_interval seconds
  */
@@ -132,7 +147,7 @@ static int kmmpd(void *data)
 	struct buffer_head *bh = EXT4_SB(sb)->s_mmp_bh;
 	struct mmp_struct *mmp;
 	ext4_fsblk_t mmp_block;
-	u32 seq = 0;
+	u32 seq = mmp_new_seq();
 	unsigned long failed_writes = 0;
 	int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
 	unsigned mmp_check_interval;
@@ -258,21 +273,6 @@ void ext4_stop_mmpd(struct ext4_sb_info *sbi)
 	}
 }
 
-/*
- * Get a random new sequence number but make sure it is not greater than
- * EXT4_MMP_SEQ_MAX.
- */
-static unsigned int mmp_new_seq(void)
-{
-	u32 new_seq;
-
-	do {
-		new_seq = prandom_u32();
-	} while (new_seq > EXT4_MMP_SEQ_MAX);
-
-	return new_seq;
-}
-
 /*
  * Protect the filesystem from being mounted more than once.
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-09-11  9:00 [PATCH -next v2 0/6] Fix some issues about mmp Ye Bin
  2021-09-11  9:00 ` [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd Ye Bin
@ 2021-09-11  9:00 ` Ye Bin
  2021-10-07 12:31   ` Jan Kara
  2021-09-11  9:00 ` [PATCH -next v2 3/6] ext4: compare to local seq and nodename when check conflict Ye Bin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Ye Bin @ 2021-09-11  9:00 UTC (permalink / raw)
  To: tytso, adilger.kernel, linux-ext4; +Cc: linux-kernel, jack, Ye Bin

kmmpd:
...
    diff = jiffies - last_update_time;
    if (diff > mmp_check_interval * HZ) {
...
As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
than 'mmp_update_interval', so there will never trigger detection.
Introduce last_check_time record previous check time.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/ext4/mmp.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index 12af6dc8457b..c781b09a78c9 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -152,6 +152,7 @@ static int kmmpd(void *data)
 	int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
 	unsigned mmp_check_interval;
 	unsigned long last_update_time;
+	unsigned long last_check_time;
 	unsigned long diff;
 	int retval = 0;
 
@@ -170,6 +171,7 @@ static int kmmpd(void *data)
 
 	memcpy(mmp->mmp_nodename, init_utsname()->nodename,
 	       sizeof(mmp->mmp_nodename));
+	last_check_time = jiffies;
 
 	while (!kthread_should_stop() && !sb_rdonly(sb)) {
 		if (!ext4_has_feature_mmp(sb)) {
@@ -198,17 +200,18 @@ static int kmmpd(void *data)
 		}
 
 		diff = jiffies - last_update_time;
-		if (diff < mmp_update_interval * HZ)
+		if (diff < mmp_update_interval * HZ) {
 			schedule_timeout_interruptible(mmp_update_interval *
 						       HZ - diff);
+			diff = jiffies - last_update_time;
+		}
 
 		/*
 		 * We need to make sure that more than mmp_check_interval
-		 * seconds have not passed since writing. If that has happened
-		 * we need to check if the MMP block is as we left it.
+		 * seconds have not passed since check. If that has happened
+		 * we need to check if the MMP block is as we write it.
 		 */
-		diff = jiffies - last_update_time;
-		if (diff > mmp_check_interval * HZ) {
+		if (jiffies - last_check_time > mmp_check_interval * HZ) {
 			struct buffer_head *bh_check = NULL;
 			struct mmp_struct *mmp_check;
 
@@ -234,6 +237,7 @@ static int kmmpd(void *data)
 				goto wait_to_exit;
 			}
 			put_bh(bh_check);
+			last_check_time = jiffies;
 		}
 
 		 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v2 3/6] ext4: compare to local seq and nodename when check conflict
  2021-09-11  9:00 [PATCH -next v2 0/6] Fix some issues about mmp Ye Bin
  2021-09-11  9:00 ` [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd Ye Bin
  2021-09-11  9:00 ` [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time Ye Bin
@ 2021-09-11  9:00 ` Ye Bin
  2021-10-07 12:36   ` Jan Kara
  2021-09-11  9:00 ` [PATCH -next v2 4/6] ext4: avoid to re-read mmp check data get from page cache Ye Bin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Ye Bin @ 2021-09-11  9:00 UTC (permalink / raw)
  To: tytso, adilger.kernel, linux-ext4; +Cc: linux-kernel, jack, Ye Bin

As mmp and check_mmp is point to the same data, so there will never
detect conflict.
To solve this issue just compare to local data.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/ext4/mmp.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index c781b09a78c9..4433fe7e9e86 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -154,6 +154,7 @@ static int kmmpd(void *data)
 	unsigned long last_update_time;
 	unsigned long last_check_time;
 	unsigned long diff;
+	char nodename[64];
 	int retval = 0;
 
 	mmp_block = le64_to_cpu(es->s_mmp_block);
@@ -169,8 +170,8 @@ static int kmmpd(void *data)
 	BUILD_BUG_ON(sizeof(mmp->mmp_bdevname) < BDEVNAME_SIZE);
 	bdevname(bh->b_bdev, mmp->mmp_bdevname);
 
-	memcpy(mmp->mmp_nodename, init_utsname()->nodename,
-	       sizeof(mmp->mmp_nodename));
+	memcpy(nodename, init_utsname()->nodename, sizeof(nodename));
+	memcpy(mmp->mmp_nodename, nodename, sizeof(mmp->mmp_nodename));
 	last_check_time = jiffies;
 
 	while (!kthread_should_stop() && !sb_rdonly(sb)) {
@@ -224,8 +225,8 @@ static int kmmpd(void *data)
 			}
 
 			mmp_check = (struct mmp_struct *)(bh_check->b_data);
-			if (mmp->mmp_seq != mmp_check->mmp_seq ||
-			    memcmp(mmp->mmp_nodename, mmp_check->mmp_nodename,
+			if (seq != mmp_check->mmp_seq ||
+			    memcmp(nodename, mmp_check->mmp_nodename,
 				   sizeof(mmp->mmp_nodename))) {
 				dump_mmp_msg(sb, mmp_check,
 					     "Error while updating MMP info. "
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v2 4/6] ext4: avoid to re-read mmp check data get from page cache
  2021-09-11  9:00 [PATCH -next v2 0/6] Fix some issues about mmp Ye Bin
                   ` (2 preceding siblings ...)
  2021-09-11  9:00 ` [PATCH -next v2 3/6] ext4: compare to local seq and nodename when check conflict Ye Bin
@ 2021-09-11  9:00 ` Ye Bin
  2021-10-07 12:44   ` Jan Kara
  2021-09-11  9:00 ` [PATCH -next v2 5/6] ext4: avoid to double free s_mmp_bh Ye Bin
  2021-09-11  9:00 ` [PATCH -next v2 6/6] ext4: fix possible store wrong check interval value in disk when umount Ye Bin
  5 siblings, 1 reply; 24+ messages in thread
From: Ye Bin @ 2021-09-11  9:00 UTC (permalink / raw)
  To: tytso, adilger.kernel, linux-ext4; +Cc: linux-kernel, jack, Ye Bin

As call read_mmp_block pass bh_check which value is NULL, then call
sb_getblk to get buffer_head. But mmp_block's buffer_head is already exist
 which also is uptodate. Lead to compare the same data.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/ext4/mmp.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index 4433fe7e9e86..007bde3c12b8 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -213,10 +213,7 @@ static int kmmpd(void *data)
 		 * we need to check if the MMP block is as we write it.
 		 */
 		if (jiffies - last_check_time > mmp_check_interval * HZ) {
-			struct buffer_head *bh_check = NULL;
-			struct mmp_struct *mmp_check;
-
-			retval = read_mmp_block(sb, &bh_check, mmp_block);
+			retval = read_mmp_block(sb, &bh, mmp_block);
 			if (retval) {
 				ext4_error_err(sb, -retval,
 					       "error reading MMP data: %d",
@@ -224,20 +221,18 @@ static int kmmpd(void *data)
 				goto wait_to_exit;
 			}
 
-			mmp_check = (struct mmp_struct *)(bh_check->b_data);
-			if (seq != mmp_check->mmp_seq ||
-			    memcmp(nodename, mmp_check->mmp_nodename,
-				   sizeof(mmp->mmp_nodename))) {
-				dump_mmp_msg(sb, mmp_check,
+			mmp = (struct mmp_struct *)(bh->b_data);
+			if (seq != le32_to_cpu(mmp->mmp_seq) ||
+			    memcmp(nodename, mmp->mmp_nodename,
+				    sizeof(nodename))) {
+				dump_mmp_msg(sb, mmp,
 					     "Error while updating MMP info. "
 					     "The filesystem seems to have been"
 					     " multiply mounted.");
 				ext4_error_err(sb, EBUSY, "abort");
-				put_bh(bh_check);
 				retval = -EBUSY;
 				goto wait_to_exit;
 			}
-			put_bh(bh_check);
 			last_check_time = jiffies;
 		}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v2 5/6] ext4: avoid to double free s_mmp_bh
  2021-09-11  9:00 [PATCH -next v2 0/6] Fix some issues about mmp Ye Bin
                   ` (3 preceding siblings ...)
  2021-09-11  9:00 ` [PATCH -next v2 4/6] ext4: avoid to re-read mmp check data get from page cache Ye Bin
@ 2021-09-11  9:00 ` Ye Bin
  2021-09-11  9:00 ` [PATCH -next v2 6/6] ext4: fix possible store wrong check interval value in disk when umount Ye Bin
  5 siblings, 0 replies; 24+ messages in thread
From: Ye Bin @ 2021-09-11  9:00 UTC (permalink / raw)
  To: tytso, adilger.kernel, linux-ext4; +Cc: linux-kernel, jack, Ye Bin

If call read_mmp_block failed then s_mmp_bh will be freed in read_mmp_block.
Kmmpd wait to be killed, ext4_stop_mmpd stop kmmpd and also release s_mmp_bh.
To avoid double free, just set EXT4_SB(sb)->s_mmp_bh with NULL.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/ext4/mmp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index 007bde3c12b8..a0d47a906faa 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -218,6 +218,7 @@ static int kmmpd(void *data)
 				ext4_error_err(sb, -retval,
 					       "error reading MMP data: %d",
 					       retval);
+				EXT4_SB(sb)->s_mmp_bh = NULL;
 				goto wait_to_exit;
 			}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v2 6/6] ext4: fix possible store wrong check interval value in disk when umount
  2021-09-11  9:00 [PATCH -next v2 0/6] Fix some issues about mmp Ye Bin
                   ` (4 preceding siblings ...)
  2021-09-11  9:00 ` [PATCH -next v2 5/6] ext4: avoid to double free s_mmp_bh Ye Bin
@ 2021-09-11  9:00 ` Ye Bin
  2021-10-07 13:12   ` Jan Kara
  5 siblings, 1 reply; 24+ messages in thread
From: Ye Bin @ 2021-09-11  9:00 UTC (permalink / raw)
  To: tytso, adilger.kernel, linux-ext4; +Cc: linux-kernel, jack, Ye Bin

Test follow steps:
1. mkfs.ext4 /dev/sda -O mmp
2. mount /dev/sda  /mnt
3. wait for about 1 minute
4. umount mnt
5. debugfs /dev/sda
6. dump_mmp
7. fsck.ext4 /dev/sda

I found 'check_interval' is range in [5, 10]. And sometime run fsck
print "MMP interval is 10 seconds and total wait time is 42 seconds.
Please wait...".
kmmpd:
...
	if (diff < mmp_update_interval * HZ)
		schedule_timeout_interruptible(mmp_update_interval * HZ - diff);
	 diff = jiffies - last_update_time;
...
	mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ,
				EXT4_MMP_MAX_CHECK_INTERVAL),
			        EXT4_MMP_MIN_CHECK_INTERVAL);
	mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval);
...
We will call ext4_stop_mmpd to stop kmmpd kthread when umount, and
schedule_timeout_interruptible will be interrupted, so 'diff' maybe
little than mmp_update_interval. Then mmp_check_interval will range
in [EXT4_MMP_MAX_CHECK_INTERVAL, EXT4_MMP_CHECK_MULT * diff / HZ].
To solve this issue, if 'diff' little then mmp_update_interval * HZ
just break loop, don't update check interval.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/ext4/mmp.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index a0d47a906faa..f39e1fa0c6db 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -205,6 +205,14 @@ static int kmmpd(void *data)
 			schedule_timeout_interruptible(mmp_update_interval *
 						       HZ - diff);
 			diff = jiffies - last_update_time;
+			/* If 'diff' little 'than mmp_update_interval * HZ', it
+			 * means someone call ext4_stop_mmpd to stop kmmpd
+			 * kthread. We don't need to update mmp_check_interval
+			 * any more, as 'diff' is not exact value.
+			 */
+			if (unlikely(diff < mmp_update_interval * HZ &&
+			    kthread_should_stop()))
+				break;
 		}
 
 		/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd
  2021-09-11  9:00 ` [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd Ye Bin
@ 2021-10-07 12:26   ` Jan Kara
  2021-10-08  1:50     ` yebin
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-10-07 12:26 UTC (permalink / raw)
  To: Ye Bin; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel, jack

On Sat 11-09-21 17:00:54, Ye Bin wrote:
> If two host has the same nodename, and seq start from 0, May cause the
> detection mechanism to fail.
> So init seq with random value to accelerate conflict detection.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>

Thanks for the patch. I agree the code in kmmpd looks suspicious.  The
sequence number is initialized to a random number in
ext4_multi_mount_protect() which is called before kmmpd is started. I think
kmmpd() should initialize its 'seq' to a number fetched from the mmp
block, instead of 0 as is currently in the code. I don't think generating a
new random number as you do is really needed...

								Honza

> ---
>  fs/ext4/mmp.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index cebea4270817..12af6dc8457b 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -122,6 +122,21 @@ void __dump_mmp_msg(struct super_block *sb, struct mmp_struct *mmp,
>  		       (int)sizeof(mmp->mmp_bdevname), mmp->mmp_bdevname);
>  }
>  
> +/*
> + * Get a random new sequence number but make sure it is not greater than
> + * EXT4_MMP_SEQ_MAX.
> + */
> +static unsigned int mmp_new_seq(void)
> +{
> +	u32 new_seq;
> +
> +	do {
> +		new_seq = prandom_u32();
> +	} while (new_seq > EXT4_MMP_SEQ_MAX);
> +
> +	return new_seq;
> +}
> +
>  /*
>   * kmmpd will update the MMP sequence every s_mmp_update_interval seconds
>   */
> @@ -132,7 +147,7 @@ static int kmmpd(void *data)
>  	struct buffer_head *bh = EXT4_SB(sb)->s_mmp_bh;
>  	struct mmp_struct *mmp;
>  	ext4_fsblk_t mmp_block;
> -	u32 seq = 0;
> +	u32 seq = mmp_new_seq();
>  	unsigned long failed_writes = 0;
>  	int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
>  	unsigned mmp_check_interval;
> @@ -258,21 +273,6 @@ void ext4_stop_mmpd(struct ext4_sb_info *sbi)
>  	}
>  }
>  
> -/*
> - * Get a random new sequence number but make sure it is not greater than
> - * EXT4_MMP_SEQ_MAX.
> - */
> -static unsigned int mmp_new_seq(void)
> -{
> -	u32 new_seq;
> -
> -	do {
> -		new_seq = prandom_u32();
> -	} while (new_seq > EXT4_MMP_SEQ_MAX);
> -
> -	return new_seq;
> -}
> -
>  /*
>   * Protect the filesystem from being mounted more than once.
>   */
> -- 
> 2.31.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-09-11  9:00 ` [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time Ye Bin
@ 2021-10-07 12:31   ` Jan Kara
  2021-10-08  1:56     ` yebin
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-10-07 12:31 UTC (permalink / raw)
  To: Ye Bin; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel, jack

On Sat 11-09-21 17:00:55, Ye Bin wrote:
> kmmpd:
> ...
>     diff = jiffies - last_update_time;
>     if (diff > mmp_check_interval * HZ) {
> ...
> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
> than 'mmp_update_interval', so there will never trigger detection.
> Introduce last_check_time record previous check time.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>

I think the check is there only for the case where write_mmp_block() +
sleep took longer than mmp_check_interval. I agree that should rarely
happen but on a really busy system it is possible and in that case we would
miss updating mmp block for too long and so another node could have started
using the filesystem. I actually don't see a reason why kmmpd should be
checking the block each mmp_check_interval as you do - mmp_check_interval
is just for ext4_multi_mount_protect() to know how long it should wait
before considering mmp block stale... Am I missing something?

								Honza

> ---
>  fs/ext4/mmp.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 12af6dc8457b..c781b09a78c9 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -152,6 +152,7 @@ static int kmmpd(void *data)
>  	int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
>  	unsigned mmp_check_interval;
>  	unsigned long last_update_time;
> +	unsigned long last_check_time;
>  	unsigned long diff;
>  	int retval = 0;
>  
> @@ -170,6 +171,7 @@ static int kmmpd(void *data)
>  
>  	memcpy(mmp->mmp_nodename, init_utsname()->nodename,
>  	       sizeof(mmp->mmp_nodename));
> +	last_check_time = jiffies;
>  
>  	while (!kthread_should_stop() && !sb_rdonly(sb)) {
>  		if (!ext4_has_feature_mmp(sb)) {
> @@ -198,17 +200,18 @@ static int kmmpd(void *data)
>  		}
>  
>  		diff = jiffies - last_update_time;
> -		if (diff < mmp_update_interval * HZ)
> +		if (diff < mmp_update_interval * HZ) {
>  			schedule_timeout_interruptible(mmp_update_interval *
>  						       HZ - diff);
> +			diff = jiffies - last_update_time;
> +		}
>  
>  		/*
>  		 * We need to make sure that more than mmp_check_interval
> -		 * seconds have not passed since writing. If that has happened
> -		 * we need to check if the MMP block is as we left it.
> +		 * seconds have not passed since check. If that has happened
> +		 * we need to check if the MMP block is as we write it.
>  		 */
> -		diff = jiffies - last_update_time;
> -		if (diff > mmp_check_interval * HZ) {
> +		if (jiffies - last_check_time > mmp_check_interval * HZ) {
>  			struct buffer_head *bh_check = NULL;
>  			struct mmp_struct *mmp_check;
>  
> @@ -234,6 +237,7 @@ static int kmmpd(void *data)
>  				goto wait_to_exit;
>  			}
>  			put_bh(bh_check);
> +			last_check_time = jiffies;
>  		}
>  
>  		 /*
> -- 
> 2.31.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 3/6] ext4: compare to local seq and nodename when check conflict
  2021-09-11  9:00 ` [PATCH -next v2 3/6] ext4: compare to local seq and nodename when check conflict Ye Bin
@ 2021-10-07 12:36   ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-10-07 12:36 UTC (permalink / raw)
  To: Ye Bin; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel, jack

On Sat 11-09-21 17:00:56, Ye Bin wrote:
> As mmp and check_mmp is point to the same data, so there will never
> detect conflict.
> To solve this issue just compare to local data.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>

Good spotting! Just one nit below.

> ---
>  fs/ext4/mmp.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index c781b09a78c9..4433fe7e9e86 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -154,6 +154,7 @@ static int kmmpd(void *data)
>  	unsigned long last_update_time;
>  	unsigned long last_check_time;
>  	unsigned long diff;
> +	char nodename[64];

Perhaps define a constant for the length of nodename and use it here as
well as in the declaration of mmp_struct->mmp_nodename?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 4/6] ext4: avoid to re-read mmp check data get from page cache
  2021-09-11  9:00 ` [PATCH -next v2 4/6] ext4: avoid to re-read mmp check data get from page cache Ye Bin
@ 2021-10-07 12:44   ` Jan Kara
  2021-10-08  3:52     ` yebin
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-10-07 12:44 UTC (permalink / raw)
  To: Ye Bin; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel, jack

On Sat 11-09-21 17:00:57, Ye Bin wrote:
> As call read_mmp_block pass bh_check which value is NULL, then call
> sb_getblk to get buffer_head. But mmp_block's buffer_head is already exist
>  which also is uptodate. Lead to compare the same data.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>

This looks good, except that read_mmp_block() also releases bh it is passed
in case of error. So it can free buffer head which is still referenced from
EXT4_SB(sb)->s_mmp_bh and cause use-after-free issues.

I guess I would just get rid of sb_getblk() in read_mmp_block() and always
expect valid bh passed. The only place that passes NULL bh after this
patch is one case in ext4_multi_mount_protect() and that can call
sb_getblk() on its own. That way we can also simplify read_mmp_block()
prototype to:

static int read_mmp_block(struct super_block *sb, struct buffer_head *bh);

								Honza

> ---
>  fs/ext4/mmp.c | 17 ++++++-----------
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 4433fe7e9e86..007bde3c12b8 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -213,10 +213,7 @@ static int kmmpd(void *data)
>  		 * we need to check if the MMP block is as we write it.
>  		 */
>  		if (jiffies - last_check_time > mmp_check_interval * HZ) {
> -			struct buffer_head *bh_check = NULL;
> -			struct mmp_struct *mmp_check;
> -
> -			retval = read_mmp_block(sb, &bh_check, mmp_block);
> +			retval = read_mmp_block(sb, &bh, mmp_block);
>  			if (retval) {
>  				ext4_error_err(sb, -retval,
>  					       "error reading MMP data: %d",
> @@ -224,20 +221,18 @@ static int kmmpd(void *data)
>  				goto wait_to_exit;
>  			}
>  
> -			mmp_check = (struct mmp_struct *)(bh_check->b_data);
> -			if (seq != mmp_check->mmp_seq ||
> -			    memcmp(nodename, mmp_check->mmp_nodename,
> -				   sizeof(mmp->mmp_nodename))) {
> -				dump_mmp_msg(sb, mmp_check,
> +			mmp = (struct mmp_struct *)(bh->b_data);
> +			if (seq != le32_to_cpu(mmp->mmp_seq) ||
> +			    memcmp(nodename, mmp->mmp_nodename,
> +				    sizeof(nodename))) {
> +				dump_mmp_msg(sb, mmp,
>  					     "Error while updating MMP info. "
>  					     "The filesystem seems to have been"
>  					     " multiply mounted.");
>  				ext4_error_err(sb, EBUSY, "abort");
> -				put_bh(bh_check);
>  				retval = -EBUSY;
>  				goto wait_to_exit;
>  			}
> -			put_bh(bh_check);
>  			last_check_time = jiffies;
>  		}
>  
> -- 
> 2.31.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 6/6] ext4: fix possible store wrong check interval value in disk when umount
  2021-09-11  9:00 ` [PATCH -next v2 6/6] ext4: fix possible store wrong check interval value in disk when umount Ye Bin
@ 2021-10-07 13:12   ` Jan Kara
  2021-10-08  3:49     ` yebin
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-10-07 13:12 UTC (permalink / raw)
  To: Ye Bin; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel, jack

On Sat 11-09-21 17:00:59, Ye Bin wrote:
> Test follow steps:
> 1. mkfs.ext4 /dev/sda -O mmp
> 2. mount /dev/sda  /mnt
> 3. wait for about 1 minute
> 4. umount mnt
> 5. debugfs /dev/sda
> 6. dump_mmp
> 7. fsck.ext4 /dev/sda
> 
> I found 'check_interval' is range in [5, 10]. And sometime run fsck
> print "MMP interval is 10 seconds and total wait time is 42 seconds.
> Please wait...".
> kmmpd:
> ...
> 	if (diff < mmp_update_interval * HZ)
> 		schedule_timeout_interruptible(mmp_update_interval * HZ - diff);
> 	 diff = jiffies - last_update_time;
> ...
> 	mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ,
> 				EXT4_MMP_MAX_CHECK_INTERVAL),
> 			        EXT4_MMP_MIN_CHECK_INTERVAL);
> 	mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval);
> ...
> We will call ext4_stop_mmpd to stop kmmpd kthread when umount, and
> schedule_timeout_interruptible will be interrupted, so 'diff' maybe
> little than mmp_update_interval. Then mmp_check_interval will range
> in [EXT4_MMP_MAX_CHECK_INTERVAL, EXT4_MMP_CHECK_MULT * diff / HZ].
> To solve this issue, if 'diff' little then mmp_update_interval * HZ
> just break loop, don't update check interval.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
>  fs/ext4/mmp.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index a0d47a906faa..f39e1fa0c6db 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -205,6 +205,14 @@ static int kmmpd(void *data)
>  			schedule_timeout_interruptible(mmp_update_interval *
>  						       HZ - diff);
>  			diff = jiffies - last_update_time;
> +			/* If 'diff' little 'than mmp_update_interval * HZ', it
> +			 * means someone call ext4_stop_mmpd to stop kmmpd
> +			 * kthread. We don't need to update mmp_check_interval
> +			 * any more, as 'diff' is not exact value.
> +			 */
> +			if (unlikely(diff < mmp_update_interval * HZ &&
> +			    kthread_should_stop()))
> +				break;
>  		}

So in this case, mmp_check_interval would be EXT4_MMP_MIN_CHECK_INTERVAL. I
don't quite understand what the practical problem is - the fsck message?
That will happen anytime mmp_check_interval is >= 10 AFAICT and I don't
quite see how that is connected to this condition... Can you explain a bit
more please?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd
  2021-10-07 12:26   ` Jan Kara
@ 2021-10-08  1:50     ` yebin
  0 siblings, 0 replies; 24+ messages in thread
From: yebin @ 2021-10-08  1:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel



On 2021/10/7 20:26, Jan Kara wrote:
> On Sat 11-09-21 17:00:54, Ye Bin wrote:
>> If two host has the same nodename, and seq start from 0, May cause the
>> detection mechanism to fail.
>> So init seq with random value to accelerate conflict detection.
>>
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
> Thanks for the patch. I agree the code in kmmpd looks suspicious.  The
> sequence number is initialized to a random number in
> ext4_multi_mount_protect() which is called before kmmpd is started. I think
> kmmpd() should initialize its 'seq' to a number fetched from the mmp
> block, instead of 0 as is currently in the code. I don't think generating a
> new random number as you do is really needed...
>
> 								Honza
Thank you for your advice,i  will  fix it.
>
>> ---
>>   fs/ext4/mmp.c | 32 ++++++++++++++++----------------
>>   1 file changed, 16 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
>> index cebea4270817..12af6dc8457b 100644
>> --- a/fs/ext4/mmp.c
>> +++ b/fs/ext4/mmp.c
>> @@ -122,6 +122,21 @@ void __dump_mmp_msg(struct super_block *sb, struct mmp_struct *mmp,
>>   		       (int)sizeof(mmp->mmp_bdevname), mmp->mmp_bdevname);
>>   }
>>   
>> +/*
>> + * Get a random new sequence number but make sure it is not greater than
>> + * EXT4_MMP_SEQ_MAX.
>> + */
>> +static unsigned int mmp_new_seq(void)
>> +{
>> +	u32 new_seq;
>> +
>> +	do {
>> +		new_seq = prandom_u32();
>> +	} while (new_seq > EXT4_MMP_SEQ_MAX);
>> +
>> +	return new_seq;
>> +}
>> +
>>   /*
>>    * kmmpd will update the MMP sequence every s_mmp_update_interval seconds
>>    */
>> @@ -132,7 +147,7 @@ static int kmmpd(void *data)
>>   	struct buffer_head *bh = EXT4_SB(sb)->s_mmp_bh;
>>   	struct mmp_struct *mmp;
>>   	ext4_fsblk_t mmp_block;
>> -	u32 seq = 0;
>> +	u32 seq = mmp_new_seq();
>>   	unsigned long failed_writes = 0;
>>   	int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
>>   	unsigned mmp_check_interval;
>> @@ -258,21 +273,6 @@ void ext4_stop_mmpd(struct ext4_sb_info *sbi)
>>   	}
>>   }
>>   
>> -/*
>> - * Get a random new sequence number but make sure it is not greater than
>> - * EXT4_MMP_SEQ_MAX.
>> - */
>> -static unsigned int mmp_new_seq(void)
>> -{
>> -	u32 new_seq;
>> -
>> -	do {
>> -		new_seq = prandom_u32();
>> -	} while (new_seq > EXT4_MMP_SEQ_MAX);
>> -
>> -	return new_seq;
>> -}
>> -
>>   /*
>>    * Protect the filesystem from being mounted more than once.
>>    */
>> -- 
>> 2.31.1
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-07 12:31   ` Jan Kara
@ 2021-10-08  1:56     ` yebin
  2021-10-08  2:38       ` yebin
  0 siblings, 1 reply; 24+ messages in thread
From: yebin @ 2021-10-08  1:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel



On 2021/10/7 20:31, Jan Kara wrote:
> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>> kmmpd:
>> ...
>>      diff = jiffies - last_update_time;
>>      if (diff > mmp_check_interval * HZ) {
>> ...
>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>> than 'mmp_update_interval', so there will never trigger detection.
>> Introduce last_check_time record previous check time.
>>
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
> I think the check is there only for the case where write_mmp_block() +
> sleep took longer than mmp_check_interval. I agree that should rarely
> happen but on a really busy system it is possible and in that case we would
> miss updating mmp block for too long and so another node could have started
> using the filesystem. I actually don't see a reason why kmmpd should be
> checking the block each mmp_check_interval as you do - mmp_check_interval
> is just for ext4_multi_mount_protect() to know how long it should wait
> before considering mmp block stale... Am I missing something?
>
> 								Honza
I'm sorry, I didn't understand the detection mechanism here before. Now 
I understand
the detection mechanism here.
As you said, it's just an abnormal protection. There's really no problem.

>> ---
>>   fs/ext4/mmp.c | 14 +++++++++-----
>>   1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
>> index 12af6dc8457b..c781b09a78c9 100644
>> --- a/fs/ext4/mmp.c
>> +++ b/fs/ext4/mmp.c
>> @@ -152,6 +152,7 @@ static int kmmpd(void *data)
>>   	int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
>>   	unsigned mmp_check_interval;
>>   	unsigned long last_update_time;
>> +	unsigned long last_check_time;
>>   	unsigned long diff;
>>   	int retval = 0;
>>   
>> @@ -170,6 +171,7 @@ static int kmmpd(void *data)
>>   
>>   	memcpy(mmp->mmp_nodename, init_utsname()->nodename,
>>   	       sizeof(mmp->mmp_nodename));
>> +	last_check_time = jiffies;
>>   
>>   	while (!kthread_should_stop() && !sb_rdonly(sb)) {
>>   		if (!ext4_has_feature_mmp(sb)) {
>> @@ -198,17 +200,18 @@ static int kmmpd(void *data)
>>   		}
>>   
>>   		diff = jiffies - last_update_time;
>> -		if (diff < mmp_update_interval * HZ)
>> +		if (diff < mmp_update_interval * HZ) {
>>   			schedule_timeout_interruptible(mmp_update_interval *
>>   						       HZ - diff);
>> +			diff = jiffies - last_update_time;
>> +		}
>>   
>>   		/*
>>   		 * We need to make sure that more than mmp_check_interval
>> -		 * seconds have not passed since writing. If that has happened
>> -		 * we need to check if the MMP block is as we left it.
>> +		 * seconds have not passed since check. If that has happened
>> +		 * we need to check if the MMP block is as we write it.
>>   		 */
>> -		diff = jiffies - last_update_time;
>> -		if (diff > mmp_check_interval * HZ) {
>> +		if (jiffies - last_check_time > mmp_check_interval * HZ) {
>>   			struct buffer_head *bh_check = NULL;
>>   			struct mmp_struct *mmp_check;
>>   
>> @@ -234,6 +237,7 @@ static int kmmpd(void *data)
>>   				goto wait_to_exit;
>>   			}
>>   			put_bh(bh_check);
>> +			last_check_time = jiffies;
>>   		}
>>   
>>   		 /*
>> -- 
>> 2.31.1
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-08  1:56     ` yebin
@ 2021-10-08  2:38       ` yebin
  2021-10-12  8:47         ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: yebin @ 2021-10-08  2:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel



On 2021/10/8 9:56, yebin wrote:
>
>
> On 2021/10/7 20:31, Jan Kara wrote:
>> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>>> kmmpd:
>>> ...
>>>      diff = jiffies - last_update_time;
>>>      if (diff > mmp_check_interval * HZ) {
>>> ...
>>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>>> than 'mmp_update_interval', so there will never trigger detection.
>>> Introduce last_check_time record previous check time.
>>>
>>> Signed-off-by: Ye Bin <yebin10@huawei.com>
>> I think the check is there only for the case where write_mmp_block() +
>> sleep took longer than mmp_check_interval. I agree that should rarely
>> happen but on a really busy system it is possible and in that case we 
>> would
>> miss updating mmp block for too long and so another node could have 
>> started
>> using the filesystem. I actually don't see a reason why kmmpd should be
>> checking the block each mmp_check_interval as you do - 
>> mmp_check_interval
>> is just for ext4_multi_mount_protect() to know how long it should wait
>> before considering mmp block stale... Am I missing something?
>>
>>                                 Honza
> I'm sorry, I didn't understand the detection mechanism here before. 
> Now I understand
> the detection mechanism here.
> As you said, it's just an abnormal protection. There's really no problem.
>
Yeah, i did test as following steps
hostA                        hostB
    mount
      ext4_multi_mount_protect  -> seq == EXT4_MMP_SEQ_CLEAN
         delay 5s after label "skip" so hostB will see seq is 
EXT4_MMP_SEQ_CLEAN
                        mount
                        ext4_multi_mount_protect -> seq == 
EXT4_MMP_SEQ_CLEAN
                                run  kmmpd
     run kmmpd

Actually,in this  situation kmmpd will not detect  confliction.
In  ext4_multi_mount_protect  function we write mmp data fisrt and wait 
'wait_time * HZ'  seconds,
read mmp data do check.Most of the time, If 'wait_time' is zero, it can 
pass check.
>>> ---
>>>   fs/ext4/mmp.c | 14 +++++++++-----
>>>   1 file changed, 9 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
>>> index 12af6dc8457b..c781b09a78c9 100644
>>> --- a/fs/ext4/mmp.c
>>> +++ b/fs/ext4/mmp.c
>>> @@ -152,6 +152,7 @@ static int kmmpd(void *data)
>>>       int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
>>>       unsigned mmp_check_interval;
>>>       unsigned long last_update_time;
>>> +    unsigned long last_check_time;
>>>       unsigned long diff;
>>>       int retval = 0;
>>>   @@ -170,6 +171,7 @@ static int kmmpd(void *data)
>>>         memcpy(mmp->mmp_nodename, init_utsname()->nodename,
>>>              sizeof(mmp->mmp_nodename));
>>> +    last_check_time = jiffies;
>>>         while (!kthread_should_stop() && !sb_rdonly(sb)) {
>>>           if (!ext4_has_feature_mmp(sb)) {
>>> @@ -198,17 +200,18 @@ static int kmmpd(void *data)
>>>           }
>>>             diff = jiffies - last_update_time;
>>> -        if (diff < mmp_update_interval * HZ)
>>> +        if (diff < mmp_update_interval * HZ) {
>>> schedule_timeout_interruptible(mmp_update_interval *
>>>                                  HZ - diff);
>>> +            diff = jiffies - last_update_time;
>>> +        }
>>>             /*
>>>            * We need to make sure that more than mmp_check_interval
>>> -         * seconds have not passed since writing. If that has happened
>>> -         * we need to check if the MMP block is as we left it.
>>> +         * seconds have not passed since check. If that has happened
>>> +         * we need to check if the MMP block is as we write it.
>>>            */
>>> -        diff = jiffies - last_update_time;
>>> -        if (diff > mmp_check_interval * HZ) {
>>> +        if (jiffies - last_check_time > mmp_check_interval * HZ) {
>>>               struct buffer_head *bh_check = NULL;
>>>               struct mmp_struct *mmp_check;
>>>   @@ -234,6 +237,7 @@ static int kmmpd(void *data)
>>>                   goto wait_to_exit;
>>>               }
>>>               put_bh(bh_check);
>>> +            last_check_time = jiffies;
>>>           }
>>>              /*
>>> -- 
>>> 2.31.1
>>>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 6/6] ext4: fix possible store wrong check interval value in disk when umount
  2021-10-07 13:12   ` Jan Kara
@ 2021-10-08  3:49     ` yebin
  0 siblings, 0 replies; 24+ messages in thread
From: yebin @ 2021-10-08  3:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel



On 2021/10/7 21:12, Jan Kara wrote:
> On Sat 11-09-21 17:00:59, Ye Bin wrote:
>> Test follow steps:
>> 1. mkfs.ext4 /dev/sda -O mmp
>> 2. mount /dev/sda  /mnt
>> 3. wait for about 1 minute
>> 4. umount mnt
>> 5. debugfs /dev/sda
>> 6. dump_mmp
>> 7. fsck.ext4 /dev/sda
>>
>> I found 'check_interval' is range in [5, 10]. And sometime run fsck
>> print "MMP interval is 10 seconds and total wait time is 42 seconds.
>> Please wait...".
>> kmmpd:
>> ...
>> 	if (diff < mmp_update_interval * HZ)
>> 		schedule_timeout_interruptible(mmp_update_interval * HZ - diff);
>> 	 diff = jiffies - last_update_time;
>> ...
>> 	mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ,
>> 				EXT4_MMP_MAX_CHECK_INTERVAL),
>> 			        EXT4_MMP_MIN_CHECK_INTERVAL);
>> 	mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval);
>> ...
>> We will call ext4_stop_mmpd to stop kmmpd kthread when umount, and
>> schedule_timeout_interruptible will be interrupted, so 'diff' maybe
>> little than mmp_update_interval. Then mmp_check_interval will range
>> in [EXT4_MMP_MAX_CHECK_INTERVAL, EXT4_MMP_CHECK_MULT * diff / HZ].
>> To solve this issue, if 'diff' little then mmp_update_interval * HZ
>> just break loop, don't update check interval.
>>
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
>> ---
>>   fs/ext4/mmp.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
>> index a0d47a906faa..f39e1fa0c6db 100644
>> --- a/fs/ext4/mmp.c
>> +++ b/fs/ext4/mmp.c
>> @@ -205,6 +205,14 @@ static int kmmpd(void *data)
>>   			schedule_timeout_interruptible(mmp_update_interval *
>>   						       HZ - diff);
>>   			diff = jiffies - last_update_time;
>> +			/* If 'diff' little 'than mmp_update_interval * HZ', it
>> +			 * means someone call ext4_stop_mmpd to stop kmmpd
>> +			 * kthread. We don't need to update mmp_check_interval
>> +			 * any more, as 'diff' is not exact value.
>> +			 */
>> +			if (unlikely(diff < mmp_update_interval * HZ &&
>> +			    kthread_should_stop()))
>> +				break;
>>   		}
> So in this case, mmp_check_interval would be EXT4_MMP_MIN_CHECK_INTERVAL. I
> don't quite understand what the practical problem is - the fsck message?
> That will happen anytime mmp_check_interval is >= 10 AFAICT and I don't
> quite see how that is connected to this condition... Can you explain a bit
> more please?
>
> 								Honza
I just think 'mmp_check_interval' is not reflect real check interval, 
and also sometime run fsck
print "MMP interval is 10 seconds and total wait time is 42 seconds. 
Please wait...", but
sometime not.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 4/6] ext4: avoid to re-read mmp check data get from page cache
  2021-10-07 12:44   ` Jan Kara
@ 2021-10-08  3:52     ` yebin
  0 siblings, 0 replies; 24+ messages in thread
From: yebin @ 2021-10-08  3:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel



On 2021/10/7 20:44, Jan Kara wrote:
> On Sat 11-09-21 17:00:57, Ye Bin wrote:
>> As call read_mmp_block pass bh_check which value is NULL, then call
>> sb_getblk to get buffer_head. But mmp_block's buffer_head is already exist
>>   which also is uptodate. Lead to compare the same data.
>>
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
> This looks good, except that read_mmp_block() also releases bh it is passed
> in case of error. So it can free buffer head which is still referenced from
> EXT4_SB(sb)->s_mmp_bh and cause use-after-free issues.
>
> I guess I would just get rid of sb_getblk() in read_mmp_block() and always
> expect valid bh passed. The only place that passes NULL bh after this
> patch is one case in ext4_multi_mount_protect() and that can call
> sb_getblk() on its own. That way we can also simplify read_mmp_block()
> prototype to:
>
> static int read_mmp_block(struct super_block *sb, struct buffer_head *bh);
>
> 								Honza
> Yeah, I will refactor 'read_mmp_block'.
>> ---
>>   fs/ext4/mmp.c | 17 ++++++-----------
>>   1 file changed, 6 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
>> index 4433fe7e9e86..007bde3c12b8 100644
>> --- a/fs/ext4/mmp.c
>> +++ b/fs/ext4/mmp.c
>> @@ -213,10 +213,7 @@ static int kmmpd(void *data)
>>   		 * we need to check if the MMP block is as we write it.
>>   		 */
>>   		if (jiffies - last_check_time > mmp_check_interval * HZ) {
>> -			struct buffer_head *bh_check = NULL;
>> -			struct mmp_struct *mmp_check;
>> -
>> -			retval = read_mmp_block(sb, &bh_check, mmp_block);
>> +			retval = read_mmp_block(sb, &bh, mmp_block);
>>   			if (retval) {
>>   				ext4_error_err(sb, -retval,
>>   					       "error reading MMP data: %d",
>> @@ -224,20 +221,18 @@ static int kmmpd(void *data)
>>   				goto wait_to_exit;
>>   			}
>>   
>> -			mmp_check = (struct mmp_struct *)(bh_check->b_data);
>> -			if (seq != mmp_check->mmp_seq ||
>> -			    memcmp(nodename, mmp_check->mmp_nodename,
>> -				   sizeof(mmp->mmp_nodename))) {
>> -				dump_mmp_msg(sb, mmp_check,
>> +			mmp = (struct mmp_struct *)(bh->b_data);
>> +			if (seq != le32_to_cpu(mmp->mmp_seq) ||
>> +			    memcmp(nodename, mmp->mmp_nodename,
>> +				    sizeof(nodename))) {
>> +				dump_mmp_msg(sb, mmp,
>>   					     "Error while updating MMP info. "
>>   					     "The filesystem seems to have been"
>>   					     " multiply mounted.");
>>   				ext4_error_err(sb, EBUSY, "abort");
>> -				put_bh(bh_check);
>>   				retval = -EBUSY;
>>   				goto wait_to_exit;
>>   			}
>> -			put_bh(bh_check);
>>   			last_check_time = jiffies;
>>   		}
>>   
>> -- 
>> 2.31.1
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-08  2:38       ` yebin
@ 2021-10-12  8:47         ` Jan Kara
  2021-10-12 11:46           ` yebin
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-10-12  8:47 UTC (permalink / raw)
  To: yebin; +Cc: Jan Kara, tytso, adilger.kernel, linux-ext4, linux-kernel

On Fri 08-10-21 10:38:31, yebin wrote:
> On 2021/10/8 9:56, yebin wrote:
> > On 2021/10/7 20:31, Jan Kara wrote:
> > > On Sat 11-09-21 17:00:55, Ye Bin wrote:
> > > > kmmpd:
> > > > ...
> > > >      diff = jiffies - last_update_time;
> > > >      if (diff > mmp_check_interval * HZ) {
> > > > ...
> > > > As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
> > > > than 'mmp_update_interval', so there will never trigger detection.
> > > > Introduce last_check_time record previous check time.
> > > > 
> > > > Signed-off-by: Ye Bin <yebin10@huawei.com>
> > > I think the check is there only for the case where write_mmp_block() +
> > > sleep took longer than mmp_check_interval. I agree that should rarely
> > > happen but on a really busy system it is possible and in that case
> > > we would
> > > miss updating mmp block for too long and so another node could have
> > > started
> > > using the filesystem. I actually don't see a reason why kmmpd should be
> > > checking the block each mmp_check_interval as you do -
> > > mmp_check_interval
> > > is just for ext4_multi_mount_protect() to know how long it should wait
> > > before considering mmp block stale... Am I missing something?
> > > 
> > >                                 Honza
> > I'm sorry, I didn't understand the detection mechanism here before. Now
> > I understand
> > the detection mechanism here.
> > As you said, it's just an abnormal protection. There's really no problem.
> > 
> Yeah, i did test as following steps
> hostA                        hostB
>    mount
>      ext4_multi_mount_protect  -> seq == EXT4_MMP_SEQ_CLEAN
>         delay 5s after label "skip" so hostB will see seq is
> EXT4_MMP_SEQ_CLEAN
>                        mount
>                        ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
>                                run  kmmpd
>     run kmmpd
> 
> Actually,in this  situation kmmpd will not detect  confliction.
> In ext4_multi_mount_protect function we write mmp data first and wait
> 'wait_time * HZ'  seconds,
> read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
> check.

But how can be wait_time zero? As far as I'm reading the code, wait_time
must be at least EXT4_MMP_MIN_CHECK_INTERVAL...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-12  8:47         ` Jan Kara
@ 2021-10-12 11:46           ` yebin
  2021-10-13  9:38             ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: yebin @ 2021-10-12 11:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel



On 2021/10/12 16:47, Jan Kara wrote:
> On Fri 08-10-21 10:38:31, yebin wrote:
>> On 2021/10/8 9:56, yebin wrote:
>>> On 2021/10/7 20:31, Jan Kara wrote:
>>>> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>>>>> kmmpd:
>>>>> ...
>>>>>       diff = jiffies - last_update_time;
>>>>>       if (diff > mmp_check_interval * HZ) {
>>>>> ...
>>>>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>>>>> than 'mmp_update_interval', so there will never trigger detection.
>>>>> Introduce last_check_time record previous check time.
>>>>>
>>>>> Signed-off-by: Ye Bin <yebin10@huawei.com>
>>>> I think the check is there only for the case where write_mmp_block() +
>>>> sleep took longer than mmp_check_interval. I agree that should rarely
>>>> happen but on a really busy system it is possible and in that case
>>>> we would
>>>> miss updating mmp block for too long and so another node could have
>>>> started
>>>> using the filesystem. I actually don't see a reason why kmmpd should be
>>>> checking the block each mmp_check_interval as you do -
>>>> mmp_check_interval
>>>> is just for ext4_multi_mount_protect() to know how long it should wait
>>>> before considering mmp block stale... Am I missing something?
>>>>
>>>>                                  Honza
>>> I'm sorry, I didn't understand the detection mechanism here before. Now
>>> I understand
>>> the detection mechanism here.
>>> As you said, it's just an abnormal protection. There's really no problem.
>>>
>> Yeah, i did test as following steps
>> hostA                        hostB
>>     mount
>>       ext4_multi_mount_protect  -> seq == EXT4_MMP_SEQ_CLEAN
>>          delay 5s after label "skip" so hostB will see seq is
>> EXT4_MMP_SEQ_CLEAN
>>                         mount
>>                         ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
>>                                 run  kmmpd
>>      run kmmpd
>>
>> Actually,in this  situation kmmpd will not detect  confliction.
>> In ext4_multi_mount_protect function we write mmp data first and wait
>> 'wait_time * HZ'  seconds,
>> read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
>> check.
> But how can be wait_time zero? As far as I'm reading the code, wait_time
> must be at least EXT4_MMP_MIN_CHECK_INTERVAL...
>
> 								Honza
  int ext4_multi_mount_protect(struct super_block *sb,
                                      ext4_fsblk_t mmp_block)
  {
          struct ext4_super_block *es = EXT4_SB(sb)->s_es;
          struct buffer_head *bh = NULL;
          struct mmp_struct *mmp = NULL;
          u32 seq;
          unsigned int mmp_check_interval = 
le16_to_cpu(es->s_mmp_update_interval);
          unsigned int wait_time = 0;                    --> wait_time 
is equal with zero
          int retval;

          if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
              mmp_block >= ext4_blocks_count(es)) {
                  ext4_warning(sb, "Invalid MMP block in superblock");
                  goto failed;
          }

          retval = read_mmp_block(sb, &bh, mmp_block);
          if (retval)
                  goto failed;

          mmp = (struct mmp_struct *)(bh->b_data);

          if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL)
                  mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL;

          /*
           * If check_interval in MMP block is larger, use that instead of
           * update_interval from the superblock.
           */
          if (le16_to_cpu(mmp->mmp_check_interval) > mmp_check_interval)
                  mmp_check_interval = le16_to_cpu(mmp->mmp_check_interval);

          seq = le32_to_cpu(mmp->mmp_seq);
          if (seq == EXT4_MMP_SEQ_CLEAN)   --> If hostA and hostB mount 
the same block device at the same time,
--> HostA and hostB  maybe get 'seq' with the same value 
EXT4_MMP_SEQ_CLEAN.
                  goto skip;
...
skip:
         /*
          * write a new random sequence number.
          */
         seq = mmp_new_seq();
         mmp->mmp_seq = cpu_to_le32(seq);

         retval = write_mmp_block(sb, bh);
         if (retval)
                 goto failed;

         /*
          * wait for MMP interval and check mmp_seq.
          */
         if (schedule_timeout_interruptible(HZ * wait_time) != 0) 
{        --> If seq is equal with EXT4_MMP_SEQ_CLEAN, wait_time is zero.
                 ext4_warning(sb, "MMP startup interrupted, failing mount");
                 goto failed;
         }

         retval = read_mmp_block(sb, &bh, mmp_block); -->We may get the 
same data with which we wrote, so we can't detect conflict at here.
         if (retval)
                 goto failed;
         mmp = (struct mmp_struct *)(bh->b_data);
         if (seq != le32_to_cpu(mmp->mmp_seq)) {
                 dump_mmp_msg(sb, mmp,
                              "Device is already active on another node.");
                 goto failed;
         }
...
}


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-12 11:46           ` yebin
@ 2021-10-13  9:38             ` Jan Kara
  2021-10-13 12:33               ` yebin
  2021-10-13 21:41               ` Theodore Ts'o
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Kara @ 2021-10-13  9:38 UTC (permalink / raw)
  To: yebin; +Cc: Jan Kara, tytso, adilger.kernel, linux-ext4, linux-kernel

On Tue 12-10-21 19:46:24, yebin wrote:
> On 2021/10/12 16:47, Jan Kara wrote:
> > On Fri 08-10-21 10:38:31, yebin wrote:
> > > On 2021/10/8 9:56, yebin wrote:
> > > > On 2021/10/7 20:31, Jan Kara wrote:
> > > > > On Sat 11-09-21 17:00:55, Ye Bin wrote:
> > > > > > kmmpd:
> > > > > > ...
> > > > > >       diff = jiffies - last_update_time;
> > > > > >       if (diff > mmp_check_interval * HZ) {
> > > > > > ...
> > > > > > As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
> > > > > > than 'mmp_update_interval', so there will never trigger detection.
> > > > > > Introduce last_check_time record previous check time.
> > > > > > 
> > > > > > Signed-off-by: Ye Bin <yebin10@huawei.com>
> > > > > I think the check is there only for the case where write_mmp_block() +
> > > > > sleep took longer than mmp_check_interval. I agree that should rarely
> > > > > happen but on a really busy system it is possible and in that case
> > > > > we would
> > > > > miss updating mmp block for too long and so another node could have
> > > > > started
> > > > > using the filesystem. I actually don't see a reason why kmmpd should be
> > > > > checking the block each mmp_check_interval as you do -
> > > > > mmp_check_interval
> > > > > is just for ext4_multi_mount_protect() to know how long it should wait
> > > > > before considering mmp block stale... Am I missing something?
> > > > > 
> > > > >                                  Honza
> > > > I'm sorry, I didn't understand the detection mechanism here before. Now
> > > > I understand
> > > > the detection mechanism here.
> > > > As you said, it's just an abnormal protection. There's really no problem.
> > > > 
> > > Yeah, i did test as following steps
> > > hostA                        hostB
> > >     mount
> > >       ext4_multi_mount_protect  -> seq == EXT4_MMP_SEQ_CLEAN
> > >          delay 5s after label "skip" so hostB will see seq is
> > > EXT4_MMP_SEQ_CLEAN
> > >                         mount
> > >                         ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
> > >                                 run  kmmpd
> > >      run kmmpd
> > > 
> > > Actually,in this  situation kmmpd will not detect  confliction.
> > > In ext4_multi_mount_protect function we write mmp data first and wait
> > > 'wait_time * HZ'  seconds,
> > > read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
> > > check.
> > But how can be wait_time zero? As far as I'm reading the code, wait_time
> > must be at least EXT4_MMP_MIN_CHECK_INTERVAL...
> > 
> > 								Honza
>  int ext4_multi_mount_protect(struct super_block *sb,
>                                      ext4_fsblk_t mmp_block)
>  {
>          struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>          struct buffer_head *bh = NULL;
>          struct mmp_struct *mmp = NULL;
>          u32 seq;
>          unsigned int mmp_check_interval =
> le16_to_cpu(es->s_mmp_update_interval);
>          unsigned int wait_time = 0;                    --> wait_time is
> equal with zero
>          int retval;
> 
>          if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
>              mmp_block >= ext4_blocks_count(es)) {
>                  ext4_warning(sb, "Invalid MMP block in superblock");
>                  goto failed;
>          }
> 
>          retval = read_mmp_block(sb, &bh, mmp_block);
>          if (retval)
>                  goto failed;
> 
>          mmp = (struct mmp_struct *)(bh->b_data);
> 
>          if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL)
>                  mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL;
> 
>          /*
>           * If check_interval in MMP block is larger, use that instead of
>           * update_interval from the superblock.
>           */
>          if (le16_to_cpu(mmp->mmp_check_interval) > mmp_check_interval)
>                  mmp_check_interval = le16_to_cpu(mmp->mmp_check_interval);
> 
>          seq = le32_to_cpu(mmp->mmp_seq);
>          if (seq == EXT4_MMP_SEQ_CLEAN)   --> If hostA and hostB mount the
> same block device at the same time,
> --> HostA and hostB  maybe get 'seq' with the same value EXT4_MMP_SEQ_CLEAN.
>                  goto skip;

Oh, I see. Thanks for explanation. 

> ...
> skip:
>         /*
>          * write a new random sequence number.
>          */
>         seq = mmp_new_seq();
>         mmp->mmp_seq = cpu_to_le32(seq);
> 
>         retval = write_mmp_block(sb, bh);
>         if (retval)
>                 goto failed;
> 
>         /*
>          * wait for MMP interval and check mmp_seq.
>          */
>         if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
> --> If seq is equal with EXT4_MMP_SEQ_CLEAN, wait_time is zero.
>                 ext4_warning(sb, "MMP startup interrupted, failing mount");
>                 goto failed;
>         }
> 
>         retval = read_mmp_block(sb, &bh, mmp_block); -->We may get the same
> data with which we wrote, so we can't detect conflict at here.

OK, I see. So the race in ext4_multi_mount_protect() goes like:

hostA				hostB

read_mmp_block()		read_mmp_block()
- sees EXT4_MMP_SEQ_CLEAN	- sees EXT4_MMP_SEQ_CLEAN
write_mmp_block()
wait_time == 0 -> no wait
read_mmp_block()
  - all OK, mount
				write_mmp_block()
				wait_time == 0 -> no wait
				read_mmp_block()
				  - all OK, mount

Do I get it right? Actually, if we passed seq we wrote in
ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
notice the conflict on its first invocation but still that would be a bit
late because there would be a time window where hostA and hostB would be
both using the fs.

We could reduce the likelyhood of this race by always waiting in
ext4_multi_mount_protect() between write & read but I guess that is
undesirable as it would slow down all clean mounts. Ted?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-13  9:38             ` Jan Kara
@ 2021-10-13 12:33               ` yebin
  2021-10-13 21:41               ` Theodore Ts'o
  1 sibling, 0 replies; 24+ messages in thread
From: yebin @ 2021-10-13 12:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4, linux-kernel



On 2021/10/13 17:38, Jan Kara wrote:
> On Tue 12-10-21 19:46:24, yebin wrote:
>> On 2021/10/12 16:47, Jan Kara wrote:
>>> On Fri 08-10-21 10:38:31, yebin wrote:
>>>> On 2021/10/8 9:56, yebin wrote:
>>>>> On 2021/10/7 20:31, Jan Kara wrote:
>>>>>> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>>>>>>> kmmpd:
>>>>>>> ...
>>>>>>>        diff = jiffies - last_update_time;
>>>>>>>        if (diff > mmp_check_interval * HZ) {
>>>>>>> ...
>>>>>>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>>>>>>> than 'mmp_update_interval', so there will never trigger detection.
>>>>>>> Introduce last_check_time record previous check time.
>>>>>>>
>>>>>>> Signed-off-by: Ye Bin <yebin10@huawei.com>
>>>>>> I think the check is there only for the case where write_mmp_block() +
>>>>>> sleep took longer than mmp_check_interval. I agree that should rarely
>>>>>> happen but on a really busy system it is possible and in that case
>>>>>> we would
>>>>>> miss updating mmp block for too long and so another node could have
>>>>>> started
>>>>>> using the filesystem. I actually don't see a reason why kmmpd should be
>>>>>> checking the block each mmp_check_interval as you do -
>>>>>> mmp_check_interval
>>>>>> is just for ext4_multi_mount_protect() to know how long it should wait
>>>>>> before considering mmp block stale... Am I missing something?
>>>>>>
>>>>>>                                   Honza
>>>>> I'm sorry, I didn't understand the detection mechanism here before. Now
>>>>> I understand
>>>>> the detection mechanism here.
>>>>> As you said, it's just an abnormal protection. There's really no problem.
>>>>>
>>>> Yeah, i did test as following steps
>>>> hostA                        hostB
>>>>      mount
>>>>        ext4_multi_mount_protect  -> seq == EXT4_MMP_SEQ_CLEAN
>>>>           delay 5s after label "skip" so hostB will see seq is
>>>> EXT4_MMP_SEQ_CLEAN
>>>>                          mount
>>>>                          ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
>>>>                                  run  kmmpd
>>>>       run kmmpd
>>>>
>>>> Actually,in this  situation kmmpd will not detect  confliction.
>>>> In ext4_multi_mount_protect function we write mmp data first and wait
>>>> 'wait_time * HZ'  seconds,
>>>> read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
>>>> check.
>>> But how can be wait_time zero? As far as I'm reading the code, wait_time
>>> must be at least EXT4_MMP_MIN_CHECK_INTERVAL...
>>>
>>> 								Honza
>>   int ext4_multi_mount_protect(struct super_block *sb,
>>                                       ext4_fsblk_t mmp_block)
>>   {
>>           struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>           struct buffer_head *bh = NULL;
>>           struct mmp_struct *mmp = NULL;
>>           u32 seq;
>>           unsigned int mmp_check_interval =
>> le16_to_cpu(es->s_mmp_update_interval);
>>           unsigned int wait_time = 0;                    --> wait_time is
>> equal with zero
>>           int retval;
>>
>>           if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
>>               mmp_block >= ext4_blocks_count(es)) {
>>                   ext4_warning(sb, "Invalid MMP block in superblock");
>>                   goto failed;
>>           }
>>
>>           retval = read_mmp_block(sb, &bh, mmp_block);
>>           if (retval)
>>                   goto failed;
>>
>>           mmp = (struct mmp_struct *)(bh->b_data);
>>
>>           if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL)
>>                   mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL;
>>
>>           /*
>>            * If check_interval in MMP block is larger, use that instead of
>>            * update_interval from the superblock.
>>            */
>>           if (le16_to_cpu(mmp->mmp_check_interval) > mmp_check_interval)
>>                   mmp_check_interval = le16_to_cpu(mmp->mmp_check_interval);
>>
>>           seq = le32_to_cpu(mmp->mmp_seq);
>>           if (seq == EXT4_MMP_SEQ_CLEAN)   --> If hostA and hostB mount the
>> same block device at the same time,
>> --> HostA and hostB  maybe get 'seq' with the same value EXT4_MMP_SEQ_CLEAN.
>>                   goto skip;
> Oh, I see. Thanks for explanation.
>
>> ...
>> skip:
>>          /*
>>           * write a new random sequence number.
>>           */
>>          seq = mmp_new_seq();
>>          mmp->mmp_seq = cpu_to_le32(seq);
>>
>>          retval = write_mmp_block(sb, bh);
>>          if (retval)
>>                  goto failed;
>>
>>          /*
>>           * wait for MMP interval and check mmp_seq.
>>           */
>>          if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
>> --> If seq is equal with EXT4_MMP_SEQ_CLEAN, wait_time is zero.
>>                  ext4_warning(sb, "MMP startup interrupted, failing mount");
>>                  goto failed;
>>          }
>>
>>          retval = read_mmp_block(sb, &bh, mmp_block); -->We may get the same
>> data with which we wrote, so we can't detect conflict at here.
> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>
> hostA				hostB
>
> read_mmp_block()		read_mmp_block()
> - sees EXT4_MMP_SEQ_CLEAN	- sees EXT4_MMP_SEQ_CLEAN
> write_mmp_block()
> wait_time == 0 -> no wait
> read_mmp_block()
>    - all OK, mount
> 				write_mmp_block()
> 				wait_time == 0 -> no wait
> 				read_mmp_block()
> 				  - all OK, mount
Yes, that's what i mean.
>
> Do I get it right? Actually, if we passed seq we wrote in
> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
> notice the conflict on its first invocation but still that would be a bit
> late because there would be a time window where hostA and hostB would be
> both using the fs.
>
> We could reduce the likelyhood of this race by always waiting in
> ext4_multi_mount_protect() between write & read but I guess that is
> undesirable as it would slow down all clean mounts. Ted?
>
> 								Honza


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-13  9:38             ` Jan Kara
  2021-10-13 12:33               ` yebin
@ 2021-10-13 21:41               ` Theodore Ts'o
  2021-10-15  3:21                 ` Andreas Dilger
  2021-10-15  3:21                 ` Andreas Dilger
  1 sibling, 2 replies; 24+ messages in thread
From: Theodore Ts'o @ 2021-10-13 21:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: yebin, adilger.kernel, linux-ext4, linux-kernel

On Wed, Oct 13, 2021 at 11:38:47AM +0200, Jan Kara wrote:
> 
> OK, I see. So the race in ext4_multi_mount_protect() goes like:
> 
> hostA				hostB
> 
> read_mmp_block()		read_mmp_block()
> - sees EXT4_MMP_SEQ_CLEAN	- sees EXT4_MMP_SEQ_CLEAN
> write_mmp_block()
> wait_time == 0 -> no wait
> read_mmp_block()
>   - all OK, mount
> 				write_mmp_block()
> 				wait_time == 0 -> no wait
> 				read_mmp_block()
> 				  - all OK, mount
> 
> Do I get it right? Actually, if we passed seq we wrote in
> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
> notice the conflict on its first invocation but still that would be a bit
> late because there would be a time window where hostA and hostB would be
> both using the fs.
> 
> We could reduce the likelyhood of this race by always waiting in
> ext4_multi_mount_protect() between write & read but I guess that is
> undesirable as it would slow down all clean mounts. Ted?

I'd like Andreas to comment here.  My understanding is that MMP
originally intended as a safety mechanism which would be used as part
of a primary/backup high availability system, but not as the *primary*
system where you might try to have two servers simultaneously try to
mount the file system and use MMP as the "election" mechanism to
decide which server is going to be the primary system, and which would
be the backup system.

The cost of being able to handle this particular race is it would slow
down the mounts of cleanly unmounted systems.

There *are* better systems to implement leader elections[1] than using
MMP.  Most of these more efficient leader elections assume that you
have a working IP network, and so if you have a separate storage
network (including a shared SCSI bus) from your standard IP network,
then MMP is a useful failsafe in the face of a network partition of
your IP network.  The question is whether MMP should be useful for
more than that.  And if it isn't, then we should probably document
what MMP is and isn't good for, and give advice in the form of an
application note for how MMP should be used in the context of a larger
system.

[1] https://en.wikipedia.org/wiki/Leader_election

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-13 21:41               ` Theodore Ts'o
@ 2021-10-15  3:21                 ` Andreas Dilger
  2021-10-15  3:21                 ` Andreas Dilger
  1 sibling, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2021-10-15  3:21 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, yebin, Andreas Dilger, linux-ext4, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3518 bytes --]

On Oct 13, 2021, at 3:41 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Wed, Oct 13, 2021 at 11:38:47AM +0200, Jan Kara wrote:
>> 
>> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>> 
>> hostA				hostB
>> 
>> read_mmp_block()		read_mmp_block()
>> - sees EXT4_MMP_SEQ_CLEAN	- sees EXT4_MMP_SEQ_CLEAN
>> write_mmp_block()
>> wait_time == 0 -> no wait
>> read_mmp_block()
>>  - all OK, mount
>> 				write_mmp_block()
>> 				wait_time == 0 -> no wait
>> 				read_mmp_block()
>> 				  - all OK, mount
>> 
>> Do I get it right? Actually, if we passed seq we wrote in
>> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
>> notice the conflict on its first invocation but still that would be a bit
>> late because there would be a time window where hostA and hostB would be
>> both using the fs.

It would be enough to have even a short delay between write and read to
detect this case.  I _thought_ there should be a delay in this case,
but maybe it was removed after the patch was originally submitted?

>> We could reduce the likelyhood of this race by always waiting in
>> ext4_multi_mount_protect() between write & read but I guess that is
>> undesirable as it would slow down all clean mounts. Ted?
> 
> I'd like Andreas to comment here.  My understanding is that MMP
> originally intended as a safety mechanism which would be used as part
> of a primary/backup high availability system, but not as the *primary*
> system where you might try to have two servers simultaneously try to
> mount the file system and use MMP as the "election" mechanism to
> decide which server is going to be the primary system, and which would
> be the backup system.
> 
> The cost of being able to handle this particular race is it would slow
> down the mounts of cleanly unmounted systems.

Ted's understanding is correct - MMP is intended to be a backup mechanism
to prevent filesystem corruption in the case where external HA methods
do the wrong thing.  This has avoided problems countless times on systems
with multi-port access to the same storage, and can also be useful in the
case of shared VM images accessed over the network, and similar.

When MMP was implemented for ZFS, a slightly different mechanism was used.
Rather than having the delay to detect concurrent mounts, it instead writes
to multiple different blocks in a random order, and then reads them all.
If two nodes try to mount the filesystem concurrently, they would pick
different block orders, and the chance of them having the same order (and
one clobbering all of the blocks of the other) would be 1/2^num_blocks.
The drawback is that this would consume more space in the filesystem, but
it wouldn't be a huge deal these days.

> There *are* better systems to implement leader elections[1] than using
> MMP.  Most of these more efficient leader elections assume that you
> have a working IP network, and so if you have a separate storage
> network (including a shared SCSI bus) from your standard IP network,
> then MMP is a useful failsafe in the face of a network partition of
> your IP network.  The question is whether MMP should be useful for
> more than that.  And if it isn't, then we should probably document
> what MMP is and isn't good for, and give advice in the form of an
> application note for how MMP should be used in the context of a larger
> system.

One of the existing failure cases with HA that MMP detects is loss of
network connection, so I wouldn't want to depend on that.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time
  2021-10-13 21:41               ` Theodore Ts'o
  2021-10-15  3:21                 ` Andreas Dilger
@ 2021-10-15  3:21                 ` Andreas Dilger
  1 sibling, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2021-10-15  3:21 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, yebin, Andreas Dilger, linux-ext4, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3518 bytes --]

On Oct 13, 2021, at 3:41 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Wed, Oct 13, 2021 at 11:38:47AM +0200, Jan Kara wrote:
>> 
>> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>> 
>> hostA				hostB
>> 
>> read_mmp_block()		read_mmp_block()
>> - sees EXT4_MMP_SEQ_CLEAN	- sees EXT4_MMP_SEQ_CLEAN
>> write_mmp_block()
>> wait_time == 0 -> no wait
>> read_mmp_block()
>>  - all OK, mount
>> 				write_mmp_block()
>> 				wait_time == 0 -> no wait
>> 				read_mmp_block()
>> 				  - all OK, mount
>> 
>> Do I get it right? Actually, if we passed seq we wrote in
>> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
>> notice the conflict on its first invocation but still that would be a bit
>> late because there would be a time window where hostA and hostB would be
>> both using the fs.

It would be enough to have even a short delay between write and read to
detect this case.  I _thought_ there should be a delay in this case,
but maybe it was removed after the patch was originally submitted?

>> We could reduce the likelyhood of this race by always waiting in
>> ext4_multi_mount_protect() between write & read but I guess that is
>> undesirable as it would slow down all clean mounts. Ted?
> 
> I'd like Andreas to comment here.  My understanding is that MMP
> originally intended as a safety mechanism which would be used as part
> of a primary/backup high availability system, but not as the *primary*
> system where you might try to have two servers simultaneously try to
> mount the file system and use MMP as the "election" mechanism to
> decide which server is going to be the primary system, and which would
> be the backup system.
> 
> The cost of being able to handle this particular race is it would slow
> down the mounts of cleanly unmounted systems.

Ted's understanding is correct - MMP is intended to be a backup mechanism
to prevent filesystem corruption in the case where external HA methods
do the wrong thing.  This has avoided problems countless times on systems
with multi-port access to the same storage, and can also be useful in the
case of shared VM images accessed over the network, and similar.

When MMP was implemented for ZFS, a slightly different mechanism was used.
Rather than having the delay to detect concurrent mounts, it instead writes
to multiple different blocks in a random order, and then reads them all.
If two nodes try to mount the filesystem concurrently, they would pick
different block orders, and the chance of them having the same order (and
one clobbering all of the blocks of the other) would be 1/2^num_blocks.
The drawback is that this would consume more space in the filesystem, but
it wouldn't be a huge deal these days.

> There *are* better systems to implement leader elections[1] than using
> MMP.  Most of these more efficient leader elections assume that you
> have a working IP network, and so if you have a separate storage
> network (including a shared SCSI bus) from your standard IP network,
> then MMP is a useful failsafe in the face of a network partition of
> your IP network.  The question is whether MMP should be useful for
> more than that.  And if it isn't, then we should probably document
> what MMP is and isn't good for, and give advice in the form of an
> application note for how MMP should be used in the context of a larger
> system.

One of the existing failure cases with HA that MMP detects is loss of
network connection, so I wouldn't want to depend on that.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-10-15  3:21 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-11  9:00 [PATCH -next v2 0/6] Fix some issues about mmp Ye Bin
2021-09-11  9:00 ` [PATCH -next v2 1/6] ext4: init seq with random value in kmmpd Ye Bin
2021-10-07 12:26   ` Jan Kara
2021-10-08  1:50     ` yebin
2021-09-11  9:00 ` [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time Ye Bin
2021-10-07 12:31   ` Jan Kara
2021-10-08  1:56     ` yebin
2021-10-08  2:38       ` yebin
2021-10-12  8:47         ` Jan Kara
2021-10-12 11:46           ` yebin
2021-10-13  9:38             ` Jan Kara
2021-10-13 12:33               ` yebin
2021-10-13 21:41               ` Theodore Ts'o
2021-10-15  3:21                 ` Andreas Dilger
2021-10-15  3:21                 ` Andreas Dilger
2021-09-11  9:00 ` [PATCH -next v2 3/6] ext4: compare to local seq and nodename when check conflict Ye Bin
2021-10-07 12:36   ` Jan Kara
2021-09-11  9:00 ` [PATCH -next v2 4/6] ext4: avoid to re-read mmp check data get from page cache Ye Bin
2021-10-07 12:44   ` Jan Kara
2021-10-08  3:52     ` yebin
2021-09-11  9:00 ` [PATCH -next v2 5/6] ext4: avoid to double free s_mmp_bh Ye Bin
2021-09-11  9:00 ` [PATCH -next v2 6/6] ext4: fix possible store wrong check interval value in disk when umount Ye Bin
2021-10-07 13:12   ` Jan Kara
2021-10-08  3:49     ` yebin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).