[PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
@ 2017-09-12  1:49 NeilBrown
  2017-09-12  1:49 ` [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: NeilBrown @ 2017-09-12  1:49 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

Hi,
 I looked again at the previous patch I posted which tried to mak
 md_update_sb() safe without taking reconfig_mutex, and realized that
 it had serious problems, particularly around devices being added or
 removed while the update was happening.

 So I decided to try a different approach, which is embodied in these
 patches.  The md thread is now explicitly allowed to call
 md_update_sb() while some other thread holds the lock and
 waits for mddev_suspend() to complete.

 Please test these and confirm that they still address the problem you
 found.

Thanks,
NeilBrown

---

NeilBrown (4):
      md: always hold reconfig_mutex when calling mddev_suspend()
      md: don't call bitmap_create() while array is quiesced.
      md: use mddev_suspend/resume instead of ->quiesce()
      md: allow metadata update while suspending.

 drivers/md/dm-raid.c     |    5 ++++-
 drivers/md/md.c          |   45 ++++++++++++++++++++++++++++++++-------------
 drivers/md/md.h          |    6 ++++++
 drivers/md/raid5-cache.c |    2 ++
 4 files changed, 44 insertions(+), 14 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce()
  2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
@ 2017-09-12  1:49 ` NeilBrown
  2017-09-12  1:49 ` [PATCH 1/4] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-09-12  1:49 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

mddev_suspend() is a more general interface than
calling ->quiesce() and is so more extensible.  A
future patch will make use of this.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d6d491243411..0b167169fef9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4820,8 +4820,8 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
 		mddev->pers->quiesce(mddev, 2);
 	else {
 		/* Expanding suspended region - need to wait */
-		mddev->pers->quiesce(mddev, 1);
-		mddev->pers->quiesce(mddev, 0);
+		mddev_suspend(mddev);
+		mddev_resume(mddev);
 	}
 	err = 0;
 unlock:
@@ -4863,8 +4863,8 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
 		mddev->pers->quiesce(mddev, 2);
 	else {
 		/* Expanding suspended region - need to wait */
-		mddev->pers->quiesce(mddev, 1);
-		mddev->pers->quiesce(mddev, 0);
+		mddev_suspend(mddev);
+		mddev_resume(mddev);
 	}
 	err = 0;
 unlock:
@@ -6595,7 +6595,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 			struct bitmap *bitmap;
 
 			bitmap = bitmap_create(mddev, -1);
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				err = bitmap_load(mddev);
@@ -6605,11 +6605,11 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 				bitmap_destroy(mddev);
 				fd = -1;
 			}
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 		} else if (fd < 0) {
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			bitmap_destroy(mddev);
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 		}
 	}
 	if (fd < 0) {
@@ -6895,7 +6895,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 			mddev->bitmap_info.space =
 				mddev->bitmap_info.default_space;
 			bitmap = bitmap_create(mddev, -1);
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				rv = bitmap_load(mddev);
@@ -6903,7 +6903,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				rv = PTR_ERR(bitmap);
 			if (rv)
 				bitmap_destroy(mddev);
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 		} else {
 			/* remove the bitmap */
 			if (!mddev->bitmap) {
@@ -6926,9 +6926,9 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				mddev->bitmap_info.nodes = 0;
 				md_cluster_ops->leave(mddev);
 			}
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			bitmap_destroy(mddev);
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 			mddev->bitmap_info.offset = 0;
 		}
 	}



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/4] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
  2017-09-12  1:49 ` [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
@ 2017-09-12  1:49 ` NeilBrown
  2017-09-12  1:49 ` [PATCH 4/4] md: allow metadata update while suspending NeilBrown
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-09-12  1:49 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

Most often mddev_suspend() is called with
reconfig_mutex held.  Make this a requirement in
preparation a subsequent patch.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm-raid.c     |    5 ++++-
 drivers/md/md.c          |    1 +
 drivers/md/raid5-cache.c |    2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 5bfe285ea9d1..f013b03e15c3 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3628,8 +3628,11 @@ static void raid_postsuspend(struct dm_target *ti)
 {
 	struct raid_set *rs = ti->private;
 
-	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
+	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
+		mddev_lock_nointr(&rs->md);
 		mddev_suspend(&rs->md);
+		mddev_unlock(&rs->md);
+	}
 
 	rs->md.ro = 1;
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b01e458d31e9..675c9e6495e4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -336,6 +336,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 void mddev_suspend(struct mddev *mddev)
 {
 	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
+	lockdep_assert_held(&mddev->reconfig_mutex);
 	if (mddev->suspended++)
 		return;
 	synchronize_rcu();
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 2dcbafa8e66c..8d09c2fa3d63 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -703,9 +703,11 @@ static void r5c_disable_writeback_async(struct work_struct *work)
 	wait_event(mddev->sb_wait,
 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
 
+	mddev_lock_nointr(mddev);
 	mddev_suspend(mddev);
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 	mddev_resume(mddev);
+	mddev_unlock(mddev);
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] md: allow metadata update while suspending.
  2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
  2017-09-12  1:49 ` [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
  2017-09-12  1:49 ` [PATCH 1/4] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
@ 2017-09-12  1:49 ` NeilBrown
  2017-09-12  1:49 ` [PATCH 2/4] md: don't call bitmap_create() while array is quiesced NeilBrown
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-09-12  1:49 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

There are various deadlock that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As the md thread update the metadata while the
reconfig mutex is held, the array may not fully
quiesce.
For example, raid5 will not complete stripes
while a metadata update is pending, and that
prevents raid5_quiesce() from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held.  So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked update is
allowed, and one which ways it is happening.  Then allow
it while the quiesce completes, and then wait for it to finished.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   14 ++++++++++++++
 drivers/md/md.h |    6 ++++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0b167169fef9..f63721109d0e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -341,8 +341,12 @@ void mddev_suspend(struct mddev *mddev)
 		return;
 	synchronize_rcu();
 	wake_up(&mddev->sb_wait);
+	set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
+	smp_mb__after_atomic();
 	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
 	mddev->pers->quiesce(mddev, 1);
+	clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags);
+	wait_event(mddev->sb_wait, !test_bit(MD_UPDATING_SB, &mddev->flags));
 
 	del_timer_sync(&mddev->safemode_timer);
 }
@@ -8790,6 +8794,16 @@ void md_check_recovery(struct mddev *mddev)
 	unlock:
 		wake_up(&mddev->sb_wait);
 		mddev_unlock(mddev);
+	} else if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags) && mddev->sb_flags) {
+		/* Write superblock - thread that called mddev_suspend()
+		 * holds reconfig_mutex for us.
+		 */
+		set_bit(MD_UPDATING_SB, &mddev->flags);
+		smp_mb__after_atomic();
+		if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags))
+			md_update_sb(mddev, 0);
+		clear_bit_unlock(MD_UPDATING_SB, &mddev->flags);
+		wake_up(&mddev->sb_wait);
 	}
 }
 EXPORT_SYMBOL(md_check_recovery);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 09db03455801..8ab5e1bcb10f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -236,6 +236,12 @@ enum mddev_flags {
 				 * never cause the array to become failed.
 				 */
 	MD_HAS_PPL,		/* The raid array has PPL feature set */
+	MD_ALLOW_SB_UPDATE,	/* md_check_recovery is allowed to update
+				 * the metadata without taking reconfig_mutex.
+				 */
+	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
+				 * without explicitly holding reconfig_mutex.
+				 */
 };
 
 enum mddev_sb_flags {



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] md: don't call bitmap_create() while array is quiesced.
  2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
                   ` (2 preceding siblings ...)
  2017-09-12  1:49 ` [PATCH 4/4] md: allow metadata update while suspending NeilBrown
@ 2017-09-12  1:49 ` NeilBrown
  2017-09-12  2:51 ` [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without Xiao Ni
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-09-12  1:49 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

bitmap_create() allocates memory with GFP_KERNEL and
so can wait for IO.
If called while the array is quiesced, it could wait indefinitely
for write out to the array - deadlock.
So call bitmap_create() before quiescing the array.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 675c9e6495e4..d6d491243411 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6591,22 +6591,26 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 		return -ENOENT; /* cannot remove what isn't there */
 	err = 0;
 	if (mddev->pers) {
-		mddev->pers->quiesce(mddev, 1);
 		if (fd >= 0) {
 			struct bitmap *bitmap;
 
 			bitmap = bitmap_create(mddev, -1);
+			mddev->pers->quiesce(mddev, 1);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				err = bitmap_load(mddev);
 			} else
 				err = PTR_ERR(bitmap);
-		}
-		if (fd < 0 || err) {
+			if (err) {
+				bitmap_destroy(mddev);
+				fd = -1;
+			}
+			mddev->pers->quiesce(mddev, 0);
+		} else if (fd < 0) {
+			mddev->pers->quiesce(mddev, 1);
 			bitmap_destroy(mddev);
-			fd = -1; /* make sure to put the file */
+			mddev->pers->quiesce(mddev, 0);
 		}
-		mddev->pers->quiesce(mddev, 0);
 	}
 	if (fd < 0) {
 		struct file *f = mddev->bitmap_info.file;
@@ -6890,8 +6894,8 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				mddev->bitmap_info.default_offset;
 			mddev->bitmap_info.space =
 				mddev->bitmap_info.default_space;
-			mddev->pers->quiesce(mddev, 1);
 			bitmap = bitmap_create(mddev, -1);
+			mddev->pers->quiesce(mddev, 1);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				rv = bitmap_load(mddev);



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
                   ` (3 preceding siblings ...)
  2017-09-12  1:49 ` [PATCH 2/4] md: don't call bitmap_create() while array is quiesced NeilBrown
@ 2017-09-12  2:51 ` Xiao Ni
  2017-09-13  2:11 ` Xiao Ni
  2017-09-30  9:46 ` Xiao Ni
  6 siblings, 0 replies; 27+ messages in thread
From: Xiao Ni @ 2017-09-12  2:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



----- Original Message -----
> From: "NeilBrown" <neilb@suse.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Tuesday, September 12, 2017 9:49:12 AM
> Subject: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> Hi,
>  I looked again at the previous patch I posted which tried to mak
>  md_update_sb() safe without taking reconfig_mutex, and realized that
>  it had serious problems, particularly around devices being added or
>  removed while the update was happening.
> 
>  So I decided to try a different approach, which is embodied in these
>  patches.  The md thread is now explicitly allowed to call
>  md_update_sb() while some other thread holds the lock and
>  waits for mddev_suspend() to complete.
> 
>  Please test these and confirm that they still address the problem you
>  found.

Hi Neil

Thanks for the patches. I'll update the test result. 

Regards
Xiao


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
                   ` (4 preceding siblings ...)
  2017-09-12  2:51 ` [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without Xiao Ni
@ 2017-09-13  2:11 ` Xiao Ni
  2017-09-13 15:09   ` Xiao Ni
  2017-09-30  9:46 ` Xiao Ni
  6 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-09-13  2:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



----- Original Message -----
> From: "NeilBrown" <neilb@suse.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Tuesday, September 12, 2017 9:49:12 AM
> Subject: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> Hi,
>  I looked again at the previous patch I posted which tried to mak
>  md_update_sb() safe without taking reconfig_mutex, and realized that
>  it had serious problems, particularly around devices being added or
>  removed while the update was happening.
> 
>  So I decided to try a different approach, which is embodied in these
>  patches.  The md thread is now explicitly allowed to call
>  md_update_sb() while some other thread holds the lock and
>  waits for mddev_suspend() to complete.
> 
>  Please test these and confirm that they still address the problem you
>  found.

Hi Neil

The test have been running for more than 24 hours. The problem doesn't appear.
The patches can fix this bug. 

Best Regards
Xiao
> 
> Thanks,
> NeilBrown
> 
> ---
> 
> NeilBrown (4):
>       md: always hold reconfig_mutex when calling mddev_suspend()
>       md: don't call bitmap_create() while array is quiesced.
>       md: use mddev_suspend/resume instead of ->quiesce()
>       md: allow metadata update while suspending.
> 
> 
>  drivers/md/dm-raid.c     |    5 ++++-
>  drivers/md/md.c          |   45
>  ++++++++++++++++++++++++++++++++-------------
>  drivers/md/md.h          |    6 ++++++
>  drivers/md/raid5-cache.c |    2 ++
>  4 files changed, 44 insertions(+), 14 deletions(-)
> 
> --
> Signature
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-13  2:11 ` Xiao Ni
@ 2017-09-13 15:09   ` Xiao Ni
  2017-09-13 23:05     ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-09-13 15:09 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



----- Original Message -----
> From: "Xiao Ni" <xni@redhat.com>
> To: "NeilBrown" <neilb@suse.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Wednesday, September 13, 2017 10:11:50 AM
> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> 
> 
> ----- Original Message -----
> > From: "NeilBrown" <neilb@suse.com>
> > To: "Xiao Ni" <xni@redhat.com>
> > Cc: linux-raid@vger.kernel.org
> > Sent: Tuesday, September 12, 2017 9:49:12 AM
> > Subject: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata
> > without
> > 
> > Hi,
> >  I looked again at the previous patch I posted which tried to mak
> >  md_update_sb() safe without taking reconfig_mutex, and realized that
> >  it had serious problems, particularly around devices being added or
> >  removed while the update was happening.
> > 
> >  So I decided to try a different approach, which is embodied in these
> >  patches.  The md thread is now explicitly allowed to call
> >  md_update_sb() while some other thread holds the lock and
> >  waits for mddev_suspend() to complete.
> > 
> >  Please test these and confirm that they still address the problem you
> >  found.
> 
> Hi Neil
> 
> The test have been running for more than 24 hours. The problem doesn't
> appear.
> The patches can fix this bug.
> 

Hi Neil

Sorry for the bad news. The test is still running and it's stuck again. 

Regards
Xiao
> > 
> > Thanks,
> > NeilBrown
> > 
> > ---
> > 
> > NeilBrown (4):
> >       md: always hold reconfig_mutex when calling mddev_suspend()
> >       md: don't call bitmap_create() while array is quiesced.
> >       md: use mddev_suspend/resume instead of ->quiesce()
> >       md: allow metadata update while suspending.
> > 
> > 
> >  drivers/md/dm-raid.c     |    5 ++++-
> >  drivers/md/md.c          |   45
> >  ++++++++++++++++++++++++++++++++-------------
> >  drivers/md/md.h          |    6 ++++++
> >  drivers/md/raid5-cache.c |    2 ++
> >  4 files changed, 44 insertions(+), 14 deletions(-)
> > 
> > --
> > Signature
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-13 15:09   ` Xiao Ni
@ 2017-09-13 23:05     ` NeilBrown
  2017-09-14  4:55       ` Xiao Ni
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-09-13 23:05 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 271 bytes --]

On Wed, Sep 13 2017, Xiao Ni wrote:
>
> Hi Neil
>
> Sorry for the bad news. The test is still running and it's stuck again. 

Any details?  Anything at all?  Just a little hint maybe?

Just saying "it's stuck again" is very nearly useless.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-13 23:05     ` NeilBrown
@ 2017-09-14  4:55       ` Xiao Ni
  2017-09-14  5:32         ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-09-14  4:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



----- Original Message -----
> From: "NeilBrown" <neilb@suse.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Thursday, September 14, 2017 7:05:20 AM
> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> On Wed, Sep 13 2017, Xiao Ni wrote:
> >
> > Hi Neil
> >
> > Sorry for the bad news. The test is still running and it's stuck again.
> 
> Any details?  Anything at all?  Just a little hint maybe?
> 
> Just saying "it's stuck again" is very nearly useless.
> 
Hi Neil

It doesn't show any useful information in /var/log/messages

echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
There aren't any messages too. 

It looks like another problem. 

[root@dell-pr1700-02 ~]# ps auxf | grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      8381  0.0  0.0      0     0 ?        D    Sep13   0:00  \_ [kworker/u8:1]
root      8966  0.0  0.0      0     0 ?        D    Sep13   0:00  \_ [jbd2/md0-8]
root       824  0.0  0.1 216856  8492 ?        Ss   Sep03   0:06 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
root       836  0.0  0.0 195052  3200 ?        Ssl  Sep03   0:00 /usr/sbin/gssproxy -D
root      1225  0.0  0.0 106008  7436 ?        Ss   Sep03   0:00 /usr/sbin/sshd -D
root     12411  0.0  0.0 112672  2264 pts/0    S+   00:50   0:00          \_ grep --color=auto D
root      8987  0.0  0.0 109000  2728 pts/2    D+   Sep13   0:04          \_ dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
root      8983  0.0  0.0   7116  2080 ?        Ds   Sep13   0:00 /usr/sbin/mdadm --grow --continue /dev/md0

[root@dell-pr1700-02 ~]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] loop1[1] loop0[0]
      2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU]
      [>....................]  reshape =  0.0% (1/509952) finish=1059.5min speed=7K/sec
      
unused devices: <none>


It looks like the reshape doesn't start. This time I didn't add the codes to check
the information of mddev->suspended and active_stripes. I just added the patches 
to source codes. Do you have other suggestions to check more things?

Best Regards
Xiao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-14  4:55       ` Xiao Ni
@ 2017-09-14  5:32         ` NeilBrown
  2017-09-14  7:57           ` Xiao Ni
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-09-14  5:32 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3205 bytes --]

On Thu, Sep 14 2017, Xiao Ni wrote:

> ----- Original Message -----
>> From: "NeilBrown" <neilb@suse.com>
>> To: "Xiao Ni" <xni@redhat.com>
>> Cc: linux-raid@vger.kernel.org
>> Sent: Thursday, September 14, 2017 7:05:20 AM
>> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
>> 
>> On Wed, Sep 13 2017, Xiao Ni wrote:
>> >
>> > Hi Neil
>> >
>> > Sorry for the bad news. The test is still running and it's stuck again.
>> 
>> Any details?  Anything at all?  Just a little hint maybe?
>> 
>> Just saying "it's stuck again" is very nearly useless.
>> 
> Hi Neil
>
> It doesn't show any useful information in /var/log/messages
>
> echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
> There aren't any messages too. 
>
> It looks like another problem. 
>
> [root@dell-pr1700-02 ~]# ps auxf | grep D
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      8381  0.0  0.0      0     0 ?        D    Sep13   0:00  \_ [kworker/u8:1]
> root      8966  0.0  0.0      0     0 ?        D    Sep13   0:00  \_ [jbd2/md0-8]
> root       824  0.0  0.1 216856  8492 ?        Ss   Sep03   0:06 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
> root       836  0.0  0.0 195052  3200 ?        Ssl  Sep03   0:00 /usr/sbin/gssproxy -D
> root      1225  0.0  0.0 106008  7436 ?        Ss   Sep03   0:00 /usr/sbin/sshd -D
> root     12411  0.0  0.0 112672  2264 pts/0    S+   00:50   0:00          \_ grep --color=auto D
> root      8987  0.0  0.0 109000  2728 pts/2    D+   Sep13   0:04          \_ dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
> root      8983  0.0  0.0   7116  2080 ?        Ds   Sep13   0:00 /usr/sbin/mdadm --grow --continue /dev/md0
>
> [root@dell-pr1700-02 ~]# cat /proc/mdstat 
> Personalities : [raid6] [raid5] [raid4] 
> md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] loop1[1] loop0[0]
>       2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU]
>       [>....................]  reshape =  0.0% (1/509952) finish=1059.5min speed=7K/sec
>       
> unused devices: <none>
>
>
> It looks like the reshape doesn't start. This time I didn't add the codes to check
> the information of mddev->suspended and active_stripes. I just added the patches 
> to source codes. Do you have other suggestions to check more things?
>
> Best Regards
> Xiao

What do
 cat /proc/8987/stack
 cat /proc/8983/stack
 cat /proc/8966/stack
 cat /proc/8381/stack

show??

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-14  5:32         ` NeilBrown
@ 2017-09-14  7:57           ` Xiao Ni
  2017-09-16 13:15             ` Xiao Ni
  2017-10-05  5:17             ` NeilBrown
  0 siblings, 2 replies; 27+ messages in thread
From: Xiao Ni @ 2017-09-14  7:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



----- Original Message -----
> From: "NeilBrown" <neilb@suse.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Thursday, September 14, 2017 1:32:02 PM
> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> On Thu, Sep 14 2017, Xiao Ni wrote:
> 
> > ----- Original Message -----
> >> From: "NeilBrown" <neilb@suse.com>
> >> To: "Xiao Ni" <xni@redhat.com>
> >> Cc: linux-raid@vger.kernel.org
> >> Sent: Thursday, September 14, 2017 7:05:20 AM
> >> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata
> >> without
> >> 
> >> On Wed, Sep 13 2017, Xiao Ni wrote:
> >> >
> >> > Hi Neil
> >> >
> >> > Sorry for the bad news. The test is still running and it's stuck again.
> >> 
> >> Any details?  Anything at all?  Just a little hint maybe?
> >> 
> >> Just saying "it's stuck again" is very nearly useless.
> >> 
> > Hi Neil
> >
> > It doesn't show any useful information in /var/log/messages
> >
> > echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
> > There aren't any messages too.
> >
> > It looks like another problem.
> >
> > [root@dell-pr1700-02 ~]# ps auxf | grep D
> > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > root      8381  0.0  0.0      0     0 ?        D    Sep13   0:00  \_
> > [kworker/u8:1]
> > root      8966  0.0  0.0      0     0 ?        D    Sep13   0:00  \_
> > [jbd2/md0-8]
> > root       824  0.0  0.1 216856  8492 ?        Ss   Sep03   0:06
> > /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible
> > recursive locking detected ernel BUG at list_del corruption list_add
> > corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral
> > protection fault nable to handle kernel ouble fault: RTNL: assertion
> > failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG
> > ysctl table check failed : nobody cared IRQ handler type mismatch Machine
> > Check Exception: Machine check events logged divide error: bounds:
> > coprocessor segment overrun: invalid TSS: segment not present: invalid
> > opcode: alignment check: stack segment: fpu exception: simd exception:
> > iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
> > root       836  0.0  0.0 195052  3200 ?        Ssl  Sep03   0:00
> > /usr/sbin/gssproxy -D
> > root      1225  0.0  0.0 106008  7436 ?        Ss   Sep03   0:00
> > /usr/sbin/sshd -D
> > root     12411  0.0  0.0 112672  2264 pts/0    S+   00:50   0:00
> > \_ grep --color=auto D
> > root      8987  0.0  0.0 109000  2728 pts/2    D+   Sep13   0:04
> > \_ dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
> > root      8983  0.0  0.0   7116  2080 ?        Ds   Sep13   0:00
> > /usr/sbin/mdadm --grow --continue /dev/md0
> >
> > [root@dell-pr1700-02 ~]# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] loop1[1]
> > loop0[0]
> >       2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6]
> >       [UUUUUU]
> >       [>....................]  reshape =  0.0% (1/509952) finish=1059.5min
> >       speed=7K/sec
> >       
> > unused devices: <none>
> >
> >
> > It looks like the reshape doesn't start. This time I didn't add the codes
> > to check
> > the information of mddev->suspended and active_stripes. I just added the
> > patches
> > to source codes. Do you have other suggestions to check more things?
> >
> > Best Regards
> > Xiao
> 
> What do
>  cat /proc/8987/stack
>  cat /proc/8983/stack
>  cat /proc/8966/stack
>  cat /proc/8381/stack
> 
> show??

dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000

[root@dell-pr1700-02 ~]# cat /proc/8987/stack
[<ffffffff810d4ea6>] io_schedule+0x16/0x40
[<ffffffff811c66ae>] __lock_page+0x10e/0x160
[<ffffffffa09b4ef0>] mpage_prepare_extent_to_map+0x290/0x310 [ext4]
[<ffffffffa09ba007>] ext4_writepages+0x467/0xe80 [ext4]
[<ffffffff811d6bec>] do_writepages+0x1c/0x70
[<ffffffff811c7c66>] __filemap_fdatawrite_range+0xc6/0x100
[<ffffffff811c7d6c>] filemap_flush+0x1c/0x20
[<ffffffffa09b757c>] ext4_alloc_da_blocks+0x2c/0x70 [ext4]
[<ffffffffa09a89a9>] ext4_release_file+0x79/0xc0 [ext4]
[<ffffffff81263d67>] __fput+0xe7/0x210
[<ffffffff81263ece>] ____fput+0xe/0x10
[<ffffffff810c59c3>] task_work_run+0x83/0xb0
[<ffffffff81003d64>] exit_to_usermode_loop+0x6c/0xa8
[<ffffffff8100389a>] do_syscall_64+0x13a/0x150
[<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

/usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add lockdep_assert_held(&mddev->reconfig_mutex)?
[root@dell-pr1700-02 ~]# cat /proc/8983/stack
[<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
[<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
[<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
[<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
[<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
[<ffffffff81260457>] __vfs_write+0x37/0x170
[<ffffffff812619e2>] vfs_write+0xb2/0x1b0
[<ffffffff81263015>] SyS_write+0x55/0xc0
[<ffffffff810037c7>] do_syscall_64+0x67/0x150
[<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[jbd2/md0-8]
[root@dell-pr1700-02 ~]# cat /proc/8966/stack
[<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
[<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
[<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
[<ffffffff81376427>] generic_make_request+0x117/0x2f0
[<ffffffff81376675>] submit_bio+0x75/0x150
[<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
[<ffffffff8129e683>] submit_bh+0x13/0x20
[<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
[<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
[<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
[<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
[<ffffffff810c73f9>] kthread+0x109/0x140
[<ffffffff817776c5>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

[kworker/u8:1]
[root@dell-pr1700-02 ~]# cat /proc/8381/stack
[<ffffffffa0a34131>] md_make_request+0xb1/0x260 [md_mod]
[<ffffffff81376427>] generic_make_request+0x117/0x2f0
[<ffffffff81376675>] submit_bio+0x75/0x150
[<ffffffffa09d421c>] ext4_io_submit+0x4c/0x60 [ext4]
[<ffffffffa09d43f4>] ext4_bio_write_page+0x1a4/0x3b0 [ext4]
[<ffffffffa09b44f7>] mpage_submit_page+0x57/0x70 [ext4]
[<ffffffffa09b4778>] mpage_map_and_submit_buffers+0x168/0x290 [ext4]
[<ffffffffa09ba3f2>] ext4_writepages+0x852/0xe80 [ext4]
[<ffffffff811d6bec>] do_writepages+0x1c/0x70
[<ffffffff81293895>] __writeback_single_inode+0x45/0x320
[<ffffffff812940c0>] writeback_sb_inodes+0x280/0x570
[<ffffffff8129443c>] __writeback_inodes_wb+0x8c/0xc0
[<ffffffff812946e6>] wb_writeback+0x276/0x310
[<ffffffff81294f9c>] wb_workfn+0x19c/0x3b0
[<ffffffff810c0ff9>] process_one_work+0x149/0x360
[<ffffffff810c177d>] worker_thread+0x4d/0x3c0
[<ffffffff810c73f9>] kthread+0x109/0x140
[<ffffffff817776c5>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

If they can't give useful hints, I can try to print more information and do test again.

Best Regards
Xiao
> 
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-14  7:57           ` Xiao Ni
@ 2017-09-16 13:15             ` Xiao Ni
  2017-10-05  5:17             ` NeilBrown
  1 sibling, 0 replies; 27+ messages in thread
From: Xiao Ni @ 2017-09-16 13:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



----- Original Message -----
> From: "Xiao Ni" <xni@redhat.com>
> To: "NeilBrown" <neilb@suse.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Thursday, September 14, 2017 3:57:21 PM
> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> 
> 
> ----- Original Message -----
> > From: "NeilBrown" <neilb@suse.com>
> > To: "Xiao Ni" <xni@redhat.com>
> > Cc: linux-raid@vger.kernel.org
> > Sent: Thursday, September 14, 2017 1:32:02 PM
> > Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata
> > without
> > 
> > On Thu, Sep 14 2017, Xiao Ni wrote:
> > 
> > > ----- Original Message -----
> > >> From: "NeilBrown" <neilb@suse.com>
> > >> To: "Xiao Ni" <xni@redhat.com>
> > >> Cc: linux-raid@vger.kernel.org
> > >> Sent: Thursday, September 14, 2017 7:05:20 AM
> > >> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with
> > >> metadata
> > >> without
> > >> 
> > >> On Wed, Sep 13 2017, Xiao Ni wrote:
> > >> >
> > >> > Hi Neil
> > >> >
> > >> > Sorry for the bad news. The test is still running and it's stuck
> > >> > again.
> > >> 
> > >> Any details?  Anything at all?  Just a little hint maybe?
> > >> 
> > >> Just saying "it's stuck again" is very nearly useless.
> > >> 
> > > Hi Neil
> > >
> > > It doesn't show any useful information in /var/log/messages
> > >
> > > echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
> > > There aren't any messages too.
> > >
> > > It looks like another problem.
> > >
> > > [root@dell-pr1700-02 ~]# ps auxf | grep D
> > > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > > root      8381  0.0  0.0      0     0 ?        D    Sep13   0:00  \_
> > > [kworker/u8:1]
> > > root      8966  0.0  0.0      0     0 ?        D    Sep13   0:00  \_
> > > [jbd2/md0-8]
> > > root       824  0.0  0.1 216856  8492 ?        Ss   Sep03   0:06
> > > /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible
> > > recursive locking detected ernel BUG at list_del corruption list_add
> > > corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral
> > > protection fault nable to handle kernel ouble fault: RTNL: assertion
> > > failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG
> > > ysctl table check failed : nobody cared IRQ handler type mismatch Machine
> > > Check Exception: Machine check events logged divide error: bounds:
> > > coprocessor segment overrun: invalid TSS: segment not present: invalid
> > > opcode: alignment check: stack segment: fpu exception: simd exception:
> > > iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
> > > root       836  0.0  0.0 195052  3200 ?        Ssl  Sep03   0:00
> > > /usr/sbin/gssproxy -D
> > > root      1225  0.0  0.0 106008  7436 ?        Ss   Sep03   0:00
> > > /usr/sbin/sshd -D
> > > root     12411  0.0  0.0 112672  2264 pts/0    S+   00:50   0:00
> > > \_ grep --color=auto D
> > > root      8987  0.0  0.0 109000  2728 pts/2    D+   Sep13   0:04
> > > \_ dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
> > > root      8983  0.0  0.0   7116  2080 ?        Ds   Sep13   0:00
> > > /usr/sbin/mdadm --grow --continue /dev/md0
> > >
> > > [root@dell-pr1700-02 ~]# cat /proc/mdstat
> > > Personalities : [raid6] [raid5] [raid4]
> > > md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2]
> > > loop1[1]
> > > loop0[0]
> > >       2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6]
> > >       [UUUUUU]
> > >       [>....................]  reshape =  0.0% (1/509952)
> > >       finish=1059.5min
> > >       speed=7K/sec
> > >       
> > > unused devices: <none>
> > >
> > >
> > > It looks like the reshape doesn't start. This time I didn't add the codes
> > > to check
> > > the information of mddev->suspended and active_stripes. I just added the
> > > patches
> > > to source codes. Do you have other suggestions to check more things?
> > >
> > > Best Regards
> > > Xiao
> > 
> > What do
> >  cat /proc/8987/stack
> >  cat /proc/8983/stack
> >  cat /proc/8966/stack
> >  cat /proc/8381/stack
> > 
> > show??
> 
> dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
> 
> [root@dell-pr1700-02 ~]# cat /proc/8987/stack
> [<ffffffff810d4ea6>] io_schedule+0x16/0x40
> [<ffffffff811c66ae>] __lock_page+0x10e/0x160
> [<ffffffffa09b4ef0>] mpage_prepare_extent_to_map+0x290/0x310 [ext4]
> [<ffffffffa09ba007>] ext4_writepages+0x467/0xe80 [ext4]
> [<ffffffff811d6bec>] do_writepages+0x1c/0x70
> [<ffffffff811c7c66>] __filemap_fdatawrite_range+0xc6/0x100
> [<ffffffff811c7d6c>] filemap_flush+0x1c/0x20
> [<ffffffffa09b757c>] ext4_alloc_da_blocks+0x2c/0x70 [ext4]
> [<ffffffffa09a89a9>] ext4_release_file+0x79/0xc0 [ext4]
> [<ffffffff81263d67>] __fput+0xe7/0x210
> [<ffffffff81263ece>] ____fput+0xe/0x10
> [<ffffffff810c59c3>] task_work_run+0x83/0xb0
> [<ffffffff81003d64>] exit_to_usermode_loop+0x6c/0xa8
> [<ffffffff8100389a>] do_syscall_64+0x13a/0x150
> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add
> lockdep_assert_held(&mddev->reconfig_mutex)?
> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
> [<ffffffff81260457>] __vfs_write+0x37/0x170
> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
> [<ffffffff81263015>] SyS_write+0x55/0xc0
> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [jbd2/md0-8]
> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
> [<ffffffff81376675>] submit_bio+0x75/0x150
> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
> [<ffffffff8129e683>] submit_bh+0x13/0x20
> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
> [<ffffffff810c73f9>] kthread+0x109/0x140
> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [kworker/u8:1]
> [root@dell-pr1700-02 ~]# cat /proc/8381/stack
> [<ffffffffa0a34131>] md_make_request+0xb1/0x260 [md_mod]
> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
> [<ffffffff81376675>] submit_bio+0x75/0x150
> [<ffffffffa09d421c>] ext4_io_submit+0x4c/0x60 [ext4]
> [<ffffffffa09d43f4>] ext4_bio_write_page+0x1a4/0x3b0 [ext4]
> [<ffffffffa09b44f7>] mpage_submit_page+0x57/0x70 [ext4]
> [<ffffffffa09b4778>] mpage_map_and_submit_buffers+0x168/0x290 [ext4]
> [<ffffffffa09ba3f2>] ext4_writepages+0x852/0xe80 [ext4]
> [<ffffffff811d6bec>] do_writepages+0x1c/0x70
> [<ffffffff81293895>] __writeback_single_inode+0x45/0x320
> [<ffffffff812940c0>] writeback_sb_inodes+0x280/0x570
> [<ffffffff8129443c>] __writeback_inodes_wb+0x8c/0xc0
> [<ffffffff812946e6>] wb_writeback+0x276/0x310
> [<ffffffff81294f9c>] wb_workfn+0x19c/0x3b0
> [<ffffffff810c0ff9>] process_one_work+0x149/0x360
> [<ffffffff810c177d>] worker_thread+0x4d/0x3c0
> [<ffffffff810c73f9>] kthread+0x109/0x140
> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> If they can't give useful hints, I can try to print more information and do
> test again.

Hi Neil

I added some codes to print some information. 

[13404.528231] mddev->suspended : 1
[13404.531170] mddev->active_io : 1
[13404.533774] conf->quiesce 0

MD_SB_CHANGE_PENDING of mddev->flags is not set
MD_UPDATING_SB of mddev->flags is not set

It's stuck at mddev_suspend
wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0)

md_write_start
wait_event(mddev->sb_wait,
      !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);

Best Regards
Xiao

> 
> Best Regards
> Xiao
> > 
> > NeilBrown
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-14  7:57           ` Xiao Ni
  2017-09-16 13:15             ` Xiao Ni
@ 2017-10-05  5:17             ` NeilBrown
  2017-10-06  3:53               ` Xiao Ni
  1 sibling, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-10-05  5:17 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2418 bytes --]

On Thu, Sep 14 2017, Xiao Ni wrote:

>> 
>> What do
>>  cat /proc/8987/stack
>>  cat /proc/8983/stack
>>  cat /proc/8966/stack
>>  cat /proc/8381/stack
>> 
>> show??
>
...

>
> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add lockdep_assert_held(&mddev->reconfig_mutex)?
> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
> [<ffffffff81260457>] __vfs_write+0x37/0x170
> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
> [<ffffffff81263015>] SyS_write+0x55/0xc0
> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> [jbd2/md0-8]
> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
> [<ffffffff81376675>] submit_bio+0x75/0x150
> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
> [<ffffffff8129e683>] submit_bh+0x13/0x20
> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
> [<ffffffff810c73f9>] kthread+0x109/0x140
> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

Thanks for this (and sorry it took so long to get to it).
It looks like

Commit: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()")

is badly broken.  I wonder how it ever passed testing.

In write_start() is change the wait_event() call to

	wait_event(mddev->sb_wait,
		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);


That should be

	wait_event(mddev->sb_wait,
		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) || mddev->suspended);

i.e. it was (!A && !B), it should be (!A || B) !!!!!

Could you please make that change and try again.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-05  5:17             ` NeilBrown
@ 2017-10-06  3:53               ` Xiao Ni
  2017-10-06  4:32                 ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-10-06  3:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



On 10/05/2017 01:17 PM, NeilBrown wrote:
> On Thu, Sep 14 2017, Xiao Ni wrote:
>
>>> What do
>>>   cat /proc/8987/stack
>>>   cat /proc/8983/stack
>>>   cat /proc/8966/stack
>>>   cat /proc/8381/stack
>>>
>>> show??
> ...
>
>> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add lockdep_assert_held(&mddev->reconfig_mutex)?
>> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
>> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
>> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
>> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
>> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
>> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
>> [<ffffffff81260457>] __vfs_write+0x37/0x170
>> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
>> [<ffffffff81263015>] SyS_write+0x55/0xc0
>> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
>> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> [jbd2/md0-8]
>> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
>> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
>> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
>> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
>> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
>> [<ffffffff81376675>] submit_bio+0x75/0x150
>> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
>> [<ffffffff8129e683>] submit_bh+0x13/0x20
>> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
>> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
>> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
>> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
>> [<ffffffff810c73f9>] kthread+0x109/0x140
>> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
> Thanks for this (and sorry it took so long to get to it).
> It looks like
>
> Commit: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()")
>
> is badly broken.  I wonder how it ever passed testing.
>
> In write_start() is change the wait_event() call to
>
> 	wait_event(mddev->sb_wait,
> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
>
>
> That should be
>
> 	wait_event(mddev->sb_wait,
> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) || mddev->suspended);
Hi Neil

Do we want write bio can be handled when mddev->suspended is 1? After 
changing to this,
write bio can be handled when mddev->suspended is 1.

When the stuck happens, mddev->suspended is 0 and MD_SB_CHANGE_PENDING 
is set. So
the patch can't fix this problem. I tried the patch, the problem still 
exists.

[ 7710.589274] mddev suspend : 0
[ 7710.592228] mddev ro : 0
[ 7710.594746] mddev insync : 0
[ 7710.597620] mddev SB CHANGE PENDING is set
[ 7710.601698] mddev SB CHANGE CLEAN is set
[ 7710.605601] mddev->persistent : 1
[ 7710.608905] mddev->external : 0
[ 7710.612030] conf quiesce : 2

raid5 is still spinning.

Hmm, I have a question. Why can't call md_check_recovery when 
MD_SB_CHANGE_PENDING
is set in raid5d?

                 if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
                         spin_unlock_irq(&conf->device_lock);
                         md_check_recovery(mddev);
                         spin_lock_irq(&conf->device_lock);
                 }

Best Regards
Xiao


>
> i.e. it was (!A && !B), it should be (!A || B) !!!!!
>
> Could you please make that change and try again.
Hi Neil

I tried the patch and it can't work.
>
> Thanks,
> NeilBrown


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-06  3:53               ` Xiao Ni
@ 2017-10-06  4:32                 ` NeilBrown
  2017-10-09  1:21                   ` Xiao Ni
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-10-06  4:32 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4261 bytes --]

On Fri, Oct 06 2017, Xiao Ni wrote:

> On 10/05/2017 01:17 PM, NeilBrown wrote:
>> On Thu, Sep 14 2017, Xiao Ni wrote:
>>
>>>> What do
>>>>   cat /proc/8987/stack
>>>>   cat /proc/8983/stack
>>>>   cat /proc/8966/stack
>>>>   cat /proc/8381/stack
>>>>
>>>> show??
>> ...
>>
>>> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add lockdep_assert_held(&mddev->reconfig_mutex)?
>>> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
>>> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
>>> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
>>> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
>>> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
>>> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
>>> [<ffffffff81260457>] __vfs_write+0x37/0x170
>>> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
>>> [<ffffffff81263015>] SyS_write+0x55/0xc0
>>> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
>>> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> [jbd2/md0-8]
>>> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
>>> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
>>> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
>>> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
>>> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
>>> [<ffffffff81376675>] submit_bio+0x75/0x150
>>> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
>>> [<ffffffff8129e683>] submit_bh+0x13/0x20
>>> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
>>> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
>>> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
>>> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
>>> [<ffffffff810c73f9>] kthread+0x109/0x140
>>> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>> Thanks for this (and sorry it took so long to get to it).
>> It looks like
>>
>> Commit: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()")
>>
>> is badly broken.  I wonder how it ever passed testing.
>>
>> In write_start() is change the wait_event() call to
>>
>> 	wait_event(mddev->sb_wait,
>> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
>>
>>
>> That should be
>>
>> 	wait_event(mddev->sb_wait,
>> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) || mddev->suspended);
> Hi Neil
>
> Do we want write bio can be handled when mddev->suspended is 1? After 
> changing to this,
> write bio can be handled when mddev->suspended is 1.

This is OK.
New write bios will not get past md_handle_request().
A write bios that did get past md_handle_request() is still allowed
through md_write_start().  The mddev_suspend() call won't complete until
that write bio has finished.

>
> When the stuck happens, mddev->suspended is 0 and MD_SB_CHANGE_PENDING 
> is set. So
> the patch can't fix this problem. I tried the patch, the problem still 
> exists.
>


I need to see all the stack traces.


> [ 7710.589274] mddev suspend : 0
> [ 7710.592228] mddev ro : 0
> [ 7710.594746] mddev insync : 0
> [ 7710.597620] mddev SB CHANGE PENDING is set
> [ 7710.601698] mddev SB CHANGE CLEAN is set
> [ 7710.605601] mddev->persistent : 1
> [ 7710.608905] mddev->external : 0
> [ 7710.612030] conf quiesce : 2
>
> raid5 is still spinning.
>
> Hmm, I have a question. Why can't call md_check_recovery when 
> MD_SB_CHANGE_PENDING
> is set in raid5d?

When MD_SB_CHANGE_PENDING is not set, there is no need to call
md_check_recovery().  I wouldn't hurt except that it would be a waste of
time.

NeilBrown


>
>                  if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
>                          spin_unlock_irq(&conf->device_lock);
>                          md_check_recovery(mddev);
>                          spin_lock_irq(&conf->device_lock);
>                  }
>
> Best Regards
> Xiao
>
>
>>
>> i.e. it was (!A && !B), it should be (!A || B) !!!!!
>>
>> Could you please make that change and try again.
> Hi Neil
>
> I tried the patch and it can't work.
>>
>> Thanks,
>> NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-06  4:32                 ` NeilBrown
@ 2017-10-09  1:21                   ` Xiao Ni
  2017-10-09  4:57                     ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-10-09  1:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5911 bytes --]



----- Original Message -----
> From: "NeilBrown" <neilb@suse.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Friday, October 6, 2017 12:32:19 PM
> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> On Fri, Oct 06 2017, Xiao Ni wrote:
> 
> > On 10/05/2017 01:17 PM, NeilBrown wrote:
> >> On Thu, Sep 14 2017, Xiao Ni wrote:
> >>
> >>>> What do
> >>>>   cat /proc/8987/stack
> >>>>   cat /proc/8983/stack
> >>>>   cat /proc/8966/stack
> >>>>   cat /proc/8381/stack
> >>>>
> >>>> show??
> >> ...
> >>
> >>> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add
> >>> lockdep_assert_held(&mddev->reconfig_mutex)?
> >>> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
> >>> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
> >>> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
> >>> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
> >>> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
> >>> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
> >>> [<ffffffff81260457>] __vfs_write+0x37/0x170
> >>> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
> >>> [<ffffffff81263015>] SyS_write+0x55/0xc0
> >>> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
> >>> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
> >>> [<ffffffffffffffff>] 0xffffffffffffffff
> >>>
> >>> [jbd2/md0-8]
> >>> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
> >>> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
> >>> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
> >>> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
> >>> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
> >>> [<ffffffff81376675>] submit_bio+0x75/0x150
> >>> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
> >>> [<ffffffff8129e683>] submit_bh+0x13/0x20
> >>> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
> >>> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
> >>> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
> >>> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
> >>> [<ffffffff810c73f9>] kthread+0x109/0x140
> >>> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
> >>> [<ffffffffffffffff>] 0xffffffffffffffff
> >> Thanks for this (and sorry it took so long to get to it).
> >> It looks like
> >>
> >> Commit: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and
> >> md_write_start()")
> >>
> >> is badly broken.  I wonder how it ever passed testing.
> >>
> >> In write_start() is change the wait_event() call to
> >>
> >> 	wait_event(mddev->sb_wait,
> >> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
> >> 		   !mddev->suspended);
> >>
> >>
> >> That should be
> >>
> >> 	wait_event(mddev->sb_wait,
> >> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
> >> 		   mddev->suspended);
> > Hi Neil
> >
> > Do we want write bio can be handled when mddev->suspended is 1? After
> > changing to this,
> > write bio can be handled when mddev->suspended is 1.
> 
> This is OK.
> New write bios will not get past md_handle_request().
> A write bios that did get past md_handle_request() is still allowed
> through md_write_start().  The mddev_suspend() call won't complete until
> that write bio has finished.

Hi Neil

Thanks for the explanation. I took some time to read the emails about the
patch cc27b0c78 which introduced this. It's similar with this problem I 
countered. But there is a call of function mddev_suspend in level_store. 
So add the check of mddev->suspended in md_write_start can fix the problem
"reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / 
raid5_make_request". 

In function suspend_lo_store it doesn't call mddev_suspend under mddev->reconfig_mutex.
So there is still a race possibility as you said at first analysis. 
> 
> >
> > When the stuck happens, mddev->suspended is 0 and MD_SB_CHANGE_PENDING
> > is set. So
> > the patch can't fix this problem. I tried the patch, the problem still
> > exists.
> >
> 
> 
> I need to see all the stack traces.

I've added the calltrace as a attachment. 

> 
> 
> > [ 7710.589274] mddev suspend : 0
> > [ 7710.592228] mddev ro : 0
> > [ 7710.594746] mddev insync : 0
> > [ 7710.597620] mddev SB CHANGE PENDING is set
> > [ 7710.601698] mddev SB CHANGE CLEAN is set
> > [ 7710.605601] mddev->persistent : 1
> > [ 7710.608905] mddev->external : 0
> > [ 7710.612030] conf quiesce : 2
> >
> > raid5 is still spinning.
> >
> > Hmm, I have a question. Why can't call md_check_recovery when
> > MD_SB_CHANGE_PENDING
> > is set in raid5d?
> 
> When MD_SB_CHANGE_PENDING is not set, there is no need to call
> md_check_recovery().  I wouldn't hurt except that it would be a waste of
> time.

I'm confused. If we want to call md_check_recovery when MD_SB_CHANGE_PENDING
is set, it should be 

--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6299,7 +6299,7 @@ static void raid5d(struct md_thread *thread)
                        break;
                handled += batch_size;
 
-               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
+               if (mddev->sb_flags & (1 << MD_SB_CHANGE_PENDING)) {
                        spin_unlock_irq(&conf->device_lock);
                        md_check_recovery(mddev);
                        spin_lock_irq(&conf->device_lock);

Right?

Regards
Xiao
> 
> NeilBrown
> 
> 
> >
> >                  if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
> >                          spin_unlock_irq(&conf->device_lock);
> >                          md_check_recovery(mddev);
> >                          spin_lock_irq(&conf->device_lock);
> >                  }
> >
> > Best Regards
> > Xiao
> >
> >
> >>
> >> i.e. it was (!A && !B), it should be (!A || B) !!!!!
> >>
> >> Could you please make that change and try again.
> > Hi Neil
> >
> > I tried the patch and it can't work.
> >>
> >> Thanks,
> >> NeilBrown
> 

[-- Attachment #2: calltrace --]
[-- Type: application/octet-stream, Size: 20356 bytes --]

Oct  5 22:46:18 localhost kernel: INFO: task kworker/u8:3:2453 blocked for more than 120 seconds.
Oct  5 22:46:18 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:46:18 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:46:18 localhost kernel: kworker/u8:3    D    0  2453      2 0x00000080
Oct  5 22:46:18 localhost kernel: Workqueue: writeback wb_workfn (flush-9:0)
Oct  5 22:46:18 localhost kernel: Call Trace:
Oct  5 22:46:18 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:46:18 localhost kernel: schedule+0x36/0x80
Oct  5 22:46:18 localhost kernel: md_write_start+0x195/0x230 [md_mod]
Oct  5 22:46:18 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:18 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Oct  5 22:46:18 localhost kernel: ? bio_split+0x5d/0x90
Oct  5 22:46:18 localhost kernel: ? blk_queue_split+0xd2/0x630
Oct  5 22:46:18 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:18 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
Oct  5 22:46:18 localhost kernel: generic_make_request+0x117/0x2f0
Oct  5 22:46:18 localhost kernel: submit_bio+0x75/0x150
Oct  5 22:46:18 localhost kernel: ? __test_set_page_writeback+0xc6/0x320
Oct  5 22:46:18 localhost kernel: ext4_io_submit+0x4c/0x60 [ext4]
Oct  5 22:46:18 localhost kernel: ext4_bio_write_page+0x1a4/0x3b0 [ext4]
Oct  5 22:46:18 localhost kernel: mpage_submit_page+0x57/0x70 [ext4]
Oct  5 22:46:18 localhost kernel: mpage_map_and_submit_buffers+0x168/0x290 [ext4]
Oct  5 22:46:18 localhost kernel: ext4_writepages+0x852/0xe80 [ext4]
Oct  5 22:46:18 localhost kernel: ? account_entity_enqueue+0xd8/0x100
Oct  5 22:46:18 localhost kernel: do_writepages+0x1c/0x70
Oct  5 22:46:18 localhost kernel: __writeback_single_inode+0x45/0x320
Oct  5 22:46:18 localhost kernel: writeback_sb_inodes+0x280/0x570
Oct  5 22:46:18 localhost kernel: __writeback_inodes_wb+0x8c/0xc0
Oct  5 22:46:18 localhost kernel: wb_writeback+0x276/0x310
Oct  5 22:46:18 localhost kernel: wb_workfn+0x19c/0x3b0
Oct  5 22:46:18 localhost kernel: process_one_work+0x149/0x360
Oct  5 22:46:18 localhost kernel: worker_thread+0x4d/0x3c0
Oct  5 22:46:18 localhost kernel: kthread+0x109/0x140
Oct  5 22:46:18 localhost kernel: ? rescuer_thread+0x380/0x380
Oct  5 22:46:18 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:46:18 localhost kernel: ? do_syscall_64+0x67/0x150
Oct  5 22:46:18 localhost kernel: ret_from_fork+0x25/0x30
Oct  5 22:46:18 localhost kernel: INFO: task jbd2/md0-8:3740 blocked for more than 120 seconds.
Oct  5 22:46:18 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:46:18 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:46:18 localhost kernel: jbd2/md0-8      D    0  3740      2 0x00000080
Oct  5 22:46:18 localhost kernel: Call Trace:
Oct  5 22:46:18 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:46:18 localhost kernel: ? __enqueue_entity+0x6c/0x70
Oct  5 22:46:18 localhost kernel: schedule+0x36/0x80
Oct  5 22:46:18 localhost kernel: md_write_start+0x195/0x230 [md_mod]
Oct  5 22:46:18 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:18 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Oct  5 22:46:19 localhost kernel: ? find_next_bit+0xb/0x10
Oct  5 22:46:19 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:19 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
Oct  5 22:46:19 localhost kernel: generic_make_request+0x117/0x2f0
Oct  5 22:46:19 localhost kernel: submit_bio+0x75/0x150
Oct  5 22:46:19 localhost kernel: ? select_idle_sibling+0x2e0/0x3b0
Oct  5 22:46:19 localhost kernel: submit_bh_wbc+0x140/0x170
Oct  5 22:46:19 localhost kernel: submit_bh+0x13/0x20
Oct  5 22:46:19 localhost kernel: jbd2_write_superblock+0x109/0x230 [jbd2]
Oct  5 22:46:19 localhost kernel: ? __enqueue_entity+0x6c/0x70
Oct  5 22:46:19 localhost kernel: ? enqueue_entity+0x1f3/0x720
Oct  5 22:46:19 localhost kernel: jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
Oct  5 22:46:19 localhost kernel: ? mutex_lock_io+0x25/0x30
Oct  5 22:46:19 localhost kernel: jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
Oct  5 22:46:19 localhost kernel: ? account_entity_dequeue+0xaa/0xe0
Oct  5 22:46:19 localhost kernel: ? dequeue_entity+0xed/0x460
Oct  5 22:46:19 localhost kernel: ? ttwu_do_activate+0x7a/0x90
Oct  5 22:46:19 localhost kernel: ? dequeue_task_fair+0x565/0x820
Oct  5 22:46:19 localhost kernel: ? __switch_to+0x229/0x440
Oct  5 22:46:19 localhost kernel: ? lock_timer_base+0x7d/0xa0
Oct  5 22:46:19 localhost kernel: ? try_to_del_timer_sync+0x53/0x80
Oct  5 22:46:19 localhost kernel: kjournald2+0xd2/0x260 [jbd2]
Oct  5 22:46:19 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:19 localhost kernel: kthread+0x109/0x140
Oct  5 22:46:19 localhost kernel: ? commit_timeout+0x10/0x10 [jbd2]
Oct  5 22:46:19 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:46:19 localhost kernel: ? do_syscall_64+0x67/0x150
Oct  5 22:46:19 localhost kernel: ret_from_fork+0x25/0x30
Oct  5 22:46:19 localhost kernel: INFO: task ext4lazyinit:3743 blocked for more than 120 seconds.
Oct  5 22:46:19 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:46:19 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:46:19 localhost kernel: ext4lazyinit    D    0  3743      2 0x00000080
Oct  5 22:46:19 localhost kernel: Call Trace:
Oct  5 22:46:19 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:46:19 localhost kernel: ? mempool_alloc+0x6e/0x170
Oct  5 22:46:19 localhost kernel: schedule+0x36/0x80
Oct  5 22:46:19 localhost kernel: md_write_start+0x195/0x230 [md_mod]
Oct  5 22:46:19 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:19 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Oct  5 22:46:19 localhost kernel: ? bio_split+0x5d/0x90
Oct  5 22:46:19 localhost kernel: ? blk_queue_split+0xd2/0x630
Oct  5 22:46:19 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:19 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
Oct  5 22:46:19 localhost kernel: ? mempool_alloc_slab+0x15/0x20
Oct  5 22:46:19 localhost kernel: ? mempool_alloc+0x6e/0x170
Oct  5 22:46:19 localhost kernel: generic_make_request+0x117/0x2f0
Oct  5 22:46:19 localhost kernel: submit_bio+0x75/0x150
Oct  5 22:46:19 localhost kernel: next_bio+0x38/0x40
Oct  5 22:46:19 localhost kernel: __blkdev_issue_zeroout+0x164/0x210
Oct  5 22:46:19 localhost kernel: blkdev_issue_zeroout+0x62/0xc0
Oct  5 22:46:19 localhost kernel: ext4_init_inode_table+0x166/0x380 [ext4]
Oct  5 22:46:19 localhost kernel: ext4_lazyinit_thread+0x2d3/0x350 [ext4]
Oct  5 22:46:19 localhost kernel: kthread+0x109/0x140
Oct  5 22:46:19 localhost kernel: ? ext4_clear_request_list+0x70/0x70 [ext4]
Oct  5 22:46:19 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:46:19 localhost kernel: ret_from_fork+0x25/0x30
Oct  5 22:46:19 localhost kernel: INFO: task md0_reshape:3751 blocked for more than 120 seconds.
Oct  5 22:46:19 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:46:19 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:46:19 localhost kernel: md0_reshape     D    0  3751      2 0x00000080
Oct  5 22:46:19 localhost kernel: Call Trace:
Oct  5 22:46:19 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:46:19 localhost kernel: schedule+0x36/0x80
Oct  5 22:46:19 localhost kernel: raid5_sync_request+0x2cf/0x370 [raid456]
Oct  5 22:46:19 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:19 localhost kernel: md_do_sync+0xafe/0xee0 [md_mod]
Oct  5 22:46:19 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:19 localhost kernel: md_thread+0x132/0x180 [md_mod]
Oct  5 22:46:19 localhost kernel: kthread+0x109/0x140
Oct  5 22:46:19 localhost kernel: ? find_pers+0x70/0x70 [md_mod]
Oct  5 22:46:19 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:46:19 localhost kernel: ret_from_fork+0x25/0x30
Oct  5 22:46:19 localhost kernel: INFO: task mdadm:3758 blocked for more than 120 seconds.
Oct  5 22:46:19 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:46:19 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:46:19 localhost kernel: mdadm           D    0  3758      1 0x00000080
Oct  5 22:46:19 localhost kernel: Call Trace:
Oct  5 22:46:19 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:46:19 localhost kernel: schedule+0x36/0x80
Oct  5 22:46:19 localhost kernel: raid5_quiesce+0x274/0x2b0 [raid456]
Oct  5 22:46:19 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:46:19 localhost kernel: suspend_lo_store+0x82/0xe0 [md_mod]
Oct  5 22:46:19 localhost kernel: md_attr_store+0x80/0xc0 [md_mod]
Oct  5 22:46:19 localhost kernel: sysfs_kf_write+0x3a/0x50
Oct  5 22:46:19 localhost kernel: kernfs_fop_write+0xff/0x180
Oct  5 22:46:19 localhost kernel: __vfs_write+0x37/0x170
Oct  5 22:46:19 localhost kernel: ? selinux_file_permission+0xe5/0x120
Oct  5 22:46:19 localhost kernel: ? security_file_permission+0x3b/0xc0
Oct  5 22:46:19 localhost kernel: vfs_write+0xb2/0x1b0
Oct  5 22:46:19 localhost kernel: ? syscall_trace_enter+0x1d0/0x2b0
Oct  5 22:46:19 localhost kernel: SyS_write+0x55/0xc0
Oct  5 22:46:19 localhost kernel: do_syscall_64+0x67/0x150
Oct  5 22:46:19 localhost kernel: entry_SYSCALL64_slow_path+0x25/0x25
Oct  5 22:46:19 localhost kernel: RIP: 0033:0x7ff6fa57c840
Oct  5 22:46:19 localhost kernel: RSP: 002b:00007ffe2fe145b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Oct  5 22:46:19 localhost kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007ff6fa57c840
Oct  5 22:46:19 localhost kernel: RDX: 0000000000000001 RSI: 00007ffe2fe14660 RDI: 0000000000000003
Oct  5 22:46:19 localhost kernel: RBP: 00007ffe2fe14660 R08: 00007ffe2fe14660 R09: 000000000000001d
Oct  5 22:46:19 localhost kernel: R10: 000000000000000a R11: 0000000000000246 R12: 00000000004699a6
Oct  5 22:46:19 localhost kernel: R13: 0000000000000000 R14: 0000000001a1bd00 R15: 0000000001a1bd00
Oct  5 22:46:19 localhost kernel: INFO: task dd:3761 blocked for more than 120 seconds.
Oct  5 22:46:19 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:46:19 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:46:19 localhost kernel: dd              D    0  3761   2291 0x00000080
Oct  5 22:46:19 localhost kernel: Call Trace:
Oct  5 22:46:19 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:46:19 localhost kernel: schedule+0x36/0x80
Oct  5 22:46:19 localhost kernel: io_schedule+0x16/0x40
Oct  5 22:46:19 localhost kernel: __lock_page+0x10e/0x160
Oct  5 22:46:19 localhost kernel: ? page_cache_tree_insert+0xf0/0xf0
Oct  5 22:46:19 localhost kernel: mpage_prepare_extent_to_map+0x290/0x310 [ext4]
Oct  5 22:46:19 localhost kernel: ext4_writepages+0x467/0xe80 [ext4]
Oct  5 22:46:19 localhost kernel: do_writepages+0x1c/0x70
Oct  5 22:46:19 localhost kernel: __filemap_fdatawrite_range+0xc6/0x100
Oct  5 22:46:19 localhost kernel: filemap_flush+0x1c/0x20
Oct  5 22:46:19 localhost kernel: ext4_alloc_da_blocks+0x2c/0x70 [ext4]
Oct  5 22:46:19 localhost kernel: ext4_release_file+0x79/0xc0 [ext4]
Oct  5 22:46:19 localhost kernel: __fput+0xe7/0x210
Oct  5 22:46:19 localhost kernel: ____fput+0xe/0x10
Oct  5 22:46:19 localhost kernel: task_work_run+0x83/0xb0
Oct  5 22:46:19 localhost kernel: exit_to_usermode_loop+0x6c/0xa8
Oct  5 22:46:19 localhost kernel: do_syscall_64+0x13a/0x150
Oct  5 22:46:19 localhost kernel: entry_SYSCALL64_slow_path+0x25/0x25
Oct  5 22:46:19 localhost kernel: RIP: 0033:0x7f09c0d15e90
Oct  5 22:46:19 localhost kernel: RSP: 002b:00007ffd681b4f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Oct  5 22:46:19 localhost kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f09c0d15e90
Oct  5 22:46:19 localhost kernel: RDX: 0000000000100000 RSI: 0000000000000000 RDI: 0000000000000001
Oct  5 22:46:19 localhost kernel: RBP: 00000000000003e8 R08: ffffffffffffffff R09: 0000000000102003
Oct  5 22:46:19 localhost kernel: R10: 00007ffd681b4c60 R11: 0000000000000246 R12: 00000000000003e8
Oct  5 22:46:19 localhost kernel: R13: 0000000000000000 R14: 00007ffd681b634b R15: 00007ffd681b51d0
Oct  5 22:48:21 localhost kernel: INFO: task kworker/u8:3:2453 blocked for more than 120 seconds.
Oct  5 22:48:21 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:48:21 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:48:21 localhost kernel: kworker/u8:3    D    0  2453      2 0x00000080
Oct  5 22:48:21 localhost kernel: Workqueue: writeback wb_workfn (flush-9:0)
Oct  5 22:48:21 localhost kernel: Call Trace:
Oct  5 22:48:21 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:48:21 localhost kernel: schedule+0x36/0x80
Oct  5 22:48:21 localhost kernel: md_write_start+0x195/0x230 [md_mod]
Oct  5 22:48:21 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:21 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Oct  5 22:48:21 localhost kernel: ? bio_split+0x5d/0x90
Oct  5 22:48:21 localhost kernel: ? blk_queue_split+0xd2/0x630
Oct  5 22:48:21 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:21 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
Oct  5 22:48:21 localhost kernel: generic_make_request+0x117/0x2f0
Oct  5 22:48:21 localhost kernel: submit_bio+0x75/0x150
Oct  5 22:48:21 localhost kernel: ? __test_set_page_writeback+0xc6/0x320
Oct  5 22:48:21 localhost kernel: ext4_io_submit+0x4c/0x60 [ext4]
Oct  5 22:48:21 localhost kernel: ext4_bio_write_page+0x1a4/0x3b0 [ext4]
Oct  5 22:48:21 localhost kernel: mpage_submit_page+0x57/0x70 [ext4]
Oct  5 22:48:21 localhost kernel: mpage_map_and_submit_buffers+0x168/0x290 [ext4]
Oct  5 22:48:21 localhost kernel: ext4_writepages+0x852/0xe80 [ext4]
Oct  5 22:48:21 localhost kernel: ? account_entity_enqueue+0xd8/0x100
Oct  5 22:48:21 localhost kernel: do_writepages+0x1c/0x70
Oct  5 22:48:21 localhost kernel: __writeback_single_inode+0x45/0x320
Oct  5 22:48:21 localhost kernel: writeback_sb_inodes+0x280/0x570
Oct  5 22:48:21 localhost kernel: __writeback_inodes_wb+0x8c/0xc0
Oct  5 22:48:21 localhost kernel: wb_writeback+0x276/0x310
Oct  5 22:48:21 localhost kernel: wb_workfn+0x19c/0x3b0
Oct  5 22:48:21 localhost kernel: process_one_work+0x149/0x360
Oct  5 22:48:21 localhost kernel: worker_thread+0x4d/0x3c0
Oct  5 22:48:21 localhost kernel: kthread+0x109/0x140
Oct  5 22:48:21 localhost kernel: ? rescuer_thread+0x380/0x380
Oct  5 22:48:21 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:48:21 localhost kernel: ? do_syscall_64+0x67/0x150
Oct  5 22:48:21 localhost kernel: ret_from_fork+0x25/0x30
Oct  5 22:48:21 localhost kernel: INFO: task jbd2/md0-8:3740 blocked for more than 120 seconds.
Oct  5 22:48:21 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:48:21 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:48:21 localhost kernel: jbd2/md0-8      D    0  3740      2 0x00000080
Oct  5 22:48:21 localhost kernel: Call Trace:
Oct  5 22:48:21 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:48:21 localhost kernel: ? __enqueue_entity+0x6c/0x70
Oct  5 22:48:21 localhost kernel: schedule+0x36/0x80
Oct  5 22:48:21 localhost kernel: md_write_start+0x195/0x230 [md_mod]
Oct  5 22:48:21 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:21 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Oct  5 22:48:21 localhost kernel: ? find_next_bit+0xb/0x10
Oct  5 22:48:21 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:21 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
Oct  5 22:48:21 localhost kernel: generic_make_request+0x117/0x2f0
Oct  5 22:48:21 localhost kernel: submit_bio+0x75/0x150
Oct  5 22:48:21 localhost kernel: ? select_idle_sibling+0x2e0/0x3b0
Oct  5 22:48:21 localhost kernel: submit_bh_wbc+0x140/0x170
Oct  5 22:48:21 localhost kernel: submit_bh+0x13/0x20
Oct  5 22:48:21 localhost kernel: jbd2_write_superblock+0x109/0x230 [jbd2]
Oct  5 22:48:21 localhost kernel: ? __enqueue_entity+0x6c/0x70
Oct  5 22:48:21 localhost kernel: ? enqueue_entity+0x1f3/0x720
Oct  5 22:48:21 localhost kernel: jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
Oct  5 22:48:21 localhost kernel: ? mutex_lock_io+0x25/0x30
Oct  5 22:48:21 localhost kernel: jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
Oct  5 22:48:21 localhost kernel: ? account_entity_dequeue+0xaa/0xe0
Oct  5 22:48:21 localhost kernel: ? dequeue_entity+0xed/0x460
Oct  5 22:48:21 localhost kernel: ? ttwu_do_activate+0x7a/0x90
Oct  5 22:48:21 localhost kernel: ? dequeue_task_fair+0x565/0x820
Oct  5 22:48:21 localhost kernel: ? __switch_to+0x229/0x440
Oct  5 22:48:21 localhost kernel: ? lock_timer_base+0x7d/0xa0
Oct  5 22:48:21 localhost kernel: ? try_to_del_timer_sync+0x53/0x80
Oct  5 22:48:21 localhost kernel: kjournald2+0xd2/0x260 [jbd2]
Oct  5 22:48:21 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:21 localhost kernel: kthread+0x109/0x140
Oct  5 22:48:21 localhost kernel: ? commit_timeout+0x10/0x10 [jbd2]
Oct  5 22:48:21 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:48:21 localhost kernel: ? do_syscall_64+0x67/0x150
Oct  5 22:48:21 localhost kernel: ret_from_fork+0x25/0x30
Oct  5 22:48:21 localhost kernel: INFO: task ext4lazyinit:3743 blocked for more than 120 seconds.
Oct  5 22:48:22 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:48:22 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:48:22 localhost kernel: ext4lazyinit    D    0  3743      2 0x00000080
Oct  5 22:48:22 localhost kernel: Call Trace:
Oct  5 22:48:22 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:48:22 localhost kernel: ? mempool_alloc+0x6e/0x170
Oct  5 22:48:22 localhost kernel: schedule+0x36/0x80
Oct  5 22:48:22 localhost kernel: md_write_start+0x195/0x230 [md_mod]
Oct  5 22:48:22 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:22 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Oct  5 22:48:22 localhost kernel: ? bio_split+0x5d/0x90
Oct  5 22:48:22 localhost kernel: ? blk_queue_split+0xd2/0x630
Oct  5 22:48:22 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:22 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
Oct  5 22:48:22 localhost kernel: ? mempool_alloc_slab+0x15/0x20
Oct  5 22:48:22 localhost kernel: ? mempool_alloc+0x6e/0x170
Oct  5 22:48:22 localhost kernel: generic_make_request+0x117/0x2f0
Oct  5 22:48:22 localhost kernel: submit_bio+0x75/0x150
Oct  5 22:48:22 localhost kernel: next_bio+0x38/0x40
Oct  5 22:48:22 localhost kernel: __blkdev_issue_zeroout+0x164/0x210
Oct  5 22:48:22 localhost kernel: blkdev_issue_zeroout+0x62/0xc0
Oct  5 22:48:22 localhost kernel: ext4_init_inode_table+0x166/0x380 [ext4]
Oct  5 22:48:22 localhost kernel: ext4_lazyinit_thread+0x2d3/0x350 [ext4]
Oct  5 22:48:22 localhost kernel: kthread+0x109/0x140
Oct  5 22:48:22 localhost kernel: ? ext4_clear_request_list+0x70/0x70 [ext4]
Oct  5 22:48:22 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:48:22 localhost kernel: ret_from_fork+0x25/0x30
Oct  5 22:48:22 localhost kernel: INFO: task md0_reshape:3751 blocked for more than 120 seconds.
Oct  5 22:48:22 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #1
Oct  5 22:48:22 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 22:48:22 localhost kernel: md0_reshape     D    0  3751      2 0x00000080
Oct  5 22:48:22 localhost kernel: Call Trace:
Oct  5 22:48:22 localhost kernel: __schedule+0x28d/0x890
Oct  5 22:48:22 localhost kernel: schedule+0x36/0x80
Oct  5 22:48:22 localhost kernel: raid5_sync_request+0x2cf/0x370 [raid456]
Oct  5 22:48:22 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:22 localhost kernel: md_do_sync+0xafe/0xee0 [md_mod]
Oct  5 22:48:22 localhost kernel: ? remove_wait_queue+0x60/0x60
Oct  5 22:48:22 localhost kernel: md_thread+0x132/0x180 [md_mod]
Oct  5 22:48:22 localhost kernel: kthread+0x109/0x140
Oct  5 22:48:22 localhost kernel: ? find_pers+0x70/0x70 [md_mod]
Oct  5 22:48:22 localhost kernel: ? kthread_park+0x60/0x60
Oct  5 22:48:22 localhost kernel: ret_from_fork+0x25/0x30

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-09  1:21                   ` Xiao Ni
@ 2017-10-09  4:57                     ` NeilBrown
  2017-10-09  5:32                       ` Xiao Ni
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-10-09  4:57 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5608 bytes --]

On Sun, Oct 08 2017, Xiao Ni wrote:

> ----- Original Message -----
>> From: "NeilBrown" <neilb@suse.com>
>> To: "Xiao Ni" <xni@redhat.com>
>> Cc: linux-raid@vger.kernel.org
>> Sent: Friday, October 6, 2017 12:32:19 PM
>> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
>> 
>> On Fri, Oct 06 2017, Xiao Ni wrote:
>> 
>> > On 10/05/2017 01:17 PM, NeilBrown wrote:
>> >> On Thu, Sep 14 2017, Xiao Ni wrote:
>> >>
>> >>>> What do
>> >>>>   cat /proc/8987/stack
>> >>>>   cat /proc/8983/stack
>> >>>>   cat /proc/8966/stack
>> >>>>   cat /proc/8381/stack
>> >>>>
>> >>>> show??
>> >> ...
>> >>
>> >>> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add
>> >>> lockdep_assert_held(&mddev->reconfig_mutex)?
>> >>> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
>> >>> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
>> >>> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
>> >>> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
>> >>> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
>> >>> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
>> >>> [<ffffffff81260457>] __vfs_write+0x37/0x170
>> >>> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
>> >>> [<ffffffff81263015>] SyS_write+0x55/0xc0
>> >>> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
>> >>> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
>> >>> [<ffffffffffffffff>] 0xffffffffffffffff
>> >>>
>> >>> [jbd2/md0-8]
>> >>> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
>> >>> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
>> >>> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
>> >>> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
>> >>> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
>> >>> [<ffffffff81376675>] submit_bio+0x75/0x150
>> >>> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
>> >>> [<ffffffff8129e683>] submit_bh+0x13/0x20
>> >>> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
>> >>> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
>> >>> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
>> >>> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
>> >>> [<ffffffff810c73f9>] kthread+0x109/0x140
>> >>> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
>> >>> [<ffffffffffffffff>] 0xffffffffffffffff
>> >> Thanks for this (and sorry it took so long to get to it).
>> >> It looks like
>> >>
>> >> Commit: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and
>> >> md_write_start()")
>> >>
>> >> is badly broken.  I wonder how it ever passed testing.
>> >>
>> >> In write_start() is change the wait_event() call to
>> >>
>> >> 	wait_event(mddev->sb_wait,
>> >> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
>> >> 		   !mddev->suspended);
>> >>
>> >>
>> >> That should be
>> >>
>> >> 	wait_event(mddev->sb_wait,
>> >> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
>> >> 		   mddev->suspended);
>> > Hi Neil
>> >
>> > Do we want write bio can be handled when mddev->suspended is 1? After
>> > changing to this,
>> > write bio can be handled when mddev->suspended is 1.
>> 
>> This is OK.
>> New write bios will not get past md_handle_request().
>> A write bios that did get past md_handle_request() is still allowed
>> through md_write_start().  The mddev_suspend() call won't complete until
>> that write bio has finished.
>
> Hi Neil
>
> Thanks for the explanation. I took some time to read the emails about the
> patch cc27b0c78 which introduced this. It's similar with this problem I 
> countered. But there is a call of function mddev_suspend in level_store. 
> So add the check of mddev->suspended in md_write_start can fix the problem
> "reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / 
> raid5_make_request". 
>
> In function suspend_lo_store it doesn't call mddev_suspend under mddev->reconfig_mutex.

It would if you had applied
   [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce()

Did you apply all 4 patches?

> So there is still a race possibility as you said at first analysis. 
>> 
>> >
>> > When the stuck happens, mddev->suspended is 0 and MD_SB_CHANGE_PENDING
>> > is set. So
>> > the patch can't fix this problem. I tried the patch, the problem still
>> > exists.
>> >
>> 
>> 
>> I need to see all the stack traces.
>
> I've added the calltrace as a attachment.

Thanks.  I looks suspend_lo_store() is calling raid5_quiesce() directly
as you say - so a patch is missing.

>
>> 
>> 
>> > [ 7710.589274] mddev suspend : 0
>> > [ 7710.592228] mddev ro : 0
>> > [ 7710.594746] mddev insync : 0
>> > [ 7710.597620] mddev SB CHANGE PENDING is set
>> > [ 7710.601698] mddev SB CHANGE CLEAN is set
>> > [ 7710.605601] mddev->persistent : 1
>> > [ 7710.608905] mddev->external : 0
>> > [ 7710.612030] conf quiesce : 2
>> >
>> > raid5 is still spinning.
>> >
>> > Hmm, I have a question. Why can't call md_check_recovery when
>> > MD_SB_CHANGE_PENDING
>> > is set in raid5d?
>> 
>> When MD_SB_CHANGE_PENDING is not set, there is no need to call
>> md_check_recovery().  I wouldn't hurt except that it would be a waste of
>> time.
>
> I'm confused. If we want to call md_check_recovery when MD_SB_CHANGE_PENDING
> is set, it should be

Sorry, I described the condition wrongly.
If any bit is set in ->sb_flags (except MD_SB_CHANGE_PENDING), then
we need to call md_check_recovery().  If none of those other bits
are set, there is no need.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-09  4:57                     ` NeilBrown
@ 2017-10-09  5:32                       ` Xiao Ni
  2017-10-09  5:52                         ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-10-09  5:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



On 10/09/2017 12:57 PM, NeilBrown wrote:
> On Sun, Oct 08 2017, Xiao Ni wrote:
>
>> ----- Original Message -----
>>> From: "NeilBrown" <neilb@suse.com>
>>> To: "Xiao Ni" <xni@redhat.com>
>>> Cc: linux-raid@vger.kernel.org
>>> Sent: Friday, October 6, 2017 12:32:19 PM
>>> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
>>>
>>> On Fri, Oct 06 2017, Xiao Ni wrote:
>>>
>>>> On 10/05/2017 01:17 PM, NeilBrown wrote:
>>>>> On Thu, Sep 14 2017, Xiao Ni wrote:
>>>>>
>>>>>>> What do
>>>>>>>    cat /proc/8987/stack
>>>>>>>    cat /proc/8983/stack
>>>>>>>    cat /proc/8966/stack
>>>>>>>    cat /proc/8381/stack
>>>>>>>
>>>>>>> show??
>>>>> ...
>>>>>
>>>>>> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add
>>>>>> lockdep_assert_held(&mddev->reconfig_mutex)?
>>>>>> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
>>>>>> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
>>>>>> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
>>>>>> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
>>>>>> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
>>>>>> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
>>>>>> [<ffffffff81260457>] __vfs_write+0x37/0x170
>>>>>> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
>>>>>> [<ffffffff81263015>] SyS_write+0x55/0xc0
>>>>>> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
>>>>>> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>
>>>>>> [jbd2/md0-8]
>>>>>> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
>>>>>> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
>>>>>> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
>>>>>> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
>>>>>> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
>>>>>> [<ffffffff81376675>] submit_bio+0x75/0x150
>>>>>> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
>>>>>> [<ffffffff8129e683>] submit_bh+0x13/0x20
>>>>>> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
>>>>>> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
>>>>>> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
>>>>>> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
>>>>>> [<ffffffff810c73f9>] kthread+0x109/0x140
>>>>>> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>> Thanks for this (and sorry it took so long to get to it).
>>>>> It looks like
>>>>>
>>>>> Commit: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and
>>>>> md_write_start()")
>>>>>
>>>>> is badly broken.  I wonder how it ever passed testing.
>>>>>
>>>>> In write_start() is change the wait_event() call to
>>>>>
>>>>> 	wait_event(mddev->sb_wait,
>>>>> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
>>>>> 		   !mddev->suspended);
>>>>>
>>>>>
>>>>> That should be
>>>>>
>>>>> 	wait_event(mddev->sb_wait,
>>>>> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
>>>>> 		   mddev->suspended);
>>>> Hi Neil
>>>>
>>>> Do we want write bio can be handled when mddev->suspended is 1? After
>>>> changing to this,
>>>> write bio can be handled when mddev->suspended is 1.
>>> This is OK.
>>> New write bios will not get past md_handle_request().
>>> A write bios that did get past md_handle_request() is still allowed
>>> through md_write_start().  The mddev_suspend() call won't complete until
>>> that write bio has finished.
>> Hi Neil
>>
>> Thanks for the explanation. I took some time to read the emails about the
>> patch cc27b0c78 which introduced this. It's similar with this problem I
>> countered. But there is a call of function mddev_suspend in level_store.
>> So add the check of mddev->suspended in md_write_start can fix the problem
>> "reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store /
>> raid5_make_request".
>>
>> In function suspend_lo_store it doesn't call mddev_suspend under mddev->reconfig_mutex.
> It would if you had applied
>     [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce()
>
> Did you apply all 4 patches?

Sorry, it's my mistake. I insmod the wrong module. I'll apply the four 
patches
and do test again.
> Thanks.  I looks suspend_lo_store() is calling raid5_quiesce() directly
> as you say - so a patch is missing.

Yes, thanks for pointing about this.

>>>> Hmm, I have a question. Why can't call md_check_recovery when
>>>> MD_SB_CHANGE_PENDING
>>>> is set in raid5d?
>>> When MD_SB_CHANGE_PENDING is not set, there is no need to call
>>> md_check_recovery().  I wouldn't hurt except that it would be a waste of
>>> time.
>> I'm confused. If we want to call md_check_recovery when MD_SB_CHANGE_PENDING
>> is set, it should be
> Sorry, I described the condition wrongly.
> If any bit is set in ->sb_flags (except MD_SB_CHANGE_PENDING), then
> we need to call md_check_recovery().  If none of those other bits
> are set, there is no need.

Hmm, so it's the first question. Why can't call md_check_recovery when 
MD_SB_CHANGE_PENDING
is set. It needs to update the superblock too when MD_SB_CHANGE_PENDING 
is set. I can't
understand this part.

Can it be:
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6299,7 +6299,7 @@ static void raid5d(struct md_thread *thread)
                         break;
                 handled += batch_size;

-               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
+               if (mddev->sb_flags) {


Best Regards
Xiao

>
> NeilBrown


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-09  5:32                       ` Xiao Ni
@ 2017-10-09  5:52                         ` NeilBrown
  2017-10-10  6:05                           ` Xiao Ni
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-10-09  5:52 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6070 bytes --]

On Mon, Oct 09 2017, Xiao Ni wrote:

> On 10/09/2017 12:57 PM, NeilBrown wrote:
>> On Sun, Oct 08 2017, Xiao Ni wrote:
>>
>>> ----- Original Message -----
>>>> From: "NeilBrown" <neilb@suse.com>
>>>> To: "Xiao Ni" <xni@redhat.com>
>>>> Cc: linux-raid@vger.kernel.org
>>>> Sent: Friday, October 6, 2017 12:32:19 PM
>>>> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
>>>>
>>>> On Fri, Oct 06 2017, Xiao Ni wrote:
>>>>
>>>>> On 10/05/2017 01:17 PM, NeilBrown wrote:
>>>>>> On Thu, Sep 14 2017, Xiao Ni wrote:
>>>>>>
>>>>>>>> What do
>>>>>>>>    cat /proc/8987/stack
>>>>>>>>    cat /proc/8983/stack
>>>>>>>>    cat /proc/8966/stack
>>>>>>>>    cat /proc/8381/stack
>>>>>>>>
>>>>>>>> show??
>>>>>> ...
>>>>>>
>>>>>>> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add
>>>>>>> lockdep_assert_held(&mddev->reconfig_mutex)?
>>>>>>> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
>>>>>>> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
>>>>>>> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
>>>>>>> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
>>>>>>> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
>>>>>>> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
>>>>>>> [<ffffffff81260457>] __vfs_write+0x37/0x170
>>>>>>> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
>>>>>>> [<ffffffff81263015>] SyS_write+0x55/0xc0
>>>>>>> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
>>>>>>> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>
>>>>>>> [jbd2/md0-8]
>>>>>>> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
>>>>>>> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
>>>>>>> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
>>>>>>> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
>>>>>>> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
>>>>>>> [<ffffffff81376675>] submit_bio+0x75/0x150
>>>>>>> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
>>>>>>> [<ffffffff8129e683>] submit_bh+0x13/0x20
>>>>>>> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
>>>>>>> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
>>>>>>> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
>>>>>>> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
>>>>>>> [<ffffffff810c73f9>] kthread+0x109/0x140
>>>>>>> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>> Thanks for this (and sorry it took so long to get to it).
>>>>>> It looks like
>>>>>>
>>>>>> Commit: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and
>>>>>> md_write_start()")
>>>>>>
>>>>>> is badly broken.  I wonder how it ever passed testing.
>>>>>>
>>>>>> In write_start() is change the wait_event() call to
>>>>>>
>>>>>> 	wait_event(mddev->sb_wait,
>>>>>> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
>>>>>> 		   !mddev->suspended);
>>>>>>
>>>>>>
>>>>>> That should be
>>>>>>
>>>>>> 	wait_event(mddev->sb_wait,
>>>>>> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
>>>>>> 		   mddev->suspended);
>>>>> Hi Neil
>>>>>
>>>>> Do we want write bio can be handled when mddev->suspended is 1? After
>>>>> changing to this,
>>>>> write bio can be handled when mddev->suspended is 1.
>>>> This is OK.
>>>> New write bios will not get past md_handle_request().
>>>> A write bios that did get past md_handle_request() is still allowed
>>>> through md_write_start().  The mddev_suspend() call won't complete until
>>>> that write bio has finished.
>>> Hi Neil
>>>
>>> Thanks for the explanation. I took some time to read the emails about the
>>> patch cc27b0c78 which introduced this. It's similar with this problem I
>>> countered. But there is a call of function mddev_suspend in level_store.
>>> So add the check of mddev->suspended in md_write_start can fix the problem
>>> "reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store /
>>> raid5_make_request".
>>>
>>> In function suspend_lo_store it doesn't call mddev_suspend under mddev->reconfig_mutex.
>> It would if you had applied
>>     [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce()
>>
>> Did you apply all 4 patches?
>
> Sorry, it's my mistake. I insmod the wrong module. I'll apply the four 
> patches
> and do test again.
>> Thanks.  I looks suspend_lo_store() is calling raid5_quiesce() directly
>> as you say - so a patch is missing.
>
> Yes, thanks for pointing about this.
>
>>>>> Hmm, I have a question. Why can't call md_check_recovery when
>>>>> MD_SB_CHANGE_PENDING
>>>>> is set in raid5d?
>>>> When MD_SB_CHANGE_PENDING is not set, there is no need to call
>>>> md_check_recovery().  I wouldn't hurt except that it would be a waste of
>>>> time.
>>> I'm confused. If we want to call md_check_recovery when MD_SB_CHANGE_PENDING
>>> is set, it should be
>> Sorry, I described the condition wrongly.
>> If any bit is set in ->sb_flags (except MD_SB_CHANGE_PENDING), then
>> we need to call md_check_recovery().  If none of those other bits
>> are set, there is no need.
>
> Hmm, so it's the first question. Why can't call md_check_recovery when 
> MD_SB_CHANGE_PENDING
> is set. It needs to update the superblock too when MD_SB_CHANGE_PENDING 
> is set. I can't
> understand this part.
>
> Can it be:
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6299,7 +6299,7 @@ static void raid5d(struct md_thread *thread)
>                          break;
>                  handled += batch_size;
>
> -               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
> +               if (mddev->sb_flags) {
>

Maybe it could, but there is a test in md_check_recovery()

	if ( ! (
		(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||

and it makes sense to match that.  There is no point dropping the
spinlock and reclaiming it if md_check_recovery() isn't going to do
anything useful.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-09  5:52                         ` NeilBrown
@ 2017-10-10  6:05                           ` Xiao Ni
  2017-10-10 21:20                             ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-10-10  6:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5783 bytes --]



On 10/09/2017 01:52 PM, NeilBrown wrote:
> On Mon, Oct 09 2017, Xiao Ni wrote:
>
>> On 10/09/2017 12:57 PM, NeilBrown wrote:
>>> It would if you had applied
>>>      [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce()
>>>
>>> Did you apply all 4 patches?
>> Sorry, it's my mistake. I insmod the wrong module. I'll apply the four
>> patches
>> and do test again.
>>> Thanks.  I looks suspend_lo_store() is calling raid5_quiesce() directly
>>> as you say - so a patch is missing.
>> Yes, thanks for pointing about this.

Hi Neil

I applied the four patches and one patch "md: fix deadlock error in 
recent patch."
There is a new stuck. It's stuck at suspend_hi_store this time. I add 
the calltrace
as an attachment.

I added some printk to print some information.

[12695.993329] mddev suspend : 1
[12695.996270] mddev ro : 0
[12695.998790] mddev insync : 0
[12696.001641] mddev active io: 1

mddev->flags doesn't have MD_SB_CHANGE_PENDING.

root      8653  0.0  0.0   9688  4752 ?        DLs  00:52   0:00 
/usr/sbin/mdadm --grow --continue /dev/md0
[root@dell-pr1700-02 md]# cat /proc/8653/stack
[<ffffffffa080d64c>] mddev_suspend+0x12c/0x160 [md_mod]
[<ffffffffa081090c>] suspend_hi_store+0x7c/0xe0 [md_mod]
[<ffffffffa0814450>] md_attr_store+0x80/0xc0 [md_mod]
[<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
[<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
[<ffffffff81260457>] __vfs_write+0x37/0x170
[<ffffffff812619e2>] vfs_write+0xb2/0x1b0
[<ffffffff81263015>] SyS_write+0x55/0xc0
[<ffffffff810037c7>] do_syscall_64+0x67/0x150
[<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

root      1234  0.0  0.0 106008  7280 ?        Ss   Oct09   0:00 
/usr/sbin/sshd -D
root      8655  0.1  0.0 108996  2752 pts/0    D+   00:52   0:04 
|           \_ dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
[root@dell-pr1700-02 md]# cat /proc/8655/stack
[<ffffffffa097b09a>] wait_transaction_locked+0x8a/0xd0 [jbd2]
[<ffffffffa097b2d4>] add_transaction_credits+0x1c4/0x2a0 [jbd2]
[<ffffffffa097b587>] start_this_handle+0x197/0x400 [jbd2]
[<ffffffffa097ba3b>] jbd2__journal_start+0xeb/0x1f0 [jbd2]
[<ffffffffa0a26dfd>] __ext4_journal_start_sb+0x6d/0xf0 [ext4]
[<ffffffffa0a45a10>] ext4_da_write_begin+0x140/0x410 [ext4]
[<ffffffff811c4dee>] generic_perform_write+0xbe/0x1b0
[<ffffffff811c812b>] __generic_file_write_iter+0x19b/0x1e0
[<ffffffffa0a32c7f>] ext4_file_write_iter+0x28f/0x3f0 [ext4]
[<ffffffff81260513>] __vfs_write+0xf3/0x170
[<ffffffff812619e2>] vfs_write+0xb2/0x1b0
[<ffffffff81263015>] SyS_write+0x55/0xc0
[<ffffffff810037c7>] do_syscall_64+0x67/0x150
[<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

root      8143  0.0  0.0      0     0 ?        D    00:40   0:00  \_ 
[kworker/u8:4]
[root@dell-pr1700-02 md]# cat /proc/8143/stack
[<ffffffffa080d131>] md_make_request+0xb1/0x260 [md_mod]
[<ffffffff81376427>] generic_make_request+0x117/0x2f0
[<ffffffff81376675>] submit_bio+0x75/0x150
[<ffffffffa0a5e21c>] ext4_io_submit+0x4c/0x60 [ext4]
[<ffffffffa0a5e3f4>] ext4_bio_write_page+0x1a4/0x3b0 [ext4]
[<ffffffffa0a3e4f7>] mpage_submit_page+0x57/0x70 [ext4]
[<ffffffffa0a3e778>] mpage_map_and_submit_buffers+0x168/0x290 [ext4]
[<ffffffffa0a443f2>] ext4_writepages+0x852/0xe80 [ext4]
[<ffffffff811d6bec>] do_writepages+0x1c/0x70
[<ffffffff81293895>] __writeback_single_inode+0x45/0x320
[<ffffffff812940c0>] writeback_sb_inodes+0x280/0x570
[<ffffffff8129443c>] __writeback_inodes_wb+0x8c/0xc0
[<ffffffff812946e6>] wb_writeback+0x276/0x310
[<ffffffff81294f9c>] wb_workfn+0x19c/0x3b0
[<ffffffff810c0ff9>] process_one_work+0x149/0x360
[<ffffffff810c177d>] worker_thread+0x4d/0x3c0
[<ffffffff810c73f9>] kthread+0x109/0x140
[<ffffffff817776c5>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff




>> Hmm, so it's the first question. Why can't call md_check_recovery when
>> MD_SB_CHANGE_PENDING
>> is set. It needs to update the superblock too when MD_SB_CHANGE_PENDING
>> is set. I can't
>> understand this part.
>>
>> Can it be:
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6299,7 +6299,7 @@ static void raid5d(struct md_thread *thread)
>>                           break;
>>                   handled += batch_size;
>>
>> -               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
>> +               if (mddev->sb_flags) {
>>
> Maybe it could, but there is a test in md_check_recovery()
>
> 	if ( ! (
> 		(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
>
> and it makes sense to match that.  There is no point dropping the
> spinlock and reclaiming it if md_check_recovery() isn't going to do
> anything useful.
It's introduced by:
commit 126925c090155f13e90b9e7e8c4010e96027c00a
Author: NeilBrown <neilb@suse.de>
Date:   Tue Sep 7 17:02:47 2010 +1000

     md: call md_update_sb even for 'external' metadata arrays.

For external metadata we don't want to take mddev lock if only
MD_SB_CHANGE_PENDING is set. But it also reduce the opportunity
to update superblock for internal metadata if only MD_SB_CHANGE_PENDING
is set.

Can it be:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b6b7a28..55e9280 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7777,7 +7777,7 @@ void md_check_recovery(struct mddev *mddev)
         if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
                 return;
         if ( ! (
-               (mddev->flags & ~ (1<<MD_CHANGE_PENDING)) ||
+               (mddev->flags & (mddev->external == 1 &&  ~ 
(1<<MD_CHANGE_PENDING))) ||
                 test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
                 test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
                 (mddev->external == 0 && mddev->safemode == 1) ||

Best Regards
Xiao






[-- Attachment #2: calltrace --]
[-- Type: text/plain, Size: 15020 bytes --]

[11302.691461] INFO: task kworker/u8:4:8143 blocked for more than 120 seconds.
[11302.698364]       Tainted: G           OE   4.13.0-rc5 #1
[11302.703719] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11302.711482] kworker/u8:4    D    0  8143      2 0x00000080
[11302.711488] Workqueue: writeback wb_workfn (flush-9:0)
[11302.711489] Call Trace:
[11302.713914]  __schedule+0x28d/0x890
[11302.717370]  schedule+0x36/0x80
[11302.720485]  md_make_request+0xb1/0x260 [md_mod]
[11302.725064]  ? remove_wait_queue+0x60/0x60
[11302.729123]  generic_make_request+0x117/0x2f0
[11302.733442]  submit_bio+0x75/0x150
[11302.736812]  ? __test_set_page_writeback+0xc6/0x320
[11302.741660]  ext4_io_submit+0x4c/0x60 [ext4]
[11302.745896]  ext4_bio_write_page+0x1a4/0x3b0 [ext4]
[11302.750733]  mpage_submit_page+0x57/0x70 [ext4]
[11302.755231]  mpage_map_and_submit_buffers+0x168/0x290 [ext4]
[11302.760845]  ext4_writepages+0x852/0xe80 [ext4]
[11302.765340]  ? account_entity_enqueue+0xd8/0x100
[11302.769916]  do_writepages+0x1c/0x70
[11302.773462]  __writeback_single_inode+0x45/0x320
[11302.778040]  writeback_sb_inodes+0x280/0x570
[11302.782273]  __writeback_inodes_wb+0x8c/0xc0
[11302.786508]  wb_writeback+0x276/0x310
[11302.790137]  wb_workfn+0x19c/0x3b0
[11302.793509]  process_one_work+0x149/0x360
[11302.797483]  worker_thread+0x4d/0x3c0
[11302.801109]  kthread+0x109/0x140
[11302.804307]  ? rescuer_thread+0x380/0x380
[11302.808280]  ? kthread_park+0x60/0x60
[11302.811913]  ? do_syscall_64+0x67/0x150
[11302.815715]  ret_from_fork+0x25/0x30
[11302.819257] INFO: task jbd2/md0-8:8635 blocked for more than 120 seconds.
[11302.825990]       Tainted: G           OE   4.13.0-rc5 #1
[11302.831338] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11302.839100] jbd2/md0-8      D    0  8635      2 0x00000080
[11302.839101] Call Trace:
[11302.841527]  __schedule+0x28d/0x890
[11302.844985]  schedule+0x36/0x80
[11302.848097]  jbd2_journal_commit_transaction+0x275/0x19e0 [jbd2]
[11302.854052]  ? account_entity_dequeue+0xaa/0xe0
[11302.858541]  ? dequeue_entity+0xed/0x460
[11302.862428]  ? ttwu_do_activate+0x7a/0x90
[11302.866400]  ? dequeue_task_fair+0x565/0x820
[11302.870632]  ? __switch_to+0x229/0x440
[11302.874350]  ? remove_wait_queue+0x60/0x60
[11302.878409]  ? lock_timer_base+0x7d/0xa0
[11302.882299]  ? try_to_del_timer_sync+0x53/0x80
[11302.886704]  kjournald2+0xd2/0x260 [jbd2]
[11302.890676]  ? remove_wait_queue+0x60/0x60
[11302.894740]  kthread+0x109/0x140
[11302.897940]  ? commit_timeout+0x10/0x10 [jbd2]
[11302.902346]  ? kthread_park+0x60/0x60
[11302.905973]  ret_from_fork+0x25/0x30
[11302.909514] INFO: task mdadm:8653 blocked for more than 120 seconds.
[11302.915812]       Tainted: G           OE   4.13.0-rc5 #1
[11302.921160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11302.928923] mdadm           D    0  8653      1 0x00000080
[11302.928924] Call Trace:
[11302.931347]  __schedule+0x28d/0x890
[11302.934807]  schedule+0x36/0x80
[11302.937923]  mddev_suspend+0x12c/0x160 [md_mod]
[11302.942416]  ? remove_wait_queue+0x60/0x60
[11302.946476]  suspend_hi_store+0x7c/0xe0 [md_mod]
[11302.951052]  md_attr_store+0x80/0xc0 [md_mod]
[11302.955375]  sysfs_kf_write+0x3a/0x50
[11302.959002]  kernfs_fop_write+0xff/0x180
[11302.962894]  __vfs_write+0x37/0x170
[11302.966351]  ? selinux_file_permission+0xe5/0x120
[11302.971013]  ? security_file_permission+0x3b/0xc0
[11302.975679]  vfs_write+0xb2/0x1b0
[11302.978961]  ? syscall_trace_enter+0x1d0/0x2b0
[11302.983366]  SyS_write+0x55/0xc0
[11302.986564]  do_syscall_64+0x67/0x150
[11302.990193]  entry_SYSCALL64_slow_path+0x25/0x25
[11302.994769] RIP: 0033:0x7f016632a840
[11302.998308] RSP: 002b:00007ffdcb5f1d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[11303.005811] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f016632a840
[11303.012883] RDX: 0000000000000005 RSI: 00007ffdcb5f1e30 RDI: 0000000000000003
[11303.019953] RBP: 00007ffdcb5f1e30 R08: 00007ffdcb5f1e30 R09: 000000000000001d
[11303.027027] R10: 000000000000000a R11: 0000000000000246 R12: 00000000004699b1
[11303.034101] R13: 0000000000000000 R14: 00000000004dd000 R15: 0000000000000001
[11303.041173] INFO: task dd:8655 blocked for more than 120 seconds.
[11303.047213]       Tainted: G           OE   4.13.0-rc5 #1
[11303.052564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11303.060325] dd              D    0  8655   2687 0x00000080
[11303.060326] Call Trace:
[11303.062752]  __schedule+0x28d/0x890
[11303.066206]  schedule+0x36/0x80
[11303.069317]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[11303.074326]  ? remove_wait_queue+0x60/0x60
[11303.078385]  add_transaction_credits+0x1c4/0x2a0 [jbd2]
[11303.083567]  start_this_handle+0x197/0x400 [jbd2]
[11303.088229]  ? __add_to_page_cache_locked+0x11c/0x1f0
[11303.093236]  ? kmem_cache_alloc+0x194/0x1a0
[11303.097384]  jbd2__journal_start+0xeb/0x1f0 [jbd2]
[11303.102142]  ? ext4_da_write_begin+0x140/0x410 [ext4]
[11303.107151]  __ext4_journal_start_sb+0x6d/0xf0 [ext4]
[11303.112164]  ext4_da_write_begin+0x140/0x410 [ext4]
[11303.116997]  generic_perform_write+0xbe/0x1b0
[11303.121316]  ? file_update_time+0x5e/0x110
[11303.125380]  __generic_file_write_iter+0x19b/0x1e0
[11303.130130]  ? _crng_backtrack_protect+0x63/0x80
[11303.134712]  ext4_file_write_iter+0x28f/0x3f0 [ext4]
[11303.139630]  __vfs_write+0xf3/0x170
[11303.143087]  vfs_write+0xb2/0x1b0
[11303.146369]  ? syscall_trace_enter+0x1d0/0x2b0
[11303.150772]  SyS_write+0x55/0xc0
[11303.153971]  do_syscall_64+0x67/0x150
[11303.157598]  entry_SYSCALL64_slow_path+0x25/0x25
[11303.162177] RIP: 0033:0x7f951dfa0840
[11303.165720] RSP: 002b:00007fff82316978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[11303.173226] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f951dfa0840
[11303.180296] RDX: 0000000000100000 RSI: 00007f951e38a000 RDI: 0000000000000001
[11303.187369] RBP: 0000000000100000 R08: ffffffffffffffff R09: 0000000000102003
[11303.194445] R10: 00007fff82316690 R11: 0000000000000246 R12: 0000000000100000
[11303.201520] R13: 00007f951e38a000 R14: 00007f951e48a000 R15: 0000000000000000
[11425.569412] INFO: task kworker/u8:4:8143 blocked for more than 120 seconds.
[11425.576313]       Tainted: G           OE   4.13.0-rc5 #1
[11425.581666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11425.589428] kworker/u8:4    D    0  8143      2 0x00000080
[11425.589434] Workqueue: writeback wb_workfn (flush-9:0)
[11425.589435] Call Trace:
[11425.591862]  __schedule+0x28d/0x890
[11425.595317]  schedule+0x36/0x80
[11425.598433]  md_make_request+0xb1/0x260 [md_mod]
[11425.603015]  ? remove_wait_queue+0x60/0x60
[11425.607074]  generic_make_request+0x117/0x2f0
[11425.611392]  submit_bio+0x75/0x150
[11425.614763]  ? __test_set_page_writeback+0xc6/0x320
[11425.619607]  ext4_io_submit+0x4c/0x60 [ext4]
[11425.623845]  ext4_bio_write_page+0x1a4/0x3b0 [ext4]
[11425.628682]  mpage_submit_page+0x57/0x70 [ext4]
[11425.633177]  mpage_map_and_submit_buffers+0x168/0x290 [ext4]
[11425.638791]  ext4_writepages+0x852/0xe80 [ext4]
[11425.643283]  ? account_entity_enqueue+0xd8/0x100
[11425.647856]  do_writepages+0x1c/0x70
[11425.651400]  __writeback_single_inode+0x45/0x320
[11425.655972]  writeback_sb_inodes+0x280/0x570
[11425.660206]  __writeback_inodes_wb+0x8c/0xc0
[11425.664436]  wb_writeback+0x276/0x310
[11425.668065]  wb_workfn+0x19c/0x3b0
[11425.671439]  process_one_work+0x149/0x360
[11425.675410]  worker_thread+0x4d/0x3c0
[11425.679036]  kthread+0x109/0x140
[11425.682236]  ? rescuer_thread+0x380/0x380
[11425.686207]  ? kthread_park+0x60/0x60
[11425.689837]  ? do_syscall_64+0x67/0x150
[11425.693637]  ret_from_fork+0x25/0x30
[11425.697180] INFO: task jbd2/md0-8:8635 blocked for more than 120 seconds.
[11425.703907]       Tainted: G           OE   4.13.0-rc5 #1
[11425.709253] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11425.717011] jbd2/md0-8      D    0  8635      2 0x00000080
[11425.717012] Call Trace:
[11425.719436]  __schedule+0x28d/0x890
[11425.722891]  schedule+0x36/0x80
[11425.726006]  jbd2_journal_commit_transaction+0x275/0x19e0 [jbd2]
[11425.731960]  ? account_entity_dequeue+0xaa/0xe0
[11425.736448]  ? dequeue_entity+0xed/0x460
[11425.740337]  ? ttwu_do_activate+0x7a/0x90
[11425.744309]  ? dequeue_task_fair+0x565/0x820
[11425.748538]  ? __switch_to+0x229/0x440
[11425.752253]  ? remove_wait_queue+0x60/0x60
[11425.756313]  ? lock_timer_base+0x7d/0xa0
[11425.760201]  ? try_to_del_timer_sync+0x53/0x80
[11425.764605]  kjournald2+0xd2/0x260 [jbd2]
[11425.768575]  ? remove_wait_queue+0x60/0x60
[11425.772636]  kthread+0x109/0x140
[11425.775832]  ? commit_timeout+0x10/0x10 [jbd2]
[11425.780236]  ? kthread_park+0x60/0x60
[11425.783863]  ret_from_fork+0x25/0x30
[11425.787406] INFO: task mdadm:8653 blocked for more than 120 seconds.
[11425.793704]       Tainted: G           OE   4.13.0-rc5 #1
[11425.799053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11425.806815] mdadm           D    0  8653      1 0x00000080
[11425.806816] Call Trace:
[11425.809237]  __schedule+0x28d/0x890
[11425.812694]  schedule+0x36/0x80
[11425.815807]  mddev_suspend+0x12c/0x160 [md_mod]
[11425.820297]  ? remove_wait_queue+0x60/0x60
[11425.824357]  suspend_hi_store+0x7c/0xe0 [md_mod]
[11425.828933]  md_attr_store+0x80/0xc0 [md_mod]
[11425.833253]  sysfs_kf_write+0x3a/0x50
[11425.836881]  kernfs_fop_write+0xff/0x180
[11425.840771]  __vfs_write+0x37/0x170
[11425.844228]  ? selinux_file_permission+0xe5/0x120
[11425.848888]  ? security_file_permission+0x3b/0xc0
[11425.853552]  vfs_write+0xb2/0x1b0
[11425.856836]  ? syscall_trace_enter+0x1d0/0x2b0
[11425.861242]  SyS_write+0x55/0xc0
[11425.864438]  do_syscall_64+0x67/0x150
[11425.868063]  entry_SYSCALL64_slow_path+0x25/0x25
[11425.872639] RIP: 0033:0x7f016632a840
[11425.876177] RSP: 002b:00007ffdcb5f1d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[11425.883680] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f016632a840
[11425.890752] RDX: 0000000000000005 RSI: 00007ffdcb5f1e30 RDI: 0000000000000003
[11425.897824] RBP: 00007ffdcb5f1e30 R08: 00007ffdcb5f1e30 R09: 000000000000001d
[11425.904895] R10: 000000000000000a R11: 0000000000000246 R12: 00000000004699b1
[11425.911966] R13: 0000000000000000 R14: 00000000004dd000 R15: 0000000000000001
[11425.919038] INFO: task dd:8655 blocked for more than 120 seconds.
[11425.925076]       Tainted: G           OE   4.13.0-rc5 #1
[11425.930424] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11425.938181] dd              D    0  8655   2687 0x00000080
[11425.938182] Call Trace:
[11425.940604]  __schedule+0x28d/0x890
[11425.944059]  schedule+0x36/0x80
[11425.947170]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[11425.952178]  ? remove_wait_queue+0x60/0x60
[11425.956238]  add_transaction_credits+0x1c4/0x2a0 [jbd2]
[11425.961418]  start_this_handle+0x197/0x400 [jbd2]
[11425.966079]  ? __add_to_page_cache_locked+0x11c/0x1f0
[11425.971086]  ? kmem_cache_alloc+0x194/0x1a0
[11425.975232]  jbd2__journal_start+0xeb/0x1f0 [jbd2]
[11425.979987]  ? ext4_da_write_begin+0x140/0x410 [ext4]
[11425.984997]  __ext4_journal_start_sb+0x6d/0xf0 [ext4]
[11425.990009]  ext4_da_write_begin+0x140/0x410 [ext4]
[11425.994840]  generic_perform_write+0xbe/0x1b0
[11425.999156]  ? file_update_time+0x5e/0x110
[11426.003217]  __generic_file_write_iter+0x19b/0x1e0
[11426.007963]  ? _crng_backtrack_protect+0x63/0x80
[11426.012546]  ext4_file_write_iter+0x28f/0x3f0 [ext4]
[11426.017464]  __vfs_write+0xf3/0x170
[11426.020921]  vfs_write+0xb2/0x1b0
[11426.024204]  ? syscall_trace_enter+0x1d0/0x2b0
[11426.028606]  SyS_write+0x55/0xc0
[11426.031806]  do_syscall_64+0x67/0x150
[11426.035432]  entry_SYSCALL64_slow_path+0x25/0x25
[11426.040010] RIP: 0033:0x7f951dfa0840
[11426.043550] RSP: 002b:00007fff82316978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[11426.051054] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f951dfa0840
[11426.058126] RDX: 0000000000100000 RSI: 00007f951e38a000 RDI: 0000000000000001
[11426.065200] RBP: 0000000000100000 R08: ffffffffffffffff R09: 0000000000102003
[11426.072273] R10: 00007fff82316690 R11: 0000000000000246 R12: 0000000000100000
[11426.079345] R13: 00007f951e38a000 R14: 00007f951e48a000 R15: 0000000000000000
[11548.447362] INFO: task kworker/u8:4:8143 blocked for more than 120 seconds.
[11548.454261]       Tainted: G           OE   4.13.0-rc5 #1
[11548.459616] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11548.467378] kworker/u8:4    D    0  8143      2 0x00000080
[11548.467383] Workqueue: writeback wb_workfn (flush-9:0)
[11548.467384] Call Trace:
[11548.469811]  __schedule+0x28d/0x890
[11548.473266]  schedule+0x36/0x80
[11548.476379]  md_make_request+0xb1/0x260 [md_mod]
[11548.480956]  ? remove_wait_queue+0x60/0x60
[11548.485014]  generic_make_request+0x117/0x2f0
[11548.489331]  submit_bio+0x75/0x150
[11548.492702]  ? __test_set_page_writeback+0xc6/0x320
[11548.497550]  ext4_io_submit+0x4c/0x60 [ext4]
[11548.501786]  ext4_bio_write_page+0x1a4/0x3b0 [ext4]
[11548.506624]  mpage_submit_page+0x57/0x70 [ext4]
[11548.511119]  mpage_map_and_submit_buffers+0x168/0x290 [ext4]
[11548.516731]  ext4_writepages+0x852/0xe80 [ext4]
[11548.521222]  ? account_entity_enqueue+0xd8/0x100
[11548.525796]  do_writepages+0x1c/0x70
[11548.529340]  __writeback_single_inode+0x45/0x320
[11548.533913]  writeback_sb_inodes+0x280/0x570
[11548.538146]  __writeback_inodes_wb+0x8c/0xc0
[11548.542375]  wb_writeback+0x276/0x310
[11548.546001]  wb_workfn+0x19c/0x3b0
[11548.549373]  process_one_work+0x149/0x360
[11548.553344]  worker_thread+0x4d/0x3c0
[11548.556972]  kthread+0x109/0x140
[11548.560172]  ? rescuer_thread+0x380/0x380
[11548.564142]  ? kthread_park+0x60/0x60
[11548.567770]  ? do_syscall_64+0x67/0x150
[11548.571571]  ret_from_fork+0x25/0x30
[11548.575111] INFO: task jbd2/md0-8:8635 blocked for more than 120 seconds.
[11548.581839]       Tainted: G           OE   4.13.0-rc5 #1
[11548.587186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11548.594948] jbd2/md0-8      D    0  8635      2 0x00000080
[11548.594949] Call Trace:
[11548.597372]  __schedule+0x28d/0x890
[11548.600825]  schedule+0x36/0x80
[11548.603937]  jbd2_journal_commit_transaction+0x275/0x19e0 [jbd2]
[11548.609892]  ? account_entity_dequeue+0xaa/0xe0
[11548.614379]  ? dequeue_entity+0xed/0x460
[11548.618268]  ? ttwu_do_activate+0x7a/0x90
[11548.622240]  ? dequeue_task_fair+0x565/0x820
[11548.626469]  ? __switch_to+0x229/0x440
[11548.630187]  ? remove_wait_queue+0x60/0x60
[11548.634245]  ? lock_timer_base+0x7d/0xa0
[11548.638133]  ? try_to_del_timer_sync+0x53/0x80
[11548.642536]  kjournald2+0xd2/0x260 [jbd2]
[11548.646506]  ? remove_wait_queue+0x60/0x60
[11548.650564]  kthread+0x109/0x140
[11548.653763]  ? commit_timeout+0x10/0x10 [jbd2]
[11548.658168]  ? kthread_park+0x60/0x60
[11548.661794]  ret_from_fork+0x25/0x30


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-10  6:05                           ` Xiao Ni
@ 2017-10-10 21:20                             ` NeilBrown
       [not found]                               ` <960568852.19225619.1507689864371.JavaMail.zimbra@redhat.com>
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-10-10 21:20 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2470 bytes --]

On Tue, Oct 10 2017, Xiao Ni wrote:

> On 10/09/2017 01:52 PM, NeilBrown wrote:
>> On Mon, Oct 09 2017, Xiao Ni wrote:
>>
>>> On 10/09/2017 12:57 PM, NeilBrown wrote:
>>>> It would if you had applied
>>>>      [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce()
>>>>
>>>> Did you apply all 4 patches?
>>> Sorry, it's my mistake. I insmod the wrong module. I'll apply the four
>>> patches
>>> and do test again.
>>>> Thanks.  I looks suspend_lo_store() is calling raid5_quiesce() directly
>>>> as you say - so a patch is missing.
>>> Yes, thanks for pointing about this.
>
> Hi Neil
>
> I applied the four patches and one patch "md: fix deadlock error in 
> recent patch."
> There is a new stuck. It's stuck at suspend_hi_store this time. I add 
> the calltrace
> as an attachment.
>
> I added some printk to print some information.
>
> [12695.993329] mddev suspend : 1
> [12695.996270] mddev ro : 0
> [12695.998790] mddev insync : 0
> [12696.001641] mddev active io: 1

You didn't tell me where (in the code) you printed this information.
That makes it hard to interpret.

If mddev->active_io is 1, then some thread must be in this range
of code

	atomic_inc(&mddev->active_io);
	rcu_read_unlock();

	if (!mddev->pers->make_request(mddev, bio)) {
		atomic_dec(&mddev->active_io);
		wake_up(&mddev->sb_wait);
		goto check_suspended;
	}

	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
		wake_up(&mddev->sb_wait);

If that thread is blocked (which appears to be the case) it must be in
->make_request() because nothing else there blocks.
None of the threads you showed are in that code.
But you didn't report all the threads - only those which hard printed
warnings.

  echo t > /proc/sysrq-trigger

will produce the stack traces of *all* threads.  That would be more
useful.

>
> Can it be:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b6b7a28..55e9280 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7777,7 +7777,7 @@ void md_check_recovery(struct mddev *mddev)
>          if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
>                  return;
>          if ( ! (
> -               (mddev->flags & ~ (1<<MD_CHANGE_PENDING)) ||
> +               (mddev->flags & (mddev->external == 1 &&  ~ 
> (1<<MD_CHANGE_PENDING))) ||

Please read that code again and see how it doesn't make any sense at
all.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <960568852.19225619.1507689864371.JavaMail.zimbra@redhat.com>]

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
       [not found]                               ` <960568852.19225619.1507689864371.JavaMail.zimbra@redhat.com>
@ 2017-10-13  3:48                                 ` NeilBrown
  2017-10-16  4:43                                   ` Xiao Ni
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-10-13  3:48 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5512 bytes --]

On Tue, Oct 10 2017, Xiao Ni wrote:
>
> I added the stack traces as an attachment. 
>

Thanks.  very helpful.
I think I can see the problem.  The following patch might
fix it.
Thanks for your ongoing testing!

NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Fri, 13 Oct 2017 14:46:37 +1100
Subject: [PATCH] md: move suspend_hi/lo handling into core md code

responding to ->suspend_lo and ->suspend_hi is similar
to responding to ->suspended.  It is best to wait in
the common core code without incrementing ->active_io.
This allows mddev_suspend()/mddev_resume() to work while
requesting are waiting for suspend_lo/hi to change.
This is particularly important as the code to update these
values now uses mddev_suspend().

So move the code for testing suspend_lo/hi out of raid1.c
and raid5.c, and place it in md.c

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c    | 29 +++++++++++++++++++++++------
 drivers/md/raid1.c | 12 ++++--------
 drivers/md/raid5.c | 22 ----------------------
 3 files changed, 27 insertions(+), 36 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ae531666f127..93ba3a526718 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -266,16 +266,31 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
  * call has finished, the bio has been linked into some internal structure
  * and so is visible to ->quiesce(), so we don't need the refcount any more.
  */
+static bool is_suspended(struct mddev *mddev, struct bio *bio)
+{
+	if (mddev->suspended)
+		return true;
+	if (bio_data_dir(bio) != WRITE)
+		return false;
+	if (mddev->suspend_lo >= mddev->suspend_hi)
+		return false;
+	if (bio->bi_iter.bi_sector >= mddev->suspend_hi)
+		return false;
+	if (bio_end_sector(bio) < mddev->suspend_lo)
+		return false;
+	return true;
+}
+
 void md_handle_request(struct mddev *mddev, struct bio *bio)
 {
 check_suspended:
 	rcu_read_lock();
-	if (mddev->suspended) {
+	if (is_suspended(mddev, bio)) {
 		DEFINE_WAIT(__wait);
 		for (;;) {
 			prepare_to_wait(&mddev->sb_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
-			if (!mddev->suspended)
+			if (!is_suspended(mddev, bio))
 				break;
 			rcu_read_unlock();
 			schedule();
@@ -4854,10 +4869,11 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
 		goto unlock;
 	old = mddev->suspend_lo;
 	mddev->suspend_lo = new;
-	if (new >= old)
+	if (new >= old) {
 		/* Shrinking suspended region */
+		wake_up(&mddev->sb_wait);
 		mddev->pers->quiesce(mddev, 2);
-	else {
+	} else {
 		/* Expanding suspended region - need to wait */
 		mddev_suspend(mddev);
 		mddev_resume(mddev);
@@ -4897,10 +4913,11 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
 		goto unlock;
 	old = mddev->suspend_hi;
 	mddev->suspend_hi = new;
-	if (new <= old)
+	if (new <= old) {
 		/* Shrinking suspended region */
+		wake_up(&mddev->sb_wait);
 		mddev->pers->quiesce(mddev, 2);
-	else {
+	} else {
 		/* Expanding suspended region - need to wait */
 		mddev_suspend(mddev);
 		mddev_resume(mddev);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index f3f3e40dc9d8..277a145b3ff5 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1310,11 +1310,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 */
 
 
-	if ((bio_end_sector(bio) > mddev->suspend_lo &&
-	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
-	    (mddev_is_clustered(mddev) &&
+	if (mddev_is_clustered(mddev) &&
 	     md_cluster_ops->area_resyncing(mddev, WRITE,
-		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
+		     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
 
 		/*
 		 * As the suspend_* range is controlled by userspace, we want
@@ -1325,12 +1323,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 			sigset_t full, old;
 			prepare_to_wait(&conf->wait_barrier,
 					&w, TASK_INTERRUPTIBLE);
-			if (bio_end_sector(bio) <= mddev->suspend_lo ||
-			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
-			    (mddev_is_clustered(mddev) &&
+			if (mddev_is_clustered(mddev) &&
 			     !md_cluster_ops->area_resyncing(mddev, WRITE,
 				     bio->bi_iter.bi_sector,
-				     bio_end_sector(bio))))
+				     bio_end_sector(bio)))
 				break;
 			sigfillset(&full);
 			sigprocmask(SIG_BLOCK, &full, &old);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 51c9dac6e744..1f9f2d80c004 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5685,28 +5685,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 				goto retry;
 			}
 
-			if (rw == WRITE &&
-			    logical_sector >= mddev->suspend_lo &&
-			    logical_sector < mddev->suspend_hi) {
-				raid5_release_stripe(sh);
-				/* As the suspend_* range is controlled by
-				 * userspace, we want an interruptible
-				 * wait.
-				 */
-				prepare_to_wait(&conf->wait_for_overlap,
-						&w, TASK_INTERRUPTIBLE);
-				if (logical_sector >= mddev->suspend_lo &&
-				    logical_sector < mddev->suspend_hi) {
-					sigset_t full, old;
-					sigfillset(&full);
-					sigprocmask(SIG_BLOCK, &full, &old);
-					schedule();
-					sigprocmask(SIG_SETMASK, &old, NULL);
-					do_prepare = true;
-				}
-				goto retry;
-			}
-
 			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
 			    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
 				/* Stripe is busy expanding or
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-13  3:48                                 ` NeilBrown
@ 2017-10-16  4:43                                   ` Xiao Ni
  0 siblings, 0 replies; 27+ messages in thread
From: Xiao Ni @ 2017-10-16  4:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



On 10/13/2017 11:48 AM, NeilBrown wrote:
> On Tue, Oct 10 2017, Xiao Ni wrote:
>> I added the stack traces as an attachment.
>>
> Thanks.  very helpful.
> I think I can see the problem.  The following patch might
> fix it.
> Thanks for your ongoing testing!
>
> NeilBrown
>
> From: NeilBrown <neilb@suse.com>
> Date: Fri, 13 Oct 2017 14:46:37 +1100
> Subject: [PATCH] md: move suspend_hi/lo handling into core md code
>
> responding to ->suspend_lo and ->suspend_hi is similar
> to responding to ->suspended.  It is best to wait in
> the common core code without incrementing ->active_io.
> This allows mddev_suspend()/mddev_resume() to work while
> requesting are waiting for suspend_lo/hi to change.
> This is particularly important as the code to update these
> values now uses mddev_suspend().
>
> So move the code for testing suspend_lo/hi out of raid1.c
> and raid5.c, and place it in md.c

Hi Neil

I applied the 6 patches
[PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
[PATCH] md: fix deadlock error in recent patch.
and this patch.

I've ran the test for more than 24 hours and it can't happen again. 
These patches
can fix the problem. Thanks for your help.

Best Regards
Xiao
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>   drivers/md/md.c    | 29 +++++++++++++++++++++++------
>   drivers/md/raid1.c | 12 ++++--------
>   drivers/md/raid5.c | 22 ----------------------
>   3 files changed, 27 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ae531666f127..93ba3a526718 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -266,16 +266,31 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
>    * call has finished, the bio has been linked into some internal structure
>    * and so is visible to ->quiesce(), so we don't need the refcount any more.
>    */
> +static bool is_suspended(struct mddev *mddev, struct bio *bio)
> +{
> +	if (mddev->suspended)
> +		return true;
> +	if (bio_data_dir(bio) != WRITE)
> +		return false;
> +	if (mddev->suspend_lo >= mddev->suspend_hi)
> +		return false;
> +	if (bio->bi_iter.bi_sector >= mddev->suspend_hi)
> +		return false;
> +	if (bio_end_sector(bio) < mddev->suspend_lo)
> +		return false;
> +	return true;
> +}
> +
>   void md_handle_request(struct mddev *mddev, struct bio *bio)
>   {
>   check_suspended:
>   	rcu_read_lock();
> -	if (mddev->suspended) {
> +	if (is_suspended(mddev, bio)) {
>   		DEFINE_WAIT(__wait);
>   		for (;;) {
>   			prepare_to_wait(&mddev->sb_wait, &__wait,
>   					TASK_UNINTERRUPTIBLE);
> -			if (!mddev->suspended)
> +			if (!is_suspended(mddev, bio))
>   				break;
>   			rcu_read_unlock();
>   			schedule();
> @@ -4854,10 +4869,11 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
>   		goto unlock;
>   	old = mddev->suspend_lo;
>   	mddev->suspend_lo = new;
> -	if (new >= old)
> +	if (new >= old) {
>   		/* Shrinking suspended region */
> +		wake_up(&mddev->sb_wait);
>   		mddev->pers->quiesce(mddev, 2);
> -	else {
> +	} else {
>   		/* Expanding suspended region - need to wait */
>   		mddev_suspend(mddev);
>   		mddev_resume(mddev);
> @@ -4897,10 +4913,11 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
>   		goto unlock;
>   	old = mddev->suspend_hi;
>   	mddev->suspend_hi = new;
> -	if (new <= old)
> +	if (new <= old) {
>   		/* Shrinking suspended region */
> +		wake_up(&mddev->sb_wait);
>   		mddev->pers->quiesce(mddev, 2);
> -	else {
> +	} else {
>   		/* Expanding suspended region - need to wait */
>   		mddev_suspend(mddev);
>   		mddev_resume(mddev);
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index f3f3e40dc9d8..277a145b3ff5 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1310,11 +1310,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>   	 */
>   
>   
> -	if ((bio_end_sector(bio) > mddev->suspend_lo &&
> -	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> -	    (mddev_is_clustered(mddev) &&
> +	if (mddev_is_clustered(mddev) &&
>   	     md_cluster_ops->area_resyncing(mddev, WRITE,
> -		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
> +		     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
>   
>   		/*
>   		 * As the suspend_* range is controlled by userspace, we want
> @@ -1325,12 +1323,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>   			sigset_t full, old;
>   			prepare_to_wait(&conf->wait_barrier,
>   					&w, TASK_INTERRUPTIBLE);
> -			if (bio_end_sector(bio) <= mddev->suspend_lo ||
> -			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
> -			    (mddev_is_clustered(mddev) &&
> +			if (mddev_is_clustered(mddev) &&
>   			     !md_cluster_ops->area_resyncing(mddev, WRITE,
>   				     bio->bi_iter.bi_sector,
> -				     bio_end_sector(bio))))
> +				     bio_end_sector(bio)))
>   				break;
>   			sigfillset(&full);
>   			sigprocmask(SIG_BLOCK, &full, &old);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 51c9dac6e744..1f9f2d80c004 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5685,28 +5685,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   				goto retry;
>   			}
>   
> -			if (rw == WRITE &&
> -			    logical_sector >= mddev->suspend_lo &&
> -			    logical_sector < mddev->suspend_hi) {
> -				raid5_release_stripe(sh);
> -				/* As the suspend_* range is controlled by
> -				 * userspace, we want an interruptible
> -				 * wait.
> -				 */
> -				prepare_to_wait(&conf->wait_for_overlap,
> -						&w, TASK_INTERRUPTIBLE);
> -				if (logical_sector >= mddev->suspend_lo &&
> -				    logical_sector < mddev->suspend_hi) {
> -					sigset_t full, old;
> -					sigfillset(&full);
> -					sigprocmask(SIG_BLOCK, &full, &old);
> -					schedule();
> -					sigprocmask(SIG_SETMASK, &old, NULL);
> -					do_prepare = true;
> -				}
> -				goto retry;
> -			}
> -
>   			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
>   			    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
>   				/* Stripe is busy expanding or


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
                   ` (5 preceding siblings ...)
  2017-09-13  2:11 ` Xiao Ni
@ 2017-09-30  9:46 ` Xiao Ni
  2017-10-05  5:03   ` NeilBrown
  6 siblings, 1 reply; 27+ messages in thread
From: Xiao Ni @ 2017-09-30  9:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



On 09/12/2017 09:49 AM, NeilBrown wrote:
> Hi,
>   I looked again at the previous patch I posted which tried to mak
>   md_update_sb() safe without taking reconfig_mutex, and realized that
>   it had serious problems, particularly around devices being added or
>   removed while the update was happening.
Could you explain this in detail? What's the serious problems?

Regards
Xiao
>
>   So I decided to try a different approach, which is embodied in these
>   patches.  The md thread is now explicitly allowed to call
>   md_update_sb() while some other thread holds the lock and
>   waits for mddev_suspend() to complete.
>
>   Please test these and confirm that they still address the problem you
>   found.
>
> Thanks,
> NeilBrown
>
> ---
>
> NeilBrown (4):
>        md: always hold reconfig_mutex when calling mddev_suspend()
>        md: don't call bitmap_create() while array is quiesced.
>        md: use mddev_suspend/resume instead of ->quiesce()
>        md: allow metadata update while suspending.
>
>
>   drivers/md/dm-raid.c     |    5 ++++-
>   drivers/md/md.c          |   45 ++++++++++++++++++++++++++++++++-------------
>   drivers/md/md.h          |    6 ++++++
>   drivers/md/raid5-cache.c |    2 ++
>   4 files changed, 44 insertions(+), 14 deletions(-)
>
> --
> Signature
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-09-30  9:46 ` Xiao Ni
@ 2017-10-05  5:03   ` NeilBrown
  2017-10-06  3:40     ` Xiao Ni
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-10-05  5:03 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1908 bytes --]

On Sat, Sep 30 2017, Xiao Ni wrote:

> On 09/12/2017 09:49 AM, NeilBrown wrote:
>> Hi,
>>   I looked again at the previous patch I posted which tried to mak
>>   md_update_sb() safe without taking reconfig_mutex, and realized that
>>   it had serious problems, particularly around devices being added or
>>   removed while the update was happening.
> Could you explain this in detail? What's the serious problems?

My patch allowed md_update_sb() to run without holding ->reconfig_mutex.
md_update_sb() uses rdev_for_each() in several places, so either that
list needs to be kept stable, or md_update_sb() needs to ensure it is
very careful while walking the list.

I couldn't just use a spinlock to protect the lists because one of the
rdev_for_Each calls md_super_write() on some of the devices, which is
not allowed while holding a spinlock.

I decided it was too hard to make md_update_sb() walk the list in a
fully say manner.

NeilBrown


>
> Regards
> Xiao
>>
>>   So I decided to try a different approach, which is embodied in these
>>   patches.  The md thread is now explicitly allowed to call
>>   md_update_sb() while some other thread holds the lock and
>>   waits for mddev_suspend() to complete.
>>
>>   Please test these and confirm that they still address the problem you
>>   found.
>>
>> Thanks,
>> NeilBrown
>>
>> ---
>>
>> NeilBrown (4):
>>        md: always hold reconfig_mutex when calling mddev_suspend()
>>        md: don't call bitmap_create() while array is quiesced.
>>        md: use mddev_suspend/resume instead of ->quiesce()
>>        md: allow metadata update while suspending.
>>
>>
>>   drivers/md/dm-raid.c     |    5 ++++-
>>   drivers/md/md.c          |   45 ++++++++++++++++++++++++++++++++-------------
>>   drivers/md/md.h          |    6 ++++++
>>   drivers/md/raid5-cache.c |    2 ++
>>   4 files changed, 44 insertions(+), 14 deletions(-)
>>
>> --
>> Signature
>>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
  2017-10-05  5:03   ` NeilBrown
@ 2017-10-06  3:40     ` Xiao Ni
  0 siblings, 0 replies; 27+ messages in thread
From: Xiao Ni @ 2017-10-06  3:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid



On 10/05/2017 01:03 PM, NeilBrown wrote:
> On Sat, Sep 30 2017, Xiao Ni wrote:
>
>> On 09/12/2017 09:49 AM, NeilBrown wrote:
>>> Hi,
>>>    I looked again at the previous patch I posted which tried to mak
>>>    md_update_sb() safe without taking reconfig_mutex, and realized that
>>>    it had serious problems, particularly around devices being added or
>>>    removed while the update was happening.
>> Could you explain this in detail? What's the serious problems?
> My patch allowed md_update_sb() to run without holding ->reconfig_mutex.
> md_update_sb() uses rdev_for_each() in several places, so either that
> list needs to be kept stable, or md_update_sb() needs to ensure it is
> very careful while walking the list.
>
> I couldn't just use a spinlock to protect the lists because one of the
> rdev_for_Each calls md_super_write() on some of the devices, which is
> not allowed while holding a spinlock.
>
> I decided it was too hard to make md_update_sb() walk the list in a
> fully say manner.
>
> NeilBrown

Hi Neil

Thanks for this explanation. Now I know the reason.

Regards
XIao
>
>
>> Regards
>> Xiao
>>>    So I decided to try a different approach, which is embodied in these
>>>    patches.  The md thread is now explicitly allowed to call
>>>    md_update_sb() while some other thread holds the lock and
>>>    waits for mddev_suspend() to complete.
>>>
>>>    Please test these and confirm that they still address the problem you
>>>    found.
>>>
>>> Thanks,
>>> NeilBrown
>>>
>>> ---
>>>
>>> NeilBrown (4):
>>>         md: always hold reconfig_mutex when calling mddev_suspend()
>>>         md: don't call bitmap_create() while array is quiesced.
>>>         md: use mddev_suspend/resume instead of ->quiesce()
>>>         md: allow metadata update while suspending.
>>>
>>>
>>>    drivers/md/dm-raid.c     |    5 ++++-
>>>    drivers/md/md.c          |   45 ++++++++++++++++++++++++++++++++-------------
>>>    drivers/md/md.h          |    6 ++++++
>>>    drivers/md/raid5-cache.c |    2 ++
>>>    4 files changed, 44 insertions(+), 14 deletions(-)
>>>
>>> --
>>> Signature
>>>


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-10-16  4:43 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
2017-09-12  1:49 ` [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
2017-09-12  1:49 ` [PATCH 1/4] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
2017-09-12  1:49 ` [PATCH 4/4] md: allow metadata update while suspending NeilBrown
2017-09-12  1:49 ` [PATCH 2/4] md: don't call bitmap_create() while array is quiesced NeilBrown
2017-09-12  2:51 ` [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without Xiao Ni
2017-09-13  2:11 ` Xiao Ni
2017-09-13 15:09   ` Xiao Ni
2017-09-13 23:05     ` NeilBrown
2017-09-14  4:55       ` Xiao Ni
2017-09-14  5:32         ` NeilBrown
2017-09-14  7:57           ` Xiao Ni
2017-09-16 13:15             ` Xiao Ni
2017-10-05  5:17             ` NeilBrown
2017-10-06  3:53               ` Xiao Ni
2017-10-06  4:32                 ` NeilBrown
2017-10-09  1:21                   ` Xiao Ni
2017-10-09  4:57                     ` NeilBrown
2017-10-09  5:32                       ` Xiao Ni
2017-10-09  5:52                         ` NeilBrown
2017-10-10  6:05                           ` Xiao Ni
2017-10-10 21:20                             ` NeilBrown
     [not found]                               ` <960568852.19225619.1507689864371.JavaMail.zimbra@redhat.com>
2017-10-13  3:48                                 ` NeilBrown
2017-10-16  4:43                                   ` Xiao Ni
2017-09-30  9:46 ` Xiao Ni
2017-10-05  5:03   ` NeilBrown
2017-10-06  3:40     ` Xiao Ni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.