All of lore.kernel.org
 help / color / mirror / Atom feed
* raid5d hangs when stopping an array during reshape
@ 2015-12-30 13:45 Artur Paszkiewicz
  2016-02-24 21:21 ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Artur Paszkiewicz @ 2015-12-30 13:45 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm seeing a hang when trying to stop a RAID5 array that is undergoing
reshape:

[   99.629924] md: reshape of RAID array md0
[   99.631150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[   99.632737] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
[   99.635366] md: using 128k window, over a total of 1047552k.
[  103.819848] md: md0: reshape interrupted.
[  150.127132] INFO: task md0_raid5:3234 blocked for more than 30 seconds.
[  150.128717]       Not tainted 4.4.0-rc5+ #54
[  150.129939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  150.132116] md0_raid5       D ffff88003b1d7ba0 14104  3234      2 0x00000000
[  150.134081]  ffff88003b1d7ba0 ffffffff81e104c0 ffff88003bad0000 ffff88003b1d8000
[  150.137205]  ffff88003d66380c 0000000000000001 ffff88003d663a50 ffff88003d663800
[  150.139994]  ffff88003b1d7bb8 ffffffff81876050 ffff88003d663800 ffff88003b1d7c28
[  150.142606] Call Trace:
[  150.143551]  [<ffffffff81876050>] schedule+0x30/0x80
[  150.144883]  [<ffffffffa005fc80>] raid5_quiesce+0x200/0x250 [raid456]
[  150.147964]  [<ffffffff810882a0>] ? prepare_to_wait_event+0xf0/0xf0
[  150.149661]  [<ffffffffa0003bca>] mddev_suspend.part.26+0x7a/0x90 [md_mod]
[  150.151376]  [<ffffffffa0003bf7>] mddev_suspend+0x17/0x20 [md_mod]
[  150.153268]  [<ffffffffa0064e29>] check_reshape+0xb9/0x6b0 [raid456]
[  150.154869]  [<ffffffff8107e63f>] ? set_next_entity+0x9f/0x6d0
[  150.156359]  [<ffffffff8107af68>] ? sched_clock_local+0x18/0x80
[  150.157848]  [<ffffffff81081400>] ? pick_next_entity+0xa0/0x150
[  150.159348]  [<ffffffff810830ae>] ? pick_next_task_fair+0x3fe/0x460
[  150.160887]  [<ffffffffa0065471>] raid5_check_reshape+0x51/0xa0 [raid456]
[  150.162482]  [<ffffffffa000ba59>] md_check_recovery+0x2f9/0x480 [md_mod]
[  150.164074]  [<ffffffffa00697b4>] raid5d+0x34/0x650 [raid456]
[  150.165751]  [<ffffffff81876050>] ? schedule+0x30/0x80
[  150.167508]  [<ffffffff818786ef>] ? schedule_timeout+0x1ef/0x270
[  150.169784]  [<ffffffff81875ac3>] ? __schedule+0x313/0x870
[  150.171194]  [<ffffffffa0002e61>] md_thread+0x111/0x130 [md_mod]
[  150.172671]  [<ffffffff810882a0>] ? prepare_to_wait_event+0xf0/0xf0
[  150.174206]  [<ffffffffa0002d50>] ? find_pers+0x70/0x70 [md_mod]
[  150.175697]  [<ffffffff8106c8d4>] kthread+0xc4/0xe0
[  150.178294]  [<ffffffff8106c810>] ? kthread_park+0x50/0x50
[  150.179745]  [<ffffffff818796df>] ret_from_fork+0x3f/0x70
[  150.181134]  [<ffffffff8106c810>] ? kthread_park+0x50/0x50

Two tasks end up blocked:
 3866 ?        D      0:00 [systemd-udevd]
 4051 ?        D      0:00 [md0_raid5]

This happens when udev change event is triggered by mdadm -S and it
causes some reads on the array. I think the hang occurs because
raid5_quiesce() is called from the raid5d thread and it blocks waiting
for active_stripes to become 0, which won't happen, since stripes are
released by raid5d. Commit 738a273 ("md/raid5: fix allocation of
'scribble' array.") added mddev_suspend() in resize_chunks(), causing
this problem. Skipping mddev_suspend()/mddev_resume() in resize_chunks()
when running in raid5d context seems to fix it, but I don't think that's
a correct fix...

Regards,
Artur


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2015-12-30 13:45 raid5d hangs when stopping an array during reshape Artur Paszkiewicz
@ 2016-02-24 21:21 ` Dan Williams
  2016-02-25  0:03   ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Dan Williams @ 2016-02-24 21:21 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid

On Wed, Dec 30, 2015 at 5:45 AM, Artur Paszkiewicz
<artur.paszkiewicz@intel.com> wrote:
> Hi,
>
> I'm seeing a hang when trying to stop a RAID5 array that is undergoing
> reshape:
>
> [   99.629924] md: reshape of RAID array md0
> [   99.631150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> [   99.632737] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> [   99.635366] md: using 128k window, over a total of 1047552k.
> [  103.819848] md: md0: reshape interrupted.
> [  150.127132] INFO: task md0_raid5:3234 blocked for more than 30 seconds.
> [  150.128717]       Not tainted 4.4.0-rc5+ #54
> [  150.129939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  150.132116] md0_raid5       D ffff88003b1d7ba0 14104  3234      2 0x00000000
> [  150.134081]  ffff88003b1d7ba0 ffffffff81e104c0 ffff88003bad0000 ffff88003b1d8000
> [  150.137205]  ffff88003d66380c 0000000000000001 ffff88003d663a50 ffff88003d663800
> [  150.139994]  ffff88003b1d7bb8 ffffffff81876050 ffff88003d663800 ffff88003b1d7c28
> [  150.142606] Call Trace:
> [  150.143551]  [<ffffffff81876050>] schedule+0x30/0x80
> [  150.144883]  [<ffffffffa005fc80>] raid5_quiesce+0x200/0x250 [raid456]
> [  150.147964]  [<ffffffff810882a0>] ? prepare_to_wait_event+0xf0/0xf0
> [  150.149661]  [<ffffffffa0003bca>] mddev_suspend.part.26+0x7a/0x90 [md_mod]
> [  150.151376]  [<ffffffffa0003bf7>] mddev_suspend+0x17/0x20 [md_mod]
> [  150.153268]  [<ffffffffa0064e29>] check_reshape+0xb9/0x6b0 [raid456]
> [  150.154869]  [<ffffffff8107e63f>] ? set_next_entity+0x9f/0x6d0
> [  150.156359]  [<ffffffff8107af68>] ? sched_clock_local+0x18/0x80
> [  150.157848]  [<ffffffff81081400>] ? pick_next_entity+0xa0/0x150
> [  150.159348]  [<ffffffff810830ae>] ? pick_next_task_fair+0x3fe/0x460
> [  150.160887]  [<ffffffffa0065471>] raid5_check_reshape+0x51/0xa0 [raid456]
> [  150.162482]  [<ffffffffa000ba59>] md_check_recovery+0x2f9/0x480 [md_mod]
> [  150.164074]  [<ffffffffa00697b4>] raid5d+0x34/0x650 [raid456]
> [  150.165751]  [<ffffffff81876050>] ? schedule+0x30/0x80
> [  150.167508]  [<ffffffff818786ef>] ? schedule_timeout+0x1ef/0x270
> [  150.169784]  [<ffffffff81875ac3>] ? __schedule+0x313/0x870
> [  150.171194]  [<ffffffffa0002e61>] md_thread+0x111/0x130 [md_mod]
> [  150.172671]  [<ffffffff810882a0>] ? prepare_to_wait_event+0xf0/0xf0
> [  150.174206]  [<ffffffffa0002d50>] ? find_pers+0x70/0x70 [md_mod]
> [  150.175697]  [<ffffffff8106c8d4>] kthread+0xc4/0xe0
> [  150.178294]  [<ffffffff8106c810>] ? kthread_park+0x50/0x50
> [  150.179745]  [<ffffffff818796df>] ret_from_fork+0x3f/0x70
> [  150.181134]  [<ffffffff8106c810>] ? kthread_park+0x50/0x50
>
> Two tasks end up blocked:
>  3866 ?        D      0:00 [systemd-udevd]
>  4051 ?        D      0:00 [md0_raid5]
>
> This happens when udev change event is triggered by mdadm -S and it
> causes some reads on the array. I think the hang occurs because
> raid5_quiesce() is called from the raid5d thread and it blocks waiting
> for active_stripes to become 0, which won't happen, since stripes are
> released by raid5d. Commit 738a273 ("md/raid5: fix allocation of
> 'scribble' array.") added mddev_suspend() in resize_chunks(), causing
> this problem. Skipping mddev_suspend()/mddev_resume() in resize_chunks()
> when running in raid5d context seems to fix it, but I don't think that's
> a correct fix...

One approach to spotting the correct fix might be to go add lockdep
annotations to validate the "locking" order of these events.

See the usage of:

        lock_map_acquire(&wq->lockdep_map);
        lock_map_release(&wq->lockdep_map);

...in the workqueue code as a way to validate flush ordering.  For
example you want lockdep to report when the current thread would
deadlock due to a circular or ABBA dependency.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-24 21:21 ` Dan Williams
@ 2016-02-25  0:03   ` Shaohua Li
  2016-02-25  0:31     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2016-02-25  0:03 UTC (permalink / raw)
  To: Neil Brown, Dan Williams; +Cc: Artur Paszkiewicz, linux-raid

On Wed, Feb 24, 2016 at 01:21:08PM -0800, Dan Williams wrote:
> On Wed, Dec 30, 2015 at 5:45 AM, Artur Paszkiewicz
> <artur.paszkiewicz@intel.com> wrote:
> > Hi,
> >
> > I'm seeing a hang when trying to stop a RAID5 array that is undergoing
> > reshape:
> >
> > [   99.629924] md: reshape of RAID array md0
> > [   99.631150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> > [   99.632737] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> > [   99.635366] md: using 128k window, over a total of 1047552k.
> > [  103.819848] md: md0: reshape interrupted.
> > [  150.127132] INFO: task md0_raid5:3234 blocked for more than 30 seconds.
> > [  150.128717]       Not tainted 4.4.0-rc5+ #54
> > [  150.129939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  150.132116] md0_raid5       D ffff88003b1d7ba0 14104  3234      2 0x00000000
> > [  150.134081]  ffff88003b1d7ba0 ffffffff81e104c0 ffff88003bad0000 ffff88003b1d8000
> > [  150.137205]  ffff88003d66380c 0000000000000001 ffff88003d663a50 ffff88003d663800
> > [  150.139994]  ffff88003b1d7bb8 ffffffff81876050 ffff88003d663800 ffff88003b1d7c28
> > [  150.142606] Call Trace:
> > [  150.143551]  [<ffffffff81876050>] schedule+0x30/0x80
> > [  150.144883]  [<ffffffffa005fc80>] raid5_quiesce+0x200/0x250 [raid456]
> > [  150.147964]  [<ffffffff810882a0>] ? prepare_to_wait_event+0xf0/0xf0
> > [  150.149661]  [<ffffffffa0003bca>] mddev_suspend.part.26+0x7a/0x90 [md_mod]
> > [  150.151376]  [<ffffffffa0003bf7>] mddev_suspend+0x17/0x20 [md_mod]
> > [  150.153268]  [<ffffffffa0064e29>] check_reshape+0xb9/0x6b0 [raid456]
> > [  150.154869]  [<ffffffff8107e63f>] ? set_next_entity+0x9f/0x6d0
> > [  150.156359]  [<ffffffff8107af68>] ? sched_clock_local+0x18/0x80
> > [  150.157848]  [<ffffffff81081400>] ? pick_next_entity+0xa0/0x150
> > [  150.159348]  [<ffffffff810830ae>] ? pick_next_task_fair+0x3fe/0x460
> > [  150.160887]  [<ffffffffa0065471>] raid5_check_reshape+0x51/0xa0 [raid456]
> > [  150.162482]  [<ffffffffa000ba59>] md_check_recovery+0x2f9/0x480 [md_mod]
> > [  150.164074]  [<ffffffffa00697b4>] raid5d+0x34/0x650 [raid456]
> > [  150.165751]  [<ffffffff81876050>] ? schedule+0x30/0x80
> > [  150.167508]  [<ffffffff818786ef>] ? schedule_timeout+0x1ef/0x270
> > [  150.169784]  [<ffffffff81875ac3>] ? __schedule+0x313/0x870
> > [  150.171194]  [<ffffffffa0002e61>] md_thread+0x111/0x130 [md_mod]
> > [  150.172671]  [<ffffffff810882a0>] ? prepare_to_wait_event+0xf0/0xf0
> > [  150.174206]  [<ffffffffa0002d50>] ? find_pers+0x70/0x70 [md_mod]
> > [  150.175697]  [<ffffffff8106c8d4>] kthread+0xc4/0xe0
> > [  150.178294]  [<ffffffff8106c810>] ? kthread_park+0x50/0x50
> > [  150.179745]  [<ffffffff818796df>] ret_from_fork+0x3f/0x70
> > [  150.181134]  [<ffffffff8106c810>] ? kthread_park+0x50/0x50
> >
> > Two tasks end up blocked:
> >  3866 ?        D      0:00 [systemd-udevd]
> >  4051 ?        D      0:00 [md0_raid5]
> >
> > This happens when udev change event is triggered by mdadm -S and it
> > causes some reads on the array. I think the hang occurs because
> > raid5_quiesce() is called from the raid5d thread and it blocks waiting
> > for active_stripes to become 0, which won't happen, since stripes are
> > released by raid5d. Commit 738a273 ("md/raid5: fix allocation of
> > 'scribble' array.") added mddev_suspend() in resize_chunks(), causing
> > this problem. Skipping mddev_suspend()/mddev_resume() in resize_chunks()
> > when running in raid5d context seems to fix it, but I don't think that's
> > a correct fix...
> 
> One approach to spotting the correct fix might be to go add lockdep
> annotations to validate the "locking" order of these events.
> 
> See the usage of:
> 
>         lock_map_acquire(&wq->lockdep_map);
>         lock_map_release(&wq->lockdep_map);
> 
> ...in the workqueue code as a way to validate flush ordering.  For
> example you want lockdep to report when the current thread would
> deadlock due to a circular or ABBA dependency.

Yes, we really add lockdep here. 

As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
which waits for the write requests. So this is a clear deadlock. I think we
should delete the check_reshape() in md_check_recovery(). If we change
layout/disks/chunk_size, check_reshape() is already called. If we start an
array, the .run() already handles new layout. There is no point
md_check_recovery() check_reshape() again.

Artur, can you check if below works for you?


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 464627b..7fb1103 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8408,8 +8408,7 @@ void md_check_recovery(struct mddev *mddev)
 		 */
 
 		if (mddev->reshape_position != MaxSector) {
-			if (mddev->pers->check_reshape == NULL ||
-			    mddev->pers->check_reshape(mddev) != 0)
+			if (mddev->pers->check_reshape == NULL)
 				/* Cannot proceed */
 				goto not_running;
 			set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);

Thanks,
Shaohua

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-25  0:03   ` Shaohua Li
@ 2016-02-25  0:31     ` NeilBrown
  2016-02-25  1:17       ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2016-02-25  0:31 UTC (permalink / raw)
  To: Shaohua Li, Dan Williams; +Cc: Artur Paszkiewicz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1908 bytes --]

On Thu, Feb 25 2016, Shaohua Li wrote:

>
> As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
> which waits for the write requests. So this is a clear deadlock. I think we
> should delete the check_reshape() in md_check_recovery(). If we change
> layout/disks/chunk_size, check_reshape() is already called. If we start an
> array, the .run() already handles new layout. There is no point
> md_check_recovery() check_reshape() again.

Are you sure?
Did you look at the commit which added that code?
commit b4c4c7b8095298ff4ce20b40bf180ada070812d0

When there is an IO error, reshape (or resync or recovery) will abort
and then possibly be automatically restarted.

Without the check here a reshape might be attempted on an array which
has failed.  Not sure if that would be harmful, but it would certainly
be pointless.

But you are right that this is causing the problem.
Maybe we should keep track of the size of the 'scribble' arrays and only
call resize_chunks if the size needs to change?  Similar to what
resize_stripes does.

It might also be good to put something like
  WARN_ON(current == mddev->thread->task);
in mddev_suspend() ... or whatever code would cause this sort of error
to trigger a warning early.

Thanks,
NeilBrown

>
> Artur, can you check if below works for you?
>
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 464627b..7fb1103 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8408,8 +8408,7 @@ void md_check_recovery(struct mddev *mddev)
>  		 */
>  
>  		if (mddev->reshape_position != MaxSector) {
> -			if (mddev->pers->check_reshape == NULL ||
> -			    mddev->pers->check_reshape(mddev) != 0)
> +			if (mddev->pers->check_reshape == NULL)
>  				/* Cannot proceed */
>  				goto not_running;
>  			set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-25  0:31     ` NeilBrown
@ 2016-02-25  1:17       ` Shaohua Li
  2016-02-25 16:05         ` Artur Paszkiewicz
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2016-02-25  1:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: Dan Williams, Artur Paszkiewicz, linux-raid

On Thu, Feb 25, 2016 at 11:31:04AM +1100, Neil Brown wrote:
> On Thu, Feb 25 2016, Shaohua Li wrote:
> 
> >
> > As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
> > which waits for the write requests. So this is a clear deadlock. I think we
> > should delete the check_reshape() in md_check_recovery(). If we change
> > layout/disks/chunk_size, check_reshape() is already called. If we start an
> > array, the .run() already handles new layout. There is no point
> > md_check_recovery() check_reshape() again.
> 
> Are you sure?
> Did you look at the commit which added that code?
> commit b4c4c7b8095298ff4ce20b40bf180ada070812d0
> 
> When there is an IO error, reshape (or resync or recovery) will abort
> and then possibly be automatically restarted.

thanks pointing out this. 
> Without the check here a reshape might be attempted on an array which
> has failed.  Not sure if that would be harmful, but it would certainly
> be pointless.
> 
> But you are right that this is causing the problem.
> Maybe we should keep track of the size of the 'scribble' arrays and only
> call resize_chunks if the size needs to change?  Similar to what
> resize_stripes does.

yep, this is my first solution, but think check_reshape() is useless here
later, apparently miss the restart case. I'll go this way.

> It might also be good to put something like
>   WARN_ON(current == mddev->thread->task);
> in mddev_suspend() ... or whatever code would cause this sort of error
> to trigger a warning early.

Sounds good.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-25  1:17       ` Shaohua Li
@ 2016-02-25 16:05         ` Artur Paszkiewicz
  2016-02-25 18:42           ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Artur Paszkiewicz @ 2016-02-25 16:05 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown; +Cc: Dan Williams, linux-raid

On 02/25/2016 02:17 AM, Shaohua Li wrote:
> On Thu, Feb 25, 2016 at 11:31:04AM +1100, Neil Brown wrote:
>> On Thu, Feb 25 2016, Shaohua Li wrote:
>>
>>>
>>> As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
>>> which waits for the write requests. So this is a clear deadlock. I think we
>>> should delete the check_reshape() in md_check_recovery(). If we change
>>> layout/disks/chunk_size, check_reshape() is already called. If we start an
>>> array, the .run() already handles new layout. There is no point
>>> md_check_recovery() check_reshape() again.
>>
>> Are you sure?
>> Did you look at the commit which added that code?
>> commit b4c4c7b8095298ff4ce20b40bf180ada070812d0
>>
>> When there is an IO error, reshape (or resync or recovery) will abort
>> and then possibly be automatically restarted.
> 
> thanks pointing out this. 
>> Without the check here a reshape might be attempted on an array which
>> has failed.  Not sure if that would be harmful, but it would certainly
>> be pointless.
>>
>> But you are right that this is causing the problem.
>> Maybe we should keep track of the size of the 'scribble' arrays and only
>> call resize_chunks if the size needs to change?  Similar to what
>> resize_stripes does.
> 
> yep, this is my first solution, but think check_reshape() is useless here
> later, apparently miss the restart case. I'll go this way.

My idea was to replace mddev_suspend()/mddev_resume() in resize_chunks()
with a rw lock that would prevent collisions with raid_run_ops(), since
scribble is used only there. But if the parity operations are executed
asynchronously this would also need to wait until all the submitted
operations have completed. Seems a bit overkill, but I came up with
this:

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a086014..3b7bbec 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -55,6 +55,7 @@
 #include <linux/ratelimit.h>
 #include <linux/nodemask.h>
 #include <linux/flex_array.h>
+#include <linux/delay.h>
 #include <trace/events/block.h>
 
 #include "md.h"
@@ -1267,6 +1268,8 @@ static void ops_complete_compute(void *stripe_head_ref)
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	atomic_dec(&sh->raid_conf->scribble_count);
+
 	/* mark the computed target(s) as uptodate */
 	mark_target_uptodate(sh, sh->ops.target);
 	mark_target_uptodate(sh, sh->ops.target2);
@@ -1314,6 +1317,9 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
 
 	pr_debug("%s: stripe %llu block: %d\n",
 		__func__, (unsigned long long)sh->sector, target);
+
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
 
 	for (i = disks; i--; )
@@ -1399,6 +1405,8 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
 	pr_debug("%s: stripe %llu block: %d\n",
 		__func__, (unsigned long long)sh->sector, target);
 
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	tgt = &sh->dev[target];
 	BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
 	dest = tgt->page;
@@ -1449,6 +1457,9 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
 	BUG_ON(sh->batch_head);
 	pr_debug("%s: stripe %llu block1: %d block2: %d\n",
 		 __func__, (unsigned long long)sh->sector, target, target2);
+
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	BUG_ON(target < 0 || target2 < 0);
 	BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
 	BUG_ON(!test_bit(R5_Wantcompute, &tgt2->flags));
@@ -1545,6 +1556,8 @@ static void ops_complete_prexor(void *stripe_head_ref)
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
+
+	atomic_dec(&sh->raid_conf->scribble_count);
 }
 
 static struct dma_async_tx_descriptor *
@@ -1563,6 +1576,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 		/* Only process blocks that are known to be uptodate */
@@ -1588,6 +1603,8 @@ ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_WANT_DRAIN);
 
 	init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_PQ_XOR_DST, tx,
@@ -1672,6 +1689,8 @@ static void ops_complete_reconstruct(void *stripe_head_ref)
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	atomic_dec(&sh->raid_conf->scribble_count);
+
 	for (i = disks; i--; ) {
 		fua |= test_bit(R5_WantFUA, &sh->dev[i].flags);
 		sync |= test_bit(R5_SyncIO, &sh->dev[i].flags);
@@ -1722,6 +1741,8 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	for (i = 0; i < sh->disks; i++) {
 		if (pd_idx == i)
 			continue;
@@ -1804,6 +1825,8 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
 
 	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
 
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	for (i = 0; i < sh->disks; i++) {
 		if (sh->pd_idx == i || sh->qd_idx == i)
 			continue;
@@ -1857,6 +1880,8 @@ static void ops_complete_check(void *stripe_head_ref)
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	atomic_dec(&sh->raid_conf->scribble_count);
+
 	sh->check_state = check_state_check_result;
 	set_bit(STRIPE_HANDLE, &sh->state);
 	raid5_release_stripe(sh);
@@ -1877,6 +1902,8 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	BUG_ON(sh->batch_head);
 	count = 0;
 	xor_dest = sh->dev[pd_idx].page;
@@ -1906,6 +1933,8 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
 	pr_debug("%s: stripe %llu checkp: %d\n", __func__,
 		(unsigned long long)sh->sector, checkp);
 
+	atomic_inc(&sh->raid_conf->scribble_count);
+
 	BUG_ON(sh->batch_head);
 	count = set_syndrome_sources(srcs, sh, SYNDROME_SRC_ALL);
 	if (!checkp)
@@ -1927,6 +1956,7 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 	struct raid5_percpu *percpu;
 	unsigned long cpu;
 
+	down_read(&conf->scribble_lock);
 	cpu = get_cpu();
 	percpu = per_cpu_ptr(conf->percpu, cpu);
 	if (test_bit(STRIPE_OP_BIOFILL, &ops_request)) {
@@ -1985,6 +2015,7 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 				wake_up(&sh->raid_conf->wait_for_overlap);
 		}
 	put_cpu();
+	up_read(&conf->scribble_lock);
 }
 
 static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp)
@@ -2089,7 +2120,10 @@ static int resize_chunks(struct r5conf *conf, int new_disks, int new_sectors)
 	unsigned long cpu;
 	int err = 0;
 
-	mddev_suspend(conf->mddev);
+	down_write(&conf->scribble_lock);
+	/* wait for async operations using scribble to complete */
+	while (atomic_read(&conf->scribble_count))
+		udelay(10);
 	get_online_cpus();
 	for_each_present_cpu(cpu) {
 		struct raid5_percpu *percpu;
@@ -2109,7 +2143,8 @@ static int resize_chunks(struct r5conf *conf, int new_disks, int new_sectors)
 		}
 	}
 	put_online_cpus();
-	mddev_resume(conf->mddev);
+	up_write(&conf->scribble_lock);
+
 	return err;
 }
 
@@ -6501,6 +6536,8 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	spin_lock_init(&conf->device_lock);
 	seqcount_init(&conf->gen_lock);
 	mutex_init(&conf->cache_size_mutex);
+	init_rwsem(&conf->scribble_lock);
+	atomic_set(&conf->scribble_count, 0);
 	init_waitqueue_head(&conf->wait_for_quiescent);
 	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
 		init_waitqueue_head(&conf->wait_for_stripe[i]);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index a415e1c..8361156 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -3,6 +3,7 @@
 
 #include <linux/raid/xor.h>
 #include <linux/dmaengine.h>
+#include <linux/rwsem.h>
 
 /*
  *
@@ -494,6 +495,9 @@ struct r5conf {
 	struct kmem_cache	*slab_cache; /* for allocating stripes */
 	struct mutex		cache_size_mutex; /* Protect changes to cache size */
 
+	struct rw_semaphore	scribble_lock;
+	atomic_t		scribble_count;
+
 	int			seq_flush, seq_write;
 	int			quiesce;
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-25 16:05         ` Artur Paszkiewicz
@ 2016-02-25 18:42           ` Shaohua Li
  2016-02-25 18:48             ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2016-02-25 18:42 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: NeilBrown, Dan Williams, linux-raid

On Thu, Feb 25, 2016 at 05:05:17PM +0100, Artur Paszkiewicz wrote:
> On 02/25/2016 02:17 AM, Shaohua Li wrote:
> > On Thu, Feb 25, 2016 at 11:31:04AM +1100, Neil Brown wrote:
> >> On Thu, Feb 25 2016, Shaohua Li wrote:
> >>
> >>>
> >>> As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
> >>> which waits for the write requests. So this is a clear deadlock. I think we
> >>> should delete the check_reshape() in md_check_recovery(). If we change
> >>> layout/disks/chunk_size, check_reshape() is already called. If we start an
> >>> array, the .run() already handles new layout. There is no point
> >>> md_check_recovery() check_reshape() again.
> >>
> >> Are you sure?
> >> Did you look at the commit which added that code?
> >> commit b4c4c7b8095298ff4ce20b40bf180ada070812d0
> >>
> >> When there is an IO error, reshape (or resync or recovery) will abort
> >> and then possibly be automatically restarted.
> > 
> > thanks pointing out this. 
> >> Without the check here a reshape might be attempted on an array which
> >> has failed.  Not sure if that would be harmful, but it would certainly
> >> be pointless.
> >>
> >> But you are right that this is causing the problem.
> >> Maybe we should keep track of the size of the 'scribble' arrays and only
> >> call resize_chunks if the size needs to change?  Similar to what
> >> resize_stripes does.
> > 
> > yep, this is my first solution, but think check_reshape() is useless here
> > later, apparently miss the restart case. I'll go this way.
> 
> My idea was to replace mddev_suspend()/mddev_resume() in resize_chunks()
> with a rw lock that would prevent collisions with raid_run_ops(), since
> scribble is used only there. But if the parity operations are executed
> asynchronously this would also need to wait until all the submitted
> operations have completed. Seems a bit overkill, but I came up with
> this:

Looks it should work, but it's overkill indead, especially the extra lock, we
can replace it with srcu though. The 'track scribble array size' is much
simpler, so I'd prefer that way. In the future, we probably should move
resize_stripes()/resize_chunks() to .start_reshape().
resize_stripes()/resize_chunks() sounds not qualified as .check_reshape().

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-25 18:42           ` Shaohua Li
@ 2016-02-25 18:48             ` Dan Williams
  2016-02-25 19:17               ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Dan Williams @ 2016-02-25 18:48 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Artur Paszkiewicz, NeilBrown, linux-raid

On Thu, Feb 25, 2016 at 10:42 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Feb 25, 2016 at 05:05:17PM +0100, Artur Paszkiewicz wrote:
>> On 02/25/2016 02:17 AM, Shaohua Li wrote:
>> > On Thu, Feb 25, 2016 at 11:31:04AM +1100, Neil Brown wrote:
>> >> On Thu, Feb 25 2016, Shaohua Li wrote:
>> >>
>> >>>
>> >>> As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
>> >>> which waits for the write requests. So this is a clear deadlock. I think we
>> >>> should delete the check_reshape() in md_check_recovery(). If we change
>> >>> layout/disks/chunk_size, check_reshape() is already called. If we start an
>> >>> array, the .run() already handles new layout. There is no point
>> >>> md_check_recovery() check_reshape() again.
>> >>
>> >> Are you sure?
>> >> Did you look at the commit which added that code?
>> >> commit b4c4c7b8095298ff4ce20b40bf180ada070812d0
>> >>
>> >> When there is an IO error, reshape (or resync or recovery) will abort
>> >> and then possibly be automatically restarted.
>> >
>> > thanks pointing out this.
>> >> Without the check here a reshape might be attempted on an array which
>> >> has failed.  Not sure if that would be harmful, but it would certainly
>> >> be pointless.
>> >>
>> >> But you are right that this is causing the problem.
>> >> Maybe we should keep track of the size of the 'scribble' arrays and only
>> >> call resize_chunks if the size needs to change?  Similar to what
>> >> resize_stripes does.
>> >
>> > yep, this is my first solution, but think check_reshape() is useless here
>> > later, apparently miss the restart case. I'll go this way.
>>
>> My idea was to replace mddev_suspend()/mddev_resume() in resize_chunks()
>> with a rw lock that would prevent collisions with raid_run_ops(), since
>> scribble is used only there. But if the parity operations are executed
>> asynchronously this would also need to wait until all the submitted
>> operations have completed. Seems a bit overkill, but I came up with
>> this:
>
> Looks it should work, but it's overkill indead, especially the extra lock, we
> can replace it with srcu though. The 'track scribble array size' is much
> simpler, so I'd prefer that way. In the future, we probably should move
> resize_stripes()/resize_chunks() to .start_reshape().
> resize_stripes()/resize_chunks() sounds not qualified as .check_reshape().
>

Any time any linux-raid mail mentions the raid5_run_ops infrastructure
I am prompted to remind that async_tx needs to die and be up leveled
to md directly.  The "help wanted" request is still pending.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-25 18:48             ` Dan Williams
@ 2016-02-25 19:17               ` Shaohua Li
  2016-02-25 19:58                 ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2016-02-25 19:17 UTC (permalink / raw)
  To: Dan Williams; +Cc: Artur Paszkiewicz, NeilBrown, linux-raid

On Thu, Feb 25, 2016 at 10:48:45AM -0800, Dan Williams wrote:
> On Thu, Feb 25, 2016 at 10:42 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Thu, Feb 25, 2016 at 05:05:17PM +0100, Artur Paszkiewicz wrote:
> >> On 02/25/2016 02:17 AM, Shaohua Li wrote:
> >> > On Thu, Feb 25, 2016 at 11:31:04AM +1100, Neil Brown wrote:
> >> >> On Thu, Feb 25 2016, Shaohua Li wrote:
> >> >>
> >> >>>
> >> >>> As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
> >> >>> which waits for the write requests. So this is a clear deadlock. I think we
> >> >>> should delete the check_reshape() in md_check_recovery(). If we change
> >> >>> layout/disks/chunk_size, check_reshape() is already called. If we start an
> >> >>> array, the .run() already handles new layout. There is no point
> >> >>> md_check_recovery() check_reshape() again.
> >> >>
> >> >> Are you sure?
> >> >> Did you look at the commit which added that code?
> >> >> commit b4c4c7b8095298ff4ce20b40bf180ada070812d0
> >> >>
> >> >> When there is an IO error, reshape (or resync or recovery) will abort
> >> >> and then possibly be automatically restarted.
> >> >
> >> > thanks pointing out this.
> >> >> Without the check here a reshape might be attempted on an array which
> >> >> has failed.  Not sure if that would be harmful, but it would certainly
> >> >> be pointless.
> >> >>
> >> >> But you are right that this is causing the problem.
> >> >> Maybe we should keep track of the size of the 'scribble' arrays and only
> >> >> call resize_chunks if the size needs to change?  Similar to what
> >> >> resize_stripes does.
> >> >
> >> > yep, this is my first solution, but think check_reshape() is useless here
> >> > later, apparently miss the restart case. I'll go this way.
> >>
> >> My idea was to replace mddev_suspend()/mddev_resume() in resize_chunks()
> >> with a rw lock that would prevent collisions with raid_run_ops(), since
> >> scribble is used only there. But if the parity operations are executed
> >> asynchronously this would also need to wait until all the submitted
> >> operations have completed. Seems a bit overkill, but I came up with
> >> this:
> >
> > Looks it should work, but it's overkill indead, especially the extra lock, we
> > can replace it with srcu though. The 'track scribble array size' is much
> > simpler, so I'd prefer that way. In the future, we probably should move
> > resize_stripes()/resize_chunks() to .start_reshape().
> > resize_stripes()/resize_chunks() sounds not qualified as .check_reshape().
> >
> 
> Any time any linux-raid mail mentions the raid5_run_ops infrastructure
> I am prompted to remind that async_tx needs to die and be up leveled
> to md directly.  The "help wanted" request is still pending.

A quick search shows async_tx has another user: exofs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid5d hangs when stopping an array during reshape
  2016-02-25 19:17               ` Shaohua Li
@ 2016-02-25 19:58                 ` Dan Williams
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2016-02-25 19:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Artur Paszkiewicz, NeilBrown, linux-raid

On Thu, Feb 25, 2016 at 11:17 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Feb 25, 2016 at 10:48:45AM -0800, Dan Williams wrote:
>> On Thu, Feb 25, 2016 at 10:42 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Thu, Feb 25, 2016 at 05:05:17PM +0100, Artur Paszkiewicz wrote:
>> >> On 02/25/2016 02:17 AM, Shaohua Li wrote:
>> >> > On Thu, Feb 25, 2016 at 11:31:04AM +1100, Neil Brown wrote:
>> >> >> On Thu, Feb 25 2016, Shaohua Li wrote:
>> >> >>
>> >> >>>
>> >> >>> As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
>> >> >>> which waits for the write requests. So this is a clear deadlock. I think we
>> >> >>> should delete the check_reshape() in md_check_recovery(). If we change
>> >> >>> layout/disks/chunk_size, check_reshape() is already called. If we start an
>> >> >>> array, the .run() already handles new layout. There is no point
>> >> >>> md_check_recovery() check_reshape() again.
>> >> >>
>> >> >> Are you sure?
>> >> >> Did you look at the commit which added that code?
>> >> >> commit b4c4c7b8095298ff4ce20b40bf180ada070812d0
>> >> >>
>> >> >> When there is an IO error, reshape (or resync or recovery) will abort
>> >> >> and then possibly be automatically restarted.
>> >> >
>> >> > thanks pointing out this.
>> >> >> Without the check here a reshape might be attempted on an array which
>> >> >> has failed.  Not sure if that would be harmful, but it would certainly
>> >> >> be pointless.
>> >> >>
>> >> >> But you are right that this is causing the problem.
>> >> >> Maybe we should keep track of the size of the 'scribble' arrays and only
>> >> >> call resize_chunks if the size needs to change?  Similar to what
>> >> >> resize_stripes does.
>> >> >
>> >> > yep, this is my first solution, but think check_reshape() is useless here
>> >> > later, apparently miss the restart case. I'll go this way.
>> >>
>> >> My idea was to replace mddev_suspend()/mddev_resume() in resize_chunks()
>> >> with a rw lock that would prevent collisions with raid_run_ops(), since
>> >> scribble is used only there. But if the parity operations are executed
>> >> asynchronously this would also need to wait until all the submitted
>> >> operations have completed. Seems a bit overkill, but I came up with
>> >> this:
>> >
>> > Looks it should work, but it's overkill indead, especially the extra lock, we
>> > can replace it with srcu though. The 'track scribble array size' is much
>> > simpler, so I'd prefer that way. In the future, we probably should move
>> > resize_stripes()/resize_chunks() to .start_reshape().
>> > resize_stripes()/resize_chunks() sounds not qualified as .check_reshape().
>> >
>>
>> Any time any linux-raid mail mentions the raid5_run_ops infrastructure
>> I am prompted to remind that async_tx needs to die and be up leveled
>> to md directly.  The "help wanted" request is still pending.
>
> A quick search shows async_tx has another user: exofs

Yes, same up leveling of api internals into the user directly needs to
be done there as well.  More help wanted :-).

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-02-25 19:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-30 13:45 raid5d hangs when stopping an array during reshape Artur Paszkiewicz
2016-02-24 21:21 ` Dan Williams
2016-02-25  0:03   ` Shaohua Li
2016-02-25  0:31     ` NeilBrown
2016-02-25  1:17       ` Shaohua Li
2016-02-25 16:05         ` Artur Paszkiewicz
2016-02-25 18:42           ` Shaohua Li
2016-02-25 18:48             ` Dan Williams
2016-02-25 19:17               ` Shaohua Li
2016-02-25 19:58                 ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.