All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] md/raid5: set faulty when cannot record badblocks
@ 2020-08-19 13:50 Yufen Yu
  2020-08-19 13:52 ` Yufen Yu
  0 siblings, 1 reply; 2+ messages in thread
From: Yufen Yu @ 2020-08-19 13:50 UTC (permalink / raw)
  To: song; +Cc: linux-raid, houtao1

From: y00427003 <yuyufen@huawei.com>

Recently, we reported a io hung on raid5 and it can be reproduced easily,
like as:

 $ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd
 $ vgcreate dm-test /dev/sda /dev/sdb /dev/sdc /dev/sdd
 $ lvcreate --type raid5 -L 2G -i 3 -I 64 -n mylv dm-test
 $ echo offline > /sys/block/sda/device/state

 Then issue io to trigger raid5 set sda as faulty:
 $ dd if=/dev/mapper/dm--test-mylv of=/dev/null bs=4k count=1 iflag=direct

 After that, set another disk sdb as offline:
 $ echo offline > /sys/block/sdb/device/state
 $ dd if=/dev/mapper/dm--test-mylv of=/dev/null bs=8k count=1 iflag=direct

In the end, 'dd' command will hung forever and cannot return. And kernel
will repeatly print message as:

[  105.381220] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[  105.382955] md/raid:mdX: read error not correctable (sector 8 on dm-3).
[  105.384145] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[  105.385871] md/raid:mdX: read error not correctable (sector 8 on dm-3).
[  105.387033] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[  105.388757] md/raid:mdX: read error not correctable (sector 8 on dm-3).
[  105.389915] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[  105.391639] md/raid:mdX: read error not correctable (sector 8 on dm-3).
[  105.392784] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[  105.394469] md/raid:mdX: read error not correctable (sector 8 on dm-3).
[  105.395622] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[  105.397315] md/raid:mdX: read error not correctable (sector 8 on dm-3).
[  105.398455] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0

We found that raid5 try to issue the requested read bio (sector 10248)
repeatly, but can never complete. Since sdb is offline, raid5's bio
issued to that disk will return error. At the same time, mddev->degraded
is set as value 1 after setting sda offline. So, raid5_end_read_request()
will and print "md/raid:mdX: read error not correctable" and try to set
badblocks.

However, raid5 created by dm-raid doesn't support badblocks. Thus,
rdev_set_badblocks() will fail and md_error() try to set the device as
faulty. After commit fb73b357fb98("raid5: block failing device if raid
will be failed"), the device cannot set faulty. Then, the issued bio
neither find badblock nor know the device faulty, it will try again and
agine but never complete.

Even without creating raid5 by dm-raid, rdev_set_badblocks() can also
fail because badblocks count is over than MAX_BADBLOCKS. If md is degraded,
the problem can also happend.

To fix this problem, we can set disk as faulty when badblocks is not
supported or badblocks count is over than MAX_BADBLOCKS.

Signed-off-by: yuyufen <yuyufen@huawei.com>
---
 drivers/md/raid5.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ef0fd4830803..4bc9cd9a99bb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2704,7 +2704,9 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 	spin_lock_irqsave(&conf->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) &&
-	    mddev->degraded == conf->max_degraded) {
+	    mddev->degraded == conf->max_degraded &&
+		(rdev->badblocks.shift != -1 ||
+		 rdev->badblocks.count < MAX_BADBLOCKS)) {
 		/*
 		 * Don't allow to achieve failed state
 		 * Don't try to recover this device
-- 
2.16.2.dirty


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [RFC PATCH] md/raid5: set faulty when cannot record badblocks
  2020-08-19 13:50 [RFC PATCH] md/raid5: set faulty when cannot record badblocks Yufen Yu
@ 2020-08-19 13:52 ` Yufen Yu
  0 siblings, 0 replies; 2+ messages in thread
From: Yufen Yu @ 2020-08-19 13:52 UTC (permalink / raw)
  To: song; +Cc: linux-raid, houtao1


On 2020/8/19 21:50, Yufen Yu wrote:
> From: y00427003 <yuyufen@huawei.com>
> 

Oh, this is error name, correct is:

Yufen Yu <yuyufen@huawei.com>

> Recently, we reported a io hung on raid5 and it can be reproduced easily,
> like as:
> 
>   $ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd
>   $ vgcreate dm-test /dev/sda /dev/sdb /dev/sdc /dev/sdd
>   $ lvcreate --type raid5 -L 2G -i 3 -I 64 -n mylv dm-test
>   $ echo offline > /sys/block/sda/device/state
> 
>   Then issue io to trigger raid5 set sda as faulty:
>   $ dd if=/dev/mapper/dm--test-mylv of=/dev/null bs=4k count=1 iflag=direct
> 
>   After that, set another disk sdb as offline:
>   $ echo offline > /sys/block/sdb/device/state
>   $ dd if=/dev/mapper/dm--test-mylv of=/dev/null bs=8k count=1 iflag=direct
> 
> In the end, 'dd' command will hung forever and cannot return. And kernel
> will repeatly print message as:
> 
> [  105.381220] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
> [  105.382955] md/raid:mdX: read error not correctable (sector 8 on dm-3).
> [  105.384145] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
> [  105.385871] md/raid:mdX: read error not correctable (sector 8 on dm-3).
> [  105.387033] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
> [  105.388757] md/raid:mdX: read error not correctable (sector 8 on dm-3).
> [  105.389915] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
> [  105.391639] md/raid:mdX: read error not correctable (sector 8 on dm-3).
> [  105.392784] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
> [  105.394469] md/raid:mdX: read error not correctable (sector 8 on dm-3).
> [  105.395622] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
> [  105.397315] md/raid:mdX: read error not correctable (sector 8 on dm-3).
> [  105.398455] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
> 
> We found that raid5 try to issue the requested read bio (sector 10248)
> repeatly, but can never complete. Since sdb is offline, raid5's bio
> issued to that disk will return error. At the same time, mddev->degraded
> is set as value 1 after setting sda offline. So, raid5_end_read_request()
> will and print "md/raid:mdX: read error not correctable" and try to set
> badblocks.
> 
> However, raid5 created by dm-raid doesn't support badblocks. Thus,
> rdev_set_badblocks() will fail and md_error() try to set the device as
> faulty. After commit fb73b357fb98("raid5: block failing device if raid
> will be failed"), the device cannot set faulty. Then, the issued bio
> neither find badblock nor know the device faulty, it will try again and
> agine but never complete.
> 
> Even without creating raid5 by dm-raid, rdev_set_badblocks() can also
> fail because badblocks count is over than MAX_BADBLOCKS. If md is degraded,
> the problem can also happend.
> 
> To fix this problem, we can set disk as faulty when badblocks is not
> supported or badblocks count is over than MAX_BADBLOCKS.
> 
> Signed-off-by: yuyufen <yuyufen@huawei.com>
> ---
>   drivers/md/raid5.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ef0fd4830803..4bc9cd9a99bb 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2704,7 +2704,9 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>   	spin_lock_irqsave(&conf->device_lock, flags);
>   
>   	if (test_bit(In_sync, &rdev->flags) &&
> -	    mddev->degraded == conf->max_degraded) {
> +	    mddev->degraded == conf->max_degraded &&
> +		(rdev->badblocks.shift != -1 ||
> +		 rdev->badblocks.count < MAX_BADBLOCKS)) {
>   		/*
>   		 * Don't allow to achieve failed state
>   		 * Don't try to recover this device
> 

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-08-19 13:53 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19 13:50 [RFC PATCH] md/raid5: set faulty when cannot record badblocks Yufen Yu
2020-08-19 13:52 ` Yufen Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.