How to online remove an error scsi disk from the system?

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How to online remove an error scsi disk from the system?
@ 2013-02-01  6:13 Tao Ma
  2013-02-01  7:54 ` Bart Van Assche
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Tao Ma @ 2013-02-01  6:13 UTC (permalink / raw)
  To: linux-scsi, LKML

Hi All,
	In our product system, we have several sata disks attached to one
machine. So when one of the disk fails, the jbd2(yes, we use ext4) will
hang forever and we will get something in /var/log/messages like below.
It seems to me that the io sent to the scsi layer is never returned back
with -EIO which is a little bit surprised for me(It should be a timeout
somewhere, right?). We have tried echo "offline" >
/sys/block/sdl/device/state, but it doesn't work. So is there any way
for us to let the scsi device returns all the io requests back with EIO
so that all the end_io can be called accordingly? Am I missing something
here?

Thanks,
Tao


sd 0:0:11:0: attempting task abort! scmd(ffff88180e900580)
sd 0:0:11:0: [sdl] CDB: Write(10): 2a 00 0d ca e0 3f 00 04 00 00
target0:0:11: handle(0x0015), sas_address(0x500e004aaaaaaa0b), phy(11)
target0:0:11: enclosure_logical_id(0x500e004aaaaaaa00), slot(11)
INFO: task jbd2/sdl1-8:4629 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sdl1-8   D 0000000000000000     0  4629      2 0x00000000
 ffff88180aa79ae0 0000000000000046 ffff88180aa79aa8 0000000000000000
 ffff88007ce0fe40 0000000000015f40 ffff8818102c0638 ffff8818102c0080
 ffff880a9184a100 ffff8818102c0638 0000000105006028 0000000100000000
Call Trace:
 [<ffffffff81236a15>] ? cpumask_next_and+0x25/0x40
 [<ffffffff810122b6>] ? read_tsc+0x16/0x40
 [<ffffffff81093cd9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff810122b6>] ? read_tsc+0x16/0x40
 [<ffffffff81093cd9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff814a8a53>] io_schedule+0x73/0xc0
 [<ffffffff811036a8>] sync_page+0x38/0x50
 [<ffffffff814a927e>] __wait_on_bit+0x5e/0x90
 [<ffffffff81103670>] ? sync_page+0x0/0x50
 [<ffffffff81103845>] wait_on_page_bit+0x75/0x80
 [<ffffffff81089320>] ? wake_bit_function+0x0/0x40
 [<ffffffff811197c7>] ? pagevec_lookup_tag+0x27/0x40
 [<ffffffff81118b55>] write_cache_pages+0x1d5/0x440
 [<ffffffff811172f0>] ? __writepage+0x0/0x40
 [<ffffffff81118de4>] generic_writepages+0x24/0x30
 [<ffffffffa02dc719>] jbd2_journal_commit_transaction+0x3e9/0x1490 [jbd2]
 [<ffffffff81074299>] ? try_to_del_timer_sync+0x49/0xe0
 [<ffffffffa02e2734>] kjournald2+0xb4/0x220 [jbd2]
 [<ffffffff810892e0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa02e2680>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff81089166>] kthread+0x96/0xa0
 [<ffffffff8100c08a>] child_rip+0xa/0x20
 [<ffffffff810890d0>] ? kthread+0x0/0xa0
 [<ffffffff8100c080>] ? child_rip+0x0/0x20


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  6:13 How to online remove an error scsi disk from the system? Tao Ma
@ 2013-02-01  7:54 ` Bart Van Assche
  2013-02-01  9:07   ` Tao Ma
  2013-02-01  9:52   ` Bryn M. Reeves
  2013-02-01  8:50 ` Jack Wang
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Bart Van Assche @ 2013-02-01  7:54 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-scsi, LKML

On 02/01/13 07:13, Tao Ma wrote:
> 	In our product system, we have several sata disks attached to one
> machine. So when one of the disk fails, the jbd2(yes, we use ext4) will
> hang forever and we will get something in /var/log/messages like below.
> It seems to me that the io sent to the scsi layer is never returned back
> with -EIO which is a little bit surprised for me(It should be a timeout
> somewhere, right?). We have tried echo "offline" >
> /sys/block/sdl/device/state, but it doesn't work. So is there any way
> for us to let the scsi device returns all the io requests back with EIO
> so that all the end_io can be called accordingly? Am I missing something
> here?

Please note that I'm not familiar with SAS. But I found this in 
drivers/scsi/scsi_proc.c:

  * proc_scsi_write - handle writes to /proc/scsi/scsi
  * @file: not used
  * @buf: buffer to write
  * @length: length of buf, at most PAGE_SIZE
  * @ppos: not used
  *
  * Description: this provides a legacy mechanism to add or remove
  * devices by Host, Channel, ID, and Lun.  To use,
  * "echo 'scsi add-single-device 0 1 2 3' > /proc/scsi/scsi" or
  * "echo 'scsi remove-single-device 0 1 2 3' > /proc/scsi/scsi" with
  * "0 1 2 3" replaced by the Host, Channel, Id, and Lun.

Bart.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: How to online remove an error scsi disk from the system?
  2013-02-01  6:13 How to online remove an error scsi disk from the system? Tao Ma
  2013-02-01  7:54 ` Bart Van Assche
@ 2013-02-01  8:50 ` Jack Wang
  2013-02-01  9:17   ` Tao Ma
  2013-02-01 14:41 ` Hillf Danton
  2013-10-16 16:22 ` taco
  3 siblings, 1 reply; 14+ messages in thread
From: Jack Wang @ 2013-02-01  8:50 UTC (permalink / raw)
  To: 'Tao Ma', linux-scsi, 'LKML'

Hi All,
	In our product system, we have several sata disks attached to one
machine. So when one of the disk fails, the jbd2(yes, we use ext4) will hang
forever and we will get something in /var/log/messages like below.
It seems to me that the io sent to the scsi layer is never returned back
with -EIO which is a little bit surprised for me(It should be a timeout
somewhere, right?). We have tried echo "offline" >
/sys/block/sdl/device/state, but it doesn't work. So is there any way for us
to let the scsi device returns all the io requests back with EIO so that all
the end_io can be called accordingly? Am I missing something here?

Thanks,
Tao
[Jack Wang] 
Hi Tao,

Have you tried:
 echo 1 > /sys/block/sdv/device/delete
 echo "- - -" > /sys/class/scsi_host/host

another way is :
find out which phy the disk attached to and:
echo 1 > /sys/class/sas_phy/phy-x:x:x/link_reset

Jack


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  7:54 ` Bart Van Assche
@ 2013-02-01  9:07   ` Tao Ma
  2013-02-01  9:52   ` Bryn M. Reeves
  1 sibling, 0 replies; 14+ messages in thread
From: Tao Ma @ 2013-02-01  9:07 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, LKML

On 02/01/2013 03:54 PM, Bart Van Assche wrote:
> On 02/01/13 07:13, Tao Ma wrote:
>>     In our product system, we have several sata disks attached to one
>> machine. So when one of the disk fails, the jbd2(yes, we use ext4) will
>> hang forever and we will get something in /var/log/messages like below.
>> It seems to me that the io sent to the scsi layer is never returned back
>> with -EIO which is a little bit surprised for me(It should be a timeout
>> somewhere, right?). We have tried echo "offline" >
>> /sys/block/sdl/device/state, but it doesn't work. So is there any way
>> for us to let the scsi device returns all the io requests back with EIO
>> so that all the end_io can be called accordingly? Am I missing something
>> here?
> 
> Please note that I'm not familiar with SAS. But I found this in
> drivers/scsi/scsi_proc.c:
> 
>  * proc_scsi_write - handle writes to /proc/scsi/scsi
>  * @file: not used
>  * @buf: buffer to write
>  * @length: length of buf, at most PAGE_SIZE
>  * @ppos: not used
>  *
>  * Description: this provides a legacy mechanism to add or remove
>  * devices by Host, Channel, ID, and Lun.  To use,
>  * "echo 'scsi add-single-device 0 1 2 3' > /proc/scsi/scsi" or
>  * "echo 'scsi remove-single-device 0 1 2 3' > /proc/scsi/scsi" with
>  * "0 1 2 3" replaced by the Host, Channel, Id, and Lun.
Sorry, it doesn't work since it will also send some IOs to the scsi. And
it hangs...

bash          D 0000000000000000     0 57479  57477 0x00000000
 ffff8817fee2dba0 0000000000000086 0000000000000000 0000000000000002
 ffffffff817c4ed5 0000000000015f40 ffff88180c7e45f8 ffff88180c7e4040
 ffffffff81a2d020 ffff88180c7e45f8 000000010fa4af09 0000000000000004
Call Trace:
 [<ffffffff8123eecf>] ? string+0x3f/0xd0
 [<ffffffff8123fdc2>] ? vsnprintf+0x242/0x580
 [<ffffffff811a0b14>] ? fsnotify_clear_marks_by_inode+0x34/0xf0
 [<ffffffff811d33c0>] ? sysfs_delete_inode+0x0/0x60
 [<ffffffff814aa5c5>] rwsem_down_failed_common+0x95/0x1c0
 [<ffffffff814aa746>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff81241814>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff812385a0>] ? kobject_release+0x0/0x1f0
 [<ffffffff814a9cd4>] ? down_read+0x24/0x30
 [<ffffffff81167794>] get_super+0x74/0xc0
 [<ffffffff8119aa9e>] fsync_bdev+0x1e/0x60
 [<ffffffff812253ce>] invalidate_partition+0x2e/0x60
 [<ffffffff811d0bfe>] del_gendisk+0x3e/0x130
 [<ffffffff813070da>] ? device_del+0x16a/0x1a0
 [<ffffffff8132f437>] sd_remove+0x67/0xb0
 [<ffffffff8130adcf>] __device_release_driver+0x6f/0xe0
 [<ffffffff8130ae6d>] device_release_driver+0x2d/0x40
 [<ffffffff8130a723>] bus_remove_device+0x83/0xe0
 [<ffffffff8130709f>] device_del+0x12f/0x1a0
 [<ffffffff8132a7f5>] __scsi_remove_device+0xa5/0xb0
 [<ffffffff8132a830>] scsi_remove_device+0x30/0x50
 [<ffffffff8132cc3f>] proc_scsi_write+0x23f/0x280
 [<ffffffff81182869>] ? mntput_no_expire+0x39/0xd0
 [<ffffffff811c482f>] proc_reg_write+0x7f/0xc0
 [<ffffffff81165c6c>] vfs_write+0xcc/0x1a0
 [<ffffffff81165e25>] sys_write+0x55/0x90
 [<ffffffff8100b032>] system_call_fastpath+0x16/0x1b


Thanks,
Tao
> 
> Bart.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  8:50 ` Jack Wang
@ 2013-02-01  9:17   ` Tao Ma
  2013-02-01  9:24     ` Jack Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Tao Ma @ 2013-02-01  9:17 UTC (permalink / raw)
  To: Jack Wang; +Cc: linux-scsi, 'LKML'

On 02/01/2013 04:50 PM, Jack Wang wrote:
> Hi All,
> 	In our product system, we have several sata disks attached to one
> machine. So when one of the disk fails, the jbd2(yes, we use ext4) will hang
> forever and we will get something in /var/log/messages like below.
> It seems to me that the io sent to the scsi layer is never returned back
> with -EIO which is a little bit surprised for me(It should be a timeout
> somewhere, right?). We have tried echo "offline" >
> /sys/block/sdl/device/state, but it doesn't work. So is there any way for us
> to let the scsi device returns all the io requests back with EIO so that all
> the end_io can be called accordingly? Am I missing something here?
> 
> Thanks,
> Tao
> [Jack Wang] 
> Hi Tao,
> 
> Have you tried:
>  echo 1 > /sys/block/sdv/device/delete
It will do some IO first so it will hang doing IO.
>  echo "- - -" > /sys/class/scsi_host/host
What do you mean for this line?
> 
> another way is :
> find out which phy the disk attached to and:
> echo 1 > /sys/class/sas_phy/phy-x:x:x/link_reset
sorry, I have done it, but there is no response.

Thanks,
Tao
> 
> Jack
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: How to online remove an error scsi disk from the system?
  2013-02-01  9:17   ` Tao Ma
@ 2013-02-01  9:24     ` Jack Wang
  2013-02-01  9:48       ` Tao Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Jack Wang @ 2013-02-01  9:24 UTC (permalink / raw)
  To: 'Tao Ma'; +Cc: linux-scsi, 'LKML'


On 02/01/2013 04:50 PM, Jack Wang wrote:
> Hi All,
> 	In our product system, we have several sata disks attached to one 
> machine. So when one of the disk fails, the jbd2(yes, we use ext4) 
> will hang forever and we will get something in /var/log/messages like
below.
> It seems to me that the io sent to the scsi layer is never returned 
> back with -EIO which is a little bit surprised for me(It should be a 
> timeout somewhere, right?). We have tried echo "offline" > 
> /sys/block/sdl/device/state, but it doesn't work. So is there any way 
> for us to let the scsi device returns all the io requests back with 
> EIO so that all the end_io can be called accordingly? Am I missing
something here?
> 
> Thanks,
> Tao
> [Jack Wang]
> Hi Tao,
> 
> Have you tried:
>  echo 1 > /sys/block/sdv/device/delete
It will do some IO first so it will hang doing IO.
>  echo "- - -" > /sys/class/scsi_host/host
What do you mean for this line?

[Jack Wang] Sorry I mean to let the driver rescan to get the disk back.
The line should be :
 echo "- - -" > /sys/class/scsi_host/hostx/scan.

Per above delete does not work , so no need to run this.
> 
> another way is :
> find out which phy the disk attached to and:
> echo 1 > /sys/class/sas_phy/phy-x:x:x/link_reset
sorry, I have done it, but there is no response.

[Jack Wang] 
What about
echo 1 > /sys/class/sas_phy/phy-x:x:x/hard_reset

?
Thanks,
Tao
> 
> Jack
> 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in the body of a message to majordomo@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the
body of a message to majordomo@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  9:24     ` Jack Wang
@ 2013-02-01  9:48       ` Tao Ma
  0 siblings, 0 replies; 14+ messages in thread
From: Tao Ma @ 2013-02-01  9:48 UTC (permalink / raw)
  To: Jack Wang; +Cc: linux-scsi, 'LKML'

On 02/01/2013 05:24 PM, Jack Wang wrote:
> 
> On 02/01/2013 04:50 PM, Jack Wang wrote:
>> Hi All,
>> 	In our product system, we have several sata disks attached to one 
>> machine. So when one of the disk fails, the jbd2(yes, we use ext4) 
>> will hang forever and we will get something in /var/log/messages like
> below.
>> It seems to me that the io sent to the scsi layer is never returned 
>> back with -EIO which is a little bit surprised for me(It should be a 
>> timeout somewhere, right?). We have tried echo "offline" > 
>> /sys/block/sdl/device/state, but it doesn't work. So is there any way 
>> for us to let the scsi device returns all the io requests back with 
>> EIO so that all the end_io can be called accordingly? Am I missing
> something here?
>>
>> Thanks,
>> Tao
>> [Jack Wang]
>> Hi Tao,
>>
>> Have you tried:
>>  echo 1 > /sys/block/sdv/device/delete
> It will do some IO first so it will hang doing IO.
>>  echo "- - -" > /sys/class/scsi_host/host
> What do you mean for this line?
> 
> [Jack Wang] Sorry I mean to let the driver rescan to get the disk back.
> The line should be :
>  echo "- - -" > /sys/class/scsi_host/hostx/scan.
> 
> Per above delete does not work , so no need to run this.
>>
>> another way is :
>> find out which phy the disk attached to and:
>> echo 1 > /sys/class/sas_phy/phy-x:x:x/link_reset
> sorry, I have done it, but there is no response.
> 
> [Jack Wang] 
> What about
> echo 1 > /sys/class/sas_phy/phy-x:x:x/hard_reset
sorry, no response either.

Thanks,
Tao
> 
> ?
> Thanks,
> Tao
>>
>> Jack
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-kernel" in the body of a message to majordomo@vger.kernel.org 
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  7:54 ` Bart Van Assche
  2013-02-01  9:07   ` Tao Ma
@ 2013-02-01  9:52   ` Bryn M. Reeves
  2013-02-01  9:59     ` Tao Ma
  1 sibling, 1 reply; 14+ messages in thread
From: Bryn M. Reeves @ 2013-02-01  9:52 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Tao Ma, linux-scsi, LKML

On 02/01/2013 07:54 AM, Bart Van Assche wrote:
>   * proc_scsi_write - handle writes to /proc/scsi/scsi
>   * @file: not used
>   * @buf: buffer to write
>   * @length: length of buf, at most PAGE_SIZE
>   * @ppos: not used
>   *
>   * Description: this provides a legacy mechanism to add or remove
>   * devices by Host, Channel, ID, and Lun.  To use,
>   * "echo 'scsi add-single-device 0 1 2 3' > /proc/scsi/scsi" or
>   * "echo 'scsi remove-single-device 0 1 2 3' > /proc/scsi/scsi" with
>   * "0 1 2 3" replaced by the Host, Channel, Id, and Lun.

The proc interface is deprecated; this can all be done via sysfs today, 
e.g.:

echo 1 > /sys/block/sdc/device/delete

Is equivalent to issuing scsi remove-single-device to proc.

Regards,
Bryn.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  9:52   ` Bryn M. Reeves
@ 2013-02-01  9:59     ` Tao Ma
  2013-02-01 10:07       ` Bryn M. Reeves
  0 siblings, 1 reply; 14+ messages in thread
From: Tao Ma @ 2013-02-01  9:59 UTC (permalink / raw)
  To: Bryn M. Reeves; +Cc: Bart Van Assche, linux-scsi, LKML

On 02/01/2013 05:52 PM, Bryn M. Reeves wrote:
> On 02/01/2013 07:54 AM, Bart Van Assche wrote:
>>   * proc_scsi_write - handle writes to /proc/scsi/scsi
>>   * @file: not used
>>   * @buf: buffer to write
>>   * @length: length of buf, at most PAGE_SIZE
>>   * @ppos: not used
>>   *
>>   * Description: this provides a legacy mechanism to add or remove
>>   * devices by Host, Channel, ID, and Lun.  To use,
>>   * "echo 'scsi add-single-device 0 1 2 3' > /proc/scsi/scsi" or
>>   * "echo 'scsi remove-single-device 0 1 2 3' > /proc/scsi/scsi" with
>>   * "0 1 2 3" replaced by the Host, Channel, Id, and Lun.
> 
> The proc interface is deprecated; this can all be done via sysfs today,
> e.g.:
> 
> echo 1 > /sys/block/sdc/device/delete
> 
> Is equivalent to issuing scsi remove-single-device to proc.
yes, but the result is the same. It will do some IO first which will
cause this command hang.

Thanks,
Tao
> 
> Regards,
> Bryn.
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  9:59     ` Tao Ma
@ 2013-02-01 10:07       ` Bryn M. Reeves
  2013-02-01 11:13         ` Tao Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Bryn M. Reeves @ 2013-02-01 10:07 UTC (permalink / raw)
  To: Tao Ma; +Cc: Bart Van Assche, linux-scsi, LKML

On 02/01/2013 09:59 AM, Tao Ma wrote:
> yes, but the result is the same. It will do some IO first which will
> cause this command hang.

You seem to have a problem with either the device/adapter or in the 
driver. The backtrace you posted shows that jbd2 (ext4) is still waiting 
on IO that's been submitted to an mpt2sas or mpt3sas adapter (I only 
know that because I recognise their log messages - you should try to 
include relevant details like this when seeking assistance).

The adapter/driver hasn't completed the IO and it looks like the SCSI 
layer is trying to abort it. Depending on the state of the driver and 
hardware your only option might be to reboot (or physically hot remove 
the device if your hardware allows it).

You don't mention the versions of the kernel and driver you're using - 
if the system is in production I would suggest contacting who ever 
normally provides support for the kernel and distribution that you are 
running.

Regards,
Bryn.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01 10:07       ` Bryn M. Reeves
@ 2013-02-01 11:13         ` Tao Ma
  2013-02-01 11:20           ` Bryn M. Reeves
  0 siblings, 1 reply; 14+ messages in thread
From: Tao Ma @ 2013-02-01 11:13 UTC (permalink / raw)
  To: Bryn M. Reeves; +Cc: Bart Van Assche, linux-scsi, LKML

On 02/01/2013 06:07 PM, Bryn M. Reeves wrote:
> On 02/01/2013 09:59 AM, Tao Ma wrote:
>> yes, but the result is the same. It will do some IO first which will
>> cause this command hang.
> 
> You seem to have a problem with either the device/adapter or in the
> driver. The backtrace you posted shows that jbd2 (ext4) is still waiting
> on IO that's been submitted to an mpt2sas or mpt3sas adapter (I only
> know that because I recognise their log messages - you should try to
> include relevant details like this when seeking assistance).
This should be  a mpt2sas adapter
#lsmod|grep mpt
mptctl                 96789  0
mptbase                97052  1 mptctl
mpt2sas               164962  18
scsi_transport_sas     35232  3 isci,libsas,mpt2sas
raid_class              4746  1 mpt2sas

The system has 12 sata disks. What else do you need? I am willing to
provide any details you want.

> 
> The adapter/driver hasn't completed the IO and it looks like the SCSI
> layer is trying to abort it. Depending on the state of the driver and
> hardware your only option might be to reboot (or physically hot remove
> the device if your hardware allows it).
OK, so let me describe the situation here. This is one of our storage
system. So 12 2TB sata disk in one box, normally when one disk fails, we
just want to remove it from the system by *software*, and then continue
to use the 11 disks left. We have found that sometimes an unsuccessful
umount or some actions against this disk can lead to some bad
situation(Say some very high load because many processes are 'D'ed). So
ideally if we can remove this device successfully, all the ios to this
disk will fail and there will be no 'D' processes and the loadavg will
also be low.
> 
> You don't mention the versions of the kernel and driver you're using -
> if the system is in production I would suggest contacting who ever
> normally provides support for the kernel and distribution that you are
> running.
We use CentOS6.2 and the kernel version is 2.6.32-220.23.1.

Thanks,
Tao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01 11:13         ` Tao Ma
@ 2013-02-01 11:20           ` Bryn M. Reeves
  0 siblings, 0 replies; 14+ messages in thread
From: Bryn M. Reeves @ 2013-02-01 11:20 UTC (permalink / raw)
  To: Tao Ma; +Cc: Bart Van Assche, linux-scsi, LKML

On 02/01/2013 11:13 AM, Tao Ma wrote:
>> You don't mention the versions of the kernel and driver you're using -
>> if the system is in production I would suggest contacting who ever
>> normally provides support for the kernel and distribution that you are
>> running.
> We use CentOS6.2 and the kernel version is 2.6.32-220.23.1.

This is ancient, even by CentOS or RHEL standards. There are thousands 
of patches in more recent kernels (either at kernel.org or in the 
updates in CentOS repositories).

Nobody on linux-kernel or the other lists you copied is going to want to 
investigate problems on such an old kernel - you'll need to either 
reproduce with something current or seek assistance from the CentOS 
community (who will probably tell you to update your kernel first anyway).

Regards,
Bryn.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  6:13 How to online remove an error scsi disk from the system? Tao Ma
  2013-02-01  7:54 ` Bart Van Assche
  2013-02-01  8:50 ` Jack Wang
@ 2013-02-01 14:41 ` Hillf Danton
  2013-10-16 16:22 ` taco
  3 siblings, 0 replies; 14+ messages in thread
From: Hillf Danton @ 2013-02-01 14:41 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-scsi, LKML

On Fri, Feb 1, 2013 at 2:13 PM, Tao Ma <tm@tao.ma> wrote:
> Hi All,
>         In our product system, we have several sata disks attached to one
> machine. So when one of the disk fails, the jbd2(yes, we use ext4) will
> hang forever and we will get something in /var/log/messages like below.
> It seems to me that the io sent to the scsi layer is never returned back
> with -EIO which is a little bit surprised for me(It should be a timeout
> somewhere, right?). We have tried echo "offline" >
> /sys/block/sdl/device/state, but it doesn't work. So is there any way
> for us to let the scsi device returns all the io requests back with EIO
> so that all the end_io can be called accordingly? Am I missing something
> here?
>
> Thanks,
> Tao
>
>
> sd 0:0:11:0: attempting task abort! scmd(ffff88180e900580)
> sd 0:0:11:0: [sdl] CDB: Write(10): 2a 00 0d ca e0 3f 00 04 00 00
> target0:0:11: handle(0x0015), sas_address(0x500e004aaaaaaa0b), phy(11)
> target0:0:11: enclosure_logical_id(0x500e004aaaaaaa00), slot(11)
> INFO: task jbd2/sdl1-8:4629 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sdl1-8   D 0000000000000000     0  4629      2 0x00000000
>  ffff88180aa79ae0 0000000000000046 ffff88180aa79aa8 0000000000000000
>  ffff88007ce0fe40 0000000000015f40 ffff8818102c0638 ffff8818102c0080
>  ffff880a9184a100 ffff8818102c0638 0000000105006028 0000000100000000
> Call Trace:
>  [<ffffffff81236a15>] ? cpumask_next_and+0x25/0x40
>  [<ffffffff810122b6>] ? read_tsc+0x16/0x40
>  [<ffffffff81093cd9>] ? ktime_get_ts+0xa9/0xe0
>  [<ffffffff810122b6>] ? read_tsc+0x16/0x40
>  [<ffffffff81093cd9>] ? ktime_get_ts+0xa9/0xe0
>  [<ffffffff814a8a53>] io_schedule+0x73/0xc0
>  [<ffffffff811036a8>] sync_page+0x38/0x50
>  [<ffffffff814a927e>] __wait_on_bit+0x5e/0x90
>  [<ffffffff81103670>] ? sync_page+0x0/0x50
>  [<ffffffff81103845>] wait_on_page_bit+0x75/0x80
>  [<ffffffff81089320>] ? wake_bit_function+0x0/0x40
>  [<ffffffff811197c7>] ? pagevec_lookup_tag+0x27/0x40
>  [<ffffffff81118b55>] write_cache_pages+0x1d5/0x440
>  [<ffffffff811172f0>] ? __writepage+0x0/0x40
>  [<ffffffff81118de4>] generic_writepages+0x24/0x30
>  [<ffffffffa02dc719>] jbd2_journal_commit_transaction+0x3e9/0x1490 [jbd2]
>  [<ffffffff81074299>] ? try_to_del_timer_sync+0x49/0xe0
>  [<ffffffffa02e2734>] kjournald2+0xb4/0x220 [jbd2]
>  [<ffffffff810892e0>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffffa02e2680>] ? kjournald2+0x0/0x220 [jbd2]
>  [<ffffffff81089166>] kthread+0x96/0xa0
>  [<ffffffff8100c08a>] child_rip+0xa/0x20
>  [<ffffffff810890d0>] ? kthread+0x0/0xa0
>  [<ffffffff8100c080>] ? child_rip+0x0/0x20
>
Can you try upstream?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How to online remove an error scsi disk from the system?
  2013-02-01  6:13 How to online remove an error scsi disk from the system? Tao Ma
                   ` (2 preceding siblings ...)
  2013-02-01 14:41 ` Hillf Danton
@ 2013-10-16 16:22 ` taco
  3 siblings, 0 replies; 14+ messages in thread
From: taco @ 2013-10-16 16:22 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-scsi, LKML

On Fri, Feb 01, 2013 at 02:13:16PM +0800, Tao Ma wrote:
> Hi All,
> 	In our product system, we have several sata disks attached to one
> machine. So when one of the disk fails, the jbd2(yes, we use ext4) will
> hang forever and we will get something in /var/log/messages like below.
> It seems to me that the io sent to the scsi layer is never returned back
> with -EIO which is a little bit surprised for me(It should be a timeout
> somewhere, right?). We have tried echo "offline" >
> /sys/block/sdl/device/state, but it doesn't work. So is there any way
> for us to let the scsi device returns all the io requests back with EIO
> so that all the end_io can be called accordingly? Am I missing something
> here?
> 
> Thanks,
> Tao
> 
> 
> sd 0:0:11:0: attempting task abort! scmd(ffff88180e900580)
It seems that IO timeout cause HBA's driver to abort scmd,
the aborted IO came back with scmd->result = DID_RESET << 16;
with this result code the Middle layer of scsi will retry this IO.
IO timeout again due to Bad disk so, this IO loop forever and
never come back.

might it is a bug of mpt2sas driver.

> sd 0:0:11:0: [sdl] CDB: Write(10): 2a 00 0d ca e0 3f 00 04 00 00
> target0:0:11: handle(0x0015), sas_address(0x500e004aaaaaaa0b), phy(11)
> target0:0:11: enclosure_logical_id(0x500e004aaaaaaa00), slot(11)
> INFO: task jbd2/sdl1-8:4629 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sdl1-8   D 0000000000000000     0  4629      2 0x00000000
>  ffff88180aa79ae0 0000000000000046 ffff88180aa79aa8 0000000000000000
>  ffff88007ce0fe40 0000000000015f40 ffff8818102c0638 ffff8818102c0080
>  ffff880a9184a100 ffff8818102c0638 0000000105006028 0000000100000000
> Call Trace:
>  [<ffffffff81236a15>] ? cpumask_next_and+0x25/0x40
>  [<ffffffff810122b6>] ? read_tsc+0x16/0x40
>  [<ffffffff81093cd9>] ? ktime_get_ts+0xa9/0xe0
>  [<ffffffff810122b6>] ? read_tsc+0x16/0x40
>  [<ffffffff81093cd9>] ? ktime_get_ts+0xa9/0xe0
>  [<ffffffff814a8a53>] io_schedule+0x73/0xc0
>  [<ffffffff811036a8>] sync_page+0x38/0x50
>  [<ffffffff814a927e>] __wait_on_bit+0x5e/0x90
>  [<ffffffff81103670>] ? sync_page+0x0/0x50
>  [<ffffffff81103845>] wait_on_page_bit+0x75/0x80
>  [<ffffffff81089320>] ? wake_bit_function+0x0/0x40
>  [<ffffffff811197c7>] ? pagevec_lookup_tag+0x27/0x40
>  [<ffffffff81118b55>] write_cache_pages+0x1d5/0x440
>  [<ffffffff811172f0>] ? __writepage+0x0/0x40
>  [<ffffffff81118de4>] generic_writepages+0x24/0x30
>  [<ffffffffa02dc719>] jbd2_journal_commit_transaction+0x3e9/0x1490 [jbd2]
>  [<ffffffff81074299>] ? try_to_del_timer_sync+0x49/0xe0
>  [<ffffffffa02e2734>] kjournald2+0xb4/0x220 [jbd2]
>  [<ffffffff810892e0>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffffa02e2680>] ? kjournald2+0x0/0x220 [jbd2]
>  [<ffffffff81089166>] kthread+0x96/0xa0
>  [<ffffffff8100c08a>] child_rip+0xa/0x20
>  [<ffffffff810890d0>] ? kthread+0x0/0xa0
>  [<ffffffff8100c080>] ? child_rip+0x0/0x20
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-10-16 16:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-01  6:13 How to online remove an error scsi disk from the system? Tao Ma
2013-02-01  7:54 ` Bart Van Assche
2013-02-01  9:07   ` Tao Ma
2013-02-01  9:52   ` Bryn M. Reeves
2013-02-01  9:59     ` Tao Ma
2013-02-01 10:07       ` Bryn M. Reeves
2013-02-01 11:13         ` Tao Ma
2013-02-01 11:20           ` Bryn M. Reeves
2013-02-01  8:50 ` Jack Wang
2013-02-01  9:17   ` Tao Ma
2013-02-01  9:24     ` Jack Wang
2013-02-01  9:48       ` Tao Ma
2013-02-01 14:41 ` Hillf Danton
2013-10-16 16:22 ` taco

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).