Re: MD RAID1 deadlock on failed disk

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: MD RAID1 deadlock on failed disk
@ 2010-10-27  0:18 Hubert Tonneau
  2010-10-26 23:56 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Hubert Tonneau @ 2010-10-27  0:18 UTC (permalink / raw)
  To: linux-raid

2.6.32.24 kernel worked fine.

Hubert Tonneau wrote:
>
> Hi,
> 
> The configuration is:
> Perc H200 controler configured with no RAID (mpt2sas driver),
> 2 SATA disks (sda and sdb),
> Linux MD Sofware RAID1 (md0),
> stock Linux 2.6.35.7 kernel.
> 
> I hotunplug the second (sdb) disk, and the result is:
> . as expected, I can read sda device,
> . as expected, any read to sdb device fails,
> . unexpectedly, and read to md0 never returns.
> 
> No oops or thing like that in the kernel log.
> I did not try the same with other kernel releases.
> 
> Regards,
> Hubert Tonneau


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD RAID1 deadlock on failed disk
  2010-10-27  0:18 MD RAID1 deadlock on failed disk Hubert Tonneau
@ 2010-10-26 23:56 ` Neil Brown
  0 siblings, 0 replies; 5+ messages in thread
From: Neil Brown @ 2010-10-26 23:56 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: linux-raid

On Wed, 27 Oct 2010 00:18:25 GMT
Hubert Tonneau <hubert.tonneau@fullpliant.org> wrote:

> 2.6.32.24 kernel worked fine.

Is this repeatable.
i.e. every time you pull a device on a 2.6.35.7 kernel it hangs?

If you can reproduce it, could you
   echo t > /proc/sysrq-trigger

and post the output that is written to the kernel log.

Thanks,
NeilBrown

> 
> Hubert Tonneau wrote:
> >
> > Hi,
> > 
> > The configuration is:
> > Perc H200 controler configured with no RAID (mpt2sas driver),
> > 2 SATA disks (sda and sdb),
> > Linux MD Sofware RAID1 (md0),
> > stock Linux 2.6.35.7 kernel.
> > 
> > I hotunplug the second (sdb) disk, and the result is:
> > . as expected, I can read sda device,
> > . as expected, any read to sdb device fails,
> > . unexpectedly, and read to md0 never returns.
> > 
> > No oops or thing like that in the kernel log.
> > I did not try the same with other kernel releases.
> > 
> > Regards,
> > Hubert Tonneau
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD RAID1 deadlock on failed disk
@ 2010-10-27 10:44 Hubert Tonneau
  2010-10-27  9:52 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Hubert Tonneau @ 2010-10-27 10:44 UTC (permalink / raw)
  To: linux-scsi; +Cc: Neil Brown

Hi,

The configuration is:
Perc H200 controller configured with no RAID (mpt2sas driver),
2 SATA disks (sda and sdb),
Linux MD Sofware RAID1 (md0),
stock Linux 2.6.35.7 kernel.

I hotunplug the second (sdb) disk, and the result is:
. as expected, I can read sda device,
. as expected, any read to sdb device fails,
. unexpectedly, any read to md0 never returns.

No oops or thing like that in the kernel log.
I did not try the same with other kernel releases.

2.6.32.24 kernel worked fine.

Neil Brown asked for /proc/sysrq-trigger ouput,
and concluded that the problem is related to 'fw_event0'.
See his answer bellow.

Regards,
Hubert Tonneau

Neil Brown wrote:
>
> The fw_event0 process is interesting.
> It seems to be hung trying to 'sync' the drive that has just been pulled.
> If that is somehow causing some IO request from the md/raid1 to be delayed
> then that would certainly hang the array.
> 
> There is a section in the middle of the trace which is missing - presumably
> the sysrq-trigger output overflowed a buffer - that isn't uncommon.
> 
> So I cannot see all the timing clearly.
> How long after pulling the drive was this trace taken?
> 
> I suspect that you need to post this to linux-scsi@vger.kernel.org
> and ask about that fw_event0 thread - whether that should happen, whether it
> has been fixed, and whether it could delay pending IO requests.
> 
> NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD RAID1 deadlock on failed disk
  2010-10-27 10:44 Hubert Tonneau
@ 2010-10-27  9:52 ` Neil Brown
  0 siblings, 0 replies; 5+ messages in thread
From: Neil Brown @ 2010-10-27  9:52 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: linux-scsi

On Wed, 27 Oct 2010 10:44:02 GMT
Hubert Tonneau <hubert.tonneau@fullpliant.org> wrote:

> Hi,
> 
> The configuration is:
> Perc H200 controller configured with no RAID (mpt2sas driver),
> 2 SATA disks (sda and sdb),
> Linux MD Sofware RAID1 (md0),
> stock Linux 2.6.35.7 kernel.
> 
> I hotunplug the second (sdb) disk, and the result is:
> . as expected, I can read sda device,
> . as expected, any read to sdb device fails,
> . unexpectedly, any read to md0 never returns.
> 
> No oops or thing like that in the kernel log.
> I did not try the same with other kernel releases.
> 
> 2.6.32.24 kernel worked fine.
> 
> Neil Brown asked for /proc/sysrq-trigger ouput,
> and concluded that the problem is related to 'fw_event0'.
> See his answer bellow.
> 
> Regards,
> Hubert Tonneau
> 
> 
> Neil Brown wrote:
> >
> > The fw_event0 process is interesting.
> > It seems to be hung trying to 'sync' the drive that has just been pulled.
> > If that is somehow causing some IO request from the md/raid1 to be delayed
> > then that would certainly hang the array.
> > 
> > There is a section in the middle of the trace which is missing - presumably
> > the sysrq-trigger output overflowed a buffer - that isn't uncommon.
> > 
> > So I cannot see all the timing clearly.
> > How long after pulling the drive was this trace taken?
> > 
> > I suspect that you need to post this to linux-scsi@vger.kernel.org
> > and ask about that fw_event0 thread - whether that should happen, whether it
> > has been fixed, and whether it could delay pending IO requests.
> > 
> > NeilBrown

It probably would help to have included the sysrq-T output so the scsi people
could see why I pointed the finger at fw_event0.

Here is that part of the trace

<6>[  318.881486] fw_event0     D 0000000000000000     0   244      2 0x00000000
<4>[  318.881493]  ffff88081d191570 0000000000000046 ffff880800000000 00000000000158c0
<4>[  318.881500]  ffff88081d191fd8 00000000000158c0 ffff88081d191fd8 ffff88081d188000
<4>[  318.881507]  00000000000158c0 00000000000158c0 ffff88081d191fd8 00000000000158c0
<4>[  318.881514] Call Trace:
<4>[  318.881520]  [<ffffffff815a296d>] schedule_timeout+0x22d/0x310
<4>[  318.881526]  [<ffffffff813a21f0>] ? __scsi_queue_insert+0xb0/0x130
<4>[  318.881533]  [<ffffffff815a252b>] wait_for_common+0xdb/0x1a0
<4>[  318.881540]  [<ffffffff81051910>] ? default_wake_function+0x0/0x20
<4>[  318.881546]  [<ffffffff81294093>] ? __generic_unplug_device+0x33/0x40
<4>[  318.881553]  [<ffffffff815a26cd>] wait_for_completion+0x1d/0x20
<4>[  318.881560]  [<ffffffff8129a9fe>] blk_execute_rq+0x8e/0xf0
<4>[  318.881567]  [<ffffffff8129666c>] ? blk_get_request+0x6c/0xa0
<4>[  318.881573]  [<ffffffff813a129c>] scsi_execute+0xfc/0x160
<4>[  318.881580]  [<ffffffff813a2cec>] scsi_execute_req+0xac/0x180
<4>[  318.881589]  [<ffffffff813c5fd0>] sd_sync_cache+0xd0/0x120
<4>[  318.881598]  [<ffffffff815a187a>] ? printk+0x68/0x6e
<4>[  318.881604]  [<ffffffff813c6283>] sd_shutdown+0x83/0x1b0
<4>[  318.881610]  [<ffffffff813c6562>] sd_remove+0x62/0xa0
<4>[  318.881618]  [<ffffffff81377555>] __device_release_driver+0x75/0xe0
<4>[  318.881624]  [<ffffffff81377acd>] device_release_driver+0x2d/0x40
<4>[  318.881631]  [<ffffffff81376532>] bus_remove_device+0xb2/0xf0
<4>[  318.881637]  [<ffffffff81374237>] device_del+0x127/0x1b0
<4>[  318.881644]  [<ffffffff813a74d5>] __scsi_remove_device+0xb5/0xc0
<4>[  318.881650]  [<ffffffff813a7510>] scsi_remove_device+0x30/0x50
<4>[  318.881656]  [<ffffffff813a7601>] __scsi_remove_target+0xb1/0xe0
<4>[  318.881662]  [<ffffffff813a76a0>] ? __remove_child+0x0/0x30
<4>[  318.881667]  [<ffffffff813a76c3>] __remove_child+0x23/0x30
<4>[  318.881673]  [<ffffffff8137399c>] device_for_each_child+0x4c/0x80
<4>[  318.881679]  [<ffffffff813a766e>] scsi_remove_target+0x3e/0x70
<4>[  318.881686]  [<ffffffff813abcc5>] sas_rphy_remove+0x75/0x80
<4>[  318.881692]  [<ffffffff813ac266>] sas_rphy_delete+0x16/0x30
<4>[  318.881698]  [<ffffffff813ac2aa>] sas_port_delete+0x2a/0x130
<4>[  318.881704]  [<ffffffff813bf3ca>] mpt2sas_transport_port_remove+0x15a/0x240
<4>[  318.881711]  [<ffffffff813ba9ed>] _scsih_remove_device+0xcd/0x120
<4>[  318.881720]  [<ffffffff81035d09>] ? default_spin_lock_flags+0x9/0x10
<4>[  318.881726]  [<ffffffff813bea00>] ? mpt2sas_transport_update_links+0x80/0x1a0
<4>[  318.881733]  [<ffffffff813be0ee>] _firmware_event_work+0x155e/0x1af0
<4>[  318.881742]  [<ffffffff8100860b>] ? __switch_to+0xcb/0x350
<4>[  318.881749]  [<ffffffff8104de5a>] ? finish_task_switch+0x4a/0xd0
<4>[  318.881756]  [<ffffffff813bcb90>] ? _firmware_event_work+0x0/0x1af0
<4>[  318.881762]  [<ffffffff810792cf>] worker_thread+0x17f/0x2b0
<4>[  318.881769]  [<ffffffff8107d9c0>] ? autoremove_wake_function+0x0/0x40
<4>[  318.881775]  [<ffffffff81079150>] ? worker_thread+0x0/0x2b0
<4>[  318.881781]  [<ffffffff8107d466>] kthread+0x96/0xa0
<4>[  318.881787]  [<ffffffff8100ae64>] kernel_thread_helper+0x4/0x10
<4>[  318.881794]  [<ffffffff8107d3d0>] ? kthread+0x0/0xa0
<4>[  318.881799]  [<ffffffff8100ae60>] ? kernel_thread_helper+0x0/0x10


It seems to hang here, and while it hangs old IO requests don't complete so
md/raid1 cannot proceed.

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* MD RAID1 deadlock on failed disk
@ 2010-10-26 22:32 Hubert Tonneau
  0 siblings, 0 replies; 5+ messages in thread
From: Hubert Tonneau @ 2010-10-26 22:32 UTC (permalink / raw)
  To: linux-raid

Hi,

The configuration is:
Perc H200 controler configured with no RAID (mpt2sas driver),
2 SATA disks (sda and sdb),
Linux MD Sofware RAID1 (md0),
stock Linux 2.6.35.7 kernel.

I hotunplug the second (sdb) disk, and the result is:
. as expected, I can read sda device,
. as expected, any read to sdb device fails,
. unexpectedly, and read to md0 never returns.

No oops or thing like that in the kernel log.
I did not try the same with other kernel releases.

Regards,
Hubert Tonneau

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-10-27  9:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-27  0:18 MD RAID1 deadlock on failed disk Hubert Tonneau
2010-10-26 23:56 ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2010-10-27 10:44 Hubert Tonneau
2010-10-27  9:52 ` Neil Brown
2010-10-26 22:32 Hubert Tonneau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.