All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 47701] New: When too many disks fall out at the same time, RCU hangs
@ 2012-09-18 23:13 bugzilla-daemon
  2012-10-06 14:30 ` [Bug 47701] " bugzilla-daemon
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-09-18 23:13 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=47701

           Summary: When too many disks fall out at the same time, RCU
                    hangs
           Product: IO/Storage
           Version: 2.5
    Kernel Version: 3.5.4
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: SCSI
        AssignedTo: linux-scsi@vger.kernel.org
        ReportedBy: sgunderson@bigfoot.com
        Regression: No


Hi,

For whatever reason, I lost all of my disks at the same time (I guess a SAS
cable fell out; I'll know tomorrow). As was expected, I/O on the machine was
not so happy afterwards; what was not so expected, was the following output on
the serial console a minute or so after:

[292657.601264] INFO: rcu_sched self-detected stall on CPU[292657.602441] INFO:
rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 4, t=60002 jiffies)
[292657.602444] INFO: Stall ended before state dump start

[292657.620982]  {[292657.622964]  6}  (t=60024 jiffies)
Pid: 7800, comm: kworker/u:9 Not tainted 3.5.4 #1
[292657.631337] Call Trace:
[292657.634069]  <IRQ>  [<ffffffff8108ecaf>] __rcu_pending+0xbd/0x417
[292657.640471]  [<ffffffff8108f0de>] rcu_check_callbacks+0xd5/0x138
[292657.646765]  [<ffffffff81041b73>] update_process_times+0x3c/0x73
[292657.653050]  [<ffffffff81072076>] tick_sched_timer+0x6a/0x93
[292657.658999]  [<ffffffff8105301e>] __run_hrtimer+0xb3/0x13e
[292657.664763]  [<ffffffff8107200c>] ? tick_nohz_handler+0xd3/0xd3
[292657.670972]  [<ffffffff8103b7a3>] ? __do_softirq+0x16c/0x182
[292657.676916]  [<ffffffff810533d5>] hrtimer_interrupt+0xce/0x1b0
[292657.683030]  [<ffffffff8101ad3d>] smp_apic_timer_interrupt+0x81/0x94
[292657.689676]  [<ffffffff81377447>] apic_timer_interrupt+0x67/0x70
[292657.695965]  <EOI>  [<ffffffff813703dd>] ?
_raw_spin_unlock_irqrestore+0x9/0xb
[292657.703695]  [<ffffffff8123daaa>] scsi_remove_target+0x137/0x153
[292657.709985]  [<ffffffff812425dc>] sas_rphy_remove+0x25/0x4e
[292657.715841]  [<ffffffff81242616>] sas_rphy_delete+0x11/0x1e
[292657.721699]  [<ffffffff81242648>] sas_port_delete+0x25/0x11a
[292657.727644]  [<ffffffff8136de53>] ? mutex_unlock+0x9/0xb
[292657.733254]  [<ffffffffa0020fd5>] mpt2sas_transport_port_remove+0x16f/0x190
[mpt2sas]
[292657.741576]  [<ffffffffa001a70b>] _scsih_remove_device+0x58/0x84 [mpt2sas]
[292657.748731]  [<ffffffffa001a7f4>] _scsih_device_remove_by_handle+0xbd/0xc6
[mpt2sas]
[292657.756960]  [<ffffffffa001c5bb>]
_scsih_sas_topology_change_event+0x422/0x46d [mpt2sas]
[292657.765531]  [<ffffffff81064556>] ? idle_balance+0xde/0x10c
[292657.771395]  [<ffffffffa001e098>] ? _scsih_abort+0x1c1/0x1c1 [mpt2sas]
[292657.778212]  [<ffffffffa001e38d>] _firmware_event_work+0x2f5/0x920
[mpt2sas]
[292657.785547]  [<ffffffff81042089>] ? add_timer+0x17/0x1a
[292657.791058]  [<ffffffff8104bc64>] ? queue_delayed_work_on+0xda/0xe8
[292657.797607]  [<ffffffffa001e098>] ? _scsih_abort+0x1c1/0x1c1 [mpt2sas]
[292657.804418]  [<ffffffff8104c641>] process_one_work+0x253/0x3c5
[292657.810534]  [<ffffffff8104cbb7>] worker_thread+0x1d4/0x34d
[292657.816394]  [<ffffffff8104c9e3>] ? rescuer_thread+0x230/0x230
[292657.822511]  [<ffffffff810500bb>] kthread+0x84/0x8c
[292657.827675]  [<ffffffff81377c94>] kernel_thread_helper+0x4/0x10
[292657.833875]  [<ffffffff81050037>] ? kthread_freezable_should_stop+0x58/0x58
[292657.841117]  [<ffffffff81377c90>] ? gs_change+0xb/0xb

I'm reporting this primarily because it could cause problems in some other
context (say, when only one or two disks disappear); of course, for me in this
case, it wouldn't matter if I/O was "properly" rejected or the entire machine
hanged, it's useless anyway.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 47701] When too many disks fall out at the same time, RCU hangs
  2012-09-18 23:13 [Bug 47701] New: When too many disks fall out at the same time, RCU hangs bugzilla-daemon
@ 2012-10-06 14:30 ` bugzilla-daemon
  2012-10-07  2:20 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-10-06 14:30 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=47701





--- Comment #1 from Brad Campbell <lists2009@fnarfbargle.com>  2012-10-06 14:30:58 ---
Created an attachment (id=82531)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=82531)
dmesg of boot, and removal of one drive at 9462 seconds.

I can reproduce this on 3.5.5 & 3.6.
I have 2 of these cards : 01:00.0 Serial Attached SCSI controller: LSI Logic /
Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02)

My test setup places only 2 drives on the SAS cards. I create a RAID10 from
them. Simply pulling a drive will cause the following RCU hang, and prevent the
machine from syncing, rebooting or being able to use the array. Alt-sysrq gets
me rebooted and back up and running. 100% reproducible.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 47701] When too many disks fall out at the same time, RCU hangs
  2012-09-18 23:13 [Bug 47701] New: When too many disks fall out at the same time, RCU hangs bugzilla-daemon
  2012-10-06 14:30 ` [Bug 47701] " bugzilla-daemon
@ 2012-10-07  2:20 ` bugzilla-daemon
  2012-10-07 14:20 ` bugzilla-daemon
  2013-11-19 23:10 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-10-07  2:20 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=47701


Joe Lawrence <joe.lawrence@stratus.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |joe.lawrence@stratus.com




--- Comment #2 from Joe Lawrence <joe.lawrence@stratus.com>  2012-10-07 02:20:43 ---
Stratus noticed a similar crash (hang actually) earlier this week when removing
a single SAS disk as part of a RAID 1 MD mirror.  In our instance, the all CPUs
were idle, except one that was running scsi_target_reap and another waiting on
RCU synchronize_sched.  Since the former function was stuck in some loop, RCU
stalled and the machine wedged.  Another Stratus engineer noticed patch [1],
and once applied to our kernel, MD/mpt2sas disk removal no longer hung the
machine.

[1] [SCSI] scsi_remove_target: fix softlockup regression on hot remove
https://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=bc3f02a795d3b4faa99d37390174be2a75d091bd

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 47701] When too many disks fall out at the same time, RCU hangs
  2012-09-18 23:13 [Bug 47701] New: When too many disks fall out at the same time, RCU hangs bugzilla-daemon
  2012-10-06 14:30 ` [Bug 47701] " bugzilla-daemon
  2012-10-07  2:20 ` bugzilla-daemon
@ 2012-10-07 14:20 ` bugzilla-daemon
  2013-11-19 23:10 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2012-10-07 14:20 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=47701


Brad Campbell <lists2009@fnarfbargle.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lists2009@fnarfbargle.com




--- Comment #3 from Brad Campbell <lists2009@fnarfbargle.com>  2012-10-07 14:20:46 ---
Apparently fixed as of 3.6.0-07201-ged5062d (current git as of 8 hours ago).

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 47701] When too many disks fall out at the same time, RCU hangs
  2012-09-18 23:13 [Bug 47701] New: When too many disks fall out at the same time, RCU hangs bugzilla-daemon
                   ` (2 preceding siblings ...)
  2012-10-07 14:20 ` bugzilla-daemon
@ 2013-11-19 23:10 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2013-11-19 23:10 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=47701

Alan <alan@lxorguk.ukuu.org.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |alan@lxorguk.ukuu.org.uk
         Resolution|---                         |CODE_FIX

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-11-19 23:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-18 23:13 [Bug 47701] New: When too many disks fall out at the same time, RCU hangs bugzilla-daemon
2012-10-06 14:30 ` [Bug 47701] " bugzilla-daemon
2012-10-07  2:20 ` bugzilla-daemon
2012-10-07 14:20 ` bugzilla-daemon
2013-11-19 23:10 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.