* stuck in megaraid_sas.c megasas_adp_reset_gen2 @ 2012-03-21 23:16 Thomas Fjellstrom 2012-03-21 23:36 ` adam radford 0 siblings, 1 reply; 9+ messages in thread From: Thomas Fjellstrom @ 2012-03-21 23:16 UTC (permalink / raw) To: lkml; +Cc: linux-scsi I recently got an IBM M1015 (MegaRaid 9240-8i) card, and after getting a new motherboard, the system now boots, but the megaraid_sas driver seems to be getting stuck when trying to initialize the card. Looking through the source, it seems to be stuck in the megasas_adp_reset_gen2 function, in the while loop at the end. Now, according to the code it can't actually get stuck there permanently, but it does take quite a while for the loop to finish, and the udev timeout messages to stop. I've looked around quite a bit, but haven't found any solutions thus far. If anyone could point me in the right direction I'd appreciate it. -- Thomas Fjellstrom thomas@fjellstrom.ca ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-03-21 23:16 stuck in megaraid_sas.c megasas_adp_reset_gen2 Thomas Fjellstrom @ 2012-03-21 23:36 ` adam radford 2012-03-21 23:52 ` Thomas Fjellstrom 0 siblings, 1 reply; 9+ messages in thread From: adam radford @ 2012-03-21 23:36 UTC (permalink / raw) To: thomas; +Cc: lkml, linux-scsi On Wed, Mar 21, 2012 at 4:16 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote: > I recently got an IBM M1015 (MegaRaid 9240-8i) card, and after getting a new > motherboard, the system now boots, but the megaraid_sas driver seems to be > getting stuck when trying to initialize the card. > > Looking through the source, it seems to be stuck in the megasas_adp_reset_gen2 > function, in the while loop at the end. Now, according to the code it can't > actually get stuck there permanently, but it does take quite a while for the > loop to finish, and the udev timeout messages to stop. > > I've looked around quite a bit, but haven't found any solutions thus far. If > anyone could point me in the right direction I'd appreciate it. If you are getting controller resets during driver load, you must not be getting interrupts or firmware is not responding to the inquiry roll-call. Make sure you have the latest firmware. The code at the end of megasas_adp_reset_gen2() just looks for DIAG_RESET_ADAPTER flag to clear on the host diag register when issuing a controller reset... that should happen almost immediately unless there is a hardware or firmware issue. Are you sure your 'new' motherboard is actually good ? Can you move your megaraid 9240-8i into a 'known working' system and re-test ? -Adam ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-03-21 23:36 ` adam radford @ 2012-03-21 23:52 ` Thomas Fjellstrom 2012-04-11 20:17 ` Thomas Fjellstrom 0 siblings, 1 reply; 9+ messages in thread From: Thomas Fjellstrom @ 2012-03-21 23:52 UTC (permalink / raw) To: adam radford; +Cc: lkml, linux-scsi On Wed Mar 21, 2012, adam radford wrote: > On Wed, Mar 21, 2012 at 4:16 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote: > > I recently got an IBM M1015 (MegaRaid 9240-8i) card, and after getting a > > new motherboard, the system now boots, but the megaraid_sas driver seems > > to be getting stuck when trying to initialize the card. > > > > Looking through the source, it seems to be stuck in the > > megasas_adp_reset_gen2 function, in the while loop at the end. Now, > > according to the code it can't actually get stuck there permanently, but > > it does take quite a while for the loop to finish, and the udev timeout > > messages to stop. > > > > I've looked around quite a bit, but haven't found any solutions thus far. > > If anyone could point me in the right direction I'd appreciate it. > > If you are getting controller resets during driver load, you must not > be getting interrupts or firmware is not responding to the inquiry > roll-call. Make sure you have the latest firmware. I updated to the latest on LSI's site today before emailing. It changes the behavior slightly. With the older firmware, it would not print any of the initial reset messages, but would once udev decides to start killing modprobe. With the new firmware, I get a: ADP_RESET_GEN2: HostDiag=a0 followed by a bunch of: RESET_GEN2: retry=%x, hostdiag=a4 Now I'm not sure the hostdiag should be different between the two. if this aN identifier is similar to the aN identifiers in the MegaCli tool, then it would mean its trying to reset a device that doesn't exist? I only have a single M1015 card installed. > The code at the end of megasas_adp_reset_gen2() just looks for > DIAG_RESET_ADAPTER flag to clear on the host diag register when > issuing a controller reset... that should happen almost immediately > unless there is a hardware or firmware issue. > > Are you sure your 'new' motherboard is actually good ? It boots and runs fine without the sas card installed. I haven't run any heavy load tests, but it seems ok. > Can you move your megaraid 9240-8i into a 'known working' system and > re-test ? Nope. This is the furthest I've gotten it to get with this card installed. The old system would fail to boot into grub properly, let alone linux. These cards seem to be /very/ picky about what motherboard you install them in. > -Adam -- Thomas Fjellstrom thomas@fjellstrom.ca ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-03-21 23:52 ` Thomas Fjellstrom @ 2012-04-11 20:17 ` Thomas Fjellstrom 2012-04-11 20:57 ` adam radford 0 siblings, 1 reply; 9+ messages in thread From: Thomas Fjellstrom @ 2012-04-11 20:17 UTC (permalink / raw) To: adam radford; +Cc: lkml, linux-scsi On Wed Mar 21, 2012, you wrote: > On Wed Mar 21, 2012, adam radford wrote: > > On Wed, Mar 21, 2012 at 4:16 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> > > wrote: > > > I recently got an IBM M1015 (MegaRaid 9240-8i) card, and after getting > > > a new motherboard, the system now boots, but the megaraid_sas driver > > > seems to be getting stuck when trying to initialize the card. > > > > > > Looking through the source, it seems to be stuck in the > > > megasas_adp_reset_gen2 function, in the while loop at the end. Now, > > > according to the code it can't actually get stuck there permanently, > > > but it does take quite a while for the loop to finish, and the udev > > > timeout messages to stop. > > > > > > I've looked around quite a bit, but haven't found any solutions thus > > > far. If anyone could point me in the right direction I'd appreciate > > > it. > > > > If you are getting controller resets during driver load, you must not > > be getting interrupts or firmware is not responding to the inquiry > > roll-call. Make sure you have the latest firmware. > > I updated to the latest on LSI's site today before emailing. It changes the > behavior slightly. With the older firmware, it would not print any of the > initial reset messages, but would once udev decides to start killing > modprobe. With the new firmware, I get a: > > ADP_RESET_GEN2: HostDiag=a0 > > followed by a bunch of: > > RESET_GEN2: retry=%x, hostdiag=a4 > > Now I'm not sure the hostdiag should be different between the two. if this > aN identifier is similar to the aN identifiers in the MegaCli tool, then > it would mean its trying to reset a device that doesn't exist? I only have > a single M1015 card installed. > > > The code at the end of megasas_adp_reset_gen2() just looks for > > DIAG_RESET_ADAPTER flag to clear on the host diag register when > > issuing a controller reset... that should happen almost immediately > > unless there is a hardware or firmware issue. > > > > Are you sure your 'new' motherboard is actually good ? > > It boots and runs fine without the sas card installed. I haven't run any > heavy load tests, but it seems ok. Machine has been solid as a rock (sans 9240-8i) for the past month with mild to half load. It runs several virtual machines, a nfs share, my firewall, a minecraft server, and some other miscellaneous stuff. Not a single hiccup. > > Can you move your megaraid 9240-8i into a 'known working' system and > > re-test ? > > Nope. This is the furthest I've gotten it to get with this card installed. > The old system would fail to boot into grub properly, let alone linux. > These cards seem to be /very/ picky about what motherboard you install > them in. > > > -Adam I just got a second M1015 card in today and gave it a go. Similar issues, different log messages. (hand typed from picture taken of screens) Lots of: megasas: Waiting for 1 commands to complete for quite a while (5-10 minutes), along with udevd trying to kill modprobe. Then: megasas: moving cmd[0]:hexstringherewithcolons queue as internal megaraid_sas: FW detected to be in fault state, restarting it... ADP_RESET_GEN2: HostDiag=a0 megaraid_sas: FW restarted successfully,initializing next stage... megaraid_sas: HBA recovery state machine,state 1 starting... (sits here for a while) megasas: Waiting for FW to come to ready state megasas: FW now in ready state megaraid_sas: command hexstringhere, hexstringhere detected (something?) while HBA reset megasas: command hexstring scsi cmd [12]detected on the internal (something?) again megasas: reset successful scsi:0:0:0:0: megasas: RESET cmd=12 retries=0 megaraid_sas: no pending cmds after reset megasas: reset successful scsi:0:0:0:0: megasas: RESET cmd=12 retries=0 megaraid_sas: no pending cmds after reset megasas: reset successful scsi:0:0:0:0: Device offlined - not ready after error recovery (other scsi devices are detected) (bootup hangs here) Eventually theres some "hung task" timeout backtraces. This is where I tried to kill udevd, CTRL+C didn't stop it from trying to kill modprobe, and ALT+SYSRQ+K caused a silent oops (keyboard leds blinking, no backtrace or OOPS text). If its similar to last time, eventually the kernel will outright OOPS without any intervention. -- Thomas Fjellstrom thomas@fjellstrom.ca ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-04-11 20:17 ` Thomas Fjellstrom @ 2012-04-11 20:57 ` adam radford 2012-04-11 21:44 ` Thomas Fjellstrom 0 siblings, 1 reply; 9+ messages in thread From: adam radford @ 2012-04-11 20:57 UTC (permalink / raw) To: thomas; +Cc: lkml, linux-scsi On Wed, Apr 11, 2012 at 1:17 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote: >> >> ADP_RESET_GEN2: HostDiag=a0 >> >> followed by a bunch of: >> >> RESET_GEN2: retry=%x, hostdiag=a4 >> >> Now I'm not sure the hostdiag should be different between the two. if this >> aN identifier is similar to the aN identifiers in the MegaCli tool, then >> it would mean its trying to reset a device that doesn't exist? I only have >> a single M1015 card installed. host diag register output a0 or a4 has absolutely nothing to do with MegaCli -aN command line argument for specifying adapter number. > I just got a second M1015 card in today and gave it a go. Similar issues, > different log messages. (hand typed from picture taken of screens) > > Lots of: > > megasas: Waiting for 1 commands to complete Can you try booting with kernel command line argument pcie_aspm=off -Adam ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-04-11 20:57 ` adam radford @ 2012-04-11 21:44 ` Thomas Fjellstrom 2012-04-12 8:11 ` adam radford 0 siblings, 1 reply; 9+ messages in thread From: Thomas Fjellstrom @ 2012-04-11 21:44 UTC (permalink / raw) To: adam radford; +Cc: lkml, linux-scsi On Wed Apr 11, 2012, adam radford wrote: > On Wed, Apr 11, 2012 at 1:17 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote: > >> ADP_RESET_GEN2: HostDiag=a0 > >> > >> followed by a bunch of: > >> > >> RESET_GEN2: retry=%x, hostdiag=a4 > >> > >> Now I'm not sure the hostdiag should be different between the two. if > >> this aN identifier is similar to the aN identifiers in the MegaCli > >> tool, then it would mean its trying to reset a device that doesn't > >> exist? I only have a single M1015 card installed. > > host diag register output a0 or a4 has absolutely nothing to do with > MegaCli -aN command line argument for specifying adapter number. > > > I just got a second M1015 card in today and gave it a go. Similar issues, > > different log messages. (hand typed from picture taken of screens) > > > > Lots of: > > > > megasas: Waiting for 1 commands to complete > > Can you try booting with kernel command line argument pcie_aspm=off No problem. Things are quite similar. Startup goes like: <detected a onboard sata ports> scsi: waiting for bus probes to complete... Refined TSC... Switched to clocksource tsc <pause here> udevd[...]: timeout: killing '/sbin/modprobe -b ...' (lots of these, so much that I hit scroll lock so I can see the kernel messages as they come up) scsi 0:0:0:0: megasas: RESET cmd=12 retries=0 megasas: [ 0] waiting for 1 commands to complete (many more waiting messages) <hung task kworker/u:4> Call Trace: [<ffffffff810641d0>] ? async_synchronize_cookie_domain+0xb2/...c [<ffffffff8105f583>] ? add_wait_queue+0x3c/0x3c .... megasas: [55] waiting for 1 commands to complete .... megasas: [175] waiting for 1 commands to complete megasas: moving cmd[0]:ffff880234bcb940:0:ffff88002339beec0 the defer queue as internal megaraid_sas: FW detected to be in faultstate, restarting it... ADP_RESET_GEN2: HostDiag=a0 (10s wait) megaraid_sas: FW restarted successfully,initializing next stage... megaraid_sas: HBA recovery state machine,state 2 starting... (30s wait) megasas: Waiting for FW to come to ready state megasas: FW now in ready state megaraid_sas: command ffff880234bcb940, ffff8802339beec0:0detected to be pending while HBA reset megasas: ffff880234bcb940 scsi cmd [12]detected on the internal queue, issue again. megasas: reset successful scsi: 0:0:0:0: megasas: RESET cmd=12 retries 0 megaraid_sas: no pending cmds after reset megasas: reset successful (20s wait) (device offlined message here, missed it this time) (detected all sata devices) And it stalled there. > -Adam -- Thomas Fjellstrom thomas@fjellstrom.ca ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-04-11 21:44 ` Thomas Fjellstrom @ 2012-04-12 8:11 ` adam radford 2012-04-12 18:16 ` Thomas Fjellstrom 0 siblings, 1 reply; 9+ messages in thread From: adam radford @ 2012-04-12 8:11 UTC (permalink / raw) To: thomas; +Cc: lkml, linux-scsi On Wed, Apr 11, 2012 at 2:44 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote: > megasas: [55] waiting for 1 commands to complete > .... > megasas: [175] waiting for 1 commands to complete You still aren't getting interrupts from the card. Can you re-try with megaraid_sas module parameter msix_disable ? If that doesn't work, contact LSI support and tell them which motherboard, any PCIe bridges, risers, etc you have. -Adam ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-04-12 8:11 ` adam radford @ 2012-04-12 18:16 ` Thomas Fjellstrom 2012-04-13 18:50 ` Thomas Fjellstrom 0 siblings, 1 reply; 9+ messages in thread From: Thomas Fjellstrom @ 2012-04-12 18:16 UTC (permalink / raw) To: adam radford; +Cc: lkml, linux-scsi On Thu Apr 12, 2012, adam radford wrote: > On Wed, Apr 11, 2012 at 2:44 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote: > > megasas: [55] waiting for 1 commands to complete > > .... > > megasas: [175] waiting for 1 commands to complete > > You still aren't getting interrupts from the card. Can you re-try > with megaraid_sas module parameter msix_disable ? > If that doesn't work, contact LSI support and tell them which > motherboard, any PCIe bridges, risers, etc you have. No luck darn it. with either card. > -Adam -- Thomas Fjellstrom thomas@fjellstrom.ca ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: stuck in megaraid_sas.c megasas_adp_reset_gen2 2012-04-12 18:16 ` Thomas Fjellstrom @ 2012-04-13 18:50 ` Thomas Fjellstrom 0 siblings, 0 replies; 9+ messages in thread From: Thomas Fjellstrom @ 2012-04-13 18:50 UTC (permalink / raw) To: adam radford; +Cc: lkml, linux-scsi On Thu Apr 12, 2012, Thomas Fjellstrom wrote: > On Thu Apr 12, 2012, adam radford wrote: > > On Wed, Apr 11, 2012 at 2:44 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> > > wrote: > > > megasas: [55] waiting for 1 commands to complete > > > .... > > > megasas: [175] waiting for 1 commands to complete > > > > You still aren't getting interrupts from the card. Can you re-try > > with megaraid_sas module parameter msix_disable ? > > If that doesn't work, contact LSI support and tell them which > > motherboard, any PCIe bridges, risers, etc you have. > > No luck darn it. with either card. I've been talking to LSI support for the past few days. So far they are unwilling to support my current use case, or fix their INT19 firmware bug (which they have acknowledged). I just hope there aren't any more serious bugs like this in their firmware. It's looking pretty scary. If they let this bug live for years, imagine what other bugs might lurk. I may have to avoid LSI completely in the future, even on real server systems. > > -Adam -- Thomas Fjellstrom thomas@fjellstrom.ca ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-04-13 18:50 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-03-21 23:16 stuck in megaraid_sas.c megasas_adp_reset_gen2 Thomas Fjellstrom 2012-03-21 23:36 ` adam radford 2012-03-21 23:52 ` Thomas Fjellstrom 2012-04-11 20:17 ` Thomas Fjellstrom 2012-04-11 20:57 ` adam radford 2012-04-11 21:44 ` Thomas Fjellstrom 2012-04-12 8:11 ` adam radford 2012-04-12 18:16 ` Thomas Fjellstrom 2012-04-13 18:50 ` Thomas Fjellstrom
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.