All of lore.kernel.org
 help / color / mirror / Atom feed
* bnx2x machine check in bnx2x_ack_int()
@ 2008-12-09 19:04 Bjorn Helgaas
  2008-12-15  8:03 ` Eilon Greenstein
  0 siblings, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2008-12-09 19:04 UTC (permalink / raw)
  To: Eilon Greenstein; +Cc: netdev

Hi Eilon,

I'm using bnx2x 1.45.23 from RHEL5.3s4 on a prototype ia64 platform,
and I see intermittent machine checks at bnx2x_ack_int+176, which is
just after __ia64_readl() returns.

This is a proto with incomplete firmware, and the driver correctly
complains about that, but it seems like there's still a hole where
things blow up.

The machine check happens intermittently on boot, but I can reproduce
it instantly with a loop like this:

  # while /bin/true; do ifup eth6; date; done

Here's some lspci and dmesg information.  I added a little debug in
bnx2x_ack_int(), so this is a kernel I compiled myself.

Let me know if there's any information I can collect or testing I
can do.

Thanks,
  Bjorn



0001:80:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Subsystem: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Flags: fast devsel, IRQ 56
        Memory at 40020000000 (64-bit, non-prefetchable) [disabled] [size=8M]
        Memory at 40020800000 (64-bit, non-prefetchable) [disabled] [size=8M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=1
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

0001:80:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Subsystem: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Flags: fast devsel, IRQ 86
        Memory at 40021000000 (64-bit, non-prefetchable) [disabled] [size=8M]
        Memory at 40021800000 (64-bit, non-prefetchable) [disabled] [size=8M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=1
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

0007:80:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Subsystem: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Flags: bus master, fast devsel, latency 0, IRQ 66
        Memory at 40320000000 (64-bit, non-prefetchable) [size=8M]
        Memory at 40320800000 (64-bit, non-prefetchable) [size=8M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=1
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

0007:80:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Subsystem: Broadcom Corporation NetXtreme II BCM57710 10Gigabit PCIe [Everest]
        Flags: bus master, fast devsel, latency 0, IRQ 87
        Memory at 40321000000 (64-bit, non-prefetchable) [size=8M]
        Memory at 40321800000 (64-bit, non-prefetchable) [size=8M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=1
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

Linux version 2.6.18-124.el5 (root@tahiti.helgaas) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #10 SMP Fri Dec 5 12:57:00 MST 2008
...
Broadcom NetXtreme II 5771x 10Gigabit Ethernet Driver bnx2x 1.45.23 (2008/11/03)
ACPI: PCI Interrupt 0001:80:00.0[A] -> GSI 33 (level, low) -> IRQ 56
[bnx2x_get_hwinfo:7490(eth0)]warning random MAC workaround active
bnx2x: MCP disabled, must load devices in order!
eth0: Broadcom NetXtreme II BCM57710 XGb (A1) PCI-E x8 2.5GHz found at mem 40020000000, IRQ 56, node addr 5a12c9643ce9
GSI 34 (level, low) -> CPU 5 (0x1100) vector 86
ACPI: PCI Interrupt 0001:80:00.1[B] -> GSI 34 (level, low) -> IRQ 86
[bnx2x_get_hwinfo:7490(eth1)]warning random MAC workaround active
eth1: Broadcom NetXtreme II BCM57710 XGb (A1) PCI-E x8 2.5GHz found at mem 40021000000, IRQ 86, node addr 3a7a42d2f596
ACPI: PCI Interrupt 0007:80:00.0[A] -> GSI 201 (level, low) -> IRQ 66
[bnx2x_get_hwinfo:7490(eth2)]warning random MAC workaround active
bnx2x: MCP disabled, must load devices in order!
eth2: Broadcom NetXtreme II BCM57710 XGb (A1) PCI-E x8 2.5GHz found at mem 40320000000, IRQ 66, node addr 02af31dabdd3
GSI 202 (level, low) -> CPU 6 (0x1200) vector 87
ACPI: PCI Interrupt 0007:80:00.1[B] -> GSI 202 (level, low) -> IRQ 87
[bnx2x_get_hwinfo:7490(eth3)]warning random MAC workaround active
eth3: Broadcom NetXtreme II BCM57710 XGb (A1) PCI-E x8 2.5GHz found at mem 40321000000, IRQ 87, node addr ae914301e30b
...

Bringing up interface eth6:  
Determining IP information for eth6...bnx2x_ack_int bp 0xe0001140f8972500 hc_addr 0x108198 addr 0xc000040020108198
[bnx2x_init_common:5394(eth6)]Bootcode is missing - can not initialize link
[bnx2x__link_reset:2001(eth6)]Bootcode is missing -not resetting link
bnx2x_ack_int bp 0xe0001140f8972500 hc_addr 0x108198 addr 0xc000040020108198
[bnx2x_initial_phy_init:1978(eth6)]Bootcode is missing -not initializing link
bnx2x_ack_int bp 0xe0001140f8972500 hc_addr 0x108198 addr 0xc000040020108198
bnx2x_ack_int bp 0xe0001140f8972500 hc_addr 0x108198 addr 0xc000040020108198
ADDRCONF(NETDEV_UP): eth6: link is not ready
 failed; no link present.  Check cable?
bnx2x_ack_int bp 0xe0001140f8972500 hc_addr 0x108198 addr 0xc000040020108198
[bnx2x__link_reset:2001(eth6)]Bootcode is missing -not resetting link
[FAILED]
Bringing up interface eth7:  
Determining IP information for eth7...bnx2x_ack_int bp 0xe0001140f8970500 hc_addr 0x1081b8 addr 0xc0000400211081b8
[bnx2x_init_common:5394(eth7)]Bootcode is missing - can not initialize link
[bnx2x__link_reset:2001(eth7)]Bootcode is missing -not resetting link
bnx2x_ack_int bp 0xe0001140f8970500 hc_addr 0x1081b8 addr 0xc0000400211081b8
[bnx2x_initial_phy_init:1978(eth7)]Bootcode is missing -not initializing link
bnx2x_ack_int bp 0xe0001140f8970500 hc_addr 0x1081b8 addr 0xc0000400211081b8
bnx2x_ack_int bp 0xe0001140f8970500 hc_addr 0x1081b8 addr 0xc0000400211081b8
ADDRCONF(NETDEV_UP): eth7: link is not ready
 failed; no link present.  Check cable?
bnx2x_ack_int bp 0xe0001140f8970500 hc_addr 0x1081b8 addr 0xc0000400211081b8
[bnx2x__link_reset:2001(eth7)]Bootcode is missing -not resetting link
[FAILED]
Bringing up interface eth8:  
Determining IP information for eth8...bnx2x_ack_int bp 0xe0001140f896e500 hc_addr 0x108198 addr 0xc000040320108198
[bnx2x_init_common:5394(eth8)]Bootcode is missing - can not initialize link
[bnx2x__link_reset:2001(eth8)]Bootcode is missing -not resetting link
bnx2x_ack_int bp 0xe0001140f896e500 hc_addr 0x108198 addr 0xc000040320108198
[bnx2x_initial_phy_init:1978(eth8)]Bootcode is missing -not initializing link
bnx2x_ack_int bp 0xe0001140f896e500 hc_addr 0x108198 addr 0xc000040320108198
bnx2x_ack_int bp 0xe0001140f896e500 hc_addr 0x108198 addr 0xc000040320108198
ADDRCONF(NETDEV_UP): eth8: link is not ready
 failed; no link present.  Check cable?
bnx2x_ack_int bp 0xe0001140f896e500 hc_addr 0x108198 addr 0xc000040320108198
[bnx2x__link_reset:2001(eth8)]Bootcode is missing -not resetting link
[FAILED]
Bringing up interface eth9:  
Determining IP information for eth9...bnx2x_ack_int bp 0xe0001140f896c500 hc_addr 0x1081b8 addr 0xc0000403211081b8
[bnx2x_init_common:5394(eth9)]Bootcode is missing - can not initialize link
[bnx2x__link_reset:2001(eth9)]Bootcode is missing -not resetting link
bnx2x_ack_int bp 0xe0001140f896c500 hc_addr 0x1081b8 addr 0xc0000403211081b8
[bnx2x_initial_phy_init:1978(eth9)]Bootcode is missing -not initializing link
bnx2x_ack_int bp 0xe0001140f896c500 hc_addr 0x1081b8 addr 0xc0000403211081b8
bnx2x_ack_int bp 0xe0001140f896c500 hc_addr 0x1081b8 addr 0xc0000403211081b8
ADDRCONF(NETDEV_UP): eth9: link is not ready
 failed; no link present.  Check cable?
bnx2x_ack_int bp 0xe0001140f896c500 hc_addr 0x1081b8 addr 0xc0000403211081b8
[bnx2x__link_reset:2001(eth9)]Bootcode is missing -not resetting link
[FAILED]
...


[root@localhost ~]# while /bin/true; do ifup eth6; date; done

Determining IP information for eth6...bnx2x_ack_int bp 0xe0001140ffef6500 hc_addr 0x108198 addr 0xc000040020108198
[bnx2x_init_common:5394(eth6)]Bootcode is missing - can not initialize link
[bnx2x__link_reset:2001(eth6)]Bootcode is missing -not resetting link
bnx2x_ack_int bp 0xe0001140ffef6500 hc_addr 0x108198 addr 0xc000040020108198
[bnx2x_initial_phy_init:1978(eth6)]Bootcode is missing -not initializing link
bnx2x_ack_int bp 0xe0001140ffef6500 hc_addr 0x108198 addr 0xc000040020108198
bnx2x_ack_int bp 0xe0001140ffef6500 hc_addr 0x108198 addr 0xc000040020108198
ADDRCONF(NETDEV_UP): eth6: link is not ready
 failed; no link present.  Check cable?
bnx2x_ack_int bp 0xe0001140ffef6500 hc_addr 0x108198 addr 0xc000040020108198
[bnx2x__link_reset:2001(eth6)]Bootcode is missing -not resetting link
Sat Nov 29 06:27:53 MST 2008

Determining IP information for eth6...bnx2x_ack_int bp 0xe0001140ffef6500 hc_addr 0x108198 addr 0xc000040020108198
[Sending CFW FIFO message (0x12000102) Local MCA (18) partition 0 CPU LID = 258 (0x102) to PDH 0 ... sent.]

sequencer: MCA on cpu17.

MCA occurred : iip=0xA00000010040D1B0, ipsr=0x0000101008526030, ipfs=0x800000000000030B.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bnx2x machine check in bnx2x_ack_int()
  2008-12-09 19:04 bnx2x machine check in bnx2x_ack_int() Bjorn Helgaas
@ 2008-12-15  8:03 ` Eilon Greenstein
  2008-12-15 16:21   ` Bjorn Helgaas
  0 siblings, 1 reply; 6+ messages in thread
From: Eilon Greenstein @ 2008-12-15  8:03 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: netdev

Hi Bjon,

My sincere apologize for the late response. When searching the mailing
list to see if I accidentally missed your reply, I have come to realize
that my response was never sent (problem with my email client).

On Tue, 2008-12-09 at 11:04 -0800, Bjorn Helgaas wrote:
> Hi Eilon,
> 
> I'm using bnx2x 1.45.23 from RHEL5.3s4 on a prototype ia64 platform,
> and I see intermittent machine checks at bnx2x_ack_int+176, which is
> just after __ia64_readl() returns.
I see that you are using two 57710 chips. What kind of board are you
using (which PHY)?

> This is a proto with incomplete firmware, and the driver correctly
> complains about that, but it seems like there's still a hole where
> things blow up.
If you set the right FW on the board, do you still see the problem? I'm
asking to determine if this is a problem specific to the "no FW" case or
general issue with this configuration

> The machine check happens intermittently on boot, but I can reproduce
> it instantly with a loop like this:
> 
>   # while /bin/true; do ifup eth6; date; done
> 
> Here's some lspci and dmesg information.  I added a little debug in
> bnx2x_ack_int(), so this is a kernel I compiled myself.
> 
> Let me know if there's any information I can collect or testing I
> can do.
Unfortunately, with the 57711 Mezzanine on the IA64 blade that I have, I
cannot reproduce this failure, so I will probably need more information.
Can you please send me the nvram content?

Thanks,
Eilon



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bnx2x machine check in bnx2x_ack_int()
  2008-12-15  8:03 ` Eilon Greenstein
@ 2008-12-15 16:21   ` Bjorn Helgaas
  2008-12-15 17:18     ` Eilon Greenstein
  0 siblings, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2008-12-15 16:21 UTC (permalink / raw)
  To: Eilon Greenstein; +Cc: netdev

On Monday 15 December 2008 01:03:29 am Eilon Greenstein wrote:
> My sincere apologize for the late response. When searching the mailing
> list to see if I accidentally missed your reply, I have come to realize
> that my response was never sent (problem with my email client).

No problem, thanks for the response!

> On Tue, 2008-12-09 at 11:04 -0800, Bjorn Helgaas wrote:
> > Hi Eilon,
> > 
> > I'm using bnx2x 1.45.23 from RHEL5.3s4 on a prototype ia64 platform,
> > and I see intermittent machine checks at bnx2x_ack_int+176, which is
> > just after __ia64_readl() returns.
> I see that you are using two 57710 chips. What kind of board are you
> using (which PHY)?

This is a blade with separate PHYs.  In this case, there is no PHY
at all because the module that would contain the PHY is not installed.

> > This is a proto with incomplete firmware, and the driver correctly
> > complains about that, but it seems like there's still a hole where
> > things blow up.
> If you set the right FW on the board, do you still see the problem? I'm
> asking to determine if this is a problem specific to the "no FW" case or
> general issue with this configuration

I'm trying to get FW installed, but haven't had any luck yet.

> > The machine check happens intermittently on boot, but I can reproduce
> > it instantly with a loop like this:
> > 
> >   # while /bin/true; do ifup eth6; date; done
> > 
> > Here's some lspci and dmesg information.  I added a little debug in
> > bnx2x_ack_int(), so this is a kernel I compiled myself.
> > 
> > Let me know if there's any information I can collect or testing I
> > can do.
> Unfortunately, with the 57711 Mezzanine on the IA64 blade that I have, I
> cannot reproduce this failure, so I will probably need more information.
> Can you please send me the nvram content?

Sure.  Is there an easy way to collect this, or should I instrument the
driver to print it out or something?

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bnx2x machine check in bnx2x_ack_int()
  2008-12-15 16:21   ` Bjorn Helgaas
@ 2008-12-15 17:18     ` Eilon Greenstein
  2008-12-15 18:04       ` Bjorn Helgaas
  0 siblings, 1 reply; 6+ messages in thread
From: Eilon Greenstein @ 2008-12-15 17:18 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: netdev

On Mon, 2008-12-15 at 08:21 -0800, Bjorn Helgaas wrote:

> > > I'm using bnx2x 1.45.23 from RHEL5.3s4 on a prototype ia64 platform,
> > > and I see intermittent machine checks at bnx2x_ack_int+176, which is
> > > just after __ia64_readl() returns.
> > I see that you are using two 57710 chips. What kind of board are you
> > using (which PHY)?
> 
> This is a blade with separate PHYs.  In this case, there is no PHY
> at all because the module that would contain the PHY is not installed.
> 
> > > This is a proto with incomplete firmware, and the driver correctly
> > > complains about that, but it seems like there's still a hole where
> > > things blow up.
> > If you set the right FW on the board, do you still see the problem? I'm
> > asking to determine if this is a problem specific to the "no FW" case or
> > general issue with this configuration
> 
> I'm trying to get FW installed, but haven't had any luck yet.

Hmmm... So it is possible that this is HW related (since we are talking
about preliminary board with missing component and missing FW). I will
ask for some field support - someone will contact you shortly.

Thanks,
Eilon



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bnx2x machine check in bnx2x_ack_int()
  2008-12-15 17:18     ` Eilon Greenstein
@ 2008-12-15 18:04       ` Bjorn Helgaas
  2008-12-15 19:36         ` Eilon Greenstein
  0 siblings, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2008-12-15 18:04 UTC (permalink / raw)
  To: Eilon Greenstein; +Cc: netdev

On Monday 15 December 2008 10:18:00 am Eilon Greenstein wrote:
> On Mon, 2008-12-15 at 08:21 -0800, Bjorn Helgaas wrote:
> 
> > > > I'm using bnx2x 1.45.23 from RHEL5.3s4 on a prototype ia64 platform,
> > > > and I see intermittent machine checks at bnx2x_ack_int+176, which is
> > > > just after __ia64_readl() returns.
> > > I see that you are using two 57710 chips. What kind of board are you
> > > using (which PHY)?
> > 
> > This is a blade with separate PHYs.  In this case, there is no PHY
> > at all because the module that would contain the PHY is not installed.
> > 
> > > > This is a proto with incomplete firmware, and the driver correctly
> > > > complains about that, but it seems like there's still a hole where
> > > > things blow up.
> > > If you set the right FW on the board, do you still see the problem? I'm
> > > asking to determine if this is a problem specific to the "no FW" case or
> > > general issue with this configuration
> > 
> > I'm trying to get FW installed, but haven't had any luck yet.
> 
> Hmmm... So it is possible that this is HW related (since we are talking
> about preliminary board with missing component and missing FW). I will
> ask for some field support - someone will contact you shortly.

Yep, it's definitely possible that this is HW related.  The proto
has 57710, but the product will ultimately use the 57711, so I don't
really even care about testing the 57710.

After upgrading some pieces on my boards, I can't reproduce the
MCA either.  So I guess we can just write it off for now.  The only
reason I pointed out the issue at all is because it really hampered
testing of standard distro bits on the rest of the box.

Sorry for the distraction.

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bnx2x machine check in bnx2x_ack_int()
  2008-12-15 18:04       ` Bjorn Helgaas
@ 2008-12-15 19:36         ` Eilon Greenstein
  0 siblings, 0 replies; 6+ messages in thread
From: Eilon Greenstein @ 2008-12-15 19:36 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: netdev

On Mon, 2008-12-15 at 10:04 -0800, Bjorn Helgaas wrote:
> Yep, it's definitely possible that this is HW related.  The proto
> has 57710, but the product will ultimately use the 57711, so I don't
> really even care about testing the 57710.
> 
> After upgrading some pieces on my boards, I can't reproduce the
> MCA either.  So I guess we can just write it off for now.  The only
> reason I pointed out the issue at all is because it really hampered
> testing of standard distro bits on the rest of the box.

Please let me know if you see this issue again - we can try to
root-cause it. And again, sorry for the late initial response.

Eilon



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-12-15 19:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-09 19:04 bnx2x machine check in bnx2x_ack_int() Bjorn Helgaas
2008-12-15  8:03 ` Eilon Greenstein
2008-12-15 16:21   ` Bjorn Helgaas
2008-12-15 17:18     ` Eilon Greenstein
2008-12-15 18:04       ` Bjorn Helgaas
2008-12-15 19:36         ` Eilon Greenstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.