All of lore.kernel.org
 help / color / mirror / Atom feed
* Driver SFC: Possible bug in LM87 temperature XFP detection code
@ 2009-04-28  9:36 Jesper Dangaard Brouer
  2009-04-28 13:36 ` Ben Hutchings
  0 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-28  9:36 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

Hi Ben,

I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for
evaluation.  The NICs uses a Solarflare Chip and the SFC driver.

If unpluging the fiber cable I start getting these errors:

+--------
 sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
 sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY

 sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
 sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY

 sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL
 sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
+---------

Reading through the driver code (drivers/net/sfc/boards.c), this problem
is related to temperature.

The real issues is that I cannot get the device up and running again
after lowering the temperature.  Only if I unload and load the sfc
driver, then I can get the device running again.

I'm thinking perhaps there is missing a PHY power up again, after the
temperature alarm has gone?

I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a.


To Ben; do you have anything you want me to try. Do you want to fix this
you self, or can you give me some code hints or patches to try out?

I'm wondering what chip the SMC NIC is using? From lspci is says
SFC4000, but does that corrospond to EFX_BOARD_SFE4001 or
EFX_BOARD_SFE4002 ?

(Additional tech info below signature)

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer



* XFP modules used
------------------

 I have tested with two different XFP modules:
  (1) Finisar FTLX1412D3BCL 10GBASE-LR/LW
  (2) Sumitomo SXP3101LX

* lspci info
------------

 lspci -xvs 12:00.0

+---------
12:00.0 Ethernet controller: Solarflare Communications SFC4000 rev B
[Solarstorm] (rev 02)
        Subsystem: Standard Microsystems Corp [SMC] SMC10GPCIe-XFP (A1)
[TigerCard]
        Flags: bus master, fast devsel, latency 0, IRQ 19
        I/O ports at e800 [size=256]
        Memory at f8000000 (64-bit, non-prefetchable) [size=64M]
        Expansion ROM at feaf0000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+
Queue=0/3 Enable-
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [90] MSI-X: Enable+ Mask- TabSize=64
        Capabilities: [b0] Vital Product Data <?>
        Capabilities: [100] Advanced Error Reporting <?>
        Capabilities: [140] Device Serial Number 00-0f-53-ff-ff-2c-b2-1a
        Kernel driver in use: sfc
        Kernel modules: sfc
00: 24 19 10 07 07 05 10 00 02 00 00 02 10 00 00 00
10: 01 e8 00 00 00 00 00 00 04 00 00 f8 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 b8 10 01 02
30: 00 00 af fe 40 00 00 00 00 00 00 00 0a 01 00 00
+-------

* Relevant kernel CONFIG_xxx
----------------------------

 CONFIG_SFC_MTD not set
 CONFIG_SENSORS_LM87=m
 CONFIG_SFC=m


* ethtool status
----------------
 ethtool eth88

+-------
Settings for eth88:
        Supported ports: [ FIBRE ]
        Supported link modes:   
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 2
        Transceiver: internal
        Auto-negotiation: off
        Link detected: no
+-------

* SMC links
-----------

 SMC10GPCIe-XFP TigerCard™ 10G
 http://www.smc.com/index.cfm?event=viewProduct&cid=9&scid=51&pid=1648




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
  2009-04-28  9:36 Driver SFC: Possible bug in LM87 temperature XFP detection code Jesper Dangaard Brouer
@ 2009-04-28 13:36 ` Ben Hutchings
  2009-04-28 14:44   ` Jesper Dangaard Brouer
  2009-04-28 17:04   ` Driver SFC: Possible bug in LM87 temperature XFP detection code Ben Hutchings
  0 siblings, 2 replies; 11+ messages in thread
From: Ben Hutchings @ 2009-04-28 13:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev

On Tue, 2009-04-28 at 11:36 +0200, Jesper Dangaard Brouer wrote:
> Hi Ben,
> 
> I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for
> evaluation.  The NICs uses a Solarflare Chip and the SFC driver.
> 
> If unpluging the fiber cable I start getting these errors:
> 
> +--------
>  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
>  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> 
>  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
>  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> 
>  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL
>  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> +---------
> 
> Reading through the driver code (drivers/net/sfc/boards.c), this problem
> is related to temperature.

Right.  And the sensors are not polled while the link is up, on the
assumption that a temperature or voltage fault will cause the link to go
down, and because bit-banged I2C will reduce throughput slightly.

> The real issues is that I cannot get the device up and running again
> after lowering the temperature.  Only if I unload and load the sfc
> driver, then I can get the device running again.
> 
> I'm thinking perhaps there is missing a PHY power up again, after the
> temperature alarm has gone?

We considered it most important to shut down the board to prevent or
mitigate damage, and did not implement any recovery beyond that.

> I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a.
> 
> 
> To Ben; do you have anything you want me to try. Do you want to fix this
> you self, or can you give me some code hints or patches to try out?

I don't intend to fix this myself.  If you want to try implementing this
then you should start by looking at efx_monitor() in efx.c.  However, I
think your time might be better spent in fixing the air flow in the
computer before the board is permanently damaged.

> I'm wondering what chip the SMC NIC is using? From lspci is says
> SFC4000, but does that corrospond to EFX_BOARD_SFE4001 or
> EFX_BOARD_SFE4002 ?

The SMC10GPCIe-XFP is based on SFE4002.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
  2009-04-28 13:36 ` Ben Hutchings
@ 2009-04-28 14:44   ` Jesper Dangaard Brouer
  2009-04-28 14:48     ` [PATCH] sfc: Make temperature warnings/alarms more explicit Jesper Dangaard Brouer
  2009-04-28 17:04   ` Driver SFC: Possible bug in LM87 temperature XFP detection code Ben Hutchings
  1 sibling, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-28 14:44 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

On Tue, 2009-04-28 at 14:36 +0100, Ben Hutchings wrote:
> On Tue, 2009-04-28 at 11:36 +0200, Jesper Dangaard Brouer wrote:
> > Hi Ben,
> > 
> > I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for
> > evaluation.  The NICs uses a Solarflare Chip and the SFC driver.
> > 
> > If unpluging the fiber cable I start getting these errors:
> > 
> > +--------
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > 
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > 
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > +---------
> > 
> > Reading through the driver code (drivers/net/sfc/boards.c), this problem
> > is related to temperature.
> 
> Right.  And the sensors are not polled while the link is up, on the
> assumption that a temperature or voltage fault will cause the link to go
> down, and because bit-banged I2C will reduce throughput slightly.

In my situation the link does not go down due the temperature issue.


> > The real issues is that I cannot get the device up and running again
> > after lowering the temperature.  Only if I unload and load the sfc
> > driver, then I can get the device running again.
> > 
> > I'm thinking perhaps there is missing a PHY power up again, after the
> > temperature alarm has gone?
> 
> We considered it most important to shut down the board to prevent or
> mitigate damage, and did not implement any recovery beyond that.

Im my case putting the PHY in PHY_MODE_LOW_POWER, does not help lowering
the temperature.  The errors are continous, until I apply "manual"
airflow ;-)


> > I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a.
> > 
> > 
> > To Ben; do you have anything you want me to try. Do you want to fix this
> > you self, or can you give me some code hints or patches to try out?
> 
> I don't intend to fix this myself.  If you want to try implementing this
> then you should start by looking at efx_monitor() in efx.c.  However, I
> think your time might be better spent in fixing the air flow in the
> computer before the board is permanently damaged.

I see you point, I don't want to damage the board... not sure I want to
fix it then... Although in a production environment, I think the driver
should support exchanging a failed XFP without rebooting the server.

Then I also think that we should make the error message a bit more
explicit, in order to warn people before the board is permanently
damaged.  I'll post a patch proposal as reply to this message...

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] sfc: Make temperature warnings/alarms more explicit.
  2009-04-28 14:44   ` Jesper Dangaard Brouer
@ 2009-04-28 14:48     ` Jesper Dangaard Brouer
  2009-04-30  0:50       ` David Miller
  2009-04-30  1:25       ` Ben Hutchings
  0 siblings, 2 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-28 14:48 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev


The sfc driver can detect different hardware failures via the
LM87 system.  One of the failures I have experienced is the
temperature alarm, but the error message didn't reveal that this
error was temperature related.  I had to read the code to
discover that.

I think that the temperature error should be more explicit, in
order to warn people before the board is permanently damaged.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---

 drivers/net/sfc/boards.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)


diff --git a/drivers/net/sfc/boards.c b/drivers/net/sfc/boards.c
index 4a4c74c..b1822fe 100644
--- a/drivers/net/sfc/boards.c
+++ b/drivers/net/sfc/boards.c
@@ -121,8 +121,10 @@ static int efx_check_lm87(struct efx_nic *efx, unsigned mask)
 	if (alarms1 || alarms2) {
 		EFX_ERR(efx,
 			"LM87 detected a hardware failure (status %02x:%02x)"
-			"%s%s\n",
+			"%s%s%s\n",
 			alarms1, alarms2,
+			(alarms1 & (LM87_ALARM_TEMP_INT|LM87_ALARM_TEMP_EXT1))
+			 ? " high temperature" : "",
 			(alarms1 & LM87_ALARM_TEMP_INT) ? " INTERNAL" : "",
 			(alarms1 & LM87_ALARM_TEMP_EXT1) ? " EXTERNAL" : "");
 		return -ERANGE;


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
  2009-04-28 13:36 ` Ben Hutchings
  2009-04-28 14:44   ` Jesper Dangaard Brouer
@ 2009-04-28 17:04   ` Ben Hutchings
  2009-04-29  8:52     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 11+ messages in thread
From: Ben Hutchings @ 2009-04-28 17:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev

I wrote:
> However, I
> think your time might be better spent in fixing the air flow in the
> computer before the board is permanently damaged.

It may also be worth checking that the sensors are actually reading
correctly.  You can read them using the "sensors" command from the
lm_sensors package or by running:

    grep . /sys/class/net/eth88/device/i2c-adapter:*/*-002e/temp*_input

Divide the numbers by 1000 to get temperatures in degrees celsius.

The "internal" temperature sensor (temp1_input) is in the LM87, which is
placed at the corner of the board away from the bracket and the edge
connector.

The "external" temperature sensor (temp2_input) is in the SFC4000.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
  2009-04-28 17:04   ` Driver SFC: Possible bug in LM87 temperature XFP detection code Ben Hutchings
@ 2009-04-29  8:52     ` Jesper Dangaard Brouer
  2009-04-29 12:11       ` Jesper Dangaard Brouer
  2009-04-29 12:47       ` Ben Hutchings
  0 siblings, 2 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-29  8:52 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Jesper Dangaard Brouer, netdev, Robert Olsson

On Tue, 28 Apr 2009, Ben Hutchings wrote:

> I wrote:
>> However, I
>> think your time might be better spent in fixing the air flow in the
>> computer before the board is permanently damaged.
>
> It may also be worth checking that the sensors are actually reading
> correctly.  You can read them using the "sensors" command from the
> lm_sensors package or by running:
>
>    grep . /sys/class/net/eth88/device/i2c-adapter:*/*-002e/temp*_input
>
> Divide the numbers by 1000 to get temperatures in degrees celsius.

/sys/class/net/eth88/device/i2c-adapter:i2c-1/1-002e/temp1_input:61000
/sys/class/net/eth88/device/i2c-adapter:i2c-1/1-002e/temp2_input:70000

Guess the temperature is fairly high... I can touch it with a finger 
without getting burned.  I also have a 10GbE Sun Neptune (niu) in the 
machine, if I touch the heatsink on that I burn my finger, so that is even 
hotter...


> The "internal" temperature sensor (temp1_input) is in the LM87, which is
> placed at the corner of the board away from the bracket and the edge
> connector.
>
> The "external" temperature sensor (temp2_input) is in the SFC4000.

I though this was read from the XFP.  I was hoping the NIC supported 
reading i2c stuff from the XFP.  Does it support that? (Robert Olsson 
wanted to play with this stuff)


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
  2009-04-29  8:52     ` Jesper Dangaard Brouer
@ 2009-04-29 12:11       ` Jesper Dangaard Brouer
  2009-04-29 12:47       ` Ben Hutchings
  1 sibling, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-29 12:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Ben Hutchings, netdev, Robert Olsson

On Wed, 2009-04-29 at 10:52 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 28 Apr 2009, Ben Hutchings wrote:

> > It may also be worth checking that the sensors are actually reading
> > correctly.  You can read them using the "sensors" command from the
> > lm_sensors package or by running:
> >
> >    grep . /sys/class/net/eth88/device/i2c-adapter:*/*-002e/temp*_input
> >
> > Divide the numbers by 1000 to get temperatures in degrees celsius.
> 
> /sys/class/net/eth88/device/i2c-adapter:i2c-1/1-002e/temp1_input:61000
> /sys/class/net/eth88/device/i2c-adapter:i2c-1/1-002e/temp2_input:70000
> 
> Guess the temperature is fairly high... I can touch it with a finger 
> without getting burned.  

> I also have a 10GbE Sun Neptune (niu) in the 
> machine, if I touch the heatsink on that I burn my finger, so that is even 
> hotter...

I got my hands on a InfraRed Thermometer.
The Sun NIC was 98 degrees celsius!!!


> > The "internal" temperature sensor (temp1_input) is in the LM87, which is
> > placed at the corner of the board away from the bracket and the edge
> > connector.

Measuring that point gives the correct measurement using the external
infrared thermometer.

> > The "external" temperature sensor (temp2_input) is in the SFC4000.

It was harder to measure the temperature on the NIC CPU because of its
blank surface.  I can get a reading of 66 Celcius where the sensor says
68-69 Celcius.

The XFP it self of "only" 55 Celcius.  Measured by quickly pulling it
out and measuring (and taking care of measuring on a non blank surface
point).

> I though this was read from the XFP.  I was hoping the NIC supported 
> reading i2c stuff from the XFP.  Does it support that? (Robert Olsson 
> wanted to play with this stuff)

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
  2009-04-29  8:52     ` Jesper Dangaard Brouer
  2009-04-29 12:11       ` Jesper Dangaard Brouer
@ 2009-04-29 12:47       ` Ben Hutchings
  1 sibling, 0 replies; 11+ messages in thread
From: Ben Hutchings @ 2009-04-29 12:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Jesper Dangaard Brouer, netdev, Robert Olsson

On Wed, 2009-04-29 at 10:52 +0200, Jesper Dangaard Brouer wrote:
[...]
> > The "internal" temperature sensor (temp1_input) is in the LM87, which is
> > placed at the corner of the board away from the bracket and the edge
> > connector.
> >
> > The "external" temperature sensor (temp2_input) is in the SFC4000.
> 
> I though this was read from the XFP.

No.

> I was hoping the NIC supported 
> reading i2c stuff from the XFP.  Does it support that? (Robert Olsson 
> wanted to play with this stuff)

That's connected to the PHY and not the controller.  According to the
datasheet for the QT2022, you can read out the EEPROM from MDIO
registers 1.32775-1.33030 (ignoring bits 8-15).  Use the MII ioctl
interface and set phy_id to 0x441 (the formula is 0x400 | (prtad << 5) |
devad).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sfc: Make temperature warnings/alarms more explicit.
  2009-04-28 14:48     ` [PATCH] sfc: Make temperature warnings/alarms more explicit Jesper Dangaard Brouer
@ 2009-04-30  0:50       ` David Miller
  2009-04-30  1:25       ` Ben Hutchings
  1 sibling, 0 replies; 11+ messages in thread
From: David Miller @ 2009-04-30  0:50 UTC (permalink / raw)
  To: jdb; +Cc: bhutchings, netdev

From: Jesper Dangaard Brouer <jdb@comx.dk>
Date: Tue, 28 Apr 2009 16:48:04 +0200

> 
> The sfc driver can detect different hardware failures via the
> LM87 system.  One of the failures I have experienced is the
> temperature alarm, but the error message didn't reveal that this
> error was temperature related.  I had to read the code to
> discover that.
> 
> I think that the temperature error should be more explicit, in
> order to warn people before the board is permanently damaged.
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>

Ben, ACK or something?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sfc: Make temperature warnings/alarms more explicit.
  2009-04-28 14:48     ` [PATCH] sfc: Make temperature warnings/alarms more explicit Jesper Dangaard Brouer
  2009-04-30  0:50       ` David Miller
@ 2009-04-30  1:25       ` Ben Hutchings
  2009-04-30  8:44         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 11+ messages in thread
From: Ben Hutchings @ 2009-04-30  1:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Miller, netdev

On Tue, 2009-04-28 at 16:48 +0200, Jesper Dangaard Brouer wrote:
> The sfc driver can detect different hardware failures via the
> LM87 system.  One of the failures I have experienced is the
> temperature alarm, but the error message didn't reveal that this
> error was temperature related.  I had to read the code to
> discover that.
> 
> I think that the temperature error should be more explicit, in
> order to warn people before the board is permanently damaged.

You are right, but...

> diff --git a/drivers/net/sfc/boards.c b/drivers/net/sfc/boards.c
> index 4a4c74c..b1822fe 100644
> --- a/drivers/net/sfc/boards.c
> +++ b/drivers/net/sfc/boards.c
> @@ -121,8 +121,10 @@ static int efx_check_lm87(struct efx_nic *efx, unsigned mask)
>  	if (alarms1 || alarms2) {
>  		EFX_ERR(efx,
>  			"LM87 detected a hardware failure (status %02x:%02x)"
> -			"%s%s\n",
> +			"%s%s%s\n",
>  			alarms1, alarms2,
> +			(alarms1 & (LM87_ALARM_TEMP_INT|LM87_ALARM_TEMP_EXT1))
> +			 ? " high temperature" : "",
>  			(alarms1 & LM87_ALARM_TEMP_INT) ? " INTERNAL" : "",
>  			(alarms1 & LM87_ALARM_TEMP_EXT1) ? " EXTERNAL" : "");
>  		return -ERANGE;

We could be more explicit still.  How about:

		EFX_ERR(efx,
			"%s out of range (LM87 status %02x:%02x)\n",
			(alarms1 & LM87_ALARM_TEMP_INT) ? "Board temperature" :
			(alarms1 & LM87_ALARM_TEMP_EXT1) ? "Controller temperature :
			"Voltage",
			alarms1, alarms2);

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sfc: Make temperature warnings/alarms more explicit.
  2009-04-30  1:25       ` Ben Hutchings
@ 2009-04-30  8:44         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-30  8:44 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev

On Thu, 2009-04-30 at 02:25 +0100, Ben Hutchings wrote:
> On Tue, 2009-04-28 at 16:48 +0200, Jesper Dangaard Brouer wrote:
> > The sfc driver can detect different hardware failures via the
> > LM87 system.  One of the failures I have experienced is the
> > temperature alarm, but the error message didn't reveal that this
> > error was temperature related.  I had to read the code to
> > discover that.
> > 
> > I think that the temperature error should be more explicit, in
> > order to warn people before the board is permanently damaged.
> 
> You are right, but...
> 
> > diff --git a/drivers/net/sfc/boards.c b/drivers/net/sfc/boards.c
> > index 4a4c74c..b1822fe 100644
> > --- a/drivers/net/sfc/boards.c
> > +++ b/drivers/net/sfc/boards.c
> > @@ -121,8 +121,10 @@ static int efx_check_lm87(struct efx_nic *efx, unsigned mask)
> >  	if (alarms1 || alarms2) {
> >  		EFX_ERR(efx,
> >  			"LM87 detected a hardware failure (status %02x:%02x)"
> > -			"%s%s\n",
> > +			"%s%s%s\n",
> >  			alarms1, alarms2,
> > +			(alarms1 & (LM87_ALARM_TEMP_INT|LM87_ALARM_TEMP_EXT1))
> > +			 ? " high temperature" : "",
> >  			(alarms1 & LM87_ALARM_TEMP_INT) ? " INTERNAL" : "",
> >  			(alarms1 & LM87_ALARM_TEMP_EXT1) ? " EXTERNAL" : "");
> >  		return -ERANGE;
> 
> We could be more explicit still.  How about:

Yes, the "INTERNAL" and "EXTERNAL" is not very descriptive.

> 
> 		EFX_ERR(efx,
> 			"%s out of range (LM87 status %02x:%02x)\n",
> 			(alarms1 & LM87_ALARM_TEMP_INT) ? "Board temperature" :
> 			(alarms1 & LM87_ALARM_TEMP_EXT1) ? "Controller temperature :
> 			"Voltage",

Notice that both LM87_ALARM_TEMP_INT and LM87_ALARM_TEMP_EXT1 can be set
at the same time.  In that case we only print "Board temperature", is
that good enough?  (I thinks its okay, as the status will provide
developers enough info for debugging the event)

> 			alarms1, alarms2);

I would like to emphesize that its a hardware alarm, by adding "HW
alarm:" see revised patch...

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


[PATCH] sfc: Make temperature warnings/alarms more explicit.

The sfc driver can detect different hardware failures via the
LM87 system.  One of the failures I have experienced is the
temperature alarm, but the error message didn't reveal that this
error was temperature related.  I had to read the code to
discover that.

I think that the temperature error should be more explicit, in
order to warn people before the board is permanently damaged.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---

 drivers/net/sfc/boards.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)


diff --git a/drivers/net/sfc/boards.c b/drivers/net/sfc/boards.c
index 4a4c74c..223866c 100644
--- a/drivers/net/sfc/boards.c
+++ b/drivers/net/sfc/boards.c
@@ -120,11 +120,11 @@ static int efx_check_lm87(struct efx_nic *efx, unsigned mask)
 	alarms2 &= mask >> 8;
 	if (alarms1 || alarms2) {
 		EFX_ERR(efx,
-			"LM87 detected a hardware failure (status %02x:%02x)"
-			"%s%s\n",
-			alarms1, alarms2,
-			(alarms1 & LM87_ALARM_TEMP_INT) ? " INTERNAL" : "",
-			(alarms1 & LM87_ALARM_TEMP_EXT1) ? " EXTERNAL" : "");
+			"HW alarm: %s out of range (LM87 status %02x:%02x)\n",
+			(alarms1 & LM87_ALARM_TEMP_INT) ? "Board temperature" :
+			(alarms1 & LM87_ALARM_TEMP_EXT1) ? "Controller temperature" :
+			 "Voltage",
+			alarms1, alarms2);
 		return -ERANGE;
 	}
 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-04-30  8:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-28  9:36 Driver SFC: Possible bug in LM87 temperature XFP detection code Jesper Dangaard Brouer
2009-04-28 13:36 ` Ben Hutchings
2009-04-28 14:44   ` Jesper Dangaard Brouer
2009-04-28 14:48     ` [PATCH] sfc: Make temperature warnings/alarms more explicit Jesper Dangaard Brouer
2009-04-30  0:50       ` David Miller
2009-04-30  1:25       ` Ben Hutchings
2009-04-30  8:44         ` Jesper Dangaard Brouer
2009-04-28 17:04   ` Driver SFC: Possible bug in LM87 temperature XFP detection code Ben Hutchings
2009-04-29  8:52     ` Jesper Dangaard Brouer
2009-04-29 12:11       ` Jesper Dangaard Brouer
2009-04-29 12:47       ` Ben Hutchings

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.