netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* atlantic: weird hwmon temperature readings with AQC107 NIC (kernel 5.2/5.3)
@ 2019-09-24 14:16 Holger Hoffstätte
  2019-09-24 14:30 ` Holger Hoffstätte
  0 siblings, 1 reply; 4+ messages in thread
From: Holger Hoffstätte @ 2019-09-24 14:16 UTC (permalink / raw)
  To: Netdev, Igor Russkikh

Hi,

I recently upgraded my home network with two AQ107-based NICs and a
multi-speed switch. Everything works great, but I couldn't help but notice
very weird hwmon temperature output (which I wanted to use for monitoring
and alerting).

Both cards identify as:

$lspci -v -s 06:00.0
06:00.0 Ethernet controller: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 02)
	Subsystem: ASUSTeK Computer Inc. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion]

In one machine lm_sensors says:

eth0-pci-0200
Adapter: PCI adapter
PHY Temperature: +315.1°C

This seems quite wrong since the card is only slightly warm to the touch, and
315.1 is exactly 255 + 60.1 - the latter value feels more like the actual
temperature.

On a second machine it says:

eth0-pci-0600
Adapter: PCI adapter
PHY Temperature: +6977.0°C

I feel qualified to say that is definitely wrong as well, since the machine is
currently not melting its way to the earth's core, and also only slightly warm
to the touch. :)

Both cards also reported wrong values with kernel 5.2, but since I'm on 5.3.1
I might as well report the current wrongness.

Do we know who's to blame here - motherboards, NICs, driver, kernel, hwmon
infrastructure? I believe the hwmon patches landed first in 5.2.

Thanks,
Holger

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: atlantic: weird hwmon temperature readings with AQC107 NIC (kernel 5.2/5.3)
  2019-09-24 14:16 atlantic: weird hwmon temperature readings with AQC107 NIC (kernel 5.2/5.3) Holger Hoffstätte
@ 2019-09-24 14:30 ` Holger Hoffstätte
  2019-09-24 14:32   ` Igor Russkikh
  0 siblings, 1 reply; 4+ messages in thread
From: Holger Hoffstätte @ 2019-09-24 14:30 UTC (permalink / raw)
  To: Netdev, Igor Russkikh

On 9/24/19 4:16 PM, Holger Hoffstätte wrote:
> Hi,
> 
> I recently upgraded my home network with two AQ107-based NICs and a
> multi-speed switch. Everything works great, but I couldn't help but notice
> very weird hwmon temperature output (which I wanted to use for monitoring
> and alerting).
> 
> Both cards identify as:
> 
> $lspci -v -s 06:00.0
> 06:00.0 Ethernet controller: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 02)
>      Subsystem: ASUSTeK Computer Inc. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion]
> 
> In one machine lm_sensors says:
> 
> eth0-pci-0200
> Adapter: PCI adapter
> PHY Temperature: +315.1°C
> 
> This seems quite wrong since the card is only slightly warm to the touch, and
> 315.1 is exactly 255 + 60.1 - the latter value feels more like the actual
> temperature.
> 
> On a second machine it says:
> 
> eth0-pci-0600
> Adapter: PCI adapter
> PHY Temperature: +6977.0°C
> 
> I feel qualified to say that is definitely wrong as well, since the machine is
> currently not melting its way to the earth's core, and also only slightly warm
> to the touch. :)
> 
> Both cards also reported wrong values with kernel 5.2, but since I'm on 5.3.1
> I might as well report the current wrongness.
> 
> Do we know who's to blame here - motherboards, NICs, driver, kernel, hwmon
> infrastructure? I believe the hwmon patches landed first in 5.2.

Another observation: the hwmon output immediately becomes sane (~58°)
when I down the link with ifconfig. As soon as I bring the link back up,
the temperature jumps from 58° to 6976° in one second.
It seems that the presence of the carrier somehow mangles the sensor
readings. I hope this helps to find the issue.

thanks,
Holger

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: atlantic: weird hwmon temperature readings with AQC107 NIC (kernel 5.2/5.3)
  2019-09-24 14:30 ` Holger Hoffstätte
@ 2019-09-24 14:32   ` Igor Russkikh
  2019-09-24 14:55     ` Holger Hoffstätte
  0 siblings, 1 reply; 4+ messages in thread
From: Igor Russkikh @ 2019-09-24 14:32 UTC (permalink / raw)
  To: Holger Hoffstätte, Netdev



On 24.09.2019 17:30, Holger Hoffstätte wrote:
> On 9/24/19 4:16 PM, Holger Hoffstätte wrote:
>> Hi,
>>
>> I recently upgraded my home network with two AQ107-based NICs and a
>> multi-speed switch. Everything works great, but I couldn't help but notice
>> very weird hwmon temperature output (which I wanted to use for monitoring
>> and alerting).
>>
>> Both cards identify as:
>>
>> $lspci -v -s 06:00.0
>> 06:00.0 Ethernet controller: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz
>> Ethernet Controller [AQtion] (rev 02)
>>      Subsystem: ASUSTeK Computer Inc. AQC107 NBase-T/IEEE 802.3bz Ethernet
>> Controller [AQtion]
>>
>> In one machine lm_sensors says:
>>
>> eth0-pci-0200
>> Adapter: PCI adapter
>> PHY Temperature: +315.1°C
>>
>> This seems quite wrong since the card is only slightly warm to the touch, and
>> 315.1 is exactly 255 + 60.1 - the latter value feels more like the actual
>> temperature.
>>
>> On a second machine it says:
>>
>> eth0-pci-0600
>> Adapter: PCI adapter
>> PHY Temperature: +6977.0°C
>>
>> I feel qualified to say that is definitely wrong as well, since the machine is
>> currently not melting its way to the earth's core, and also only slightly warm
>> to the touch. :)
>>
>> Both cards also reported wrong values with kernel 5.2, but since I'm on 5.3.1
>> I might as well report the current wrongness.
>>
>> Do we know who's to blame here - motherboards, NICs, driver, kernel, hwmon
>> infrastructure? I believe the hwmon patches landed first in 5.2.
> 
> Another observation: the hwmon output immediately becomes sane (~58°)
> when I down the link with ifconfig. As soon as I bring the link back up,
> the temperature jumps from 58° to 6976° in one second.
> It seems that the presence of the carrier somehow mangles the sensor
> readings. I hope this helps to find the issue.
> 
> thanks,
> Holger


Hi Holger,

Thanks for the report,

We've recently found out that as well, could you try applying the following patch:

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
index da726489e3c8..08b026b41571 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
@@ -337,7 +337,7 @@ static int aq_fw2x_get_phy_temp(struct aq_hw_s *self, int *temp)
        /* Convert PHY temperature from 1/256 degree Celsius
         * to 1/1000 degree Celsius.
         */
-       *temp = temp_res  * 1000 / 256;
+       *temp = (temp_res & 0xFFFF)  * 1000 / 256;

        return 0;
 }



Funny thing is that this value gets readout from HW memory, all the readouts are
done by full dwords, but the value is only word width. High word of that dword
is estimated cable length. On short cables we use it is often zero ;)

As I see from your readings - your cables are abit longer :)

This also explains why temp is good when you do interface down.

Regards,
  Igor

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: atlantic: weird hwmon temperature readings with AQC107 NIC (kernel 5.2/5.3)
  2019-09-24 14:32   ` Igor Russkikh
@ 2019-09-24 14:55     ` Holger Hoffstätte
  0 siblings, 0 replies; 4+ messages in thread
From: Holger Hoffstätte @ 2019-09-24 14:55 UTC (permalink / raw)
  To: Igor Russkikh, Netdev

On 9/24/19 4:32 PM, Igor Russkikh wrote:
> We've recently found out that as well, could you try applying the following patch:
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
> b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
> index da726489e3c8..08b026b41571 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
> @@ -337,7 +337,7 @@ static int aq_fw2x_get_phy_temp(struct aq_hw_s *self, int *temp)
>          /* Convert PHY temperature from 1/256 degree Celsius
>           * to 1/1000 degree Celsius.
>           */
> -       *temp = temp_res  * 1000 / 256;
> +       *temp = (temp_res & 0xFFFF)  * 1000 / 256;
> 
>          return 0;
>   }

Well, what do you know!

eth0-pci-0600
Adapter: PCI adapter
PHY Temperature:  +63.5°C

This looks like it aligns with reality. :-)

> Funny thing is that this value gets readout from HW memory, all the readouts are
> done by full dwords, but the value is only word width. High word of that dword

I suspected upper-word-garbage but of course don't know internals.

> is estimated cable length. On short cables we use it is often zero ;)

:D

When you send the proper patch feel free to add:
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>

Thanks again for the quick help!

Holger

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-24 14:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-24 14:16 atlantic: weird hwmon temperature readings with AQC107 NIC (kernel 5.2/5.3) Holger Hoffstätte
2019-09-24 14:30 ` Holger Hoffstätte
2019-09-24 14:32   ` Igor Russkikh
2019-09-24 14:55     ` Holger Hoffstätte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).