All of lore.kernel.org
 help / color / mirror / Atom feed
* [lm-sensors] Ticket #2382
@ 2013-11-18 18:56 Mike Gilbert
  2013-11-18 22:39 ` Guenter Roeck
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: Mike Gilbert @ 2013-11-18 18:56 UTC (permalink / raw)
  To: lm-sensors


Do you have any additional information pertaining to ticket 2382?

The CPU card we use in our products is going end-of-life. The CPU card 
vendor send us a new card that is supposed to be a drop in replacement 
(it's the same card with a newer Atom chip). The new card returns an 
error when reading the coretemp:

    # cat /sys/bus/platform/devices/coretemp.0/temp2_input
    cat: read error: Resource temporarily unavailable
    #

Some printk debugging yields:

    ENTER show_temp
    status_reg @ 19C
    eax = 8620000 edx = 0
    temp = 0 valid = 0
    EXIT show_temp

This looks like the same issue described in your ticket 2382.

Any information you can provide will be appreciated.

Mike Gilbert
Principle Engineer
Bay Microsystems, Inc.


_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
@ 2013-11-18 22:39 ` Guenter Roeck
  2013-11-19  7:51 ` Jean Delvare
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-18 22:39 UTC (permalink / raw)
  To: lm-sensors

What Atom chip ? Can you provide output of /proc/cpuinfo ?

Thanks,
Guenter

On Mon, Nov 18, 2013 at 01:56:28PM -0500, Mike Gilbert wrote:
> 
> Do you have any additional information pertaining to ticket 2382?
> 
> The CPU card we use in our products is going end-of-life. The CPU
> card vendor send us a new card that is supposed to be a drop in
> replacement (it's the same card with a newer Atom chip). The new
> card returns an error when reading the coretemp:
> 
>    # cat /sys/bus/platform/devices/coretemp.0/temp2_input
>    cat: read error: Resource temporarily unavailable
>    #
> 
> Some printk debugging yields:
> 
>    ENTER show_temp
>    status_reg @ 19C
>    eax = 8620000 edx = 0
>    temp = 0 valid = 0
>    EXIT show_temp
> 
> This looks like the same issue described in your ticket 2382.
> 
> Any information you can provide will be appreciated.
> 
> Mike Gilbert
> Principle Engineer
> Bay Microsystems, Inc.
> 
> 
> _______________________________________________
> lm-sensors mailing list
> lm-sensors@lm-sensors.org
> http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
> 

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
  2013-11-18 22:39 ` Guenter Roeck
@ 2013-11-19  7:51 ` Jean Delvare
  2013-11-19 14:33 ` Guenter Roeck
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Jean Delvare @ 2013-11-19  7:51 UTC (permalink / raw)
  To: lm-sensors

Hi Mike,

On Mon, 18 Nov 2013 13:56:28 -0500, Mike Gilbert wrote:
> 
> Do you have any additional information pertaining to ticket 2382?
> 
> The CPU card we use in our products is going end-of-life. The CPU card 
> vendor send us a new card that is supposed to be a drop in replacement 
> (it's the same card with a newer Atom chip). The new card returns an 
> error when reading the coretemp:
> 
>     # cat /sys/bus/platform/devices/coretemp.0/temp2_input
>     cat: read error: Resource temporarily unavailable
>     #
> 
> Some printk debugging yields:
> 
>     ENTER show_temp
>     status_reg @ 19C
>     eax = 8620000 edx = 0
>     temp = 0 valid = 0
>     EXIT show_temp
> 
> This looks like the same issue described in your ticket 2382.

Indeed.

> Any information you can provide will be appreciated.

Unfortunately, no, I do not have any additional information about this
issue. I can only recommend using the external hardware monitoring chip
(if there is one) instead of coretemp to monitor the temperature of
these CPUs.

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
  2013-11-18 22:39 ` Guenter Roeck
  2013-11-19  7:51 ` Jean Delvare
@ 2013-11-19 14:33 ` Guenter Roeck
  2013-11-19 15:04 ` Mike Gilbert
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-19 14:33 UTC (permalink / raw)
  To: lm-sensors

On 11/19/2013 06:05 AM, Mike Gilbert wrote:
> Guenter,
>
> Thanks for responding. The cards are both made by Emerson. The old one is a COMX-430. The new one is a COMX-440.
>
> Mike
>
>
> Here's the info from the old CPU card:
>
> processor    : 0
> vendor_id    : GenuineIntel
> cpu family    : 15
> model        : 4
> model name    : Intel(R) Xeon(TM) CPU 3.00GHz
> stepping    : 3
> cpu MHz        : 3000.000
> cache size    : 2048 KB
> physical id    : 0
> siblings    : 2
> core id        : 0
> cpu cores    : 1
> apicid        : 0
> initial apicid    : 0
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 5
> wp        : yes
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl cid cx16 xtpr
> bogomips    : 5986.15
> clflush size    : 64
> cache_alignment    : 128
> address sizes    : 36 bits physical, 48 bits virtual
> power management:
>
> processor    : 1
> vendor_id    : GenuineIntel
> cpu family    : 15
> model        : 4
> model name    : Intel(R) Xeon(TM) CPU 3.00GHz
> stepping    : 3
> cpu MHz        : 3000.000
> cache size    : 2048 KB
> physical id    : 0
> siblings    : 2
> core id        : 0
> cpu cores    : 1
> apicid        : 1
> initial apicid    : 1
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 5
> wp        : yes
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl cid cx16 xtpr
> bogomips    : 5985.19
> clflush size    : 64
> cache_alignment    : 128
> address sizes    : 36 bits physical, 48 bits virtual
> power management:
>
>
> Here's the info from the new CPU card:
>
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 28
> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz

Yes, Jean was right, same issue, same response.

One thing to try might be to see what happens
if you put the system under load, ie heat up the CPU.
Can you try that ?

Thanks,
Guenter

> stepping        : 10
> microcode       : 0x107
> cpu MHz         : 1662.657
> cache size      : 512 KB
> physical id     : 0
> siblings        : 4
> core id         : 0
> cpu cores       : 2
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe lahf_lm dtherm
> bogomips        : 3325.31
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> processor       : 1
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 28
> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
> stepping        : 10
> microcode       : 0x107
> cpu MHz         : 1662.657
> cache size      : 512 KB
> physical id     : 0
> siblings        : 4
> core id         : 0
> cpu cores       : 2
> apicid          : 1
> initial apicid  : 1
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe lahf_lm dtherm
> bogomips        : 3325.31
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> processor       : 2
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 28
> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
> stepping        : 10
> microcode       : 0x107
> cpu MHz         : 1662.657
> cache size      : 512 KB
> physical id     : 0
> siblings        : 4
> core id         : 1
> cpu cores       : 2
> apicid          : 2
> initial apicid  : 2
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe lahf_lm dtherm
> bogomips        : 3325.31
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> processor       : 3
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 28
> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
> stepping        : 10
> microcode       : 0x107
> cpu MHz         : 1662.657
> cache size      : 512 KB
> physical id     : 0
> siblings        : 4
> core id         : 1
> cpu cores       : 2
> apicid          : 3
> initial apicid  : 3
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe lahf_lm dtherm
> bogomips        : 3325.31
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
>
> On 11/18/2013 05:39 PM, Guenter Roeck wrote:
>> What Atom chip ? Can you provide output of /proc/cpuinfo ?
>>
>> Thanks,
>> Guenter
>>
>> On Mon, Nov 18, 2013 at 01:56:28PM -0500, Mike Gilbert wrote:
>>> Do you have any additional information pertaining to ticket 2382?
>>>
>>> The CPU card we use in our products is going end-of-life. The CPU
>>> card vendor send us a new card that is supposed to be a drop in
>>> replacement (it's the same card with a newer Atom chip). The new
>>> card returns an error when reading the coretemp:
>>>
>>>     # cat /sys/bus/platform/devices/coretemp.0/temp2_input
>>>     cat: read error: Resource temporarily unavailable
>>>     #
>>>
>>> Some printk debugging yields:
>>>
>>>     ENTER show_temp
>>>     status_reg @ 19C
>>>     eax = 8620000 edx = 0
>>>     temp = 0 valid = 0
>>>     EXIT show_temp
>>>
>>> This looks like the same issue described in your ticket 2382.
>>>
>>> Any information you can provide will be appreciated.
>>>
>>> Mike Gilbert
>>> Principle Engineer
>>> Bay Microsystems, Inc.
>>>
>>>
>>> _______________________________________________
>>> lm-sensors mailing list
>>> lm-sensors@lm-sensors.org
>>> http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
>>>
>
>
>


_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (2 preceding siblings ...)
  2013-11-19 14:33 ` Guenter Roeck
@ 2013-11-19 15:04 ` Mike Gilbert
  2013-11-19 16:38 ` Guenter Roeck
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Mike Gilbert @ 2013-11-19 15:04 UTC (permalink / raw)
  To: lm-sensors


Guenter,

We're evaluating the new card in a open chassis. It is on the test bench 
with a table fan for cooling. I turned off the fan and got:

     ENTER show_temp
     cpu 0 (0)
     status_reg @ 19C
     eax = 885E0000 edx = 0
     temp = 1770 valid = 1
     EXIT show_temp

It seems like you've seen this before. What's going on?

Thanks,
Mike


On 11/19/2013 09:33 AM, Guenter Roeck wrote:
> On 11/19/2013 06:05 AM, Mike Gilbert wrote:
>> Guenter,
>>
>> Thanks for responding. The cards are both made by Emerson. The old 
>> one is a COMX-430. The new one is a COMX-440.
>>
>> Mike
>>
>>
>> Here's the info from the old CPU card:
>>
>> processor    : 0
>> vendor_id    : GenuineIntel
>> cpu family    : 15
>> model        : 4
>> model name    : Intel(R) Xeon(TM) CPU 3.00GHz
>> stepping    : 3
>> cpu MHz        : 3000.000
>> cache size    : 2048 KB
>> physical id    : 0
>> siblings    : 2
>> core id        : 0
>> cpu cores    : 1
>> apicid        : 0
>> initial apicid    : 0
>> fpu        : yes
>> fpu_exception    : yes
>> cpuid level    : 5
>> wp        : yes
>> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
>> syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl 
>> cid cx16 xtpr
>> bogomips    : 5986.15
>> clflush size    : 64
>> cache_alignment    : 128
>> address sizes    : 36 bits physical, 48 bits virtual
>> power management:
>>
>> processor    : 1
>> vendor_id    : GenuineIntel
>> cpu family    : 15
>> model        : 4
>> model name    : Intel(R) Xeon(TM) CPU 3.00GHz
>> stepping    : 3
>> cpu MHz        : 3000.000
>> cache size    : 2048 KB
>> physical id    : 0
>> siblings    : 2
>> core id        : 0
>> cpu cores    : 1
>> apicid        : 1
>> initial apicid    : 1
>> fpu        : yes
>> fpu_exception    : yes
>> cpuid level    : 5
>> wp        : yes
>> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
>> syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl 
>> cid cx16 xtpr
>> bogomips    : 5985.19
>> clflush size    : 64
>> cache_alignment    : 128
>> address sizes    : 36 bits physical, 48 bits virtual
>> power management:
>>
>>
>> Here's the info from the new CPU card:
>>
>> processor       : 0
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 28
>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>
> Yes, Jean was right, same issue, same response.
>
> One thing to try might be to see what happens
> if you put the system under load, ie heat up the CPU.
> Can you try that ?
>
> Thanks,
> Guenter
>
>> stepping        : 10
>> microcode       : 0x107
>> cpu MHz         : 1662.657
>> cache size      : 512 KB
>> physical id     : 0
>> siblings        : 4
>> core id         : 0
>> cpu cores       : 2
>> apicid          : 0
>> initial apicid  : 0
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 10
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm 
>> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl 
>> aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe 
>> lahf_lm dtherm
>> bogomips        : 3325.31
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 36 bits physical, 48 bits virtual
>> power management:
>>
>> processor       : 1
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 28
>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>> stepping        : 10
>> microcode       : 0x107
>> cpu MHz         : 1662.657
>> cache size      : 512 KB
>> physical id     : 0
>> siblings        : 4
>> core id         : 0
>> cpu cores       : 2
>> apicid          : 1
>> initial apicid  : 1
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 10
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm 
>> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl 
>> aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe 
>> lahf_lm dtherm
>> bogomips        : 3325.31
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 36 bits physical, 48 bits virtual
>> power management:
>>
>> processor       : 2
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 28
>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>> stepping        : 10
>> microcode       : 0x107
>> cpu MHz         : 1662.657
>> cache size      : 512 KB
>> physical id     : 0
>> siblings        : 4
>> core id         : 1
>> cpu cores       : 2
>> apicid          : 2
>> initial apicid  : 2
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 10
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm 
>> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl 
>> aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe 
>> lahf_lm dtherm
>> bogomips        : 3325.31
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 36 bits physical, 48 bits virtual
>> power management:
>>
>> processor       : 3
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 28
>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>> stepping        : 10
>> microcode       : 0x107
>> cpu MHz         : 1662.657
>> cache size      : 512 KB
>> physical id     : 0
>> siblings        : 4
>> core id         : 1
>> cpu cores       : 2
>> apicid          : 3
>> initial apicid  : 3
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 10
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm 
>> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl 
>> aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3 cx16 xtpr pdcm movbe 
>> lahf_lm dtherm
>> bogomips        : 3325.31
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 36 bits physical, 48 bits virtual
>> power management:
>>
>>
>> On 11/18/2013 05:39 PM, Guenter Roeck wrote:
>>> What Atom chip ? Can you provide output of /proc/cpuinfo ?
>>>
>>> Thanks,
>>> Guenter
>>>
>>> On Mon, Nov 18, 2013 at 01:56:28PM -0500, Mike Gilbert wrote:
>>>> Do you have any additional information pertaining to ticket 2382?
>>>>
>>>> The CPU card we use in our products is going end-of-life. The CPU
>>>> card vendor send us a new card that is supposed to be a drop in
>>>> replacement (it's the same card with a newer Atom chip). The new
>>>> card returns an error when reading the coretemp:
>>>>
>>>>     # cat /sys/bus/platform/devices/coretemp.0/temp2_input
>>>>     cat: read error: Resource temporarily unavailable
>>>>     #
>>>>
>>>> Some printk debugging yields:
>>>>
>>>>     ENTER show_temp
>>>>     status_reg @ 19C
>>>>     eax = 8620000 edx = 0
>>>>     temp = 0 valid = 0
>>>>     EXIT show_temp
>>>>
>>>> This looks like the same issue described in your ticket 2382.
>>>>
>>>> Any information you can provide will be appreciated.
>>>>
>>>> Mike Gilbert
>>>> Principle Engineer
>>>> Bay Microsystems, Inc.
>>>>
>>>>
>>>> _______________________________________________
>>>> lm-sensors mailing list
>>>> lm-sensors@lm-sensors.org
>>>> http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
>>>>
>>
>>
>>
>


_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (3 preceding siblings ...)
  2013-11-19 15:04 ` Mike Gilbert
@ 2013-11-19 16:38 ` Guenter Roeck
  2013-11-19 17:18 ` Jean Delvare
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-19 16:38 UTC (permalink / raw)
  To: lm-sensors

On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
> 
> Guenter,
> 
> We're evaluating the new card in a open chassis. It is on the test
> bench with a table fan for cooling. I turned off the fan and got:
> 
>     ENTER show_temp
>     cpu 0 (0)
>     status_reg @ 19C
>     eax = 885E0000 edx = 0
>     temp = 1770 valid = 1
>     EXIT show_temp
> 
> It seems like you've seen this before. What's going on?
> 
No, I was just throwing darts at a wall with my eyes closed.
Seriously, it was just a wild guess. Idea was that the valid bit may be 0
if the temperature is too low to be even remotely close to the maximum.
For this chip, just to give you an example, the datasheet says that any
reported temperature below 50 degrees C only means that the temperature
is below 50 degrees C.

Jean, any idea what we can do about this ? Report X degrees C (some constant
below TjMax) if valid is 0 ?

Guenter

> Thanks,
> Mike
> 
> 
> On 11/19/2013 09:33 AM, Guenter Roeck wrote:
> >On 11/19/2013 06:05 AM, Mike Gilbert wrote:
> >>Guenter,
> >>
> >>Thanks for responding. The cards are both made by Emerson. The
> >>old one is a COMX-430. The new one is a COMX-440.
> >>
> >>Mike
> >>
> >>
> >>Here's the info from the old CPU card:
> >>
> >>processor    : 0
> >>vendor_id    : GenuineIntel
> >>cpu family    : 15
> >>model        : 4
> >>model name    : Intel(R) Xeon(TM) CPU 3.00GHz
> >>stepping    : 3
> >>cpu MHz        : 3000.000
> >>cache size    : 2048 KB
> >>physical id    : 0
> >>siblings    : 2
> >>core id        : 0
> >>cpu cores    : 1
> >>apicid        : 0
> >>initial apicid    : 0
> >>fpu        : yes
> >>fpu_exception    : yes
> >>cpuid level    : 5
> >>wp        : yes
> >>flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> >>pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
> >>tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64
> >>monitor ds_cpl cid cx16 xtpr
> >>bogomips    : 5986.15
> >>clflush size    : 64
> >>cache_alignment    : 128
> >>address sizes    : 36 bits physical, 48 bits virtual
> >>power management:
> >>
> >>processor    : 1
> >>vendor_id    : GenuineIntel
> >>cpu family    : 15
> >>model        : 4
> >>model name    : Intel(R) Xeon(TM) CPU 3.00GHz
> >>stepping    : 3
> >>cpu MHz        : 3000.000
> >>cache size    : 2048 KB
> >>physical id    : 0
> >>siblings    : 2
> >>core id        : 0
> >>cpu cores    : 1
> >>apicid        : 1
> >>initial apicid    : 1
> >>fpu        : yes
> >>fpu_exception    : yes
> >>cpuid level    : 5
> >>wp        : yes
> >>flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> >>pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
> >>tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64
> >>monitor ds_cpl cid cx16 xtpr
> >>bogomips    : 5985.19
> >>clflush size    : 64
> >>cache_alignment    : 128
> >>address sizes    : 36 bits physical, 48 bits virtual
> >>power management:
> >>
> >>
> >>Here's the info from the new CPU card:
> >>
> >>processor       : 0
> >>vendor_id       : GenuineIntel
> >>cpu family      : 6
> >>model           : 28
> >>model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
> >
> >Yes, Jean was right, same issue, same response.
> >
> >One thing to try might be to see what happens
> >if you put the system under load, ie heat up the CPU.
> >Can you try that ?
> >
> >Thanks,
> >Guenter
> >
> >>stepping        : 10
> >>microcode       : 0x107
> >>cpu MHz         : 1662.657
> >>cache size      : 512 KB
> >>physical id     : 0
> >>siblings        : 4
> >>core id         : 0
> >>cpu cores       : 2
> >>apicid          : 0
> >>initial apicid  : 0
> >>fpu             : yes
> >>fpu_exception   : yes
> >>cpuid level     : 10
> >>wp              : yes
> >>flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
> >>mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
> >>ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
> >>rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
> >>cx16 xtpr pdcm movbe lahf_lm dtherm
> >>bogomips        : 3325.31
> >>clflush size    : 64
> >>cache_alignment : 64
> >>address sizes   : 36 bits physical, 48 bits virtual
> >>power management:
> >>
> >>processor       : 1
> >>vendor_id       : GenuineIntel
> >>cpu family      : 6
> >>model           : 28
> >>model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
> >>stepping        : 10
> >>microcode       : 0x107
> >>cpu MHz         : 1662.657
> >>cache size      : 512 KB
> >>physical id     : 0
> >>siblings        : 4
> >>core id         : 0
> >>cpu cores       : 2
> >>apicid          : 1
> >>initial apicid  : 1
> >>fpu             : yes
> >>fpu_exception   : yes
> >>cpuid level     : 10
> >>wp              : yes
> >>flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
> >>mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
> >>ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
> >>rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
> >>cx16 xtpr pdcm movbe lahf_lm dtherm
> >>bogomips        : 3325.31
> >>clflush size    : 64
> >>cache_alignment : 64
> >>address sizes   : 36 bits physical, 48 bits virtual
> >>power management:
> >>
> >>processor       : 2
> >>vendor_id       : GenuineIntel
> >>cpu family      : 6
> >>model           : 28
> >>model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
> >>stepping        : 10
> >>microcode       : 0x107
> >>cpu MHz         : 1662.657
> >>cache size      : 512 KB
> >>physical id     : 0
> >>siblings        : 4
> >>core id         : 1
> >>cpu cores       : 2
> >>apicid          : 2
> >>initial apicid  : 2
> >>fpu             : yes
> >>fpu_exception   : yes
> >>cpuid level     : 10
> >>wp              : yes
> >>flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
> >>mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
> >>ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
> >>rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
> >>cx16 xtpr pdcm movbe lahf_lm dtherm
> >>bogomips        : 3325.31
> >>clflush size    : 64
> >>cache_alignment : 64
> >>address sizes   : 36 bits physical, 48 bits virtual
> >>power management:
> >>
> >>processor       : 3
> >>vendor_id       : GenuineIntel
> >>cpu family      : 6
> >>model           : 28
> >>model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
> >>stepping        : 10
> >>microcode       : 0x107
> >>cpu MHz         : 1662.657
> >>cache size      : 512 KB
> >>physical id     : 0
> >>siblings        : 4
> >>core id         : 1
> >>cpu cores       : 2
> >>apicid          : 3
> >>initial apicid  : 3
> >>fpu             : yes
> >>fpu_exception   : yes
> >>cpuid level     : 10
> >>wp              : yes
> >>flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
> >>mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
> >>ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
> >>rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
> >>cx16 xtpr pdcm movbe lahf_lm dtherm
> >>bogomips        : 3325.31
> >>clflush size    : 64
> >>cache_alignment : 64
> >>address sizes   : 36 bits physical, 48 bits virtual
> >>power management:
> >>
> >>
> >>On 11/18/2013 05:39 PM, Guenter Roeck wrote:
> >>>What Atom chip ? Can you provide output of /proc/cpuinfo ?
> >>>
> >>>Thanks,
> >>>Guenter
> >>>
> >>>On Mon, Nov 18, 2013 at 01:56:28PM -0500, Mike Gilbert wrote:
> >>>>Do you have any additional information pertaining to ticket 2382?
> >>>>
> >>>>The CPU card we use in our products is going end-of-life. The CPU
> >>>>card vendor send us a new card that is supposed to be a drop in
> >>>>replacement (it's the same card with a newer Atom chip). The new
> >>>>card returns an error when reading the coretemp:
> >>>>
> >>>>    # cat /sys/bus/platform/devices/coretemp.0/temp2_input
> >>>>    cat: read error: Resource temporarily unavailable
> >>>>    #
> >>>>
> >>>>Some printk debugging yields:
> >>>>
> >>>>    ENTER show_temp
> >>>>    status_reg @ 19C
> >>>>    eax = 8620000 edx = 0
> >>>>    temp = 0 valid = 0
> >>>>    EXIT show_temp
> >>>>
> >>>>This looks like the same issue described in your ticket 2382.
> >>>>
> >>>>Any information you can provide will be appreciated.
> >>>>
> >>>>Mike Gilbert
> >>>>Principle Engineer
> >>>>Bay Microsystems, Inc.
> >>>>
> >>>>
> >>>>_______________________________________________
> >>>>lm-sensors mailing list
> >>>>lm-sensors@lm-sensors.org
> >>>>http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
> >>>>
> >>
> >>
> >>
> >
> 
> 

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (4 preceding siblings ...)
  2013-11-19 16:38 ` Guenter Roeck
@ 2013-11-19 17:18 ` Jean Delvare
  2013-11-19 17:24 ` Mike Gilbert
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Jean Delvare @ 2013-11-19 17:18 UTC (permalink / raw)
  To: lm-sensors

Hi Guenter, Mike,

On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote:
> On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
> > 
> > Guenter,
> > 
> > We're evaluating the new card in a open chassis. It is on the test
> > bench with a table fan for cooling. I turned off the fan and got:
> > 
> >     ENTER show_temp
> >     cpu 0 (0)
> >     status_reg @ 19C
> >     eax = 885E0000 edx = 0
> >     temp = 1770 valid = 1
> >     EXIT show_temp
> > 
> > It seems like you've seen this before. What's going on?
> 
> No, I was just throwing darts at a wall with my eyes closed.

Oh, you thought that was a wall? :D

> Seriously, it was just a wild guess. Idea was that the valid bit may be 0
> if the temperature is too low to be even remotely close to the maximum.

That was my theory in ticket #2382, indeed. It was never tested until
today I think, thanks Mike for doing that.

> For this chip, just to give you an example, the datasheet says that any
> reported temperature below 50 degrees C only means that the temperature
> is below 50 degrees C.

That's a start... I didn't know it was documented. Is it documented for
all CPU models? If we can gather the values at least for all affected
Atom CPU models (as I suppose the value will vary per model) we could
tweak something in the driver.

> Jean, any idea what we can do about this ? Report X degrees C (some constant
> below TjMax) if valid is 0 ?

Well well, we don't really have a sane way to transmit the information
("temperature is below X") down to the monitoring applications. The
sysfs interface has no provision for it, libsensors wouldn't handle it
and "sensors" wouldn't either, of course.

We could hard-code an arbitrarily low temperature as you suggest,
however I'm not sure if we want to do it for all CPU models or only the
ones listed in ticket #2382. My concern is that the Intel specification
doesn't limit "valid = 0" to too low temperature values. They don't
give any detail, so assuming that "too low" is the only reason seems
weird. I remember we saw transient errors on coretemp readings in the
past, but I can't remember if that was on these Atom models (i.e. just
another incarnation of ticket #2382) or other CPU models. I'm afraid we
may start reporting temperature values instead of actual errors if the
fix-up is too broad.

Either way, the current situation is rather bad, as "N/A" looks more
like "it's broken" than "it's cold". So I have no objection to crafting
"something" into the driver to make it look better, if you are
motivated to give it a try.

If you are even more motivated and want to extend the sysfs to properly
report the situation to user-space, feel free to do that as well. I
volunteer to review any kernel patch related to this, and to write the
user-space code to deal with it. I'm just not sure it's worth the
effort for just 3 CPU models.

Thanks,
-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (5 preceding siblings ...)
  2013-11-19 17:18 ` Jean Delvare
@ 2013-11-19 17:24 ` Mike Gilbert
  2013-11-19 17:53 ` Guenter Roeck
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Mike Gilbert @ 2013-11-19 17:24 UTC (permalink / raw)
  To: lm-sensors


Guenter,

I think we understand this better now. Here is a table of diode temp & 
core temp:

diode core
   43  read error
   44    6
   45    7
   46   10
   48   13
   49   14
   50   15
   55   24
   66   40
   73   52
   76   58
   83   67
thermal shutdown

We got these readings by heating the CPU card.

 From 
http://www.lm-sensors.org/wiki/FAQ/Chapter3#coretempreturnsunrealisticvalues:
> The temperature value returned by the coretemp driver isn't absolute. 
> It's a thermal margin from the critical limit, and the greater the 
> margin, the worse the accuracy.

It appears that the new CPU has a very different core temperature 
profile from the previous CPU.

Mike


On 11/19/2013 11:38 AM, Guenter Roeck wrote:
> On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
>> Guenter,
>>
>> We're evaluating the new card in a open chassis. It is on the test
>> bench with a table fan for cooling. I turned off the fan and got:
>>
>>      ENTER show_temp
>>      cpu 0 (0)
>>      status_reg @ 19C
>>      eax = 885E0000 edx = 0
>>      temp = 1770 valid = 1
>>      EXIT show_temp
>>
>> It seems like you've seen this before. What's going on?
>>
> No, I was just throwing darts at a wall with my eyes closed.
> Seriously, it was just a wild guess. Idea was that the valid bit may be 0
> if the temperature is too low to be even remotely close to the maximum.
> For this chip, just to give you an example, the datasheet says that any
> reported temperature below 50 degrees C only means that the temperature
> is below 50 degrees C.
>
> Jean, any idea what we can do about this ? Report X degrees C (some constant
> below TjMax) if valid is 0 ?
>
> Guenter
>
>> Thanks,
>> Mike
>>
>>
>> On 11/19/2013 09:33 AM, Guenter Roeck wrote:
>>> On 11/19/2013 06:05 AM, Mike Gilbert wrote:
>>>> Guenter,
>>>>
>>>> Thanks for responding. The cards are both made by Emerson. The
>>>> old one is a COMX-430. The new one is a COMX-440.
>>>>
>>>> Mike
>>>>
>>>>
>>>> Here's the info from the old CPU card:
>>>>
>>>> processor    : 0
>>>> vendor_id    : GenuineIntel
>>>> cpu family    : 15
>>>> model        : 4
>>>> model name    : Intel(R) Xeon(TM) CPU 3.00GHz
>>>> stepping    : 3
>>>> cpu MHz        : 3000.000
>>>> cache size    : 2048 KB
>>>> physical id    : 0
>>>> siblings    : 2
>>>> core id        : 0
>>>> cpu cores    : 1
>>>> apicid        : 0
>>>> initial apicid    : 0
>>>> fpu        : yes
>>>> fpu_exception    : yes
>>>> cpuid level    : 5
>>>> wp        : yes
>>>> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>>>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
>>>> tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64
>>>> monitor ds_cpl cid cx16 xtpr
>>>> bogomips    : 5986.15
>>>> clflush size    : 64
>>>> cache_alignment    : 128
>>>> address sizes    : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>> processor    : 1
>>>> vendor_id    : GenuineIntel
>>>> cpu family    : 15
>>>> model        : 4
>>>> model name    : Intel(R) Xeon(TM) CPU 3.00GHz
>>>> stepping    : 3
>>>> cpu MHz        : 3000.000
>>>> cache size    : 2048 KB
>>>> physical id    : 0
>>>> siblings    : 2
>>>> core id        : 0
>>>> cpu cores    : 1
>>>> apicid        : 1
>>>> initial apicid    : 1
>>>> fpu        : yes
>>>> fpu_exception    : yes
>>>> cpuid level    : 5
>>>> wp        : yes
>>>> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>>>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
>>>> tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64
>>>> monitor ds_cpl cid cx16 xtpr
>>>> bogomips    : 5985.19
>>>> clflush size    : 64
>>>> cache_alignment    : 128
>>>> address sizes    : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>>
>>>> Here's the info from the new CPU card:
>>>>
>>>> processor       : 0
>>>> vendor_id       : GenuineIntel
>>>> cpu family      : 6
>>>> model           : 28
>>>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>>> Yes, Jean was right, same issue, same response.
>>>
>>> One thing to try might be to see what happens
>>> if you put the system under load, ie heat up the CPU.
>>> Can you try that ?
>>>
>>> Thanks,
>>> Guenter
>>>
>>>> stepping        : 10
>>>> microcode       : 0x107
>>>> cpu MHz         : 1662.657
>>>> cache size      : 512 KB
>>>> physical id     : 0
>>>> siblings        : 4
>>>> core id         : 0
>>>> cpu cores       : 2
>>>> apicid          : 0
>>>> initial apicid  : 0
>>>> fpu             : yes
>>>> fpu_exception   : yes
>>>> cpuid level     : 10
>>>> wp              : yes
>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
>>>> mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
>>>> ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
>>>> rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
>>>> cx16 xtpr pdcm movbe lahf_lm dtherm
>>>> bogomips        : 3325.31
>>>> clflush size    : 64
>>>> cache_alignment : 64
>>>> address sizes   : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>> processor       : 1
>>>> vendor_id       : GenuineIntel
>>>> cpu family      : 6
>>>> model           : 28
>>>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>>>> stepping        : 10
>>>> microcode       : 0x107
>>>> cpu MHz         : 1662.657
>>>> cache size      : 512 KB
>>>> physical id     : 0
>>>> siblings        : 4
>>>> core id         : 0
>>>> cpu cores       : 2
>>>> apicid          : 1
>>>> initial apicid  : 1
>>>> fpu             : yes
>>>> fpu_exception   : yes
>>>> cpuid level     : 10
>>>> wp              : yes
>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
>>>> mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
>>>> ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
>>>> rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
>>>> cx16 xtpr pdcm movbe lahf_lm dtherm
>>>> bogomips        : 3325.31
>>>> clflush size    : 64
>>>> cache_alignment : 64
>>>> address sizes   : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>> processor       : 2
>>>> vendor_id       : GenuineIntel
>>>> cpu family      : 6
>>>> model           : 28
>>>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>>>> stepping        : 10
>>>> microcode       : 0x107
>>>> cpu MHz         : 1662.657
>>>> cache size      : 512 KB
>>>> physical id     : 0
>>>> siblings        : 4
>>>> core id         : 1
>>>> cpu cores       : 2
>>>> apicid          : 2
>>>> initial apicid  : 2
>>>> fpu             : yes
>>>> fpu_exception   : yes
>>>> cpuid level     : 10
>>>> wp              : yes
>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
>>>> mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
>>>> ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
>>>> rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
>>>> cx16 xtpr pdcm movbe lahf_lm dtherm
>>>> bogomips        : 3325.31
>>>> clflush size    : 64
>>>> cache_alignment : 64
>>>> address sizes   : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>> processor       : 3
>>>> vendor_id       : GenuineIntel
>>>> cpu family      : 6
>>>> model           : 28
>>>> model name      : Intel(R) Atom(TM) CPU D510   @ 1.66GHz
>>>> stepping        : 10
>>>> microcode       : 0x107
>>>> cpu MHz         : 1662.657
>>>> cache size      : 512 KB
>>>> physical id     : 0
>>>> siblings        : 4
>>>> core id         : 1
>>>> cpu cores       : 2
>>>> apicid          : 3
>>>> initial apicid  : 3
>>>> fpu             : yes
>>>> fpu_exception   : yes
>>>> cpuid level     : 10
>>>> wp              : yes
>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
>>>> mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
>>>> ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts
>>>> rep_good nopl aperfmperf pni dtes64 monitor ds_cpl tm2 ssse3
>>>> cx16 xtpr pdcm movbe lahf_lm dtherm
>>>> bogomips        : 3325.31
>>>> clflush size    : 64
>>>> cache_alignment : 64
>>>> address sizes   : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>>
>>>> On 11/18/2013 05:39 PM, Guenter Roeck wrote:
>>>>> What Atom chip ? Can you provide output of /proc/cpuinfo ?
>>>>>
>>>>> Thanks,
>>>>> Guenter
>>>>>
>>>>> On Mon, Nov 18, 2013 at 01:56:28PM -0500, Mike Gilbert wrote:
>>>>>> Do you have any additional information pertaining to ticket 2382?
>>>>>>
>>>>>> The CPU card we use in our products is going end-of-life. The CPU
>>>>>> card vendor send us a new card that is supposed to be a drop in
>>>>>> replacement (it's the same card with a newer Atom chip). The new
>>>>>> card returns an error when reading the coretemp:
>>>>>>
>>>>>>     # cat /sys/bus/platform/devices/coretemp.0/temp2_input
>>>>>>     cat: read error: Resource temporarily unavailable
>>>>>>     #
>>>>>>
>>>>>> Some printk debugging yields:
>>>>>>
>>>>>>     ENTER show_temp
>>>>>>     status_reg @ 19C
>>>>>>     eax = 8620000 edx = 0
>>>>>>     temp = 0 valid = 0
>>>>>>     EXIT show_temp
>>>>>>
>>>>>> This looks like the same issue described in your ticket 2382.
>>>>>>
>>>>>> Any information you can provide will be appreciated.
>>>>>>
>>>>>> Mike Gilbert
>>>>>> Principle Engineer
>>>>>> Bay Microsystems, Inc.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> lm-sensors mailing list
>>>>>> lm-sensors@lm-sensors.org
>>>>>> http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
>>>>>>
>>>>

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (6 preceding siblings ...)
  2013-11-19 17:24 ` Mike Gilbert
@ 2013-11-19 17:53 ` Guenter Roeck
  2013-11-19 19:23 ` Mike Gilbert
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-19 17:53 UTC (permalink / raw)
  To: lm-sensors

On Tue, Nov 19, 2013 at 06:18:57PM +0100, Jean Delvare wrote:
> Hi Guenter, Mike,
> 
> On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote:
> > On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
> > > 
> > > Guenter,
> > > 
> > > We're evaluating the new card in a open chassis. It is on the test
> > > bench with a table fan for cooling. I turned off the fan and got:
> > > 
> > >     ENTER show_temp
> > >     cpu 0 (0)
> > >     status_reg @ 19C
> > >     eax = 885E0000 edx = 0
> > >     temp = 1770 valid = 1
> > >     EXIT show_temp
> > > 
> > > It seems like you've seen this before. What's going on?
> > 
> > No, I was just throwing darts at a wall with my eyes closed.
> 
> Oh, you thought that was a wall? :D
> 
> > Seriously, it was just a wild guess. Idea was that the valid bit may be 0
> > if the temperature is too low to be even remotely close to the maximum.
> 
> That was my theory in ticket #2382, indeed. It was never tested until
> today I think, thanks Mike for doing that.
> 
> > For this chip, just to give you an example, the datasheet says that any
> > reported temperature below 50 degrees C only means that the temperature
> > is below 50 degrees C.
> 
> That's a start... I didn't know it was documented. Is it documented for
> all CPU models? If we can gather the values at least for all affected

Uuh ... I didn't say it was documented. If it is, I don't know about it.
As I said, it was just a wild guess.... even without reading your comment
on the ticket.

> Atom CPU models (as I suppose the value will vary per model) we could
> tweak something in the driver.
> 
> > Jean, any idea what we can do about this ? Report X degrees C (some constant
> > below TjMax) if valid is 0 ?
> 
> Well well, we don't really have a sane way to transmit the information
> ("temperature is below X") down to the monitoring applications. The
> sysfs interface has no provision for it, libsensors wouldn't handle it
> and "sensors" wouldn't either, of course.
> 
> We could hard-code an arbitrarily low temperature as you suggest,
> however I'm not sure if we want to do it for all CPU models or only the
> ones listed in ticket #2382. My concern is that the Intel specification
> doesn't limit "valid = 0" to too low temperature values. They don't
> give any detail, so assuming that "too low" is the only reason seems
> weird. I remember we saw transient errors on coretemp readings in the
> past, but I can't remember if that was on these Atom models (i.e. just
> another incarnation of ticket #2382) or other CPU models. I'm afraid we
> may start reporting temperature values instead of actual errors if the
> fix-up is too broad.
> 
> Either way, the current situation is rather bad, as "N/A" looks more
> like "it's broken" than "it's cold". So I have no objection to crafting
> "something" into the driver to make it look better, if you are
> motivated to give it a try.
> 
> If you are even more motivated and want to extend the sysfs to properly
> report the situation to user-space, feel free to do that as well. I
> volunteer to review any kernel patch related to this, and to write the
> user-space code to deal with it. I'm just not sure it's worth the
> effort for just 3 CPU models.
> 
I'd rather go with an exception table, or rather extend the existing tables.
It is probably somewhat safe to assume that the problem applies to all CPUs
with the same model/mask. Based on that we could declare a "tjmin" and
report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
numbers, would then be 36 degrees C (100 - 64).

If you are ok with that I'll submit a patch for it.

Guenter

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (7 preceding siblings ...)
  2013-11-19 17:53 ` Guenter Roeck
@ 2013-11-19 19:23 ` Mike Gilbert
  2013-11-19 19:41 ` Jean Delvare
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Mike Gilbert @ 2013-11-19 19:23 UTC (permalink / raw)
  To: lm-sensors


On 11/19/2013 12:53 PM, Guenter Roeck wrote:
> On Tue, Nov 19, 2013 at 06:18:57PM +0100, Jean Delvare wrote:
>> Hi Guenter, Mike,
>>
>> On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote:
>>> On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
>>>> Guenter,
>>>>
>>>> We're evaluating the new card in a open chassis. It is on the test
>>>> bench with a table fan for cooling. I turned off the fan and got:
>>>>
>>>>      ENTER show_temp
>>>>      cpu 0 (0)
>>>>      status_reg @ 19C
>>>>      eax = 885E0000 edx = 0
>>>>      temp = 1770 valid = 1
>>>>      EXIT show_temp
>>>>
>>>> It seems like you've seen this before. What's going on?
>>> No, I was just throwing darts at a wall with my eyes closed.
>> Oh, you thought that was a wall? :D
>>
>>> Seriously, it was just a wild guess. Idea was that the valid bit may be 0
>>> if the temperature is too low to be even remotely close to the maximum.
>> That was my theory in ticket #2382, indeed. It was never tested until
>> today I think, thanks Mike for doing that.
>>
>>> For this chip, just to give you an example, the datasheet says that any
>>> reported temperature below 50 degrees C only means that the temperature
>>> is below 50 degrees C.
>> That's a start... I didn't know it was documented. Is it documented for
>> all CPU models? If we can gather the values at least for all affected
> Uuh ... I didn't say it was documented. If it is, I don't know about it.
> As I said, it was just a wild guess.... even without reading your comment
> on the ticket.
>
>> Atom CPU models (as I suppose the value will vary per model) we could
>> tweak something in the driver.
>>
>>> Jean, any idea what we can do about this ? Report X degrees C (some constant
>>> below TjMax) if valid is 0 ?
>> Well well, we don't really have a sane way to transmit the information
>> ("temperature is below X") down to the monitoring applications. The
>> sysfs interface has no provision for it, libsensors wouldn't handle it
>> and "sensors" wouldn't either, of course.
>>
>> We could hard-code an arbitrarily low temperature as you suggest,
>> however I'm not sure if we want to do it for all CPU models or only the
>> ones listed in ticket #2382. My concern is that the Intel specification
>> doesn't limit "valid = 0" to too low temperature values. They don't
>> give any detail, so assuming that "too low" is the only reason seems
>> weird. I remember we saw transient errors on coretemp readings in the
>> past, but I can't remember if that was on these Atom models (i.e. just
>> another incarnation of ticket #2382) or other CPU models. I'm afraid we
>> may start reporting temperature values instead of actual errors if the
>> fix-up is too broad.
>>
>> Either way, the current situation is rather bad, as "N/A" looks more
>> like "it's broken" than "it's cold". So I have no objection to crafting
>> "something" into the driver to make it look better, if you are
>> motivated to give it a try.
>>
>> If you are even more motivated and want to extend the sysfs to properly
>> report the situation to user-space, feel free to do that as well. I
>> volunteer to review any kernel patch related to this, and to write the
>> user-space code to deal with it. I'm just not sure it's worth the
>> effort for just 3 CPU models.
>>
> I'd rather go with an exception table, or rather extend the existing tables.
> It is probably somewhat safe to assume that the problem applies to all CPUs
> with the same model/mask. Based on that we could declare a "tjmin" and
> report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
> temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
> numbers, would then be 36 degrees C (100 - 64).
>
> If you are ok with that I'll submit a patch for it.
>
> Guenter

I plotted out the data and a I think a fair approximation formula is:

Celsius = (((60/100) * return-value) + 40);

So temperatures less than 40 are reported as 40 and temperatures over 
100 cause thermal shut-down and it doesn't matter.

Have fun,
Mike


_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (8 preceding siblings ...)
  2013-11-19 19:23 ` Mike Gilbert
@ 2013-11-19 19:41 ` Jean Delvare
  2013-11-19 21:14 ` Guenter Roeck
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Jean Delvare @ 2013-11-19 19:41 UTC (permalink / raw)
  To: lm-sensors

On Tue, 19 Nov 2013 09:53:51 -0800, Guenter Roeck wrote:
> On Tue, Nov 19, 2013 at 06:18:57PM +0100, Jean Delvare wrote:
> > Hi Guenter, Mike,
> > 
> > On Tue, 19 Nov 2013 08:38:40 -0800, Guenter Roeck wrote:
> > > On Tue, Nov 19, 2013 at 10:04:08AM -0500, Mike Gilbert wrote:
> > > > 
> > > > Guenter,
> > > > 
> > > > We're evaluating the new card in a open chassis. It is on the test
> > > > bench with a table fan for cooling. I turned off the fan and got:
> > > > 
> > > >     ENTER show_temp
> > > >     cpu 0 (0)
> > > >     status_reg @ 19C
> > > >     eax = 885E0000 edx = 0
> > > >     temp = 1770 valid = 1
> > > >     EXIT show_temp
> > > > 
> > > > It seems like you've seen this before. What's going on?
> > > 
> > > No, I was just throwing darts at a wall with my eyes closed.
> > 
> > Oh, you thought that was a wall? :D
> > 
> > > Seriously, it was just a wild guess. Idea was that the valid bit may be 0
> > > if the temperature is too low to be even remotely close to the maximum.
> > 
> > That was my theory in ticket #2382, indeed. It was never tested until
> > today I think, thanks Mike for doing that.
> > 
> > > For this chip, just to give you an example, the datasheet says that any
> > > reported temperature below 50 degrees C only means that the temperature
> > > is below 50 degrees C.
> > 
> > That's a start... I didn't know it was documented. Is it documented for
> > all CPU models? If we can gather the values at least for all affected
> 
> Uuh ... I didn't say it was documented. If it is, I don't know about it.
> As I said, it was just a wild guess.... even without reading your comment
> on the ticket.

I must have misread you. What where you talking about when you said
"For this chip, just to give you an example, the datasheet says that
any reported temperature below 50 degrees C only means that the
temperature is below 50 degrees C"?

> > Atom CPU models (as I suppose the value will vary per model) we could
> > tweak something in the driver.
> > 
> > > Jean, any idea what we can do about this ? Report X degrees C (some constant
> > > below TjMax) if valid is 0 ?
> > 
> > Well well, we don't really have a sane way to transmit the information
> > ("temperature is below X") down to the monitoring applications. The
> > sysfs interface has no provision for it, libsensors wouldn't handle it
> > and "sensors" wouldn't either, of course.
> > 
> > We could hard-code an arbitrarily low temperature as you suggest,
> > however I'm not sure if we want to do it for all CPU models or only the
> > ones listed in ticket #2382. My concern is that the Intel specification
> > doesn't limit "valid = 0" to too low temperature values. They don't
> > give any detail, so assuming that "too low" is the only reason seems
> > weird. I remember we saw transient errors on coretemp readings in the
> > past, but I can't remember if that was on these Atom models (i.e. just
> > another incarnation of ticket #2382) or other CPU models. I'm afraid we
> > may start reporting temperature values instead of actual errors if the
> > fix-up is too broad.
> > 
> > Either way, the current situation is rather bad, as "N/A" looks more
> > like "it's broken" than "it's cold". So I have no objection to crafting
> > "something" into the driver to make it look better, if you are
> > motivated to give it a try.
> > 
> > If you are even more motivated and want to extend the sysfs to properly
> > report the situation to user-space, feel free to do that as well. I
> > volunteer to review any kernel patch related to this, and to write the
> > user-space code to deal with it. I'm just not sure it's worth the
> > effort for just 3 CPU models.
> 
> I'd rather go with an exception table, or rather extend the existing tables.
> It is probably somewhat safe to assume that the problem applies to all CPUs
> with the same model/mask. Based on that we could declare a "tjmin" and
> report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
> temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
> numbers, would then be 36 degrees C (100 - 64).

Not sure where you drew the "36" from. From Mike's table it seems the
valid flag wears off when the reported temperature would be < 6°C. This
correlates with my findings in the ticket where the valid flag would be
0 for 1°C and 4°C.

> If you are ok with that I'll submit a patch for it.

Yes I am.

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (9 preceding siblings ...)
  2013-11-19 19:41 ` Jean Delvare
@ 2013-11-19 21:14 ` Guenter Roeck
  2013-11-19 21:53 ` Guenter Roeck
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-19 21:14 UTC (permalink / raw)
  To: lm-sensors

T24gVHVlLCBOb3YgMTksIDIwMTMgYXQgMDg6NDE6MDFQTSArMDEwMCwgSmVhbiBEZWx2YXJlIHdy
b3RlOgo+IE9uIFR1ZSwgMTkgTm92IDIwMTMgMDk6NTM6NTEgLTA4MDAsIEd1ZW50ZXIgUm9lY2sg
d3JvdGU6Cj4gPiBPbiBUdWUsIE5vdiAxOSwgMjAxMyBhdCAwNjoxODo1N1BNICswMTAwLCBKZWFu
IERlbHZhcmUgd3JvdGU6Cj4gPiA+IEhpIEd1ZW50ZXIsIE1pa2UsCj4gPiA+IAo+ID4gPiBPbiBU
dWUsIDE5IE5vdiAyMDEzIDA4OjM4OjQwIC0wODAwLCBHdWVudGVyIFJvZWNrIHdyb3RlOgo+ID4g
PiA+IE9uIFR1ZSwgTm92IDE5LCAyMDEzIGF0IDEwOjA0OjA4QU0gLTA1MDAsIE1pa2UgR2lsYmVy
dCB3cm90ZToKPiA+ID4gPiA+IAo+ID4gPiA+ID4gR3VlbnRlciwKPiA+ID4gPiA+IAo+ID4gPiA+
ID4gV2UncmUgZXZhbHVhdGluZyB0aGUgbmV3IGNhcmQgaW4gYSBvcGVuIGNoYXNzaXMuIEl0IGlz
IG9uIHRoZSB0ZXN0Cj4gPiA+ID4gPiBiZW5jaCB3aXRoIGEgdGFibGUgZmFuIGZvciBjb29saW5n
LiBJIHR1cm5lZCBvZmYgdGhlIGZhbiBhbmQgZ290Ogo+ID4gPiA+ID4gCj4gPiA+ID4gPiAgICAg
RU5URVIgc2hvd190ZW1wCj4gPiA+ID4gPiAgICAgY3B1IDAgKDApCj4gPiA+ID4gPiAgICAgc3Rh
dHVzX3JlZyBAIDE5Qwo+ID4gPiA+ID4gICAgIGVheCA9IDg4NUUwMDAwIGVkeCA9IDAKPiA+ID4g
PiA+ICAgICB0ZW1wID0gMTc3MCB2YWxpZCA9IDEKPiA+ID4gPiA+ICAgICBFWElUIHNob3dfdGVt
cAo+ID4gPiA+ID4gCj4gPiA+ID4gPiBJdCBzZWVtcyBsaWtlIHlvdSd2ZSBzZWVuIHRoaXMgYmVm
b3JlLiBXaGF0J3MgZ29pbmcgb24/Cj4gPiA+ID4gCj4gPiA+ID4gTm8sIEkgd2FzIGp1c3QgdGhy
b3dpbmcgZGFydHMgYXQgYSB3YWxsIHdpdGggbXkgZXllcyBjbG9zZWQuCj4gPiA+IAo+ID4gPiBP
aCwgeW91IHRob3VnaHQgdGhhdCB3YXMgYSB3YWxsPyA6RAo+ID4gPiAKPiA+ID4gPiBTZXJpb3Vz
bHksIGl0IHdhcyBqdXN0IGEgd2lsZCBndWVzcy4gSWRlYSB3YXMgdGhhdCB0aGUgdmFsaWQgYml0
IG1heSBiZSAwCj4gPiA+ID4gaWYgdGhlIHRlbXBlcmF0dXJlIGlzIHRvbyBsb3cgdG8gYmUgZXZl
biByZW1vdGVseSBjbG9zZSB0byB0aGUgbWF4aW11bS4KPiA+ID4gCj4gPiA+IFRoYXQgd2FzIG15
IHRoZW9yeSBpbiB0aWNrZXQgIzIzODIsIGluZGVlZC4gSXQgd2FzIG5ldmVyIHRlc3RlZCB1bnRp
bAo+ID4gPiB0b2RheSBJIHRoaW5rLCB0aGFua3MgTWlrZSBmb3IgZG9pbmcgdGhhdC4KPiA+ID4g
Cj4gPiA+ID4gRm9yIHRoaXMgY2hpcCwganVzdCB0byBnaXZlIHlvdSBhbiBleGFtcGxlLCB0aGUg
ZGF0YXNoZWV0IHNheXMgdGhhdCBhbnkKPiA+ID4gPiByZXBvcnRlZCB0ZW1wZXJhdHVyZSBiZWxv
dyA1MCBkZWdyZWVzIEMgb25seSBtZWFucyB0aGF0IHRoZSB0ZW1wZXJhdHVyZQo+ID4gPiA+IGlz
IGJlbG93IDUwIGRlZ3JlZXMgQy4KPiA+ID4gCj4gPiA+IFRoYXQncyBhIHN0YXJ0Li4uIEkgZGlk
bid0IGtub3cgaXQgd2FzIGRvY3VtZW50ZWQuIElzIGl0IGRvY3VtZW50ZWQgZm9yCj4gPiA+IGFs
bCBDUFUgbW9kZWxzPyBJZiB3ZSBjYW4gZ2F0aGVyIHRoZSB2YWx1ZXMgYXQgbGVhc3QgZm9yIGFs
bCBhZmZlY3RlZAo+ID4gCj4gPiBVdWggLi4uIEkgZGlkbid0IHNheSBpdCB3YXMgZG9jdW1lbnRl
ZC4gSWYgaXQgaXMsIEkgZG9uJ3Qga25vdyBhYm91dCBpdC4KPiA+IEFzIEkgc2FpZCwgaXQgd2Fz
IGp1c3QgYSB3aWxkIGd1ZXNzLi4uLiBldmVuIHdpdGhvdXQgcmVhZGluZyB5b3VyIGNvbW1lbnQK
PiA+IG9uIHRoZSB0aWNrZXQuCj4gCj4gSSBtdXN0IGhhdmUgbWlzcmVhZCB5b3UuIFdoYXQgd2hl
cmUgeW91IHRhbGtpbmcgYWJvdXQgd2hlbiB5b3Ugc2FpZAo+ICJGb3IgdGhpcyBjaGlwLCBqdXN0
IHRvIGdpdmUgeW91IGFuIGV4YW1wbGUsIHRoZSBkYXRhc2hlZXQgc2F5cyB0aGF0Cj4gYW55IHJl
cG9ydGVkIHRlbXBlcmF0dXJlIGJlbG93IDUwIGRlZ3JlZXMgQyBvbmx5IG1lYW5zIHRoYXQgdGhl
Cj4gdGVtcGVyYXR1cmUgaXMgYmVsb3cgNTAgZGVncmVlcyBDIj8KPiAKSXQgZG9lcy4gU29ycnks
IEkgdGhvdWdodCB5b3UgcmVmZXIgdG8gdGhlIHZhbGlkIGJpdC4gTXkgYmFkLgoKVGhlIGV4YWN0
IHdvcmRpbmcgaXMgIkFueSBEVFMgcmVhZGluZyBiZWxvdyA1MMKwQyBzaG91bGQgYmUgY29uc2lk
ZXJlZAp0byBpbmRpY2F0ZSBvbmx5IGEgdGVtcGVyYXR1cmUgYmVsb3cgNTDCsEMgYW5kIG5vdCBh
IHNwZWNpZmljIHRlbXBlcmF0dXJlIi4KVGhpcyBpcyBmcm9tIEludGVswq4gQXRvbeKEoiBQcm9j
ZXNzb3IgRDQwMCBhbmQgRDUwMCBTZXJpZXMgRGF0YXNoZWV0LApWb2x1bWUgMSwgIjcuMS4zIERp
Z2l0YWwgVGhlcm1hbCBTZW5zb3IiLgoKSnVzdCBmb3IgZnVuLCBJIGFsc28gY2hlY2tlZCB0aGUg
ZGF0YXNoZWV0cyBmb3IgWjV4eCwgWjZ4eCwgTjQwMC9ONTAwLCBhbmQKRDIwMDAvTjIwMDAuIFRo
ZSBEMjAwL04yMDAwIGRhdGFzaGVldHMgc2F5cyAiQW55IHRlbXBlcmF0dXJlIGJlbG93IDI1IC4u
LiIsCnRoZSBvdGhlcnMgYXJlIHNpbGVudCBvbiB0aGUgc3ViamVjdC4KCj4gPiA+IEF0b20gQ1BV
IG1vZGVscyAoYXMgSSBzdXBwb3NlIHRoZSB2YWx1ZSB3aWxsIHZhcnkgcGVyIG1vZGVsKSB3ZSBj
b3VsZAo+ID4gPiB0d2VhayBzb21ldGhpbmcgaW4gdGhlIGRyaXZlci4KPiA+ID4gCj4gPiA+ID4g
SmVhbiwgYW55IGlkZWEgd2hhdCB3ZSBjYW4gZG8gYWJvdXQgdGhpcyA/IFJlcG9ydCBYIGRlZ3Jl
ZXMgQyAoc29tZSBjb25zdGFudAo+ID4gPiA+IGJlbG93IFRqTWF4KSBpZiB2YWxpZCBpcyAwID8K
PiA+ID4gCj4gPiA+IFdlbGwgd2VsbCwgd2UgZG9uJ3QgcmVhbGx5IGhhdmUgYSBzYW5lIHdheSB0
byB0cmFuc21pdCB0aGUgaW5mb3JtYXRpb24KPiA+ID4gKCJ0ZW1wZXJhdHVyZSBpcyBiZWxvdyBY
IikgZG93biB0byB0aGUgbW9uaXRvcmluZyBhcHBsaWNhdGlvbnMuIFRoZQo+ID4gPiBzeXNmcyBp
bnRlcmZhY2UgaGFzIG5vIHByb3Zpc2lvbiBmb3IgaXQsIGxpYnNlbnNvcnMgd291bGRuJ3QgaGFu
ZGxlIGl0Cj4gPiA+IGFuZCAic2Vuc29ycyIgd291bGRuJ3QgZWl0aGVyLCBvZiBjb3Vyc2UuCj4g
PiA+IAo+ID4gPiBXZSBjb3VsZCBoYXJkLWNvZGUgYW4gYXJiaXRyYXJpbHkgbG93IHRlbXBlcmF0
dXJlIGFzIHlvdSBzdWdnZXN0LAo+ID4gPiBob3dldmVyIEknbSBub3Qgc3VyZSBpZiB3ZSB3YW50
IHRvIGRvIGl0IGZvciBhbGwgQ1BVIG1vZGVscyBvciBvbmx5IHRoZQo+ID4gPiBvbmVzIGxpc3Rl
ZCBpbiB0aWNrZXQgIzIzODIuIE15IGNvbmNlcm4gaXMgdGhhdCB0aGUgSW50ZWwgc3BlY2lmaWNh
dGlvbgo+ID4gPiBkb2Vzbid0IGxpbWl0ICJ2YWxpZCA9IDAiIHRvIHRvbyBsb3cgdGVtcGVyYXR1
cmUgdmFsdWVzLiBUaGV5IGRvbid0Cj4gPiA+IGdpdmUgYW55IGRldGFpbCwgc28gYXNzdW1pbmcg
dGhhdCAidG9vIGxvdyIgaXMgdGhlIG9ubHkgcmVhc29uIHNlZW1zCj4gPiA+IHdlaXJkLiBJIHJl
bWVtYmVyIHdlIHNhdyB0cmFuc2llbnQgZXJyb3JzIG9uIGNvcmV0ZW1wIHJlYWRpbmdzIGluIHRo
ZQo+ID4gPiBwYXN0LCBidXQgSSBjYW4ndCByZW1lbWJlciBpZiB0aGF0IHdhcyBvbiB0aGVzZSBB
dG9tIG1vZGVscyAoaS5lLiBqdXN0Cj4gPiA+IGFub3RoZXIgaW5jYXJuYXRpb24gb2YgdGlja2V0
ICMyMzgyKSBvciBvdGhlciBDUFUgbW9kZWxzLiBJJ20gYWZyYWlkIHdlCj4gPiA+IG1heSBzdGFy
dCByZXBvcnRpbmcgdGVtcGVyYXR1cmUgdmFsdWVzIGluc3RlYWQgb2YgYWN0dWFsIGVycm9ycyBp
ZiB0aGUKPiA+ID4gZml4LXVwIGlzIHRvbyBicm9hZC4KPiA+ID4gCj4gPiA+IEVpdGhlciB3YXks
IHRoZSBjdXJyZW50IHNpdHVhdGlvbiBpcyByYXRoZXIgYmFkLCBhcyAiTi9BIiBsb29rcyBtb3Jl
Cj4gPiA+IGxpa2UgIml0J3MgYnJva2VuIiB0aGFuICJpdCdzIGNvbGQiLiBTbyBJIGhhdmUgbm8g
b2JqZWN0aW9uIHRvIGNyYWZ0aW5nCj4gPiA+ICJzb21ldGhpbmciIGludG8gdGhlIGRyaXZlciB0
byBtYWtlIGl0IGxvb2sgYmV0dGVyLCBpZiB5b3UgYXJlCj4gPiA+IG1vdGl2YXRlZCB0byBnaXZl
IGl0IGEgdHJ5Lgo+ID4gPiAKPiA+ID4gSWYgeW91IGFyZSBldmVuIG1vcmUgbW90aXZhdGVkIGFu
ZCB3YW50IHRvIGV4dGVuZCB0aGUgc3lzZnMgdG8gcHJvcGVybHkKPiA+ID4gcmVwb3J0IHRoZSBz
aXR1YXRpb24gdG8gdXNlci1zcGFjZSwgZmVlbCBmcmVlIHRvIGRvIHRoYXQgYXMgd2VsbC4gSQo+
ID4gPiB2b2x1bnRlZXIgdG8gcmV2aWV3IGFueSBrZXJuZWwgcGF0Y2ggcmVsYXRlZCB0byB0aGlz
LCBhbmQgdG8gd3JpdGUgdGhlCj4gPiA+IHVzZXItc3BhY2UgY29kZSB0byBkZWFsIHdpdGggaXQu
IEknbSBqdXN0IG5vdCBzdXJlIGl0J3Mgd29ydGggdGhlCj4gPiA+IGVmZm9ydCBmb3IganVzdCAz
IENQVSBtb2RlbHMuCj4gPiAKPiA+IEknZCByYXRoZXIgZ28gd2l0aCBhbiBleGNlcHRpb24gdGFi
bGUsIG9yIHJhdGhlciBleHRlbmQgdGhlIGV4aXN0aW5nIHRhYmxlcy4KPiA+IEl0IGlzIHByb2Jh
Ymx5IHNvbWV3aGF0IHNhZmUgdG8gYXNzdW1lIHRoYXQgdGhlIHByb2JsZW0gYXBwbGllcyB0byBh
bGwgQ1BVcwo+ID4gd2l0aCB0aGUgc2FtZSBtb2RlbC9tYXNrLiBCYXNlZCBvbiB0aGF0IHdlIGNv
dWxkIGRlY2xhcmUgYSAidGptaW4iIGFuZAo+ID4gcmVwb3J0IHRoYXQgaWYgaXQgaXMgMSkgZGVm
aW5lZCBhbmQgMikgdGhlIHZhbGlkIGJpdCBpcyAwLiBBIHNvbWV3aGF0ICJzYWZlIgo+ID4gdGVt
cGVyYXR1cmUgdG8gcmVwb3J0IGZvciB0aGUgRDV4eCAobW9kZWwgMHgxYy9tYXNrIDEwKSwgYmFz
ZWQgb24gTWlrZSdzCj4gPiBudW1iZXJzLCB3b3VsZCB0aGVuIGJlIDM2IGRlZ3JlZXMgQyAoMTAw
IC0gNjQpLgo+IAo+IE5vdCBzdXJlIHdoZXJlIHlvdSBkcmV3IHRoZSAiMzYiIGZyb20uIEZyb20g
TWlrZSdzIHRhYmxlIGl0IHNlZW1zIHRoZQo+IHZhbGlkIGZsYWcgd2VhcnMgb2ZmIHdoZW4gdGhl
IHJlcG9ydGVkIHRlbXBlcmF0dXJlIHdvdWxkIGJlIDwgNsKwQy4gVGhpcwo+IGNvcnJlbGF0ZXMg
d2l0aCBteSBmaW5kaW5ncyBpbiB0aGUgdGlja2V0IHdoZXJlIHRoZSB2YWxpZCBmbGFnIHdvdWxk
IGJlCj4gMCBmb3IgMcKwQyBhbmQgNMKwQy4KPiAKWW91IGFyZSByaWdodC4gTm8gaWRlYSBteXNl
bGY7IG1heWJlIGl0IHdhcyB0b28gZWFybHkgYW5kIEkgZGlkbid0IGhhdmUgZW5vdWdoIGNvZmZl
ZS4KCkhvdyBhYm91dCB0aGF0OiBkZWZpbmUgdGptaW4gYXQgWCBkZWdyZWVzIEMsIGFuZCByZXBv
cnQgdGhhdCB0ZW1wZXJhdHVyZSBpZgp2YWxpZD09MCBvciBpZiB0aGUgcmVwb3J0ZWQgdGVtcGVy
YXR1cmUgaXMgbG93ZXIuIFdvdWxkIHRoYXQgbWFrZSBzZW5zZSA/CgpPbmx5IHF1ZXN0aW9uIHJl
bWFpbnMgd2hhdCBYIHNob3VsZCBiZSBmb3IgbW9kZWwgMHgxYy8xMC4gMjUgPyAzMCA/CgpUaGFu
a3MsCkd1ZW50ZXIKCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fCmxtLXNlbnNvcnMgbWFpbGluZyBsaXN0CmxtLXNlbnNvcnNAbG0tc2Vuc29ycy5vcmcKaHR0
cDovL2xpc3RzLmxtLXNlbnNvcnMub3JnL21haWxtYW4vbGlzdGluZm8vbG0tc2Vuc29ycw=

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (10 preceding siblings ...)
  2013-11-19 21:14 ` Guenter Roeck
@ 2013-11-19 21:53 ` Guenter Roeck
  2013-11-20  9:19 ` Jean Delvare
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-19 21:53 UTC (permalink / raw)
  To: lm-sensors

On Tue, Nov 19, 2013 at 08:41:01PM +0100, Jean Delvare wrote:
[ ... ]
> > with the same model/mask. Based on that we could declare a "tjmin" and
> > report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
> > temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
> > numbers, would then be 36 degrees C (100 - 64).
> 
> Not sure where you drew the "36" from. From Mike's table it seems the
> valid flag wears off when the reported temperature would be < 6°C. This
> correlates with my findings in the ticket where the valid flag would be
> 0 for 1°C and 4°C.
> 
Now I remember what I was thinking. In Mike's table, the real temperature at
which the sensor last reported 'valid' (according to the thermal diode)
was at 44 degrees C, or 56 degrees below TjMax. Add the reported temperature
of 6 degrees C to that number and you get 62. Round up to 64 below TjMax,
or 36 degrees C.

Not that this calculation really makes any sense ;-), but with Mike's 'real'
numbers from the thermal diode it sounds at least somewhat reasonable.

Guenter

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (11 preceding siblings ...)
  2013-11-19 21:53 ` Guenter Roeck
@ 2013-11-20  9:19 ` Jean Delvare
  2013-11-20 17:29 ` Guenter Roeck
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Jean Delvare @ 2013-11-20  9:19 UTC (permalink / raw)
  To: lm-sensors

Hi Guenter,

On Tue, 19 Nov 2013 13:53:58 -0800, Guenter Roeck wrote:
> On Tue, Nov 19, 2013 at 08:41:01PM +0100, Jean Delvare wrote:
> [ ... ]
> > > with the same model/mask. Based on that we could declare a "tjmin" and
> > > report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
> > > temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
> > > numbers, would then be 36 degrees C (100 - 64).
> > 
> > Not sure where you drew the "36" from. From Mike's table it seems the
> > valid flag wears off when the reported temperature would be < 6°C. This
> > correlates with my findings in the ticket where the valid flag would be
> > 0 for 1°C and 4°C.
> > 
> Now I remember what I was thinking. In Mike's table, the real temperature at
> which the sensor last reported 'valid' (according to the thermal diode)
> was at 44 degrees C, or 56 degrees below TjMax.

That I agree with.

> Add the reported temperature
> of 6 degrees C to that number and you get 62. Round up to 64 below TjMax,
> or 36 degrees C.

All this means is that the DTS would return 0°C at approximately 36°C
physical (if we can trust the external sensor AND ignoring the expected
difference between internal and external temperature measurement.) I
don't think you can deduce tjmin from that, as the DTS scale and the
physical scale are distinct.

> Not that this calculation really makes any sense ;-), but with Mike's 'real'
> numbers from the thermal diode it sounds at least somewhat reasonable.

It is well known that the CPU DTS loses accuracy at low temperatures,
and Mike's numbers only show that.

Tjmin should be taken from the datasheet when it is present there. When
it is not in the datasheet, it becomes arbitrary, the only hard
constraint being that it must be greater than the values for which the
valid flag is no longer set (i.e. >= 6 in Mike's case.)

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (12 preceding siblings ...)
  2013-11-20  9:19 ` Jean Delvare
@ 2013-11-20 17:29 ` Guenter Roeck
  2013-11-20 18:06 ` Jean Delvare
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-20 17:29 UTC (permalink / raw)
  To: lm-sensors

On Wed, Nov 20, 2013 at 10:19:42AM +0100, Jean Delvare wrote:
> Hi Guenter,
> 
> On Tue, 19 Nov 2013 13:53:58 -0800, Guenter Roeck wrote:
> > On Tue, Nov 19, 2013 at 08:41:01PM +0100, Jean Delvare wrote:
> > [ ... ]
> > > > with the same model/mask. Based on that we could declare a "tjmin" and
> > > > report that if it is 1) defined and 2) the valid bit is 0. A somewhat "safe"
> > > > temperature to report for the D5xx (model 0x1c/mask 10), based on Mike's
> > > > numbers, would then be 36 degrees C (100 - 64).
> > > 
> > > Not sure where you drew the "36" from. From Mike's table it seems the
> > > valid flag wears off when the reported temperature would be < 6°C. This
> > > correlates with my findings in the ticket where the valid flag would be
> > > 0 for 1°C and 4°C.
> > > 
> > Now I remember what I was thinking. In Mike's table, the real temperature at
> > which the sensor last reported 'valid' (according to the thermal diode)
> > was at 44 degrees C, or 56 degrees below TjMax.
> 
> That I agree with.
> 
> > Add the reported temperature
> > of 6 degrees C to that number and you get 62. Round up to 64 below TjMax,
> > or 36 degrees C.
> 
> All this means is that the DTS would return 0°C at approximately 36°C
> physical (if we can trust the external sensor AND ignoring the expected
> difference between internal and external temperature measurement.) I
> don't think you can deduce tjmin from that, as the DTS scale and the
> physical scale are distinct.
> 
Yes, you are right. As I say below, that calculation doesn't really make much
sense.

> > Not that this calculation really makes any sense ;-), but with Mike's 'real'
> > numbers from the thermal diode it sounds at least somewhat reasonable.
> 
> It is well known that the CPU DTS loses accuracy at low temperatures,
> and Mike's numbers only show that.
> 
> Tjmin should be taken from the datasheet when it is present there. When
> it is not in the datasheet, it becomes arbitrary, the only hard
> constraint being that it must be greater than the values for which the
> valid flag is no longer set (i.e. >= 6 in Mike's case.)
> 
I don't see it anywhere, and I don't think it exists.

Mike's graph is quite interesting - it shows that the temperature reading error
is linear, at least for his CPU. Unfortunately, I don't think we can use
that knowledge to "fix" the reading automatically, as the error is very likely
different for other CPUs. We might consider adding an ideality factor module
parameter, though. What do you think about that ?

Another question is what temperature to use as tjmin. If we add an ideality
factor module parameter, it could be quite low, such as 20 degrees C.
We could even calculate tjmin based on the ideality factor if specified.
    tjmin = tjmax - (tjmax * ideality_factor / 100); /* ideality_factor in % */

Otherwise I would prefer something higher, at least 30 degrees C.

Thanks,
Guenter

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (13 preceding siblings ...)
  2013-11-20 17:29 ` Guenter Roeck
@ 2013-11-20 18:06 ` Jean Delvare
  2013-11-20 18:15 ` Guenter Roeck
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Jean Delvare @ 2013-11-20 18:06 UTC (permalink / raw)
  To: lm-sensors

On Wed, 20 Nov 2013 09:29:44 -0800, Guenter Roeck wrote:
> Mike's graph is quite interesting - it shows that the temperature reading error
> is linear, at least for his CPU. Unfortunately, I don't think we can use
> that knowledge to "fix" the reading automatically, as the error is very likely
> different for other CPUs. We might consider adding an ideality factor module
> parameter, though. What do you think about that ?

Everyone can compute the formula and use libsensors to apply it. If the
user has to provide the value manually for each CPU sample then it
might as well be that way, no need to add a module parameter. A single
module parameter would additionally become a problem for multi-socket
systems, you'd need an array and a reliable way to map each entry to
the logical CPUs of a given socket (assuming the ideality factor is per
package... which may not always be true.)

> Another question is what temperature to use as tjmin. If we add an ideality
> factor module parameter, it could be quite low, such as 20 degrees C.
> We could even calculate tjmin based on the ideality factor if specified.
>     tjmin = tjmax - (tjmax * ideality_factor / 100); /* ideality_factor in % */
> 
> Otherwise I would prefer something higher, at least 30 degrees C.

Personally I'd just do the minimum to avoid returning an error. In
other words I'd be fine returning values down to 6 degrees (for the
Atom D510 at least). We know the value is wrong but it can be corrected
in user-space, while if we clamp higher, it can no longer be corrected.

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (14 preceding siblings ...)
  2013-11-20 18:06 ` Jean Delvare
@ 2013-11-20 18:15 ` Guenter Roeck
  2013-11-20 18:25 ` Jean Delvare
  2013-11-20 18:38 ` Guenter Roeck
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-20 18:15 UTC (permalink / raw)
  To: lm-sensors

On Wed, Nov 20, 2013 at 07:06:41PM +0100, Jean Delvare wrote:
> On Wed, 20 Nov 2013 09:29:44 -0800, Guenter Roeck wrote:
> > Mike's graph is quite interesting - it shows that the temperature reading error
> > is linear, at least for his CPU. Unfortunately, I don't think we can use
> > that knowledge to "fix" the reading automatically, as the error is very likely
> > different for other CPUs. We might consider adding an ideality factor module
> > parameter, though. What do you think about that ?
> 
> Everyone can compute the formula and use libsensors to apply it. If the
> user has to provide the value manually for each CPU sample then it
> might as well be that way, no need to add a module parameter. A single
> module parameter would additionally become a problem for multi-socket
> systems, you'd need an array and a reliable way to map each entry to
> the logical CPUs of a given socket (assuming the ideality factor is per
> package... which may not always be true.)
> 
Good point.

> > Another question is what temperature to use as tjmin. If we add an ideality
> > factor module parameter, it could be quite low, such as 20 degrees C.
> > We could even calculate tjmin based on the ideality factor if specified.
> >     tjmin = tjmax - (tjmax * ideality_factor / 100); /* ideality_factor in % */
> > 
> > Otherwise I would prefer something higher, at least 30 degrees C.
> 
> Personally I'd just do the minimum to avoid returning an error. In
> other words I'd be fine returning values down to 6 degrees (for the
> Atom D510 at least). We know the value is wrong but it can be corrected
> in user-space, while if we clamp higher, it can no longer be corrected.
> 
Seems to me it would be much simpler to just return 0 if the valid bit is 0,
and not bother returning -EAGAIN in that case. After all, that is what
it boils down to, isn't it ?

Thanks,
Guenter

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (15 preceding siblings ...)
  2013-11-20 18:15 ` Guenter Roeck
@ 2013-11-20 18:25 ` Jean Delvare
  2013-11-20 18:38 ` Guenter Roeck
  17 siblings, 0 replies; 19+ messages in thread
From: Jean Delvare @ 2013-11-20 18:25 UTC (permalink / raw)
  To: lm-sensors

On Wed, 20 Nov 2013 10:15:29 -0800, Guenter Roeck wrote:
> On Wed, Nov 20, 2013 at 07:06:41PM +0100, Jean Delvare wrote:
> > Personally I'd just do the minimum to avoid returning an error. In
> > other words I'd be fine returning values down to 6 degrees (for the
> > Atom D510 at least). We know the value is wrong but it can be corrected
> > in user-space, while if we clamp higher, it can no longer be corrected.
> 
> Seems to me it would be much simpler to just return 0 if the valid bit is 0,
> and not bother returning -EAGAIN in that case. After all, that is what
> it boils down to, isn't it ?

In the specific case of these Atom chips, sort of, yes (although I'd
return maybe 5000 for continuity with the last known good value, but
that's not so important in the end, plus there's no guarantee that the
limit is the same for other Atom CPU models / samples.)

However we can't really do that in general, as again there is no Intel
documentation saying that valid == 0 can only mean "temperature too
low". There may be other causes.

Now I really can't remember if we ever had reports of -EAGAIN being
returned transiently. If the error is always permanent then I'd agree
that returning 0 (or any other arbitrarily low value) is no worse than
returning -EAGAIN.

Alternatively, as long as we have no clean way to return "lower than X
but I don't know the exact value" to user-space, it might be more
simple to just ignore the valid bit and always return the value. In
ticket #2382 I found that values 1°C and 4°C would be returned instead
of the error in that case, it's no worse than returning 0°C or an
arbitrary 5°C, and it would come for absolutely no extra code.

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [lm-sensors] Ticket #2382
  2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
                   ` (16 preceding siblings ...)
  2013-11-20 18:25 ` Jean Delvare
@ 2013-11-20 18:38 ` Guenter Roeck
  17 siblings, 0 replies; 19+ messages in thread
From: Guenter Roeck @ 2013-11-20 18:38 UTC (permalink / raw)
  To: lm-sensors

On Wed, Nov 20, 2013 at 07:25:09PM +0100, Jean Delvare wrote:
> On Wed, 20 Nov 2013 10:15:29 -0800, Guenter Roeck wrote:
> > On Wed, Nov 20, 2013 at 07:06:41PM +0100, Jean Delvare wrote:
> > > Personally I'd just do the minimum to avoid returning an error. In
> > > other words I'd be fine returning values down to 6 degrees (for the
> > > Atom D510 at least). We know the value is wrong but it can be corrected
> > > in user-space, while if we clamp higher, it can no longer be corrected.
> > 
> > Seems to me it would be much simpler to just return 0 if the valid bit is 0,
> > and not bother returning -EAGAIN in that case. After all, that is what
> > it boils down to, isn't it ?
> 
> In the specific case of these Atom chips, sort of, yes (although I'd
> return maybe 5000 for continuity with the last known good value, but
> that's not so important in the end, plus there's no guarantee that the
> limit is the same for other Atom CPU models / samples.)
> 
> However we can't really do that in general, as again there is no Intel
> documentation saying that valid == 0 can only mean "temperature too
> low". There may be other causes.
> 
> Now I really can't remember if we ever had reports of -EAGAIN being
> returned transiently. If the error is always permanent then I'd agree
> that returning 0 (or any other arbitrarily low value) is no worse than
> returning -EAGAIN.
> 
> Alternatively, as long as we have no clean way to return "lower than X
> but I don't know the exact value" to user-space, it might be more
> simple to just ignore the valid bit and always return the value. In
> ticket #2382 I found that values 1°C and 4°C would be returned instead
> of the error in that case, it's no worse than returning 0°C or an
> arbitrary 5°C, and it would come for absolutely no extra code.
> 
Both are ok with me. Either case it would be a one-line change.

-       return tdata->valid ? sprintf(buf, "%d\n", tdata->temp) : -EAGAIN;
+       return sprintf(buf, "%d\n", tdata->valid ? tdata->temp : 0);

or

-       return tdata->valid ? sprintf(buf, "%d\n", tdata->temp) : -EAGAIN;
+       return sprintf(buf, "%d\n", tdata->temp);

Let me know which one you prefer and I'll submit a patch.

Thanks,
Guenter

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-11-20 18:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-18 18:56 [lm-sensors] Ticket #2382 Mike Gilbert
2013-11-18 22:39 ` Guenter Roeck
2013-11-19  7:51 ` Jean Delvare
2013-11-19 14:33 ` Guenter Roeck
2013-11-19 15:04 ` Mike Gilbert
2013-11-19 16:38 ` Guenter Roeck
2013-11-19 17:18 ` Jean Delvare
2013-11-19 17:24 ` Mike Gilbert
2013-11-19 17:53 ` Guenter Roeck
2013-11-19 19:23 ` Mike Gilbert
2013-11-19 19:41 ` Jean Delvare
2013-11-19 21:14 ` Guenter Roeck
2013-11-19 21:53 ` Guenter Roeck
2013-11-20  9:19 ` Jean Delvare
2013-11-20 17:29 ` Guenter Roeck
2013-11-20 18:06 ` Jean Delvare
2013-11-20 18:15 ` Guenter Roeck
2013-11-20 18:25 ` Jean Delvare
2013-11-20 18:38 ` Guenter Roeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.