All of lore.kernel.org
 help / color / mirror / Atom feed
* [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
@ 2011-05-31 19:50 Jean Delvare
  2011-05-31 20:29 ` Guenter Roeck
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Jean Delvare @ 2011-05-31 19:50 UTC (permalink / raw)
  To: lm-sensors

The current temperature range check of MSR_IA32_TEMPERATURE_TARGET
seems too strict to me, some TjMax values documented in
Documentation/hwmon/coretemp wouldn't pass. Relax the check so that
all the documented values pass.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
---
I'm not even sure why we need to check the range. Why would the
value read from the MSR be wrong?

 drivers/hwmon/coretemp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-3.0-rc1.orig/drivers/hwmon/coretemp.c	2011-05-30 20:45:09.000000000 +0200
+++ linux-3.0-rc1/drivers/hwmon/coretemp.c	2011-05-31 19:36:09.000000000 +0200
@@ -296,7 +296,7 @@ static int get_tjmax(struct cpuinfo_x86
 		 * If the TjMax is not plausible, an assumption
 		 * will be used
 		 */
-		if (val > 80 && val < 120) {
+		if (val >= 70 && val <= 125) {
 			dev_info(dev, "TjMax is %d C.\n", val);
 			return val * 1000;
 		}


-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
@ 2011-05-31 20:29 ` Guenter Roeck
  2011-05-31 20:54 ` Jean Delvare
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Guenter Roeck @ 2011-05-31 20:29 UTC (permalink / raw)
  To: lm-sensors

On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> The current temperature range check of MSR_IA32_TEMPERATURE_TARGET
> seems too strict to me, some TjMax values documented in
> Documentation/hwmon/coretemp wouldn't pass. Relax the check so that
> all the documented values pass.
> 
Maybe the reason is that the processors which support reading TjMax via
the register are all in the 80 .. 120 range. Who knows - unfortunately,
a simple table describing which processor supports which set of
registers does not seem to exist, or at least I was unable to find it.

> Signed-off-by: Jean Delvare <khali@linux-fr.org>
> Cc: Carsten Emde <C.Emde@osadl.org>
> Cc: Fenghua Yu <fenghua.yu@intel.com>

> ---
> I'm not even sure why we need to check the range. Why would the
> value read from the MSR be wrong?

If I understand correctly, some processors support the register but not
the TjMax value in bits 16..23. Here is an exchange about it:

http://www.tomshardware.com/forum/230079-29-intel-tjunction-mean

Patch applied.

Thanks,
Guenter



_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
  2011-05-31 20:29 ` Guenter Roeck
@ 2011-05-31 20:54 ` Jean Delvare
  2011-05-31 21:07 ` Guenter Roeck
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jean Delvare @ 2011-05-31 20:54 UTC (permalink / raw)
  To: lm-sensors

On Tue, 31 May 2011 13:29:18 -0700, Guenter Roeck wrote:
> On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> > The current temperature range check of MSR_IA32_TEMPERATURE_TARGET
> > seems too strict to me, some TjMax values documented in
> > Documentation/hwmon/coretemp wouldn't pass. Relax the check so that
> > all the documented values pass.
> > 
> Maybe the reason is that the processors which support reading TjMax via
> the register are all in the 80 .. 120 range. Who knows - unfortunately,
> a simple table describing which processor supports which set of
> registers does not seem to exist, or at least I was unable to find it.
> 
> > Signed-off-by: Jean Delvare <khali@linux-fr.org>
> > Cc: Carsten Emde <C.Emde@osadl.org>
> > Cc: Fenghua Yu <fenghua.yu@intel.com>
> 
> > ---
> > I'm not even sure why we need to check the range. Why would the
> > value read from the MSR be wrong?
> 
> If I understand correctly, some processors support the register but not
> the TjMax value in bits 16..23. Here is an exchange about it:
> 
> http://www.tomshardware.com/forum/230079-29-intel-tjunction-mean

OK but then the value in these bits would be 0, not a random value,
right? Wouldn't checking for val != 0 be more appropriate than the
heuristic we have?

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
  2011-05-31 20:29 ` Guenter Roeck
  2011-05-31 20:54 ` Jean Delvare
@ 2011-05-31 21:07 ` Guenter Roeck
  2011-05-31 21:39 ` Yu, Fenghua
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Guenter Roeck @ 2011-05-31 21:07 UTC (permalink / raw)
  To: lm-sensors

On Tue, 2011-05-31 at 16:54 -0400, Jean Delvare wrote:
> On Tue, 31 May 2011 13:29:18 -0700, Guenter Roeck wrote:
> > On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> > > The current temperature range check of MSR_IA32_TEMPERATURE_TARGET
> > > seems too strict to me, some TjMax values documented in
> > > Documentation/hwmon/coretemp wouldn't pass. Relax the check so that
> > > all the documented values pass.
> > > 
> > Maybe the reason is that the processors which support reading TjMax via
> > the register are all in the 80 .. 120 range. Who knows - unfortunately,
> > a simple table describing which processor supports which set of
> > registers does not seem to exist, or at least I was unable to find it.
> > 
> > > Signed-off-by: Jean Delvare <khali@linux-fr.org>
> > > Cc: Carsten Emde <C.Emde@osadl.org>
> > > Cc: Fenghua Yu <fenghua.yu@intel.com>
> > 
> > > ---
> > > I'm not even sure why we need to check the range. Why would the
> > > value read from the MSR be wrong?
> > 
> > If I understand correctly, some processors support the register but not
> > the TjMax value in bits 16..23. Here is an exchange about it:
> > 
> > http://www.tomshardware.com/forum/230079-29-intel-tjunction-mean
> 
> OK but then the value in these bits would be 0, not a random value,
> right? Wouldn't checking for val != 0 be more appropriate than the
> heuristic we have?
> 
I would think so, but then I don't have access to any of the affected
CPUs, and documentation does not seem to be available, so this is hard
to guess without additional information.

Guenter



_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (2 preceding siblings ...)
  2011-05-31 21:07 ` Guenter Roeck
@ 2011-05-31 21:39 ` Yu, Fenghua
  2011-05-31 22:17 ` Guenter Roeck
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Yu, Fenghua @ 2011-05-31 21:39 UTC (permalink / raw)
  To: lm-sensors

> -----Original Message-----
> From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> Sent: Tuesday, May 31, 2011 1:29 PM
> To: Jean Delvare
> Cc: LM Sensors; Yu, Fenghua; Carsten Emde
> Subject: Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> temperature range check
> 
> On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> > The current temperature range check of MSR_IA32_TEMPERATURE_TARGET
> > seems too strict to me, some TjMax values documented in
> > Documentation/hwmon/coretemp wouldn't pass. Relax the check so that
> > all the documented values pass.
> >
> Maybe the reason is that the processors which support reading TjMax via
> the register are all in the 80 .. 120 range. Who knows - unfortunately,
> a simple table describing which processor supports which set of
> registers does not seem to exist, or at least I was unable to find it.
> 
> > Signed-off-by: Jean Delvare <khali@linux-fr.org>
> > Cc: Carsten Emde <C.Emde@osadl.org>
> > Cc: Fenghua Yu <fenghua.yu@intel.com>
> 
> > ---
> > I'm not even sure why we need to check the range. Why would the
> > value read from the MSR be wrong?
> 
> If I understand correctly, some processors support the register but not
> the TjMax value in bits 16..23. Here is an exchange about it:
> 
> http://www.tomshardware.com/forum/230079-29-intel-tjunction-mean

I couldn't find " some processors support the register but not
the TjMax value in bits 16..23" in the above site.

Even if there is such hint in the site, I wouldn't use it as a reference for coding. Our coding needs to base on official Intel specifications.

I check the SDM about the MSR_TEMPERATURE_TARGET register. My understanding is if the MSR is implemented, its bits 16..23 have the Temperature Target. Otherwise it's a hardware bug.

There is no spec for the Temperature Target range in SDM (can you find one?). Whatever range we put in coretemp is just speculation. Strictly speaking, we shouldn't have the range check.

Thanks.

-Fenghua


_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (3 preceding siblings ...)
  2011-05-31 21:39 ` Yu, Fenghua
@ 2011-05-31 22:17 ` Guenter Roeck
  2011-05-31 22:56 ` Yu, Fenghua
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Guenter Roeck @ 2011-05-31 22:17 UTC (permalink / raw)
  To: lm-sensors

On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote:
> > -----Original Message-----
> > From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> > Sent: Tuesday, May 31, 2011 1:29 PM
> > To: Jean Delvare
> > Cc: LM Sensors; Yu, Fenghua; Carsten Emde
> > Subject: Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> > temperature range check
> > 
> > On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> > > The current temperature range check of MSR_IA32_TEMPERATURE_TARGET
> > > seems too strict to me, some TjMax values documented in
> > > Documentation/hwmon/coretemp wouldn't pass. Relax the check so that
> > > all the documented values pass.
> > >
> > Maybe the reason is that the processors which support reading TjMax via
> > the register are all in the 80 .. 120 range. Who knows - unfortunately,
> > a simple table describing which processor supports which set of
> > registers does not seem to exist, or at least I was unable to find it.
> > 
> > > Signed-off-by: Jean Delvare <khali@linux-fr.org>
> > > Cc: Carsten Emde <C.Emde@osadl.org>
> > > Cc: Fenghua Yu <fenghua.yu@intel.com>
> > 
> > > ---
> > > I'm not even sure why we need to check the range. Why would the
> > > value read from the MSR be wrong?
> > 
> > If I understand correctly, some processors support the register but not
> > the TjMax value in bits 16..23. Here is an exchange about it:
> > 
> > http://www.tomshardware.com/forum/230079-29-intel-tjunction-mean
> 
> I couldn't find " some processors support the register but not
> the TjMax value in bits 16..23" in the above site.

Not directly, but it says:

"Note Tj is not a fixed value and the IA32_TEMPERATURE_TARGET[15:8]
value can vary from part to part. Tj is also not software readable."

which, as I read it, seems to imply that TjMax is not returned when
reading IA32_TEMPERATURE_TARGET on the CPUs mentioned in the article.

> Even if there is such hint in the site, I wouldn't use it as a reference for coding. Our coding needs to base on official Intel specifications.
> 
Agreed, if one is available...

> I check the SDM about the MSR_TEMPERATURE_TARGET register. My understanding is if the MSR is implemented, its bits 16..23 have the Temperature Target. Otherwise it's a hardware bug.
> 
> There is no spec for the Temperature Target range in SDM (can you find one?). Whatever range we put in coretemp is just speculation. Strictly speaking, we shouldn't have the range check.
> 
I'd be somewhat hesitant to make such a change right away. Maybe we can
replace the check with a check against 0 and display a warning (or even
an error) if it reads 0. Would that make sense ?

Thanks,
Guenter



_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (4 preceding siblings ...)
  2011-05-31 22:17 ` Guenter Roeck
@ 2011-05-31 22:56 ` Yu, Fenghua
  2011-05-31 23:26 ` Guenter Roeck
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Yu, Fenghua @ 2011-05-31 22:56 UTC (permalink / raw)
  To: lm-sensors

> -----Original Message-----
> From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> Sent: Tuesday, May 31, 2011 3:17 PM
> To: Yu, Fenghua
> Cc: Jean Delvare; LM Sensors; Carsten Emde
> Subject: RE: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> temperature range check
> 
> On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote:
> > > -----Original Message-----
> > > From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> > > Sent: Tuesday, May 31, 2011 1:29 PM
> > > To: Jean Delvare
> > > Cc: LM Sensors; Yu, Fenghua; Carsten Emde
> > > Subject: Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> > > temperature range check
> > >
> > > On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> > > > The current temperature range check of
> MSR_IA32_TEMPERATURE_TARGET
> > > > seems too strict to me, some TjMax values documented in
> > > > Documentation/hwmon/coretemp wouldn't pass. Relax the check so
> that
> > > > all the documented values pass.
> > > >
> > > Maybe the reason is that the processors which support reading TjMax
> via
> > > the register are all in the 80 .. 120 range. Who knows -
> unfortunately,
> > > a simple table describing which processor supports which set of
> > > registers does not seem to exist, or at least I was unable to find
> it.
> > >
> > > > Signed-off-by: Jean Delvare <khali@linux-fr.org>
> > > > Cc: Carsten Emde <C.Emde@osadl.org>
> > > > Cc: Fenghua Yu <fenghua.yu@intel.com>
> > >
> > > > ---
> > > > I'm not even sure why we need to check the range. Why would the
> > > > value read from the MSR be wrong?
> > >
> > > If I understand correctly, some processors support the register but
> not
> > > the TjMax value in bits 16..23. Here is an exchange about it:
> > >
> > > http://www.tomshardware.com/forum/230079-29-intel-tjunction-mean
> >
> > I couldn't find " some processors support the register but not
> > the TjMax value in bits 16..23" in the above site.
> 
> Not directly, but it says:
> 
> "Note Tj is not a fixed value and the IA32_TEMPERATURE_TARGET[15:8]
> value can vary from part to part. Tj is also not software readable."
> 

"vary from model to model" would be more accurate, I think. But again this kind of words and info may not be accurate or misleading and shouldn't be used for our coding.

> which, as I read it, seems to imply that TjMax is not returned when
> reading IA32_TEMPERATURE_TARGET on the CPUs mentioned in the article.
> 
> > Even if there is such hint in the site, I wouldn't use it as a
> reference for coding. Our coding needs to base on official Intel
> specifications.
> >
> Agreed, if one is available...
> 
> > I check the SDM about the MSR_TEMPERATURE_TARGET register. My
> understanding is if the MSR is implemented, its bits 16..23 have the
> Temperature Target. Otherwise it's a hardware bug.
> >
> > There is no spec for the Temperature Target range in SDM (can you
> find one?). Whatever range we put in coretemp is just speculation.
> Strictly speaking, we shouldn't have the range check.
> >
> I'd be somewhat hesitant to make such a change right away. Maybe we can
> replace the check with a check against 0 and display a warning (or even
> an error) if it reads 0. Would that make sense ?

After a CPU model comes out, the number of Temperature Target is fixed for whatever value. If that value doesn't meet design spec, it could be:
1. a design issue which should be detected and fixed during internal design validation;
2. a manufacture issue which should be detected and fixed during manufacture process.

There is no spec to claim 0 is wrong for bits 16..23. Check against 0 is still a speculation. If checking against 0, why not check against 1, 2, etc?

Without any spec guidline, a sanity checking on the bits is trying to capture either a hardware design bug or a hardware manufacture bug. This is not based on any spec. And the effort is very likely fruitless.

A lot of similar places in kernel don't do sanity checking, e.g. bits 0..15 in MSR_IA32_PERF_STATUS.

Thanks.

-Fenghua
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (5 preceding siblings ...)
  2011-05-31 22:56 ` Yu, Fenghua
@ 2011-05-31 23:26 ` Guenter Roeck
  2011-05-31 23:39 ` Yu, Fenghua
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Guenter Roeck @ 2011-05-31 23:26 UTC (permalink / raw)
  To: lm-sensors

On Tue, 2011-05-31 at 18:56 -0400, Yu, Fenghua wrote:
> > -----Original Message-----
> > From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> > Sent: Tuesday, May 31, 2011 3:17 PM
> > To: Yu, Fenghua
> > Cc: Jean Delvare; LM Sensors; Carsten Emde
> > Subject: RE: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> > temperature range check
> > 
> > On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote:
> > > > -----Original Message-----
> > > > From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> > > > Sent: Tuesday, May 31, 2011 1:29 PM
> > > > To: Jean Delvare
> > > > Cc: LM Sensors; Yu, Fenghua; Carsten Emde
> > > > Subject: Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> > > > temperature range check
> > > >
> > > > On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> > > > > The current temperature range check of
> > MSR_IA32_TEMPERATURE_TARGET
> > > > > seems too strict to me, some TjMax values documented in
> > > > > Documentation/hwmon/coretemp wouldn't pass. Relax the check so
> > that
> > > > > all the documented values pass.
> > > > >
> > > > Maybe the reason is that the processors which support reading TjMax
> > via
> > > > the register are all in the 80 .. 120 range. Who knows -
> > unfortunately,
> > > > a simple table describing which processor supports which set of
> > > > registers does not seem to exist, or at least I was unable to find
> > it.
> > > >
> > > > > Signed-off-by: Jean Delvare <khali@linux-fr.org>
> > > > > Cc: Carsten Emde <C.Emde@osadl.org>
> > > > > Cc: Fenghua Yu <fenghua.yu@intel.com>
> > > >
> > > > > ---
> > > > > I'm not even sure why we need to check the range. Why would the
> > > > > value read from the MSR be wrong?
> > > >
> > > > If I understand correctly, some processors support the register but
> > not
> > > > the TjMax value in bits 16..23. Here is an exchange about it:
> > > >
> > > > http://www.tomshardware.com/forum/230079-29-intel-tjunction-mean
> > >
> > > I couldn't find " some processors support the register but not
> > > the TjMax value in bits 16..23" in the above site.
> > 
> > Not directly, but it says:
> > 
> > "Note Tj is not a fixed value and the IA32_TEMPERATURE_TARGET[15:8]
> > value can vary from part to part. Tj is also not software readable."
> > 
> 
> "vary from model to model" would be more accurate, I think. But again this kind of words and info may not be accurate or misleading and shouldn't be used for our coding.
> 
Sure, but keep in mind that official information or documentation about
the register was not available when the statement was made.

> > which, as I read it, seems to imply that TjMax is not returned when
> > reading IA32_TEMPERATURE_TARGET on the CPUs mentioned in the article.
> > 
> > > Even if there is such hint in the site, I wouldn't use it as a
> > reference for coding. Our coding needs to base on official Intel
> > specifications.
> > >
> > Agreed, if one is available...
> > 
> > > I check the SDM about the MSR_TEMPERATURE_TARGET register. My
> > understanding is if the MSR is implemented, its bits 16..23 have the
> > Temperature Target. Otherwise it's a hardware bug.
> > >
> > > There is no spec for the Temperature Target range in SDM (can you
> > find one?). Whatever range we put in coretemp is just speculation.
> > Strictly speaking, we shouldn't have the range check.
> > >
> > I'd be somewhat hesitant to make such a change right away. Maybe we can
> > replace the check with a check against 0 and display a warning (or even
> > an error) if it reads 0. Would that make sense ?
> 
> After a CPU model comes out, the number of Temperature Target is fixed for whatever value. If that value doesn't meet design spec, it could be:
> 1. a design issue which should be detected and fixed during internal design validation;
> 2. a manufacture issue which should be detected and fixed during manufacture process.
> 
> There is no spec to claim 0 is wrong for bits 16..23. Check against 0 is still a speculation. If checking against 0, why not check against 1, 2, etc?
> 
Guess that is why the "random" check is there ...

> Without any spec guidline, a sanity checking on the bits is trying to capture either a hardware design bug or a hardware manufacture bug. This is not based on any spec. And the effort is very likely fruitless.
> 
- or the case where bit 16..23 to read TjMax are not supported in some
chip models. After all, it _does_ happen that additional bits are added
to a register over time.

> A lot of similar places in kernel don't do sanity checking, e.g. bits 0..15 in MSR_IA32_PERF_STATUS.
> 

So what do we want to do ? 

One possible option would be to remove the check in 3.0-rc (ie not Cc
-stable) and see what happens. If we are lucky, and things are as you
suggest, we should be fine. Otherwise, we would have some time before
3.0 is released to fix any resulting problems.

Thanks,
Guenter



_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (6 preceding siblings ...)
  2011-05-31 23:26 ` Guenter Roeck
@ 2011-05-31 23:39 ` Yu, Fenghua
  2011-06-01  9:40 ` Jean Delvare
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Yu, Fenghua @ 2011-05-31 23:39 UTC (permalink / raw)
  To: lm-sensors

> -----Original Message-----
> From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> Sent: Tuesday, May 31, 2011 4:27 PM
> To: Yu, Fenghua
> Cc: Jean Delvare; LM Sensors; Carsten Emde
> Subject: RE: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> temperature range check
> 
> On Tue, 2011-05-31 at 18:56 -0400, Yu, Fenghua wrote:
> > > -----Original Message-----
> > > From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> > > Sent: Tuesday, May 31, 2011 3:17 PM
> > > To: Yu, Fenghua
> > > Cc: Jean Delvare; LM Sensors; Carsten Emde
> > > Subject: RE: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target
> > > temperature range check
> > >
> > > On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote:
> > > > > -----Original Message-----
> > > > > From: Guenter Roeck [mailto:guenter.roeck@ericsson.com]
> > > > > Sent: Tuesday, May 31, 2011 1:29 PM
> > > > > To: Jean Delvare
> > > > > Cc: LM Sensors; Yu, Fenghua; Carsten Emde
> > > > > Subject: Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax
> target
> > > > > temperature range check
> > > > >
> > > > > On Tue, 2011-05-31 at 15:50 -0400, Jean Delvare wrote:
> > > > > > The current temperature range check of
> > > MSR_IA32_TEMPERATURE_TARGET
> > > > > > seems too strict to me, some TjMax values documented in
> > > > > > Documentation/hwmon/coretemp wouldn't pass. Relax the check
> so
> > > that
> > > > > > all the documented values pass.
> > > > > >
> > > > > Maybe the reason is that the processors which support reading
> TjMax
> > > via
> > > > > the register are all in the 80 .. 120 range. Who knows -
> > > unfortunately,
> > > > > a simple table describing which processor supports which set of
> > > > > registers does not seem to exist, or at least I was unable to
> find
> > > it.
> > > > >
> > > > > > Signed-off-by: Jean Delvare <khali@linux-fr.org>
> > > > > > Cc: Carsten Emde <C.Emde@osadl.org>
> > > > > > Cc: Fenghua Yu <fenghua.yu@intel.com>
> > > > >
> > > > > > ---
> > > > > > I'm not even sure why we need to check the range. Why would
> the
> > > > > > value read from the MSR be wrong?
> > > > >
> > > > > If I understand correctly, some processors support the register
> but
> > > not
> > > > > the TjMax value in bits 16..23. Here is an exchange about it:
> > > > >
> > > > > http://www.tomshardware.com/forum/230079-29-intel-tjunction-
> mean
> > > >
> > > > I couldn't find " some processors support the register but not
> > > > the TjMax value in bits 16..23" in the above site.
> > >
> > > Not directly, but it says:
> > >
> > > "Note Tj is not a fixed value and the IA32_TEMPERATURE_TARGET[15:8]
> > > value can vary from part to part. Tj is also not software
> readable."
> > >
> >
> > "vary from model to model" would be more accurate, I think. But again
> this kind of words and info may not be accurate or misleading and
> shouldn't be used for our coding.
> >
> Sure, but keep in mind that official information or documentation about
> the register was not available when the statement was made.
> 
> > > which, as I read it, seems to imply that TjMax is not returned when
> > > reading IA32_TEMPERATURE_TARGET on the CPUs mentioned in the
> article.
> > >
> > > > Even if there is such hint in the site, I wouldn't use it as a
> > > reference for coding. Our coding needs to base on official Intel
> > > specifications.
> > > >
> > > Agreed, if one is available...
> > >
> > > > I check the SDM about the MSR_TEMPERATURE_TARGET register. My
> > > understanding is if the MSR is implemented, its bits 16..23 have
> the
> > > Temperature Target. Otherwise it's a hardware bug.
> > > >
> > > > There is no spec for the Temperature Target range in SDM (can you
> > > find one?). Whatever range we put in coretemp is just speculation.
> > > Strictly speaking, we shouldn't have the range check.
> > > >
> > > I'd be somewhat hesitant to make such a change right away. Maybe we
> can
> > > replace the check with a check against 0 and display a warning (or
> even
> > > an error) if it reads 0. Would that make sense ?
> >
> > After a CPU model comes out, the number of Temperature Target is
> fixed for whatever value. If that value doesn't meet design spec, it
> could be:
> > 1. a design issue which should be detected and fixed during internal
> design validation;
> > 2. a manufacture issue which should be detected and fixed during
> manufacture process.
> >
> > There is no spec to claim 0 is wrong for bits 16..23. Check against 0
> is still a speculation. If checking against 0, why not check against 1,
> 2, etc?
> >
> Guess that is why the "random" check is there ...
> 
> > Without any spec guidline, a sanity checking on the bits is trying to
> capture either a hardware design bug or a hardware manufacture bug.
> This is not based on any spec. And the effort is very likely fruitless.
> >
> - or the case where bit 16..23 to read TjMax are not supported in some
> chip models. After all, it _does_ happen that additional bits are added
> to a register over time.
> 
> > A lot of similar places in kernel don't do sanity checking, e.g. bits
> 0..15 in MSR_IA32_PERF_STATUS.
> >
> 
> So what do we want to do ?
> 
> One possible option would be to remove the check in 3.0-rc (ie not Cc
> -stable) and see what happens. If we are lucky, and things are as you
> suggest, we should be fine. Otherwise, we would have some time before
> 3.0 is released to fix any resulting problems.

Although it should be rare if it really happens, we could see abnormal tjmax numbers (out of current range check) because we do hit some processor issues after removing the check. If someone sees the case, the hardware issue will be reported to Intel. But I hope we won't see this case because I'm confident in Intel's processor:)

Thanks.

-Fenghua

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (7 preceding siblings ...)
  2011-05-31 23:39 ` Yu, Fenghua
@ 2011-06-01  9:40 ` Jean Delvare
  2011-06-01 16:12 ` Guenter Roeck
  2011-06-01 20:18 ` Jean Delvare
  10 siblings, 0 replies; 12+ messages in thread
From: Jean Delvare @ 2011-06-01  9:40 UTC (permalink / raw)
  To: lm-sensors

Hi Fenghua,

On Tue, 31 May 2011 15:56:11 -0700, Yu, Fenghua wrote:
> > On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote:
> > > I check the SDM about the MSR_TEMPERATURE_TARGET register. My
> > > understanding is if the MSR is implemented, its bits 16..23 have the
> > > Temperature Target. Otherwise it's a hardware bug.

This doesn't seem to be the case in practice, see below.

> > > There is no spec for the Temperature Target range in SDM (can you
> > > find one?). Whatever range we put in coretemp is just speculation.
> > > Strictly speaking, we shouldn't have the range check.

I tend to agree.

> > I'd be somewhat hesitant to make such a change right away. Maybe we can
> > replace the check with a check against 0 and display a warning (or even
> > an error) if it reads 0. Would that make sense ?

Almost, except that I don't think we should complain about value 0, see
below.

> After a CPU model comes out, the number of Temperature Target is fixed for whatever value. If that value doesn't meet design spec, it could be:
> 1. a design issue which should be detected and fixed during internal design validation;
> 2. a manufacture issue which should be detected and fixed during manufacture process.
> 
> There is no spec to claim 0 is wrong for bits 16..23. Check against 0 is still a speculation. If checking against 0, why not check against 1, 2, etc?

0 is different because it is what unused/unimplemented MSR bits return.
I have 3 supported CPUs here for my testing: family 6 model 0xe (Core
Duo) in a laptop, family 6 model 0xf (Core2 Duo) in a desktop and
family 6 model 0x1a in a workstation (Xeon).

On model 0xe, reading from MSR 0x1a2 return an I/O error. On model 0xf,
the MSR exists, but reading from it returns 0 for bits 23:16. On model
0x1a, it returns a valid temperature limit in bits 23:16 (97°C).

> 
> Without any spec guidline, a sanity checking on the bits is trying to capture either a hardware design bug or a hardware manufacture bug. This is not based on any spec. And the effort is very likely fruitless.
> 
> A lot of similar places in kernel don't do sanity checking, e.g. bits 0..15 in MSR_IA32_PERF_STATUS.

I would be perfectly happy to not have any sanity check on the value,
but this requires that we know upfront which CPU models have bits 23:16
of MSR 0x1a2 implemented, and which do not. As my experiment above
shows, some models have the MSR implemented but not bits 23:16, and
checking for value 0 is the only way to handle them in a generic way.

According to the SDM, MSR 0x1a2 is supposed to only exist in Nehalem
(i.e. family 6 model 0x1a) and Sandy Bridge CPUs (family and model
unknown to me) but I guess it will also exist in future CPUs.

I see two possible implementations:

1* Go by the SDM. First check that we are on a model which supports
MSR_TEMPERATURE_TARGET before calling it, and always trust the result.
For other models, use the heuristic we had before (adjust_tjmax). The
major drawback is that we will have to update the list of CPU models
supporting MSR_TEMPERATURE_TARGET as they are released by Intel. We
just went away from having to continuously maintain a device list in
the coretemp driver and it seems wrong to go back to this situation
again.

2* Go by my experimental result. Read MSR_TEMPERATURE_TARGET
unconditionally, but only use the result if there was no I/O error and
the value isn't 0. In other cases, don't emit any warning message and
simply fallback to the heuristic we had before (adjust_tjmax). This is
closer to what we have at the moment, and has the advantage that it
won't require further updates.

Implementation 2 clearly has my favors.

For stable kernel series, I'd rather just apply the patch I submitted,
because it is the least intrusive option and we aren't aware of any
actual system requiring a more intrusive fix.

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (8 preceding siblings ...)
  2011-06-01  9:40 ` Jean Delvare
@ 2011-06-01 16:12 ` Guenter Roeck
  2011-06-01 20:18 ` Jean Delvare
  10 siblings, 0 replies; 12+ messages in thread
From: Guenter Roeck @ 2011-06-01 16:12 UTC (permalink / raw)
  To: lm-sensors

On Wed, 2011-06-01 at 05:40 -0400, Jean Delvare wrote:
> Hi Fenghua,
> 
> On Tue, 31 May 2011 15:56:11 -0700, Yu, Fenghua wrote:
> > > On Tue, 2011-05-31 at 17:39 -0400, Yu, Fenghua wrote:
> > > > I check the SDM about the MSR_TEMPERATURE_TARGET register. My
> > > > understanding is if the MSR is implemented, its bits 16..23 have the
> > > > Temperature Target. Otherwise it's a hardware bug.
> 
> This doesn't seem to be the case in practice, see below.
> 
> > > > There is no spec for the Temperature Target range in SDM (can you
> > > > find one?). Whatever range we put in coretemp is just speculation.
> > > > Strictly speaking, we shouldn't have the range check.
> 
> I tend to agree.
> 
> > > I'd be somewhat hesitant to make such a change right away. Maybe we can
> > > replace the check with a check against 0 and display a warning (or even
> > > an error) if it reads 0. Would that make sense ?
> 
> Almost, except that I don't think we should complain about value 0, see
> below.
> 
> > After a CPU model comes out, the number of Temperature Target is fixed for whatever value. If that value doesn't meet design spec, it could be:
> > 1. a design issue which should be detected and fixed during internal design validation;
> > 2. a manufacture issue which should be detected and fixed during manufacture process.
> > 
> > There is no spec to claim 0 is wrong for bits 16..23. Check against 0 is still a speculation. If checking against 0, why not check against 1, 2, etc?
> 
> 0 is different because it is what unused/unimplemented MSR bits return.
> I have 3 supported CPUs here for my testing: family 6 model 0xe (Core
> Duo) in a laptop, family 6 model 0xf (Core2 Duo) in a desktop and
> family 6 model 0x1a in a workstation (Xeon).
> 
> On model 0xe, reading from MSR 0x1a2 return an I/O error. On model 0xf,
> the MSR exists, but reading from it returns 0 for bits 23:16. On model
> 0x1a, it returns a valid temperature limit in bits 23:16 (97°C).
> 
> > 
> > Without any spec guidline, a sanity checking on the bits is trying to capture either a hardware design bug or a hardware manufacture bug. This is not based on any spec. And the effort is very likely fruitless.
> > 
> > A lot of similar places in kernel don't do sanity checking, e.g. bits 0..15 in MSR_IA32_PERF_STATUS.
> 
> I would be perfectly happy to not have any sanity check on the value,
> but this requires that we know upfront which CPU models have bits 23:16
> of MSR 0x1a2 implemented, and which do not. As my experiment above
> shows, some models have the MSR implemented but not bits 23:16, and
> checking for value 0 is the only way to handle them in a generic way.
> 
> According to the SDM, MSR 0x1a2 is supposed to only exist in Nehalem
> (i.e. family 6 model 0x1a) and Sandy Bridge CPUs (family and model
> unknown to me) but I guess it will also exist in future CPUs.
> 
> I see two possible implementations:
> 
> 1* Go by the SDM. First check that we are on a model which supports
> MSR_TEMPERATURE_TARGET before calling it, and always trust the result.
> For other models, use the heuristic we had before (adjust_tjmax). The
> major drawback is that we will have to update the list of CPU models
> supporting MSR_TEMPERATURE_TARGET as they are released by Intel. We
> just went away from having to continuously maintain a device list in
> the coretemp driver and it seems wrong to go back to this situation
> again.
> 
> 2* Go by my experimental result. Read MSR_TEMPERATURE_TARGET
> unconditionally, but only use the result if there was no I/O error and
> the value isn't 0. In other cases, don't emit any warning message and
> simply fallback to the heuristic we had before (adjust_tjmax). This is
> closer to what we have at the moment, and has the advantage that it
> won't require further updates.
> 
> Implementation 2 clearly has my favors.
> 
Definitely. 1* doesn't look like very compelling.

> For stable kernel series, I'd rather just apply the patch I submitted,
> because it is the least intrusive option and we aren't aware of any
> actual system requiring a more intrusive fix.
> 
Do we really want/need this in -stable ? Just wondering. 

We do need the second part of 2*, ie unconditionally calling
adjust_tjmax().

Thanks,
Guenter



_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature
  2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
                   ` (9 preceding siblings ...)
  2011-06-01 16:12 ` Guenter Roeck
@ 2011-06-01 20:18 ` Jean Delvare
  10 siblings, 0 replies; 12+ messages in thread
From: Jean Delvare @ 2011-06-01 20:18 UTC (permalink / raw)
  To: lm-sensors

On Wed, 1 Jun 2011 09:12:11 -0700, Guenter Roeck wrote:
> On Wed, 2011-06-01 at 05:40 -0400, Jean Delvare wrote:
> > I would be perfectly happy to not have any sanity check on the value,
> > but this requires that we know upfront which CPU models have bits 23:16
> > of MSR 0x1a2 implemented, and which do not. As my experiment above
> > shows, some models have the MSR implemented but not bits 23:16, and
> > checking for value 0 is the only way to handle them in a generic way.
> > 
> > According to the SDM, MSR 0x1a2 is supposed to only exist in Nehalem
> > (i.e. family 6 model 0x1a) and Sandy Bridge CPUs (family and model
> > unknown to me) but I guess it will also exist in future CPUs.
> > 
> > I see two possible implementations:
> > 
> > 1* Go by the SDM. First check that we are on a model which supports
> > MSR_TEMPERATURE_TARGET before calling it, and always trust the result.
> > For other models, use the heuristic we had before (adjust_tjmax). The
> > major drawback is that we will have to update the list of CPU models
> > supporting MSR_TEMPERATURE_TARGET as they are released by Intel. We
> > just went away from having to continuously maintain a device list in
> > the coretemp driver and it seems wrong to go back to this situation
> > again.
> > 
> > 2* Go by my experimental result. Read MSR_TEMPERATURE_TARGET
> > unconditionally, but only use the result if there was no I/O error and
> > the value isn't 0. In other cases, don't emit any warning message and
> > simply fallback to the heuristic we had before (adjust_tjmax). This is
> > closer to what we have at the moment, and has the advantage that it
> > won't require further updates.
> > 
> > Implementation 2 clearly has my favors.
> > 
> Definitely. 1* doesn't look like very compelling.
> 
> > For stable kernel series, I'd rather just apply the patch I submitted,
> > because it is the least intrusive option and we aren't aware of any
> > actual system requiring a more intrusive fix.
>
> Do we really want/need this in -stable ? Just wondering. 

In all honesty, I don't know. I did not have any bug report, but if
someone were to hit that one, all they would get is a wrong TjMax,
that's not necessarily something people will notice.

In fact it depends when MSR_TEMPERATURE_TARGET started reporting valid
values in bits 23:16. I know models 0xe, 0xf and 0x1c do not, but I am
unsure about 0x16 and 0x17 (for which we know some models have TjMax
outside the original range check.) Officially supported started with
Nehalem but you never know...

Also, I can only suppose that bits 23:16 of the MSR will be supported
by all Intel CPUs moving forward, and it seems reasonable to imagine
that at least one model will get outside the current check at some
point in the future (if it isn't already the case.) Pushing the fix to
stable now would anticipate this. Best would be to find an actual CPU
model affected by the bug, but I don't have one at hand and finding one
would take much time, which I suspect would be better spent differently.

> We do need the second part of 2*, ie unconditionally calling
> adjust_tjmax().

Yes we do.

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-06-01 20:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-31 19:50 [lm-sensors] [PATCH] hwmon: (coretemp) Relax target temperature Jean Delvare
2011-05-31 20:29 ` Guenter Roeck
2011-05-31 20:54 ` Jean Delvare
2011-05-31 21:07 ` Guenter Roeck
2011-05-31 21:39 ` Yu, Fenghua
2011-05-31 22:17 ` Guenter Roeck
2011-05-31 22:56 ` Yu, Fenghua
2011-05-31 23:26 ` Guenter Roeck
2011-05-31 23:39 ` Yu, Fenghua
2011-06-01  9:40 ` Jean Delvare
2011-06-01 16:12 ` Guenter Roeck
2011-06-01 20:18 ` Jean Delvare

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.