All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fenghua Yu <fenghua.yu@intel.com>
To: Jean Delvare <khali@nerim.net>
Cc: "Yu, Fenghua" <fenghua.yu@intel.com>,
	Rudolf Marek <r.marek@assembler.cz>,
	"Wan, Huaxu" <huaxu.wan@intel.com>, H Peter Anvin <hpa@zytor.com>,
	Chen Gong <gong.chen@linux.intel.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	lm-sensors <lm-sensors@lm-sensors.org>
Subject: Re: [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue
Date: Fri, 20 Aug 2010 14:53:53 -0700	[thread overview]
Message-ID: <20100820215353.GC22602@linux-os.sc.intel.com> (raw)
In-Reply-To: <20100820102646.7659b6b7@hyperion.delvare>

On Fri, Aug 20, 2010 at 01:26:46AM -0700, Jean Delvare wrote:
> Hi Fenghua,
> 
> On Wed, 18 Aug 2010 15:53:45 -0700, Fenghua Yu wrote:
> > From: Fenghua Yu <fenghua.yu@intel.com>
> > When a CPU is hot-removed, its core sensor should be still available to upper
> > level application as long as the hot-removed CPU's HT sibling is still running.
> > A core sensor is invisible to user level only when all of siblings in a core are
> > hot-removed.
> 
> Good point. I admit I didn't think about this scenario when fixing the
> duplicate HT entries. I thought both hyperthreads would go away at the
> same time, but since then I learned that individual HT can be removed
> using the sysfs "online" attributes.
> 
> That being said, I'm curious if this is really a problem in practice?
> Why would one disable only one hyperthread on a given core? I can't
> think of a real-world scenario.

Overall we need to keep state integrity for hot-removed CPU and shared core
sensor. Without fixing this issue, we end up with inconsistent system info.

As for usage scenario, I can think of some:
1. Power saving. Management application may offline some threads or all thread
siblings to save power. Image all of HT is disabled during run-time, less power
is consumed with less performance.
2. RAS. A bad thread may be offlined which its sibling is still running. This
could be becaused of logical CPU spcific state e.g. instruction TLB.

> 
> I don't mean to suggest that we don't have to fix the problem. I'm
> simply trying to figure out how fast we need to fix it, and whether the
> fix is worth adding to the stable kernel series or not.
> 
> As you can see, the switch of hyperthreads on Core 1 caused hwmon
> device coretemp-isa-0001 to be removed and be replaced with
> coretemp-isa-0005. There is also a change in the underlying
> directories, /sys/class/hwmon/hwmon1/device now points
> to /sys/devices/platform/coretemp.5 instead
> of /sys/devices/platform/coretemp.1. This has three drawbacks:
> 1* Configuration statements from /etc/sensors.conf will no longer be
>    applied.
> 2* Some monitoring applications may lose their path to the sensors.
>    Thankfully, libsensors uses hwmon device paths rather than physical
>    device paths, so the effect should be limited, but other tools (e.g.
>    the fancontrol script) tend to prefer physical device paths, so they
>    will break.
> 3* If you disable several HTs at once, you have no guarantee that the
>    new hwmon devices will be numbered in the same order as the old hwmon
>    devices. If you are unlucky and the number changes, then all
>    libsensors-based applications will start reporting garbage.
> 
> I admit that these issues are not critical ones, and are rather
> unlikely to happen in the real world, but so is the problem you are
> trying to solve in the first place.
> 
> Point 1* could be easily solved by changing the way the coretemp device
> ID is allocated. Instead of using the CPU ID directly, we would use the
> smallest CPU ID amongst all the siblings. This ensures a consistent ID
> no matter which sibling is used.
> 
> Points 2* and 3*, however, can't be solved without reworking the driver
> significantly. I think we should not only skip duplicate HT entries on
> driver registration as my naive patch did. We should instead keep track
> of them, i.e. all coretemp entries should know the list of CPU entries
> they are backed up by, and a coretemp device would be unregistered only
> when this list shrinks to zero elements (all HT have been removed.)
> 
> As you said you agree to give a try to a rework of the coretemp driver
> to keep all related cores into the same hwmon device, I think this
> additional constraint might fit well in the new driver design. What do
> you think?

Yes, I agree with you on that. Since I'm rewriting coretemp/pkgtemp, this issue
will be fixed in a new coding.

Thanks.

-Fenghua

WARNING: multiple messages have this Message-ID (diff)
From: Fenghua Yu <fenghua.yu@intel.com>
To: Jean Delvare <khali@nerim.net>
Cc: "Yu, Fenghua" <fenghua.yu@intel.com>,
	Rudolf Marek <r.marek@assembler.cz>,
	"Wan, Huaxu" <huaxu.wan@intel.com>, H Peter Anvin <hpa@zytor.com>,
	Chen Gong <gong.chen@linux.intel.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	lm-sensors <lm-sensors@lm-sensors.org>
Subject: Re: [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect
Date: Fri, 20 Aug 2010 21:53:53 +0000	[thread overview]
Message-ID: <20100820215353.GC22602@linux-os.sc.intel.com> (raw)
In-Reply-To: <20100820102646.7659b6b7@hyperion.delvare>

On Fri, Aug 20, 2010 at 01:26:46AM -0700, Jean Delvare wrote:
> Hi Fenghua,
> 
> On Wed, 18 Aug 2010 15:53:45 -0700, Fenghua Yu wrote:
> > From: Fenghua Yu <fenghua.yu@intel.com>
> > When a CPU is hot-removed, its core sensor should be still available to upper
> > level application as long as the hot-removed CPU's HT sibling is still running.
> > A core sensor is invisible to user level only when all of siblings in a core are
> > hot-removed.
> 
> Good point. I admit I didn't think about this scenario when fixing the
> duplicate HT entries. I thought both hyperthreads would go away at the
> same time, but since then I learned that individual HT can be removed
> using the sysfs "online" attributes.
> 
> That being said, I'm curious if this is really a problem in practice?
> Why would one disable only one hyperthread on a given core? I can't
> think of a real-world scenario.

Overall we need to keep state integrity for hot-removed CPU and shared core
sensor. Without fixing this issue, we end up with inconsistent system info.

As for usage scenario, I can think of some:
1. Power saving. Management application may offline some threads or all thread
siblings to save power. Image all of HT is disabled during run-time, less power
is consumed with less performance.
2. RAS. A bad thread may be offlined which its sibling is still running. This
could be becaused of logical CPU spcific state e.g. instruction TLB.

> 
> I don't mean to suggest that we don't have to fix the problem. I'm
> simply trying to figure out how fast we need to fix it, and whether the
> fix is worth adding to the stable kernel series or not.
> 
> As you can see, the switch of hyperthreads on Core 1 caused hwmon
> device coretemp-isa-0001 to be removed and be replaced with
> coretemp-isa-0005. There is also a change in the underlying
> directories, /sys/class/hwmon/hwmon1/device now points
> to /sys/devices/platform/coretemp.5 instead
> of /sys/devices/platform/coretemp.1. This has three drawbacks:
> 1* Configuration statements from /etc/sensors.conf will no longer be
>    applied.
> 2* Some monitoring applications may lose their path to the sensors.
>    Thankfully, libsensors uses hwmon device paths rather than physical
>    device paths, so the effect should be limited, but other tools (e.g.
>    the fancontrol script) tend to prefer physical device paths, so they
>    will break.
> 3* If you disable several HTs at once, you have no guarantee that the
>    new hwmon devices will be numbered in the same order as the old hwmon
>    devices. If you are unlucky and the number changes, then all
>    libsensors-based applications will start reporting garbage.
> 
> I admit that these issues are not critical ones, and are rather
> unlikely to happen in the real world, but so is the problem you are
> trying to solve in the first place.
> 
> Point 1* could be easily solved by changing the way the coretemp device
> ID is allocated. Instead of using the CPU ID directly, we would use the
> smallest CPU ID amongst all the siblings. This ensures a consistent ID
> no matter which sibling is used.
> 
> Points 2* and 3*, however, can't be solved without reworking the driver
> significantly. I think we should not only skip duplicate HT entries on
> driver registration as my naive patch did. We should instead keep track
> of them, i.e. all coretemp entries should know the list of CPU entries
> they are backed up by, and a coretemp device would be unregistered only
> when this list shrinks to zero elements (all HT have been removed.)
> 
> As you said you agree to give a try to a rework of the coretemp driver
> to keep all related cores into the same hwmon device, I think this
> additional constraint might fit well in the new driver design. What do
> you think?

Yes, I agree with you on that. Since I'm rewriting coretemp/pkgtemp, this issue
will be fixed in a new coding.

Thanks.

-Fenghua

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

  reply	other threads:[~2010-08-20 21:53 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-18 22:53 [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue Fenghua Yu
2010-08-18 22:53 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect Fenghua Yu
2010-08-18 22:53 ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Fenghua Yu
2010-08-18 22:53   ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Fenghua Yu
2010-08-19  7:24   ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare
2010-08-19  7:24     ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare
2010-08-19 19:04     ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables H. Peter Anvin
2010-08-19 19:04       ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of H. Peter Anvin
2010-08-19 21:26       ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare
2010-08-19 21:26         ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare
2010-08-19 21:28         ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables H. Peter Anvin
2010-08-19 21:28           ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of H. Peter Anvin
2010-08-19 21:32           ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare
2010-08-19 21:32             ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare
2010-08-19 21:51         ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Guenter Roeck
2010-08-19 21:51           ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Guenter Roeck
2010-09-04 12:39           ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare
2010-09-04 12:39             ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare
2010-09-04 14:30             ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Guenter Roeck
2010-09-04 14:30               ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Guenter Roeck
2010-09-04 16:17               ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare
2010-09-04 16:17                 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare
2010-08-18 22:53 ` [PATCH] drivers/hwmon/pkgtemp: Fix improper locking in CPU hot remove Fenghua Yu
2010-08-18 22:53   ` [lm-sensors] [PATCH] drivers/hwmon/pkgtemp: Fix improper locking in Fenghua Yu
2010-08-19 17:07 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue Guenter Roeck
2010-08-19 17:07   ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect Guenter Roeck
2010-08-20  8:26 ` [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue Jean Delvare
2010-08-20  8:26   ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect Jean Delvare
2010-08-20 21:53   ` Fenghua Yu [this message]
2010-08-20 21:53     ` Fenghua Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100820215353.GC22602@linux-os.sc.intel.com \
    --to=fenghua.yu@intel.com \
    --cc=gong.chen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=huaxu.wan@intel.com \
    --cc=khali@nerim.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lm-sensors@lm-sensors.org \
    --cc=r.marek@assembler.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.