From: Fenghua Yu <fenghua.yu@intel.com> To: Jean Delvare <khali@nerim.net> Cc: "Yu, Fenghua" <fenghua.yu@intel.com>, Rudolf Marek <r.marek@assembler.cz>, "Wan, Huaxu" <huaxu.wan@intel.com>, H Peter Anvin <hpa@zytor.com>, Chen Gong <gong.chen@linux.intel.com>, linux-kernel <linux-kernel@vger.kernel.org>, lm-sensors <lm-sensors@lm-sensors.org> Subject: Re: [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue Date: Fri, 20 Aug 2010 14:53:53 -0700 [thread overview] Message-ID: <20100820215353.GC22602@linux-os.sc.intel.com> (raw) In-Reply-To: <20100820102646.7659b6b7@hyperion.delvare> On Fri, Aug 20, 2010 at 01:26:46AM -0700, Jean Delvare wrote: > Hi Fenghua, > > On Wed, 18 Aug 2010 15:53:45 -0700, Fenghua Yu wrote: > > From: Fenghua Yu <fenghua.yu@intel.com> > > When a CPU is hot-removed, its core sensor should be still available to upper > > level application as long as the hot-removed CPU's HT sibling is still running. > > A core sensor is invisible to user level only when all of siblings in a core are > > hot-removed. > > Good point. I admit I didn't think about this scenario when fixing the > duplicate HT entries. I thought both hyperthreads would go away at the > same time, but since then I learned that individual HT can be removed > using the sysfs "online" attributes. > > That being said, I'm curious if this is really a problem in practice? > Why would one disable only one hyperthread on a given core? I can't > think of a real-world scenario. Overall we need to keep state integrity for hot-removed CPU and shared core sensor. Without fixing this issue, we end up with inconsistent system info. As for usage scenario, I can think of some: 1. Power saving. Management application may offline some threads or all thread siblings to save power. Image all of HT is disabled during run-time, less power is consumed with less performance. 2. RAS. A bad thread may be offlined which its sibling is still running. This could be becaused of logical CPU spcific state e.g. instruction TLB. > > I don't mean to suggest that we don't have to fix the problem. I'm > simply trying to figure out how fast we need to fix it, and whether the > fix is worth adding to the stable kernel series or not. > > As you can see, the switch of hyperthreads on Core 1 caused hwmon > device coretemp-isa-0001 to be removed and be replaced with > coretemp-isa-0005. There is also a change in the underlying > directories, /sys/class/hwmon/hwmon1/device now points > to /sys/devices/platform/coretemp.5 instead > of /sys/devices/platform/coretemp.1. This has three drawbacks: > 1* Configuration statements from /etc/sensors.conf will no longer be > applied. > 2* Some monitoring applications may lose their path to the sensors. > Thankfully, libsensors uses hwmon device paths rather than physical > device paths, so the effect should be limited, but other tools (e.g. > the fancontrol script) tend to prefer physical device paths, so they > will break. > 3* If you disable several HTs at once, you have no guarantee that the > new hwmon devices will be numbered in the same order as the old hwmon > devices. If you are unlucky and the number changes, then all > libsensors-based applications will start reporting garbage. > > I admit that these issues are not critical ones, and are rather > unlikely to happen in the real world, but so is the problem you are > trying to solve in the first place. > > Point 1* could be easily solved by changing the way the coretemp device > ID is allocated. Instead of using the CPU ID directly, we would use the > smallest CPU ID amongst all the siblings. This ensures a consistent ID > no matter which sibling is used. > > Points 2* and 3*, however, can't be solved without reworking the driver > significantly. I think we should not only skip duplicate HT entries on > driver registration as my naive patch did. We should instead keep track > of them, i.e. all coretemp entries should know the list of CPU entries > they are backed up by, and a coretemp device would be unregistered only > when this list shrinks to zero elements (all HT have been removed.) > > As you said you agree to give a try to a rework of the coretemp driver > to keep all related cores into the same hwmon device, I think this > additional constraint might fit well in the new driver design. What do > you think? Yes, I agree with you on that. Since I'm rewriting coretemp/pkgtemp, this issue will be fixed in a new coding. Thanks. -Fenghua
WARNING: multiple messages have this Message-ID (diff)
From: Fenghua Yu <fenghua.yu@intel.com> To: Jean Delvare <khali@nerim.net> Cc: "Yu, Fenghua" <fenghua.yu@intel.com>, Rudolf Marek <r.marek@assembler.cz>, "Wan, Huaxu" <huaxu.wan@intel.com>, H Peter Anvin <hpa@zytor.com>, Chen Gong <gong.chen@linux.intel.com>, linux-kernel <linux-kernel@vger.kernel.org>, lm-sensors <lm-sensors@lm-sensors.org> Subject: Re: [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect Date: Fri, 20 Aug 2010 21:53:53 +0000 [thread overview] Message-ID: <20100820215353.GC22602@linux-os.sc.intel.com> (raw) In-Reply-To: <20100820102646.7659b6b7@hyperion.delvare> On Fri, Aug 20, 2010 at 01:26:46AM -0700, Jean Delvare wrote: > Hi Fenghua, > > On Wed, 18 Aug 2010 15:53:45 -0700, Fenghua Yu wrote: > > From: Fenghua Yu <fenghua.yu@intel.com> > > When a CPU is hot-removed, its core sensor should be still available to upper > > level application as long as the hot-removed CPU's HT sibling is still running. > > A core sensor is invisible to user level only when all of siblings in a core are > > hot-removed. > > Good point. I admit I didn't think about this scenario when fixing the > duplicate HT entries. I thought both hyperthreads would go away at the > same time, but since then I learned that individual HT can be removed > using the sysfs "online" attributes. > > That being said, I'm curious if this is really a problem in practice? > Why would one disable only one hyperthread on a given core? I can't > think of a real-world scenario. Overall we need to keep state integrity for hot-removed CPU and shared core sensor. Without fixing this issue, we end up with inconsistent system info. As for usage scenario, I can think of some: 1. Power saving. Management application may offline some threads or all thread siblings to save power. Image all of HT is disabled during run-time, less power is consumed with less performance. 2. RAS. A bad thread may be offlined which its sibling is still running. This could be becaused of logical CPU spcific state e.g. instruction TLB. > > I don't mean to suggest that we don't have to fix the problem. I'm > simply trying to figure out how fast we need to fix it, and whether the > fix is worth adding to the stable kernel series or not. > > As you can see, the switch of hyperthreads on Core 1 caused hwmon > device coretemp-isa-0001 to be removed and be replaced with > coretemp-isa-0005. There is also a change in the underlying > directories, /sys/class/hwmon/hwmon1/device now points > to /sys/devices/platform/coretemp.5 instead > of /sys/devices/platform/coretemp.1. This has three drawbacks: > 1* Configuration statements from /etc/sensors.conf will no longer be > applied. > 2* Some monitoring applications may lose their path to the sensors. > Thankfully, libsensors uses hwmon device paths rather than physical > device paths, so the effect should be limited, but other tools (e.g. > the fancontrol script) tend to prefer physical device paths, so they > will break. > 3* If you disable several HTs at once, you have no guarantee that the > new hwmon devices will be numbered in the same order as the old hwmon > devices. If you are unlucky and the number changes, then all > libsensors-based applications will start reporting garbage. > > I admit that these issues are not critical ones, and are rather > unlikely to happen in the real world, but so is the problem you are > trying to solve in the first place. > > Point 1* could be easily solved by changing the way the coretemp device > ID is allocated. Instead of using the CPU ID directly, we would use the > smallest CPU ID amongst all the siblings. This ensures a consistent ID > no matter which sibling is used. > > Points 2* and 3*, however, can't be solved without reworking the driver > significantly. I think we should not only skip duplicate HT entries on > driver registration as my naive patch did. We should instead keep track > of them, i.e. all coretemp entries should know the list of CPU entries > they are backed up by, and a coretemp device would be unregistered only > when this list shrinks to zero elements (all HT have been removed.) > > As you said you agree to give a try to a rework of the coretemp driver > to keep all related cores into the same hwmon device, I think this > additional constraint might fit well in the new driver design. What do > you think? Yes, I agree with you on that. Since I'm rewriting coretemp/pkgtemp, this issue will be fixed in a new coding. Thanks. -Fenghua _______________________________________________ lm-sensors mailing list lm-sensors@lm-sensors.org http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
next prev parent reply other threads:[~2010-08-20 21:53 UTC|newest] Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top 2010-08-18 22:53 [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue Fenghua Yu 2010-08-18 22:53 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect Fenghua Yu 2010-08-18 22:53 ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Fenghua Yu 2010-08-18 22:53 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Fenghua Yu 2010-08-19 7:24 ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare 2010-08-19 7:24 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare 2010-08-19 19:04 ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables H. Peter Anvin 2010-08-19 19:04 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of H. Peter Anvin 2010-08-19 21:26 ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare 2010-08-19 21:26 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare 2010-08-19 21:28 ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables H. Peter Anvin 2010-08-19 21:28 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of H. Peter Anvin 2010-08-19 21:32 ` [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare 2010-08-19 21:32 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare 2010-08-19 21:51 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Guenter Roeck 2010-08-19 21:51 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Guenter Roeck 2010-09-04 12:39 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare 2010-09-04 12:39 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare 2010-09-04 14:30 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Guenter Roeck 2010-09-04 14:30 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Guenter Roeck 2010-09-04 16:17 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of unused variables Jean Delvare 2010-09-04 16:17 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Remove warnings of Jean Delvare 2010-08-18 22:53 ` [PATCH] drivers/hwmon/pkgtemp: Fix improper locking in CPU hot remove Fenghua Yu 2010-08-18 22:53 ` [lm-sensors] [PATCH] drivers/hwmon/pkgtemp: Fix improper locking in Fenghua Yu 2010-08-19 17:07 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue Guenter Roeck 2010-08-19 17:07 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect Guenter Roeck 2010-08-20 8:26 ` [PATCH] drivers/hwmon/coretemp: Fix incorrect hot-removed CPU's core sensor issue Jean Delvare 2010-08-20 8:26 ` [lm-sensors] [PATCH] drivers/hwmon/coretemp: Fix incorrect Jean Delvare 2010-08-20 21:53 ` Fenghua Yu [this message] 2010-08-20 21:53 ` Fenghua Yu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20100820215353.GC22602@linux-os.sc.intel.com \ --to=fenghua.yu@intel.com \ --cc=gong.chen@linux.intel.com \ --cc=hpa@zytor.com \ --cc=huaxu.wan@intel.com \ --cc=khali@nerim.net \ --cc=linux-kernel@vger.kernel.org \ --cc=lm-sensors@lm-sensors.org \ --cc=r.marek@assembler.cz \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.