All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 22:15 ` [PATCH 1/1] Introduce Intel RAPL cooling device driver Jacob Pan
@ 2013-04-02 21:42   ` Randy Dunlap
  2013-04-02 21:47   ` Randy Dunlap
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Randy Dunlap @ 2013-04-02 21:42 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On 04/02/13 15:15, Jacob Pan wrote:
> RAPL(Running Average Power Limit) interface provides platform software
> with the ability to monitor, control, and get notifications on SOC
> power consumptions. Since its first appearance on Sandy Bridge, more
> features have being added to extend its usage. In RAPL, platforms are
> divided into domains for fine grained control. These domains include
> package, DRAM controller, CPU core (Power Plane 0), graphics uncore
> (power plane 1), etc.
> 
> The purpose of this driver is to expose RAPL for userspace
> consumption. Overall, RAPL fits in the generic thermal layer in
> that platform level power capping and monitoring are mainly used for
> thermal management and thermal layer provides the abstracted interface
> needed to have portable applications.
> 
> Specifically, userspace is presented with per domain cooling device
> with sysfs links to its kobject. Although RAPL domain provides many
> parameters for fine tuning, long term power limit is exposed as the
> single knob via cooling device state. Whereas the rest of the
> parameters are still accessible via the linked kobject. This simplifies
> the interface for both simple and advanced use cases.
> 
> Eventfd is used to provide notifications to the userspace. At per domain
> level, use can choose any event capable parameters to register for
> threshold crossing notifications. This is shamelessly "borrowed" from
> cgroup with some trimming/fitting.
> 
> Zhang, Rui's initial RAPL driver was used as a reference and starting
> point. Many thanks.
> https://lkml.org/lkml/2011/5/26/93
> 
> Unlike the patch above, which is mainly for monitoring, this driver
> focus on the control and usability by user applications.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/platform/x86/Kconfig      |    8 +
>  drivers/platform/x86/Makefile     |    1 +
>  drivers/platform/x86/intel_rapl.c | 1323 +++++++++++++++++++++++++++++++++++++
>  drivers/platform/x86/intel_rapl.h |  249 +++++++
>  4 files changed, 1581 insertions(+)
>  create mode 100644 drivers/platform/x86/intel_rapl.c
>  create mode 100644 drivers/platform/x86/intel_rapl.h
> 
> diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig
> index 3338437..34bcd52 100644
> --- a/drivers/platform/x86/Kconfig
> +++ b/drivers/platform/x86/Kconfig
> @@ -781,4 +781,12 @@ config APPLE_GMUX
>  	  graphics as well as the backlight. Currently only backlight
>  	  control is supported by the driver.
>  
> +config INTEL_RAPL
> +	tristate "Intel RAPL Support"
> +	depends on X86 && THERMAL
> +	default y
> +	---help---
> +	  RAPL, AKA, Running Average Power Limit provides mechanisms to enforce

	  RAPL, aka Running Average Power Limit, provides mechanisms to enforce

or a.k.a., but not AKA

or even:
	  RAPL (Running Average Power Limit) provides mechanisms to enforce


> +	  and monitor per domain power consumption limits of supported Intel CPUs.
> +
>  endif # X86_PLATFORM_DEVICES


-- 
~Randy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 22:15 ` [PATCH 1/1] Introduce Intel RAPL cooling device driver Jacob Pan
  2013-04-02 21:42   ` Randy Dunlap
@ 2013-04-02 21:47   ` Randy Dunlap
  2013-04-02 23:04     ` Greg Kroah-Hartman
  2013-04-02 22:35   ` Joe Perches
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Randy Dunlap @ 2013-04-02 21:47 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven,
	Greg Kroah-Hartman

On 04/02/13 15:15, Jacob Pan wrote:
> diff --git a/drivers/platform/x86/intel_rapl.c b/drivers/platform/x86/intel_rapl.c
> new file mode 100644
> index 0000000..56ee928
> --- /dev/null
> +++ b/drivers/platform/x86/intel_rapl.c
> @@ -0,0 +1,1323 @@
> +/*
> + *  intel_rapl.c - Intel Running Average Power Limit Driver for MSR based
> + *                 RAPL interface
> + *
> + *  Copyright (c) 2013, Intel Corporation.
> + * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +#include <linux/sysfs.h>
> +

> +#include "../../../fs/sysfs/sysfs.h"


What does this driver need from ^^^^^^ that file, which says:

 * fs/sysfs/sysfs.h - sysfs internal header file

and should that be moved to include/linux/sysfs.h ?


-- 
~Randy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/1] RAPL (Running Average Power Limit) driver
@ 2013-04-02 22:15 Jacob Pan
  2013-04-02 22:15 ` [PATCH 1/1] Introduce Intel RAPL cooling device driver Jacob Pan
  0 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-02 22:15 UTC (permalink / raw)
  To: LKML, Platform Driver, Matthew Garrett
  Cc: Zhang Rui, Rafael Wysocki, Len Brown, Srinivas Pandruvada,
	Arjan van de Ven, Jacob Pan

RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions. Since its first appearance on Sandy Bridge, more
features have being added to extend its usage. In RAPL, platforms are
divided into domains for fine grained control. These domains include
package, DRAM controller, CPU core (Power Plane 0), graphics uncore
(power plane 1), etc.

The purpose of this driver is to expose RAPL for userspace
consumption. Overall, RAPL fits in the generic thermal layer in
that platform level power capping and monitoring are mainly used for
thermal management and thermal layer provides the abstracted interface
needed to have portable applications.

Specifically, userspace is presented with per domain cooling device
with sysfs links to its kobject. Although RAPL domain provides many
parameters for fine tuning, long term power limit is exposed as the
single knob via cooling device state. Whereas the rest of the
parameters are still accessible via the linked kobject. This simplifies
the interface for both simple and advanced use cases.

DETAILS
=======
1. sysfs layout

As an x86 platform driver, RAPL driver binds with supported CPU ids
during probing phase. Once domains are discovered, kobjets are created
for each domain which are also linked with cooling devices after its
registration with the generic thermal layer.

e.g.package RAPL domain registered as cooling device #15, link "device"
back to its kobject.

/sys/class/thermal/cooling_device15/
├── cur_state
├── device -> ../../../platform/intel_rapl/rapl_domains/package
├── max_state
├── power
├── subsystem -> ../../../../class/thermal
├── type
└── uevent

In driver's private sysfs area, domains kobjects are grouped under a
kset which exposes global data.
/sys/devices/platform/intel_rapl/
├── driver -> ../../../bus/platform/drivers/intel_rapl
├── power
├── rapl_domains
│   ├── package
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device15  
│   ├── power_plane_0
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device16  
│   └── power_plane_1
│       └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device18  
└── subsystem -> ../../../bus/platform


2. per domain parameters

These are the fine tuning parameters only used by advanced
power/thermal management applications. Refer to Intel SDM ch14 for
details.

root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *
domain_name:package
energy:924228
lock:0
max_power:0
max_window:0
min_power:0
pl1_clamp:1
pl1_enable:1
pl2_clamp:0
pl2_enable:1
power:2276
power_limit1:12000
power_limit2:31250
thermal_spec_power:17000
throttle_time:
time_window1:28000
time_window2:0

3. event notifications

RAPL driver uses eventfd to provide userspace notifications on selected
events. A file node called "event_control" is created for each RAPL
domain. User can write control file descriptor, eventfd descriptor, and
threshold to event_control file. Then, user application can use
poll/select or blocking read to get notifications from the driver.
Multiple events are allowed for each domain but only a single threshold
is accepted.

4. Usage Examples (assume the topology in the sysfs layout above)

- set power limit to package domain (whole SOC package) to 6w
root@chromoly:~# echo 6000
	> /sys/class/thermal/cooling_device15/cur_state  

- set power limit to pp1 domain (graphics) to 4w
root@chromoly:~# echo 4000
	> /sys/class/thermal/cooling_device18/cur_state  

- check the current power usage in mWatts of pp1 domain
root@chromoly:~# cat  /sys/class/thermal/cooling_device18/cur_state 
61

- set event notification when power consumption of graphics unit crosses
  5w.
root@chromoly:~#
  event_fd_listener /sys/class/thermal/cooling_device18/device/power 5000
(event_fd_listener opens control file power and creates an eventfd,
then write efd, cfd, threshold to event_control file of the given
domain)


Caveats:

1. Package power limit events are supported by legacy thermal reporting
mechanism, which uses local APIC thermal vector to generate interrupts
when targeted P-states are not honored by the HW/FW. This is tied to
machine check reporting. Until RAPL is used, this notification is a rare
exception. When RAPL power limit is set artifically low, this
notification could result in unwanted interrupts for each power limit
excursion. Therefore, RAPL driver attempts to turn off the power limit
notification interrupt when user sets a power limit.

2. By Intel Software Developer's Manual, RAPL interface can report
max/min power for certain domains. But in reality HW often reports 0
for max/min power. RAPL driver tackles this problem by using thermal
specification power or current power limit1 when max power information
is not available. The result is that the max_state of a RAPL cooling
device can be based on thermal spec power or power limit 1.

3. Since RAPL is backed by FW. In case of FW failure or plain lack of
support, setting RAPL power limit could result in silent failure. I
don't have a good solution for that.

4. Data polling starts only when the following items are set
	- power limit
	- events


Jacob Pan (1):
  Introduce Intel RAPL cooling device driver

 drivers/platform/x86/Kconfig      |    8 +
 drivers/platform/x86/Makefile     |    1 +
 drivers/platform/x86/intel_rapl.c | 1323 +++++++++++++++++++++++++++++++++++++
 drivers/platform/x86/intel_rapl.h |  249 +++++++
 4 files changed, 1581 insertions(+)
 create mode 100644 drivers/platform/x86/intel_rapl.c
 create mode 100644 drivers/platform/x86/intel_rapl.h

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 22:15 [PATCH 0/1] RAPL (Running Average Power Limit) driver Jacob Pan
@ 2013-04-02 22:15 ` Jacob Pan
  2013-04-02 21:42   ` Randy Dunlap
                     ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Jacob Pan @ 2013-04-02 22:15 UTC (permalink / raw)
  To: LKML, Platform Driver, Matthew Garrett
  Cc: Zhang Rui, Rafael Wysocki, Len Brown, Srinivas Pandruvada,
	Arjan van de Ven, Jacob Pan

RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions. Since its first appearance on Sandy Bridge, more
features have being added to extend its usage. In RAPL, platforms are
divided into domains for fine grained control. These domains include
package, DRAM controller, CPU core (Power Plane 0), graphics uncore
(power plane 1), etc.

The purpose of this driver is to expose RAPL for userspace
consumption. Overall, RAPL fits in the generic thermal layer in
that platform level power capping and monitoring are mainly used for
thermal management and thermal layer provides the abstracted interface
needed to have portable applications.

Specifically, userspace is presented with per domain cooling device
with sysfs links to its kobject. Although RAPL domain provides many
parameters for fine tuning, long term power limit is exposed as the
single knob via cooling device state. Whereas the rest of the
parameters are still accessible via the linked kobject. This simplifies
the interface for both simple and advanced use cases.

Eventfd is used to provide notifications to the userspace. At per domain
level, use can choose any event capable parameters to register for
threshold crossing notifications. This is shamelessly "borrowed" from
cgroup with some trimming/fitting.

Zhang, Rui's initial RAPL driver was used as a reference and starting
point. Many thanks.
https://lkml.org/lkml/2011/5/26/93

Unlike the patch above, which is mainly for monitoring, this driver
focus on the control and usability by user applications.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/platform/x86/Kconfig      |    8 +
 drivers/platform/x86/Makefile     |    1 +
 drivers/platform/x86/intel_rapl.c | 1323 +++++++++++++++++++++++++++++++++++++
 drivers/platform/x86/intel_rapl.h |  249 +++++++
 4 files changed, 1581 insertions(+)
 create mode 100644 drivers/platform/x86/intel_rapl.c
 create mode 100644 drivers/platform/x86/intel_rapl.h

diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig
index 3338437..34bcd52 100644
--- a/drivers/platform/x86/Kconfig
+++ b/drivers/platform/x86/Kconfig
@@ -781,4 +781,12 @@ config APPLE_GMUX
 	  graphics as well as the backlight. Currently only backlight
 	  control is supported by the driver.
 
+config INTEL_RAPL
+	tristate "Intel RAPL Support"
+	depends on X86 && THERMAL
+	default y
+	---help---
+	  RAPL, AKA, Running Average Power Limit provides mechanisms to enforce
+	  and monitor per domain power consumption limits of supported Intel CPUs.
+
 endif # X86_PLATFORM_DEVICES
diff --git a/drivers/platform/x86/Makefile b/drivers/platform/x86/Makefile
index ace2b38..a80c0f4 100644
--- a/drivers/platform/x86/Makefile
+++ b/drivers/platform/x86/Makefile
@@ -51,3 +51,4 @@ obj-$(CONFIG_INTEL_OAKTRAIL)	+= intel_oaktrail.o
 obj-$(CONFIG_SAMSUNG_Q10)	+= samsung-q10.o
 obj-$(CONFIG_APPLE_GMUX)	+= apple-gmux.o
 obj-$(CONFIG_CHROMEOS_LAPTOP)	+= chromeos_laptop.o
+obj-$(CONFIG_INTEL_RAPL)	+= intel_rapl.o
diff --git a/drivers/platform/x86/intel_rapl.c b/drivers/platform/x86/intel_rapl.c
new file mode 100644
index 0000000..56ee928
--- /dev/null
+++ b/drivers/platform/x86/intel_rapl.c
@@ -0,0 +1,1323 @@
+/*
+ *  intel_rapl.c - Intel Running Average Power Limit Driver for MSR based
+ *                 RAPL interface
+ *
+ *  Copyright (c) 2013, Intel Corporation.
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or (at
+ *  your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful, but
+ *  WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License along
+ *  with this program; if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ */
+
+/* #define DEBUG */
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/platform_device.h>
+#include <linux/thermal.h>
+#include <linux/slab.h>
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/log2.h>
+#include <linux/bitmap.h>
+#include <linux/delay.h>
+#include <linux/sysfs.h>
+
+#include <asm/processor.h>
+#include <asm/cpu_device_id.h>
+
+#include "intel_rapl.h"
+#include "../../../fs/sysfs/sysfs.h"
+#define DRIVER_NAME "intel_rapl"
+
+static void rapl_poll_data(struct work_struct *dummy);
+static DECLARE_DELAYED_WORK(rapl_polling_work, rapl_poll_data);
+static bool polling_started;
+static int start_periodic_polling(void);
+static int stop_periodic_polling(void);
+static struct kset *rapl_kset;
+
+static void rapl_init_domains(void);
+
+static struct rapl_domain *rapl_domains;
+static struct rapl_data rg_data; /* global data */
+static struct rapl_domain_data *rd_data;
+
+#define kobj_to_rapl_domain(k) container_of(k, struct rapl_domain, kobj)
+#define to_rapl_attr(a) container_of(a, struct rapl_attr, attr)
+
+static struct platform_device intel_rapl_device = {
+	.name		   = DRIVER_NAME,
+	.id		   = -1,
+};
+
+static char *rapl_domain_names[] = {
+	"package",
+	"power_plane_0",
+	"power_plane_1",
+	"dram",
+};
+
+/* called after domain detection and global data are set */
+static void rapl_init_domains(void)
+{
+	int i, j = 0;
+
+	for (i = 0; i < RAPL_DOMAIN_MAX; i++) {
+		unsigned int mask = rg_data.domain_map & (1 << i);
+
+		switch (mask) {
+		case 1 << RAPL_DOMAIN_PKG:
+			rapl_domains[j].name =
+				rapl_domain_names[RAPL_DOMAIN_PKG];
+			rapl_domains[j].id = RAPL_DOMAIN_PKG;
+			rapl_domains[j].msrs[0] = MSR_PKG_POWER_LIMIT;
+			rapl_domains[j].msrs[1] = MSR_PKG_ENERGY_STATUS;
+			rapl_domains[j].msrs[2] = MSR_PKG_PERF_STATUS;
+			rapl_domains[j].msrs[3] = 0;
+			rapl_domains[j].msrs[4] = MSR_PKG_POWER_INFO;
+			rapl_domains[j].attr_map = RAPL_STANDARD_ATTRS |
+				RAPL_ATTR_PL2 |
+				RAPL_ATTR_PL2_ENABLE | RAPL_ATTR_PL2_CLAMP |
+				RAPL_ATTR_THROTTLE_TIME |
+				RAPL_ATTR_TIME_WINDOW2 |
+				RAPL_ATTR_MAX_POWER | RAPL_ATTR_MIN_POWER |
+				RAPL_ATTR_MAX_WINDOW |
+				RAPL_ATTR_THERMAL_SPEC_POWER;
+			break;
+		case 1 << RAPL_DOMAIN_PP0:
+			rapl_domains[j].name =
+				rapl_domain_names[RAPL_DOMAIN_PP0];
+			rapl_domains[j].id = RAPL_DOMAIN_PP0;
+			rapl_domains[j].msrs[0] = MSR_PP0_POWER_LIMIT;
+			rapl_domains[j].msrs[1] = MSR_PP0_ENERGY_STATUS;
+			rapl_domains[j].msrs[2] = 0;
+			rapl_domains[j].msrs[3] = MSR_PP0_POLICY;
+			rapl_domains[j].msrs[4] = 0;
+			rapl_domains[j].attr_map = RAPL_STANDARD_ATTRS |
+				RAPL_ATTR_PRIO_LEVEL |
+				RAPL_ATTR_THROTTLE_TIME;
+			break;
+		case 1 << RAPL_DOMAIN_PP1:
+			rapl_domains[j].name =
+				rapl_domain_names[RAPL_DOMAIN_PP1];
+			rapl_domains[j].id = RAPL_DOMAIN_PP1;
+			rapl_domains[j].msrs[0] = MSR_PP1_POWER_LIMIT;
+			rapl_domains[j].msrs[1] = MSR_PP1_ENERGY_STATUS;
+			rapl_domains[j].msrs[2] = 0;
+			rapl_domains[j].msrs[3] = MSR_PP1_POLICY;
+			rapl_domains[j].msrs[4] = 0;
+			rapl_domains[j].attr_map = RAPL_STANDARD_ATTRS |
+				RAPL_ATTR_PRIO_LEVEL |
+				RAPL_ATTR_THROTTLE_TIME;
+			break;
+		case 1 << RAPL_DOMAIN_DRAM:
+			rapl_domains[j].name =
+				rapl_domain_names[RAPL_DOMAIN_DRAM];
+			rapl_domains[j].id = RAPL_DOMAIN_DRAM;
+			rapl_domains[j].msrs[0] = MSR_DRAM_POWER_LIMIT;
+			rapl_domains[j].msrs[1] = MSR_DRAM_ENERGY_STATUS;
+			rapl_domains[j].msrs[2] = MSR_DRAM_PERF_STATUS;
+			rapl_domains[j].msrs[3] = 0;
+			rapl_domains[j].msrs[4] = MSR_DRAM_POWER_INFO;
+			rapl_domains[j].attr_map = RAPL_STANDARD_ATTRS |
+				RAPL_ATTR_THROTTLE_TIME |
+				RAPL_ATTR_TIME_WINDOW2 |
+				RAPL_ATTR_MAX_POWER | RAPL_ATTR_MIN_POWER |
+				RAPL_ATTR_MAX_WINDOW |
+				RAPL_ATTR_THERMAL_SPEC_POWER;
+			break;
+		default:
+			pr_info("No rapl domain %s on this platform\n",
+				rapl_domain_names[i]);
+		}
+		if (mask)
+			j++;
+	}
+}
+
+static u64 rapl_unit_xlate(enum unit_type type, u64 value, int to_raw)
+{
+	u64 divisor = 1;
+	int scale = 1; /* scale to user friendly data without floating point */
+	int f, y; /* fraction and exp. used for time unit */
+
+	switch (type) {
+	case POWER_UNIT:
+		divisor = rg_data.power_unit_divisor;
+		scale = POWER_UNIT_SCALE;
+		break;
+	case ENERGY_UNIT:
+		scale = ENERGY_UNIT_SCALE;
+		divisor = rg_data.energy_unit_divisor;
+		break;
+	case TIME_UNIT:
+		divisor = rg_data.time_unit_divisor;
+		scale = TIME_UNIT_SCALE;
+		/* special processing based on 2^Y*(1+F)/4 = val/divisor */
+		if (!to_raw) {
+			f = (value & 0x60) >> 5;
+			y = value & 0x1f;
+			value = (1<<y)*(4+f)*scale/4;
+			return div64_u64(value, divisor);
+		} else {
+			do_div(value, scale);
+			value *= divisor;
+			y = ilog2(value);
+			f = div64_u64(4 * (value-(1<<y)), 1<<y);
+			value = (y & 0x1f) | ((f&0x3)<<5);
+			return value;
+		}
+		break;
+	case NA_UNIT:
+	default:
+		return value;
+	};
+
+	if (to_raw)
+		return div64_u64(value * divisor, scale);
+	else
+		return div64_u64(value * scale, divisor);
+}
+
+/* in the order of enum rapl_primitives */
+static struct rapl_primitive_info rpi[] = {
+	/* name, mask, shift, msr index, unit divisor*/
+	PRIMITIVE_INFO_INIT(energy, ENERGY_STATUS_MASK, 0,
+			RAPL_DOMAIN_MSR_STATUS, ENERGY_UNIT,
+			RAPL_PRIMITIVE_EVENT_CAP),
+	PRIMITIVE_INFO_INIT(power_limit1, POWER_LIMIT1_MASK, 0,
+			RAPL_DOMAIN_MSR_LIMIT, POWER_UNIT, 0),
+	PRIMITIVE_INFO_INIT(power_limit2, POWER_LIMIT2_MASK, 32,
+			RAPL_DOMAIN_MSR_LIMIT, POWER_UNIT, 0),
+	PRIMITIVE_INFO_INIT(lock, POWER_PP_LOCK, 31,
+			RAPL_DOMAIN_MSR_LIMIT, NA_UNIT, 0),
+	PRIMITIVE_INFO_INIT(pl1_enable, POWER_LIMIT1_ENABLE, 15,
+			RAPL_DOMAIN_MSR_LIMIT, NA_UNIT, 0),
+	PRIMITIVE_INFO_INIT(pl1_clamp, POWER_LIMIT1_CLAMP, 16,
+			RAPL_DOMAIN_MSR_LIMIT, NA_UNIT, 0),
+	PRIMITIVE_INFO_INIT(pl2_enable, POWER_LIMIT2_ENABLE, 47,
+			RAPL_DOMAIN_MSR_LIMIT, NA_UNIT, 0),
+	PRIMITIVE_INFO_INIT(pl2_clamp, POWER_LIMIT2_CLAMP, 48,
+			RAPL_DOMAIN_MSR_LIMIT, NA_UNIT, 0),
+	PRIMITIVE_INFO_INIT(time_window1, TIME_WINDOW1_MASK, 17,
+			RAPL_DOMAIN_MSR_LIMIT, TIME_UNIT, 0),
+	PRIMITIVE_INFO_INIT(time_window2, TIME_WINDOW2_MASK, 49,
+			RAPL_DOMAIN_MSR_LIMIT, TIME_UNIT, 0),
+	PRIMITIVE_INFO_INIT(thermal_spec_power, POWER_INFO_THERMAL_SPEC_MASK, 0,
+			RAPL_DOMAIN_MSR_INFO, POWER_UNIT, 0),
+	PRIMITIVE_INFO_INIT(max_power, POWER_INFO_MAX_MASK, 32,
+			RAPL_DOMAIN_MSR_INFO, POWER_UNIT, 0),
+	PRIMITIVE_INFO_INIT(min_power, POWER_INFO_MIN_MASK, 16,
+			RAPL_DOMAIN_MSR_INFO, POWER_UNIT, 0),
+	PRIMITIVE_INFO_INIT(max_window, POWER_INFO_MAX_TIME_WIN_MASK, 48,
+			RAPL_DOMAIN_MSR_INFO, TIME_UNIT, 0),
+	PRIMITIVE_INFO_INIT(throttle_time, PERF_STATUS_THROTTLE_TIME_MASK, 0,
+			RAPL_DOMAIN_MSR_PERF, TIME_UNIT,
+			RAPL_PRIMITIVE_EVENT_CAP),
+	PRIMITIVE_INFO_INIT(prio_level, PP_POLICY_MASK, 0,
+			RAPL_DOMAIN_MSR_POLICY, NA_UNIT, 0),
+	PRIMITIVE_INFO_INIT(power, 0, 0, 0, POWER_UNIT,
+			RAPL_PRIMITIVE_DERIVED|RAPL_PRIMITIVE_EVENT_CAP),
+	/* non-hardware, used for sysfs attr */
+	PRIMITIVE_INFO_INIT(event_control, 0, 0, 0, 0, RAPL_PRIMITIVE_DUMMY),
+	PRIMITIVE_INFO_INIT(domain_name, 0, 0, 0, 0, RAPL_PRIMITIVE_DUMMY),
+	{NULL, 0, 0, 0},
+};
+
+static int primitive_name_to_entry(const char *name)
+{
+	int i;
+
+	for (i = 0; i < nr_rapl_primitives; i++) {
+		if (!strcmp(rpi[i].name, name))
+			return i;
+	}
+
+	return -EINVAL;
+}
+
+static int rapl_read_data_raw(struct rapl_domain *domain,
+			struct rapl_primitive_info *rp, bool xlate, u64 *data)
+{
+	u32 msr_l, msr_h;
+	u64 value, final;
+	u32 msr;
+	u32 mask_h, mask_l;
+
+	if (NULL == rp->name || rp->flag & RAPL_PRIMITIVE_DUMMY)
+		return -EINVAL;
+
+	msr = domain->msrs[rp->id];
+	if (!msr)
+		return -EINVAL;
+
+	/* specical-case pkg lock bit since pkg domain uses a different bit */
+	if (rp->pm_id == lock && domain->id == RAPL_DOMAIN_PKG) {
+		rp->mask = POWER_PKG_LOCK;
+		rp->shift = 63;
+	}
+	if (rp->flag & RAPL_PRIMITIVE_DERIVED) {
+		*data = domain->rdd->primitives[rp->pm_id];
+		return 0;
+	}
+
+	if (rdmsr_safe(msr, &msr_l, &msr_h)) {
+		pr_debug("failed to read msr 0x%x\n", msr);
+		return -EIO;
+	}
+
+	mask_h = rp->mask >> 32;
+	mask_l = rp->mask & 0xffffffff;
+
+	value = (u64)msr_h<<32 | (u64)msr_l;
+
+	final = value & rp->mask;
+	final = final >> rp->shift;
+	if (true == xlate)
+		*data = rapl_unit_xlate(rp->unit, final, 0);
+	else
+		*data = final;
+
+	return 0;
+}
+
+static int rapl_write_data_raw(struct rapl_domain *domain,
+			struct rapl_primitive_info *rp,
+			unsigned long long value)
+{
+	u32 msr_l, msr_h;
+	u32 mask_h, val_h;
+	u32 msr = domain->msrs[rp->id];
+
+	if (rdmsr_safe(msr, &msr_l, &msr_h)) {
+		pr_err("failed to read msr 0x%x\n", msr);
+		return -EIO;
+	}
+	value = rapl_unit_xlate(rp->unit, value, 1);
+	mask_h = rp->mask >> 32;
+	if (mask_h) {
+		msr_h &= ~mask_h;
+		val_h = (value << rp->shift) >> 32;
+		msr_h |= val_h;
+	}
+	msr_l &= ~(u32)rp->mask;
+	msr_l |= (u32)value << rp->shift;
+	if (wrmsr_safe(msr, msr_l, msr_h)) {
+		pr_err("failed to read msr 0x%x\n", msr);
+		return -EIO;
+	}
+
+	return value >> rp->shift;
+}
+
+#define SHOW_PRIMITIVE(n)						\
+	static ssize_t show_ ## n(struct rapl_domain *rd, char *buf)	\
+	{								\
+		u64 val;						\
+		int ret;						\
+		int i = primitive_name_to_entry(#n);			\
+		if (i >= 0) {						\
+			ret = rapl_read_data_raw(rd, &rpi[i], true, &val); \
+			if (ret)					\
+				return ret;				\
+			return sprintf(buf, "%llu\n", val);		\
+		}							\
+		return i;						\
+	}
+
+static int rapl_check_unit(void)
+{
+	u64 output;
+	u32 value;
+
+	if (rdmsrl_safe(MSR_RAPL_POWER_UNIT, &output)) {
+		pr_err("Failed to read power unit MSR 0x%x, exit.\n",
+			MSR_RAPL_POWER_UNIT);
+		return -ENODEV;
+	}
+	/* energy unit: 1/enery_unit_divisor Joules */
+	value = (output & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET;
+	rg_data.energy_unit_divisor = 1 << value;
+
+	/* power unit: 1/power_unit_divisor Watts */
+	value = (output & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET;
+	rg_data.power_unit_divisor = 1 << value;
+
+	/* time unit: 1/time_unit_divisor Seconds */
+	value = (output & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
+	rg_data.time_unit_divisor = 1 << value;
+
+	return 0;
+}
+
+static int rapl_get_max_state(struct thermal_cooling_device *cdev,
+			unsigned long *state)
+{
+	int ret;
+	u64 val;
+
+	struct rapl_domain *rd = (struct rapl_domain *)cdev->devdata;
+
+	/* TDP aka thermal design power is the max level rapl can set */
+	ret = rapl_read_data_raw(rd, &rpi[thermal_spec_power], true, &val);
+	if (ret)
+		goto default_max;
+	if (val)
+		goto done;
+	/* use pl1 setting as max if tdp is not available */
+default_max:
+	ret = rapl_read_data_raw(rd, &rpi[power_limit1], true, &val);
+	if (ret)
+		return ret;
+done:
+	*state = val;
+
+	return 0;
+}
+
+static int rapl_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
+			     *state)
+{
+	struct rapl_domain *rd = (struct rapl_domain *)cdev->devdata;
+
+	if (false == polling_started)
+		*state = 0;
+	else
+		*state = rd->rdd->primitives[power];
+
+	return 0;
+}
+
+static bool rapl_polling_should_cont(void)
+{
+	int i;
+	unsigned int all_state = 0;
+
+	/* remaining events or user set power limit will continue polling */
+	for (i = 0; i < rg_data.nr_domains; i++)
+		all_state += rapl_domains[i].state;
+
+	return !!all_state;
+}
+
+
+static void set_pkg_thermal_irq(bool enable)
+{
+	u32 l, h;
+
+	/* REVISIT:
+	 * When package power limit is set artificially low by RAPL, LVT
+	 * thermal interrupt for package power limit should be ignored
+	 * since we are not really exceeding the real limit. The intention
+	 * is to avoid interrupt storms while we are power limiting.
+	 * A useful feature will be routing the pkg_power_limit interrupt
+	 * to userspace via eventfd. once we have a usecase, this is simple
+	 * to do by adding an atomic notifier.
+	 */
+	if (boot_cpu_has(X86_FEATURE_PTS))
+		rdmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, l, h);
+	else
+		return;
+
+	if (false == enable)
+		l &= ~PACKAGE_THERM_INT_PLN_ENABLE;
+	else
+		l |= PACKAGE_THERM_INT_PLN_ENABLE;
+
+	if (boot_cpu_has(X86_FEATURE_PLN))
+		wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, l, h);
+
+}
+
+static int rapl_set_cur_state(struct thermal_cooling_device *cdev,
+			unsigned long state)
+{
+	struct rapl_domain *rd = (struct rapl_domain *)cdev->devdata;
+	unsigned long spec_power = rd->rdd->primitives[thermal_spec_power];
+	unsigned long minimum_power = rd->rdd->primitives[min_power];
+
+	if (state) {
+		/* in some cases, no spec power is provided. just do a basic
+		 * range check between 0 and max bits allowed.
+		 */
+		if (!spec_power || !minimum_power) {
+			minimum_power = 0;
+			spec_power = POWER_UNIT_SCALE*
+				POWER_INFO_THERMAL_SPEC_MASK/
+				rg_data.power_unit_divisor;
+		}
+		if (state < minimum_power || state >= spec_power) {
+			pr_err("Out of thermal spec power range! %lu- %lu\n",
+				minimum_power, spec_power);
+			state = clamp(state, minimum_power, spec_power);
+		}
+		/* REVISIT: there are correlations between RAPL parameters.
+		 * 1) set proportional pl2, e.g. 1.2xpl1
+		 * 2) use a short default tw1 such as 1 sec. some system has
+		 *    a very long default time window (20sec+) which results in
+		 *    slow response.
+		 * user can set these parameters via the device sysfs files, or
+		 * we can assign the best guessed value here.
+		 */
+		rapl_write_data_raw(rd, &rpi[pl1_enable], 1);
+		rapl_write_data_raw(rd, &rpi[pl1_clamp], 1);
+		rapl_write_data_raw(rd, &rpi[power_limit1], state);
+		rd->state |= DOMAIN_STATE_POWER_LIMIT_SET;
+		start_periodic_polling();
+		set_pkg_thermal_irq(false);
+	} else {
+		/* may stop polling if no pending events */
+		rapl_write_data_raw(rd, &rpi[pl1_enable], 0);
+		rapl_write_data_raw(rd, &rpi[pl1_clamp], 0);
+		rd->state &= ~DOMAIN_STATE_POWER_LIMIT_SET;
+		set_pkg_thermal_irq(true);
+	}
+
+	return 0;
+}
+
+static const struct thermal_cooling_device_ops rapl_cdev_ops = {
+	.get_max_state = rapl_get_max_state,
+	.get_cur_state = rapl_get_cur_state,
+	.set_cur_state = rapl_set_cur_state,
+};
+
+static const struct x86_cpu_id intel_rapl_ids[] = {
+	{ X86_VENDOR_INTEL, 6, 0x2a},/* SNB */
+	{ X86_VENDOR_INTEL, 6, 0x2d},
+	{ X86_VENDOR_INTEL, 6, 0x3a},/* IVB */
+	{ X86_VENDOR_INTEL, 6, 0x45},/* HSW */
+	{}
+};
+MODULE_DEVICE_TABLE(x86cpu, intel_rapl_ids);
+
+static ssize_t show_domain_name(struct rapl_domain *rd, char *buf)
+{
+	return sprintf(buf, "%s\n", rd->name);
+}
+
+static ssize_t show_power(struct rapl_domain *rd, char *buf)
+{
+	return sprintf(buf, "%lu\n", rd->rdd->primitives[power]);
+}
+
+static ssize_t show_event_control(struct rapl_domain *rd, char *buf)
+{
+	struct rapl_event *event;
+	struct rapl_event *tmp;
+	int ret = 0;
+	int i = 0;
+
+	/* show a list of active event name, threshold, counter */
+	list_for_each_entry_safe(event, tmp, &rd->event_list, list) {
+		ret += sprintf(buf, "%s%.16s %lu %lu\n",
+			buf, rpi[event->prim].name, event->thresholds.value,
+			event->counter);
+		if (++i > MAX_RAPL_THRESHOLDS)
+			break;
+	}
+
+	return ret;
+}
+
+/* Gets called on POLLHUP on eventfd when user closes it. */
+static int rapl_event_wake(wait_queue_t *wait, unsigned mode,
+			int sync, void *key)
+{
+	struct rapl_event *event = container_of(wait, struct rapl_event, wait);
+	unsigned long flags = (unsigned long)key;
+
+	if (flags & POLLHUP) {
+		pr_debug("user closed efd, remove event\n");
+		spin_lock(&event->rd->event_lock);
+		if (!list_empty(&event->list)) {
+			list_del_init(&event->list);
+			schedule_work(&event->remove);
+			event->rd->rdd->events[event->prim] = NULL;
+		}
+		spin_unlock(&event->rd->event_lock);
+	}
+
+	return 0;
+}
+
+static void rapl_event_ptable_queue(struct file *file,
+		wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct rapl_event *event = container_of(pt, struct rapl_event, pt);
+
+	event->wqh = wqh;
+	add_wait_queue(wqh, &event->wait);
+}
+
+static void rapl_event_remove(struct work_struct *work)
+{
+	struct rapl_event *event = container_of(work, struct rapl_event,
+						remove);
+
+	remove_wait_queue(event->wqh, &event->wait);
+	eventfd_signal(event->thresholds.eventfd, 1);
+	eventfd_ctx_put(event->thresholds.eventfd);
+	kfree(event);
+}
+
+
+/*
+ * Handles userspace writes to event_control file. For a given domain, the
+ * format is  <eventfd> <command file fd> <threshold value> in the buffer.
+ */
+static ssize_t store_event_control(struct rapl_domain *rd, const char *buf,
+				size_t size)
+{
+	unsigned int efd, cfd, new_threshold;
+	struct file *efile = NULL;
+	struct file *cfile = NULL;
+	int ret = 0;
+	int prim;
+	struct rapl_event *ep;
+	u64 val;
+
+	if (sscanf(buf, "%u %u %u", &efd, &cfd, &new_threshold) != 3)
+		return -EINVAL;
+
+	efile = eventfd_fget(efd);
+	if (IS_ERR(efile)) {
+		ret = PTR_ERR(efile);
+		pr_err("failed to get eventfd file %d\n", efd);
+		goto done;
+	}
+	cfile = fget(cfd);
+	if (!cfile) {
+		ret = -EBADF;
+		fput(efile);
+		goto done;
+	}
+	/* check if the cfile belongs to the same rapl domain */
+	if (strcmp(rd->kobj.sd->s_name,
+			cfile->f_dentry->d_parent->d_name.name)) {
+		pr_debug("cfile does not belong to domain %s\n",
+			rd->kobj.sd->s_name);
+		ret = -EINVAL;
+		goto exit_cleanup_fds;
+	}
+	prim = primitive_name_to_entry(
+		(const char *)cfile->f_dentry->d_name.name);
+	if (prim < 0) {
+		pr_err("failed lookup primitive id for control file %s\n",
+			cfile->f_dentry->d_name.name);
+		ret = -EINVAL;
+		goto exit_cleanup_fds;
+	}
+	if (!(rpi[prim].flag & RAPL_PRIMITIVE_EVENT_CAP)) {
+		pr_info("Invalid control file %d\n", prim);
+		ret = -EINVAL;
+		goto exit_cleanup_fds;
+	}
+
+	/*
+	 * Check if there is already an event registered to this control file.
+	 * If yes, we notify the current efd then replace the event with the
+	 * new efd/cfd/threshold. If new threshold value is 0, we delete the
+	 * event after the user gets notified.
+	 */
+	ep = rd->rdd->events[prim];
+	if (!new_threshold || ep) {
+		if (!ep) {
+			ret = -EINVAL;
+			goto exit_cleanup_fds;
+		}
+		pr_debug("delete: d:%s e:%d c:%d thrd:%s prim:%d val %lu\n",
+			rd->name, efd, cfd, cfile->f_dentry->d_name.name, prim,
+			ep->thresholds.value);
+		spin_lock(&rd->event_lock);
+		list_del_init(&ep->list);
+		spin_unlock(&rd->event_lock);
+
+		/* Notify user event is deleted, user has to figure out the
+		 * current state. i.e. check event_control file to see if
+		 * its threshold is in the active event list.
+		 */
+		rapl_event_remove(&ep->remove);
+		if (list_empty(&rd->event_list))
+			rd->state &= ~DOMAIN_STATE_EVENT_SET;
+		rd->rdd->events[prim] = NULL;
+
+		if (!new_threshold)
+			goto done;
+	}
+
+	ep = kzalloc(sizeof(struct rapl_event), GFP_KERNEL);
+	if (!ep) {
+		ret = -ENOMEM;
+		goto exit_cleanup_fds;
+	}
+	rd->rdd->events[prim] = ep;
+	ep->prim = prim;
+	ep->thresholds.eventfd = eventfd_ctx_fileget(efile);
+	if (IS_ERR(ep->thresholds.eventfd)) {
+		pr_err("failed to get eventfd ctx %d\n", efd);
+		ret = PTR_ERR(ep->thresholds.eventfd);
+		kfree(ep);
+		goto exit_cleanup_fds;
+	}
+
+	ret = rapl_read_data_raw(rd, &rpi[prim], true, &val);
+	if (ret) {
+		pr_debug(" failed to read event.\n");
+		rd->rdd->events[prim] = NULL;
+		kfree(ep);
+		goto exit_cleanup_fds;
+	}
+	init_poll_funcptr(&ep->pt, rapl_event_ptable_queue);
+	init_waitqueue_func_entry(&ep->wait, rapl_event_wake);
+	INIT_WORK(&ep->remove, rapl_event_remove);
+	if (efile->f_op->poll(efile, &ep->pt) & POLLHUP) {
+		schedule_work(&ep->remove);
+		goto exit_cleanup_fds;
+	}
+
+	ep->last_val = val;
+	ep->rd = rd;
+	INIT_LIST_HEAD(&ep->list);
+	spin_lock(&rd->event_lock);
+	list_add(&ep->list, &rd->event_list);
+	spin_unlock(&rd->event_lock);
+
+	ep->thresholds.value = new_threshold;
+	pr_debug("domain:%s efd:%d cfd:%d threshold:%s, prim:%d val %lu\n",
+		rd->name, efd, cfd, cfile->f_dentry->d_name.name, prim,
+		ep->thresholds.value);
+
+	/* start update all data */
+	rd->state |= DOMAIN_STATE_EVENT_SET;
+	smp_wmb();
+	start_periodic_polling();
+
+exit_cleanup_fds:
+	fput(cfile);
+	fput(efile);
+
+done:
+	if (ret)
+		return ret;
+
+	return size;
+}
+
+#define STORE_PRIMITIVE(n)						\
+	static ssize_t store_ ## n(struct rapl_domain *rd,		\
+				const char *buf, size_t size)		\
+	{								\
+	unsigned long long new;						\
+	int i = primitive_name_to_entry(#n);				\
+	if (i < 0)							\
+		return -EIO;						\
+	if (kstrtoull(buf, 0, &new) < 0)				\
+		return -EINVAL;						\
+	rapl_write_data_raw(rd, &rpi[i], new);				\
+	return size;							\
+}
+
+static ssize_t intel_rapl_show(struct kobject *kobj,
+			struct attribute *attr, char *buf)
+{
+	struct rapl_domain *rd = kobj_to_rapl_domain(kobj);
+	struct rapl_attr *ra = to_rapl_attr(attr);
+
+	return ra->show ? ra->show(rd, buf) : -EIO;
+}
+
+static ssize_t intel_rapl_store(struct kobject *kobj,
+	struct attribute *attr, const char *buf, size_t size)
+{
+	struct rapl_domain *rd = kobj_to_rapl_domain(kobj);
+	struct rapl_attr *ra = to_rapl_attr(attr);
+
+	return ra->store ? ra->store(rd, buf, size) : -EIO;
+}
+
+static const struct sysfs_ops intel_rapl_sysfs_ops = {
+	.show = intel_rapl_show,
+	.store = intel_rapl_store,
+};
+
+SHOW_PRIMITIVE(energy)
+SHOW_PRIMITIVE(power_limit1)
+SHOW_PRIMITIVE(power_limit2)
+SHOW_PRIMITIVE(lock)
+SHOW_PRIMITIVE(pl1_enable)
+SHOW_PRIMITIVE(pl2_enable)
+SHOW_PRIMITIVE(pl1_clamp)
+SHOW_PRIMITIVE(pl2_clamp)
+SHOW_PRIMITIVE(time_window1)
+SHOW_PRIMITIVE(time_window2)
+SHOW_PRIMITIVE(thermal_spec_power)
+SHOW_PRIMITIVE(max_power)
+SHOW_PRIMITIVE(min_power)
+SHOW_PRIMITIVE(max_window)
+SHOW_PRIMITIVE(throttle_time)
+SHOW_PRIMITIVE(prio_level)
+
+STORE_PRIMITIVE(power_limit1)
+STORE_PRIMITIVE(power_limit2)
+STORE_PRIMITIVE(time_window1)
+STORE_PRIMITIVE(time_window2)
+STORE_PRIMITIVE(pl1_enable)
+STORE_PRIMITIVE(pl2_enable)
+STORE_PRIMITIVE(pl1_clamp)
+STORE_PRIMITIVE(pl2_clamp)
+
+#define RW_ATTR(val)							\
+	static struct rapl_attr attr_## val = {				\
+		.attr	= {.name = __stringify(val), .mode = 0644 },	\
+		.show	= show_## val,					\
+		.store	= store_## val,					\
+	};
+
+#define RO_ATTR(val)							\
+	static struct rapl_attr attr_## val = {				\
+		.attr	= {.name = __stringify(val), .mode = 0444 },	\
+		.show	= show_## val,					\
+		.store	= NULL,						\
+	};
+
+RO_ATTR(energy);
+RO_ATTR(power);
+RO_ATTR(domain_name);
+RO_ATTR(lock);
+RO_ATTR(thermal_spec_power);
+RO_ATTR(max_power);
+RO_ATTR(min_power);
+RO_ATTR(max_window);
+RO_ATTR(throttle_time);
+RO_ATTR(prio_level);
+
+RW_ATTR(power_limit1);
+RW_ATTR(power_limit2);
+RW_ATTR(time_window1);
+RW_ATTR(time_window2);
+RW_ATTR(pl1_enable);
+RW_ATTR(pl2_enable);
+RW_ATTR(pl1_clamp);
+RW_ATTR(pl2_clamp);
+RW_ATTR(event_control);
+
+/* listed in the order of rapl_primitives such that attrs can be indexed and
+ * assigned to per domain attrs based on its availability.
+ */
+static struct attribute *all_attrs[] = {
+	&attr_energy.attr,
+	&attr_power_limit1.attr,
+	&attr_power_limit2.attr,
+	&attr_lock.attr,
+
+	&attr_pl1_enable.attr,
+	&attr_pl1_clamp.attr,
+	&attr_pl2_enable.attr,
+	&attr_pl2_clamp.attr,
+
+	&attr_time_window1.attr,
+	&attr_time_window2.attr,
+	&attr_thermal_spec_power.attr,
+	&attr_max_power.attr,
+
+	&attr_min_power.attr,
+	&attr_max_window.attr,
+	&attr_throttle_time.attr,
+	&attr_prio_level.attr,
+
+	&attr_power.attr,
+	&attr_event_control.attr,
+	&attr_domain_name.attr,
+	NULL,
+};
+
+#define SHOW_GLOBAL_DATA(n)						\
+	static ssize_t show_ ## n(struct kobject *s,			\
+				struct kobj_attribute *a,		\
+				char *buf)				\
+	{								\
+		return sprintf(buf, "%d\n", rg_data.n);			\
+	}
+
+#define STORE_GLOBAL_DATA(n)						\
+	static ssize_t store_ ## n(struct kobject *s,			\
+					struct kobj_attribute *a,	\
+					const char *buf, size_t size)	\
+	{								\
+		unsigned long new;					\
+		if (kstrtoul(buf, 0, &new) < 0)				\
+			return -EINVAL;					\
+		rg_data.n = new;					\
+		return size;						\
+	}
+
+SHOW_GLOBAL_DATA(polling_freq_hz)
+SHOW_GLOBAL_DATA(energy_unit_divisor)
+SHOW_GLOBAL_DATA(power_unit_divisor)
+SHOW_GLOBAL_DATA(time_unit_divisor)
+
+STORE_GLOBAL_DATA(polling_freq_hz)
+STORE_GLOBAL_DATA(energy_unit_divisor)
+STORE_GLOBAL_DATA(power_unit_divisor)
+STORE_GLOBAL_DATA(time_unit_divisor)
+
+#define GLOBAL_ATTR(n)				\
+	static struct kobj_attribute n =	\
+		__ATTR(n, 0644,			\
+			show_ ##n,		\
+			store_ ##n		\
+			);
+
+GLOBAL_ATTR(energy_unit_divisor);
+GLOBAL_ATTR(power_unit_divisor);
+GLOBAL_ATTR(time_unit_divisor);
+GLOBAL_ATTR(polling_freq_hz);
+
+static const struct attribute *global_attrs[] = {
+	&energy_unit_divisor.attr,
+	&power_unit_divisor.attr,
+	&time_unit_divisor.attr,
+	&polling_freq_hz.attr,
+	NULL,
+};
+
+static void rapl_domain_kobj_release(struct kobject *kobj)
+{
+	struct rapl_domain *rd = kobj_to_rapl_domain(kobj);
+
+	complete(&rd->kobj_unregister);
+}
+
+static struct kobj_type ktype_intel_rapl = {
+	.sysfs_ops = &intel_rapl_sysfs_ops,
+	.release = rapl_domain_kobj_release,
+};
+
+static void rapl_update_domain_data(void)
+{
+	int i, j;
+	u64 val;
+	bool xlate;
+
+	for (i = 0; i < rg_data.nr_domains; i++) {
+		/* exclude non-raw primitives */
+		for (j = 0; j < NR_RAW_PRIMITIVES; j++)
+			xlate = !!(rpi[j].unit);
+			if (!rapl_read_data_raw(&rapl_domains[i], &rpi[j],
+							xlate, &val))
+				rd_data[i].primitives[j] = val;
+	}
+}
+
+static struct attribute **rapl_domain_attrs(int id)
+{
+	struct attribute **attrs;
+	int i = 0, j = 0, n;
+	unsigned long map;
+
+	map = rapl_domains[id].attr_map;
+	n = bitmap_weight(&map, nr_rapl_primitives);
+	/* allocate an extra entry for NULL */
+	attrs = kzalloc((n + 1) * sizeof(struct attribute *), GFP_KERNEL);
+	if (NULL == attrs)
+		return NULL;
+
+	/* fill in attrs with bitmap selected entries from all attrs */
+	do {
+		if (map & 1)
+			attrs[i++] = all_attrs[j];
+		j++;
+		map = map >> 1;
+	} while (map && i < n);
+	rapl_domains[id].attrs = attrs;
+
+	return attrs;
+}
+
+static int intel_rapl_probe(struct platform_device *pdev)
+{
+	int id = 0;
+	int ret = 0;
+	struct thermal_cooling_device *cdev;
+
+	rapl_update_domain_data();
+	for (id = 0; id < rg_data.nr_domains; id++) {
+		/*
+		 * create kobj for each rapl domain then link them with
+		 * generic thermal sysfs
+		 */
+		init_completion(&rapl_domains[id].kobj_unregister);
+		if (rapl_kset)
+			rapl_domains[id].kobj.kset = rapl_kset;
+
+		ktype_intel_rapl.default_attrs = rapl_domain_attrs(id);
+		ret = kobject_init_and_add(&rapl_domains[id].kobj,
+					&ktype_intel_rapl, NULL,
+					"%.32s", rapl_domains[id].name);
+		if (ret) {
+			pr_err("unable to create kobj for domain %s\n",
+				rapl_domains[id].name);
+			goto end;
+		}
+		cdev = thermal_cooling_device_register(DRIVER_NAME,
+						&rapl_domains[id],
+						&rapl_cdev_ops);
+		if (IS_ERR(cdev)) {
+			kobject_del(&rapl_domains[id].kobj);
+			ret = PTR_ERR(cdev);
+			goto end;
+		}
+		kobject_uevent(&rapl_domains[id].kobj, KOBJ_ADD);
+
+		rapl_domains[id].cool_dev = cdev;
+		pr_info("registered RAPL domain %s as cooling device\n",
+			rapl_domains[id].name);
+
+		ret = sysfs_create_link(&rapl_domains[id].kobj,
+					&cdev->device.kobj, "thermal_cooling");
+		if (ret)
+			dev_err(&intel_rapl_device.dev,
+				"Failed to create thermal_cooling link for domain %d %s\n",
+				id, rapl_domains[id].name);
+
+		ret = sysfs_create_link(&cdev->device.kobj,
+					&rapl_domains[id].kobj,
+					"device");
+		if (ret)
+			dev_err(&intel_rapl_device.dev,
+				"Failed to create device link for domain %d %s\n",
+				id, rapl_domains[id].name);
+	}
+
+	if (rapl_kset)
+		ret = sysfs_create_files(&rapl_kset->kobj, global_attrs);
+
+end:
+
+	return ret;
+}
+
+static int intel_rapl_remove(struct platform_device *pdev)
+{
+	enum rapl_domain_id id;
+	struct rapl_event *event;
+	struct rapl_event *tmp;
+	struct rapl_domain *rd;
+
+	stop_periodic_polling();
+	set_pkg_thermal_irq(true);
+
+	for (id = 0; id < rg_data.nr_domains; id++) {
+		pr_debug("Remove %s device\n", rapl_domains[id].name);
+		rd = &rapl_domains[id];
+		rapl_write_data_raw(rd, &rpi[pl1_enable], 0);
+		sysfs_remove_files(&rd->kobj,
+				(const struct attribute **)rd->attrs);
+		list_for_each_entry_safe(event, tmp, &rd->event_list, list) {
+			pr_debug("free event %s, threshold %lu\n",
+				rd->name, event->thresholds.value);
+			spin_lock(&rd->event_lock);
+			schedule_work(&event->remove);
+			list_del(&event->list);
+			spin_unlock(&rd->event_lock);
+		}
+
+		sysfs_remove_link(&rd->cool_dev->device.kobj, "device");
+		sysfs_remove_link(&rd->kobj, "thermal_cooling");
+		thermal_cooling_device_unregister(rd->cool_dev);
+
+		kobject_uevent(&rd->kobj, KOBJ_REMOVE);
+		kobject_del(&rd->kobj);
+		kobject_put(&rd->kobj);
+		wait_for_completion(&rapl_domains[id].kobj_unregister);
+		kfree(rd->attrs);
+	}
+	sysfs_remove_files(&rapl_kset->kobj, global_attrs);
+
+	return 0;
+}
+
+static struct platform_driver intel_rapl_driver = {
+	.driver = {
+		.name = DRIVER_NAME,
+		.owner = THIS_MODULE,
+	},
+	.probe = intel_rapl_probe,
+	.remove = intel_rapl_remove,
+};
+
+static int rapl_update_data(void)
+{
+	int i, ret;
+	static u64 energy_last[RAPL_DOMAIN_MAX];
+	u64 energy_now[RAPL_DOMAIN_MAX];
+
+	/* collect raw data for calculation */
+	for (i = 0; i < rg_data.nr_domains; i++) {
+		ret = rapl_read_data_raw(&rapl_domains[i], &rpi[energy], false,
+					&energy_now[i]);
+		if (ret)
+			return ret;
+	}
+	/* first time reading */
+	if (energy_last[0] == 0)
+		goto exit;
+
+	for (i = 0; i < rg_data.nr_domains; i++) {
+		unsigned long energy_raw = energy_now[i] - energy_last[i];
+		rd_data[i].primitives[power] = div64_u64(
+			rapl_unit_xlate(ENERGY_UNIT, energy_raw, 0),
+			rg_data.polling_freq_hz);
+	}
+exit:
+	memcpy(energy_last, energy_now, sizeof(energy_now));
+
+	return 0;
+}
+
+/* check and send eventfd notifications to the userspace for all domains
+ */
+static void rapl_check_events(void)
+{
+	struct rapl_event *event, *tmp;
+	struct rapl_domain *rd;
+	int i;
+	u64 cur;
+
+	/* run through per domain, per primitive thresholds */
+	for (i = 0; i < rg_data.nr_domains; i++) {
+		rd = &rapl_domains[i];
+		if (!(rd->state & DOMAIN_STATE_EVENT_SET))
+			continue;
+		list_for_each_entry_safe(event, tmp, &rd->event_list, list) {
+			if (rapl_read_data_raw(rd, &rpi[event->prim],
+						true, &cur)) {
+				pr_err("efd failed dmn:%s prim %d\n", rd->name,
+					event->prim);
+				continue;
+			}
+
+			/* check if we crossed threshold in both direction */
+			if ((event->thresholds.value > event->last_val &&
+					event->thresholds.value < cur) ||
+				(event->thresholds.value < event->last_val &&
+					event->thresholds.value > cur)) {
+				eventfd_signal(event->thresholds.eventfd, 1);
+				event->counter++;
+			}
+			event->last_val = cur;
+		}
+	}
+}
+
+/* For time based data, e.g. average power */
+static void rapl_poll_data(struct work_struct *dummy)
+{
+	rapl_update_data();
+	rapl_check_events();
+	if (!rapl_polling_should_cont()) {
+		polling_started = false;
+		return;
+	}
+	schedule_delayed_work(&rapl_polling_work,
+			round_jiffies_relative(rg_data.polling_freq_hz*HZ));
+}
+
+static int start_periodic_polling(void)
+{
+	if (true == polling_started)
+		goto out;
+	schedule_delayed_work(&rapl_polling_work, 0);
+	polling_started = true;
+
+out:
+	return 0;
+}
+
+static int stop_periodic_polling(void)
+{
+	if (true == polling_started) {
+		cancel_delayed_work_sync(&rapl_polling_work);
+		polling_started = false;
+		pr_debug("stop polling rapl data\n");
+	}
+
+	return 0;
+}
+
+static void intel_rapl_dev_release(struct device *dev)
+{
+	return;
+}
+
+static int rapl_check_energy_cnt(enum rapl_domain_id id)
+{
+	unsigned msr;
+	unsigned long long val1, val2;
+	int retry = 0;
+
+	switch (id) {
+	case RAPL_DOMAIN_PKG:
+		msr = MSR_PKG_ENERGY_STATUS;
+		break;
+	case RAPL_DOMAIN_PP0:
+		msr = MSR_PP0_ENERGY_STATUS;
+		break;
+	case RAPL_DOMAIN_PP1:
+		msr = MSR_PP1_ENERGY_STATUS;
+		break;
+	case RAPL_DOMAIN_DRAM:
+		msr = MSR_DRAM_ENERGY_STATUS;
+		break;
+	default:
+		pr_err("invalid domain id %d\n", id);
+		return -EINVAL;
+	}
+	if (rdmsrl_safe(msr, &val1))
+		return -ENODEV;
+again:
+	/* energy counters roll slowly on some domains, do a few retries */
+	msleep(100);
+	rdmsrl_safe(msr, &val2);
+
+	/* if energy counter does not change, report as bad domain */
+	if ((val1 & ENERGY_STATUS_MASK) == (val2 & ENERGY_STATUS_MASK)) {
+		if (retry++ < 10)
+			goto again;
+		pr_info("domain %s exists but energy ctr %llu:%llu not working, skip\n",
+			rapl_domain_names[id], val1, val2);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+static int rapl_detect_domains(void)
+{
+	int i;
+	int ret = 0;
+
+	if (!x86_match_cpu(intel_rapl_ids)) {
+		ret = -ENODEV;
+		goto done;
+	}
+
+	for (i = 0; i < RAPL_DOMAIN_MAX; i++) {
+		if (!rapl_check_energy_cnt(i))
+			rg_data.domain_map |= 1 << i;
+	}
+
+	rg_data.nr_domains = bitmap_weight(&rg_data.domain_map,
+					RAPL_DOMAIN_MAX);
+	if (!rg_data.nr_domains) {
+		pr_err("no valid rapl domains found\n");
+		ret = -ENODEV;
+		goto done;
+	}
+	rg_data.polling_freq_hz = RAPL_POLLING_FREQ_DEFAULT;
+
+	pr_info("Found %d vaild RAPL domains\n", rg_data.nr_domains);
+	rapl_domains = kzalloc(sizeof(struct rapl_domain) * rg_data.nr_domains,
+			GFP_KERNEL);
+	if (NULL == rapl_domains) {
+		pr_err("Failed to allocate memory for rapl domain\n");
+		ret = -ENOMEM;
+		goto done;
+	}
+
+	rapl_init_domains();
+
+done:
+	return ret;
+}
+
+static int __init intel_rapl_init(void)
+{
+	int i, ret;
+
+	ret = rapl_check_unit();
+	if (ret)
+		return ret;
+
+	ret = rapl_detect_domains();
+	if (ret)
+		return ret;
+	/* allocate per domain data */
+	rd_data = kzalloc(sizeof(struct rapl_domain_data) * rg_data.nr_domains,
+			GFP_KERNEL);
+	if (NULL == rd_data) {
+		pr_err("Failed to allocate memory for rapl domain data\n");
+		ret = -ENOMEM;
+		goto exit_free_domains;
+	}
+	for (i = 0; i < rg_data.nr_domains; i++) {
+		rapl_domains[i].rdd = &rd_data[i];
+		INIT_LIST_HEAD(&rapl_domains[i].event_list);
+		spin_lock_init(&rapl_domains[i].event_lock);
+	}
+	intel_rapl_device.dev.release = intel_rapl_dev_release;
+	intel_rapl_device.dev.platform_data = rapl_domains;
+
+	platform_device_register(&intel_rapl_device);
+	/* rapl kset used as base class to abstract common attrs */
+	rapl_kset = kset_create_and_add("rapl_domains", NULL,
+					&intel_rapl_device.dev.kobj);
+	if (!rapl_kset) {
+		ret = -ENOMEM;
+		goto exit_free_rdata;
+	}
+	ret = platform_driver_register(&intel_rapl_driver);
+	if (ret)
+		goto exit_unregister_device;
+
+	return 0;
+
+exit_unregister_device:
+	kset_unregister(rapl_kset);
+	platform_device_unregister(&intel_rapl_device);
+exit_free_rdata:
+	kfree(rd_data);
+exit_free_domains:
+	kfree(rapl_domains);
+
+	return ret;
+}
+
+static void __exit intel_rapl_exit(void)
+{
+	platform_device_unregister(&intel_rapl_device);
+	platform_driver_unregister(&intel_rapl_driver);
+	kfree(rd_data);
+	kfree(rapl_domains);
+	kset_unregister(rapl_kset);
+}
+
+
+module_init(intel_rapl_init);
+module_exit(intel_rapl_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@intel.com>");
+
+MODULE_DESCRIPTION("Driver for Intel RAPL (Running Average Power Limit) interface");
+MODULE_VERSION("0.1");
diff --git a/drivers/platform/x86/intel_rapl.h b/drivers/platform/x86/intel_rapl.h
new file mode 100644
index 0000000..e0a5125
--- /dev/null
+++ b/drivers/platform/x86/intel_rapl.h
@@ -0,0 +1,249 @@
+/*
+ *  Intel RAPL driver
+ *
+ *  intel_rapl.h
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or (at
+ *  your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful, but
+ *  WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License along
+ *  with this program; if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ */
+#ifndef INTEL_RAPL_H
+#define INTEL_RAPL_H
+
+#define DRIVER_NAME "intel_rapl"
+
+/* RAPL UNIT BITMASK */
+#define ENERGY_STATUS_MASK      0xffffffff
+
+#define POWER_LIMIT1_MASK       0x7FFF
+#define POWER_LIMIT1_ENABLE     (0x1<<15)
+#define POWER_LIMIT1_CLAMP      (0x1<<16)
+
+#define POWER_LIMIT2_MASK       (0x7FFFULL<<32)
+#define POWER_LIMIT2_ENABLE     (0x1ULL<<47)
+#define POWER_LIMIT2_CLAMP      (0x1ULL<<48)
+#define POWER_PKG_LOCK          (0x1ULL<<63)
+#define POWER_PP_LOCK           (0x1<<31)
+
+#define TIME_WINDOW1_MASK       (0x7F<<17)
+#define TIME_WINDOW2_MASK       (0x7FULL<<49)
+
+#define POWER_UNIT_OFFSET	0
+#define POWER_UNIT_MASK		0x0F
+
+#define ENERGY_UNIT_OFFSET	0x08
+#define ENERGY_UNIT_MASK	0x1F00
+
+#define TIME_UNIT_OFFSET	0x10
+#define TIME_UNIT_MASK		0xF0000
+
+#define POWER_INFO_MAX_MASK     (0x7fffULL<<32)
+#define POWER_INFO_MIN_MASK     (0x7fffULL<<16)
+#define POWER_INFO_MAX_TIME_WIN_MASK     (0x3fULL<<48)
+#define POWER_INFO_THERMAL_SPEC_MASK     0x7fff
+
+#define PERF_STATUS_THROTTLE_TIME_MASK 0xffffffff
+#define PP_POLICY_MASK         0x1F
+/* Non HW constants */
+
+/* Event capable, allow assigning thresholds */
+#define RAPL_PRIMITIVE_EVENT_CAP     (1<<0)
+#define RAPL_PRIMITIVE_DERIVED       (1<<1) /* not from raw data */
+#define RAPL_PRIMITIVE_DUMMY         (1<<2)
+#define RAPL_POLLING_FREQ_DEFAULT    1
+
+/* scale RAPL units to avoid floating point math inside kernel */
+#define POWER_UNIT_SCALE  (1000)
+#define ENERGY_UNIT_SCALE (1000)
+#define TIME_UNIT_SCALE   (1000)
+
+enum unit_type {
+	NA_UNIT, /* no translation */
+	POWER_UNIT,
+	ENERGY_UNIT,
+	TIME_UNIT,
+};
+
+enum rapl_domain_id {
+	RAPL_DOMAIN_PKG,
+	RAPL_DOMAIN_PP0,
+	RAPL_DOMAIN_PP1,
+	RAPL_DOMAIN_DRAM,
+	RAPL_DOMAIN_MAX,
+};
+
+enum rapl_domain_msr_id {
+	RAPL_DOMAIN_MSR_LIMIT,
+	RAPL_DOMAIN_MSR_STATUS,
+	RAPL_DOMAIN_MSR_PERF,
+	RAPL_DOMAIN_MSR_POLICY,
+	RAPL_DOMAIN_MSR_INFO,
+	RAPL_DOMAIN_MSR_MAX,
+};
+
+struct rapl_domain_msr {
+	int	limit;
+	int	status;
+	/* optional msrs below */
+	int	perf;
+	int     policy;
+	int     info; /* power info */
+};
+
+
+#define	DOMAIN_STATE_INACTIVE           (0)
+#define	DOMAIN_STATE_POWER_LIMIT_SET    (1<<1)
+#define	DOMAIN_STATE_EVENT_SET          (1<<2)
+
+struct rapl_domain {
+	char *name;
+	enum rapl_domain_id id;
+	int msrs[RAPL_DOMAIN_MSR_MAX];
+	struct thermal_zone_device *tz_dev;
+	struct thermal_cooling_device *cool_dev;
+	struct completion kobj_unregister;
+	struct kobject kobj;
+	struct rapl_domain_data *rdd;
+	struct list_head event_list;
+	unsigned long attr_map; /* bitmap for per domain features */
+	struct attribute **attrs;
+	spinlock_t event_lock; /* protect event queue */
+	unsigned int state;
+};
+
+/*  global data */
+struct rapl_data {
+	unsigned int nr_domains;
+	unsigned long domain_map; /* bit map of active domains */
+	unsigned int power_unit_divisor;
+	unsigned int energy_unit_divisor;
+	unsigned int time_unit_divisor;
+	unsigned int polling_freq_hz;
+};
+
+enum rapl_primitives {
+	energy,
+	power_limit1,
+	power_limit2,
+	lock,
+
+	pl1_enable,
+	pl1_clamp,
+	pl2_enable,
+	pl2_clamp,
+
+	time_window1,
+	time_window2,
+	thermal_spec_power,
+	max_power,
+
+	min_power,
+	max_window,
+	throttle_time,
+	prio_level,
+
+	/* below are not raw primitive data */
+	power,
+	event_control,
+	domain_name,
+	nr_rapl_primitives,
+};
+
+#define NR_RAW_PRIMITIVES (nr_rapl_primitives - 3)
+/*
+ * RAPL domains have a base set of attrs then on top of that each domain
+ * may have optional attrs. We need to expose only relavent sysfs nodes
+ * for each attr per supported domain. Use ULL bit mask to represent both
+ * standard and optional features.
+ */
+
+/* standard attrs */
+#define RAPL_ATTR_ENERGY_CTR             BIT(energy)
+#define RAPL_ATTR_PL1                    BIT(power_limit1)
+#define RAPL_ATTR_LOCK                   BIT(lock)
+
+#define RAPL_ATTR_PL1_ENABLE             BIT(pl1_enable)
+#define RAPL_ATTR_PL1_CLAMP              BIT(pl1_clamp)
+#define RAPL_ATTR_DOMAIN_NAME            BIT(domain_name)
+#define RAPL_ATTR_EVENT_CONTROL          BIT(event_control)
+#define RAPL_ATTR_POWER                  BIT(power)
+#define RAPL_ATTR_TIME_WINDOW1           BIT(time_window1)
+
+/* optional attrs */
+#define RAPL_ATTR_PL2                    BIT(power_limit2)
+#define RAPL_ATTR_PL2_ENABLE             BIT(pl2_enable)
+#define RAPL_ATTR_PL2_CLAMP              BIT(pl2_clamp)
+#define RAPL_ATTR_TIME_WINDOW2           BIT(time_window2)
+#define RAPL_ATTR_THERMAL_SPEC_POWER     BIT(thermal_spec_power)
+#define RAPL_ATTR_MAX_POWER              BIT(max_power)
+#define RAPL_ATTR_MIN_POWER              BIT(min_power)
+#define RAPL_ATTR_MAX_WINDOW             BIT(max_window)
+#define RAPL_ATTR_THROTTLE_TIME          BIT(throttle_time)
+#define RAPL_ATTR_PRIO_LEVEL             BIT(prio_level)
+
+#define RAPL_STANDARD_ATTRS (RAPL_ATTR_ENERGY_CTR |		\
+				RAPL_ATTR_PL1 |			\
+				RAPL_ATTR_LOCK |		\
+				RAPL_ATTR_PL1_ENABLE |		\
+				RAPL_ATTR_PL1_CLAMP |		\
+				RAPL_ATTR_TIME_WINDOW1 |	\
+				RAPL_ATTR_POWER |		\
+				RAPL_ATTR_DOMAIN_NAME |		\
+				RAPL_ATTR_EVENT_CONTROL)
+
+struct rapl_attr {
+	struct attribute attr;
+	ssize_t (*show) (struct rapl_domain *, char *);
+	ssize_t (*store) (struct rapl_domain *, const char *, size_t count);
+};
+
+struct rapl_threshold {
+	unsigned long value;
+	struct eventfd_ctx *eventfd;
+};
+
+#define MAX_RAPL_THRESHOLDS 3
+
+struct rapl_event {
+	struct list_head list;
+	struct rapl_domain *rd;
+	struct rapl_threshold thresholds;
+	unsigned long counter; /* # of times threshold crossed */
+	unsigned long last_val;
+	enum rapl_primitives prim;
+	poll_table pt;
+	wait_queue_head_t *wqh;
+	wait_queue_t wait;
+	struct work_struct remove;
+};
+
+struct rapl_domain_data {
+	unsigned long primitives[nr_rapl_primitives];
+	struct rapl_event *events[nr_rapl_primitives];
+};
+
+struct rapl_primitive_info {
+	const char *name;
+	u64 mask;
+	int shift;
+	enum rapl_domain_msr_id id;
+	enum unit_type unit;
+	enum rapl_primitives pm_id;
+	u32 flag;
+};
+
+#define PRIMITIVE_INFO_INIT(p, m, s, i, u, f) {#p, m, s, i, u, p, f}
+#endif  /* INTEL_RAPL_H */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 22:15 ` [PATCH 1/1] Introduce Intel RAPL cooling device driver Jacob Pan
  2013-04-02 21:42   ` Randy Dunlap
  2013-04-02 21:47   ` Randy Dunlap
@ 2013-04-02 22:35   ` Joe Perches
  2013-04-02 23:00   ` Greg KH
  2013-04-03 10:24   ` Paul Bolle
  4 siblings, 0 replies; 23+ messages in thread
From: Joe Perches @ 2013-04-02 22:35 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, 2013-04-02 at 15:15 -0700, Jacob Pan wrote:
> RAPL(Running Average Power Limit) interface provides platform software
> with the ability to monitor, control, and get notifications on SOC
> power consumptions. Since its first appearance on Sandy Bridge, more
> features have being added to extend its usage. In RAPL, platforms are
> divided into domains for fine grained control. These domains include
> package, DRAM controller, CPU core (Power Plane 0), graphics uncore
> (power plane 1), etc.

trivia only:

> diff --git a/drivers/platform/x86/intel_rapl.c b/drivers/platform/x86/intel_rapl.c
[]
> +static char *rapl_domain_names[] = {
> +	"package",
> +	"power_plane_0",
> +	"power_plane_1",
> +	"dram",
> +};

const

> +static void rapl_init_domains(void)
> +{
> +	int i, j = 0;


I think this below is unpleasant to read.
Maybe add a temporary pointer and get rid of j?

	typeof rapl_domains *t = rapl_domains;

(not typeof, but the actual type)

> +
> +	for (i = 0; i < RAPL_DOMAIN_MAX; i++) {
> +		unsigned int mask = rg_data.domain_map & (1 << i);
> +
> +		switch (mask) {
> +		case 1 << RAPL_DOMAIN_PKG:
> +			rapl_domains[j].name =
> +				rapl_domain_names[RAPL_DOMAIN_PKG];

			t->name = rapl_domain_names[RAPL_DOMAIN_PKG];
			etc...
[]
> +		if (mask)
> +			j++;

			t++;

> +/* in the order of enum rapl_primitives */
> +static struct rapl_primitive_info rpi[] = {

const?

> +	/* name, mask, shift, msr index, unit divisor*/
> +	PRIMITIVE_INFO_INIT(energy, ENERGY_STATUS_MASK, 0,
> +			RAPL_DOMAIN_MSR_STATUS, ENERGY_UNIT,
> +			RAPL_PRIMITIVE_EVENT_CAP),

[]
> +static int rapl_read_data_raw(struct rapl_domain *domain,
> +			struct rapl_primitive_info *rp, bool xlate, u64 *data)
> +{
[]
> +	/* specical-case pkg lock bit since pkg domain uses a different bit */

special-case

[]
> +static ssize_t store_event_control(struct rapl_domain *rd, const char *buf,
> +				size_t size)
> +{
> +	unsigned int efd, cfd, new_threshold;
> +	struct file *efile = NULL;
> +	struct file *cfile = NULL;
> +	int ret = 0;
> +	int prim;
> +	struct rapl_event *ep;
> +	u64 val;
> +
> +	if (sscanf(buf, "%u %u %u", &efd, &cfd, &new_threshold) != 3)
> +		return -EINVAL;
> +
> +	efile = eventfd_fget(efd);
> +	if (IS_ERR(efile)) {
> +		ret = PTR_ERR(efile);
> +		pr_err("failed to get eventfd file %d\n", efd);
> +		goto done;
> +	}
> +	cfile = fget(cfd);
> +	if (!cfile) {
> +		ret = -EBADF;
> +		fput(efile);
> +		goto done;
> +	}
> +	/* check if the cfile belongs to the same rapl domain */
> +	if (strcmp(rd->kobj.sd->s_name,
> +			cfile->f_dentry->d_parent->d_name.name)) {
> +		pr_debug("cfile does not belong to domain %s\n",
> +			rd->kobj.sd->s_name);
> +		ret = -EINVAL;
> +		goto exit_cleanup_fds;
> +	}
> +	prim = primitive_name_to_entry(
> +		(const char *)cfile->f_dentry->d_name.name);
> +	if (prim < 0) {
> +		pr_err("failed lookup primitive id for control file %s\n",
> +			cfile->f_dentry->d_name.name);
> +		ret = -EINVAL;
> +		goto exit_cleanup_fds;
> +	}
> +	if (!(rpi[prim].flag & RAPL_PRIMITIVE_EVENT_CAP)) {
> +		pr_info("Invalid control file %d\n", prim);
> +		ret = -EINVAL;
> +		goto exit_cleanup_fds;
> +	}

Perhaps all these pr_<levels> should be pr_err as they
all lead to a failure path.


[]

> +static int stop_periodic_polling(void)
> +{
> +	if (true == polling_started) {

You don't need a comparison,
	if (polling_started) {
is perhaps better.

[]
> +static int rapl_detect_domains(void)
> +{
[]
> +	pr_info("Found %d vaild RAPL domains\n", rg_data.nr_domains);

valid

> +	rapl_domains = kzalloc(sizeof(struct rapl_domain) * rg_data.nr_domains,
> +			GFP_KERNEL);

kcalloc

> +	if (NULL == rapl_domains) {
> +		pr_err("Failed to allocate memory for rapl domain\n");

You don't need OOM messages as the memory
subsystem already does a dump_stack on OOM.

[]

> +static int __init intel_rapl_init(void)
> +{

> +	rd_data = kzalloc(sizeof(struct rapl_domain_data) * rg_data.nr_domains,
> +			GFP_KERNEL);

kcalloc again.

> +	if (NULL == rd_data) {
> +		pr_err("Failed to allocate memory for rapl domain data\n");

OOM too.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 22:15 ` [PATCH 1/1] Introduce Intel RAPL cooling device driver Jacob Pan
                     ` (2 preceding siblings ...)
  2013-04-02 22:35   ` Joe Perches
@ 2013-04-02 23:00   ` Greg KH
  2013-04-02 23:03     ` Greg KH
                       ` (2 more replies)
  2013-04-03 10:24   ` Paul Bolle
  4 siblings, 3 replies; 23+ messages in thread
From: Greg KH @ 2013-04-02 23:00 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, Apr 02, 2013 at 03:15:36PM -0700, Jacob Pan wrote:
> RAPL(Running Average Power Limit) interface provides platform software
> with the ability to monitor, control, and get notifications on SOC
> power consumptions. Since its first appearance on Sandy Bridge, more
> features have being added to extend its usage. In RAPL, platforms are
> divided into domains for fine grained control. These domains include
> package, DRAM controller, CPU core (Power Plane 0), graphics uncore
> (power plane 1), etc.
> 
> The purpose of this driver is to expose RAPL for userspace
> consumption. Overall, RAPL fits in the generic thermal layer in
> that platform level power capping and monitoring are mainly used for
> thermal management and thermal layer provides the abstracted interface
> needed to have portable applications.
> 
> Specifically, userspace is presented with per domain cooling device
> with sysfs links to its kobject. Although RAPL domain provides many
> parameters for fine tuning, long term power limit is exposed as the
> single knob via cooling device state. Whereas the rest of the
> parameters are still accessible via the linked kobject. This simplifies
> the interface for both simple and advanced use cases.
> 
> Eventfd is used to provide notifications to the userspace. At per domain
> level, use can choose any event capable parameters to register for
> threshold crossing notifications. This is shamelessly "borrowed" from
> cgroup with some trimming/fitting.
> 
> Zhang, Rui's initial RAPL driver was used as a reference and starting
> point. Many thanks.
> https://lkml.org/lkml/2011/5/26/93
> 
> Unlike the patch above, which is mainly for monitoring, this driver
> focus on the control and usability by user applications.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/platform/x86/Kconfig      |    8 +
>  drivers/platform/x86/Makefile     |    1 +
>  drivers/platform/x86/intel_rapl.c | 1323 +++++++++++++++++++++++++++++++++++++
>  drivers/platform/x86/intel_rapl.h |  249 +++++++
>  4 files changed, 1581 insertions(+)
>  create mode 100644 drivers/platform/x86/intel_rapl.c
>  create mode 100644 drivers/platform/x86/intel_rapl.h

You are creating new sysfs files, yet you forgot to document them in
Documentation/ABI, please do so as part of this patch.

> --- /dev/null
> +++ b/drivers/platform/x86/intel_rapl.c
> @@ -0,0 +1,1323 @@
> +/*
> + *  intel_rapl.c - Intel Running Average Power Limit Driver for MSR based
> + *                 RAPL interface
> + *
> + *  Copyright (c) 2013, Intel Corporation.
> + * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or (at
> + *  your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful, but
> + *  WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + *  General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License along
> + *  with this program; if not, write to the Free Software Foundation, Inc.,
> + *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

You really want to track the FSF's address for the next 40 years and
keep your file up to date?  I didn't think so...

> +#include "intel_rapl.h"
> +#include "../../../fs/sysfs/sysfs.h"

WTF?

Oh, that's a sure sign you are not doing something properly, if you
think it's ok to muck around with the internals of sysfs.

There's a reason that file is "private", why do you think it's ok to use
it directly?  Did you just think that I somehow "forgot" to put it in
the proper include directory?

Kids these days...

Surely one of the internal Intel code reviews caught this when you ran
it through that process?

I'm not even going to review the rest of this file, please fix that up
first, before sending it out again.  And, when you do so, please cc: me.

Oh, I couldn't help myself:

> +static void rapl_domain_kobj_release(struct kobject *kobj)
> +{
> +	struct rapl_domain *rd = kobj_to_rapl_domain(kobj);
> +
> +	complete(&rd->kobj_unregister);
> +}

No, that's not how you do a release function (hint, you can lock up your
driver from userspace by holding your file open...)  Free the memory
here, where it is supposed to be free.

And, one final complaint, never use "raw" kobjects, for loads of good
reasons, not the least being you just prevented userspace from seeing
what is happening with your devices.  Use the driver model, that's what
it is there for, if you need "sub children", or subdirectories.

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 23:00   ` Greg KH
@ 2013-04-02 23:03     ` Greg KH
  2013-04-02 23:33     ` Jacob Pan
  2013-04-02 23:53     ` Jacob Pan
  2 siblings, 0 replies; 23+ messages in thread
From: Greg KH @ 2013-04-02 23:03 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, Apr 02, 2013 at 04:00:42PM -0700, Greg KH wrote:
> I'm not even going to review the rest of this file, please fix that up
> first, before sending it out again.  And, when you do so, please cc: me.

It's like a train wreck, you just can't tear away your eyes...

You race with userspace in the creation of your sysfs files.  Your
device is announced to userspace and then, later on, you create the
sysfs files.  This means that tools like libudev never sees your files,
so the attributes aren't read properly by loads of userspace programs
that you want to have read them.

Please fix that before resending this as well.

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 21:47   ` Randy Dunlap
@ 2013-04-02 23:04     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 23+ messages in thread
From: Greg Kroah-Hartman @ 2013-04-02 23:04 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Jacob Pan, LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, Apr 02, 2013 at 02:47:35PM -0700, Randy Dunlap wrote:
> On 04/02/13 15:15, Jacob Pan wrote:
> > diff --git a/drivers/platform/x86/intel_rapl.c b/drivers/platform/x86/intel_rapl.c
> > new file mode 100644
> > index 0000000..56ee928
> > --- /dev/null
> > +++ b/drivers/platform/x86/intel_rapl.c
> > @@ -0,0 +1,1323 @@
> > +/*
> > + *  intel_rapl.c - Intel Running Average Power Limit Driver for MSR based
> > + *                 RAPL interface
> > + *
> > + *  Copyright (c) 2013, Intel Corporation.
> > + * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +#include <linux/sysfs.h>
> > +
> 
> > +#include "../../../fs/sysfs/sysfs.h"
> 
> 
> What does this driver need from ^^^^^^ that file, which says:
> 
>  * fs/sysfs/sysfs.h - sysfs internal header file
> 
> and should that be moved to include/linux/sysfs.h ?

See my other complain about this already, I beat you to it :)

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 23:00   ` Greg KH
  2013-04-02 23:03     ` Greg KH
@ 2013-04-02 23:33     ` Jacob Pan
  2013-04-02 23:48       ` Greg KH
  2013-04-02 23:53     ` Jacob Pan
  2 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-02 23:33 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, 2 Apr 2013 16:00:42 -0700
Greg KH <gregkh@linuxfoundation.org> wrote:

> > +#include "intel_rapl.h"
> > +#include "../../../fs/sysfs/sysfs.h"  
> 
> WTF?
> 
> Oh, that's a sure sign you are not doing something properly, if you
> think it's ok to muck around with the internals of sysfs.
> 
> There's a reason that file is "private", why do you think it's ok to
> use it directly?  Did you just think that I somehow "forgot" to put
> it in the proper include directory?
I did feel unsure about this but i saw some precedence in the kernel.
Anyway, I needed a way to validate a userspace file passed to rapl
driver belong to the same sysfs directory. I will look for alternative
ways.

-- 
Thanks,

Jacob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 23:33     ` Jacob Pan
@ 2013-04-02 23:48       ` Greg KH
  2013-04-03  0:17         ` Jacob Pan
  0 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2013-04-02 23:48 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, Apr 02, 2013 at 04:33:57PM -0700, Jacob Pan wrote:
> On Tue, 2 Apr 2013 16:00:42 -0700
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > > +#include "intel_rapl.h"
> > > +#include "../../../fs/sysfs/sysfs.h"  
> > 
> > WTF?
> > 
> > Oh, that's a sure sign you are not doing something properly, if you
> > think it's ok to muck around with the internals of sysfs.
> > 
> > There's a reason that file is "private", why do you think it's ok to
> > use it directly?  Did you just think that I somehow "forgot" to put
> > it in the proper include directory?
> I did feel unsure about this but i saw some precedence in the kernel.

Someone else is doing this with the sysfs api?  I don't see any other
code in Linus's tree doing this at the moment, where did you see this?
Let me know and I'll fix it up right away.

> Anyway, I needed a way to validate a userspace file passed to rapl
> driver belong to the same sysfs directory. I will look for alternative
> ways.

What do you mean by this?  What exactly are you trying to do?  No normal
driver code should _ever_ call sysfs functions directly, nor should they
ever care about sysfs internals.

And, odds are, you didn't test your code as a module, right, as any
internal sysfs function that you could get from this .h file, wouldn't
be exported for a module to use, unless I missed one somewhere?

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 23:00   ` Greg KH
  2013-04-02 23:03     ` Greg KH
  2013-04-02 23:33     ` Jacob Pan
@ 2013-04-02 23:53     ` Jacob Pan
  2013-04-03  0:13       ` Greg KH
  2 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-02 23:53 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, 2 Apr 2013 16:00:42 -0700
Greg KH <gregkh@linuxfoundation.org> wrote:

> And, one final complaint, never use "raw" kobjects, for loads of good
> reasons, not the least being you just prevented userspace from seeing
> what is happening with your devices.  Use the driver model, that's
> what it is there for, if you need "sub children", or subdirectories.
I chose to use to kobjects for the reason that userspace can see the
device linking more clearly.
Let me try to paraphrase, I have two options:
1. if I use the platform device model instead of raw kobjects, I would
have one platform device for each rapl domain. Then link individual
platform device with the generic thermal layer sysfs.

2. In the current patch, I have one platform driver, then expose per
domain kobject that can be linked to the generic thermal layer. Common
attributes of all domains are grouped under the kset. 

I did consider both options. I thought using #2 option is better since
it allow user to discover the topologies easier by following the
sysfs link. If i use use #1, it would be hard to expose the common
attributes and more code too. Perhpas I misread
Documentation/kobjects.txt which i thought kobject/kset are perfect for
presenting situation like this.

-- 
Thanks,

Jacob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 23:53     ` Jacob Pan
@ 2013-04-03  0:13       ` Greg KH
  2013-04-03  4:48         ` Jacob Pan
  0 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2013-04-03  0:13 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, Apr 02, 2013 at 04:53:40PM -0700, Jacob Pan wrote:
> On Tue, 2 Apr 2013 16:00:42 -0700
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > And, one final complaint, never use "raw" kobjects, for loads of good
> > reasons, not the least being you just prevented userspace from seeing
> > what is happening with your devices.  Use the driver model, that's
> > what it is there for, if you need "sub children", or subdirectories.
> I chose to use to kobjects for the reason that userspace can see the
> device linking more clearly.

Then use a 'struct device'.

> Let me try to paraphrase, I have two options:
> 1. if I use the platform device model instead of raw kobjects, I would
> have one platform device for each rapl domain. Then link individual
> platform device with the generic thermal layer sysfs.

I don't understand.

> 2. In the current patch, I have one platform driver, then expose per
> domain kobject that can be linked to the generic thermal layer. Common
> attributes of all domains are grouped under the kset. 

Then use a 'struct device' that is attached to that kset, no need to use
a platform device, as that's not what is needed here, right?

> I did consider both options. I thought using #2 option is better since
> it allow user to discover the topologies easier by following the
> sysfs link. If i use use #1, it would be hard to expose the common
> attributes and more code too. Perhpas I misread
> Documentation/kobjects.txt which i thought kobject/kset are perfect for
> presenting situation like this.

No, kobjects are for things not using the device tree, like filesystems
or other stuff.  You are in the device tree, so use a 'struct device'.
If you use a "raw" kobject, you do not notify userspace what is going
on, and you loose out on a bunch of other stuff that the driver model
provides you.

But do you even need to use anything at all?

Let's step back and start over, what exactly are you trying to tell
userspace?  What data do you have that you need to express to it?  How
do you want userspace to see/use it?

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 23:48       ` Greg KH
@ 2013-04-03  0:17         ` Jacob Pan
  2013-04-03 16:30           ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-03  0:17 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, 2 Apr 2013 16:48:05 -0700
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Tue, Apr 02, 2013 at 04:33:57PM -0700, Jacob Pan wrote:
> > On Tue, 2 Apr 2013 16:00:42 -0700
> > Greg KH <gregkh@linuxfoundation.org> wrote:
> > 
> > > > +#include "intel_rapl.h"
> > > > +#include "../../../fs/sysfs/sysfs.h"  
> > > 
> > > WTF?
> > > 
> > > Oh, that's a sure sign you are not doing something properly, if
> > > you think it's ok to muck around with the internals of sysfs.
> > > 
> > > There's a reason that file is "private", why do you think it's ok
> > > to use it directly?  Did you just think that I somehow "forgot"
> > > to put it in the proper include directory?
> > I did feel unsure about this but i saw some precedence in the
> > kernel.
> 
> Someone else is doing this with the sysfs api?  I don't see any other
> code in Linus's tree doing this at the moment, where did you see this?
> Let me know and I'll fix it up right away.
> 
no, i did not mean sysfs api. I mean include internal header files via
#include ../../ 
e.g.in drivers/usb/image/microtek.c

#include "../../scsi/scsi.h"
#include <scsi/scsi_host.h>

> > Anyway, I needed a way to validate a userspace file passed to rapl
> > driver belong to the same sysfs directory. I will look for
> > alternative ways.
> 
> What do you mean by this?  What exactly are you trying to do?  No
> normal driver code should _ever_ call sysfs functions directly, nor
> should they ever care about sysfs internals.
> 
i did not call sysfs internal calls, just need to use 
struct sysfs_dirent {}

to do the following sanity check against user passed event control file,
it is still not a 100% strong check. 
	/* check if the cfile belongs to the same rapl domain */
	if (strcmp(rd->kobj.sd->s_name,
			cfile->f_dentry->d_parent->d_name.name)) {
		pr_debug("cfile does not belong to domain %s\n",
			rd->kobj.sd->s_name);
		ret = -EINVAL;
		goto exit_cleanup_fds;
	}

> And, odds are, you didn't test your code as a module, right, as any
> internal sysfs function that you could get from this .h file, wouldn't
> be exported for a module to use, unless I missed one somewhere?
> 
I did run the driver as module since i didn't use sysfs internal
functions, just the struct. I may be hitting a corner case here, but
for drivers who need to discover sysfs hierarchy would it be useful to
expose some info in struct sysfs_dirent{}?

> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe
> platform-driver-x86" in the body of a message to
> majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

[Jacob Pan]

-- 
Thanks,

Jacob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-03  0:13       ` Greg KH
@ 2013-04-03  4:48         ` Jacob Pan
  2013-04-03 16:35           ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-03  4:48 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, 2 Apr 2013 17:13:00 -0700
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Tue, Apr 02, 2013 at 04:53:40PM -0700, Jacob Pan wrote:
> > On Tue, 2 Apr 2013 16:00:42 -0700
> > Greg KH <gregkh@linuxfoundation.org> wrote:
> > 
> > > And, one final complaint, never use "raw" kobjects, for loads of
> > > good reasons, not the least being you just prevented userspace
> > > from seeing what is happening with your devices.  Use the driver
> > > model, that's what it is there for, if you need "sub children",
> > > or subdirectories.
> > I chose to use to kobjects for the reason that userspace can see the
> > device linking more clearly.
> 
> Then use a 'struct device'.
> 
> > Let me try to paraphrase, I have two options:
> > 1. if I use the platform device model instead of raw kobjects, I
> > would have one platform device for each rapl domain. Then link
> > individual platform device with the generic thermal layer sysfs.
> 
> I don't understand.
> 
> > 2. In the current patch, I have one platform driver, then expose per
> > domain kobject that can be linked to the generic thermal layer.
> > Common attributes of all domains are grouped under the kset. 
> 
> Then use a 'struct device' that is attached to that kset, no need to
> use a platform device, as that's not what is needed here, right?
> 
> > I did consider both options. I thought using #2 option is better
> > since it allow user to discover the topologies easier by following
> > the sysfs link. If i use use #1, it would be hard to expose the
> > common attributes and more code too. Perhpas I misread
> > Documentation/kobjects.txt which i thought kobject/kset are perfect
> > for presenting situation like this.
> 
> No, kobjects are for things not using the device tree, like
> filesystems or other stuff.  You are in the device tree, so use a
> 'struct device'. If you use a "raw" kobject, you do not notify
> userspace what is going on, and you loose out on a bunch of other
> stuff that the driver model provides you.
> 
> But do you even need to use anything at all?
> 
> Let's step back and start over, what exactly are you trying to tell
> userspace?  What data do you have that you need to express to it?  How
> do you want userspace to see/use it?
> 
> greg k-h

It is a good idea to step back and let me explain what I wanted to
do here for userspace.

I have two kinds of applications that might use this driver.
1. simple use case where user sets a power limit for a RAPL domain.
e.g. set graphics unit power limit to 7w
2. advanced use case where use can do fine tuning on top of simple
power limit,e.g. the dynamic response parameters of power control
logic, event notifications, etc.

For #1, this driver register with the abstract generic thermal layer
(/sys/class/thermal) and presents itself as a set of cooling devices
with a single knob per domain for power limits.
root@chromoly:/sys/class/thermal/cooling_device15# echo 7000 > cur_state 

For #2, to give userspace complete control of the RAPL interface, which
is not generic, I put them under the device private sysfs area.
root@chromoly:/sys/class/thermal/cooling_device15/device# echo 1000 > time_window1 

As you mentioned about using device tree vs. fs, and how kobject are
used for fs. I do have the need to go between a generic thermal sysfs
and the true device tree. This is the reason why I used kobjects and
link them between device tree and its thermal sysfs representation.

e.g. a RAPL package cooling device linked with its platform device
kobj. (device is linked with rapl_domains/package, the line is too long)

root@chromoly:/sys/class/thermal# ls -l cooling_device15/
total 0
-rw-r--r-- 1 root root 4096 Apr  2 15:03 cur_state
lrwxrwxrwx 1 root root    0 Apr  2 21:28 device
-> ../../../platform/intel_rapl/rapl_domains/package
-r--r--r-- 1 root root 4096 Apr  2 15:03 max_state
drwxr-xr-x 2 root root    0 Apr  2 21:28 power
lrwxrwxrwx 1 root root    0 Apr  2 15:03 subsystem
-> ../../../../class/thermal
-r--r--r-- 1 root root 4096 Apr  2 15:03 type
-rw-r--r-- 1 root root 4096 Apr  2 15:03 uevent

For userspace which is not satisfied with the simple use case of a
single knob for setting power limit, it can follow the link to find the
device tree entry. Then get access to the complete knobs, including
event notifications.

I wanted to use generic thermal fs in that it gives a simple
representation of the most usable feature of RAPL. Basically, RAPL
domains can be enumerated by thermal control apps just as ACPI fans or
processor cooling.

-- 
Thanks,

Jacob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-02 22:15 ` [PATCH 1/1] Introduce Intel RAPL cooling device driver Jacob Pan
                     ` (3 preceding siblings ...)
  2013-04-02 23:00   ` Greg KH
@ 2013-04-03 10:24   ` Paul Bolle
  4 siblings, 0 replies; 23+ messages in thread
From: Paul Bolle @ 2013-04-03 10:24 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, 2013-04-02 at 15:15 -0700, Jacob Pan wrote:
> --- a/drivers/platform/x86/Kconfig
> +++ b/drivers/platform/x86/Kconfig
> @@ -781,4 +781,12 @@ config APPLE_GMUX
>  	  graphics as well as the backlight. Currently only backlight
>  	  control is supported by the driver.
>  
> +config INTEL_RAPL
> +	tristate "Intel RAPL Support"
> +	depends on X86 && THERMAL

drivers/platform/x86/Kconfig is sourced from drivers/platform/Kconfig.
The relevant line is wrapped in an "if X86" and "endif" pair. So X86 is
superfluous here.
"
> +	default y
> +	---help---
> +	  RAPL, AKA, Running Average Power Limit provides mechanisms to enforce
> +	  and monitor per domain power consumption limits of supported Intel CPUs.
> +
>  endif # X86_PLATFORM_DEVICES


Paul Bolle


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-03  0:17         ` Jacob Pan
@ 2013-04-03 16:30           ` Greg KH
  2013-04-03 18:03             ` Jacob Pan
  0 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2013-04-03 16:30 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, Apr 02, 2013 at 05:17:14PM -0700, Jacob Pan wrote:
> On Tue, 2 Apr 2013 16:48:05 -0700
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > On Tue, Apr 02, 2013 at 04:33:57PM -0700, Jacob Pan wrote:
> > > On Tue, 2 Apr 2013 16:00:42 -0700
> > > Greg KH <gregkh@linuxfoundation.org> wrote:
> > > 
> > > > > +#include "intel_rapl.h"
> > > > > +#include "../../../fs/sysfs/sysfs.h"  
> > > > 
> > > > WTF?
> > > > 
> > > > Oh, that's a sure sign you are not doing something properly, if
> > > > you think it's ok to muck around with the internals of sysfs.
> > > > 
> > > > There's a reason that file is "private", why do you think it's ok
> > > > to use it directly?  Did you just think that I somehow "forgot"
> > > > to put it in the proper include directory?
> > > I did feel unsure about this but i saw some precedence in the
> > > kernel.
> > 
> > Someone else is doing this with the sysfs api?  I don't see any other
> > code in Linus's tree doing this at the moment, where did you see this?
> > Let me know and I'll fix it up right away.
> > 
> no, i did not mean sysfs api. I mean include internal header files via
> #include ../../ 
> e.g.in drivers/usb/image/microtek.c
> 
> #include "../../scsi/scsi.h"
> #include <scsi/scsi_host.h>

That is because this is a scsi host driver.  Your code is not part of
sysfs itself.

> > > Anyway, I needed a way to validate a userspace file passed to rapl
> > > driver belong to the same sysfs directory. I will look for
> > > alternative ways.
> > 
> > What do you mean by this?  What exactly are you trying to do?  No
> > normal driver code should _ever_ call sysfs functions directly, nor
> > should they ever care about sysfs internals.
> > 
> i did not call sysfs internal calls, just need to use 
> struct sysfs_dirent {}
> 
> to do the following sanity check against user passed event control file,
> it is still not a 100% strong check. 
> 	/* check if the cfile belongs to the same rapl domain */
> 	if (strcmp(rd->kobj.sd->s_name,
> 			cfile->f_dentry->d_parent->d_name.name)) {
> 		pr_debug("cfile does not belong to domain %s\n",
> 			rd->kobj.sd->s_name);
> 		ret = -EINVAL;
> 		goto exit_cleanup_fds;
> 	}

This made it through a code review at Intel?  Seriously?  Come on,
there's just so much wrong here, I don't know where to begin.

Hint, if you find yourself caring about the internals of sysfs in a
device driver, you are doing something so wrong it's not funny.  Do you
see _any_ other driver doing anything like this?  What makes this driver
so special that it can do unexpected, and totally different things with
sysfs?

> > And, odds are, you didn't test your code as a module, right, as any
> > internal sysfs function that you could get from this .h file, wouldn't
> > be exported for a module to use, unless I missed one somewhere?
> > 
> I did run the driver as module since i didn't use sysfs internal
> functions, just the struct. I may be hitting a corner case here, but
> for drivers who need to discover sysfs hierarchy would it be useful to
> expose some info in struct sysfs_dirent{}?

No, not at all, why would a driver ever care about that?  Somehow we
have gotten by for the past 10+ years without needing it, why is your
driver so different than the thousands of other Linux drivers?

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-03  4:48         ` Jacob Pan
@ 2013-04-03 16:35           ` Greg KH
  2013-04-03 17:35             ` Jacob Pan
  0 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2013-04-03 16:35 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Tue, Apr 02, 2013 at 09:48:18PM -0700, Jacob Pan wrote:
> > Let's step back and start over, what exactly are you trying to tell
> > userspace?  What data do you have that you need to express to it?  How
> > do you want userspace to see/use it?
> 
> It is a good idea to step back and let me explain what I wanted to
> do here for userspace.
> 
> I have two kinds of applications that might use this driver.
> 1. simple use case where user sets a power limit for a RAPL domain.
> e.g. set graphics unit power limit to 7w
> 2. advanced use case where use can do fine tuning on top of simple
> power limit,e.g. the dynamic response parameters of power control
> logic, event notifications, etc.
> 
> For #1, this driver register with the abstract generic thermal layer
> (/sys/class/thermal) and presents itself as a set of cooling devices
> with a single knob per domain for power limits.
> root@chromoly:/sys/class/thermal/cooling_device15# echo 7000 > cur_state 

Great, how about submitting that functionality as patch 1 of your
series?  That seems like a very "normal" thermal driver, right?

> For #2, to give userspace complete control of the RAPL interface, which
> is not generic, I put them under the device private sysfs area.
> root@chromoly:/sys/class/thermal/cooling_device15/device# echo 1000 > time_window1 

I totally fail to understand the difference.  What do you want to show
to userspace that can't be expressed through the thermal interface
today?  Perhaps the thermal interface could be expanded to provide more
functionality that you need?  Why create a one-off API that will never
be used again and require userspace programs to be written just to
handle this one type of device?

> As you mentioned about using device tree vs. fs, and how kobject are
> used for fs. I do have the need to go between a generic thermal sysfs
> and the true device tree. This is the reason why I used kobjects and
> link them between device tree and its thermal sysfs representation.

I don't understand your leap to using kobjects.

> e.g. a RAPL package cooling device linked with its platform device
> kobj. (device is linked with rapl_domains/package, the line is too long)
> 
> root@chromoly:/sys/class/thermal# ls -l cooling_device15/
> total 0
> -rw-r--r-- 1 root root 4096 Apr  2 15:03 cur_state
> lrwxrwxrwx 1 root root    0 Apr  2 21:28 device
> -> ../../../platform/intel_rapl/rapl_domains/package
> -r--r--r-- 1 root root 4096 Apr  2 15:03 max_state
> drwxr-xr-x 2 root root    0 Apr  2 21:28 power
> lrwxrwxrwx 1 root root    0 Apr  2 15:03 subsystem
> -> ../../../../class/thermal
> -r--r--r-- 1 root root 4096 Apr  2 15:03 type
> -rw-r--r-- 1 root root 4096 Apr  2 15:03 uevent

I still don't understand.  What are you adding here, the device symlink?
Or something else?

> For userspace which is not satisfied with the simple use case of a
> single knob for setting power limit, it can follow the link to find the
> device tree entry. Then get access to the complete knobs, including
> event notifications.

And what is in that device directory?  What is rapl_domains?  Why isn't
that a normal 'struct device'?

Still confused.

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-03 16:35           ` Greg KH
@ 2013-04-03 17:35             ` Jacob Pan
  2013-04-05 20:23               ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-03 17:35 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Wed, 3 Apr 2013 09:35:09 -0700
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Tue, Apr 02, 2013 at 09:48:18PM -0700, Jacob Pan wrote:
> > > Let's step back and start over, what exactly are you trying to
> > > tell userspace?  What data do you have that you need to express
> > > to it?  How do you want userspace to see/use it?
> > 
> > It is a good idea to step back and let me explain what I wanted to
> > do here for userspace.
> > 
> > I have two kinds of applications that might use this driver.
> > 1. simple use case where user sets a power limit for a RAPL domain.
> > e.g. set graphics unit power limit to 7w
> > 2. advanced use case where use can do fine tuning on top of simple
> > power limit,e.g. the dynamic response parameters of power control
> > logic, event notifications, etc.
> > 
> > For #1, this driver register with the abstract generic thermal layer
> > (/sys/class/thermal) and presents itself as a set of cooling devices
> > with a single knob per domain for power limits.
> > root@chromoly:/sys/class/thermal/cooling_device15# echo 7000 >
> > cur_state 
> 
> Great, how about submitting that functionality as patch 1 of your
> series?  That seems like a very "normal" thermal driver, right?
> 
yes, that would be a normal thermal cooling device driver. I will do
that first. Thanks for the suggestion.
> > For #2, to give userspace complete control of the RAPL interface,
> > which is not generic, I put them under the device private sysfs
> > area. root@chromoly:/sys/class/thermal/cooling_device15/device#
> > echo 1000 > time_window1 
> 
> I totally fail to understand the difference.  What do you want to show
> to userspace that can't be expressed through the thermal interface
> today?
The difference is one single knob (long term power limit) in the thermal
interface vs multiple fine grained control and data in the complete RAPL
interface.

Here is what a complete RAPL interface for package domain looks like.
root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *
domain_name:package
energy:22396031
lock:0
max_power:0
max_window:0
min_power:0
pl1_clamp:0
pl1_enable:1
pl2_clamp:0
pl2_enable:1
power:7841
power_limit1:25000
power_limit2:31250
thermal_spec_power:17000
throttle_time: 
time_window1:28000
time_window2:0


>  Perhaps the thermal interface could be expanded to provide
> more functionality that you need?
yes, some of them such as limits. But not all the data in the list
above are suitable for thermal interface. That is why I am trying to
balance between abstracted generic data and RAPL specific data while
still allow linking between the two.

The way I envisioned how a thermal/power management app would use is:
1. go through generic thermal layer sysfs and find available RAPL
domains
2. if the app wants to do more fine grained control, it follows the
device symlink to locate the RAPL domain specific sysfs area.

>  Why create a one-off API that will
> never be used again and require userspace programs to be written just
> to handle this one type of device?
> 
why is that a one-off API? RAPL interface is maintained identical across
Intel CPUs after Sandy Bridges. I agree with you that it is still one
type of device with some of its data unique. Should i create a RAPL
class device?

> > As you mentioned about using device tree vs. fs, and how kobject are
> > used for fs. I do have the need to go between a generic thermal
> > sysfs and the true device tree. This is the reason why I used
> > kobjects and link them between device tree and its thermal sysfs
> > representation.
> 
> I don't understand your leap to using kobjects.
> 
I use kobjects mainly for its symlink to allow userspace locate the
'true' device behind generic thermal layers' cooling device.


> > e.g. a RAPL package cooling device linked with its platform device
> > kobj. (device is linked with rapl_domains/package, the line is too
> > long)
> > 
> > root@chromoly:/sys/class/thermal# ls -l cooling_device15/
> > total 0
> > -rw-r--r-- 1 root root 4096 Apr  2 15:03 cur_state
> > lrwxrwxrwx 1 root root    0 Apr  2 21:28 device
> > -> ../../../platform/intel_rapl/rapl_domains/package
> > -r--r--r-- 1 root root 4096 Apr  2 15:03 max_state
> > drwxr-xr-x 2 root root    0 Apr  2 21:28 power
> > lrwxrwxrwx 1 root root    0 Apr  2 15:03 subsystem
> > -> ../../../../class/thermal
> > -r--r--r-- 1 root root 4096 Apr  2 15:03 type
> > -rw-r--r-- 1 root root 4096 Apr  2 15:03 uevent
> 
> I still don't understand.  What are you adding here, the device
> symlink? Or something else?
> 
> > For userspace which is not satisfied with the simple use case of a
> > single knob for setting power limit, it can follow the link to find
> > the device tree entry. Then get access to the complete knobs,
> > including event notifications.
> 
> And what is in that device directory? 
 
the device directory contains the complete RAPL interface
representation. paste the example of package domain again.

root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *
domain_name:package
energy:22396031
lock:0
max_power:0
max_window:0
min_power:0
pl1_clamp:0
pl1_enable:1
pl2_clamp:0
pl2_enable:1
power:7841
power_limit1:25000
power_limit2:31250
thermal_spec_power:17000
grep: throttle_time: Input/output error
time_window1:28000
time_window2:0

> What is rapl_domains?  Why
> isn't that a normal 'struct device'?
>
RAPL domains can be viewed as sub devices of RAPL interface. On a given
platform they can be a complete or partial list of 
package, power plane 0 (processor core), power plane 1 (graphics),
dram controller, pch, etc.

Yes, I can create a RAPL class and create_device for each domain based
on 'struct device'. I can still use the kobj symlink to link to generic
thermal layer. I just thought it is an overkill as compared to
simply using kobjects and kset.

I don't understand the benefit of using 'struct device' in this
case. RAPL interface as a whole has its device, I am not creating
standalone kobjects.


> Still confused.
> 
> greg k-h

[Jacob Pan]

-- 
Thanks,

Jacob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-03 16:30           ` Greg KH
@ 2013-04-03 18:03             ` Jacob Pan
  2013-04-05 20:24               ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-03 18:03 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Wed, 3 Apr 2013 09:30:52 -0700
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Tue, Apr 02, 2013 at 05:17:14PM -0700, Jacob Pan wrote:
> > On Tue, 2 Apr 2013 16:48:05 -0700
> > Greg KH <gregkh@linuxfoundation.org> wrote:
> > 
> > > On Tue, Apr 02, 2013 at 04:33:57PM -0700, Jacob Pan wrote:
> > > > On Tue, 2 Apr 2013 16:00:42 -0700
> > > > Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > 
> > > > > > +#include "intel_rapl.h"
> > > > > > +#include "../../../fs/sysfs/sysfs.h"  
> > > > > 
> > > > > WTF?
> > > > > 
> > > > > Oh, that's a sure sign you are not doing something properly,
> > > > > if you think it's ok to muck around with the internals of
> > > > > sysfs.
> > > > > 
> > > > > There's a reason that file is "private", why do you think
> > > > > it's ok to use it directly?  Did you just think that I
> > > > > somehow "forgot" to put it in the proper include directory?
> > > > I did feel unsure about this but i saw some precedence in the
> > > > kernel.
> > > 
> > > Someone else is doing this with the sysfs api?  I don't see any
> > > other code in Linus's tree doing this at the moment, where did
> > > you see this? Let me know and I'll fix it up right away.
> > > 
> > no, i did not mean sysfs api. I mean include internal header files
> > via #include ../../ 
> > e.g.in drivers/usb/image/microtek.c
> > 
> > #include "../../scsi/scsi.h"
> > #include <scsi/scsi_host.h>
> 
> That is because this is a scsi host driver.  Your code is not part of
> sysfs itself.
> 
> > > > Anyway, I needed a way to validate a userspace file passed to
> > > > rapl driver belong to the same sysfs directory. I will look for
> > > > alternative ways.
> > > 
> > > What do you mean by this?  What exactly are you trying to do?  No
> > > normal driver code should _ever_ call sysfs functions directly,
> > > nor should they ever care about sysfs internals.
> > > 
> > i did not call sysfs internal calls, just need to use 
> > struct sysfs_dirent {}
> > 
> > to do the following sanity check against user passed event control
> > file, it is still not a 100% strong check. 
> > 	/* check if the cfile belongs to the same rapl domain */
> > 	if (strcmp(rd->kobj.sd->s_name,
> > 			cfile->f_dentry->d_parent->d_name.name)) {
> > 		pr_debug("cfile does not belong to domain %s\n",
> > 			rd->kobj.sd->s_name);
> > 		ret = -EINVAL;
> > 		goto exit_cleanup_fds;
> > 	}
> 
> This made it through a code review at Intel?  Seriously?  Come on,
> there's just so much wrong here, I don't know where to begin.
> 
> Hint, if you find yourself caring about the internals of sysfs in a
> device driver, you are doing something so wrong it's not funny.  Do
> you see _any_ other driver doing anything like this?  What makes this
> driver so special that it can do unexpected, and totally different
> things with sysfs?
> 
I admit that my knowledge in this area are limited. I appreciate your
help and straightforward comments.

Perhaps the reason is that not many drivers use eventfd and its way to
arm event thresholds. The userspace passed an eventfd, a file
descriptor, and a threshold value to the RAPL driver. In order to
authenticate the control file descriptor is a valid, I need to check
1. the control file is capable of generating events, e.g. it can be an
energy counter but not a constant power_limit1
2. the control file belongs to the same RAPL domain since the name are
reused in all RAPL domains. e.g. all domains have energy counter and
events are per domain.

I used the sysfs directory check for #2. It is just an extra check. But
I think can drop that check and user pick events based on its name
string. The fact that user writes to the per domain control file
implies the domain specificness of the event.

In similar eventfd usage in cgroup, it has its own fs so i does the
check based on file ops of the control files.

> > > And, odds are, you didn't test your code as a module, right, as
> > > any internal sysfs function that you could get from this .h file,
> > > wouldn't be exported for a module to use, unless I missed one
> > > somewhere?
> > > 
> > I did run the driver as module since i didn't use sysfs internal
> > functions, just the struct. I may be hitting a corner case here, but
> > for drivers who need to discover sysfs hierarchy would it be useful
> > to expose some info in struct sysfs_dirent{}?
> 
> No, not at all, why would a driver ever care about that?  Somehow we
> have gotten by for the past 10+ years without needing it, why is your
> driver so different than the thousands of other Linux drivers?
> 
> greg k-h

ditto, I will drop that extra check.

-- 
Thanks,

Jacob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-03 17:35             ` Jacob Pan
@ 2013-04-05 20:23               ` Greg KH
  2013-04-05 21:33                 ` Jacob Pan
  0 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2013-04-05 20:23 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Wed, Apr 03, 2013 at 10:35:51AM -0700, Jacob Pan wrote:
> On Wed, 3 Apr 2013 09:35:09 -0700
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > On Tue, Apr 02, 2013 at 09:48:18PM -0700, Jacob Pan wrote:
> > > > Let's step back and start over, what exactly are you trying to
> > > > tell userspace?  What data do you have that you need to express
> > > > to it?  How do you want userspace to see/use it?
> > > 
> > > It is a good idea to step back and let me explain what I wanted to
> > > do here for userspace.
> > > 
> > > I have two kinds of applications that might use this driver.
> > > 1. simple use case where user sets a power limit for a RAPL domain.
> > > e.g. set graphics unit power limit to 7w
> > > 2. advanced use case where use can do fine tuning on top of simple
> > > power limit,e.g. the dynamic response parameters of power control
> > > logic, event notifications, etc.
> > > 
> > > For #1, this driver register with the abstract generic thermal layer
> > > (/sys/class/thermal) and presents itself as a set of cooling devices
> > > with a single knob per domain for power limits.
> > > root@chromoly:/sys/class/thermal/cooling_device15# echo 7000 >
> > > cur_state 
> > 
> > Great, how about submitting that functionality as patch 1 of your
> > series?  That seems like a very "normal" thermal driver, right?
> > 
> yes, that would be a normal thermal cooling device driver. I will do
> that first. Thanks for the suggestion.

Do that first, get it merged, and then let's work on the second part.
The patch for that will be much more obvious as to what you are
attempting to do.

> >  Perhaps the thermal interface could be expanded to provide
> > more functionality that you need?
> yes, some of them such as limits. But not all the data in the list
> above are suitable for thermal interface. That is why I am trying to
> balance between abstracted generic data and RAPL specific data while
> still allow linking between the two.

What is not in the existing interface?  And as this is a thermal device,
why can't you add them there?

> The way I envisioned how a thermal/power management app would use is:
> 1. go through generic thermal layer sysfs and find available RAPL
> domains
> 2. if the app wants to do more fine grained control, it follows the
> device symlink to locate the RAPL domain specific sysfs area.

So any application will have to know all of the device-specific
attributes?  That totally defeats the purpose of a generic api that the
kernel is providing.  You are creating device-specific apis that will
not work over the long-run (i.e. next 5-10 years.)  Please don't do that
unless you have exhausted _all_ other alternatives.

So, get your first driver accepted, using the in-kernel thermal api, and
then, if you still feel you wish to do device-specific extensions, we
can discuss that then.

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-03 18:03             ` Jacob Pan
@ 2013-04-05 20:24               ` Greg KH
  0 siblings, 0 replies; 23+ messages in thread
From: Greg KH @ 2013-04-05 20:24 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Wed, Apr 03, 2013 at 11:03:08AM -0700, Jacob Pan wrote:
> On Wed, 3 Apr 2013 09:30:52 -0700
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > On Tue, Apr 02, 2013 at 05:17:14PM -0700, Jacob Pan wrote:
> > > On Tue, 2 Apr 2013 16:48:05 -0700
> > > Greg KH <gregkh@linuxfoundation.org> wrote:
> > > 
> > > > On Tue, Apr 02, 2013 at 04:33:57PM -0700, Jacob Pan wrote:
> > > > > On Tue, 2 Apr 2013 16:00:42 -0700
> > > > > Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > 
> > > > > > > +#include "intel_rapl.h"
> > > > > > > +#include "../../../fs/sysfs/sysfs.h"  
> > > > > > 
> > > > > > WTF?
> > > > > > 
> > > > > > Oh, that's a sure sign you are not doing something properly,
> > > > > > if you think it's ok to muck around with the internals of
> > > > > > sysfs.
> > > > > > 
> > > > > > There's a reason that file is "private", why do you think
> > > > > > it's ok to use it directly?  Did you just think that I
> > > > > > somehow "forgot" to put it in the proper include directory?
> > > > > I did feel unsure about this but i saw some precedence in the
> > > > > kernel.
> > > > 
> > > > Someone else is doing this with the sysfs api?  I don't see any
> > > > other code in Linus's tree doing this at the moment, where did
> > > > you see this? Let me know and I'll fix it up right away.
> > > > 
> > > no, i did not mean sysfs api. I mean include internal header files
> > > via #include ../../ 
> > > e.g.in drivers/usb/image/microtek.c
> > > 
> > > #include "../../scsi/scsi.h"
> > > #include <scsi/scsi_host.h>
> > 
> > That is because this is a scsi host driver.  Your code is not part of
> > sysfs itself.
> > 
> > > > > Anyway, I needed a way to validate a userspace file passed to
> > > > > rapl driver belong to the same sysfs directory. I will look for
> > > > > alternative ways.
> > > > 
> > > > What do you mean by this?  What exactly are you trying to do?  No
> > > > normal driver code should _ever_ call sysfs functions directly,
> > > > nor should they ever care about sysfs internals.
> > > > 
> > > i did not call sysfs internal calls, just need to use 
> > > struct sysfs_dirent {}
> > > 
> > > to do the following sanity check against user passed event control
> > > file, it is still not a 100% strong check. 
> > > 	/* check if the cfile belongs to the same rapl domain */
> > > 	if (strcmp(rd->kobj.sd->s_name,
> > > 			cfile->f_dentry->d_parent->d_name.name)) {
> > > 		pr_debug("cfile does not belong to domain %s\n",
> > > 			rd->kobj.sd->s_name);
> > > 		ret = -EINVAL;
> > > 		goto exit_cleanup_fds;
> > > 	}
> > 
> > This made it through a code review at Intel?  Seriously?  Come on,
> > there's just so much wrong here, I don't know where to begin.
> > 
> > Hint, if you find yourself caring about the internals of sysfs in a
> > device driver, you are doing something so wrong it's not funny.  Do
> > you see _any_ other driver doing anything like this?  What makes this
> > driver so special that it can do unexpected, and totally different
> > things with sysfs?
> > 
> I admit that my knowledge in this area are limited. I appreciate your
> help and straightforward comments.
> 
> Perhaps the reason is that not many drivers use eventfd and its way to
> arm event thresholds.

sysfs should not be using eventfs, that's the issue here.  Don't do
that, sysfs has other apis to do this type of thing, if you somehow
think it is something that you really need to do (hint, I doubt it...)

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-05 20:23               ` Greg KH
@ 2013-04-05 21:33                 ` Jacob Pan
  2013-04-05 21:46                   ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Jacob Pan @ 2013-04-05 21:33 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Fri, 5 Apr 2013 13:23:09 -0700
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Wed, Apr 03, 2013 at 10:35:51AM -0700, Jacob Pan wrote:
> > On Wed, 3 Apr 2013 09:35:09 -0700
> > Greg KH <gregkh@linuxfoundation.org> wrote:
> > 
> > > On Tue, Apr 02, 2013 at 09:48:18PM -0700, Jacob Pan wrote:
> > > > > Let's step back and start over, what exactly are you trying to
> > > > > tell userspace?  What data do you have that you need to
> > > > > express to it?  How do you want userspace to see/use it?
> > > > 
> > > > It is a good idea to step back and let me explain what I wanted
> > > > to do here for userspace.
> > > > 
> > > > I have two kinds of applications that might use this driver.
> > > > 1. simple use case where user sets a power limit for a RAPL
> > > > domain. e.g. set graphics unit power limit to 7w
> > > > 2. advanced use case where use can do fine tuning on top of
> > > > simple power limit,e.g. the dynamic response parameters of
> > > > power control logic, event notifications, etc.
> > > > 
> > > > For #1, this driver register with the abstract generic thermal
> > > > layer (/sys/class/thermal) and presents itself as a set of
> > > > cooling devices with a single knob per domain for power limits.
> > > > root@chromoly:/sys/class/thermal/cooling_device15# echo 7000 >
> > > > cur_state 
> > > 
> > > Great, how about submitting that functionality as patch 1 of your
> > > series?  That seems like a very "normal" thermal driver, right?
> > > 
> > yes, that would be a normal thermal cooling device driver. I will do
> > that first. Thanks for the suggestion.
> 
> Do that first, get it merged, and then let's work on the second part.
> The patch for that will be much more obvious as to what you are
> attempting to do.
> 
Sorry I was too busy to work on v2 before seeing this. I agree I need
to simplify the interface, I just need to come up with a more
intelligent way to abstract that and do the best guesses for the user.
Hopefully, v2 will serve as a confirmation on the comments I got from
v1. i.e. kobject->struct device, removed dependencies on sysfs internal
data struct, etc.

> > >  Perhaps the thermal interface could be expanded to provide
> > > more functionality that you need?
> > yes, some of them such as limits. But not all the data in the list
> > above are suitable for thermal interface. That is why I am trying to
> > balance between abstracted generic data and RAPL specific data while
> > still allow linking between the two.
> 
> What is not in the existing interface?  And as this is a thermal
> device, why can't you add them there?
existing interface has only cur_state and max_state, I have been
working with Rui (thermal maintainer) on adding more knobs for cooling
devices, such as limit low, limit high, event control. I believe they
can all be added.

What is not appropriate for thermal interface are the things like
energy counters, accumulated throttle time, time constants that used by
the internal control algorithm.
> 
> > The way I envisioned how a thermal/power management app would use
> > is: 1. go through generic thermal layer sysfs and find available
> > RAPL domains
> > 2. if the app wants to do more fine grained control, it follows the
> > device symlink to locate the RAPL domain specific sysfs area.
> 
> So any application will have to know all of the device-specific
> attributes?  That totally defeats the purpose of a generic api that
> the kernel is providing.  You are creating device-specific apis that
> will not work over the long-run (i.e. next 5-10 years.)  Please don't
> do that unless you have exhausted _all_ other alternatives.
> 
I agree we should improve the abstraction to avoid using device
specific attributes. Unfortunately, on the other side power tuning tends
to be very platform specific.
> So, get your first driver accepted, using the in-kernel thermal api,
> and then, if you still feel you wish to do device-specific
> extensions, we can discuss that then.
> 
Agreed. I just want to make sure the rapl class and per domain 'struct
device' layout in v2 is what you suggested. I avoided using raw kobjects
and sysfs internals.

The challenge for the simplified/abstract model is that the driver
would have to guess what is best for the user.
e.g. when user selects power limit of 8200mw for the core power plane.
Besides setting that limit, the driver will decide the following for
the user:
 - dynamics of the control, rise time, overshoot, steady state error
 - whether or not allow P/T states to go below OS request value
 - correlation with instantaneous power limit

I believe it is all doable, just need to strike a balance and for
different platforms. But I do believe it is a better interface for most
generic applications.

> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe
> platform-driver-x86" in the body of a message to
> majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

[Jacob Pan]

-- 
Thanks,

Jacob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] Introduce Intel RAPL cooling device driver
  2013-04-05 21:33                 ` Jacob Pan
@ 2013-04-05 21:46                   ` Greg KH
  0 siblings, 0 replies; 23+ messages in thread
From: Greg KH @ 2013-04-05 21:46 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, Platform Driver, Matthew Garrett, Zhang Rui,
	Rafael Wysocki, Len Brown, Srinivas Pandruvada, Arjan van de Ven

On Fri, Apr 05, 2013 at 02:33:40PM -0700, Jacob Pan wrote:
> On Fri, 5 Apr 2013 13:23:09 -0700
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > On Wed, Apr 03, 2013 at 10:35:51AM -0700, Jacob Pan wrote:
> > > On Wed, 3 Apr 2013 09:35:09 -0700
> > > Greg KH <gregkh@linuxfoundation.org> wrote:
> > > 
> > > > On Tue, Apr 02, 2013 at 09:48:18PM -0700, Jacob Pan wrote:
> > > > > > Let's step back and start over, what exactly are you trying to
> > > > > > tell userspace?  What data do you have that you need to
> > > > > > express to it?  How do you want userspace to see/use it?
> > > > > 
> > > > > It is a good idea to step back and let me explain what I wanted
> > > > > to do here for userspace.
> > > > > 
> > > > > I have two kinds of applications that might use this driver.
> > > > > 1. simple use case where user sets a power limit for a RAPL
> > > > > domain. e.g. set graphics unit power limit to 7w
> > > > > 2. advanced use case where use can do fine tuning on top of
> > > > > simple power limit,e.g. the dynamic response parameters of
> > > > > power control logic, event notifications, etc.
> > > > > 
> > > > > For #1, this driver register with the abstract generic thermal
> > > > > layer (/sys/class/thermal) and presents itself as a set of
> > > > > cooling devices with a single knob per domain for power limits.
> > > > > root@chromoly:/sys/class/thermal/cooling_device15# echo 7000 >
> > > > > cur_state 
> > > > 
> > > > Great, how about submitting that functionality as patch 1 of your
> > > > series?  That seems like a very "normal" thermal driver, right?
> > > > 
> > > yes, that would be a normal thermal cooling device driver. I will do
> > > that first. Thanks for the suggestion.
> > 
> > Do that first, get it merged, and then let's work on the second part.
> > The patch for that will be much more obvious as to what you are
> > attempting to do.
> > 
> Sorry I was too busy to work on v2 before seeing this. I agree I need
> to simplify the interface, I just need to come up with a more
> intelligent way to abstract that and do the best guesses for the user.
> Hopefully, v2 will serve as a confirmation on the comments I got from
> v1. i.e. kobject->struct device, removed dependencies on sysfs internal
> data struct, etc.

I'm not going to review that part of the code, sorry, as it's about to
be ripped out anyway :)

> > > >  Perhaps the thermal interface could be expanded to provide
> > > > more functionality that you need?
> > > yes, some of them such as limits. But not all the data in the list
> > > above are suitable for thermal interface. That is why I am trying to
> > > balance between abstracted generic data and RAPL specific data while
> > > still allow linking between the two.
> > 
> > What is not in the existing interface?  And as this is a thermal
> > device, why can't you add them there?
> existing interface has only cur_state and max_state, I have been
> working with Rui (thermal maintainer) on adding more knobs for cooling
> devices, such as limit low, limit high, event control. I believe they
> can all be added.

Great, then you will not need any of the driver-specific sysfs files,
struct device usage, or kobject mess, so your code should be a lot
smaller.

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-04-05 21:46 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-02 22:15 [PATCH 0/1] RAPL (Running Average Power Limit) driver Jacob Pan
2013-04-02 22:15 ` [PATCH 1/1] Introduce Intel RAPL cooling device driver Jacob Pan
2013-04-02 21:42   ` Randy Dunlap
2013-04-02 21:47   ` Randy Dunlap
2013-04-02 23:04     ` Greg Kroah-Hartman
2013-04-02 22:35   ` Joe Perches
2013-04-02 23:00   ` Greg KH
2013-04-02 23:03     ` Greg KH
2013-04-02 23:33     ` Jacob Pan
2013-04-02 23:48       ` Greg KH
2013-04-03  0:17         ` Jacob Pan
2013-04-03 16:30           ` Greg KH
2013-04-03 18:03             ` Jacob Pan
2013-04-05 20:24               ` Greg KH
2013-04-02 23:53     ` Jacob Pan
2013-04-03  0:13       ` Greg KH
2013-04-03  4:48         ` Jacob Pan
2013-04-03 16:35           ` Greg KH
2013-04-03 17:35             ` Jacob Pan
2013-04-05 20:23               ` Greg KH
2013-04-05 21:33                 ` Jacob Pan
2013-04-05 21:46                   ` Greg KH
2013-04-03 10:24   ` Paul Bolle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.