linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] x86: Move away from /dev/cpu/*/msr
@ 2016-06-15 10:00 Borislav Petkov
  2016-06-15 10:22 ` chenyu
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Borislav Petkov @ 2016-06-15 10:00 UTC (permalink / raw)
  To: lkml
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Greg Kroah-Hartman,
	Thomas Renninger, Kan Liang, Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, linux-pm

Hi people,

so we've been talking about this for a long time now - how loading
msr.ko is not a good thing and how userspace shouldn't poke at random
MSRs.

So my intention is to move away users in tools/ which did write to MSRs
through the char dev and replace it with proper sysfs et al interfaces.
Once that's done, we can start tainting the kernel when writing to MSRs
from that device or even forbid it completely at some point.

We'll see.

Anyway, here's a first attempt, please scream if something's not right.
Functionality-wise, it should be equivalent as I'm exporting the
pref_hint of the IA32_ENERGY_PERF_BIAS in sysfs and it lands under

/sys/devices/system/cpu/cpu?/energy_policy_pref_hint

where anything with sufficient perms can read/write it.

Comments are, as always, appreciated.

---
From: Borislav Petkov <bp@suse.de>
Date: Wed, 15 Jun 2016 11:20:41 +0200
Subject: [PATCH] Move away from msr.ko

Take care of MSR_IA32_ENERGY_PERF_BIAS.

Not-yet-signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/Makefile                       |  1 +
 arch/x86/kernel/cpu/intel.c                        |  2 +
 arch/x86/kernel/cpu/sysfs.c                        | 55 ++++++++++++++
 drivers/base/cpu.c                                 | 28 ++++++-
 tools/power/cpupower/Makefile                      |  2 +-
 tools/power/cpupower/utils/cpupower-info.c         |  4 +-
 tools/power/cpupower/utils/cpupower-set.c          |  4 +-
 tools/power/cpupower/utils/helpers/helpers.h       |  6 --
 tools/power/cpupower/utils/helpers/msr.c           | 57 --------------
 tools/power/utils/utils.c                          | 45 +++++++++++
 tools/power/utils/utils.h                          |  7 ++
 tools/power/x86/turbostat/Makefile                 | 16 +++-
 tools/power/x86/turbostat/turbostat.c              | 11 ++-
 tools/power/x86/x86_energy_perf_policy/Makefile    |  4 +-
 .../x86_energy_perf_policy.c                       | 86 +++++-----------------
 15 files changed, 182 insertions(+), 146 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sysfs.c
 create mode 100644 tools/power/utils/utils.c
 create mode 100644 tools/power/utils/utils.h

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4a8697f7d4ef..d8085857a0d1 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -20,6 +20,7 @@ obj-y			:= intel_cacheinfo.o scattered.o topology.o
 obj-y			+= common.o
 obj-y			+= rdrand.o
 obj-y			+= match.o
+obj-y			+= sysfs.o
 
 obj-$(CONFIG_PROC_FS)	+= proc.o
 obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 6e2ffbebbcdb..37d02df6803e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -210,6 +210,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	}
 
 	check_mpx_erratum(c);
+
+	setup_force_cpu_cap(X86_FEATURE_EPB);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/cpu/sysfs.c b/arch/x86/kernel/cpu/sysfs.c
new file mode 100644
index 000000000000..7114abdc9dac
--- /dev/null
+++ b/arch/x86/kernel/cpu/sysfs.c
@@ -0,0 +1,55 @@
+#include <linux/sysfs.h>
+#include <linux/device.h>
+
+static ssize_t
+energy_policy_pref_hint_show(struct device *dev, struct device_attribute *attr,
+			     char *buf)
+{
+	u64 epb;
+
+	rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
+
+	return sprintf(buf, "%d\n", (unsigned int)(epb & 0xFULL));
+}
+
+static ssize_t
+energy_policy_pref_hint_store(struct device *dev, struct device_attribute *attr,
+			      const char *buf, size_t count)
+{
+	u32 val;
+	u64 epb;
+
+	if (kstrtou32(buf, 10, &val) < 0)
+		return -EINVAL;
+
+	if (val > 15)
+		return -EINVAL;
+
+	rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
+
+	if ((epb & 0xf) == val)
+		return count;
+
+	wrmsrl(MSR_IA32_ENERGY_PERF_BIAS, (epb & ~0xFULL) | val);
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(energy_policy_pref_hint);
+
+static struct attribute *cpu_attrs[] = {
+	&dev_attr_energy_policy_pref_hint.attr,
+	NULL,
+};
+
+static struct attribute_group cpu_attr_group = {
+	.attrs = cpu_attrs,
+};
+
+const struct attribute_group * arch_get_cpu_group(void)
+{
+	if (static_cpu_has(X86_FEATURE_EPB))
+		return &cpu_attr_group;
+
+	return NULL;
+}
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..4b9a6e499dbd 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -344,6 +344,8 @@ static int cpu_uevent(struct device *dev, struct kobj_uevent_env *env)
 }
 #endif
 
+const struct attribute_group * __weak arch_get_cpu_group(void) { return NULL; }
+
 /*
  * register_cpu - Setup a sysfs device for a CPU.
  * @cpu - cpu->hotpluggable field set to 1 will generate a control file in
@@ -354,6 +356,8 @@ static int cpu_uevent(struct device *dev, struct kobj_uevent_env *env)
  */
 int register_cpu(struct cpu *cpu, int num)
 {
+	const struct attribute_group *arch_grp;
+	int grp_size;
 	int error;
 
 	cpu->node_id = cpu_to_node(num);
@@ -368,13 +372,31 @@ int register_cpu(struct cpu *cpu, int num)
 	cpu->dev.bus->uevent = cpu_uevent;
 #endif
 	cpu->dev.groups = common_cpu_attr_groups;
-	if (cpu->hotpluggable)
+	grp_size = ARRAY_SIZE(common_cpu_attr_groups);
+
+	if (cpu->hotpluggable) {
 		cpu->dev.groups = hotplugable_cpu_attr_groups;
+		grp_size = ARRAY_SIZE(hotplugable_cpu_attr_groups);
+	}
+
+	arch_grp = arch_get_cpu_group();
+	if (arch_grp) {
+		const struct attribute_group **tmp;
+
+		tmp = kcalloc(grp_size + 1, sizeof(struct attribute_group *), GFP_KERNEL);
+		if (tmp) {
+			memcpy(tmp, cpu->dev.groups, sizeof(*tmp) * grp_size);
+			tmp[grp_size - 1] = arch_grp;
+			cpu->dev.groups = tmp;
+		}
+	}
+
 	error = device_register(&cpu->dev);
-	if (!error)
+	if (!error) {
 		per_cpu(cpu_sys_devices, num) = &cpu->dev;
-	if (!error)
 		register_cpu_under_node(num, cpu_to_node(num));
+	}
+
 
 	return error;
 }
diff --git a/tools/power/cpupower/Makefile b/tools/power/cpupower/Makefile
index 8358863259c5..1d52a5f9dbaf 100644
--- a/tools/power/cpupower/Makefile
+++ b/tools/power/cpupower/Makefile
@@ -138,7 +138,7 @@ UTIL_OBJS =  utils/helpers/amd.o utils/helpers/msr.o \
 	utils/idle_monitor/mperf_monitor.o utils/idle_monitor/cpupower-monitor.o \
 	utils/cpupower.o utils/cpufreq-info.o utils/cpufreq-set.o \
 	utils/cpupower-set.o utils/cpupower-info.o utils/cpuidle-info.o \
-	utils/cpuidle-set.o
+	utils/cpuidle-set.o ../utils/utils.o
 
 UTIL_SRC := $(UTIL_OBJS:.o=.c)
 
diff --git a/tools/power/cpupower/utils/cpupower-info.c b/tools/power/cpupower/utils/cpupower-info.c
index c7caa8eaa6d0..d1e202debdc1 100644
--- a/tools/power/cpupower/utils/cpupower-info.c
+++ b/tools/power/cpupower/utils/cpupower-info.c
@@ -15,6 +15,8 @@
 #include "helpers/helpers.h"
 #include "helpers/sysfs.h"
 
+#include "../../utils/utils.h"
+
 static struct option set_opts[] = {
      {"perf-bias", optional_argument, NULL, 'b'},
      { },
@@ -93,7 +95,7 @@ int cmd_info(int argc, char **argv)
 		}
 
 		if (params.perf_bias) {
-			ret = msr_intel_get_perf_bias(cpu);
+			ret = get_pref_hint(cpu);
 			if (ret < 0) {
 				fprintf(stderr,
 			_("Could not read perf-bias value[%d]\n"), ret);
diff --git a/tools/power/cpupower/utils/cpupower-set.c b/tools/power/cpupower/utils/cpupower-set.c
index 532f46b9a335..00f5b4681d0d 100644
--- a/tools/power/cpupower/utils/cpupower-set.c
+++ b/tools/power/cpupower/utils/cpupower-set.c
@@ -16,6 +16,8 @@
 #include "helpers/sysfs.h"
 #include "helpers/bitmask.h"
 
+#include "../../utils/utils.h"
+
 static struct option set_opts[] = {
 	{"perf-bias", required_argument, NULL, 'b'},
 	{ },
@@ -87,7 +89,7 @@ int cmd_set(int argc, char **argv)
 		}
 
 		if (params.perf_bias) {
-			ret = msr_intel_set_perf_bias(cpu, perf_bias);
+			ret = set_pref_hint(cpu, perf_bias);
 			if (ret) {
 				fprintf(stderr, _("Error setting perf-bias "
 						  "value on CPU %d\n"), cpu);
diff --git a/tools/power/cpupower/utils/helpers/helpers.h b/tools/power/cpupower/utils/helpers/helpers.h
index afb66f80554e..107d2c7c1ace 100644
--- a/tools/power/cpupower/utils/helpers/helpers.h
+++ b/tools/power/cpupower/utils/helpers/helpers.h
@@ -102,8 +102,6 @@ extern struct cpupower_cpu_info cpupower_cpu_info;
 extern int read_msr(int cpu, unsigned int idx, unsigned long long *val);
 extern int write_msr(int cpu, unsigned int idx, unsigned long long val);
 
-extern int msr_intel_set_perf_bias(unsigned int cpu, unsigned int val);
-extern int msr_intel_get_perf_bias(unsigned int cpu);
 extern unsigned long long msr_intel_get_turbo_ratio(unsigned int cpu);
 
 /* Read/Write msr ****************************/
@@ -147,10 +145,6 @@ static inline int read_msr(int cpu, unsigned int idx, unsigned long long *val)
 { return -1; };
 static inline int write_msr(int cpu, unsigned int idx, unsigned long long val)
 { return -1; };
-static inline int msr_intel_set_perf_bias(unsigned int cpu, unsigned int val)
-{ return -1; };
-static inline int msr_intel_get_perf_bias(unsigned int cpu)
-{ return -1; };
 static inline unsigned long long msr_intel_get_turbo_ratio(unsigned int cpu)
 { return 0; };
 
diff --git a/tools/power/cpupower/utils/helpers/msr.c b/tools/power/cpupower/utils/helpers/msr.c
index 31a4b24a8bc6..6c076c26ef52 100644
--- a/tools/power/cpupower/utils/helpers/msr.c
+++ b/tools/power/cpupower/utils/helpers/msr.c
@@ -10,7 +10,6 @@
 /* Intel specific MSRs */
 #define MSR_IA32_PERF_STATUS		0x198
 #define MSR_IA32_MISC_ENABLES		0x1a0
-#define MSR_IA32_ENERGY_PERF_BIAS	0x1b0
 #define MSR_NEHALEM_TURBO_RATIO_LIMIT	0x1ad
 
 /*
@@ -43,62 +42,6 @@ int read_msr(int cpu, unsigned int idx, unsigned long long *val)
 	return -1;
 }
 
-/*
- * write_msr
- *
- * Will return 0 on success and -1 on failure.
- * Possible errno values could be:
- * EFAULT -If the read/write did not fully complete
- * EIO    -If the CPU does not support MSRs
- * ENXIO  -If the CPU does not exist
- */
-int write_msr(int cpu, unsigned int idx, unsigned long long val)
-{
-	int fd;
-	char msr_file_name[64];
-
-	sprintf(msr_file_name, "/dev/cpu/%d/msr", cpu);
-	fd = open(msr_file_name, O_WRONLY);
-	if (fd < 0)
-		return -1;
-	if (lseek(fd, idx, SEEK_CUR) == -1)
-		goto err;
-	if (write(fd, &val, sizeof val) != sizeof val)
-		goto err;
-	close(fd);
-	return 0;
- err:
-	close(fd);
-	return -1;
-}
-
-int msr_intel_get_perf_bias(unsigned int cpu)
-{
-	unsigned long long val;
-	int ret;
-
-	if (!(cpupower_cpu_info.caps & CPUPOWER_CAP_PERF_BIAS))
-		return -1;
-
-	ret = read_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS, &val);
-	if (ret)
-		return ret;
-	return val;
-}
-
-int msr_intel_set_perf_bias(unsigned int cpu, unsigned int val)
-{
-	int ret;
-
-	if (!(cpupower_cpu_info.caps & CPUPOWER_CAP_PERF_BIAS))
-		return -1;
-
-	ret = write_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS, val);
-	if (ret)
-		return ret;
-	return 0;
-}
-
 unsigned long long msr_intel_get_turbo_ratio(unsigned int cpu)
 {
 	unsigned long long val;
diff --git a/tools/power/utils/utils.c b/tools/power/utils/utils.c
new file mode 100644
index 000000000000..0377c5686704
--- /dev/null
+++ b/tools/power/utils/utils.c
@@ -0,0 +1,45 @@
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+int get_pref_hint(unsigned int cpu)
+{
+	char fname[64];
+	int val = -1;
+	int fd;
+
+	sprintf(fname, "/sys/devices/system/cpu/cpu%d/energy_policy_pref_hint", cpu);
+
+	fd = open(fname, O_RDONLY);
+	if (fd < 0)
+		return -1;
+
+	if (read(fd, &val, sizeof(val)) != sizeof(val))
+		val = -1;
+
+	close(fd);
+	return val;
+}
+
+int set_pref_hint(unsigned int cpu, unsigned int val)
+{
+	char fname[64];
+	int ret = 0, fd;
+
+	sprintf(fname, "/sys/devices/system/cpu/cpu%d/energy_policy_pref_hint", cpu);
+
+	fd = open(fname, O_WRONLY);
+	if (fd < 0)
+		return -1;
+
+	if (write(fd, &val, sizeof(val)) != sizeof(val))
+		ret = -1;
+
+	close(fd);
+	return ret;
+}
+
+
diff --git a/tools/power/utils/utils.h b/tools/power/utils/utils.h
new file mode 100644
index 000000000000..d6989a0cecc8
--- /dev/null
+++ b/tools/power/utils/utils.h
@@ -0,0 +1,7 @@
+#ifndef __TOOLS_POWER_UTILS_H_
+#define __TOOLS_POWER_UTILS_H_
+
+int get_pref_hint(unsigned int cpu);
+int set_pref_hint(unsigned int cpu, unsigned int val);
+
+#endif /* __TOOLS_POWER_UTILS_H_ */
diff --git a/tools/power/x86/turbostat/Makefile b/tools/power/x86/turbostat/Makefile
index e367b1a85d70..a806b81e2e5e 100644
--- a/tools/power/x86/turbostat/Makefile
+++ b/tools/power/x86/turbostat/Makefile
@@ -2,22 +2,30 @@ CC		= $(CROSS_COMPILE)gcc
 BUILD_OUTPUT	:= $(CURDIR)
 PREFIX		:= /usr
 DESTDIR		:=
+INCLUDE		:= -I../../utils
 
 ifeq ("$(origin O)", "command line")
 	BUILD_OUTPUT := $(O)
 endif
 
-turbostat : turbostat.c
+OBJS		:= turbostat.o ../../utils/utils.o
+
+turbostat : $(OBJS)
 CFLAGS +=	-Wall
 CFLAGS +=	-DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
+CFLAGS +=	$(INCLUDE)
+
+%.o: %.c
+	@mkdir -p $(BUILD_OUTPUT)
+	$(CC) -c $(CFLAGS) $< -o $(BUILD_OUTPUT)/$@
 
-%: %.c
+turbostat: $(OBJS)
 	@mkdir -p $(BUILD_OUTPUT)
-	$(CC) $(CFLAGS) $< -o $(BUILD_OUTPUT)/$@
+	$(CC) $(CFLAGS) $? -o $(BUILD_OUTPUT)/$@
 
 .PHONY : clean
 clean :
-	@rm -f $(BUILD_OUTPUT)/turbostat
+	@rm -f $(BUILD_OUTPUT)/turbostat $(BUILD_OUTPUT)/*.o $(BUILD_OUTPUT)/../../utils/utils.o
 
 install : turbostat
 	install -d  $(DESTDIR)$(PREFIX)/bin
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index acbf7ff2ee6e..1695afa634fb 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -43,6 +43,8 @@
 #include <linux/capability.h>
 #include <errno.h>
 
+#include "../../utils/utils.h"
+
 char *proc_stat = "/proc/stat";
 FILE *outf;
 int *fd_percpu;
@@ -2353,8 +2355,8 @@ dump_cstate_pstate_config_info(unsigned int family, unsigned int model)
  */
 int print_epb(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 {
-	unsigned long long msr;
 	char *epb_string;
+	int pref_hint;
 	int cpu;
 
 	if (!has_epb)
@@ -2371,10 +2373,11 @@ int print_epb(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 		return -1;
 	}
 
-	if (get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS, &msr))
+	pref_hint = get_pref_hint(cpu);
+	if (pref_hint < 0)
 		return 0;
 
-	switch (msr & 0xF) {
+	switch (pref_hint) {
 	case ENERGY_PERF_BIAS_PERFORMANCE:
 		epb_string = "performance";
 		break;
@@ -2388,7 +2391,7 @@ int print_epb(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 		epb_string = "custom";
 		break;
 	}
-	fprintf(outf, "cpu%d: MSR_IA32_ENERGY_PERF_BIAS: 0x%08llx (%s)\n", cpu, msr, epb_string);
+	fprintf(outf, "cpu%d: %s\n", cpu, epb_string);
 
 	return 0;
 }
diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile
index 971c9ffdcb50..8c194d2102a9 100644
--- a/tools/power/x86/x86_energy_perf_policy/Makefile
+++ b/tools/power/x86/x86_energy_perf_policy/Makefile
@@ -1,9 +1,9 @@
 DESTDIR ?=
 
-x86_energy_perf_policy : x86_energy_perf_policy.c
+x86_energy_perf_policy : x86_energy_perf_policy.c ../../utils/utils.c
 
 clean :
-	rm -f x86_energy_perf_policy
+	rm -f x86_energy_perf_policy ../../utils/utils.o
 
 install :
 	install x86_energy_perf_policy ${DESTDIR}/usr/bin/
diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c
index 40b3e5482f8a..50baebb9aa6f 100644
--- a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c
+++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c
@@ -31,6 +31,8 @@
 #include <stdlib.h>
 #include <string.h>
 
+#include "../../utils/utils.h"
+
 unsigned int verbose;		/* set with -v */
 unsigned int read_only;		/* set with -r */
 char *progname;
@@ -69,8 +71,6 @@ void usage(void)
 	exit(1);
 }
 
-#define MSR_IA32_ENERGY_PERF_BIAS	0x000001b0
-
 #define	BIAS_PERFORMANCE		0
 #define BIAS_BALANCE			6
 #define	BIAS_POWERSAVE			15
@@ -184,79 +184,31 @@ void validate_cpuid(void)
 	return;	/* success */
 }
 
-unsigned long long get_msr(int cpu, int offset)
-{
-	unsigned long long msr;
-	char msr_path[32];
-	int retval;
-	int fd;
-
-	sprintf(msr_path, "/dev/cpu/%d/msr", cpu);
-	fd = open(msr_path, O_RDONLY);
-	if (fd < 0) {
-		printf("Try \"# modprobe msr\"\n");
-		perror(msr_path);
-		exit(1);
-	}
-
-	retval = pread(fd, &msr, sizeof msr, offset);
-
-	if (retval != sizeof msr) {
-		printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval);
-		exit(-2);
-	}
-	close(fd);
-	return msr;
-}
-
-unsigned long long  put_msr(int cpu, unsigned long long new_msr, int offset)
+void print_bias(int cpu)
 {
-	unsigned long long old_msr;
-	char msr_path[32];
-	int retval;
-	int fd;
-
-	sprintf(msr_path, "/dev/cpu/%d/msr", cpu);
-	fd = open(msr_path, O_RDWR);
-	if (fd < 0) {
-		perror(msr_path);
-		exit(1);
-	}
-
-	retval = pread(fd, &old_msr, sizeof old_msr, offset);
-	if (retval != sizeof old_msr) {
-		perror("pwrite");
-		printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval);
-		exit(-2);
-	}
+	unsigned int val;
 
-	retval = pwrite(fd, &new_msr, sizeof new_msr, offset);
-	if (retval != sizeof new_msr) {
-		perror("pwrite");
-		printf("pwrite cpu%d 0x%x = %d\n", cpu, offset, retval);
-		exit(-2);
-	}
-
-	close(fd);
-
-	return old_msr;
-}
+	val = get_pref_hint(cpu);
+	if (val < 0)
+		return;
 
-void print_msr(int cpu)
-{
-	printf("cpu%d: 0x%016llx\n",
-		cpu, get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS));
+	printf("cpu%d: 0x%08x\n", cpu, val);
 }
 
 void update_msr(int cpu)
 {
-	unsigned long long previous_msr;
+	unsigned int previous_val;
+
+	previous_val = get_pref_hint(cpu);
+	if (previous_val < 0)
+		return;
 
-	previous_msr = put_msr(cpu, new_bias, MSR_IA32_ENERGY_PERF_BIAS);
+	if (set_pref_hint(cpu, new_bias))
+		return;
 
 	if (verbose)
-		printf("cpu%d  msr0x%x 0x%016llx -> 0x%016llx\n",
-			cpu, MSR_IA32_ENERGY_PERF_BIAS, previous_msr, new_bias);
+		printf("cpu%d  pref hint: 0x%08x -> 0x%08x\n",
+			cpu, previous_val, new_bias);
 
 	return;
 }
@@ -310,12 +262,12 @@ int main(int argc, char **argv)
 
 	if (cpu != -1) {
 		if (read_only)
-			print_msr(cpu);
+			print_bias(cpu);
 		else
 			update_msr(cpu);
 	} else {
 		if (read_only)
-			for_every_cpu(print_msr);
+			for_every_cpu(print_bias);
 		else
 			for_every_cpu(update_msr);
 	}
-- 
2.7.3


-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 10:00 [RFC PATCH] x86: Move away from /dev/cpu/*/msr Borislav Petkov
@ 2016-06-15 10:22 ` chenyu
  2016-06-15 10:36   ` Borislav Petkov
  2016-06-15 16:41 ` Len Brown
  2016-06-20 21:12 ` Andi Kleen
  2 siblings, 1 reply; 11+ messages in thread
From: chenyu @ 2016-06-15 10:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 6:00 PM, Borislav Petkov <bp@alien8.de> wrote:
> Comments are, as always, appreciated.
>
> ---
> From: Borislav Petkov <bp@suse.de>
> Date: Wed, 15 Jun 2016 11:20:41 +0200
> Subject: [PATCH] Move away from msr.ko
>
> Take care of MSR_IA32_ENERGY_PERF_BIAS.
>
> Not-yet-signed-off-by: Borislav Petkov <bp@suse.de>
[cut]
> +static ssize_t
> +energy_policy_pref_hint_show(struct device *dev, struct device_attribute *attr,
> +                            char *buf)
> +{
> +       u64 epb;
> +
> +       rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
MSR_IA32_ENERGY_PERF_BIAS is of core scope, should we use rdmsr_on_cpu?
> +
> +       return sprintf(buf, "%d\n", (unsigned int)(epb & 0xFULL));
> +}
> +
> +static ssize_t
> +energy_policy_pref_hint_store(struct device *dev, struct device_attribute *attr,
> +                             const char *buf, size_t count)
> +{
> +       u32 val;
> +       u64 epb;
> +
> +       if (kstrtou32(buf, 10, &val) < 0)
> +               return -EINVAL;
> +
> +       if (val > 15)
> +               return -EINVAL;
> +
> +       rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
rdmsr_on_cpu?
> +
> +       if ((epb & 0xf) == val)
> +               return count;
> +
> +       wrmsrl(MSR_IA32_ENERGY_PERF_BIAS, (epb & ~0xFULL) | val);
wrmsr_on_cpu?



thanks,
Yu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 10:22 ` chenyu
@ 2016-06-15 10:36   ` Borislav Petkov
  2016-06-15 14:00     ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2016-06-15 10:36 UTC (permalink / raw)
  To: chenyu
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 06:22:59PM +0800, chenyu wrote:
> MSR_IA32_ENERGY_PERF_BIAS is of core scope, should we use rdmsr_on_cpu?

Ah, yes, good point. Will change.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 10:36   ` Borislav Petkov
@ 2016-06-15 14:00     ` Borislav Petkov
  0 siblings, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2016-06-15 14:00 UTC (permalink / raw)
  To: chenyu
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 12:36:22PM +0200, Borislav Petkov wrote:
> Ah, yes, good point. Will change.

Ok, got a bit more involved - I also need to disable preemption around
smp_call_function_single(). Here's what I have now. It seems to work
too.

static ssize_t
energy_policy_pref_hint_show(struct device *dev, struct device_attribute *attr,
                             char *buf)
{
        u64 epb;

        preempt_disable();
        if (rdmsrl_on_cpu(dev->id, MSR_IA32_ENERGY_PERF_BIAS, &epb))
                epb = -1;
        preempt_enable();

        return sprintf(buf, "%d\n", (unsigned int)(epb & 0xFULL));
}

static ssize_t
energy_policy_pref_hint_store(struct device *dev, struct device_attribute *attr,
                              const char *buf, size_t count)
{
        u32 val;
        u64 epb;

        if (kstrtou32(buf, 10, &val) < 0)
                return -EINVAL;

        if (val > 15)
                return -EINVAL;

        preempt_disable();

        if (rdmsrl_on_cpu(dev->id, MSR_IA32_ENERGY_PERF_BIAS, &epb)) {
                count = -EINVAL;
                goto out;
        }

        if ((epb & 0xf) == val)
                goto out;

        if (wrmsrl_on_cpu(dev->id, MSR_IA32_ENERGY_PERF_BIAS, (epb & ~0xFULL) | val))
                count = -EINVAL;

out:
        preempt_enable();
        return count;
}

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 10:00 [RFC PATCH] x86: Move away from /dev/cpu/*/msr Borislav Petkov
  2016-06-15 10:22 ` chenyu
@ 2016-06-15 16:41 ` Len Brown
  2016-06-15 16:56   ` Borislav Petkov
  2016-06-20 21:12 ` Andi Kleen
  2 siblings, 1 reply; 11+ messages in thread
From: Len Brown @ 2016-06-15 16:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 6:00 AM, Borislav Petkov <bp@alien8.de> wrote:
> Hi people,
>
> so we've been talking about this for a long time now - how loading
> msr.ko is not a good thing and how userspace shouldn't poke at random
> MSRs.
>
> So my intention is to move away users in tools/ which did write to MSRs
> through the char dev and replace it with proper sysfs et al interfaces.
> Once that's done, we can start tainting the kernel when writing to MSRs
> from that device or even forbid it completely at some point.
>
> We'll see.
>
> Anyway, here's a first attempt, please scream if something's not right.
> Functionality-wise, it should be equivalent as I'm exporting the
> pref_hint of the IA32_ENERGY_PERF_BIAS in sysfs and it lands under
>
> /sys/devices/system/cpu/cpu?/energy_policy_pref_hint
>
> where anything with sufficient perms can read/write it.

turbostat reads MSRs, but never writes.  And it will still
need /dev/msr for all kinds of counters it reads.  So updating
turbostat to use this new attribute for EPB reads is sort of
a demo, rather than a functional change.

I agree the kernel should be tainted if user-space uses
/dev/msr to scribble on MSRs behind the kernel's back.

When EPB was first invented, I proposed a sysfs attribute to
control it.  But that proposal was system-wide, and affected
more than EPB.  Maybe that was too ambitious.  The
energy_perf_policy utility was a "plan-b".

Recent hardware has an additional MSR field

MSR_IA32_HWP_REQUEST.ENERGY_PERFORMANCE_PREFERENCE
that replaces

MSR_IA32_ENERGY_PERF_BIAS
for the purpose of P-state control.

Both MSRs/fields exist and have effect at the same time.

so the API
energy_policy_pref_hint

will not work -- as it isn't clear which MSR it refers to.

I've updated x86_energy_perf_policy to talk to this MSR
and a number of others for the benefit of HWP.  The
patch is over 1000 lines.  I'll post it shortly.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 16:41 ` Len Brown
@ 2016-06-15 16:56   ` Borislav Petkov
  2016-06-15 17:21     ` Len Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2016-06-15 16:56 UTC (permalink / raw)
  To: Len Brown
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 12:41:21PM -0400, Len Brown wrote:
> Recent hardware has an additional MSR field
> 
> MSR_IA32_HWP_REQUEST.ENERGY_PERFORMANCE_PREFERENCE
> that replaces
> 
> MSR_IA32_ENERGY_PERF_BIAS
> for the purpose of P-state control.
>
> 
> Both MSRs/fields exist and have effect at the same time.
> 
> so the API
> energy_policy_pref_hint
> 
> will not work -- as it isn't clear which MSR it refers to.

Surely we can make the new interface work too - perhaps add a new sysfs
file for the new thing. The old MSR_IA32_ENERGY_PERF_BIAS would be
needed on those older boxes.

Or we can have a sysfs file which is called something like
"perf_preference" or whatnot and that thing either maps input to the old
MSR_IA32_ENERGY_PERF_BIAS or to the new thing.

> I've updated x86_energy_perf_policy to talk to this MSR
> and a number of others for the benefit of HWP.  The
> patch is over 1000 lines.  I'll post it shortly.

So we should *not* give ourselves the example that using msr.ko for
other things *besides* debugging is ok. It is very wrong to talk to
naked MSRs and we have done it by now because this thing was there and
well, sure, why not use it.

But poking at MSRs is dangerous and we need proper abstraction. And we
should work towards that instead perpetuating wrong use.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 16:56   ` Borislav Petkov
@ 2016-06-15 17:21     ` Len Brown
  2016-06-15 17:39       ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Len Brown @ 2016-06-15 17:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 12:56 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Jun 15, 2016 at 12:41:21PM -0400, Len Brown wrote:
>> Recent hardware has an additional MSR field
>>
>> MSR_IA32_HWP_REQUEST.ENERGY_PERFORMANCE_PREFERENCE
>> that replaces
>>
>> MSR_IA32_ENERGY_PERF_BIAS
>> for the purpose of P-state control.
>>
>>
>> Both MSRs/fields exist and have effect at the same time.
>>
>> so the API
>> energy_policy_pref_hint
>>
>> will not work -- as it isn't clear which MSR it refers to.
>
> Surely we can make the new interface work too - perhaps add a new sysfs
> file for the new thing. The old MSR_IA32_ENERGY_PERF_BIAS would be
> needed on those older boxes.
>
> Or we can have a sysfs file which is called something like
> "perf_preference" or whatnot and that thing either maps input to the old
> MSR_IA32_ENERGY_PERF_BIAS or to the new thing.

Maybe I wasn't clear.
The API -- the name -- must be clear about what MSR it talks to.
I suggest that the name exactly match the name of the actual MSR,
because you are about to need a 2nd one with a name so close
that it will otherwise be ambiguous.

>> I've updated x86_energy_perf_policy to talk to this MSR
>> and a number of others for the benefit of HWP.  The
>> patch is over 1000 lines.  I'll post it shortly.
>
> So we should *not* give ourselves the example that using msr.ko for
> other things *besides* debugging is ok. It is very wrong to talk to
> naked MSRs and we have done it by now because this thing was there and
> well, sure, why not use it.
>
> But poking at MSRs is dangerous and we need proper abstraction. And we
> should work towards that instead perpetuating wrong use.

Again, I support your direction.  I'm not trying to work against it,
I'm trying to tell you that you are just scratching the surface
and there will be more steps to complete the task -- because
there are more MSRs.  Your new API doesn't exist on the
installed base, and so the old /dev/msr method must be available
to the installed base.  Sure, in the future, when the new API
is available, we can update the utility to use it going forward.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 17:21     ` Len Brown
@ 2016-06-15 17:39       ` Borislav Petkov
  2016-06-15 17:42         ` Len Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2016-06-15 17:39 UTC (permalink / raw)
  To: Len Brown
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 01:21:01PM -0400, Len Brown wrote:
> The API -- the name -- must be clear about what MSR it talks to.

Didn't I say that?!

"Surely we can make the new interface work too - perhaps add a new sysfs
file for the new thing."

> I suggest that the name exactly match the name of the actual MSR,
> because you are about to need a 2nd one with a name so close
> that it will otherwise be ambiguous.

So from looking at IA32_HWP_REQUEST, it sounds like you'd need a whole
new dir:

hwp_req
|-> package_control
|-> energy_perf_pref
|-> ...
|-> min_perf

and both interfaces will be visible only when the CPUID bit is set.

I.e., for the energy_policy_pref_hint, I'm checking X86_FEATURE_EPB and
I'm sure the HWP ones have CPUID bits too.

> Again, I support your direction.  I'm not trying to work against it,
> I'm trying to tell you that you are just scratching the surface
> and there will be more steps to complete the task -- because
> there are more MSRs.

Oh, I know that. That's why this is a first RFC, to poke at people.

Also, I'm looking at the WRMSR use cases first. The reading can be taken
care of later.

> Your new API doesn't exist on the installed base, and so the old
> /dev/msr method must be available to the installed base. Sure, in the
> future, when the new API is available, we can update the utility to
> use it going forward.

Well, since the utility is part of tools/, it goes with the kernel
version. Just like perf.

Or are you dying to be able to use new tool on old kernels?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 17:39       ` Borislav Petkov
@ 2016-06-15 17:42         ` Len Brown
  2016-06-15 17:52           ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Len Brown @ 2016-06-15 17:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

> Well, since the utility is part of tools/, it goes with the kernel
> version. Just like perf.
>
> Or are you dying to be able to use new tool on old kernels?

Yes.  Enterprise customers use old kernels.  They need
at least a snapshot of the utility that works w/o the API update.
-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 17:42         ` Len Brown
@ 2016-06-15 17:52           ` Borislav Petkov
  0 siblings, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2016-06-15 17:52 UTC (permalink / raw)
  To: Len Brown
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, Linux PM list

On Wed, Jun 15, 2016 at 01:42:41PM -0400, Len Brown wrote:
> Yes.  Enterprise customers use old kernels.  They need
> at least a snapshot of the utility that works w/o the API update.

Enterprise customers get the proper tool versions too. At least we -
SUSE - make sure that is the case.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] x86: Move away from /dev/cpu/*/msr
  2016-06-15 10:00 [RFC PATCH] x86: Move away from /dev/cpu/*/msr Borislav Petkov
  2016-06-15 10:22 ` chenyu
  2016-06-15 16:41 ` Len Brown
@ 2016-06-20 21:12 ` Andi Kleen
  2 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2016-06-20 21:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Greg Kroah-Hartman, Thomas Renninger, Kan Liang,
	Peter Zijlstra (Intel),
	Rafael J. Wysocki, Len Brown, linux-pm

Borislav Petkov <bp@alien8.de> writes:

> Comments are, as always, appreciated.

Seems like a waste of kernel code to me. The MSR interface works
perfectly fine. There are potentially hundreds of useful MSRs,
are you going to add new sysfs for each of them?

Even the more obscure ones can be very useful for debugging
and monitoring.

Most MSRs are model specific so this would end up with tons
of switch (x86_model) ... which are always difficult to maintain
and need to be updated all the time when new CPUs come out.

This will likely generate a really large ongoing number of patches,
and to solve what problem exactly?

The whole thing doesn't make any sense to me.

It's just a waste of code, maintainer time, patch review
capacity, which all could be far more usefully employed
to do something that actually solves real problems.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-06-20 21:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-15 10:00 [RFC PATCH] x86: Move away from /dev/cpu/*/msr Borislav Petkov
2016-06-15 10:22 ` chenyu
2016-06-15 10:36   ` Borislav Petkov
2016-06-15 14:00     ` Borislav Petkov
2016-06-15 16:41 ` Len Brown
2016-06-15 16:56   ` Borislav Petkov
2016-06-15 17:21     ` Len Brown
2016-06-15 17:39       ` Borislav Petkov
2016-06-15 17:42         ` Len Brown
2016-06-15 17:52           ` Borislav Petkov
2016-06-20 21:12 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).