[PATCH V1 0/3] perf/x86/intel: Add Branch Monitoring support

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V1 0/3] perf/x86/intel: Add Branch Monitoring support
@ 2017-11-11 21:20 Megha Dey
  2017-11-11 21:20 ` [PATCH V1 1/3] x86/cpu/intel: Add Cannonlake to Intel family Megha Dey
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Megha Dey @ 2017-11-11 21:20 UTC (permalink / raw)
  To: x86, linux-kernel, linux-doc
  Cc: tglx, mingo, hpa, andriy.shevchenko, kstewart, yu-cheng.yu,
	len.brown, gregkh, peterz, acme, alexander.shishkin, jolsa,
	namhyung, vikas.shivappa, pombredanne, me, bp,
	grzegorz.andrejczuk, tony.luck, corbet, ravi.v.shankar,
	megha.dey, Megha Dey

This patchset adds support for Intel's branch monitoring feature. This
feature uses heuristics to detect the occurrence of an ROP(Return Oriented
Programming) or ROP like(JOP: Jump oriented programming) attack. These
heuristics are based off certain performance monitoring statistics,
measured dynamically over a short configurable window period. ROP is a
malware trend in which the attacker can compromise a return pointer held
on the stack to redirect execution to a different desired instruction.

Currently, only the Cannonlake family of Intel processors support this
feature. This feature is enabled by CONFIG_PERF_EVENTS_INTEL_BM.

Once the kernel is compiled with CONFIG_PERF_EVENTS_INTEL_BM=y on a
Cannonlake system, the following perf events are added which can be viewed
with perf list:
  intel_bm/branch-misp/                              [Kernel PMU event]
  intel_bm/call-ret/                                 [Kernel PMU event]
  intel_bm/far-branch/                               [Kernel PMU event]
  intel_bm/indirect-branch-misp/                     [Kernel PMU event]
  intel_bm/ret-misp/                                 [Kernel PMU event]
  intel_bm/rets/                                     [Kernel PMU event]

A perf-based kernel driver has been used to monitor the occurrence of
one of the 6 branch monitoring events. There are 2 counters that each
can select between one of these events for evaluation over a specified
instruction window size (0 to 1023). For each counter, a threshold value
(0 to 127) can be configured to set a point at which an interrupt is
generated. Each task can monitor a maximum of 2 events at any given time.

Apart from the kernel driver, this patchset adds CPUID of Cannonlake
processors to Intel family list and the Documentation/x86/intel_bm.txt
file with some information about Intel Branch monitoring.

Changes V0->V1:
1. Used the 'is_sampling_event' function 
2. Added support to monitor 2 events for every task
3. Corrected typos
4. Added a lock to prevent race condition in concurrent perf_event_open()s
5. Got rid of start()/stop() and added its functionality in add()/del()
6. Removed read() callback as it was not doing anything.
6. Removed code for sampling events as we do not support sampling.
7. Added 'id' member to hw_perf_event::intel_bm to track which counter the
event is using.
8. Moved MSR accesses to the add()/del() callbacks

Megha Dey (3):
  x86/cpu/intel: Add Cannonlake to Intel family
  perf/x86/intel/bm.c: Add Intel Branch Monitoring support
  x86, bm: Add documentation on Intel Branch Monitoring

 Documentation/x86/intel_bm.txt      | 216 +++++++++++++
 arch/x86/events/Kconfig             |  10 +
 arch/x86/events/intel/Makefile      |   2 +
 arch/x86/events/intel/bm.c          | 618 ++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/intel-family.h |   2 +
 arch/x86/include/asm/msr-index.h    |   5 +
 arch/x86/include/asm/processor.h    |   4 +
 include/linux/perf_event.h          |   9 +-
 kernel/events/core.c                |  16 +
 9 files changed, 881 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/x86/intel_bm.txt
 create mode 100644 arch/x86/events/intel/bm.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH V1 1/3] x86/cpu/intel: Add Cannonlake to Intel family
  2017-11-11 21:20 [PATCH V1 0/3] perf/x86/intel: Add Branch Monitoring support Megha Dey
@ 2017-11-11 21:20 ` Megha Dey
  2017-11-11 21:20 ` [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support Megha Dey
  2017-11-11 21:20 ` [PATCH V1 3/3] x86, bm: Add documentation on Intel Branch Monitoring Megha Dey
  2 siblings, 0 replies; 9+ messages in thread
From: Megha Dey @ 2017-11-11 21:20 UTC (permalink / raw)
  To: x86, linux-kernel, linux-doc
  Cc: tglx, mingo, hpa, andriy.shevchenko, kstewart, yu-cheng.yu,
	len.brown, gregkh, peterz, acme, alexander.shishkin, jolsa,
	namhyung, vikas.shivappa, pombredanne, me, bp,
	grzegorz.andrejczuk, tony.luck, corbet, ravi.v.shankar,
	megha.dey, Megha Dey

Add CPUID of Cannonlake (CNL) processors to Intel family list.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 arch/x86/include/asm/intel-family.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/intel-family.h b/arch/x86/include/asm/intel-family.h
index 35a6bc4..056bd41 100644
--- a/arch/x86/include/asm/intel-family.h
+++ b/arch/x86/include/asm/intel-family.h
@@ -65,6 +65,8 @@
 #define INTEL_FAM6_ATOM_DENVERTON	0x5F /* Goldmont Microserver */
 #define INTEL_FAM6_ATOM_GEMINI_LAKE	0x7A
 
+#define INTEL_FAM6_CANNONLAKE_MOBILE	0x66
+
 /* Xeon Phi */
 
 #define INTEL_FAM6_XEON_PHI_KNL		0x57 /* Knights Landing */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support
  2017-11-11 21:20 [PATCH V1 0/3] perf/x86/intel: Add Branch Monitoring support Megha Dey
  2017-11-11 21:20 ` [PATCH V1 1/3] x86/cpu/intel: Add Cannonlake to Intel family Megha Dey
@ 2017-11-11 21:20 ` Megha Dey
  2017-11-13  9:00   ` Peter Zijlstra
  2017-11-11 21:20 ` [PATCH V1 3/3] x86, bm: Add documentation on Intel Branch Monitoring Megha Dey
  2 siblings, 1 reply; 9+ messages in thread
From: Megha Dey @ 2017-11-11 21:20 UTC (permalink / raw)
  To: x86, linux-kernel, linux-doc
  Cc: tglx, mingo, hpa, andriy.shevchenko, kstewart, yu-cheng.yu,
	len.brown, gregkh, peterz, acme, alexander.shishkin, jolsa,
	namhyung, vikas.shivappa, pombredanne, me, bp,
	grzegorz.andrejczuk, tony.luck, corbet, ravi.v.shankar,
	megha.dey, Megha Dey

Currently, the cannonlake family of Intel processors support the
branch monitoring feature. Intel's Branch monitoring feature is trying
to utilize heuristics to detect the occurrence of an ROP (Return
Oriented Programming) attack.

A perf-based kernel driver has been used to monitor the occurrence of
one of the 6 branch monitoring events. There are 2 counters that each
can select between one of these events for evaluation over a specified
instruction window size (0 to 1023). For each counter, a threshold value
(0 to 127) can be configured to set a point at which ROP detection event
action is taken (determined by user-space). Each task can monitor
a maximum of 2 events at any given time.

Apart from window_size(global) and threshold(per-counter), various sysfs
entries are provided for the user to configure: guest_disable, lbr_freeze,
window_cnt_sel, cnt_and_mode (all global) and mispred_evt_cnt(per-counter).
For all events belonging to the same task, the global parameters are
shared.

Everytime a task is scheduled out, we save current window and count
associated with the event being monitored. When the task is scheduled
next, we start counting from previous count associated with this event.
Thus, a full context switch in this case is not necessary.

To monitor a user space application for ROP related events, perf command
line can be used as follows:

perf stat -e <name of event> <application to be monitored>

eg. For the following test program (test.c) and threshold = 100
(echo 100 > /sys/devices/intel_bm/threshold)

void func(void)
{
        return;
}

void main(void)
{
        int i;

        for (i = 0; i < 128; i++) {
                func();
        }

        return;
}

perf stat -e intel_bm/rets/ ./test

 Performance counter stats for './test':

                 1      intel_bm/rets/

       0.104705937 seconds time elapsed

perf returns the number of branch monitoring interrupts occurred during
the execution of the user-space application.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Yu-Cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/events/Kconfig          |  10 +
 arch/x86/events/intel/Makefile   |   2 +
 arch/x86/events/intel/bm.c       | 618 +++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/msr-index.h |   5 +
 arch/x86/include/asm/processor.h |   4 +
 include/linux/perf_event.h       |   9 +-
 kernel/events/core.c             |  16 +
 7 files changed, 663 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/events/intel/bm.c

diff --git a/arch/x86/events/Kconfig b/arch/x86/events/Kconfig
index 9a7a144..40903ca 100644
--- a/arch/x86/events/Kconfig
+++ b/arch/x86/events/Kconfig
@@ -9,6 +9,16 @@ config PERF_EVENTS_INTEL_UNCORE
 	Include support for Intel uncore performance events. These are
 	available on NehalemEX and more modern processors.
 
+config PERF_EVENTS_INTEL_BM
+	bool "Intel Branch Monitoring support"
+	depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
+	---help---
+	  Include support for Intel Branch monitoring. This feature utilizes
+	  heuristics for detecting ROP(Return oriented programming) like
+	  attacks. These heuristics are based off certain performance
+	  monitoring statistics, measured dynamically over a short
+	  configurable window period.
+
 config PERF_EVENTS_INTEL_RAPL
 	tristate "Intel rapl performance events"
 	depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 3468b0c..14235ec 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -2,6 +2,8 @@
 obj-$(CONFIG_CPU_SUP_INTEL)		+= core.o bts.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= ds.o knc.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= lbr.o p4.o p6.o pt.o
+obj-$(CONFIG_PERF_EVENTS_INTEL_BM)	+= intel-bm-perf.o
+intel-bm-perf-objs			:= bm.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)	+= intel-rapl-perf.o
 intel-rapl-perf-objs			:= rapl.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= intel-uncore.o
diff --git a/arch/x86/events/intel/bm.c b/arch/x86/events/intel/bm.c
new file mode 100644
index 0000000..923c6e9
--- /dev/null
+++ b/arch/x86/events/intel/bm.c
@@ -0,0 +1,618 @@
+/*
+ * Support for Intel branch monitoring counters
+ *
+ * Intel branch monitoring MSRs are specified in the Intel® 64 and IA-32
+ * Software Developer’s Manual Volume 4 section 2.16.2 (October 2017)
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * Contact Information:
+ * Megha Dey <megha.dey@linux.intel.com>
+ * Yu-Cheng Yu <yu-cheng.yu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/module.h>
+#include <linux/perf_event.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/err.h>
+#include <asm/apic.h>
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
+#include <asm/nmi.h>
+
+#include "../perf_event.h"
+
+/* Branch Monitoring default and mask values */
+#define BM_MAX_WINDOW_SIZE		0x3ff
+#define BM_MAX_THRESHOLD		0x7f
+#define BM_MAX_EVENTS			6
+#define BM_WINDOW_SIZE_SHIFT		8
+#define BM_THRESHOLD_SHIFT		8
+#define BM_EVENT_TYPE_SHIFT		1
+#define BM_GUEST_DISABLE_SHIFT		3
+#define BM_LBR_FREEZE_SHIFT		2
+#define BM_WINDOW_CNT_SEL_SHIFT		24
+#define BM_CNT_AND_MODE_SHIFT		26
+#define BM_MISPRED_EVT_CNT_SHIFT	15
+#define BM_ENABLE			0x3
+
+static unsigned int bm_window_size = BM_MAX_WINDOW_SIZE;
+static unsigned int bm_guest_disable;
+static unsigned int bm_lbr_freeze;
+static unsigned int bm_window_cnt_sel;
+static unsigned int bm_cnt_and_mode;
+
+static unsigned int bm_threshold = BM_MAX_THRESHOLD;
+static unsigned int bm_mispred_evt_cnt;
+
+/* Branch monitoring counter owners */
+static struct perf_event **bm_counter_owner;
+
+static struct pmu intel_bm_pmu;
+
+DEFINE_PER_CPU(int, bm_unmask_apic);
+
+union bm_detect_status {
+	struct {
+		__u8 event: 1;
+		__u8 lbrs_valid: 1;
+		__u8 reserved0: 6;
+		__u8 ctrl_hit: 4;
+		__u8 reserved1: 4;
+		__u16 count_window: 10;
+		__u8 reserved2: 6;
+		__u8 count[4];
+	} __packed;
+	uint64_t raw;
+};
+
+static int intel_bm_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
+{
+	struct perf_event *event;
+	union bm_detect_status stat;
+	int i;
+	unsigned long x;
+
+	rdmsrl(BR_DETECT_STATUS_MSR, stat.raw);
+
+	if (stat.event) {
+		wrmsrl(BR_DETECT_STATUS_MSR, 0);
+		apic_write(APIC_LVTPC, APIC_DM_NMI);
+		/*
+		 * Issue wake-up to corresponding polling event
+		 */
+		x = stat.ctrl_hit;
+		for_each_set_bit(i, &x, BM_MAX_COUNTERS) {
+			event = current->thread.bm_counter_owner[i];
+			local64_inc(&event->count);
+			atomic_set(&event->hw.bm_poll, POLLIN);
+			event->pending_wakeup = 1;
+			irq_work_queue(&event->pending);
+		}
+		return NMI_HANDLED;
+	}
+	return NMI_DONE;
+}
+
+/*
+ * Unmask the NMI bit of the local APIC the first time task is scheduled
+ * on a particular CPU.
+ */
+static void intel_bm_unmask_nmi(void)
+{
+	this_cpu_write(bm_unmask_apic, 0);
+
+	if (!(this_cpu_read(bm_unmask_apic))) {
+		apic_write(APIC_LVTPC, APIC_DM_NMI);
+		this_cpu_inc(bm_unmask_apic);
+	}
+}
+
+static int intel_bm_event_add(struct perf_event *event, int mode)
+{
+	union bm_detect_status cur_stat, prev_stat;
+
+	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
+
+	prev_stat.raw = local64_read(&event->hw.prev_count);
+
+	/*
+	 * Start counting from previous count associated with this event
+	 */
+	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
+
+	cur_stat.count[event->hw.id] = prev_stat.count[event->hw.id];
+	cur_stat.count_window = prev_stat.count_window;
+	wrmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
+
+	wrmsrl(BR_DETECT_CONTROL_MSR, event->hw.bm_ctrl);
+
+	intel_bm_unmask_nmi();
+
+	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
+		(event->hw.bm_counter_conf | 1));
+
+	return 0;
+}
+
+static void intel_bm_event_update(struct perf_event *event)
+{
+	union bm_detect_status cur_stat;
+
+	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
+	local64_set(&event->hw.prev_count, (uint64_t)cur_stat.raw);
+}
+
+static void intel_bm_event_del(struct perf_event *event, int flags)
+{
+	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
+
+	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
+		(event->hw.bm_counter_conf & ~1));
+
+	intel_bm_event_update(event);
+}
+
+static void intel_bm_event_destroy(struct perf_event *event)
+{
+	bm_counter_owner[event->hw.id] = NULL;
+}
+
+static DEFINE_MUTEX(bm_counter_mutex);
+
+static int intel_bm_event_init(struct perf_event *event)
+{
+	u64 cfg;
+	int counter_to_use = -1, i;
+
+	local64_set(&event->hw.prev_count, 0);
+
+	if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	/*
+	 * Type is assigned by kernel, see /sys/devices/intel_bm/type
+	 */
+	if (event->attr.type != intel_bm_pmu.type)
+		return -ENOENT;
+
+	/*
+	 * Only per tasks events are supported. It does not make sense to
+	 * monitor all tasks for an ROP attack. This could generate a lot
+	 * of false positives.
+	 */
+	if (event->hw.target == NULL)
+		return -EINVAL;
+
+	/* No sampling supported */
+	if (!is_sampling_event(event))
+		return;
+
+	event->event_caps |= PERF_EV_CAP_BM;
+	/*
+	 * cfg contains one of the 6 possible Branch Monitoring events
+	 */
+	cfg = event->attr.config;
+	if (cfg < 0 || cfg > (BM_MAX_EVENTS - 1))
+		return -EINVAL;
+
+	/*
+	 * Find a hardware counter for the target task
+	 */
+	bm_counter_owner = event->hw.target->thread.bm_counter_owner;
+
+	mutex_lock(&bm_counter_mutex);
+	for (i = 0; i < BM_MAX_COUNTERS; i++) {
+		if (bm_counter_owner[i] == NULL) {
+			counter_to_use = i;
+			bm_counter_owner[i] = event;
+			break;
+		}
+	}
+	mutex_unlock(&bm_counter_mutex);
+
+	if (counter_to_use == -1)
+		return -EBUSY;
+
+	event->hw.bm_ctrl = (bm_window_size << BM_WINDOW_SIZE_SHIFT) |
+			    (bm_guest_disable << BM_GUEST_DISABLE_SHIFT) |
+			    (bm_lbr_freeze << BM_LBR_FREEZE_SHIFT) |
+			    (bm_window_cnt_sel << BM_WINDOW_CNT_SEL_SHIFT) |
+			    (bm_cnt_and_mode << BM_CNT_AND_MODE_SHIFT) |
+								BM_ENABLE;
+	event->hw.bm_counter_conf = (bm_threshold << BM_THRESHOLD_SHIFT) |
+			(bm_mispred_evt_cnt << BM_MISPRED_EVT_CNT_SHIFT) |
+					(cfg << BM_EVENT_TYPE_SHIFT);
+
+	event->hw.id = counter_to_use;
+	local64_set(&event->count, 0);
+
+	event->destroy = intel_bm_event_destroy;
+
+	return 0;
+}
+
+EVENT_ATTR_STR(rets, rets, "event=0x0");
+EVENT_ATTR_STR(call-ret, call_ret, "event=0x01");
+EVENT_ATTR_STR(ret-misp, ret_misp, "event=0x02");
+EVENT_ATTR_STR(branch-misp, branch_mispredict, "event=0x03");
+EVENT_ATTR_STR(indirect-branch-misp, indirect_branch_mispredict, "event=0x04");
+EVENT_ATTR_STR(far-branch, far_branch, "event=0x05");
+
+static struct attribute *intel_bm_events_attr[] = {
+	EVENT_PTR(rets),
+	EVENT_PTR(call_ret),
+	EVENT_PTR(ret_misp),
+	EVENT_PTR(branch_mispredict),
+	EVENT_PTR(indirect_branch_mispredict),
+	EVENT_PTR(far_branch),
+	NULL,
+};
+
+static struct attribute_group intel_bm_events_group = {
+	.name = "events",
+	.attrs = intel_bm_events_attr,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+static struct attribute *intel_bm_formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group intel_bm_format_group = {
+	.name = "format",
+	.attrs = intel_bm_formats_attr,
+};
+
+/*
+ * User can configure the BM MSRs using the corresponding sysfs entries
+ */
+
+static ssize_t
+threshold_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", bm_threshold);
+
+	return rv;
+}
+
+static ssize_t
+threshold_store(struct device *dev,
+			   struct device_attribute *attr,
+			   const char *buf, size_t count)
+{
+	unsigned int threshold;
+	int err;
+
+	err = kstrtouint(buf, 0, &threshold);
+	if (err)
+		return err;
+
+	if (threshold > BM_MAX_THRESHOLD) {
+		pr_err("invalid threshold value\n");
+		return -EINVAL;
+	}
+
+	bm_threshold = threshold;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(threshold);
+
+static ssize_t
+window_size_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", bm_window_size);
+
+	return rv;
+}
+
+static ssize_t
+window_size_store(struct device *dev,
+			    struct device_attribute *attr,
+			    const char *buf, size_t count)
+{
+	unsigned int window_size;
+	int err;
+
+	err = kstrtouint(buf, 0, &window_size);
+	if (err)
+		return err;
+
+	if (window_size > BM_MAX_WINDOW_SIZE) {
+		pr_err("illegal window size\n");
+		return -EINVAL;
+	}
+
+	bm_window_size = window_size;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(window_size);
+
+static ssize_t
+lbr_freeze_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", bm_lbr_freeze);
+
+	return rv;
+}
+
+static ssize_t
+lbr_freeze_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	unsigned int lbr_freeze;
+	int err;
+
+	err = kstrtouint(buf, 0, &lbr_freeze);
+	if (err)
+		return err;
+
+	if (lbr_freeze > 1) {
+		pr_err("lbr freeze can only be 0 or 1\n");
+		return -EINVAL;
+	}
+
+	bm_lbr_freeze = lbr_freeze;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(lbr_freeze);
+
+static ssize_t
+guest_disable_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", bm_guest_disable);
+
+	return rv;
+}
+
+static ssize_t
+guest_disable_store(struct device *dev,
+			struct device_attribute *attr,
+			const char *buf, size_t count)
+{
+	unsigned int guest_disable;
+	int err;
+
+	err = kstrtouint(buf, 0, &guest_disable);
+	if (err)
+		return err;
+
+	if (guest_disable > 1) {
+		pr_err("guest disable can only be 0 or 1\n");
+		return -EINVAL;
+	}
+
+	bm_guest_disable = guest_disable;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(guest_disable);
+
+static ssize_t
+window_cnt_sel_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", bm_window_cnt_sel);
+
+	return rv;
+}
+
+static ssize_t
+window_cnt_sel_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	unsigned int window_cnt_sel;
+	int err;
+
+	err = kstrtouint(buf, 0, &window_cnt_sel);
+	if (err)
+		return err;
+
+	if (window_cnt_sel > 3) {
+		pr_err("invalid window_cnt_sel value\n");
+		return -EINVAL;
+	}
+
+	bm_window_cnt_sel = window_cnt_sel;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(window_cnt_sel);
+
+static ssize_t
+cnt_and_mode_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", bm_cnt_and_mode);
+
+	return rv;
+}
+
+static ssize_t
+cnt_and_mode_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	unsigned int cnt_and_mode;
+	int err;
+
+	err = kstrtouint(buf, 0, &cnt_and_mode);
+	if (err)
+		return err;
+
+	if (cnt_and_mode > 1) {
+		pr_err("invalid cnt_and_mode value\n");
+		return -EINVAL;
+	}
+
+	bm_cnt_and_mode = cnt_and_mode;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(cnt_and_mode);
+
+static ssize_t
+mispred_evt_cnt_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", bm_mispred_evt_cnt);
+
+	return rv;
+}
+
+static ssize_t
+mispred_evt_cnt_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	unsigned int mispred_evt_cnt;
+	int err;
+
+	err = kstrtouint(buf, 0, &mispred_evt_cnt);
+	if (err)
+		return err;
+
+	if (mispred_evt_cnt > 1) {
+		pr_err("invalid mispred_evt_cnt value\n");
+		return -EINVAL;
+	}
+
+	bm_mispred_evt_cnt = mispred_evt_cnt;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(mispred_evt_cnt);
+
+static ssize_t
+num_counters_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	ssize_t rv;
+
+	rv = sprintf(buf, "%d\n", BM_MAX_COUNTERS);
+
+	return rv;
+}
+
+static DEVICE_ATTR_RO(num_counters);
+
+static struct attribute *intel_bm_attrs[] = {
+	&dev_attr_window_size.attr,
+	&dev_attr_threshold.attr,
+	&dev_attr_lbr_freeze.attr,
+	&dev_attr_guest_disable.attr,
+	&dev_attr_window_cnt_sel.attr,
+	&dev_attr_cnt_and_mode.attr,
+	&dev_attr_mispred_evt_cnt.attr,
+	&dev_attr_num_counters.attr,
+	NULL,
+};
+
+static const struct attribute_group intel_bm_group = {
+	.attrs = intel_bm_attrs,
+};
+
+static const struct attribute_group *intel_bm_attr_groups[] = {
+	&intel_bm_events_group,
+	&intel_bm_format_group,
+	&intel_bm_group,
+	NULL,
+};
+
+static struct pmu intel_bm_pmu = {
+	.task_ctx_nr     = perf_sw_context,
+	.attr_groups     = intel_bm_attr_groups,
+	.event_init      = intel_bm_event_init,
+	.add             = intel_bm_event_add,
+	.del             = intel_bm_event_del,
+};
+
+#define X86_BM_MODEL_MATCH(model)       \
+	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY }
+
+static const struct x86_cpu_id bm_cpu_match[] __initconst = {
+	X86_BM_MODEL_MATCH(INTEL_FAM6_CANNONLAKE_MOBILE),
+	{},
+};
+
+MODULE_DEVICE_TABLE(x86cpu, bm_cpu_match);
+
+static __init int intel_bm_init(void)
+{
+	int ret, err;
+
+	/*
+	 * Only CNL supports branch monitoring
+	 */
+	if (!(x86_match_cpu(bm_cpu_match)))
+		return -ENODEV;
+
+	err = register_nmi_handler(NMI_LOCAL, intel_bm_event_nmi_handler,
+								0, "BM");
+
+	if (err)
+		goto fail_nmi;
+
+	ret =  perf_pmu_register(&intel_bm_pmu, "intel_bm", -1);
+	if (ret) {
+		pr_err("Intel BM perf registration failed: %d\n", ret);
+		return ret;
+	}
+
+	return 0;
+
+fail_nmi:
+	unregister_nmi_handler(NMI_LOCAL, "BM");
+	return err;
+}
+module_init(intel_bm_init);
+
+static void __exit intel_bm_exit(void)
+{
+	perf_pmu_unregister(&intel_bm_pmu);
+	unregister_nmi_handler(NMI_LOCAL, "BM");
+}
+module_exit(intel_bm_exit);
+
+MODULE_LICENSE("GPL");
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ab02261..f72de49 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -294,6 +294,11 @@
 /* Alternative perfctr range with full access. */
 #define MSR_IA32_PMC0			0x000004c1
 
+/* Intel Branch Monitoring MSRs */
+#define BR_DETECT_CONTROL_MSR		0x00000350
+#define BR_DETECT_STATUS_MSR		0x00000351
+#define BR_DETECT_COUNTER_CONFIG_BASE	0x00000354
+
 /* AMD64 MSRs. Not complete. See the architecture manual for a more
    complete list. */
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bdac19a..abaa22d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -42,6 +42,8 @@
 #define NET_IP_ALIGN	0
 
 #define HBP_NUM 4
+
+#define BM_MAX_COUNTERS 2
 /*
  * Default implementation of macro that returns current
  * instruction pointer ("program counter").
@@ -458,6 +460,8 @@ struct thread_struct {
 
 	/* Save middle states of ptrace breakpoints */
 	struct perf_event	*ptrace_bps[HBP_NUM];
+	/* Branch Monitoring counter owners */
+	struct perf_event	*bm_counter_owner[BM_MAX_COUNTERS];
 	/* Debug status used for traps, single steps, etc... */
 	unsigned long           debugreg6;
 	/* Keep track of the exact dr7 value set by the user */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..60dd625 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -168,6 +168,13 @@ struct hw_perf_event {
 	 */
 	struct task_struct		*target;
 
+	struct { /* intel_bm */
+			u64 bm_ctrl;
+			u64 bm_counter_conf;
+			atomic_t bm_poll;
+			u64 id;
+	};
+
 	/*
 	 * PMU would store hardware filter configuration
 	 * here.
@@ -191,7 +198,6 @@ struct hw_perf_event {
 	 * local64_cmpxchg() such that pmu::read() can be called nested.
 	 */
 	local64_t			prev_count;
-
 	/*
 	 * The period to start the next sample with.
 	 */
@@ -512,6 +518,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
  */
 #define PERF_EV_CAP_SOFTWARE		BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
+#define PERF_EV_CAP_BM			BIT(2)
 
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 10cdb9c..61227ea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4598,6 +4598,15 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
 
 	poll_wait(file, &event->waitq, wait);
 
+	/*
+	 * Branch monitoring events do not support ring buffer.
+	 * For users polling on these events, return appropriate poll state.
+	 */
+	if (event->event_caps & PERF_EV_CAP_BM) {
+		events = atomic_xchg(&event->hw.bm_poll, 0);
+		return events;
+	}
+
 	if (is_event_hup(event))
 		return events;
 
@@ -5500,6 +5509,13 @@ void perf_event_wakeup(struct perf_event *event)
 {
 	ring_buffer_wakeup(event);
 
+	/*
+	 * Since branch monitoring events do not have ring buffer, they
+	 * have to be woken up separately
+	 */
+	if (event->event_caps & PERF_EV_CAP_BM)
+		wake_up_all(&event->waitq);
+
 	if (event->pending_kill) {
 		kill_fasync(perf_event_fasync(event), SIGIO, event->pending_kill);
 		event->pending_kill = 0;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V1 3/3] x86, bm: Add documentation on Intel Branch Monitoring
  2017-11-11 21:20 [PATCH V1 0/3] perf/x86/intel: Add Branch Monitoring support Megha Dey
  2017-11-11 21:20 ` [PATCH V1 1/3] x86/cpu/intel: Add Cannonlake to Intel family Megha Dey
  2017-11-11 21:20 ` [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support Megha Dey
@ 2017-11-11 21:20 ` Megha Dey
  2017-11-12  1:56   ` Randy Dunlap
  2 siblings, 1 reply; 9+ messages in thread
From: Megha Dey @ 2017-11-11 21:20 UTC (permalink / raw)
  To: x86, linux-kernel, linux-doc
  Cc: tglx, mingo, hpa, andriy.shevchenko, kstewart, yu-cheng.yu,
	len.brown, gregkh, peterz, acme, alexander.shishkin, jolsa,
	namhyung, vikas.shivappa, pombredanne, me, bp,
	grzegorz.andrejczuk, tony.luck, corbet, ravi.v.shankar,
	megha.dey, Megha Dey

This patch adds the Documentation/x86/intel_bm.txt file with some
information about Intel Branch monitoring.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 Documentation/x86/intel_bm.txt | 216 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 216 insertions(+)
 create mode 100644 Documentation/x86/intel_bm.txt

diff --git a/Documentation/x86/intel_bm.txt b/Documentation/x86/intel_bm.txt
new file mode 100644
index 0000000..25b7177
--- /dev/null
+++ b/Documentation/x86/intel_bm.txt
@@ -0,0 +1,216 @@
+Intel(R) Branch Monitoring
+
+Copyright (C) 2017 Intel Corporation
+
+Megha Dey <megha.dey@intel.com>
+Yu-Cheng Yu <yu-cheng.yu@intel.com>
+
+I. Overview
+===========
+
+The Cannonlake family of Intel processors support the branch monitoring
+feature. This feature uses heuristics to detect the occurrence of an ROP
+(Return Oriented Programming) or ROP like(JOP:Jump oriented programming)
+attack. These heuristics are based off certain performance monitoring
+statistics, measured dynamically over a short configurable window period.
+ROP is a malware trend in which the attacker can compromise a return
+pointer held on the stack to redirect execution to a different desired
+instruction.
+
+Support for branch monitoring has been added via Linux kernel perf event
+infrastructure. This feature is enabled by CONFIG_PERF_EVENTS_INTEL_BM.
+
+Once the kernel is compiled with CONFIG_PERF_EVENTS_INTEL_BM=y on a
+Cannonlake system, the following perf events are added which can be viewed
+with perf list:
+  intel_bm/branch-misp/                              [Kernel PMU event]
+  intel_bm/call-ret/                                 [Kernel PMU event]
+  intel_bm/far-branch/                               [Kernel PMU event]
+  intel_bm/indirect-branch-misp/                     [Kernel PMU event]
+  intel_bm/ret-misp/                                 [Kernel PMU event]
+  intel_bm/rets/                                     [Kernel PMU event]
+
+II. Hardware details
+====================
+
+The MSRs associated with branch monitoring are as follows:
+
+1. BR_DETECT_CTRL : Branch Monitoring Global control
+   Used for enabling and configuring global capability
+
+2. BR_DETECT_STATUS : Branch Monitoring Global Status
+   Used by SW handler for determining detect status
+
+3. BR_DETECT_COUNTER_CONFIG_i : Branch Monitoring Counter Configuration
+   Per-cpu branch monitoring counter Configuration
+
+There are 2 8-bit counters that each can select between one of the
+following 6 events:
+
+1. RET instructions: Counts the number of near return instructions retired
+
+2. CALL-RET instructions: Counts the difference between the number of near
+   return and call instructions retired
+
+3. RET mispredicts: Mispredicted return instructions retired
+
+4. Branch (all) mispredicts: Counts the number of mispredicted branches
+
+5. Indirect branch mispredicts: Counts the number of mispredicted indirect
+   near branch instructions. Includes indirect near jump/call instructions
+
+6. Far branch instructions: Counts the number of far branches retired
+
+Branch Monitoring hardware utilizes various existing performance related
+counter events. Of the 6 events above, only call-ret is newly implemented.
+
+The events are evaluated over a specified 10-bit instruction window size
+(0 to 1023). For each counter, a threshold value (0 to 127) can be
+configured to set a point at which an interrupt is generated and a
+detection event action is taken (determined by user-space). This can take
+the form of signaling an interrupt and/or freezing the state of the last
+branch record information.
+
+The event counters are reset after every 'window size' instructions by the
+hardware.
+
+The feature is for user mode (privilege level > 0) operation only, which is
+the known malware security threat target environment. While in supervisor
+mode, this heuristic detection counter activity is suspended. This behavior
+(user mode) is independent of root vs. non-root with respect to
+virtualization technology execution.
+
+III. Software Implementation
+============================
+
+A perf-based kernel driver has been used to monitor the occurrence of
+one of the 6 branch monitoring events.
+
+If an branch monitoring interrupt is generated, the interrupt bit is set
+which is cleared by interrupt handler and the event counters are reset.
+
+The entire system can monitor a maximum of 2 events at any given time.
+These events can belong to the same or different tasks.
+
+Everytime a task is scheduled out, we save current window and count
+associated with the event being monitored. When the task is scheduled next,
+we start counting from previous count associated with this event. Thus, a
+full context switch in this case is not necessary.
+
+The Branch Monitoring exception can be configured as a regular interrupt or
+an NMI. We chain an NMI handler after PMU, because
+1. It will not interfere with PMU events
+2. We only monitor for user-mode events, and this will not delay branch
+   monitoring events for user-mode
+
+We monitor only per-task events. It does not make sense to monitor all tasks
+for an attack. This could generate a lot of false positives.
+
+IV. User-configurable inputs
+============================
+
+Several sysfs entries are provided in /sys/devices/intel_bm/ to configure
+controls for the supported hardware heuristics.
+
+1. LBR freeze: /sys/devices/intel-bm/lbr_freeze
+   possible values are 0 or 1. By default this is disabled(0). When enabled,
+   an LBR freeze is observed on threshold trip
+
+2. Guest Disable: /sys/devices/intel-bm/guest_disable
+   Possible values are 0 or 1. By default it is 0. When set to ‘1’, branch
+   monitoring feature is disabled when operating at VMX non-root operation.
+
+3. Window size: /sys/devices/intel-bm/window_size
+   By default, window size is 1023. It can take values from 0 to 1023. This
+   represents the number of instructions to be executed before the event
+   counters are reset.
+
+4. Window count select: /sys/devices/intel-bm/window_cnt_sel
+   Possible values are:
+   ‘00 = instructions retired
+   ‘01 = branches retired
+   ‘10 = returned instructions retired
+   ‘11 = indirect branch instructions retired
+   By default, it has a value of 0.
+
+5. Count and mode: /sys/devices/intel-bm/cnt_and_mode
+   Possible values are 0 or 1. By default it is 0. When set to ‘1’, the
+   overall event triggering condition is true only if both enabled
+   counter’s threshold conditions are true. When ‘0’, the threshold
+   tripping condition is true if either enabled counter’s threshold is
+   true. If a counter is not enabled, then it does not factor into the
+   AND’ing logic
+
+6. Threshold: /sys/devices/intel-bm/threshold
+   An unsigned value of 0 to 127 is supported. The value 0 of counter
+   threshold will result in branch monitoring event signaled after every
+   instruction. By default, it has a value of 127.
+
+7. Mispredict counting behaviour: /sys/devices/intel-bm/mispred_evt_cnt
+   Possible values are:
+   0 = mispredict events are counted in a window
+   1 = mispredict events are counted based on a consecutive occurrence.
+   By default, it has a value of 0.
+
+Threshold and Mispredict events counting behaviour are per-counter
+configurations whereas the rest are global.
+
+V. Example usage
+================
+
+1. To monitor a user space application for branch monitoring events, perf
+command line can be used as follows:
+
+perf stat -e intel_bm/rets/ ./test
+
+ Performance counter stats for './test':
+
+                 1      intel_bm/rets/
+
+       0.104705937 seconds time elapsed
+
+where test.c is:
+
+void func(void)
+{
+        return;
+}
+
+void main(void)
+{
+        int i;
+
+        for (i = 0; i < 128; i++) {
+                func();
+        }
+
+        return;
+}
+
+and threshold = 100 (echo 100 > /sys/devices/intel_bm/threshold)
+
+perf returns the number of branch monitoring interrupts occurred when the
+user-space application was running.
+
+2. To monitor 2 events for a task,
+
+perf stat -e intel_bm/far-branch/,intel_bm/rets/ ./rets-128.bin
+
+ Performance counter stats for './rets-128.bin':
+
+                 0      intel_bm/far-branch/
+                 1      intel_bm/rets/
+
+       0.104057608 seconds time elapsed
+
+For the above example, the threshold and window size are shared.
+
+3. To monitor 2 events with different thresholds(same or different task)
+
+On terminal 1:
+echo <threshold1> > /sys/devices/intel_bm/threshold
+perf stat -e intel_bm/rets/ ./test.bin
+
+On terminal 2:
+echo <threshold2> > /sys/devices/intel_bm/threshold
+perf stat -e intel_bm/call-ret/ ./test.bin
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH V1 3/3] x86, bm: Add documentation on Intel Branch Monitoring
  2017-11-11 21:20 ` [PATCH V1 3/3] x86, bm: Add documentation on Intel Branch Monitoring Megha Dey
@ 2017-11-12  1:56   ` Randy Dunlap
  0 siblings, 0 replies; 9+ messages in thread
From: Randy Dunlap @ 2017-11-12  1:56 UTC (permalink / raw)
  To: Megha Dey, x86, linux-kernel, linux-doc
  Cc: tglx, mingo, hpa, andriy.shevchenko, kstewart, yu-cheng.yu,
	len.brown, gregkh, peterz, acme, alexander.shishkin, jolsa,
	namhyung, vikas.shivappa, pombredanne, me, bp,
	grzegorz.andrejczuk, tony.luck, corbet, ravi.v.shankar,
	megha.dey

On 11/11/17 13:20, Megha Dey wrote:
> This patch adds the Documentation/x86/intel_bm.txt file with some
> information about Intel Branch monitoring.

> +4. Window count select: /sys/devices/intel-bm/window_cnt_sel
> +   Possible values are:
> +   ‘00 = instructions retired
> +   ‘01 = branches retired
> +   ‘10 = returned instructions retired
> +   ‘11 = indirect branch instructions retired
> +   By default, it has a value of 0.

Hi,

Is the 'xx binary notation?  If so, it would be nice to say so..
or whatever it is.

thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support
  2017-11-11 21:20 ` [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support Megha Dey
@ 2017-11-13  9:00   ` Peter Zijlstra
  2017-11-13 19:22     ` Dey, Megha
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2017-11-13  9:00 UTC (permalink / raw)
  To: Megha Dey
  Cc: x86, linux-kernel, linux-doc, tglx, mingo, hpa,
	andriy.shevchenko, kstewart, yu-cheng.yu, len.brown, gregkh,
	acme, alexander.shishkin, jolsa, namhyung, vikas.shivappa,
	pombredanne, me, bp, grzegorz.andrejczuk, tony.luck, corbet,
	ravi.v.shankar, megha.dey

On Sat, Nov 11, 2017 at 01:20:05PM -0800, Megha Dey wrote:
> Currently, the cannonlake family of Intel processors support the
> branch monitoring feature. Intel's Branch monitoring feature is trying
> to utilize heuristics to detect the occurrence of an ROP (Return
> Oriented Programming) attack.
> 
> A perf-based kernel driver has been used to monitor the occurrence of
> one of the 6 branch monitoring events. There are 2 counters that each
> can select between one of these events for evaluation over a specified
> instruction window size (0 to 1023). For each counter, a threshold value
> (0 to 127) can be configured to set a point at which ROP detection event
> action is taken (determined by user-space). Each task can monitor
> a maximum of 2 events at any given time.
> 
> Apart from window_size(global) and threshold(per-counter), various sysfs
> entries are provided for the user to configure: guest_disable, lbr_freeze,
> window_cnt_sel, cnt_and_mode (all global) and mispred_evt_cnt(per-counter).
> For all events belonging to the same task, the global parameters are
> shared.

Is there any sensible documentation on this except the MSR listings?

> 
> Everytime a task is scheduled out, we save current window and count
> associated with the event being monitored. When the task is scheduled
> next, we start counting from previous count associated with this event.
> Thus, a full context switch in this case is not necessary.

What? That doesn't make any sense. The fact that we scheduled out and
then in again _is_ a full context switch no?

> 
> Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
> Signed-off-by: Yu-Cheng Yu <yu-cheng.yu@intel.com>

That SoB chain is buggered.


> +static int intel_bm_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
> +{
> +	struct perf_event *event;
> +	union bm_detect_status stat;
> +	int i;
> +	unsigned long x;
> +
> +	rdmsrl(BR_DETECT_STATUS_MSR, stat.raw);

	if (!stat.event)
		return NMI_DONE;

saves you a whole bunch of indentation, no?

> +
> +	if (stat.event) {
> +		wrmsrl(BR_DETECT_STATUS_MSR, 0);
> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> +		/*
> +		 * Issue wake-up to corresponding polling event
> +		 */
> +		x = stat.ctrl_hit;
> +		for_each_set_bit(i, &x, BM_MAX_COUNTERS) {
> +			event = current->thread.bm_counter_owner[i];
> +			local64_inc(&event->count);
> +			atomic_set(&event->hw.bm_poll, POLLIN);
> +			event->pending_wakeup = 1;
> +			irq_work_queue(&event->pending);
> +		}
> +		return NMI_HANDLED;
> +	}
> +	return NMI_DONE;
> +}
> +
> +/*
> + * Unmask the NMI bit of the local APIC the first time task is scheduled
> + * on a particular CPU.
> + */
> +static void intel_bm_unmask_nmi(void)
> +{
> +	this_cpu_write(bm_unmask_apic, 0);
> +
> +	if (!(this_cpu_read(bm_unmask_apic))) {
> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> +		this_cpu_inc(bm_unmask_apic);
> +	}
> +}

What? Why?

> +static int intel_bm_event_add(struct perf_event *event, int mode)
> +{
> +	union bm_detect_status cur_stat, prev_stat;
> +
> +	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
> +
> +	prev_stat.raw = local64_read(&event->hw.prev_count);
> +
> +	/*
> +	 * Start counting from previous count associated with this event
> +	 */
> +	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
> +
> +	cur_stat.count[event->hw.id] = prev_stat.count[event->hw.id];
> +	cur_stat.count_window = prev_stat.count_window;
> +	wrmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);

Why are you writing back the value you read? Just to waste cycles?

> +	wrmsrl(BR_DETECT_CONTROL_MSR, event->hw.bm_ctrl);
> +
> +	intel_bm_unmask_nmi();
> +
> +	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
> +		(event->hw.bm_counter_conf | 1));

Please use a named construct for that enable bit.

> +
> +	return 0;
> +}
> +
> +static void intel_bm_event_update(struct perf_event *event)
> +{
> +	union bm_detect_status cur_stat;
> +
> +	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
> +	local64_set(&event->hw.prev_count, (uint64_t)cur_stat.raw);
> +}

That looks wrong... the general point of update functions is to update
the count, the above does not in fact do that.

> +
> +static void intel_bm_event_del(struct perf_event *event, int flags)
> +{
> +	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
> +
> +	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
> +		(event->hw.bm_counter_conf & ~1));

Either that EN bit is part of the bm_counter_conf, in which case you
didn't need to add it in _add(), or its not and you don't need to clear
it here. Make up your mind.

> +
> +	intel_bm_event_update(event);

Except of course, that does not in fact update...

> +}
> +
> +static void intel_bm_event_destroy(struct perf_event *event)
> +{
> +	bm_counter_owner[event->hw.id] = NULL;
> +}
> +
> +static DEFINE_MUTEX(bm_counter_mutex);
> +
> +static int intel_bm_event_init(struct perf_event *event)
> +{
> +	u64 cfg;
> +	int counter_to_use = -1, i;
> +
> +	local64_set(&event->hw.prev_count, 0);
> +

> +	/*
> +	 * Find a hardware counter for the target task
> +	 */
> +	bm_counter_owner = event->hw.target->thread.bm_counter_owner;
> +
> +	mutex_lock(&bm_counter_mutex);
> +	for (i = 0; i < BM_MAX_COUNTERS; i++) {
> +		if (bm_counter_owner[i] == NULL) {
> +			counter_to_use = i;
> +			bm_counter_owner[i] = event;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&bm_counter_mutex);
> +
> +	if (counter_to_use == -1)
> +		return -EBUSY;
> +
> +	event->hw.bm_ctrl = (bm_window_size << BM_WINDOW_SIZE_SHIFT) |
> +			    (bm_guest_disable << BM_GUEST_DISABLE_SHIFT) |
> +			    (bm_lbr_freeze << BM_LBR_FREEZE_SHIFT) |
> +			    (bm_window_cnt_sel << BM_WINDOW_CNT_SEL_SHIFT) |
> +			    (bm_cnt_and_mode << BM_CNT_AND_MODE_SHIFT) |
> +								BM_ENABLE;
> +	event->hw.bm_counter_conf = (bm_threshold << BM_THRESHOLD_SHIFT) |
> +			(bm_mispred_evt_cnt << BM_MISPRED_EVT_CNT_SHIFT) |
> +					(cfg << BM_EVENT_TYPE_SHIFT);
> +
> +	event->hw.id = counter_to_use;
> +	local64_set(&event->count, 0);

That is just a really ugly hack to work around:


> +static struct pmu intel_bm_pmu = {
> +	.task_ctx_nr     = perf_sw_context,


this. And you didn't bother to mention that atrocity in your Changelog.



NAK. 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support
  2017-11-13  9:00   ` Peter Zijlstra
@ 2017-11-13 19:22     ` Dey, Megha
  2017-11-13 20:25       ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Dey, Megha @ 2017-11-13 19:22 UTC (permalink / raw)
  To: Peter Zijlstra, Megha Dey
  Cc: x86, linux-kernel, linux-doc, tglx, mingo, hpa,
	andriy.shevchenko, kstewart, Yu, Yu-cheng, Brown, Len, gregkh,
	acme, alexander.shishkin, jolsa, namhyung, vikas.shivappa,
	pombredanne, me, bp, Andrejczuk, Grzegorz, Luck, Tony, corbet,
	Shankar, Ravi V



>-----Original Message-----
>From: Peter Zijlstra [mailto:peterz@infradead.org]
>Sent: Monday, November 13, 2017 1:00 AM
>To: Megha Dey <megha.dey@linux.intel.com>
>Cc: x86@kernel.org; linux-kernel@vger.kernel.org; linux-
>doc@vger.kernel.org; tglx@linutronix.de; mingo@redhat.com;
>hpa@zytor.com; andriy.shevchenko@linux.intel.com;
>kstewart@linuxfoundation.org; Yu, Yu-cheng <yu-cheng.yu@intel.com>;
>Brown, Len <len.brown@intel.com>; gregkh@linuxfoundation.org;
>acme@kernel.org; alexander.shishkin@linux.intel.com; jolsa@redhat.com;
>namhyung@kernel.org; vikas.shivappa@linux.intel.com;
>pombredanne@nexb.com; me@kylehuey.com; bp@suse.de; Andrejczuk,
>Grzegorz <grzegorz.andrejczuk@intel.com>; Luck, Tony
><tony.luck@intel.com>; corbet@lwn.net; Shankar, Ravi V
><ravi.v.shankar@intel.com>; Dey, Megha <megha.dey@intel.com>
>Subject: Re: [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch
>Monitoring support
>
>On Sat, Nov 11, 2017 at 01:20:05PM -0800, Megha Dey wrote:
>> Currently, the cannonlake family of Intel processors support the
>> branch monitoring feature. Intel's Branch monitoring feature is trying
>> to utilize heuristics to detect the occurrence of an ROP (Return
>> Oriented Programming) attack.
>>
>> A perf-based kernel driver has been used to monitor the occurrence of
>> one of the 6 branch monitoring events. There are 2 counters that each
>> can select between one of these events for evaluation over a specified
>> instruction window size (0 to 1023). For each counter, a threshold
>> value
>> (0 to 127) can be configured to set a point at which ROP detection
>> event action is taken (determined by user-space). Each task can
>> monitor a maximum of 2 events at any given time.
>>
>> Apart from window_size(global) and threshold(per-counter), various
>> sysfs entries are provided for the user to configure: guest_disable,
>> lbr_freeze, window_cnt_sel, cnt_and_mode (all global) and
>mispred_evt_cnt(per-counter).
>> For all events belonging to the same task, the global parameters are
>> shared.
>
>Is there any sensible documentation on this except the MSR listings?

I have documented these sysfs entries in the next patch of this patch set:
Add Documentation for branch monitoring. Apart from that, unfortunately
there is only the MSR listings.

>
>>
>> Everytime a task is scheduled out, we save current window and count
>> associated with the event being monitored. When the task is scheduled
>> next, we start counting from previous count associated with this event.
>> Thus, a full context switch in this case is not necessary.
>
>What? That doesn't make any sense. The fact that we scheduled out and
>then in again _is_ a full context switch no?

What I meant was we need not save and restore all the branch monitoring 
MSRs during a context switch. I agree this is confusing. Will remove this line. 
>
>>
>> Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
>> Signed-off-by: Yu-Cheng Yu <yu-cheng.yu@intel.com>
>
>That SoB chain is buggered.

Will change the ordering.
>
>
>> +static int intel_bm_event_nmi_handler(unsigned int cmd, struct
>> +pt_regs *regs) {
>> +	struct perf_event *event;
>> +	union bm_detect_status stat;
>> +	int i;
>> +	unsigned long x;
>> +
>> +	rdmsrl(BR_DETECT_STATUS_MSR, stat.raw);
>
>	if (!stat.event)
>		return NMI_DONE;
>
>saves you a whole bunch of indentation, no?

Yep, it does.
>
>> +
>> +	if (stat.event) {
>> +		wrmsrl(BR_DETECT_STATUS_MSR, 0);
>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +		/*
>> +		 * Issue wake-up to corresponding polling event
>> +		 */
>> +		x = stat.ctrl_hit;
>> +		for_each_set_bit(i, &x, BM_MAX_COUNTERS) {
>> +			event = current->thread.bm_counter_owner[i];
>> +			local64_inc(&event->count);
>> +			atomic_set(&event->hw.bm_poll, POLLIN);
>> +			event->pending_wakeup = 1;
>> +			irq_work_queue(&event->pending);
>> +		}
>> +		return NMI_HANDLED;
>> +	}
>> +	return NMI_DONE;
>> +}
>> +
>> +/*
>> + * Unmask the NMI bit of the local APIC the first time task is
>> +scheduled
>> + * on a particular CPU.
>> + */
>> +static void intel_bm_unmask_nmi(void) {
>> +	this_cpu_write(bm_unmask_apic, 0);
>> +
>> +	if (!(this_cpu_read(bm_unmask_apic))) {
>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +		this_cpu_inc(bm_unmask_apic);
>> +	}
>> +}
>
>What? Why?

Normally, other drivers using perf create an event on every CPU (thereby calling
perf_init on every CPU), where this bit(APIC_DM_NMI)is explicitly unmasked.
In our driver, we do not do this (since we are worried only about a particular task)
and hence this bit is only disabled on the local APIC where the perf event is initialized.
As such, if the task is scheduled out to some other CPU, this bit is set and hence
would stop the interrupt from reaching the processing core.

>
>> +static int intel_bm_event_add(struct perf_event *event, int mode) {
>> +	union bm_detect_status cur_stat, prev_stat;
>> +
>> +	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
>> +
>> +	prev_stat.raw = local64_read(&event->hw.prev_count);
>> +
>> +	/*
>> +	 * Start counting from previous count associated with this event
>> +	 */
>> +	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
>> +
>> +	cur_stat.count[event->hw.id] = prev_stat.count[event->hw.id];
>> +	cur_stat.count_window = prev_stat.count_window;
>> +	wrmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
>
>Why are you writing back the value you read? Just to waste cycles?

We only wanted to update the window count and event count associated with one of
the 2 event IDs. But you are right, we don't read to read the MSR, will directly write to it.
>
>> +	wrmsrl(BR_DETECT_CONTROL_MSR, event->hw.bm_ctrl);
>> +
>> +	intel_bm_unmask_nmi();
>> +
>> +	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
>> +		(event->hw.bm_counter_conf | 1));
>
>Please use a named construct for that enable bit.

I will switch to using bitfields for this MSR, similar to the status register.
>
>> +
>> +	return 0;
>> +}
>> +
>> +static void intel_bm_event_update(struct perf_event *event) {
>> +	union bm_detect_status cur_stat;
>> +
>> +	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
>> +	local64_set(&event->hw.prev_count, (uint64_t)cur_stat.raw); }
>
>That looks wrong... the general point of update functions is to update the
>count, the above does not in fact do that.

Ok will remove this and add this functionality to event_del()
>
>> +
>> +static void intel_bm_event_del(struct perf_event *event, int flags) {
>> +	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
>> +
>> +	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
>> +		(event->hw.bm_counter_conf & ~1));
>
>Either that EN bit is part of the bm_counter_conf, in which case you didn't
>need to add it in _add(), or its not and you don't need to clear it here. Make
>up your mind.

We are starting the count in add() (last bit of bm_counter_conf) and stopping it here.
I am not sure what you mean by this. Are you saying that we don't explicitly have to
enable or disable the count if the enable bit is a part of the MSR?
>
>> +
>> +	intel_bm_event_update(event);
>
>Except of course, that does not in fact update...

Will remove this.
>
>> +}
>> +
>> +static void intel_bm_event_destroy(struct perf_event *event) {
>> +	bm_counter_owner[event->hw.id] = NULL; }
>> +
>> +static DEFINE_MUTEX(bm_counter_mutex);
>> +
>> +static int intel_bm_event_init(struct perf_event *event) {
>> +	u64 cfg;
>> +	int counter_to_use = -1, i;
>> +
>> +	local64_set(&event->hw.prev_count, 0);
>> +
>
>> +	/*
>> +	 * Find a hardware counter for the target task
>> +	 */
>> +	bm_counter_owner = event->hw.target-
>>thread.bm_counter_owner;
>> +
>> +	mutex_lock(&bm_counter_mutex);
>> +	for (i = 0; i < BM_MAX_COUNTERS; i++) {
>> +		if (bm_counter_owner[i] == NULL) {
>> +			counter_to_use = i;
>> +			bm_counter_owner[i] = event;
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&bm_counter_mutex);
>> +
>> +	if (counter_to_use == -1)
>> +		return -EBUSY;
>> +
>> +	event->hw.bm_ctrl = (bm_window_size <<
>BM_WINDOW_SIZE_SHIFT) |
>> +			    (bm_guest_disable << BM_GUEST_DISABLE_SHIFT)
>|
>> +			    (bm_lbr_freeze << BM_LBR_FREEZE_SHIFT) |
>> +			    (bm_window_cnt_sel <<
>BM_WINDOW_CNT_SEL_SHIFT) |
>> +			    (bm_cnt_and_mode <<
>BM_CNT_AND_MODE_SHIFT) |
>> +								BM_ENABLE;
>> +	event->hw.bm_counter_conf = (bm_threshold <<
>BM_THRESHOLD_SHIFT) |
>> +			(bm_mispred_evt_cnt <<
>BM_MISPRED_EVT_CNT_SHIFT) |
>> +					(cfg << BM_EVENT_TYPE_SHIFT);
>> +
>> +	event->hw.id = counter_to_use;
>> +	local64_set(&event->count, 0);
>
>That is just a really ugly hack to work around:

Actually, this was intended to make sure event->count is always zero for a new event.
This wasn't intended to be a hack to work around the below. Hence, was not mentioned
in the changelog.
>
>
>> +static struct pmu intel_bm_pmu = {
>> +	.task_ctx_nr     = perf_sw_context,
>
>
>this. And you didn't bother to mention that atrocity in your Changelog.

We don't need this. Will remove it in the next patch series.
>
>
>
>NAK.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support
  2017-11-13 19:22     ` Dey, Megha
@ 2017-11-13 20:25       ` Thomas Gleixner
  2017-11-13 22:14         ` Megha Dey
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2017-11-13 20:25 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Peter Zijlstra, Megha Dey, x86, linux-kernel, linux-doc, mingo,
	hpa, andriy.shevchenko, kstewart, Yu, Yu-cheng, Brown, Len,
	gregkh, acme, alexander.shishkin, jolsa, namhyung,
	vikas.shivappa, pombredanne, me, bp, Andrejczuk, Grzegorz, Luck,
	Tony, corbet, Shankar, Ravi V

On Mon, 13 Nov 2017, Dey, Megha wrote:
> >-----Original Message-----
> >From: Peter Zijlstra [mailto:peterz@infradead.org]
> >Sent: Monday, November 13, 2017 1:00 AM
> >To: Megha Dey <megha.dey@linux.intel.com>
> >Cc: x86@kernel.org; linux-kernel@vger.kernel.org; linux-

Please fix your mail client so it does not add this complete useless
information to the reply.

> >On Sat, Nov 11, 2017 at 01:20:05PM -0800, Megha Dey wrote:
> >> +/*
> >> + * Unmask the NMI bit of the local APIC the first time task is
> >> +scheduled
> >> + * on a particular CPU.
> >> + */
> >> +static void intel_bm_unmask_nmi(void) {
> >> +	this_cpu_write(bm_unmask_apic, 0);
> >> +
> >> +	if (!(this_cpu_read(bm_unmask_apic))) {
> >> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> >> +		this_cpu_inc(bm_unmask_apic);
> >> +	}
> >> +}
> >
> >What? Why?
> 

> Normally, other drivers using perf create an event on every CPU (thereby
> calling perf_init on every CPU), where this bit(APIC_DM_NMI)is explicitly
> unmasked.  In our driver, we do not do this (since we are worried only
> about a particular task) and hence this bit is only disabled on the local
> APIC where the perf event is initialized.
>
> As such, if the task is scheduled out to some other CPU, this bit is set
> and hence would stop the interrupt from reaching the processing core.

Still that code makes no sense at all and certainly does not do what you
claim it does:

> >> +	this_cpu_write(bm_unmask_apic, 0);
> >> +
> >> +	if (!(this_cpu_read(bm_unmask_apic))) {

So first you write the per cpu variable to 0 and then you check whether it
is zero, which is pointless obviously.

> >
> >> +static int intel_bm_event_add(struct perf_event *event, int mode) {

Please move the opening bracket of the function into the next line. See the
kernel coding style documentation.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support
  2017-11-13 20:25       ` Thomas Gleixner
@ 2017-11-13 22:14         ` Megha Dey
  0 siblings, 0 replies; 9+ messages in thread
From: Megha Dey @ 2017-11-13 22:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, x86, linux-kernel, linux-doc, mingo, hpa,
	andriy.shevchenko, kstewart, Yu, Yu-cheng, Brown, Len, gregkh,
	acme, alexander.shishkin, jolsa, namhyung, vikas.shivappa,
	pombredanne, me, bp, Andrejczuk, Grzegorz, Luck, Tony, corbet,
	Shankar, Ravi V

On Mon, 2017-11-13 at 21:25 +0100, Thomas Gleixner wrote:
> On Mon, 13 Nov 2017, Dey, Megha wrote:
> > >-----Original Message-----
> > >From: Peter Zijlstra [mailto:peterz@infradead.org]
> > >Sent: Monday, November 13, 2017 1:00 AM
> > >To: Megha Dey <megha.dey@linux.intel.com>
> > >Cc: x86@kernel.org; linux-kernel@vger.kernel.org; linux-
> 
> Please fix your mail client so it does not add this complete useless
> information to the reply.

Will fix this.
> 
> > >On Sat, Nov 11, 2017 at 01:20:05PM -0800, Megha Dey wrote:
> > >> +/*
> > >> + * Unmask the NMI bit of the local APIC the first time task is
> > >> +scheduled
> > >> + * on a particular CPU.
> > >> + */
> > >> +static void intel_bm_unmask_nmi(void) {
> > >> +	this_cpu_write(bm_unmask_apic, 0);
> > >> +
> > >> +	if (!(this_cpu_read(bm_unmask_apic))) {
> > >> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> > >> +		this_cpu_inc(bm_unmask_apic);
> > >> +	}
> > >> +}
> > >
> > >What? Why?
> > 
> 
> > Normally, other drivers using perf create an event on every CPU (thereby
> > calling perf_init on every CPU), where this bit(APIC_DM_NMI)is explicitly
> > unmasked.  In our driver, we do not do this (since we are worried only
> > about a particular task) and hence this bit is only disabled on the local
> > APIC where the perf event is initialized.
> >
> > As such, if the task is scheduled out to some other CPU, this bit is set
> > and hence would stop the interrupt from reaching the processing core.
> 
> Still that code makes no sense at all and certainly does not do what you
> claim it does:
> 
> > >> +	this_cpu_write(bm_unmask_apic, 0);
> > >> +
> > >> +	if (!(this_cpu_read(bm_unmask_apic))) {
> 
> So first you write the per cpu variable to 0 and then you check whether it
> is zero, which is pointless obviously.

yes, I see your point. The logic is flawed. Will fix this.
> 
> > >
> > >> +static int intel_bm_event_add(struct perf_event *event, int mode) {
> 
> Please move the opening bracket of the function into the next line. See the
> kernel coding style documentation.

Will do.
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-11-13 21:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-11 21:20 [PATCH V1 0/3] perf/x86/intel: Add Branch Monitoring support Megha Dey
2017-11-11 21:20 ` [PATCH V1 1/3] x86/cpu/intel: Add Cannonlake to Intel family Megha Dey
2017-11-11 21:20 ` [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support Megha Dey
2017-11-13  9:00   ` Peter Zijlstra
2017-11-13 19:22     ` Dey, Megha
2017-11-13 20:25       ` Thomas Gleixner
2017-11-13 22:14         ` Megha Dey
2017-11-11 21:20 ` [PATCH V1 3/3] x86, bm: Add documentation on Intel Branch Monitoring Megha Dey
2017-11-12  1:56   ` Randy Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.