All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] perf/x86: Add perf text poke event
@ 2019-12-18 14:26 Adrian Hunter
  2019-12-18 14:26 ` [PATCH 1/3] " Adrian Hunter
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Adrian Hunter @ 2019-12-18 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Hi

Here are patches to add a text poke event to record changes to kernel text
(i.e. self-modifying code) in order to support tracers like Intel PT
decoding through jump labels.

The first patch makes the kernel change and the subsequent patches are
tools changes.

The second patch adds support for updating perf tools' data cache
with the changed bytes.

The last patch is an Intel PT specific tools change.

The patches are based on linux-next.


Changes since RFC:

  Dropped 'flags' from the new event.  The consensus seemed to be that text
  pokes should emply a scheme similar to x86's INT3 method instead.

  dropped tools patches already applied.


Example:

  For jump labels, the kernel needs
	CONFIG_JUMP_LABEL=y
  and also an easy to flip jump label is in sysctl_schedstats() which needs
	CONFIG_SCHEDSTATS=y
	CONFIG_PROC_SYSCTL=y

  Also note the 'sudo perf record' is put into the background which, as
  written, needs sudo credential caching (otherwise the background task
  will stop awaiting the sudo password), hence the 'sudo echo' to start.

Before:

  $ sudo echo
  $ sudo perf record -o perf.data.before --kcore -a -e intel_pt//k -m,64M &
  [1] 1640
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo bash -c 'echo 1 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  1
  $ sudo bash -c 'echo 0 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo kill 1640
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 16.635 MB perf.data.before ]
  $ perf script -i perf.data.before --itrace=e >/dev/null
  Warning:
  1946 instruction trace errors

After:

  $ sudo echo
  $ sudo perf record -o perf.data.after --kcore -a -e intel_pt//k -m,64M &
  [1] 1882
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo bash -c 'echo 1 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  1
  $ sudo bash -c 'echo 0 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo kill 1882
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 10.893 MB perf.data.after ]
  $ perf script -i perf.data.after --itrace=e
  $


Adrian Hunter (3):
      perf/x86: Add perf text poke event
      perf tools: Add support for PERF_RECORD_TEXT_POKE
      perf intel-pt: Add support for text poke events

 arch/x86/kernel/alternative.c             | 37 ++++++++++++-
 include/linux/perf_event.h                |  6 +++
 include/uapi/linux/perf_event.h           | 19 ++++++-
 kernel/events/core.c                      | 87 ++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h     | 19 ++++++-
 tools/perf/arch/x86/util/intel-pt.c       |  4 ++
 tools/perf/builtin-record.c               | 45 ++++++++++++++++
 tools/perf/lib/include/perf/event.h       |  8 +++
 tools/perf/util/event.c                   | 39 ++++++++++++++
 tools/perf/util/event.h                   |  5 ++
 tools/perf/util/evlist.h                  |  1 +
 tools/perf/util/intel-pt.c                | 71 +++++++++++++++++++++++++
 tools/perf/util/machine.c                 | 34 ++++++++++++
 tools/perf/util/machine.h                 |  3 ++
 tools/perf/util/perf_event_attr_fprintf.c |  1 +
 tools/perf/util/record.c                  | 10 ++++
 tools/perf/util/record.h                  |  1 +
 tools/perf/util/session.c                 | 20 +++++++
 tools/perf/util/tool.h                    |  3 +-
 19 files changed, 408 insertions(+), 5 deletions(-)


Regards
Adrian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] perf/x86: Add perf text poke event
  2019-12-18 14:26 [PATCH 0/3] perf/x86: Add perf text poke event Adrian Hunter
@ 2019-12-18 14:26 ` Adrian Hunter
  2019-12-19 13:09   ` Peter Zijlstra
  2019-12-18 14:26 ` [PATCH 2/3] perf tools: Add support for PERF_RECORD_TEXT_POKE Adrian Hunter
  2019-12-18 14:26 ` [PATCH 3/3] perf intel-pt: Add support for text poke events Adrian Hunter
  2 siblings, 1 reply; 6+ messages in thread
From: Adrian Hunter @ 2019-12-18 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Record changes to kernel text (i.e. self-modifying code) in order to
support tracers like Intel PT decoding through jump labels.

A copy of the running kernel code is needed as a reference point
(e.g. from /proc/kcore). The text poke event records the old bytes
and the new bytes so that the event can be processed forwards or backwards.

In the case of Intel PT tracing, the executable code must be walked to
reconstruct the control flow. For x86 a jump label text poke begins:
  - write INT3 byte
  - IPI-SYNC
  - write instruction tail
At this point the actual control flow will be through the INT3 and handler
and not hit the old or new instruction. Intel PT outputs FUP/TIP packets
for the INT3, so the flow can still be decoded. Subsequently:
  - emit RECORD_TEXT_POKE with the new instruction
  - IPI-SYNC
  - write first byte
  - IPI-SYNC
So before the text poke event timestamp, the decoder will see either the
old instruction flow or FUP/TIP of INT3. After the text poke event
timestamp, the decoder will see either the new instruction flow or FUP/TIP
of INT3. Thus decoders can use the timestamp as the point at which to
modify the executable code.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/kernel/alternative.c   | 37 +++++++++++++-
 include/linux/perf_event.h      |  6 +++
 include/uapi/linux/perf_event.h | 19 ++++++-
 kernel/events/core.c            | 87 ++++++++++++++++++++++++++++++++-
 4 files changed, 146 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 30e86730655c..f932f8f4caf4 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -3,6 +3,7 @@
 
 #include <linux/module.h>
 #include <linux/sched.h>
+#include <linux/perf_event.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/stringify.h>
@@ -946,6 +947,7 @@ struct text_poke_loc {
 	s32 rel32;
 	u8 opcode;
 	const u8 text[POKE_MAX_OPCODE_SIZE];
+	u8 old;
 };
 
 static struct bp_patching_desc {
@@ -1087,8 +1089,10 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	/*
 	 * First step: add a int3 trap to the address that will be patched.
 	 */
-	for (i = 0; i < nr_entries; i++)
+	for (i = 0; i < nr_entries; i++) {
+		tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
 		text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
+	}
 
 	text_poke_sync();
 
@@ -1096,14 +1100,45 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * Second step: update all but the first byte of the patched range.
 	 */
 	for (do_sync = 0, i = 0; i < nr_entries; i++) {
+		u8 old[POKE_MAX_OPCODE_SIZE] = { tp[i].old, };
 		int len = text_opcode_size(tp[i].opcode);
 
 		if (len - INT3_INSN_SIZE > 0) {
+			memcpy(old + INT3_INSN_SIZE,
+			       text_poke_addr(&tp[i]) + INT3_INSN_SIZE,
+			       len - INT3_INSN_SIZE);
 			text_poke(text_poke_addr(&tp[i]) + INT3_INSN_SIZE,
 				  (const char *)tp[i].text + INT3_INSN_SIZE,
 				  len - INT3_INSN_SIZE);
 			do_sync++;
 		}
+
+		/*
+		 * Emit a perf event to record the text poke, primarily to
+		 * support Intel PT decoding which must walk the executable code
+		 * to reconstruct the trace. The flow up to here is:
+		 *   - write INT3 byte
+		 *   - IPI-SYNC
+		 *   - write instruction tail
+		 * At this point the actual control flow will be through the
+		 * INT3 and handler and not hit the old or new instruction.
+		 * Intel PT outputs FUP/TIP packets for the INT3, so the flow
+		 * can still be decoded. Subsequently:
+		 *   - emit RECORD_TEXT_POKE with the new instruction
+		 *   - IPI-SYNC
+		 *   - write first byte
+		 *   - IPI-SYNC
+		 * So before the text poke event timestamp, the decoder will see
+		 * either the old instruction flow or FUP/TIP of INT3. After the
+		 * text poke event timestamp, the decoder will see either the
+		 * new instruction flow or FUP/TIP of INT3. Thus decoders can
+		 * use the timestamp as the point at which to modify the
+		 * executable code.
+		 * The old instruction is recorded so that the event can be
+		 * processed forwards or backwards.
+		 */
+		perf_event_text_poke((unsigned long)text_poke_addr(&tp[i]), old,
+				     tp[i].text, len);
 	}
 
 	if (do_sync) {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6d4c22aee384..2cfc127e3534 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1212,6 +1212,8 @@ extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_namespaces(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
+extern void perf_event_text_poke(unsigned long addr, const void *old_bytes,
+				 const void *new_bytes, size_t len);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -1462,6 +1464,10 @@ static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_namespaces(struct task_struct *tsk)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline void perf_event_text_poke(unsigned long addr,
+					const void *old_bytes,
+					const void *new_bytes,
+					size_t len)			{ }
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 377d794d3105..d8422ebf128c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -377,7 +377,8 @@ struct perf_event_attr {
 				ksymbol        :  1, /* include ksymbol events */
 				bpf_event      :  1, /* include bpf events */
 				aux_output     :  1, /* generate AUX records instead of events */
-				__reserved_1   : 32;
+				text_poke      :  1, /* include text poke events */
+				__reserved_1   : 31;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1006,6 +1007,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_BPF_EVENT			= 18,
 
+	/*
+	 * Records changes to kernel text i.e. self-modified code.
+	 * 'len' is the number of old bytes, which is the same as the number
+	 * of new bytes. 'bytes' contains the old bytes followed immediately
+	 * by the new bytes.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				addr;
+	 *	u16				len;
+	 *	u8				bytes[];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_TEXT_POKE			= 19,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c0b11b234290..4996f2bbf9b6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -386,6 +386,7 @@ static atomic_t nr_freq_events __read_mostly;
 static atomic_t nr_switch_events __read_mostly;
 static atomic_t nr_ksymbol_events __read_mostly;
 static atomic_t nr_bpf_events __read_mostly;
+static atomic_t nr_text_poke_events __read_mostly;
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
@@ -4397,7 +4398,7 @@ static bool is_sb_event(struct perf_event *event)
 	if (attr->mmap || attr->mmap_data || attr->mmap2 ||
 	    attr->comm || attr->comm_exec ||
 	    attr->task || attr->ksymbol ||
-	    attr->context_switch ||
+	    attr->context_switch || attr->text_poke ||
 	    attr->bpf_event)
 		return true;
 	return false;
@@ -4471,6 +4472,8 @@ static void unaccount_event(struct perf_event *event)
 		atomic_dec(&nr_ksymbol_events);
 	if (event->attr.bpf_event)
 		atomic_dec(&nr_bpf_events);
+	if (event->attr.text_poke)
+		atomic_dec(&nr_text_poke_events);
 
 	if (dec) {
 		if (!atomic_add_unless(&perf_sched_count, -1, 1))
@@ -8309,6 +8312,86 @@ void perf_event_bpf_event(struct bpf_prog *prog,
 	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
 }
 
+struct perf_text_poke_event {
+	const void		*old_bytes;
+	const void		*new_bytes;
+	size_t			pad;
+	u16			len;
+
+	struct {
+		struct perf_event_header	header;
+
+		u64				addr;
+	} event_id;
+};
+
+static int perf_event_text_poke_match(struct perf_event *event)
+{
+	return event->attr.text_poke;
+}
+
+// TODO: go back and check peter's comments
+static void perf_event_text_poke_output(struct perf_event *event, void *data)
+{
+	struct perf_text_poke_event *text_poke_event = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	u64 padding = 0;
+	int ret;
+
+	if (!perf_event_text_poke_match(event))
+		return;
+
+	perf_event_header__init_id(&text_poke_event->event_id.header, &sample, event);
+
+	ret = perf_output_begin(&handle, event, text_poke_event->event_id.header.size);
+	if (ret)
+		return;
+
+	perf_output_put(&handle, text_poke_event->event_id);
+	perf_output_put(&handle, text_poke_event->len);
+
+	__output_copy(&handle, text_poke_event->old_bytes, text_poke_event->len);
+	__output_copy(&handle, text_poke_event->new_bytes, text_poke_event->len);
+
+	if (text_poke_event->pad)
+		__output_copy(&handle, &padding, text_poke_event->pad);
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
+void perf_event_text_poke(unsigned long addr, const void *old_bytes,
+			  const void *new_bytes, size_t len)
+{
+	struct perf_text_poke_event text_poke_event;
+	size_t tot, pad;
+
+	if (!atomic_read(&nr_text_poke_events))
+		return;
+
+	tot  = sizeof(text_poke_event.len) + (len * 2);
+	pad  = ALIGN(tot, sizeof(u64)) - tot;
+
+	text_poke_event = (struct perf_text_poke_event){
+		.old_bytes    = old_bytes,
+		.new_bytes    = new_bytes,
+		.pad          = pad,
+		.len          = len,
+		.event_id  = {
+			.header = {
+				.type = PERF_RECORD_TEXT_POKE,
+				.misc = PERF_RECORD_MISC_KERNEL,
+				.size = sizeof(text_poke_event.event_id) + tot + pad,
+			},
+			.addr = addr,
+		},
+	};
+
+	perf_iterate_sb(perf_event_text_poke_output, &text_poke_event, NULL);
+}
+
 void perf_event_itrace_started(struct perf_event *event)
 {
 	event->attach_state |= PERF_ATTACH_ITRACE;
@@ -10623,6 +10706,8 @@ static void account_event(struct perf_event *event)
 		atomic_inc(&nr_ksymbol_events);
 	if (event->attr.bpf_event)
 		atomic_inc(&nr_bpf_events);
+	if (event->attr.text_poke)
+		atomic_inc(&nr_text_poke_events);
 
 	if (inc) {
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] perf tools: Add support for PERF_RECORD_TEXT_POKE
  2019-12-18 14:26 [PATCH 0/3] perf/x86: Add perf text poke event Adrian Hunter
  2019-12-18 14:26 ` [PATCH 1/3] " Adrian Hunter
@ 2019-12-18 14:26 ` Adrian Hunter
  2019-12-18 14:26 ` [PATCH 3/3] perf intel-pt: Add support for text poke events Adrian Hunter
  2 siblings, 0 replies; 6+ messages in thread
From: Adrian Hunter @ 2019-12-18 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Add processing for PERF_RECORD_TEXT_POKE events. When a text poke event is
processed, then the kernel dso data cache is updated with the poked bytes.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/include/uapi/linux/perf_event.h     | 19 +++++++++-
 tools/perf/builtin-record.c               | 45 +++++++++++++++++++++++
 tools/perf/lib/include/perf/event.h       |  8 ++++
 tools/perf/util/event.c                   | 39 ++++++++++++++++++++
 tools/perf/util/event.h                   |  5 +++
 tools/perf/util/evlist.h                  |  1 +
 tools/perf/util/machine.c                 | 34 +++++++++++++++++
 tools/perf/util/machine.h                 |  3 ++
 tools/perf/util/perf_event_attr_fprintf.c |  1 +
 tools/perf/util/record.c                  | 10 +++++
 tools/perf/util/record.h                  |  1 +
 tools/perf/util/session.c                 | 20 ++++++++++
 tools/perf/util/tool.h                    |  3 +-
 13 files changed, 187 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 377d794d3105..d8422ebf128c 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -377,7 +377,8 @@ struct perf_event_attr {
 				ksymbol        :  1, /* include ksymbol events */
 				bpf_event      :  1, /* include bpf events */
 				aux_output     :  1, /* generate AUX records instead of events */
-				__reserved_1   : 32;
+				text_poke      :  1, /* include text poke events */
+				__reserved_1   : 31;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1006,6 +1007,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_BPF_EVENT			= 18,
 
+	/*
+	 * Records changes to kernel text i.e. self-modified code.
+	 * 'len' is the number of old bytes, which is the same as the number
+	 * of new bytes. 'bytes' contains the old bytes followed immediately
+	 * by the new bytes.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				addr;
+	 *	u16				len;
+	 *	u8				bytes[];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_TEXT_POKE			= 19,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4c301466101b..d56c3cbe97ae 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -723,6 +723,43 @@ static int record__auxtrace_init(struct record *rec __maybe_unused)
 
 #endif
 
+static int record__config_text_poke(struct evlist *evlist)
+{
+	struct evsel *evsel;
+	int err;
+
+	/* Nothing to do if text poke is already configured */
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->core.attr.text_poke)
+			return 0;
+	}
+
+	err = parse_events(evlist, "dummy:u", NULL);
+	if (err)
+		return err;
+
+	evsel = evlist__last(evlist);
+
+	evsel->core.attr.freq = 0;
+	evsel->core.attr.sample_period = 1;
+	evsel->core.attr.text_poke = 1;
+
+	evsel->core.system_wide = true;
+	evsel->no_aux_samples = true;
+	evsel->immediate = true;
+
+	/* Text poke must be collected on all CPUs */
+	perf_cpu_map__put(evsel->core.own_cpus);
+	evsel->core.own_cpus = perf_cpu_map__new(NULL);
+	perf_cpu_map__put(evsel->core.cpus);
+	evsel->core.cpus = perf_cpu_map__get(evsel->core.own_cpus);
+
+	perf_evsel__set_sample_bit(evsel, TIME);
+	perf_evsel__reset_sample_bit(evsel, BRANCH_STACK);
+
+	return 0;
+}
+
 static bool record__kcore_readable(struct machine *machine)
 {
 	char kcore[PATH_MAX];
@@ -2610,6 +2647,14 @@ int cmd_record(int argc, const char **argv)
 	if (rec->opts.full_auxtrace)
 		rec->buildid_all = true;
 
+	if (rec->opts.text_poke) {
+		err = record__config_text_poke(rec->evlist);
+		if (err) {
+			pr_err("record__config_text_poke failed, error %d\n", err);
+			goto out;
+		}
+	}
+
 	if (record_opts__config(&rec->opts)) {
 		err = -EINVAL;
 		goto out;
diff --git a/tools/perf/lib/include/perf/event.h b/tools/perf/lib/include/perf/event.h
index 18106899cb4e..ecd2087e122e 100644
--- a/tools/perf/lib/include/perf/event.h
+++ b/tools/perf/lib/include/perf/event.h
@@ -105,6 +105,13 @@ struct perf_record_bpf_event {
 	__u8			 tag[BPF_TAG_SIZE];  // prog tag
 };
 
+struct perf_record_text_poke_event {
+	struct perf_event_header header;
+	__u64			addr;
+	__u16			len;
+	__u8			bytes[];
+};
+
 struct perf_record_sample {
 	struct perf_event_header header;
 	__u64			 array[];
@@ -360,6 +367,7 @@ union perf_event {
 	struct perf_record_sample		sample;
 	struct perf_record_bpf_event		bpf;
 	struct perf_record_ksymbol		ksymbol;
+	struct perf_record_text_poke_event	text_poke;
 	struct perf_record_header_attr		attr;
 	struct perf_record_event_update		event_update;
 	struct perf_record_header_event_type	event_type;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index c5447ff516a2..b0d934cdda31 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -31,6 +31,7 @@
 #include "stat.h"
 #include "session.h"
 #include "bpf-event.h"
+#include "print_binary.h"
 #include "tool.h"
 #include "../perf.h"
 
@@ -54,6 +55,7 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_NAMESPACES]		= "NAMESPACES",
 	[PERF_RECORD_KSYMBOL]			= "KSYMBOL",
 	[PERF_RECORD_BPF_EVENT]			= "BPF_EVENT",
+	[PERF_RECORD_TEXT_POKE]			= "TEXT_POKE",
 	[PERF_RECORD_HEADER_ATTR]		= "ATTR",
 	[PERF_RECORD_HEADER_EVENT_TYPE]		= "EVENT_TYPE",
 	[PERF_RECORD_HEADER_TRACING_DATA]	= "TRACING_DATA",
@@ -252,6 +254,14 @@ int perf_event__process_bpf(struct perf_tool *tool __maybe_unused,
 	return machine__process_bpf(machine, event, sample);
 }
 
+int perf_event__process_text_poke(struct perf_tool *tool __maybe_unused,
+				  union perf_event *event,
+				  struct perf_sample *sample,
+				  struct machine *machine)
+{
+	return machine__process_text_poke(machine, event, sample);
+}
+
 size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
 {
 	return fprintf(fp, " %d/%d: [%#" PRI_lx64 "(%#" PRI_lx64 ") @ %#" PRI_lx64 "]: %c %s\n",
@@ -398,6 +408,32 @@ size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp)
 		       event->bpf.type, event->bpf.flags, event->bpf.id);
 }
 
+static int text_poke_printer(enum binary_printer_ops op, unsigned int val,
+			     void *extra __maybe_unused, FILE *fp)
+{
+	if (op == BINARY_PRINT_NUM_DATA)
+		return fprintf(fp, " %02x", val);
+	if (op == BINARY_PRINT_LINE_END)
+		return fprintf(fp, "\n");
+	return 0;
+}
+
+size_t perf_event__fprintf_text_poke(union perf_event *event, FILE *fp)
+{
+	size_t ret = fprintf(fp, " addr %#" PRI_lx64 " len %u\n",
+			     event->text_poke.addr,
+			     event->text_poke.len);
+
+	ret += fprintf(fp, "     old bytes:");
+	ret += binary__fprintf(event->text_poke.bytes, event->text_poke.len, 16,
+			       text_poke_printer, NULL, fp);
+	ret += fprintf(fp, "     new bytes:");
+	ret += binary__fprintf(event->text_poke.bytes + event->text_poke.len,
+			       event->text_poke.len, 16, text_poke_printer,
+			       NULL, fp);
+	return ret;
+}
+
 size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 {
 	size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -439,6 +475,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 	case PERF_RECORD_BPF_EVENT:
 		ret += perf_event__fprintf_bpf(event, fp);
 		break;
+	case PERF_RECORD_TEXT_POKE:
+		ret += perf_event__fprintf_text_poke(event, fp);
+		break;
 	default:
 		ret += fprintf(fp, "\n");
 	}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 85223159737c..cbe98e22527f 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -345,6 +345,10 @@ int perf_event__process_bpf(struct perf_tool *tool,
 			    union perf_event *event,
 			    struct perf_sample *sample,
 			    struct machine *machine);
+int perf_event__process_text_poke(struct perf_tool *tool,
+				  union perf_event *event,
+				  struct perf_sample *sample,
+				  struct machine *machine);
 int perf_event__process(struct perf_tool *tool,
 			union perf_event *event,
 			struct perf_sample *sample,
@@ -378,6 +382,7 @@ size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_namespaces(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_text_poke(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf(union perf_event *event, FILE *fp);
 
 int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index f5bd5c386df1..93e9b35e53b0 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -175,6 +175,7 @@ struct callchain_param;
 void perf_evlist__set_id_pos(struct evlist *evlist);
 bool perf_can_sample_identifier(void);
 bool perf_can_record_switch_events(void);
+bool perf_can_record_text_poke_events(void);
 bool perf_can_record_cpu_wide(void);
 bool perf_can_aux_sample(void);
 void perf_evlist__config(struct evlist *evlist, struct record_opts *opts,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index c8c5410315e8..3449af7f6b81 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -770,6 +770,38 @@ int machine__process_ksymbol(struct machine *machine __maybe_unused,
 	return machine__process_ksymbol_register(machine, event, sample);
 }
 
+int machine__process_text_poke(struct machine *machine, union perf_event *event,
+			       struct perf_sample *sample __maybe_unused)
+{
+	struct map *map = maps__find(&machine->kmaps, event->text_poke.addr);
+
+	if (dump_trace)
+		perf_event__fprintf_text_poke(event, stdout);
+
+	if (map) {
+		u8 *new_bytes = event->text_poke.bytes + event->text_poke.len;
+		int ret;
+
+		/*
+		 * Kernel maps might be changed when loading symbols so loading
+		 * must be done prior to using kernel maps.
+		 */
+		map__load(map);
+		ret = dso__data_write_cache_addr(map->dso, map, machine,
+						 event->text_poke.addr,
+						 new_bytes,
+						 event->text_poke.len);
+		if (ret != event->text_poke.len)
+			pr_err("Failed to write kernel text poke at %#" PRI_lx64 "\n",
+			       event->text_poke.addr);
+	} else {
+		pr_err("Failed to find kernel text poke address map for %#" PRI_lx64 "\n",
+		       event->text_poke.addr);
+	}
+
+	return 0;
+}
+
 static struct map *machine__addnew_module_map(struct machine *machine, u64 start,
 					      const char *filename)
 {
@@ -1901,6 +1933,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
 		ret = machine__process_ksymbol(machine, event, sample); break;
 	case PERF_RECORD_BPF_EVENT:
 		ret = machine__process_bpf(machine, event, sample); break;
+	case PERF_RECORD_TEXT_POKE:
+		ret = machine__process_text_poke(machine, event, sample); break;
 	default:
 		ret = -1;
 		break;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index be0a930eca89..0bb086acb7ed 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -135,6 +135,9 @@ int machine__process_mmap2_event(struct machine *machine, union perf_event *even
 int machine__process_ksymbol(struct machine *machine,
 			     union perf_event *event,
 			     struct perf_sample *sample);
+int machine__process_text_poke(struct machine *machine,
+			       union perf_event *event,
+			       struct perf_sample *sample);
 int machine__process_event(struct machine *machine, union perf_event *event,
 				struct perf_sample *sample);
 
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 651203126c71..4c6c700335a6 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -144,6 +144,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(aux_watermark, p_unsigned);
 	PRINT_ATTRf(sample_max_stack, p_unsigned);
 	PRINT_ATTRf(aux_sample_size, p_unsigned);
+	PRINT_ATTRf(text_poke, p_unsigned);
 
 	return ret;
 }
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index 7def66168503..207ba2a65008 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -97,6 +97,11 @@ static void perf_probe_context_switch(struct evsel *evsel)
 	evsel->core.attr.context_switch = 1;
 }
 
+static void perf_probe_text_poke(struct evsel *evsel)
+{
+	evsel->core.attr.text_poke = 1;
+}
+
 bool perf_can_sample_identifier(void)
 {
 	return perf_probe_api(perf_probe_sample_identifier);
@@ -112,6 +117,11 @@ bool perf_can_record_switch_events(void)
 	return perf_probe_api(perf_probe_context_switch);
 }
 
+bool perf_can_record_text_poke_events(void)
+{
+	return perf_probe_api(perf_probe_text_poke);
+}
+
 bool perf_can_record_cpu_wide(void)
 {
 	struct perf_event_attr attr = {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 5421fd2ad383..127b4ca11fd3 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -46,6 +46,7 @@ struct record_opts {
 	bool	      sample_id;
 	bool	      no_bpf_event;
 	bool	      kcore;
+	bool	      text_poke;
 	unsigned int  freq;
 	unsigned int  mmap_pages;
 	unsigned int  auxtrace_mmap_pages;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index d0d7d25b23e3..c0cd316d8e28 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -489,6 +489,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
 		tool->ksymbol = perf_event__process_ksymbol;
 	if (tool->bpf == NULL)
 		tool->bpf = perf_event__process_bpf;
+	if (tool->text_poke == NULL)
+		tool->text_poke = perf_event__process_text_poke;
 	if (tool->read == NULL)
 		tool->read = process_event_sample_stub;
 	if (tool->throttle == NULL)
@@ -658,6 +660,21 @@ static void perf_event__switch_swap(union perf_event *event, bool sample_id_all)
 		swap_sample_id_all(event, &event->context_switch + 1);
 }
 
+static void perf_event__text_poke_swap(union perf_event *event, bool sample_id_all)
+{
+	event->text_poke.addr  = bswap_64(event->text_poke.addr);
+	event->text_poke.len   = bswap_16(event->text_poke.len);
+
+	if (sample_id_all) {
+		size_t len = sizeof(event->text_poke.len) +
+			     event->text_poke.len * 2;
+		void *data = &event->text_poke.len;
+
+		data += PERF_ALIGN(len, sizeof(u64));
+		swap_sample_id_all(event, data);
+	}
+}
+
 static void perf_event__throttle_swap(union perf_event *event,
 				      bool sample_id_all)
 {
@@ -931,6 +948,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_SWITCH]		  = perf_event__switch_swap,
 	[PERF_RECORD_SWITCH_CPU_WIDE]	  = perf_event__switch_swap,
 	[PERF_RECORD_NAMESPACES]	  = perf_event__namespaces_swap,
+	[PERF_RECORD_TEXT_POKE]		  = perf_event__text_poke_swap,
 	[PERF_RECORD_HEADER_ATTR]	  = perf_event__hdr_attr_swap,
 	[PERF_RECORD_HEADER_EVENT_TYPE]	  = perf_event__event_type_swap,
 	[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1470,6 +1488,8 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->ksymbol(tool, event, sample, machine);
 	case PERF_RECORD_BPF_EVENT:
 		return tool->bpf(tool, event, sample, machine);
+	case PERF_RECORD_TEXT_POKE:
+		return tool->text_poke(tool, event, sample, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 2abbf668b8de..006182923bbe 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -56,7 +56,8 @@ struct perf_tool {
 			throttle,
 			unthrottle,
 			ksymbol,
-			bpf;
+			bpf,
+			text_poke;
 
 	event_attr_op	attr;
 	event_attr_op	event_update;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] perf intel-pt: Add support for text poke events
  2019-12-18 14:26 [PATCH 0/3] perf/x86: Add perf text poke event Adrian Hunter
  2019-12-18 14:26 ` [PATCH 1/3] " Adrian Hunter
  2019-12-18 14:26 ` [PATCH 2/3] perf tools: Add support for PERF_RECORD_TEXT_POKE Adrian Hunter
@ 2019-12-18 14:26 ` Adrian Hunter
  2 siblings, 0 replies; 6+ messages in thread
From: Adrian Hunter @ 2019-12-18 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Select text poke events when available and the kernel is being traced.
Process text poke events to invalidate entries in Intel PT's instruction
cache.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/arch/x86/util/intel-pt.c |  4 ++
 tools/perf/util/intel-pt.c          | 71 +++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/tools/perf/arch/x86/util/intel-pt.c b/tools/perf/arch/x86/util/intel-pt.c
index 20df442fdf36..ec75445ef7cf 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -827,6 +827,10 @@ static int intel_pt_recording_options(struct auxtrace_record *itr,
 		}
 	}
 
+	if (have_timing_info && !intel_pt_evsel->core.attr.exclude_kernel &&
+	    perf_can_record_text_poke_events() && perf_can_record_cpu_wide())
+		opts->text_poke = true;
+
 	if (intel_pt_evsel) {
 		/*
 		 * To obtain the auxtrace buffer file descriptor, the auxtrace
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 33cf8928cf05..9603217e437e 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -514,6 +514,17 @@ intel_pt_cache_lookup(struct dso *dso, struct machine *machine, u64 offset)
 	return auxtrace_cache__lookup(dso->auxtrace_cache, offset);
 }
 
+static void intel_pt_cache_invalidate(struct dso *dso, struct machine *machine,
+				      u64 offset)
+{
+	struct auxtrace_cache *c = intel_pt_cache(dso, machine);
+
+	if (!c)
+		return;
+
+	auxtrace_cache__remove(dso->auxtrace_cache, offset);
+}
+
 static inline u8 intel_pt_cpumode(struct intel_pt *pt, uint64_t ip)
 {
 	return ip >= pt->kernel_start ?
@@ -2592,6 +2603,63 @@ static int intel_pt_process_itrace_start(struct intel_pt *pt,
 					event->itrace_start.tid);
 }
 
+static int intel_pt_find_map(struct thread *thread, u8 cpumode, u64 addr,
+			     struct addr_location *al)
+{
+	if (!al->map || addr < al->map->start || addr >= al->map->end) {
+		if (!thread__find_map(thread, cpumode, addr, al) || !al->map->dso)
+			return -1;
+	}
+
+	return 0;
+}
+
+/* Invalidate all instruction cache entries that overlap the text poke */
+static int intel_pt_text_poke(struct intel_pt *pt, union perf_event *event)
+{
+	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+	u64 addr = event->text_poke.addr + event->text_poke.len - 1;
+	/* Assume text poke begins in a basic block no more than 4096 bytes */
+	int cnt = 4096 + event->text_poke.len;
+	struct thread *thread = pt->unknown_thread;
+	struct addr_location al = { .map = NULL };
+	struct machine *machine = pt->machine;
+	struct intel_pt_cache_entry *e;
+	u64 offset;
+
+	for (; cnt; cnt--, addr--) {
+		if (intel_pt_find_map(thread, cpumode, addr, &al)) {
+			if (addr < event->text_poke.addr)
+				return 0;
+			pr_err("%s: failed to find map at %#"PRIx64"\n",
+			       __func__, addr);
+			return -EINVAL;
+		}
+
+		offset = al.map->map_ip(al.map, addr);
+
+		e = intel_pt_cache_lookup(al.map->dso, machine, offset);
+		if (!e)
+			continue;
+
+		if (addr + e->byte_cnt + e->length <= event->text_poke.addr) {
+			/*
+			 * No overlap. Working backwards there cannot be another
+			 * basic block that overlaps the text poke if there is a
+			 * branch instruction before the text poke address.
+			 */
+			if (e->branch != INTEL_PT_BR_NO_BRANCH)
+				return 0;
+		} else {
+			intel_pt_cache_invalidate(al.map->dso, machine, offset);
+			intel_pt_log("Invalidated instruction cache for %s at %#"PRIx64"\n",
+				     al.map->dso->long_name, addr);
+		}
+	}
+
+	return 0;
+}
+
 static int intel_pt_process_event(struct perf_session *session,
 				  union perf_event *event,
 				  struct perf_sample *sample,
@@ -2653,6 +2721,9 @@ static int intel_pt_process_event(struct perf_session *session,
 		 event->header.type == PERF_RECORD_SWITCH_CPU_WIDE)
 		err = intel_pt_context_switch(pt, event, sample);
 
+	if (!err && event->header.type == PERF_RECORD_TEXT_POKE)
+		err = intel_pt_text_poke(pt, event);
+
 	intel_pt_log("event %u: cpu %d time %"PRIu64" tsc %#"PRIx64" ",
 		     event->header.type, sample->cpu, sample->time, timestamp);
 	intel_pt_log_event(event);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] perf/x86: Add perf text poke event
  2019-12-18 14:26 ` [PATCH 1/3] " Adrian Hunter
@ 2019-12-19 13:09   ` Peter Zijlstra
  2019-12-19 16:50     ` Hunter, Adrian
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2019-12-19 13:09 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

On Wed, Dec 18, 2019 at 04:26:16PM +0200, Adrian Hunter wrote:
> Record changes to kernel text (i.e. self-modifying code) in order to
> support tracers like Intel PT decoding through jump labels.

I don't get the obsession with just jump-labels, we need a solution all
modifying code. The fentry site usage is increasing, and optprobes are
also fairly popular with a bunch of people.

> A copy of the running kernel code is needed as a reference point
> (e.g. from /proc/kcore). The text poke event records the old bytes
> and the new bytes so that the event can be processed forwards or backwards.
> 
> In the case of Intel PT tracing, the executable code must be walked to
> reconstruct the control flow. For x86 a jump label text poke begins:
>   - write INT3 byte
>   - IPI-SYNC
>   - write instruction tail
> At this point the actual control flow will be through the INT3 and handler
> and not hit the old or new instruction. Intel PT outputs FUP/TIP packets
> for the INT3, so the flow can still be decoded. Subsequently:
>   - emit RECORD_TEXT_POKE with the new instruction
>   - IPI-SYNC
>   - write first byte
>   - IPI-SYNC
> So before the text poke event timestamp, the decoder will see either the
> old instruction flow or FUP/TIP of INT3. After the text poke event
> timestamp, the decoder will see either the new instruction flow or FUP/TIP
> of INT3. Thus decoders can use the timestamp as the point at which to
> modify the executable code.

I feel a much better justification for the design can be found in the
discussion we've had around ARM-CS.

Basically SMP instruction coherency mandates something like this, it is
just a happy accident x86 already had all the bits in place.

How is something like:

"Record (single instruction) changes to the kernel text (i.e.
self-modifying code) in order to support tracers like Intel PT and
ARM CoreSight.

A copy of the running kernel code is needed as a reference point (e.g.
from /proc/kcore). The text poke event records the old bytes and the
new bytes so that the event can be processed forwards or backwards.

The basic problem is recording the modified instruction in an
unambiguous manner given SMP instruction cache (in)coherence. That is,
when modifying an instruction concurrently any solution with one or
multiple timestamps is not sufficient:

	CPU0				CPU1
 0
 1	write insn A
 2					execute insn A
 3	sync-I$
 4

Due to I$, CPU1 might execute either the old or new A. No matter where
we record tracepoints on CPU0, one simply cannot tell what CPU1 will
have observed, except that at 0 it must be the old one and at 4 it
must be the new one.

To solve this, take inspiration from x86 text poking, which has to
solve this exact problem due to variable length instruction encoding
and I-fetch windows.

 1) overwrite the instruction with a breakpoint and sync I$

This guarantees that that code flow will never hit the target
instruction anymore, on any CPU (or rather, it will cause an
exception).

 2) issue the TEXT_POKE event

 3) overwrite the breakpoint with the new instruction and sync I$

Now we know that any execution after the TEXT_POKE event will either
observe the breakpoint (and hit the exception) or the new instruction.

So by guarding the TEXT_POKE event with an exception on either side;
we can now tell, without doubt, which instruction another CPU will
have observed."

?

> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/kernel/alternative.c   | 37 +++++++++++++-

I'm thinking it might make sense to do this x86 part in a separate
patch, and just present the generic thing first:

>  include/linux/perf_event.h      |  6 +++
>  include/uapi/linux/perf_event.h | 19 ++++++-
>  kernel/events/core.c            | 87 ++++++++++++++++++++++++++++++++-
>  4 files changed, 146 insertions(+), 3 deletions(-)
> 

> @@ -1006,6 +1007,22 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_BPF_EVENT			= 18,
>  
> +	/*
> +	 * Records changes to kernel text i.e. self-modified code.
> +	 * 'len' is the number of old bytes, which is the same as the number
> +	 * of new bytes. 'bytes' contains the old bytes followed immediately
> +	 * by the new bytes.
> +	 *
> +	 * struct {
> +	 *	struct perf_event_header	header;
> +	 *	u64				addr;
> +	 *	u16				len;
> +	 *	u8				bytes[];

Would it make sense to have something like:

	 *	u16				old_len;
	 *	u16				new_len;
	 *	u8				old_bytes[old_len];
	 *	u8				new_bytes[new_len];

That would allow using this for (short) trampolines (ftrace, optprobes
etc..). {old_len=0, new_len>0} would indicate a new trampoline, while
{old_len>0, new_len=0} would indicate the dissapearance of a trampoline.

> +	 *	struct sample_id		sample_id;
> +	 * };
> +	 */
> +	PERF_RECORD_TEXT_POKE			= 19,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };

Then the x86 patch, hooking up the event, can also cover kprobes and
stuff.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH 1/3] perf/x86: Add perf text poke event
  2019-12-19 13:09   ` Peter Zijlstra
@ 2019-12-19 16:50     ` Hunter, Adrian
  0 siblings, 0 replies; 6+ messages in thread
From: Hunter, Adrian @ 2019-12-19 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

> -----Original Message-----
> From: Peter Zijlstra <peterz@infradead.org>
> Sent: Thursday, December 19, 2019 3:09 PM
> To: Hunter, Adrian <adrian.hunter@intel.com>
> Cc: Ingo Molnar <mingo@redhat.com>; Borislav Petkov <bp@alien8.de>; H .
> Peter Anvin <hpa@zytor.com>; x86@kernel.org; Mark Rutland
> <mark.rutland@arm.com>; Alexander Shishkin
> <alexander.shishkin@linux.intel.com>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Leo Yan <leo.yan@linaro.org>; Arnaldo
> Carvalho de Melo <acme@kernel.org>; Jiri Olsa <jolsa@redhat.com>; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 1/3] perf/x86: Add perf text poke event
> 
> On Wed, Dec 18, 2019 at 04:26:16PM +0200, Adrian Hunter wrote:
> > Record changes to kernel text (i.e. self-modifying code) in order to
> > support tracers like Intel PT decoding through jump labels.
> 
> I don't get the obsession with just jump-labels, we need a solution all
> modifying code. The fentry site usage is increasing, and optprobes are also
> fairly popular with a bunch of people.

Yes we need a solution for all modifying code.  Jump labels are just a good
place to start because they are the biggest problem by far.

> 
> > A copy of the running kernel code is needed as a reference point (e.g.
> > from /proc/kcore). The text poke event records the old bytes and the
> > new bytes so that the event can be processed forwards or backwards.
> >
> > In the case of Intel PT tracing, the executable code must be walked to
> > reconstruct the control flow. For x86 a jump label text poke begins:
> >   - write INT3 byte
> >   - IPI-SYNC
> >   - write instruction tail
> > At this point the actual control flow will be through the INT3 and
> > handler and not hit the old or new instruction. Intel PT outputs
> > FUP/TIP packets for the INT3, so the flow can still be decoded.
> Subsequently:
> >   - emit RECORD_TEXT_POKE with the new instruction
> >   - IPI-SYNC
> >   - write first byte
> >   - IPI-SYNC
> > So before the text poke event timestamp, the decoder will see either
> > the old instruction flow or FUP/TIP of INT3. After the text poke event
> > timestamp, the decoder will see either the new instruction flow or
> > FUP/TIP of INT3. Thus decoders can use the timestamp as the point at
> > which to modify the executable code.
> 
> I feel a much better justification for the design can be found in the discussion
> we've had around ARM-CS.
> 
> Basically SMP instruction coherency mandates something like this, it is just a
> happy accident x86 already had all the bits in place.
> 
> How is something like:
> 
> "Record (single instruction) changes to the kernel text (i.e.
> self-modifying code) in order to support tracers like Intel PT and ARM
> CoreSight.
> 
> A copy of the running kernel code is needed as a reference point (e.g.
> from /proc/kcore). The text poke event records the old bytes and the new
> bytes so that the event can be processed forwards or backwards.
> 
> The basic problem is recording the modified instruction in an unambiguous
> manner given SMP instruction cache (in)coherence. That is, when modifying
> an instruction concurrently any solution with one or multiple timestamps is
> not sufficient:
> 
> 	CPU0				CPU1
>  0
>  1	write insn A
>  2					execute insn A
>  3	sync-I$
>  4
> 
> Due to I$, CPU1 might execute either the old or new A. No matter where we
> record tracepoints on CPU0, one simply cannot tell what CPU1 will have
> observed, except that at 0 it must be the old one and at 4 it must be the new
> one.
> 
> To solve this, take inspiration from x86 text poking, which has to solve this
> exact problem due to variable length instruction encoding and I-fetch
> windows.
> 
>  1) overwrite the instruction with a breakpoint and sync I$
> 
> This guarantees that that code flow will never hit the target instruction
> anymore, on any CPU (or rather, it will cause an exception).
> 
>  2) issue the TEXT_POKE event
> 
>  3) overwrite the breakpoint with the new instruction and sync I$
> 
> Now we know that any execution after the TEXT_POKE event will either
> observe the breakpoint (and hit the exception) or the new instruction.
> 
> So by guarding the TEXT_POKE event with an exception on either side; we
> can now tell, without doubt, which instruction another CPU will have
> observed."
> 
> ?

Ok

> 
> > Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> > ---
> >  arch/x86/kernel/alternative.c   | 37 +++++++++++++-
> 
> I'm thinking it might make sense to do this x86 part in a separate patch, and
> just present the generic thing first:
> 
> >  include/linux/perf_event.h      |  6 +++
> >  include/uapi/linux/perf_event.h | 19 ++++++-
> >  kernel/events/core.c            | 87 ++++++++++++++++++++++++++++++++-
> >  4 files changed, 146 insertions(+), 3 deletions(-)
> >
> 
> > @@ -1006,6 +1007,22 @@ enum perf_event_type {
> >  	 */
> >  	PERF_RECORD_BPF_EVENT			= 18,
> >
> > +	/*
> > +	 * Records changes to kernel text i.e. self-modified code.
> > +	 * 'len' is the number of old bytes, which is the same as the number
> > +	 * of new bytes. 'bytes' contains the old bytes followed immediately
> > +	 * by the new bytes.
> > +	 *
> > +	 * struct {
> > +	 *	struct perf_event_header	header;
> > +	 *	u64				addr;
> > +	 *	u16				len;
> > +	 *	u8				bytes[];
> 
> Would it make sense to have something like:
> 
> 	 *	u16				old_len;
> 	 *	u16				new_len;
> 	 *	u8				old_bytes[old_len];
> 	 *	u8				new_bytes[new_len];
> 
> That would allow using this for (short) trampolines (ftrace, optprobes etc..).
> {old_len=0, new_len>0} would indicate a new trampoline, while {old_len>0,
> new_len=0} would indicate the dissapearance of a trampoline.

Yes that makes sense.

> 
> > +	 *	struct sample_id		sample_id;
> > +	 * };
> > +	 */
> > +	PERF_RECORD_TEXT_POKE			= 19,
> > +
> >  	PERF_RECORD_MAX,			/* non-ABI */
> >  };
> 
> Then the x86 patch, hooking up the event, can also cover kprobes and stuff.

Ok


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-12-19 16:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-18 14:26 [PATCH 0/3] perf/x86: Add perf text poke event Adrian Hunter
2019-12-18 14:26 ` [PATCH 1/3] " Adrian Hunter
2019-12-19 13:09   ` Peter Zijlstra
2019-12-19 16:50     ` Hunter, Adrian
2019-12-18 14:26 ` [PATCH 2/3] perf tools: Add support for PERF_RECORD_TEXT_POKE Adrian Hunter
2019-12-18 14:26 ` [PATCH 3/3] perf intel-pt: Add support for text poke events Adrian Hunter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.