linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] perf/x86: Add perf text poke event
@ 2019-10-25 12:59 Adrian Hunter
  2019-10-25 12:59 ` [PATCH RFC 1/6] " Adrian Hunter
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-25 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Hi

Here are patches to add a text poke event to record changes to kernel text
(i.e. self-modifying code) in order to support tracers like Intel PT
decoding through jump labels.

One question is whether this approach can be of use to other architectures,
particularly ARM-CS.

The first patch makes the kernel change and the subsequent patches are
tools changes.

The first 3 tools patches add support for updating perf tools' data cache
with the changed bytes.

The last 2 patches are Intel PT specific tools changes.

The patches are based on Arnaldo's tree perf/core branch.

Example:

  For jump labels, the kernel needs
	CONFIG_JUMP_LABEL=y
  and also an easy to flip jump label is in sysctl_schedstats() which needs
	CONFIG_SCHEDSTATS=y
	CONFIG_PROC_SYSCTL=y

  Also note the 'sudo perf record' is put into the background which, as
  written, needs sudo credential caching (otherwise the background task
  will stop awaiting the sudo password), hence the 'sudo echo' to start.

Before:

  $ sudo echo
  $ sudo perf record -o perf.data.before --kcore -a -e intel_pt//k -m,64M &
  [1] 1640
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo bash -c 'echo 1 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  1
  $ sudo bash -c 'echo 0 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo kill 1640
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 16.635 MB perf.data.before ]
  $ perf script -i perf.data.before --itrace=e >/dev/null
  Warning:
  1946 instruction trace errors

After:

  $ sudo echo
  $ sudo perf record -o perf.data.after --kcore -a -e intel_pt//k -m,64M &
  [1] 1882
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo bash -c 'echo 1 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  1
  $ sudo bash -c 'echo 0 > /proc/sys/kernel/sched_schedstats'
  $ cat /proc/sys/kernel/sched_schedstats
  0
  $ sudo kill 1882
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 10.893 MB perf.data.after ]
  $ perf script -i perf.data.after --itrace=e
  $


Adrian Hunter (6):
      perf/x86: Add perf text poke event
      perf dso: Refactor dso_cache__read()
      perf dso: Add dso__data_write_cache_addr()
      perf tools: Add support for PERF_RECORD_TEXT_POKE
      perf auxtrace: Add auxtrace_cache__remove()
      perf intel-pt: Add support for text poke events

 arch/x86/include/asm/text-patching.h      |   1 +
 arch/x86/kernel/alternative.c             |  39 ++++++++-
 include/linux/perf_event.h                |   6 ++
 include/uapi/linux/perf_event.h           |  28 ++++++-
 kernel/events/core.c                      |  90 +++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h     |  28 ++++++-
 tools/perf/arch/x86/util/intel-pt.c       |   5 ++
 tools/perf/builtin-record.c               |  45 ++++++++++
 tools/perf/lib/include/perf/event.h       |   9 ++
 tools/perf/util/auxtrace.c                |  28 +++++++
 tools/perf/util/auxtrace.h                |   1 +
 tools/perf/util/dso.c                     | 135 +++++++++++++++++++++---------
 tools/perf/util/dso.h                     |   7 ++
 tools/perf/util/event.c                   |  39 +++++++++
 tools/perf/util/event.h                   |   5 ++
 tools/perf/util/evlist.h                  |   1 +
 tools/perf/util/intel-pt.c                |  71 ++++++++++++++++
 tools/perf/util/machine.c                 |  38 +++++++++
 tools/perf/util/machine.h                 |   3 +
 tools/perf/util/perf_event_attr_fprintf.c |   1 +
 tools/perf/util/record.c                  |  10 +++
 tools/perf/util/record.h                  |   1 +
 tools/perf/util/session.c                 |  22 +++++
 tools/perf/util/tool.h                    |   3 +-
 24 files changed, 570 insertions(+), 46 deletions(-)


Regards
Adrian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-25 12:59 [PATCH RFC 0/6] perf/x86: Add perf text poke event Adrian Hunter
@ 2019-10-25 12:59 ` Adrian Hunter
  2019-10-30 10:47   ` Leo Yan
  2019-11-04 10:40   ` Peter Zijlstra
  2019-10-25 12:59 ` [PATCH RFC 2/6] perf dso: Refactor dso_cache__read() Adrian Hunter
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-25 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Record changes to kernel text (i.e. self-modifying code) in order to
support tracers like Intel PT decoding through jump labels.

A copy of the running kernel code is needed as a reference point
(e.g. from /proc/kcore). The text poke event records the old bytes
and the new bytes so that the event can be processed forwards or backwards.

The text poke event has 'flags' to describe when the event is sent. At
present the only flag is PERF_TEXT_POKE_UPDATE which indicates the
point at which tools should update their reference kernel executable to
change the old bytes to the new bytes.

Other architectures may wish to emit a pair of text poke events. One before
and one after a text poke. In that case, PERF_TEXT_POKE_UPDATE flag would
be set on only one of the pair.

In the case of Intel PT tracing, the executable code must be walked to
reconstruct the control flow. For x86 a jump label text poke begins:
  - write INT3 byte
  - IPI-SYNC
At this point the actual control flow will be through the INT3 and handler
and not hit the old or new instruction. Intel PT outputs FUP/TIP packets
for the INT3, so the flow can still be decoded. Subsequently:
  - emit RECORD_TEXT_POKE with the new instruction
  - write instruction tail
  - IPI-SYNC
  - write first byte
  - IPI-SYNC
So before the text poke event timestamp, the decoder will see either the
old instruction flow or FUP/TIP of INT3. After the text poke event
timestamp, the decoder will see either the new instruction flow or FUP/TIP
of INT3. Thus decoders can use the timestamp as the point at which to
modify the executable code.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        | 39 +++++++++++-
 include/linux/perf_event.h           |  6 ++
 include/uapi/linux/perf_event.h      | 28 ++++++++-
 kernel/events/core.c                 | 90 +++++++++++++++++++++++++++-
 5 files changed, 161 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 5e8319bb207a..873141765d8e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -30,6 +30,7 @@ struct text_poke_loc {
 	void *addr;
 	size_t len;
 	const char opcode[POKE_MAX_OPCODE_SIZE];
+	u8 old;
 };
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9d3a971ea364..9488f0fa7e22 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -3,6 +3,7 @@
 
 #include <linux/module.h>
 #include <linux/sched.h>
+#include <linux/perf_event.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/stringify.h>
@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 	/*
 	 * First step: add a int3 trap to the address that will be patched.
 	 */
-	for (i = 0; i < nr_entries; i++)
+	for (i = 0; i < nr_entries; i++) {
+		tp[i].old = *(u8 *)tp[i].addr;
 		text_poke(tp[i].addr, &int3, sizeof(int3));
+	}
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
@@ -1054,12 +1057,46 @@ void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 	 * Second step: update all but the first byte of the patched range.
 	 */
 	for (i = 0; i < nr_entries; i++) {
+		u8 old[POKE_MAX_OPCODE_SIZE] = { tp[i].old, };
+
+		memcpy(old + sizeof(int3),
+		       tp[i].addr + sizeof(int3),
+		       tp[i].len - sizeof(int3));
+
 		if (tp[i].len - sizeof(int3) > 0) {
 			text_poke((char *)tp[i].addr + sizeof(int3),
 				  (const char *)tp[i].opcode + sizeof(int3),
 				  tp[i].len - sizeof(int3));
 			patched_all_but_first++;
 		}
+
+		/*
+		 * Emit a perf event to record the text poke, primarily to
+		 * support Intel PT decoding which must walk the executable code
+		 * to reconstruct the trace. The flow up to here is:
+		 *   - write INT3 byte
+		 *   - IPI-SYNC
+		 * At this point the actual control flow will be through the
+		 * INT3 and handler and not hit the old or new instruction.
+		 * Intel PT outputs FUP/TIP packets for the INT3, so the flow
+		 * can still be decoded. Subsequently:
+		 *   - emit RECORD_TEXT_POKE with the new instruction
+		 *   - write instruction tail
+		 *   - IPI-SYNC
+		 *   - write first byte
+		 *   - IPI-SYNC
+		 * So before the text poke event timestamp, the decoder will see
+		 * either the old instruction flow or FUP/TIP of INT3. After the
+		 * text poke event timestamp, the decoder will see either the
+		 * new instruction flow or FUP/TIP of INT3. Thus decoders can
+		 * use the timestamp as the point at which to modify the
+		 * executable code.
+		 * The old instruction is recorded so that the event can be
+		 * processed forwards or backwards.
+		 */
+		perf_event_text_poke((unsigned long)tp[i].addr,
+				     PERF_TEXT_POKE_UPDATE, old, tp[i].opcode,
+				     tp[i].len);
 	}
 
 	if (patched_all_but_first) {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61448c19a132..db88b7ade8c6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_namespaces(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
+extern void perf_event_text_poke(unsigned long addr, u16 flags, const void *old_bytes,
+				 const void *new_bytes, size_t len);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_namespaces(struct task_struct *tsk)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline void perf_event_text_poke(unsigned long addr, u16 flags,
+					const void *old_bytes,
+					const void *new_bytes,
+					size_t len)			{ }
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..c8d1f52a7fce 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -375,7 +375,8 @@ struct perf_event_attr {
 				ksymbol        :  1, /* include ksymbol events */
 				bpf_event      :  1, /* include bpf events */
 				aux_output     :  1, /* generate AUX records instead of events */
-				__reserved_1   : 32;
+				text_poke      :  1, /* include text poke events */
+				__reserved_1   : 31;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1000,6 +1001,26 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_BPF_EVENT			= 18,
 
+	/*
+	 * Records changes to kernel text i.e. self-modified code.
+	 * 'flags' has PERF_TEXT_POKE_UPDATE (i.e. bit 0) set to
+	 * indicate to tools to update old bytes to new bytes in the
+	 * executable image.
+	 * 'len' is the number of old bytes, which is the same as the number
+	 * of new bytes. 'bytes' contains the old bytes followed immediately
+	 * by the new bytes.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				addr;
+	 *	u16				flags;
+	 *	u16				len;
+	 *	u8				bytes[];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_TEXT_POKE			= 19,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1041,6 +1062,11 @@ enum perf_callchain_context {
 #define PERF_AUX_FLAG_PARTIAL		0x04	/* record contains gaps */
 #define PERF_AUX_FLAG_COLLISION		0x08	/* sample collided with another */
 
+/**
+ * PERF_RECORD_TEXT_POKE::flags bits
+ */
+#define PERF_TEXT_POKE_UPDATE		0x01	/* update old bytes to new bytes */
+
 #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
 #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
 #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ec0b0bfddbd..666e094e8ed9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -386,6 +386,7 @@ static atomic_t nr_freq_events __read_mostly;
 static atomic_t nr_switch_events __read_mostly;
 static atomic_t nr_ksymbol_events __read_mostly;
 static atomic_t nr_bpf_events __read_mostly;
+static atomic_t nr_text_poke_events __read_mostly;
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
@@ -4351,7 +4352,7 @@ static bool is_sb_event(struct perf_event *event)
 	if (attr->mmap || attr->mmap_data || attr->mmap2 ||
 	    attr->comm || attr->comm_exec ||
 	    attr->task || attr->ksymbol ||
-	    attr->context_switch ||
+	    attr->context_switch || attr->text_poke ||
 	    attr->bpf_event)
 		return true;
 	return false;
@@ -4425,6 +4426,8 @@ static void unaccount_event(struct perf_event *event)
 		atomic_dec(&nr_ksymbol_events);
 	if (event->attr.bpf_event)
 		atomic_dec(&nr_bpf_events);
+	if (event->attr.text_poke)
+		atomic_dec(&nr_text_poke_events);
 
 	if (dec) {
 		if (!atomic_add_unless(&perf_sched_count, -1, 1))
@@ -8070,6 +8073,89 @@ void perf_event_bpf_event(struct bpf_prog *prog,
 	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
 }
 
+struct perf_text_poke_event {
+	const void		*old_bytes;
+	const void		*new_bytes;
+	size_t			pad;
+	u16			flags;
+	u16			len;
+
+	struct {
+		struct perf_event_header	header;
+
+		u64				addr;
+	} event_id;
+};
+
+static int perf_event_text_poke_match(struct perf_event *event)
+{
+	return event->attr.text_poke;
+}
+
+static void perf_event_text_poke_output(struct perf_event *event, void *data)
+{
+	struct perf_text_poke_event *text_poke_event = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	u64 padding = 0;
+	int ret;
+
+	if (!perf_event_text_poke_match(event))
+		return;
+
+	perf_event_header__init_id(&text_poke_event->event_id.header, &sample, event);
+
+	ret = perf_output_begin(&handle, event, text_poke_event->event_id.header.size);
+	if (ret)
+		return;
+
+	perf_output_put(&handle, text_poke_event->event_id);
+	perf_output_put(&handle, text_poke_event->flags);
+	perf_output_put(&handle, text_poke_event->len);
+
+	__output_copy(&handle, text_poke_event->old_bytes, text_poke_event->len);
+	__output_copy(&handle, text_poke_event->new_bytes, text_poke_event->len);
+
+	if (text_poke_event->pad)
+		__output_copy(&handle, &padding, text_poke_event->pad);
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
+void perf_event_text_poke(unsigned long addr, u16 flags, const void *old_bytes,
+			  const void *new_bytes, size_t len)
+{
+	struct perf_text_poke_event text_poke_event;
+	size_t tot, pad;
+
+	if (!atomic_read(&nr_text_poke_events))
+		return;
+
+	tot  = sizeof(text_poke_event.flags) +
+	       sizeof(text_poke_event.len) + (len << 1);
+	pad  = ALIGN(tot, sizeof(u64)) - tot;
+
+	text_poke_event = (struct perf_text_poke_event){
+		.old_bytes    = old_bytes,
+		.new_bytes    = new_bytes,
+		.pad          = pad,
+		.flags        = flags,
+		.len          = len,
+		.event_id  = {
+			.header = {
+				.type = PERF_RECORD_TEXT_POKE,
+				.misc = PERF_RECORD_MISC_KERNEL,
+				.size = sizeof(text_poke_event.event_id) + tot + pad,
+			},
+			.addr = addr,
+		},
+	};
+
+	perf_iterate_sb(perf_event_text_poke_output, &text_poke_event, NULL);
+}
+
 void perf_event_itrace_started(struct perf_event *event)
 {
 	event->attach_state |= PERF_ATTACH_ITRACE;
@@ -10356,6 +10442,8 @@ static void account_event(struct perf_event *event)
 		atomic_inc(&nr_ksymbol_events);
 	if (event->attr.bpf_event)
 		atomic_inc(&nr_bpf_events);
+	if (event->attr.text_poke)
+		atomic_inc(&nr_text_poke_events);
 
 	if (inc) {
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 2/6] perf dso: Refactor dso_cache__read()
  2019-10-25 12:59 [PATCH RFC 0/6] perf/x86: Add perf text poke event Adrian Hunter
  2019-10-25 12:59 ` [PATCH RFC 1/6] " Adrian Hunter
@ 2019-10-25 12:59 ` Adrian Hunter
  2019-10-25 14:54   ` Arnaldo Carvalho de Melo
                     ` (2 more replies)
  2019-10-25 12:59 ` [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr() Adrian Hunter
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-25 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Refactor dso_cache__read() to separate populating the cache from copying
data from it.  This is preparation for adding a cache "write" that will
update the data in the cache.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/dso.c | 64 +++++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 27 deletions(-)

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index e11ddf86f2b3..460330d125b6 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -768,7 +768,7 @@ dso_cache__free(struct dso *dso)
 	pthread_mutex_unlock(&dso->lock);
 }
 
-static struct dso_cache *dso_cache__find(struct dso *dso, u64 offset)
+static struct dso_cache *__dso_cache__find(struct dso *dso, u64 offset)
 {
 	const struct rb_root *root = &dso->data.cache;
 	struct rb_node * const *p = &root->rb_node;
@@ -863,54 +863,64 @@ static ssize_t file_read(struct dso *dso, struct machine *machine,
 	return ret;
 }
 
-static ssize_t
-dso_cache__read(struct dso *dso, struct machine *machine,
-		u64 offset, u8 *data, ssize_t size)
+static struct dso_cache *dso_cache__populate(struct dso *dso,
+					     struct machine *machine,
+					     u64 offset, ssize_t *ret)
 {
 	u64 cache_offset = offset & DSO__DATA_CACHE_MASK;
 	struct dso_cache *cache;
 	struct dso_cache *old;
-	ssize_t ret;
 
 	cache = zalloc(sizeof(*cache) + DSO__DATA_CACHE_SIZE);
-	if (!cache)
-		return -ENOMEM;
+	if (!cache) {
+		*ret = -ENOMEM;
+		return NULL;
+	}
 
 	if (dso->binary_type == DSO_BINARY_TYPE__BPF_PROG_INFO)
-		ret = bpf_read(dso, cache_offset, cache->data);
+		*ret = bpf_read(dso, cache_offset, cache->data);
 	else
-		ret = file_read(dso, machine, cache_offset, cache->data);
+		*ret = file_read(dso, machine, cache_offset, cache->data);
 
-	if (ret > 0) {
-		cache->offset = cache_offset;
-		cache->size   = ret;
+	if (*ret <= 0) {
+		free(cache);
+		return NULL;
+	}
 
-		old = dso_cache__insert(dso, cache);
-		if (old) {
-			/* we lose the race */
-			free(cache);
-			cache = old;
-		}
+	cache->offset = cache_offset;
+	cache->size   = *ret;
 
-		ret = dso_cache__memcpy(cache, offset, data, size);
+	old = dso_cache__insert(dso, cache);
+	if (old) {
+		/* we lose the race */
+		free(cache);
+		cache = old;
 	}
 
-	if (ret <= 0)
-		free(cache);
+	return cache;
+}
 
-	return ret;
+static struct dso_cache *dso_cache__find(struct dso *dso,
+					 struct machine *machine,
+					 u64 offset,
+					 ssize_t *ret)
+{
+	struct dso_cache *cache = __dso_cache__find(dso, offset);
+
+	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
 }
 
 static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
 			      u64 offset, u8 *data, ssize_t size)
 {
 	struct dso_cache *cache;
+	ssize_t ret = 0;
 
-	cache = dso_cache__find(dso, offset);
-	if (cache)
-		return dso_cache__memcpy(cache, offset, data, size);
-	else
-		return dso_cache__read(dso, machine, offset, data, size);
+	cache = dso_cache__find(dso, machine, offset, &ret);
+	if (!cache)
+		return ret;
+
+	return dso_cache__memcpy(cache, offset, data, size);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr()
  2019-10-25 12:59 [PATCH RFC 0/6] perf/x86: Add perf text poke event Adrian Hunter
  2019-10-25 12:59 ` [PATCH RFC 1/6] " Adrian Hunter
  2019-10-25 12:59 ` [PATCH RFC 2/6] perf dso: Refactor dso_cache__read() Adrian Hunter
@ 2019-10-25 12:59 ` Adrian Hunter
  2019-10-28 15:45   ` Jiri Olsa
  2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
  2019-10-25 12:59 ` [PATCH RFC 4/6] perf tools: Add support for PERF_RECORD_TEXT_POKE Adrian Hunter
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-25 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Add functions to write into the dso file data cache, but not change the
file itself.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/dso.c | 73 ++++++++++++++++++++++++++++++++++---------
 tools/perf/util/dso.h |  7 +++++
 2 files changed, 65 insertions(+), 15 deletions(-)

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index 460330d125b6..0f1b77275a86 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -827,14 +827,16 @@ dso_cache__insert(struct dso *dso, struct dso_cache *new)
 	return cache;
 }
 
-static ssize_t
-dso_cache__memcpy(struct dso_cache *cache, u64 offset,
-		  u8 *data, u64 size)
+static ssize_t dso_cache__memcpy(struct dso_cache *cache, u64 offset, u8 *data,
+				 u64 size, bool out)
 {
 	u64 cache_offset = offset - cache->offset;
 	u64 cache_size   = min(cache->size - cache_offset, size);
 
-	memcpy(data, cache->data + cache_offset, cache_size);
+	if (out)
+		memcpy(data, cache->data + cache_offset, cache_size);
+	else
+		memcpy(cache->data + cache_offset, data, cache_size);
 	return cache_size;
 }
 
@@ -910,8 +912,8 @@ static struct dso_cache *dso_cache__find(struct dso *dso,
 	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
 }
 
-static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
-			      u64 offset, u8 *data, ssize_t size)
+static ssize_t dso_cache_io(struct dso *dso, struct machine *machine,
+			    u64 offset, u8 *data, ssize_t size, bool out)
 {
 	struct dso_cache *cache;
 	ssize_t ret = 0;
@@ -920,16 +922,16 @@ static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
 	if (!cache)
 		return ret;
 
-	return dso_cache__memcpy(cache, offset, data, size);
+	return dso_cache__memcpy(cache, offset, data, size, out);
 }
 
 /*
  * Reads and caches dso data DSO__DATA_CACHE_SIZE size chunks
  * in the rb_tree. Any read to already cached data is served
- * by cached data.
+ * by cached data. Writes update the cache only, not the backing file.
  */
-static ssize_t cached_read(struct dso *dso, struct machine *machine,
-			   u64 offset, u8 *data, ssize_t size)
+static ssize_t cached_io(struct dso *dso, struct machine *machine,
+			 u64 offset, u8 *data, ssize_t size, bool out)
 {
 	ssize_t r = 0;
 	u8 *p = data;
@@ -937,7 +939,7 @@ static ssize_t cached_read(struct dso *dso, struct machine *machine,
 	do {
 		ssize_t ret;
 
-		ret = dso_cache_read(dso, machine, offset, p, size);
+		ret = dso_cache_io(dso, machine, offset, p, size, out);
 		if (ret < 0)
 			return ret;
 
@@ -1021,8 +1023,9 @@ off_t dso__data_size(struct dso *dso, struct machine *machine)
 	return dso->data.file_size;
 }
 
-static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
-				u64 offset, u8 *data, ssize_t size)
+static ssize_t data_read_write_offset(struct dso *dso, struct machine *machine,
+				      u64 offset, u8 *data, ssize_t size,
+				      bool out)
 {
 	if (dso__data_file_size(dso, machine))
 		return -1;
@@ -1034,7 +1037,7 @@ static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
 	if (offset + size < offset)
 		return -1;
 
-	return cached_read(dso, machine, offset, data, size);
+	return cached_io(dso, machine, offset, data, size, out);
 }
 
 /**
@@ -1054,7 +1057,7 @@ ssize_t dso__data_read_offset(struct dso *dso, struct machine *machine,
 	if (dso->data.status == DSO_DATA_STATUS_ERROR)
 		return -1;
 
-	return data_read_offset(dso, machine, offset, data, size);
+	return data_read_write_offset(dso, machine, offset, data, size, true);
 }
 
 /**
@@ -1075,6 +1078,46 @@ ssize_t dso__data_read_addr(struct dso *dso, struct map *map,
 	return dso__data_read_offset(dso, machine, offset, data, size);
 }
 
+/**
+ * dso__data_write_cache_offs - Write data to dso data cache at file offset
+ * @dso: dso object
+ * @machine: machine object
+ * @offset: file offset
+ * @data: buffer to write
+ * @size: size of the @data buffer
+ *
+ * Write into the dso file data cache, but do not change the file itself.
+ */
+ssize_t dso__data_write_cache_offs(struct dso *dso, struct machine *machine,
+				   u64 offset, const u8 *data_in, ssize_t size)
+{
+	u8 *data = (u8 *)data_in; /* cast away const to use same fns for r/w */
+
+	if (dso->data.status == DSO_DATA_STATUS_ERROR)
+		return -1;
+
+	return data_read_write_offset(dso, machine, offset, data, size, false);
+}
+
+/**
+ * dso__data_write_cache_addr - Write data to dso data cache at dso address
+ * @dso: dso object
+ * @machine: machine object
+ * @add: virtual memory address
+ * @data: buffer to write
+ * @size: size of the @data buffer
+ *
+ * External interface to write into the dso file data cache, but do not change
+ * the file itself.
+ */
+ssize_t dso__data_write_cache_addr(struct dso *dso, struct map *map,
+				   struct machine *machine, u64 addr,
+				   const u8 *data, ssize_t size)
+{
+	u64 offset = map->map_ip(map, addr);
+	return dso__data_write_cache_offs(dso, machine, offset, data, size);
+}
+
 struct map *dso__new_map(const char *name)
 {
 	struct map *map = NULL;
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index e4dddb76770d..2f1fcbc6fead 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -285,6 +285,8 @@ void dso__set_module_info(struct dso *dso, struct kmod_path *m,
  *   dso__data_size
  *   dso__data_read_offset
  *   dso__data_read_addr
+ *   dso__data_write_cache_offs
+ *   dso__data_write_cache_addr
  *
  * Please refer to the dso.c object code for each function and
  * arguments documentation. Following text tries to explain the
@@ -332,6 +334,11 @@ ssize_t dso__data_read_addr(struct dso *dso, struct map *map,
 			    struct machine *machine, u64 addr,
 			    u8 *data, ssize_t size);
 bool dso__data_status_seen(struct dso *dso, enum dso_data_status_seen by);
+ssize_t dso__data_write_cache_offs(struct dso *dso, struct machine *machine,
+				   u64 offset, const u8 *data, ssize_t size);
+ssize_t dso__data_write_cache_addr(struct dso *dso, struct map *map,
+				   struct machine *machine, u64 addr,
+				   const u8 *data, ssize_t size);
 
 struct map *dso__new_map(const char *name);
 struct dso *machine__findnew_kernel(struct machine *machine, const char *name,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 4/6] perf tools: Add support for PERF_RECORD_TEXT_POKE
  2019-10-25 12:59 [PATCH RFC 0/6] perf/x86: Add perf text poke event Adrian Hunter
                   ` (2 preceding siblings ...)
  2019-10-25 12:59 ` [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr() Adrian Hunter
@ 2019-10-25 12:59 ` Adrian Hunter
  2019-10-25 12:59 ` [PATCH RFC 5/6] perf auxtrace: Add auxtrace_cache__remove() Adrian Hunter
  2019-10-25 13:00 ` [PATCH RFC 6/6] perf intel-pt: Add support for text poke events Adrian Hunter
  5 siblings, 0 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-25 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Add processing for PERF_RECORD_TEXT_POKE events. When a text poke event is
processed, if the PERF_TEXT_POKE_UPDATE flag is set, then the kernel dso
data cache is updated with the poked bytes.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/include/uapi/linux/perf_event.h     | 28 +++++++++++++-
 tools/perf/arch/x86/util/intel-pt.c       |  1 +
 tools/perf/builtin-record.c               | 45 +++++++++++++++++++++++
 tools/perf/lib/include/perf/event.h       |  9 +++++
 tools/perf/util/event.c                   | 39 ++++++++++++++++++++
 tools/perf/util/event.h                   |  5 +++
 tools/perf/util/evlist.h                  |  1 +
 tools/perf/util/machine.c                 | 38 +++++++++++++++++++
 tools/perf/util/machine.h                 |  3 ++
 tools/perf/util/perf_event_attr_fprintf.c |  1 +
 tools/perf/util/record.c                  | 10 +++++
 tools/perf/util/record.h                  |  1 +
 tools/perf/util/session.c                 | 22 +++++++++++
 tools/perf/util/tool.h                    |  3 +-
 14 files changed, 204 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index bb7b271397a6..c8d1f52a7fce 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -375,7 +375,8 @@ struct perf_event_attr {
 				ksymbol        :  1, /* include ksymbol events */
 				bpf_event      :  1, /* include bpf events */
 				aux_output     :  1, /* generate AUX records instead of events */
-				__reserved_1   : 32;
+				text_poke      :  1, /* include text poke events */
+				__reserved_1   : 31;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1000,6 +1001,26 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_BPF_EVENT			= 18,
 
+	/*
+	 * Records changes to kernel text i.e. self-modified code.
+	 * 'flags' has PERF_TEXT_POKE_UPDATE (i.e. bit 0) set to
+	 * indicate to tools to update old bytes to new bytes in the
+	 * executable image.
+	 * 'len' is the number of old bytes, which is the same as the number
+	 * of new bytes. 'bytes' contains the old bytes followed immediately
+	 * by the new bytes.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				addr;
+	 *	u16				flags;
+	 *	u16				len;
+	 *	u8				bytes[];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_TEXT_POKE			= 19,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1041,6 +1062,11 @@ enum perf_callchain_context {
 #define PERF_AUX_FLAG_PARTIAL		0x04	/* record contains gaps */
 #define PERF_AUX_FLAG_COLLISION		0x08	/* sample collided with another */
 
+/**
+ * PERF_RECORD_TEXT_POKE::flags bits
+ */
+#define PERF_TEXT_POKE_UPDATE		0x01	/* update old bytes to new bytes */
+
 #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
 #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
 #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
diff --git a/tools/perf/arch/x86/util/intel-pt.c b/tools/perf/arch/x86/util/intel-pt.c
index d6d26256915f..a81ce60876f6 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -17,6 +17,7 @@
 #include "../../util/event.h"
 #include "../../util/evlist.h"
 #include "../../util/evsel.h"
+#include "../../util/evsel_config.h"
 #include "../../util/cpumap.h"
 #include "../../util/mmap.h"
 #include <subcmd/parse-options.h>
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f6664bb08b26..5a5a5e84b136 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -702,6 +702,43 @@ static int record__auxtrace_init(struct record *rec __maybe_unused)
 
 #endif
 
+static int record__config_text_poke(struct evlist *evlist)
+{
+	struct evsel *evsel;
+	int err;
+
+	/* Nothing to do if text poke is already configured */
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->core.attr.text_poke)
+			return 0;
+	}
+
+	err = parse_events(evlist, "dummy:u", NULL);
+	if (err)
+		return err;
+
+	evsel = evlist__last(evlist);
+
+	evsel->core.attr.freq = 0;
+	evsel->core.attr.sample_period = 1;
+	evsel->core.attr.text_poke = 1;
+
+	evsel->core.system_wide = true;
+	evsel->no_aux_samples = true;
+	evsel->immediate = true;
+
+	/* Text poke must be collected on all CPUs */
+	perf_cpu_map__put(evsel->core.own_cpus);
+	evsel->core.own_cpus = perf_cpu_map__new(NULL);
+	perf_cpu_map__put(evsel->core.cpus);
+	evsel->core.cpus = perf_cpu_map__get(evsel->core.own_cpus);
+
+	perf_evsel__set_sample_bit(evsel, TIME);
+	perf_evsel__reset_sample_bit(evsel, BRANCH_STACK);
+
+	return 0;
+}
+
 static bool record__kcore_readable(struct machine *machine)
 {
 	char kcore[PATH_MAX];
@@ -2507,6 +2544,14 @@ int cmd_record(int argc, const char **argv)
 	if (rec->opts.full_auxtrace)
 		rec->buildid_all = true;
 
+	if (rec->opts.text_poke) {
+		err = record__config_text_poke(rec->evlist);
+		if (err) {
+			pr_err("record__config_text_poke failed, error %d\n", err);
+			goto out;
+		}
+	}
+
 	if (record_opts__config(&rec->opts)) {
 		err = -EINVAL;
 		goto out;
diff --git a/tools/perf/lib/include/perf/event.h b/tools/perf/lib/include/perf/event.h
index 18106899cb4e..0127125eefd7 100644
--- a/tools/perf/lib/include/perf/event.h
+++ b/tools/perf/lib/include/perf/event.h
@@ -105,6 +105,14 @@ struct perf_record_bpf_event {
 	__u8			 tag[BPF_TAG_SIZE];  // prog tag
 };
 
+struct perf_record_text_poke_event {
+	struct perf_event_header header;
+	__u64			addr;
+	__u16			flags;
+	__u16			len;
+	__u8			bytes[];
+};
+
 struct perf_record_sample {
 	struct perf_event_header header;
 	__u64			 array[];
@@ -360,6 +368,7 @@ union perf_event {
 	struct perf_record_sample		sample;
 	struct perf_record_bpf_event		bpf;
 	struct perf_record_ksymbol		ksymbol;
+	struct perf_record_text_poke_event	text_poke;
 	struct perf_record_header_attr		attr;
 	struct perf_record_event_update		event_update;
 	struct perf_record_header_event_type	event_type;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index fc1e5a991008..b1e59fc3bce9 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -31,6 +31,7 @@
 #include "stat.h"
 #include "session.h"
 #include "bpf-event.h"
+#include "print_binary.h"
 #include "tool.h"
 #include "../perf.h"
 
@@ -54,6 +55,7 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_NAMESPACES]		= "NAMESPACES",
 	[PERF_RECORD_KSYMBOL]			= "KSYMBOL",
 	[PERF_RECORD_BPF_EVENT]			= "BPF_EVENT",
+	[PERF_RECORD_TEXT_POKE]			= "TEXT_POKE",
 	[PERF_RECORD_HEADER_ATTR]		= "ATTR",
 	[PERF_RECORD_HEADER_EVENT_TYPE]		= "EVENT_TYPE",
 	[PERF_RECORD_HEADER_TRACING_DATA]	= "TRACING_DATA",
@@ -252,6 +254,14 @@ int perf_event__process_bpf(struct perf_tool *tool __maybe_unused,
 	return machine__process_bpf(machine, event, sample);
 }
 
+int perf_event__process_text_poke(struct perf_tool *tool __maybe_unused,
+				  union perf_event *event,
+				  struct perf_sample *sample,
+				  struct machine *machine)
+{
+	return machine__process_text_poke(machine, event, sample);
+}
+
 size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
 {
 	return fprintf(fp, " %d/%d: [%#" PRI_lx64 "(%#" PRI_lx64 ") @ %#" PRI_lx64 "]: %c %s\n",
@@ -398,6 +408,32 @@ size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp)
 		       event->bpf.type, event->bpf.flags, event->bpf.id);
 }
 
+static int text_poke_printer(enum binary_printer_ops op, unsigned int val,
+			     void *extra __maybe_unused, FILE *fp)
+{
+	if (op == BINARY_PRINT_NUM_DATA)
+		return fprintf(fp, " %02x", val);
+	if (op == BINARY_PRINT_LINE_END)
+		return fprintf(fp, "\n");
+	return 0;
+}
+
+size_t perf_event__fprintf_text_poke(union perf_event *event, FILE *fp)
+{
+	size_t ret = fprintf(fp, " addr %#" PRI_lx64 " flags %#x len %u\n",
+			     event->text_poke.addr, event->text_poke.flags,
+			     event->text_poke.len);
+
+	ret += fprintf(fp, "     old bytes:");
+	ret += binary__fprintf(event->text_poke.bytes, event->text_poke.len, 16,
+			       text_poke_printer, NULL, fp);
+	ret += fprintf(fp, "     new bytes:");
+	ret += binary__fprintf(event->text_poke.bytes + event->text_poke.len,
+			       event->text_poke.len, 16, text_poke_printer,
+			       NULL, fp);
+	return ret;
+}
+
 size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 {
 	size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -439,6 +475,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 	case PERF_RECORD_BPF_EVENT:
 		ret += perf_event__fprintf_bpf(event, fp);
 		break;
+	case PERF_RECORD_TEXT_POKE:
+		ret += perf_event__fprintf_text_poke(event, fp);
+		break;
 	default:
 		ret += fprintf(fp, "\n");
 	}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index a0a0c91cde4a..f0f4f4db6e22 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -339,6 +339,10 @@ int perf_event__process_bpf(struct perf_tool *tool,
 			    union perf_event *event,
 			    struct perf_sample *sample,
 			    struct machine *machine);
+int perf_event__process_text_poke(struct perf_tool *tool,
+				  union perf_event *event,
+				  struct perf_sample *sample,
+				  struct machine *machine);
 int perf_event__process(struct perf_tool *tool,
 			union perf_event *event,
 			struct perf_sample *sample,
@@ -372,6 +376,7 @@ size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_namespaces(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_text_poke(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf(union perf_event *event, FILE *fp);
 
 int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 13051409fd22..a2b9fa8fb281 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -175,6 +175,7 @@ struct callchain_param;
 void perf_evlist__set_id_pos(struct evlist *evlist);
 bool perf_can_sample_identifier(void);
 bool perf_can_record_switch_events(void);
+bool perf_can_record_text_poke_events(void);
 bool perf_can_record_cpu_wide(void);
 void perf_evlist__config(struct evlist *evlist, struct record_opts *opts,
 			 struct callchain_param *callchain);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 70a9f8716a4b..edf0327093b8 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -767,6 +767,42 @@ int machine__process_ksymbol(struct machine *machine __maybe_unused,
 	return machine__process_ksymbol_register(machine, event, sample);
 }
 
+int machine__process_text_poke(struct machine *machine, union perf_event *event,
+			       struct perf_sample *sample __maybe_unused)
+{
+	struct map_groups *mg = &machine->kmaps;
+	struct map *map = map_groups__find(mg, event->text_poke.addr);
+
+	if (dump_trace)
+		perf_event__fprintf_text_poke(event, stdout);
+
+	if (!(event->text_poke.flags & PERF_TEXT_POKE_UPDATE))
+		return 0;
+
+	if (map) {
+		u8 *new_bytes = event->text_poke.bytes + event->text_poke.len;
+		int ret;
+
+		/*
+		 * Kernel maps might be changed when loading symbols so loading
+		 * must be done prior to using kernel maps.
+		 */
+		map__load(map);
+		ret = dso__data_write_cache_addr(map->dso, map, machine,
+						 event->text_poke.addr,
+						 new_bytes,
+						 event->text_poke.len);
+		if (ret != event->text_poke.len)
+			pr_err("Failed to write kernel text poke at %#" PRI_lx64 "\n",
+			       event->text_poke.addr);
+	} else {
+		pr_err("Failed to find kernel text poke address map for %#" PRI_lx64 "\n",
+		       event->text_poke.addr);
+	}
+
+	return 0;
+}
+
 static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
 {
 	const char *dup_filename;
@@ -1931,6 +1967,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
 		ret = machine__process_ksymbol(machine, event, sample); break;
 	case PERF_RECORD_BPF_EVENT:
 		ret = machine__process_bpf(machine, event, sample); break;
+	case PERF_RECORD_TEXT_POKE:
+		ret = machine__process_text_poke(machine, event, sample); break;
 	default:
 		ret = -1;
 		break;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 18e13c0ccd6a..af936ec01c46 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -134,6 +134,9 @@ int machine__process_mmap2_event(struct machine *machine, union perf_event *even
 int machine__process_ksymbol(struct machine *machine,
 			     union perf_event *event,
 			     struct perf_sample *sample);
+int machine__process_text_poke(struct machine *machine,
+			       union perf_event *event,
+			       struct perf_sample *sample);
 int machine__process_event(struct machine *machine, union perf_event *event,
 				struct perf_sample *sample);
 
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index d4ad3f04923a..0134ffd27b23 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -143,6 +143,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(sample_regs_intr, p_hex);
 	PRINT_ATTRf(aux_watermark, p_unsigned);
 	PRINT_ATTRf(sample_max_stack, p_unsigned);
+	PRINT_ATTRf(text_poke, p_unsigned);
 
 	return ret;
 }
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index 8579505c29a4..a58d01b65078 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -97,6 +97,11 @@ static void perf_probe_context_switch(struct evsel *evsel)
 	evsel->core.attr.context_switch = 1;
 }
 
+static void perf_probe_text_poke(struct evsel *evsel)
+{
+	evsel->core.attr.text_poke = 1;
+}
+
 bool perf_can_sample_identifier(void)
 {
 	return perf_probe_api(perf_probe_sample_identifier);
@@ -112,6 +117,11 @@ bool perf_can_record_switch_events(void)
 	return perf_probe_api(perf_probe_context_switch);
 }
 
+bool perf_can_record_text_poke_events(void)
+{
+	return perf_probe_api(perf_probe_text_poke);
+}
+
 bool perf_can_record_cpu_wide(void)
 {
 	struct perf_event_attr attr = {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 948bbcf9aef3..acc04d873059 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -45,6 +45,7 @@ struct record_opts {
 	bool	      sample_id;
 	bool	      no_bpf_event;
 	bool	      kcore;
+	bool	      text_poke;
 	unsigned int  freq;
 	unsigned int  mmap_pages;
 	unsigned int  auxtrace_mmap_pages;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f07b8ecb91bc..e2c0e72701b0 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -489,6 +489,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
 		tool->ksymbol = perf_event__process_ksymbol;
 	if (tool->bpf == NULL)
 		tool->bpf = perf_event__process_bpf;
+	if (tool->text_poke == NULL)
+		tool->text_poke = perf_event__process_text_poke;
 	if (tool->read == NULL)
 		tool->read = process_event_sample_stub;
 	if (tool->throttle == NULL)
@@ -658,6 +660,23 @@ static void perf_event__switch_swap(union perf_event *event, bool sample_id_all)
 		swap_sample_id_all(event, &event->context_switch + 1);
 }
 
+static void perf_event__text_poke_swap(union perf_event *event, bool sample_id_all)
+{
+	event->text_poke.addr  = bswap_64(event->text_poke.addr);
+	event->text_poke.flags = bswap_16(event->text_poke.flags);
+	event->text_poke.len   = bswap_16(event->text_poke.len);
+
+	if (sample_id_all) {
+		size_t len = sizeof(event->text_poke.flags) +
+			     sizeof(event->text_poke.len) +
+			     event->text_poke.len * 2;
+		void *data = &event->text_poke.flags;
+
+		data += PERF_ALIGN(len, sizeof(u64));
+		swap_sample_id_all(event, data);
+	}
+}
+
 static void perf_event__throttle_swap(union perf_event *event,
 				      bool sample_id_all)
 {
@@ -930,6 +949,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_SWITCH]		  = perf_event__switch_swap,
 	[PERF_RECORD_SWITCH_CPU_WIDE]	  = perf_event__switch_swap,
 	[PERF_RECORD_NAMESPACES]	  = perf_event__namespaces_swap,
+	[PERF_RECORD_TEXT_POKE]		  = perf_event__text_poke_swap,
 	[PERF_RECORD_HEADER_ATTR]	  = perf_event__hdr_attr_swap,
 	[PERF_RECORD_HEADER_EVENT_TYPE]	  = perf_event__event_type_swap,
 	[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1469,6 +1489,8 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->ksymbol(tool, event, sample, machine);
 	case PERF_RECORD_BPF_EVENT:
 		return tool->bpf(tool, event, sample, machine);
+	case PERF_RECORD_TEXT_POKE:
+		return tool->text_poke(tool, event, sample, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 2abbf668b8de..006182923bbe 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -56,7 +56,8 @@ struct perf_tool {
 			throttle,
 			unthrottle,
 			ksymbol,
-			bpf;
+			bpf,
+			text_poke;
 
 	event_attr_op	attr;
 	event_attr_op	event_update;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 5/6] perf auxtrace: Add auxtrace_cache__remove()
  2019-10-25 12:59 [PATCH RFC 0/6] perf/x86: Add perf text poke event Adrian Hunter
                   ` (3 preceding siblings ...)
  2019-10-25 12:59 ` [PATCH RFC 4/6] perf tools: Add support for PERF_RECORD_TEXT_POKE Adrian Hunter
@ 2019-10-25 12:59 ` Adrian Hunter
  2019-10-25 14:48   ` Arnaldo Carvalho de Melo
  2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
  2019-10-25 13:00 ` [PATCH RFC 6/6] perf intel-pt: Add support for text poke events Adrian Hunter
  5 siblings, 2 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-25 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Add auxtrace_cache__remove(). Intel PT uses an auxtrace_cache to store the
results of code-walking, so that the same block of instructions does not
have to be decoded repeatedly. However, when there are text poke events,
the associated cache entries need to be removed.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/auxtrace.c | 28 ++++++++++++++++++++++++++++
 tools/perf/util/auxtrace.h |  1 +
 2 files changed, 29 insertions(+)

diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 8470dfe9fe97..c555c3ccd79d 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -1457,6 +1457,34 @@ int auxtrace_cache__add(struct auxtrace_cache *c, u32 key,
 	return 0;
 }
 
+static struct auxtrace_cache_entry *auxtrace_cache__rm(struct auxtrace_cache *c,
+						       u32 key)
+{
+	struct auxtrace_cache_entry *entry;
+	struct hlist_head *hlist;
+	struct hlist_node *n;
+
+	if (!c)
+		return NULL;
+
+	hlist = &c->hashtable[hash_32(key, c->bits)];
+	hlist_for_each_entry_safe(entry, n, hlist, hash) {
+		if (entry->key == key) {
+			hlist_del(&entry->hash);
+			return entry;
+		}
+	}
+
+	return NULL;
+}
+
+void auxtrace_cache__remove(struct auxtrace_cache *c, u32 key)
+{
+	struct auxtrace_cache_entry *entry = auxtrace_cache__rm(c, key);
+
+	auxtrace_cache__free_entry(c, entry);
+}
+
 void *auxtrace_cache__lookup(struct auxtrace_cache *c, u32 key)
 {
 	struct auxtrace_cache_entry *entry;
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index f201f36bc35f..3f4aa5427d76 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -489,6 +489,7 @@ void *auxtrace_cache__alloc_entry(struct auxtrace_cache *c);
 void auxtrace_cache__free_entry(struct auxtrace_cache *c, void *entry);
 int auxtrace_cache__add(struct auxtrace_cache *c, u32 key,
 			struct auxtrace_cache_entry *entry);
+void auxtrace_cache__remove(struct auxtrace_cache *c, u32 key);
 void *auxtrace_cache__lookup(struct auxtrace_cache *c, u32 key);
 
 struct auxtrace_record *auxtrace_record__init(struct evlist *evlist,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 6/6] perf intel-pt: Add support for text poke events
  2019-10-25 12:59 [PATCH RFC 0/6] perf/x86: Add perf text poke event Adrian Hunter
                   ` (4 preceding siblings ...)
  2019-10-25 12:59 ` [PATCH RFC 5/6] perf auxtrace: Add auxtrace_cache__remove() Adrian Hunter
@ 2019-10-25 13:00 ` Adrian Hunter
  5 siblings, 0 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-25 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Select text poke events when available and the kernel is being traced.
Process text poke events to invalidate entries in Intel PT's instruction
cache.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/arch/x86/util/intel-pt.c |  4 ++
 tools/perf/util/intel-pt.c          | 71 +++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/tools/perf/arch/x86/util/intel-pt.c b/tools/perf/arch/x86/util/intel-pt.c
index a81ce60876f6..b7d04698349c 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -754,6 +754,10 @@ static int intel_pt_recording_options(struct auxtrace_record *itr,
 		}
 	}
 
+	if (have_timing_info && !intel_pt_evsel->core.attr.exclude_kernel &&
+	    perf_can_record_text_poke_events() && perf_can_record_cpu_wide())
+		opts->text_poke = true;
+
 	if (intel_pt_evsel) {
 		/*
 		 * To obtain the auxtrace buffer file descriptor, the auxtrace
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index a1c9eb6d4f40..021d99f4a66d 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -504,6 +504,17 @@ intel_pt_cache_lookup(struct dso *dso, struct machine *machine, u64 offset)
 	return auxtrace_cache__lookup(dso->auxtrace_cache, offset);
 }
 
+static void intel_pt_cache_invalidate(struct dso *dso, struct machine *machine,
+				      u64 offset)
+{
+	struct auxtrace_cache *c = intel_pt_cache(dso, machine);
+
+	if (!c)
+		return;
+
+	auxtrace_cache__remove(dso->auxtrace_cache, offset);
+}
+
 static inline u8 intel_pt_cpumode(struct intel_pt *pt, uint64_t ip)
 {
 	return ip >= pt->kernel_start ?
@@ -2520,6 +2531,63 @@ static int intel_pt_process_itrace_start(struct intel_pt *pt,
 					event->itrace_start.tid);
 }
 
+static int intel_pt_find_map(struct thread *thread, u8 cpumode, u64 addr,
+			     struct addr_location *al)
+{
+	if (!al->map || addr < al->map->start || addr >= al->map->end) {
+		if (!thread__find_map(thread, cpumode, addr, al) || !al->map->dso)
+			return -1;
+	}
+
+	return 0;
+}
+
+/* Invalidate all instruction cache entries that overlap the text poke */
+static int intel_pt_text_poke(struct intel_pt *pt, union perf_event *event)
+{
+	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+	u64 addr = event->text_poke.addr + event->text_poke.len - 1;
+	/* Assume text poke begins in a basic block no more than 4096 bytes */
+	int cnt = 4096 + event->text_poke.len;
+	struct thread *thread = pt->unknown_thread;
+	struct addr_location al = { .map = NULL };
+	struct machine *machine = pt->machine;
+	struct intel_pt_cache_entry *e;
+	u64 offset;
+
+	for (; cnt; cnt--, addr--) {
+		if (intel_pt_find_map(thread, cpumode, addr, &al)) {
+			if (addr < event->text_poke.addr)
+				return 0;
+			pr_err("%s: failed to find map at %#"PRIx64"\n",
+			       __func__, addr);
+			return -EINVAL;
+		}
+
+		offset = al.map->map_ip(al.map, addr);
+
+		e = intel_pt_cache_lookup(al.map->dso, machine, offset);
+		if (!e)
+			continue;
+
+		if (addr + e->byte_cnt + e->length <= event->text_poke.addr) {
+			/*
+			 * No overlap. Working backwards there cannot be another
+			 * basic block that overlaps the text poke if there is a
+			 * branch instruction before the text poke address.
+			 */
+			if (e->branch != INTEL_PT_BR_NO_BRANCH)
+				return 0;
+		} else {
+			intel_pt_cache_invalidate(al.map->dso, machine, offset);
+			intel_pt_log("Invalidated instruction cache for %s at %#"PRIx64"\n",
+				     al.map->dso->long_name, addr);
+		}
+	}
+
+	return 0;
+}
+
 static int intel_pt_process_event(struct perf_session *session,
 				  union perf_event *event,
 				  struct perf_sample *sample,
@@ -2577,6 +2645,9 @@ static int intel_pt_process_event(struct perf_session *session,
 		 event->header.type == PERF_RECORD_SWITCH_CPU_WIDE)
 		err = intel_pt_context_switch(pt, event, sample);
 
+	if (!err && event->header.type == PERF_RECORD_TEXT_POKE)
+		err = intel_pt_text_poke(pt, event);
+
 	intel_pt_log("event %u: cpu %d time %"PRIu64" tsc %#"PRIx64" ",
 		     event->header.type, sample->cpu, sample->time, timestamp);
 	intel_pt_log_event(event);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 5/6] perf auxtrace: Add auxtrace_cache__remove()
  2019-10-25 12:59 ` [PATCH RFC 5/6] perf auxtrace: Add auxtrace_cache__remove() Adrian Hunter
@ 2019-10-25 14:48   ` Arnaldo Carvalho de Melo
  2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
  1 sibling, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-10-25 14:48 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Jiri Olsa, linux-kernel

Em Fri, Oct 25, 2019 at 03:59:59PM +0300, Adrian Hunter escreveu:
> Add auxtrace_cache__remove(). Intel PT uses an auxtrace_cache to store the
> results of code-walking, so that the same block of instructions does not
> have to be decoded repeatedly. However, when there are text poke events,
> the associated cache entries need to be removed.

Applied this one as it is just leg work for the rest, that I'll wait a
bit for comments.

- Arnaldo
 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  tools/perf/util/auxtrace.c | 28 ++++++++++++++++++++++++++++
>  tools/perf/util/auxtrace.h |  1 +
>  2 files changed, 29 insertions(+)
> 
> diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
> index 8470dfe9fe97..c555c3ccd79d 100644
> --- a/tools/perf/util/auxtrace.c
> +++ b/tools/perf/util/auxtrace.c
> @@ -1457,6 +1457,34 @@ int auxtrace_cache__add(struct auxtrace_cache *c, u32 key,
>  	return 0;
>  }
>  
> +static struct auxtrace_cache_entry *auxtrace_cache__rm(struct auxtrace_cache *c,
> +						       u32 key)
> +{
> +	struct auxtrace_cache_entry *entry;
> +	struct hlist_head *hlist;
> +	struct hlist_node *n;
> +
> +	if (!c)
> +		return NULL;
> +
> +	hlist = &c->hashtable[hash_32(key, c->bits)];
> +	hlist_for_each_entry_safe(entry, n, hlist, hash) {
> +		if (entry->key == key) {
> +			hlist_del(&entry->hash);
> +			return entry;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +void auxtrace_cache__remove(struct auxtrace_cache *c, u32 key)
> +{
> +	struct auxtrace_cache_entry *entry = auxtrace_cache__rm(c, key);
> +
> +	auxtrace_cache__free_entry(c, entry);
> +}
> +
>  void *auxtrace_cache__lookup(struct auxtrace_cache *c, u32 key)
>  {
>  	struct auxtrace_cache_entry *entry;
> diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
> index f201f36bc35f..3f4aa5427d76 100644
> --- a/tools/perf/util/auxtrace.h
> +++ b/tools/perf/util/auxtrace.h
> @@ -489,6 +489,7 @@ void *auxtrace_cache__alloc_entry(struct auxtrace_cache *c);
>  void auxtrace_cache__free_entry(struct auxtrace_cache *c, void *entry);
>  int auxtrace_cache__add(struct auxtrace_cache *c, u32 key,
>  			struct auxtrace_cache_entry *entry);
> +void auxtrace_cache__remove(struct auxtrace_cache *c, u32 key);
>  void *auxtrace_cache__lookup(struct auxtrace_cache *c, u32 key);
>  
>  struct auxtrace_record *auxtrace_record__init(struct evlist *evlist,
> -- 
> 2.17.1

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/6] perf dso: Refactor dso_cache__read()
  2019-10-25 12:59 ` [PATCH RFC 2/6] perf dso: Refactor dso_cache__read() Adrian Hunter
@ 2019-10-25 14:54   ` Arnaldo Carvalho de Melo
  2019-10-28 15:39   ` Jiri Olsa
  2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
  2 siblings, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-10-25 14:54 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Jiri Olsa, linux-kernel

Em Fri, Oct 25, 2019 at 03:59:56PM +0300, Adrian Hunter escreveu:
> Refactor dso_cache__read() to separate populating the cache from copying
> data from it.  This is preparation for adding a cache "write" that will
> update the data in the cache.

Ditto for 2/6 and 3/6, i.e. applying them now.

- Arnaldo
 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  tools/perf/util/dso.c | 64 +++++++++++++++++++++++++------------------
>  1 file changed, 37 insertions(+), 27 deletions(-)
> 
> diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
> index e11ddf86f2b3..460330d125b6 100644
> --- a/tools/perf/util/dso.c
> +++ b/tools/perf/util/dso.c
> @@ -768,7 +768,7 @@ dso_cache__free(struct dso *dso)
>  	pthread_mutex_unlock(&dso->lock);
>  }
>  
> -static struct dso_cache *dso_cache__find(struct dso *dso, u64 offset)
> +static struct dso_cache *__dso_cache__find(struct dso *dso, u64 offset)
>  {
>  	const struct rb_root *root = &dso->data.cache;
>  	struct rb_node * const *p = &root->rb_node;
> @@ -863,54 +863,64 @@ static ssize_t file_read(struct dso *dso, struct machine *machine,
>  	return ret;
>  }
>  
> -static ssize_t
> -dso_cache__read(struct dso *dso, struct machine *machine,
> -		u64 offset, u8 *data, ssize_t size)
> +static struct dso_cache *dso_cache__populate(struct dso *dso,
> +					     struct machine *machine,
> +					     u64 offset, ssize_t *ret)
>  {
>  	u64 cache_offset = offset & DSO__DATA_CACHE_MASK;
>  	struct dso_cache *cache;
>  	struct dso_cache *old;
> -	ssize_t ret;
>  
>  	cache = zalloc(sizeof(*cache) + DSO__DATA_CACHE_SIZE);
> -	if (!cache)
> -		return -ENOMEM;
> +	if (!cache) {
> +		*ret = -ENOMEM;
> +		return NULL;
> +	}
>  
>  	if (dso->binary_type == DSO_BINARY_TYPE__BPF_PROG_INFO)
> -		ret = bpf_read(dso, cache_offset, cache->data);
> +		*ret = bpf_read(dso, cache_offset, cache->data);
>  	else
> -		ret = file_read(dso, machine, cache_offset, cache->data);
> +		*ret = file_read(dso, machine, cache_offset, cache->data);
>  
> -	if (ret > 0) {
> -		cache->offset = cache_offset;
> -		cache->size   = ret;
> +	if (*ret <= 0) {
> +		free(cache);
> +		return NULL;
> +	}
>  
> -		old = dso_cache__insert(dso, cache);
> -		if (old) {
> -			/* we lose the race */
> -			free(cache);
> -			cache = old;
> -		}
> +	cache->offset = cache_offset;
> +	cache->size   = *ret;
>  
> -		ret = dso_cache__memcpy(cache, offset, data, size);
> +	old = dso_cache__insert(dso, cache);
> +	if (old) {
> +		/* we lose the race */
> +		free(cache);
> +		cache = old;
>  	}
>  
> -	if (ret <= 0)
> -		free(cache);
> +	return cache;
> +}
>  
> -	return ret;
> +static struct dso_cache *dso_cache__find(struct dso *dso,
> +					 struct machine *machine,
> +					 u64 offset,
> +					 ssize_t *ret)
> +{
> +	struct dso_cache *cache = __dso_cache__find(dso, offset);
> +
> +	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
>  }
>  
>  static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
>  			      u64 offset, u8 *data, ssize_t size)
>  {
>  	struct dso_cache *cache;
> +	ssize_t ret = 0;
>  
> -	cache = dso_cache__find(dso, offset);
> -	if (cache)
> -		return dso_cache__memcpy(cache, offset, data, size);
> -	else
> -		return dso_cache__read(dso, machine, offset, data, size);
> +	cache = dso_cache__find(dso, machine, offset, &ret);
> +	if (!cache)
> +		return ret;
> +
> +	return dso_cache__memcpy(cache, offset, data, size);
>  }
>  
>  /*
> -- 
> 2.17.1

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/6] perf dso: Refactor dso_cache__read()
  2019-10-25 12:59 ` [PATCH RFC 2/6] perf dso: Refactor dso_cache__read() Adrian Hunter
  2019-10-25 14:54   ` Arnaldo Carvalho de Melo
@ 2019-10-28 15:39   ` Jiri Olsa
  2019-10-29  9:19     ` Adrian Hunter
  2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
  2 siblings, 1 reply; 34+ messages in thread
From: Jiri Olsa @ 2019-10-28 15:39 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, linux-kernel

On Fri, Oct 25, 2019 at 03:59:56PM +0300, Adrian Hunter wrote:

SNIP

> +}
>  
> -	return ret;
> +static struct dso_cache *dso_cache__find(struct dso *dso,
> +					 struct machine *machine,
> +					 u64 offset,
> +					 ssize_t *ret)
> +{
> +	struct dso_cache *cache = __dso_cache__find(dso, offset);
> +
> +	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
>  }
>  
>  static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
>  			      u64 offset, u8 *data, ssize_t size)
>  {
>  	struct dso_cache *cache;
> +	ssize_t ret = 0;
>  
> -	cache = dso_cache__find(dso, offset);
> -	if (cache)
> -		return dso_cache__memcpy(cache, offset, data, size);
> -	else
> -		return dso_cache__read(dso, machine, offset, data, size);
> +	cache = dso_cache__find(dso, machine, offset, &ret);
> +	if (!cache)
> +		return ret;

why not use the ERR_* macros to get error through the pointer
instead of adding extra argument?

jirka


> +
> +	return dso_cache__memcpy(cache, offset, data, size);
>  }
>  
>  /*
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr()
  2019-10-25 12:59 ` [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr() Adrian Hunter
@ 2019-10-28 15:45   ` Jiri Olsa
  2019-10-29  9:20     ` Adrian Hunter
  2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
  1 sibling, 1 reply; 34+ messages in thread
From: Jiri Olsa @ 2019-10-28 15:45 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, linux-kernel

On Fri, Oct 25, 2019 at 03:59:57PM +0300, Adrian Hunter wrote:
SNIP

>  
> -static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
> -				u64 offset, u8 *data, ssize_t size)
> +static ssize_t data_read_write_offset(struct dso *dso, struct machine *machine,
> +				      u64 offset, u8 *data, ssize_t size,
> +				      bool out)
>  {
>  	if (dso__data_file_size(dso, machine))
>  		return -1;
> @@ -1034,7 +1037,7 @@ static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
>  	if (offset + size < offset)
>  		return -1;
>  
> -	return cached_read(dso, machine, offset, data, size);
> +	return cached_io(dso, machine, offset, data, size, out);

you renamed the function as _read_write_ so the bool choosing
the opration should be named either read or write.. I had to
go all the way down to find out what 'out' means ;-)

jirka


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/6] perf dso: Refactor dso_cache__read()
  2019-10-28 15:39   ` Jiri Olsa
@ 2019-10-29  9:19     ` Adrian Hunter
  0 siblings, 0 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-29  9:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, linux-kernel

On 28/10/19 5:39 PM, Jiri Olsa wrote:
> On Fri, Oct 25, 2019 at 03:59:56PM +0300, Adrian Hunter wrote:
> 
> SNIP
> 
>> +}
>>  
>> -	return ret;
>> +static struct dso_cache *dso_cache__find(struct dso *dso,
>> +					 struct machine *machine,
>> +					 u64 offset,
>> +					 ssize_t *ret)
>> +{
>> +	struct dso_cache *cache = __dso_cache__find(dso, offset);
>> +
>> +	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
>>  }
>>  
>>  static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
>>  			      u64 offset, u8 *data, ssize_t size)
>>  {
>>  	struct dso_cache *cache;
>> +	ssize_t ret = 0;
>>  
>> -	cache = dso_cache__find(dso, offset);
>> -	if (cache)
>> -		return dso_cache__memcpy(cache, offset, data, size);
>> -	else
>> -		return dso_cache__read(dso, machine, offset, data, size);
>> +	cache = dso_cache__find(dso, machine, offset, &ret);
>> +	if (!cache)
>> +		return ret;
> 
> why not use the ERR_* macros to get error through the pointer
> instead of adding extra argument?
> 

OK, here's the diff for that:

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index 460330d125b6..272545624fbe 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -3,6 +3,7 @@
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/zalloc.h>
+#include <linux/err.h>
 #include <sys/time.h>
 #include <sys/resource.h>
 #include <sys/types.h>
@@ -865,30 +866,29 @@ static ssize_t file_read(struct dso *dso, struct machine *machine,
 
 static struct dso_cache *dso_cache__populate(struct dso *dso,
 					     struct machine *machine,
-					     u64 offset, ssize_t *ret)
+					     u64 offset)
 {
 	u64 cache_offset = offset & DSO__DATA_CACHE_MASK;
 	struct dso_cache *cache;
 	struct dso_cache *old;
+	ssize_t ret;
 
 	cache = zalloc(sizeof(*cache) + DSO__DATA_CACHE_SIZE);
-	if (!cache) {
-		*ret = -ENOMEM;
-		return NULL;
-	}
+	if (!cache)
+		return ERR_PTR(-ENOMEM);
 
 	if (dso->binary_type == DSO_BINARY_TYPE__BPF_PROG_INFO)
-		*ret = bpf_read(dso, cache_offset, cache->data);
+		ret = bpf_read(dso, cache_offset, cache->data);
 	else
-		*ret = file_read(dso, machine, cache_offset, cache->data);
+		ret = file_read(dso, machine, cache_offset, cache->data);
 
-	if (*ret <= 0) {
+	if (ret <= 0) {
 		free(cache);
-		return NULL;
+		return ERR_PTR(ret);
 	}
 
 	cache->offset = cache_offset;
-	cache->size   = *ret;
+	cache->size   = ret;
 
 	old = dso_cache__insert(dso, cache);
 	if (old) {
@@ -902,23 +902,20 @@ static struct dso_cache *dso_cache__populate(struct dso *dso,
 
 static struct dso_cache *dso_cache__find(struct dso *dso,
 					 struct machine *machine,
-					 u64 offset,
-					 ssize_t *ret)
+					 u64 offset)
 {
 	struct dso_cache *cache = __dso_cache__find(dso, offset);
 
-	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
+	return cache ? cache : dso_cache__populate(dso, machine, offset);
 }
 
 static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
 			      u64 offset, u8 *data, ssize_t size)
 {
-	struct dso_cache *cache;
-	ssize_t ret = 0;
+	struct dso_cache *cache = dso_cache__find(dso, machine, offset);
 
-	cache = dso_cache__find(dso, machine, offset, &ret);
-	if (!cache)
-		return ret;
+	if (IS_ERR_OR_NULL(cache))
+		return PTR_ERR(cache);
 
 	return dso_cache__memcpy(cache, offset, data, size);
 }


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr()
  2019-10-28 15:45   ` Jiri Olsa
@ 2019-10-29  9:20     ` Adrian Hunter
  0 siblings, 0 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-10-29  9:20 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, linux-kernel

On 28/10/19 5:45 PM, Jiri Olsa wrote:
> On Fri, Oct 25, 2019 at 03:59:57PM +0300, Adrian Hunter wrote:
> SNIP
> 
>>  
>> -static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
>> -				u64 offset, u8 *data, ssize_t size)
>> +static ssize_t data_read_write_offset(struct dso *dso, struct machine *machine,
>> +				      u64 offset, u8 *data, ssize_t size,
>> +				      bool out)
>>  {
>>  	if (dso__data_file_size(dso, machine))
>>  		return -1;
>> @@ -1034,7 +1037,7 @@ static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
>>  	if (offset + size < offset)
>>  		return -1;
>>  
>> -	return cached_read(dso, machine, offset, data, size);
>> +	return cached_io(dso, machine, offset, data, size, out);
> 
> you renamed the function as _read_write_ so the bool choosing
> the opration should be named either read or write.. I had to
> go all the way down to find out what 'out' means ;-)
> 

Arnaldo already applied it, so here is the diff:

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index 12ab26baba44..505ba78eda3c 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -829,12 +829,12 @@ dso_cache__insert(struct dso *dso, struct dso_cache *new)
 }
 
 static ssize_t dso_cache__memcpy(struct dso_cache *cache, u64 offset, u8 *data,
-				 u64 size, bool out)
+				 u64 size, bool read)
 {
 	u64 cache_offset = offset - cache->offset;
 	u64 cache_size   = min(cache->size - cache_offset, size);
 
-	if (out)
+	if (read)
 		memcpy(data, cache->data + cache_offset, cache_size);
 	else
 		memcpy(cache->data + cache_offset, data, cache_size);
@@ -912,14 +912,14 @@ static struct dso_cache *dso_cache__find(struct dso *dso,
 }
 
 static ssize_t dso_cache_io(struct dso *dso, struct machine *machine,
-			    u64 offset, u8 *data, ssize_t size, bool out)
+			    u64 offset, u8 *data, ssize_t size, bool read)
 {
 	struct dso_cache *cache = dso_cache__find(dso, machine, offset);
 
 	if (IS_ERR_OR_NULL(cache))
 		return PTR_ERR(cache);
 
-	return dso_cache__memcpy(cache, offset, data, size, out);
+	return dso_cache__memcpy(cache, offset, data, size, read);
 }
 
 /*
@@ -928,7 +928,7 @@ static ssize_t dso_cache_io(struct dso *dso, struct machine *machine,
  * by cached data. Writes update the cache only, not the backing file.
  */
 static ssize_t cached_io(struct dso *dso, struct machine *machine,
-			 u64 offset, u8 *data, ssize_t size, bool out)
+			 u64 offset, u8 *data, ssize_t size, bool read)
 {
 	ssize_t r = 0;
 	u8 *p = data;
@@ -936,7 +936,7 @@ static ssize_t cached_io(struct dso *dso, struct machine *machine,
 	do {
 		ssize_t ret;
 
-		ret = dso_cache_io(dso, machine, offset, p, size, out);
+		ret = dso_cache_io(dso, machine, offset, p, size, read);
 		if (ret < 0)
 			return ret;
 
@@ -1022,7 +1022,7 @@ off_t dso__data_size(struct dso *dso, struct machine *machine)
 
 static ssize_t data_read_write_offset(struct dso *dso, struct machine *machine,
 				      u64 offset, u8 *data, ssize_t size,
-				      bool out)
+				      bool read)
 {
 	if (dso__data_file_size(dso, machine))
 		return -1;
@@ -1034,7 +1034,7 @@ static ssize_t data_read_write_offset(struct dso *dso, struct machine *machine,
 	if (offset + size < offset)
 		return -1;
 
-	return cached_io(dso, machine, offset, data, size, out);
+	return cached_io(dso, machine, offset, data, size, read);
 }
 
 /**



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-25 12:59 ` [PATCH RFC 1/6] " Adrian Hunter
@ 2019-10-30 10:47   ` Leo Yan
  2019-10-30 12:46     ` Peter Zijlstra
  2019-11-04 10:40   ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Leo Yan @ 2019-10-30 10:47 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Hi Adrian,

On Fri, Oct 25, 2019 at 03:59:55PM +0300, Adrian Hunter wrote:
> Record changes to kernel text (i.e. self-modifying code) in order to
> support tracers like Intel PT decoding through jump labels.
> 
> A copy of the running kernel code is needed as a reference point
> (e.g. from /proc/kcore). The text poke event records the old bytes
> and the new bytes so that the event can be processed forwards or backwards.
> 
> The text poke event has 'flags' to describe when the event is sent. At
> present the only flag is PERF_TEXT_POKE_UPDATE which indicates the
> point at which tools should update their reference kernel executable to
> change the old bytes to the new bytes.
> 
> Other architectures may wish to emit a pair of text poke events. One before
> and one after a text poke. In that case, PERF_TEXT_POKE_UPDATE flag would
> be set on only one of the pair.

Thanks a lot for the patch set.

Below is my understanding for implementation 'emit a pair of text poke
events' as you mentioned:

Since Arm64 instruction is fixed size, it doesn't need to rely on INT3
liked mechanism to replace instructions and directly uses two operations
to alter instruction (modify instruction and flush icache line).  So
Arm64 has no chance to send perf event in the middle of altering
instruction.

Thus we can send pair of text poke events in the kernel:

  perf_event_text_poke(PERF_TEXT_POKE_UPDATE_PREV)

    Change instruction
    Flush icache

  perf_event_text_poke(PERF_TEXT_POKE_UPDATE_POST)

In the userspace, if perf tool detects the instruction is changed
from nop to branch, we need to update dso cache for the
'PERF_TEXT_POKE_UPDATE_PREV' event; if detect the instruction is
changed from branch to nop, we need to update dso cache for
'PERF_TEXT_POKE_UPDATE_POST' event.  The main idea is to ensure the
branch instructions can be safely contained in the dso file and any
branch samples can read out correct branch instruction.

Could you confirm this is the same with your understanding?  Or I miss
anything?  I personally even think the pair events can be used for
different arches (e.g. the solution can be reused on Arm64/x86, etc).

> In the case of Intel PT tracing, the executable code must be walked to
> reconstruct the control flow. For x86 a jump label text poke begins:
>   - write INT3 byte
>   - IPI-SYNC
> At this point the actual control flow will be through the INT3 and handler
> and not hit the old or new instruction. Intel PT outputs FUP/TIP packets
> for the INT3, so the flow can still be decoded. Subsequently:
>   - emit RECORD_TEXT_POKE with the new instruction

s/RECORD_TEXT_POKE/PERF_TEXT_POKE_UPDATE

And I think emit the perf event is after the item 'write instruction
tail' but not before it.  So the commit log is not consistent with
the change in the patch.

>   - write instruction tail
>   - IPI-SYNC
>   - write first byte
>   - IPI-SYNC
> So before the text poke event timestamp, the decoder will see either the
> old instruction flow or FUP/TIP of INT3. After the text poke event
> timestamp, the decoder will see either the new instruction flow or FUP/TIP

I will check internally for timestamp on Arm/Arm64 ...

Thanks,
Leo Yan

> of INT3. Thus decoders can use the timestamp as the point at which to
> modify the executable code.
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/include/asm/text-patching.h |  1 +
>  arch/x86/kernel/alternative.c        | 39 +++++++++++-
>  include/linux/perf_event.h           |  6 ++
>  include/uapi/linux/perf_event.h      | 28 ++++++++-
>  kernel/events/core.c                 | 90 +++++++++++++++++++++++++++-
>  5 files changed, 161 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index 5e8319bb207a..873141765d8e 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -30,6 +30,7 @@ struct text_poke_loc {
>  	void *addr;
>  	size_t len;
>  	const char opcode[POKE_MAX_OPCODE_SIZE];
> +	u8 old;
>  };
>  
>  extern void text_poke_early(void *addr, const void *opcode, size_t len);
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 9d3a971ea364..9488f0fa7e22 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -3,6 +3,7 @@
>  
>  #include <linux/module.h>
>  #include <linux/sched.h>
> +#include <linux/perf_event.h>
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/stringify.h>
> @@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
>  	/*
>  	 * First step: add a int3 trap to the address that will be patched.
>  	 */
> -	for (i = 0; i < nr_entries; i++)
> +	for (i = 0; i < nr_entries; i++) {
> +		tp[i].old = *(u8 *)tp[i].addr;
>  		text_poke(tp[i].addr, &int3, sizeof(int3));
> +	}
>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  
> @@ -1054,12 +1057,46 @@ void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
>  	 * Second step: update all but the first byte of the patched range.
>  	 */
>  	for (i = 0; i < nr_entries; i++) {
> +		u8 old[POKE_MAX_OPCODE_SIZE] = { tp[i].old, };
> +
> +		memcpy(old + sizeof(int3),
> +		       tp[i].addr + sizeof(int3),
> +		       tp[i].len - sizeof(int3));
> +
>  		if (tp[i].len - sizeof(int3) > 0) {
>  			text_poke((char *)tp[i].addr + sizeof(int3),
>  				  (const char *)tp[i].opcode + sizeof(int3),
>  				  tp[i].len - sizeof(int3));
>  			patched_all_but_first++;
>  		}
> +
> +		/*
> +		 * Emit a perf event to record the text poke, primarily to
> +		 * support Intel PT decoding which must walk the executable code
> +		 * to reconstruct the trace. The flow up to here is:
> +		 *   - write INT3 byte
> +		 *   - IPI-SYNC
> +		 * At this point the actual control flow will be through the
> +		 * INT3 and handler and not hit the old or new instruction.
> +		 * Intel PT outputs FUP/TIP packets for the INT3, so the flow
> +		 * can still be decoded. Subsequently:
> +		 *   - emit RECORD_TEXT_POKE with the new instruction
> +		 *   - write instruction tail
> +		 *   - IPI-SYNC
> +		 *   - write first byte
> +		 *   - IPI-SYNC
> +		 * So before the text poke event timestamp, the decoder will see
> +		 * either the old instruction flow or FUP/TIP of INT3. After the
> +		 * text poke event timestamp, the decoder will see either the
> +		 * new instruction flow or FUP/TIP of INT3. Thus decoders can
> +		 * use the timestamp as the point at which to modify the
> +		 * executable code.
> +		 * The old instruction is recorded so that the event can be
> +		 * processed forwards or backwards.
> +		 */
> +		perf_event_text_poke((unsigned long)tp[i].addr,
> +				     PERF_TEXT_POKE_UPDATE, old, tp[i].opcode,
> +				     tp[i].len);
>  	}
>  
>  	if (patched_all_but_first) {
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 61448c19a132..db88b7ade8c6 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
>  extern void perf_event_comm(struct task_struct *tsk, bool exec);
>  extern void perf_event_namespaces(struct task_struct *tsk);
>  extern void perf_event_fork(struct task_struct *tsk);
> +extern void perf_event_text_poke(unsigned long addr, u16 flags, const void *old_bytes,
> +				 const void *new_bytes, size_t len);
>  
>  /* Callchains */
>  DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
> @@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void)				{ }
>  static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
>  static inline void perf_event_namespaces(struct task_struct *tsk)	{ }
>  static inline void perf_event_fork(struct task_struct *tsk)		{ }
> +static inline void perf_event_text_poke(unsigned long addr, u16 flags,
> +					const void *old_bytes,
> +					const void *new_bytes,
> +					size_t len)			{ }
>  static inline void perf_event_init(void)				{ }
>  static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
>  static inline void perf_swevent_put_recursion_context(int rctx)		{ }
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index bb7b271397a6..c8d1f52a7fce 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -375,7 +375,8 @@ struct perf_event_attr {
>  				ksymbol        :  1, /* include ksymbol events */
>  				bpf_event      :  1, /* include bpf events */
>  				aux_output     :  1, /* generate AUX records instead of events */
> -				__reserved_1   : 32;
> +				text_poke      :  1, /* include text poke events */
> +				__reserved_1   : 31;
>  
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n events */
> @@ -1000,6 +1001,26 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_BPF_EVENT			= 18,
>  
> +	/*
> +	 * Records changes to kernel text i.e. self-modified code.
> +	 * 'flags' has PERF_TEXT_POKE_UPDATE (i.e. bit 0) set to
> +	 * indicate to tools to update old bytes to new bytes in the
> +	 * executable image.
> +	 * 'len' is the number of old bytes, which is the same as the number
> +	 * of new bytes. 'bytes' contains the old bytes followed immediately
> +	 * by the new bytes.
> +	 *
> +	 * struct {
> +	 *	struct perf_event_header	header;
> +	 *	u64				addr;
> +	 *	u16				flags;
> +	 *	u16				len;
> +	 *	u8				bytes[];
> +	 *	struct sample_id		sample_id;
> +	 * };
> +	 */
> +	PERF_RECORD_TEXT_POKE			= 19,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };
>  
> @@ -1041,6 +1062,11 @@ enum perf_callchain_context {
>  #define PERF_AUX_FLAG_PARTIAL		0x04	/* record contains gaps */
>  #define PERF_AUX_FLAG_COLLISION		0x08	/* sample collided with another */
>  
> +/**
> + * PERF_RECORD_TEXT_POKE::flags bits
> + */
> +#define PERF_TEXT_POKE_UPDATE		0x01	/* update old bytes to new bytes */
> +
>  #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
>  #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
>  #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 9ec0b0bfddbd..666e094e8ed9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -386,6 +386,7 @@ static atomic_t nr_freq_events __read_mostly;
>  static atomic_t nr_switch_events __read_mostly;
>  static atomic_t nr_ksymbol_events __read_mostly;
>  static atomic_t nr_bpf_events __read_mostly;
> +static atomic_t nr_text_poke_events __read_mostly;
>  
>  static LIST_HEAD(pmus);
>  static DEFINE_MUTEX(pmus_lock);
> @@ -4351,7 +4352,7 @@ static bool is_sb_event(struct perf_event *event)
>  	if (attr->mmap || attr->mmap_data || attr->mmap2 ||
>  	    attr->comm || attr->comm_exec ||
>  	    attr->task || attr->ksymbol ||
> -	    attr->context_switch ||
> +	    attr->context_switch || attr->text_poke ||
>  	    attr->bpf_event)
>  		return true;
>  	return false;
> @@ -4425,6 +4426,8 @@ static void unaccount_event(struct perf_event *event)
>  		atomic_dec(&nr_ksymbol_events);
>  	if (event->attr.bpf_event)
>  		atomic_dec(&nr_bpf_events);
> +	if (event->attr.text_poke)
> +		atomic_dec(&nr_text_poke_events);
>  
>  	if (dec) {
>  		if (!atomic_add_unless(&perf_sched_count, -1, 1))
> @@ -8070,6 +8073,89 @@ void perf_event_bpf_event(struct bpf_prog *prog,
>  	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
>  }
>  
> +struct perf_text_poke_event {
> +	const void		*old_bytes;
> +	const void		*new_bytes;
> +	size_t			pad;
> +	u16			flags;
> +	u16			len;
> +
> +	struct {
> +		struct perf_event_header	header;
> +
> +		u64				addr;
> +	} event_id;
> +};
> +
> +static int perf_event_text_poke_match(struct perf_event *event)
> +{
> +	return event->attr.text_poke;
> +}
> +
> +static void perf_event_text_poke_output(struct perf_event *event, void *data)
> +{
> +	struct perf_text_poke_event *text_poke_event = data;
> +	struct perf_output_handle handle;
> +	struct perf_sample_data sample;
> +	u64 padding = 0;
> +	int ret;
> +
> +	if (!perf_event_text_poke_match(event))
> +		return;
> +
> +	perf_event_header__init_id(&text_poke_event->event_id.header, &sample, event);
> +
> +	ret = perf_output_begin(&handle, event, text_poke_event->event_id.header.size);
> +	if (ret)
> +		return;
> +
> +	perf_output_put(&handle, text_poke_event->event_id);
> +	perf_output_put(&handle, text_poke_event->flags);
> +	perf_output_put(&handle, text_poke_event->len);
> +
> +	__output_copy(&handle, text_poke_event->old_bytes, text_poke_event->len);
> +	__output_copy(&handle, text_poke_event->new_bytes, text_poke_event->len);
> +
> +	if (text_poke_event->pad)
> +		__output_copy(&handle, &padding, text_poke_event->pad);
> +
> +	perf_event__output_id_sample(event, &handle, &sample);
> +
> +	perf_output_end(&handle);
> +}
> +
> +void perf_event_text_poke(unsigned long addr, u16 flags, const void *old_bytes,
> +			  const void *new_bytes, size_t len)
> +{
> +	struct perf_text_poke_event text_poke_event;
> +	size_t tot, pad;
> +
> +	if (!atomic_read(&nr_text_poke_events))
> +		return;
> +
> +	tot  = sizeof(text_poke_event.flags) +
> +	       sizeof(text_poke_event.len) + (len << 1);
> +	pad  = ALIGN(tot, sizeof(u64)) - tot;
> +
> +	text_poke_event = (struct perf_text_poke_event){
> +		.old_bytes    = old_bytes,
> +		.new_bytes    = new_bytes,
> +		.pad          = pad,
> +		.flags        = flags,
> +		.len          = len,
> +		.event_id  = {
> +			.header = {
> +				.type = PERF_RECORD_TEXT_POKE,
> +				.misc = PERF_RECORD_MISC_KERNEL,
> +				.size = sizeof(text_poke_event.event_id) + tot + pad,
> +			},
> +			.addr = addr,
> +		},
> +	};
> +
> +	perf_iterate_sb(perf_event_text_poke_output, &text_poke_event, NULL);
> +}
> +
>  void perf_event_itrace_started(struct perf_event *event)
>  {
>  	event->attach_state |= PERF_ATTACH_ITRACE;
> @@ -10356,6 +10442,8 @@ static void account_event(struct perf_event *event)
>  		atomic_inc(&nr_ksymbol_events);
>  	if (event->attr.bpf_event)
>  		atomic_inc(&nr_bpf_events);
> +	if (event->attr.text_poke)
> +		atomic_inc(&nr_text_poke_events);
>  
>  	if (inc) {
>  		/*
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-30 10:47   ` Leo Yan
@ 2019-10-30 12:46     ` Peter Zijlstra
  2019-10-30 14:19       ` Leo Yan
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2019-10-30 12:46 UTC (permalink / raw)
  To: Leo Yan
  Cc: Adrian Hunter, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

On Wed, Oct 30, 2019 at 06:47:47PM +0800, Leo Yan wrote:
> Hi Adrian,
> 
> On Fri, Oct 25, 2019 at 03:59:55PM +0300, Adrian Hunter wrote:
> > Record changes to kernel text (i.e. self-modifying code) in order to
> > support tracers like Intel PT decoding through jump labels.
> > 
> > A copy of the running kernel code is needed as a reference point
> > (e.g. from /proc/kcore). The text poke event records the old bytes
> > and the new bytes so that the event can be processed forwards or backwards.
> > 
> > The text poke event has 'flags' to describe when the event is sent. At
> > present the only flag is PERF_TEXT_POKE_UPDATE which indicates the
> > point at which tools should update their reference kernel executable to
> > change the old bytes to the new bytes.
> > 
> > Other architectures may wish to emit a pair of text poke events. One before
> > and one after a text poke. In that case, PERF_TEXT_POKE_UPDATE flag would
> > be set on only one of the pair.
> 
> Thanks a lot for the patch set.
> 
> Below is my understanding for implementation 'emit a pair of text poke
> events' as you mentioned:
> 
> Since Arm64 instruction is fixed size, it doesn't need to rely on INT3
> liked mechanism to replace instructions and directly uses two operations
> to alter instruction (modify instruction and flush icache line).  So
> Arm64 has no chance to send perf event in the middle of altering
> instruction.

Right.

> Thus we can send pair of text poke events in the kernel:
> 
>   perf_event_text_poke(PERF_TEXT_POKE_UPDATE_PREV)
> 
>     Change instruction
>     Flush icache
> 
>   perf_event_text_poke(PERF_TEXT_POKE_UPDATE_POST)
> 
> In the userspace, if perf tool detects the instruction is changed
> from nop to branch,

There is _far_ more text poking than just jump_label's NOP/JMP
transitions. Ftrace also does NOP/CALL, CALL/CALL, and the static_call
infrastructure that I posted here:

  https://lkml.kernel.org/r/20191007082708.013939311@infradead.org

Has: JMP/RET, JMP/JMP and if it has inline patching support: NOP/CALL,
CALL/CALL, patching.

Anyway, the below argument doesn't care much, it works for NOP/JMP just
fine.

> we need to update dso cache for the
> 'PERF_TEXT_POKE_UPDATE_PREV' event; if detect the instruction is
> changed from branch to nop, we need to update dso cache for
> 'PERF_TEXT_POKE_UPDATE_POST' event.  The main idea is to ensure the
> branch instructions can be safely contained in the dso file and any
> branch samples can read out correct branch instruction.
> 
> Could you confirm this is the same with your understanding?  Or I miss
> anything?  I personally even think the pair events can be used for
> different arches (e.g. the solution can be reused on Arm64/x86, etc).

So the problem we have with PT is that it is a bit-stream of
branch taken/not-taken decisions. In order to decode that we need to
have an accurate view of the unconditional code flow.

Both NOP/JMP are unconditional and we need to exactly know which of the
two was encountered.

With your scheme, I don't see how we can ever actually know that. When
we get the PRE event, all we really know is that we're going to change
a specific instruction into another. And at the POST event we know it
has been done. But in between these two events, we have no clue which of
the two instructions is live on which CPU (two CPUs might in fact have a
different live instruction at the same time).

This means we _cannot_ unambiguously decode a taken/not-taken decision
stream.

Does CS have this same problem, and how would the PRE/POST events help
with that?

So our (x86) horrible (variable instruction length induced) complexity
for text poking is actually in our advantage this one time. The
exception and exception-return paths allow us to unambiguously know what
happens around the time of poking.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-30 12:46     ` Peter Zijlstra
@ 2019-10-30 14:19       ` Leo Yan
  2019-10-30 15:00         ` Mike Leach
  2019-10-30 16:23         ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Leo Yan @ 2019-10-30 14:19 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Leach
  Cc: Adrian Hunter, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Mark Rutland, Alexander Shishkin, Mathieu Poirier,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

Hi Peter,

[ + Mike Leach ]

On Wed, Oct 30, 2019 at 01:46:59PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2019 at 06:47:47PM +0800, Leo Yan wrote:
> > Hi Adrian,
> > 
> > On Fri, Oct 25, 2019 at 03:59:55PM +0300, Adrian Hunter wrote:
> > > Record changes to kernel text (i.e. self-modifying code) in order to
> > > support tracers like Intel PT decoding through jump labels.
> > > 
> > > A copy of the running kernel code is needed as a reference point
> > > (e.g. from /proc/kcore). The text poke event records the old bytes
> > > and the new bytes so that the event can be processed forwards or backwards.
> > > 
> > > The text poke event has 'flags' to describe when the event is sent. At
> > > present the only flag is PERF_TEXT_POKE_UPDATE which indicates the
> > > point at which tools should update their reference kernel executable to
> > > change the old bytes to the new bytes.
> > > 
> > > Other architectures may wish to emit a pair of text poke events. One before
> > > and one after a text poke. In that case, PERF_TEXT_POKE_UPDATE flag would
> > > be set on only one of the pair.
> > 
> > Thanks a lot for the patch set.
> > 
> > Below is my understanding for implementation 'emit a pair of text poke
> > events' as you mentioned:
> > 
> > Since Arm64 instruction is fixed size, it doesn't need to rely on INT3
> > liked mechanism to replace instructions and directly uses two operations
> > to alter instruction (modify instruction and flush icache line).  So
> > Arm64 has no chance to send perf event in the middle of altering
> > instruction.
> 
> Right.
> 
> > Thus we can send pair of text poke events in the kernel:
> > 
> >   perf_event_text_poke(PERF_TEXT_POKE_UPDATE_PREV)
> > 
> >     Change instruction
> >     Flush icache
> > 
> >   perf_event_text_poke(PERF_TEXT_POKE_UPDATE_POST)
> > 
> > In the userspace, if perf tool detects the instruction is changed
> > from nop to branch,
> 
> There is _far_ more text poking than just jump_label's NOP/JMP
> transitions. Ftrace also does NOP/CALL, CALL/CALL, and the static_call
> infrastructure that I posted here:
> 
>   https://lkml.kernel.org/r/20191007082708.013939311@infradead.org
> 
> Has: JMP/RET, JMP/JMP and if it has inline patching support: NOP/CALL,
> CALL/CALL, patching.

Thanks for the info.  I took a bit time to look your patch set and
Steven's patch set for dynamic function, though don't completely
understand, but get more sense for the context.

> Anyway, the below argument doesn't care much, it works for NOP/JMP just
> fine.

We can support NOP/JMP case as the first step, but later should can
extend to support other transitions.

> > we need to update dso cache for the
> > 'PERF_TEXT_POKE_UPDATE_PREV' event; if detect the instruction is
> > changed from branch to nop, we need to update dso cache for
> > 'PERF_TEXT_POKE_UPDATE_POST' event.  The main idea is to ensure the
> > branch instructions can be safely contained in the dso file and any
> > branch samples can read out correct branch instruction.
> > 
> > Could you confirm this is the same with your understanding?  Or I miss
> > anything?  I personally even think the pair events can be used for
> > different arches (e.g. the solution can be reused on Arm64/x86, etc).
> 
> So the problem we have with PT is that it is a bit-stream of
> branch taken/not-taken decisions. In order to decode that we need to
> have an accurate view of the unconditional code flow.
> 
> Both NOP/JMP are unconditional and we need to exactly know which of the
> two was encountered.

If I understand correctly, PT decoder needs to read out instructions
from dso and decide the instruction type (NOP or JMP), and finally
generate the accurate code flow.

So PT decoder relies on (cached) DSO for decoding.  As I know, this
might be different from Arm CS, since Arm CS decoder is merely
generate packets and it doesn't need to rely on DSO for decoding.

I think my answer is not very convinced, in case I mislead, loop Mike
to confirm for this.

> With your scheme, I don't see how we can ever actually know that. When
> we get the PRE event, all we really know is that we're going to change
> a specific instruction into another. And at the POST event we know it
> has been done. But in between these two events, we have no clue which of
> the two instructions is live on which CPU (two CPUs might in fact have a
> different live instruction at the same time).
>
> This means we _cannot_ unambiguously decode a taken/not-taken decision
> stream.
> 
> Does CS have this same problem, and how would the PRE/POST events help
> with that?

My purpose is to use PRE event and POST event to update cached DSO,
thus perf tool can read out 'correct' instructions and fill them into
instruction/branch samples.  On the other hand, as mentioned, I don't
aware the instruction read out from DSO will be used for Arm CS decoder;
Arm CS also reads the instructions from DSO, usually it's used to decide
instruction size or decide the sample flags (flags for CALL/RETURN/hw
int, etc...), but it isn't used for decoder.

@Mike, please help confirm for this as well, from Arm CoreSight's
decoder pespective.

> So our (x86) horrible (variable instruction length induced) complexity
> for text poking is actually in our advantage this one time. The
> exception and exception-return paths allow us to unambiguously know what
> happens around the time of poking.

Thanks a lot for the info!
Leo.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-30 14:19       ` Leo Yan
@ 2019-10-30 15:00         ` Mike Leach
  2019-10-30 16:23         ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Leach @ 2019-10-30 15:00 UTC (permalink / raw)
  To: Leo Yan
  Cc: Peter Zijlstra, Adrian Hunter, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Mark Rutland, Alexander Shishkin,
	Mathieu Poirier, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel

Hi Leo,

On Wed, 30 Oct 2019 at 14:20, Leo Yan <leo.yan@linaro.org> wrote:
>
> Hi Peter,
>
> [ + Mike Leach ]
>
> On Wed, Oct 30, 2019 at 01:46:59PM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 30, 2019 at 06:47:47PM +0800, Leo Yan wrote:
> > > Hi Adrian,
> > >
> > > On Fri, Oct 25, 2019 at 03:59:55PM +0300, Adrian Hunter wrote:
> > > > Record changes to kernel text (i.e. self-modifying code) in order to
> > > > support tracers like Intel PT decoding through jump labels.
> > > >
> > > > A copy of the running kernel code is needed as a reference point
> > > > (e.g. from /proc/kcore). The text poke event records the old bytes
> > > > and the new bytes so that the event can be processed forwards or backwards.
> > > >
> > > > The text poke event has 'flags' to describe when the event is sent. At
> > > > present the only flag is PERF_TEXT_POKE_UPDATE which indicates the
> > > > point at which tools should update their reference kernel executable to
> > > > change the old bytes to the new bytes.
> > > >
> > > > Other architectures may wish to emit a pair of text poke events. One before
> > > > and one after a text poke. In that case, PERF_TEXT_POKE_UPDATE flag would
> > > > be set on only one of the pair.
> > >
> > > Thanks a lot for the patch set.
> > >
> > > Below is my understanding for implementation 'emit a pair of text poke
> > > events' as you mentioned:
> > >
> > > Since Arm64 instruction is fixed size, it doesn't need to rely on INT3
> > > liked mechanism to replace instructions and directly uses two operations
> > > to alter instruction (modify instruction and flush icache line).  So
> > > Arm64 has no chance to send perf event in the middle of altering
> > > instruction.
> >
> > Right.
> >
> > > Thus we can send pair of text poke events in the kernel:
> > >
> > >   perf_event_text_poke(PERF_TEXT_POKE_UPDATE_PREV)
> > >
> > >     Change instruction
> > >     Flush icache
> > >
> > >   perf_event_text_poke(PERF_TEXT_POKE_UPDATE_POST)
> > >
> > > In the userspace, if perf tool detects the instruction is changed
> > > from nop to branch,
> >
> > There is _far_ more text poking than just jump_label's NOP/JMP
> > transitions. Ftrace also does NOP/CALL, CALL/CALL, and the static_call
> > infrastructure that I posted here:
> >
> >   https://lkml.kernel.org/r/20191007082708.013939311@infradead.org
> >
> > Has: JMP/RET, JMP/JMP and if it has inline patching support: NOP/CALL,
> > CALL/CALL, patching.
>
> Thanks for the info.  I took a bit time to look your patch set and
> Steven's patch set for dynamic function, though don't completely
> understand, but get more sense for the context.
>
> > Anyway, the below argument doesn't care much, it works for NOP/JMP just
> > fine.
>
> We can support NOP/JMP case as the first step, but later should can
> extend to support other transitions.
>
> > > we need to update dso cache for the
> > > 'PERF_TEXT_POKE_UPDATE_PREV' event; if detect the instruction is
> > > changed from branch to nop, we need to update dso cache for
> > > 'PERF_TEXT_POKE_UPDATE_POST' event.  The main idea is to ensure the
> > > branch instructions can be safely contained in the dso file and any
> > > branch samples can read out correct branch instruction.
> > >
> > > Could you confirm this is the same with your understanding?  Or I miss
> > > anything?  I personally even think the pair events can be used for
> > > different arches (e.g. the solution can be reused on Arm64/x86, etc).
> >
> > So the problem we have with PT is that it is a bit-stream of
> > branch taken/not-taken decisions. In order to decode that we need to
> > have an accurate view of the unconditional code flow.
> >
> > Both NOP/JMP are unconditional and we need to exactly know which of the
> > two was encountered.
>
> If I understand correctly, PT decoder needs to read out instructions
> from dso and decide the instruction type (NOP or JMP), and finally
> generate the accurate code flow.
>
> So PT decoder relies on (cached) DSO for decoding.  As I know, this
> might be different from Arm CS, since Arm CS decoder is merely
> generate packets and it doesn't need to rely on DSO for decoding.
>
> I think my answer is not very convinced, in case I mislead, loop Mike
> to confirm for this.
>
The CoreSight decoder needs the same information. When a branch / call
instruction is traced an atom is generated in the trace to determine
if that call as taken or not.
During the decode process the decoder walks the code image to match up
atoms to call / branch instructions - which means for correct trace
decode we must have an accurate code image as executed, as otherwise
trace decode will go wrong.
Within perf, the OpenCSD decoder will call back into perf to ask for
the memory image for a given address - which perf will supply from the
list of .so / executable files it has in the .debug directory.

Self modifying code presents an issue in this respect.

Correlation of the modification information with the captured trace
becomes the primary concern.

Regards

Mike


> > With your scheme, I don't see how we can ever actually know that. When
> > we get the PRE event, all we really know is that we're going to change
> > a specific instruction into another. And at the POST event we know it
> > has been done. But in between these two events, we have no clue which of
> > the two instructions is live on which CPU (two CPUs might in fact have a
> > different live instruction at the same time).
> >
> > This means we _cannot_ unambiguously decode a taken/not-taken decision
> > stream.
> >
> > Does CS have this same problem, and how would the PRE/POST events help
> > with that?
>
> My purpose is to use PRE event and POST event to update cached DSO,
> thus perf tool can read out 'correct' instructions and fill them into
> instruction/branch samples.  On the other hand, as mentioned, I don't
> aware the instruction read out from DSO will be used for Arm CS decoder;
> Arm CS also reads the instructions from DSO, usually it's used to decide
> instruction size or decide the sample flags (flags for CALL/RETURN/hw
> int, etc...), but it isn't used for decoder.
>
> @Mike, please help confirm for this as well, from Arm CoreSight's
> decoder pespective.
>
> > So our (x86) horrible (variable instruction length induced) complexity
> > for text poking is actually in our advantage this one time. The
> > exception and exception-return paths allow us to unambiguously know what
> > happens around the time of poking.
>
> Thanks a lot for the info!
> Leo.



-- 
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-30 14:19       ` Leo Yan
  2019-10-30 15:00         ` Mike Leach
@ 2019-10-30 16:23         ` Peter Zijlstra
  2019-10-31  7:31           ` Leo Yan
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2019-10-30 16:23 UTC (permalink / raw)
  To: Leo Yan
  Cc: Mike Leach, Adrian Hunter, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Mark Rutland, Alexander Shishkin,
	Mathieu Poirier, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel

On Wed, Oct 30, 2019 at 10:19:50PM +0800, Leo Yan wrote:
> On Wed, Oct 30, 2019 at 01:46:59PM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 30, 2019 at 06:47:47PM +0800, Leo Yan wrote:

> > Anyway, the below argument doesn't care much, it works for NOP/JMP just
> > fine.
> 
> We can support NOP/JMP case as the first step, but later should can
> extend to support other transitions.

Since all instructions (with the possible exception of RET) are
unconditional branch instructions: NOP, JMP, CALL. It makes no read
difference to the argument below.

( I'm thinking RET might be special in that it reads the return address
from the stack and therefore must emit the whole IP into the stream, as
we cannot know the stack state )

> > > we need to update dso cache for the
> > > 'PERF_TEXT_POKE_UPDATE_PREV' event; if detect the instruction is
> > > changed from branch to nop, we need to update dso cache for
> > > 'PERF_TEXT_POKE_UPDATE_POST' event.  The main idea is to ensure the
> > > branch instructions can be safely contained in the dso file and any
> > > branch samples can read out correct branch instruction.
> > > 
> > > Could you confirm this is the same with your understanding?  Or I miss
> > > anything?  I personally even think the pair events can be used for
> > > different arches (e.g. the solution can be reused on Arm64/x86, etc).
> > 
> > So the problem we have with PT is that it is a bit-stream of
> > branch taken/not-taken decisions. In order to decode that we need to
> > have an accurate view of the unconditional code flow.
> > 
> > Both NOP/JMP are unconditional and we need to exactly know which of the
> > two was encountered.
> 
> If I understand correctly, PT decoder needs to read out instructions
> from dso and decide the instruction type (NOP or JMP), and finally
> generate the accurate code flow.
> 
> So PT decoder relies on (cached) DSO for decoding.  As I know, this
> might be different from Arm CS, since Arm CS decoder is merely
> generate packets and it doesn't need to rely on DSO for decoding.

Given a start point (from a start or sync packet) we scan the
instruction stream forward until the first conditional branch
instruction. Then we consume the next available branch decision bit to
know where to continue.

So yes, we need to have a correct text image available for this to work.

> > With your scheme, I don't see how we can ever actually know that. When
> > we get the PRE event, all we really know is that we're going to change
> > a specific instruction into another. And at the POST event we know it
> > has been done. But in between these two events, we have no clue which of
> > the two instructions is live on which CPU (two CPUs might in fact have a
> > different live instruction at the same time).
> >
> > This means we _cannot_ unambiguously decode a taken/not-taken decision
> > stream.
> > 
> > Does CS have this same problem, and how would the PRE/POST events help
> > with that?
> 
> My purpose is to use PRE event and POST event to update cached DSO,
> thus perf tool can read out 'correct' instructions and fill them into
> instruction/branch samples.

The thing is, as I argued, the instruction state between PRE and POST is
ambiguous. This makes it impossible to decode the branch decision
stream.

Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we
have CPU1 covering the instruction at T3.

How do you decide where CPU1 goes and what the next conditional branch
is?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-30 16:23         ` Peter Zijlstra
@ 2019-10-31  7:31           ` Leo Yan
  2019-11-01 10:04             ` Peter Zijlstra
       [not found]             ` <CAJ9a7VgZH7g=rFDpKf=FzEcyBVLS_WjqbrqtRnjOi7WOY4st+w@mail.gmail.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Leo Yan @ 2019-10-31  7:31 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon, Mark Rutland
  Cc: Mike Leach, Adrian Hunter, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Mark Rutland, Alexander Shishkin,
	Mathieu Poirier, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel

[ + Will, Mark ]

On Wed, Oct 30, 2019 at 05:23:25PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2019 at 10:19:50PM +0800, Leo Yan wrote:
> > On Wed, Oct 30, 2019 at 01:46:59PM +0100, Peter Zijlstra wrote:
> > > On Wed, Oct 30, 2019 at 06:47:47PM +0800, Leo Yan wrote:
> 
> > > Anyway, the below argument doesn't care much, it works for NOP/JMP just
> > > fine.
> > 
> > We can support NOP/JMP case as the first step, but later should can
> > extend to support other transitions.
> 
> Since all instructions (with the possible exception of RET) are
> unconditional branch instructions: NOP, JMP, CALL. It makes no read
> difference to the argument below.
> 
> ( I'm thinking RET might be special in that it reads the return address
> from the stack and therefore must emit the whole IP into the stream, as
> we cannot know the stack state )

To be honest, I don't have knowledge what's the exactly format for 'ret'
in CoreSight trace; so would like to leave this to Mike.

Since Mike has confirmed that Arm CoreSight also needs the accurate
code for decoding branch/call instructions, it's no doubt that we need
to implement the same mechanism at here to update DSO for the accurate
code flow.  So the question is how to do this on Arm/Arm64 :)

Before move farward, I'd like to step back to describe clearly what's
current problem on Arm64 and check one question for jump label:

I checked the kernel code, both kprobe and ftrace both uses
stop_machine() to alter instructions, since all CPUs run into stop
machine's synchronization, there have no race condition between
instructions transition and CPUs execte the altered instruction; thus
it's safe for kprobe and ftrace to use perf event PERF_TEXT_POKE_UPDATE
to notify instruction transition and can allow us to read out 'correct'
instruction for decoder.

But for jump label, it doesn't use the stop_machine() and perf event
PERF_TEXT_POKE_UPDATE will introduce race condition as below (Let's see
the example for transition from nop to branch):

              CPU0                                      CPU1
  NOP instruction
   `-> static_key_enable()
        `-> aarch64_insn_patch_text_nosync()
             `-> perf event PERF_TEXT_POKE_UPDATE
                                                     -> Execute nop
                                                        instruction
             `-> aarch64_insn_write()
             `-> __flush_icache_range()

Since x86 platform have INT3 as a mediate state, it can avoid the
race condition between CPU0 (who is do transition) and other CPUs (who
is possible to execute nop/branch).

> > > > we need to update dso cache for the
> > > > 'PERF_TEXT_POKE_UPDATE_PREV' event; if detect the instruction is
> > > > changed from branch to nop, we need to update dso cache for
> > > > 'PERF_TEXT_POKE_UPDATE_POST' event.  The main idea is to ensure the
> > > > branch instructions can be safely contained in the dso file and any
> > > > branch samples can read out correct branch instruction.
> > > > 
> > > > Could you confirm this is the same with your understanding?  Or I miss
> > > > anything?  I personally even think the pair events can be used for
> > > > different arches (e.g. the solution can be reused on Arm64/x86, etc).
> > > 
> > > So the problem we have with PT is that it is a bit-stream of
> > > branch taken/not-taken decisions. In order to decode that we need to
> > > have an accurate view of the unconditional code flow.
> > > 
> > > Both NOP/JMP are unconditional and we need to exactly know which of the
> > > two was encountered.
> > 
> > If I understand correctly, PT decoder needs to read out instructions
> > from dso and decide the instruction type (NOP or JMP), and finally
> > generate the accurate code flow.
> > 
> > So PT decoder relies on (cached) DSO for decoding.  As I know, this
> > might be different from Arm CS, since Arm CS decoder is merely
> > generate packets and it doesn't need to rely on DSO for decoding.
> 
> Given a start point (from a start or sync packet) we scan the
> instruction stream forward until the first conditional branch
> instruction. Then we consume the next available branch decision bit to
> know where to continue.
> 
> So yes, we need to have a correct text image available for this to work.
> 
> > > With your scheme, I don't see how we can ever actually know that. When
> > > we get the PRE event, all we really know is that we're going to change
> > > a specific instruction into another. And at the POST event we know it
> > > has been done. But in between these two events, we have no clue which of
> > > the two instructions is live on which CPU (two CPUs might in fact have a
> > > different live instruction at the same time).
> > >
> > > This means we _cannot_ unambiguously decode a taken/not-taken decision
> > > stream.
> > > 
> > > Does CS have this same problem, and how would the PRE/POST events help
> > > with that?
> > 
> > My purpose is to use PRE event and POST event to update cached DSO,
> > thus perf tool can read out 'correct' instructions and fill them into
> > instruction/branch samples.
> 
> The thing is, as I argued, the instruction state between PRE and POST is
> ambiguous. This makes it impossible to decode the branch decision
> stream.
> 
> Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we
> have CPU1 covering the instruction at T3.
> 
> How do you decide where CPU1 goes and what the next conditional branch
> is?

Sorry for my not well thought.

I agree that T3 is an uncertain state with below flow:

      CPU0                                             CPU1
  perf event PERF_TEXT_POKE_UPDATE_PRE   -> T1

    Int3 / NOP                                       -> T3

    Int3 / branch                                    -> T3'

  perf event PERF_TEXT_POKE_UPDATE_POST  -> T5

Except if the trace has extra info and can use old/new instructions
combination for analysis, otherwise PRE/POST pair events aren't helpful
for resolve this issue (if trace decoder can do this, then the change in
kernel will be much simpler).

Below are two potential options we can use on Arm64 platform:

- Change to use stop_machine() for jump label; this might introduce
  performance issue if jump label is altered frequently.

  To mitigate the impaction, we can only use stop_machine() when
  detect the perf events are enabled, otherwise will rollback to use
  the old code path.

- We can use breakpoint to emulate the similiar flow with x86's int3,
  thus we can dismiss the race condition between one CPU alters
  instruction and other CPUs run into the alternative instruction.

@Will, @Mark, could you help review this?  Appreciate any comments
and suggestions.  And please let me know if you want to consolidate
related works with your side (or as you know if there have ongoing
discussion or someone works on this).

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-31  7:31           ` Leo Yan
@ 2019-11-01 10:04             ` Peter Zijlstra
  2019-11-01 10:09               ` Peter Zijlstra
  2019-11-04  2:23               ` Leo Yan
       [not found]             ` <CAJ9a7VgZH7g=rFDpKf=FzEcyBVLS_WjqbrqtRnjOi7WOY4st+w@mail.gmail.com>
  1 sibling, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2019-11-01 10:04 UTC (permalink / raw)
  To: Leo Yan
  Cc: Will Deacon, Mark Rutland, Mike Leach, Adrian Hunter,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86,
	Alexander Shishkin, Mathieu Poirier, Arnaldo Carvalho de Melo,
	Jiri Olsa, linux-kernel

On Thu, Oct 31, 2019 at 03:31:36PM +0800, Leo Yan wrote:

> Before move farward, I'd like to step back to describe clearly what's
> current problem on Arm64 and check one question for jump label:
> 
> I checked the kernel code, both kprobe and ftrace both uses
> stop_machine() to alter instructions,

That's not currect for Aargh64, see aarch64_insn_patch_text_nosync(),
which is used in both ftrace and jump_label.

> since all CPUs run into stop
> machine's synchronization, there have no race condition between
> instructions transition and CPUs execte the altered instruction; thus
> it's safe for kprobe and ftrace to use perf event PERF_TEXT_POKE_UPDATE
> to notify instruction transition and can allow us to read out 'correct'
> instruction for decoder.

Agreed, IFF patching happens using stop_machine(), things are easy. ARM
is (so far) exclusively using stop_machine() based text_poking, although
the last time I spoke to Will about this, he said the _nosync stuff is
possible on 32bit too, just nobody has bothered implementing it.

> But for jump label, it doesn't use the stop_machine() and perf event
> PERF_TEXT_POKE_UPDATE will introduce race condition as below (Let's see
> the example for transition from nop to branch):
> 
>               CPU0                                      CPU1
>   NOP instruction
>    `-> static_key_enable()
>         `-> aarch64_insn_patch_text_nosync()
>              `-> perf event PERF_TEXT_POKE_UPDATE
>                                                      -> Execute nop
>                                                         instruction
>              `-> aarch64_insn_write()
>              `-> __flush_icache_range()
> 
> Since x86 platform have INT3 as a mediate state, it can avoid the
> race condition between CPU0 (who is do transition) and other CPUs (who
> is possible to execute nop/branch).

Ah, you found the _nosync thing in jump_label, here's the one in ftrace:

arch/arm64/kernel/ftrace.c:     if (aarch64_insn_patch_text_nosync((void *)pc, new))

And yes, this is racy.

> > The thing is, as I argued, the instruction state between PRE and POST is
> > ambiguous. This makes it impossible to decode the branch decision
> > stream.
> > 
> > Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we
> > have CPU1 covering the instruction at T3.
> > 
> > How do you decide where CPU1 goes and what the next conditional branch
> > is?
> 
> Sorry for my not well thought.
> 
> I agree that T3 is an uncertain state with below flow:
> 
>       CPU0                                             CPU1
>   perf event PERF_TEXT_POKE_UPDATE_PRE   -> T1
> 
>     Int3 / NOP                                       -> T3
> 
>     Int3 / branch                                    -> T3'
> 
>   perf event PERF_TEXT_POKE_UPDATE_POST  -> T5
> 
> Except if the trace has extra info and can use old/new instructions
> combination for analysis, otherwise PRE/POST pair events aren't helpful
> for resolve this issue (if trace decoder can do this, then the change in
> kernel will be much simpler).
> 
> Below are two potential options we can use on Arm64 platform:
> 
> - Change to use stop_machine() for jump label; this might introduce
>   performance issue if jump label is altered frequently.
> 
>   To mitigate the impaction, we can only use stop_machine() when
>   detect the perf events are enabled, otherwise will rollback to use
>   the old code path.
> 
> - We can use breakpoint to emulate the similiar flow with x86's int3,
>   thus we can dismiss the race condition between one CPU alters
>   instruction and other CPUs run into the alternative instruction.
> 
> @Will, @Mark, could you help review this?  Appreciate any comments
> and suggestions.  And please let me know if you want to consolidate
> related works with your side (or as you know if there have ongoing
> discussion or someone works on this).

Given people are building larger Aargh64 machines (I've heard about 100+
CPUs already), I'm thinking the 3rd option is the most performant.

But yes, as you mention earlier, we can make this optional on the
TEXT_POKE_UPDATE event being in use.

I'm thinking something along the lines of:

static uintptr_t nosync_addr;
static u32 nosync_insn;

int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
	const u32 break = // some_breakpoint_insn;
	uintptr_t tp = (uintptr_t)addr;
	int ret;

	lockdep_assert_held(&text_mutex);

	/* A64 instructions must be word aligned */
	if (tp & 0x3)
		return -EINVAL;

	if (perf_text_poke_update_enabled()) {

		nosync_insn = insn;
		smp_store_release(&nosync_addr, tp);

		ret = aarch64_insn_write(addr, break);
		if (ret == 0)
			__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);

		perf_event_text_poke(....);
	}

	ret = aarch64_insn_write(addr, insn);
	if (ret == 0)
		__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);

	return ret;
}

And have the 'break' handler do:

aarch64_insn_break_handler(struct pt_regs *regs)
{
	unsigned long addr = smp_load_acquire(&nosync_addr);
	u32 insn = nosync_insn;

	if (regs->ip != addr)
		return;

	// emulate @insn
}

I understood from Will the whole nosync scheme only works for a limited
set of instructions, but you only have to implement emulation for the
actual instructions used of course.

(which is what we do on x86)

Does this sound workable?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
       [not found]             ` <CAJ9a7VgZH7g=rFDpKf=FzEcyBVLS_WjqbrqtRnjOi7WOY4st+w@mail.gmail.com>
@ 2019-11-01 10:06               ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2019-11-01 10:06 UTC (permalink / raw)
  To: Mike Leach
  Cc: Leo Yan, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Borislav Petkov, H . Peter Anvin,
	Ingo Molnar, Jiri Olsa, Mark Rutland, Mathieu Poirier,
	Will Deacon, linux-kernel, x86

On Thu, Oct 31, 2019 at 12:29:15PM +0000, Mike Leach wrote:
> On Thu, 31 Oct 2019 at 07:31, Leo Yan <leo.yan@linaro.org> wrote:

> > > Since all instructions (with the possible exception of RET) are
> > > unconditional branch instructions: NOP, JMP, CALL. It makes no read
> > > difference to the argument below.
> > >
> > > ( I'm thinking RET might be special in that it reads the return address
> > > from the stack and therefore must emit the whole IP into the stream, as
> > > we cannot know the stack state )
> >
> > To be honest, I don't have knowledge what's the exactly format for 'ret'
> > in CoreSight trace; so would like to leave this to Mike.
> >
> 
> For ETM trace we do not have to output the entire address into he stream if:
> - address compression allows us to emit only the changed ls bit from the
> last address.
> - the address is identical to one of the last three addresses emitted ( we
> just emit a ‘same address encoding’
> - we are using return stack compression and the address is top of the
> return stack (we emit nothing and the decoder gets the address from its own
> mode, of the return stack)

Cute. I don't actually know what PT does, but I figured there had to at
least be the option to provide more information.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-01 10:04             ` Peter Zijlstra
@ 2019-11-01 10:09               ` Peter Zijlstra
  2019-11-04  2:23               ` Leo Yan
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2019-11-01 10:09 UTC (permalink / raw)
  To: Leo Yan
  Cc: Will Deacon, Mark Rutland, Mike Leach, Adrian Hunter,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86,
	Alexander Shishkin, Mathieu Poirier, Arnaldo Carvalho de Melo,
	Jiri Olsa, linux-kernel

On Fri, Nov 01, 2019 at 11:04:40AM +0100, Peter Zijlstra wrote:

> I'm thinking something along the lines of:
> 
> static uintptr_t nosync_addr;
> static u32 nosync_insn;
> 
> int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
> {
> 	const u32 break = // some_breakpoint_insn;
> 	uintptr_t tp = (uintptr_t)addr;
> 	int ret;
> 
> 	lockdep_assert_held(&text_mutex);
> 
> 	/* A64 instructions must be word aligned */
> 	if (tp & 0x3)
> 		return -EINVAL;
> 
> 	if (perf_text_poke_update_enabled()) {
> 
> 		nosync_insn = insn;
> 		smp_store_release(&nosync_addr, tp);
> 
> 		ret = aarch64_insn_write(addr, break);
> 		if (ret == 0)
> 			__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);
> 
> 		perf_event_text_poke(....);
> 	}
> 
> 	ret = aarch64_insn_write(addr, insn);
> 	if (ret == 0)
> 		__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);

	// barrier that guarantees iflush completion ? dsb(osh) ?

	WRITE_ONCE(nosync_addr, NULL);

> 	return ret;
> }
> 
> And have the 'break' handler do:
> 
> aarch64_insn_break_handler(struct pt_regs *regs)
> {
> 	unsigned long addr = smp_load_acquire(&nosync_addr);
> 	u32 insn = nosync_insn;
> 
> 	if (regs->ip != addr)
> 		return;
> 
> 	// emulate @insn
> }
> 
> I understood from Will the whole nosync scheme only works for a limited
> set of instructions, but you only have to implement emulation for the
> actual instructions used of course.
> 
> (which is what we do on x86)
> 
> Does this sound workable?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-01 10:04             ` Peter Zijlstra
  2019-11-01 10:09               ` Peter Zijlstra
@ 2019-11-04  2:23               ` Leo Yan
  2019-11-08 15:05                 ` Leo Yan
  1 sibling, 1 reply; 34+ messages in thread
From: Leo Yan @ 2019-11-04  2:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Mark Rutland, Mike Leach, Adrian Hunter,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86,
	Alexander Shishkin, Mathieu Poirier, Arnaldo Carvalho de Melo,
	Jiri Olsa, linux-kernel

Hi Peter,

On Fri, Nov 01, 2019 at 11:04:40AM +0100, Peter Zijlstra wrote:
> On Thu, Oct 31, 2019 at 03:31:36PM +0800, Leo Yan wrote:
> 
> > Before move farward, I'd like to step back to describe clearly what's
> > current problem on Arm64 and check one question for jump label:
> > 
> > I checked the kernel code, both kprobe and ftrace both uses
> > stop_machine() to alter instructions,
> 
> That's not currect for Aargh64, see aarch64_insn_patch_text_nosync(),
> which is used in both ftrace and jump_label.

Thanks for pointing out this.

Agree, for ftrace, it's complex and there have multiple pathes to use
aarch64_insn_patch_text_nosync().

Below flow uses stop_machine() in ftrace:

  ftrace_run_stop_machine()
   stop_machine(__ftrace_modify_code, &command, NULL);
     __ftrace_modify_code()
       ftrace_modify_all_code()
         ftrace_update_ftrace_func()
           ftrace_modify_code()
             aarch64_insn_patch_text_nosync()

Below flow doesn't use stop_machine() in ftrace:

  prepare_coming_module()
    ftrace_module_enable()
      process_cached_mods()
        process_mod_list()
          ftrace_hash_move_and_update_ops()
            ftrace_ops_update_code()
              ftrace_ops_update_code()
                ftrace_run_modify_code()
                  ftrace_run_update_code()
                    arch_ftrace_update_code()
                      ftrace_modify_all_code()
                        ftrace_update_ftrace_func()
                          ftrace_modify_code()
                            aarch64_insn_patch_text_nosync()

Actually, there have other flows also will call into
aarch64_insn_patch_text_nosync(), so at least we cannot say all ftrace
flow uses stop_machine() to alter instructions.

> > since all CPUs run into stop
> > machine's synchronization, there have no race condition between
> > instructions transition and CPUs execte the altered instruction; thus
> > it's safe for kprobe and ftrace to use perf event PERF_TEXT_POKE_UPDATE
> > to notify instruction transition and can allow us to read out 'correct'
> > instruction for decoder.
> 
> Agreed, IFF patching happens using stop_machine(), things are easy. ARM
> is (so far) exclusively using stop_machine() based text_poking, although
> the last time I spoke to Will about this, he said the _nosync stuff is
> possible on 32bit too, just nobody has bothered implementing it.
> 
> > But for jump label, it doesn't use the stop_machine() and perf event
> > PERF_TEXT_POKE_UPDATE will introduce race condition as below (Let's see
> > the example for transition from nop to branch):
> > 
> >               CPU0                                      CPU1
> >   NOP instruction
> >    `-> static_key_enable()
> >         `-> aarch64_insn_patch_text_nosync()
> >              `-> perf event PERF_TEXT_POKE_UPDATE
> >                                                      -> Execute nop
> >                                                         instruction
> >              `-> aarch64_insn_write()
> >              `-> __flush_icache_range()
> > 
> > Since x86 platform have INT3 as a mediate state, it can avoid the
> > race condition between CPU0 (who is do transition) and other CPUs (who
> > is possible to execute nop/branch).
> 
> Ah, you found the _nosync thing in jump_label, here's the one in ftrace:
> 
> arch/arm64/kernel/ftrace.c:     if (aarch64_insn_patch_text_nosync((void *)pc, new))
> 
> And yes, this is racy.
> 
> > > The thing is, as I argued, the instruction state between PRE and POST is
> > > ambiguous. This makes it impossible to decode the branch decision
> > > stream.
> > > 
> > > Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we
> > > have CPU1 covering the instruction at T3.
> > > 
> > > How do you decide where CPU1 goes and what the next conditional branch
> > > is?
> > 
> > Sorry for my not well thought.
> > 
> > I agree that T3 is an uncertain state with below flow:
> > 
> >       CPU0                                             CPU1
> >   perf event PERF_TEXT_POKE_UPDATE_PRE   -> T1
> > 
> >     Int3 / NOP                                       -> T3
> > 
> >     Int3 / branch                                    -> T3'
> > 
> >   perf event PERF_TEXT_POKE_UPDATE_POST  -> T5
> > 
> > Except if the trace has extra info and can use old/new instructions
> > combination for analysis, otherwise PRE/POST pair events aren't helpful
> > for resolve this issue (if trace decoder can do this, then the change in
> > kernel will be much simpler).
> > 
> > Below are two potential options we can use on Arm64 platform:
> > 
> > - Change to use stop_machine() for jump label; this might introduce
> >   performance issue if jump label is altered frequently.
> > 
> >   To mitigate the impaction, we can only use stop_machine() when
> >   detect the perf events are enabled, otherwise will rollback to use
> >   the old code path.
> > 
> > - We can use breakpoint to emulate the similiar flow with x86's int3,
> >   thus we can dismiss the race condition between one CPU alters
> >   instruction and other CPUs run into the alternative instruction.
> > 
> > @Will, @Mark, could you help review this?  Appreciate any comments
> > and suggestions.  And please let me know if you want to consolidate
> > related works with your side (or as you know if there have ongoing
> > discussion or someone works on this).
> 
> Given people are building larger Aargh64 machines (I've heard about 100+
> CPUs already), I'm thinking the 3rd option is the most performant.
> 
> But yes, as you mention earlier, we can make this optional on the
> TEXT_POKE_UPDATE event being in use.
> 
> I'm thinking something along the lines of:
> 
> static uintptr_t nosync_addr;
> static u32 nosync_insn;
> 
> int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
> {
> 	const u32 break = // some_breakpoint_insn;
> 	uintptr_t tp = (uintptr_t)addr;
> 	int ret;
> 
> 	lockdep_assert_held(&text_mutex);
> 
> 	/* A64 instructions must be word aligned */
> 	if (tp & 0x3)
> 		return -EINVAL;
> 
> 	if (perf_text_poke_update_enabled()) {
> 
> 		nosync_insn = insn;
> 		smp_store_release(&nosync_addr, tp);
> 
> 		ret = aarch64_insn_write(addr, break);
> 		if (ret == 0)
> 			__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);
> 
> 		perf_event_text_poke(....);
> 	}
> 
> 	ret = aarch64_insn_write(addr, insn);
> 	if (ret == 0)
> 		__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);
> 
> 	return ret;
> }
> 
> And have the 'break' handler do:
> 
> aarch64_insn_break_handler(struct pt_regs *regs)
> {
> 	unsigned long addr = smp_load_acquire(&nosync_addr);
> 	u32 insn = nosync_insn;
> 
> 	if (regs->ip != addr)
> 		return;
> 
> 	// emulate @insn
> }
> 
> I understood from Will the whole nosync scheme only works for a limited
> set of instructions, but you only have to implement emulation for the
> actual instructions used of course.
> 
> (which is what we do on x86)
> 
> Does this sound workable?

Very appreciate for the posted code (and another minor fixing in your
next replying), the logic is quite clear.

Will do prototype for this, at the meantime, I'd like to give a bit more
time for Will (or other Arm maintainers) to review this.

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-10-25 12:59 ` [PATCH RFC 1/6] " Adrian Hunter
  2019-10-30 10:47   ` Leo Yan
@ 2019-11-04 10:40   ` Peter Zijlstra
  2019-11-04 12:32     ` Adrian Hunter
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2019-11-04 10:40 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

On Fri, Oct 25, 2019 at 03:59:55PM +0300, Adrian Hunter wrote:
> Record changes to kernel text (i.e. self-modifying code) in order to
> support tracers like Intel PT decoding through jump labels.
> 
> A copy of the running kernel code is needed as a reference point
> (e.g. from /proc/kcore). The text poke event records the old bytes
> and the new bytes so that the event can be processed forwards or backwards.
> 
> The text poke event has 'flags' to describe when the event is sent. At
> present the only flag is PERF_TEXT_POKE_UPDATE which indicates the
> point at which tools should update their reference kernel executable to
> change the old bytes to the new bytes.
> 
> Other architectures may wish to emit a pair of text poke events. One before
> and one after a text poke. In that case, PERF_TEXT_POKE_UPDATE flag would
> be set on only one of the pair.
> 
> In the case of Intel PT tracing, the executable code must be walked to
> reconstruct the control flow. For x86 a jump label text poke begins:
>   - write INT3 byte
>   - IPI-SYNC
> At this point the actual control flow will be through the INT3 and handler
> and not hit the old or new instruction. Intel PT outputs FUP/TIP packets
> for the INT3, so the flow can still be decoded. Subsequently:
>   - emit RECORD_TEXT_POKE with the new instruction
>   - write instruction tail
>   - IPI-SYNC
>   - write first byte
>   - IPI-SYNC
> So before the text poke event timestamp, the decoder will see either the
> old instruction flow or FUP/TIP of INT3. After the text poke event
> timestamp, the decoder will see either the new instruction flow or FUP/TIP
> of INT3. Thus decoders can use the timestamp as the point at which to
> modify the executable code.
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>

> @@ -1000,6 +1001,26 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_BPF_EVENT			= 18,
>  
> +	/*
> +	 * Records changes to kernel text i.e. self-modified code.
> +	 * 'flags' has PERF_TEXT_POKE_UPDATE (i.e. bit 0) set to
> +	 * indicate to tools to update old bytes to new bytes in the
> +	 * executable image.
> +	 * 'len' is the number of old bytes, which is the same as the number
> +	 * of new bytes. 'bytes' contains the old bytes followed immediately
> +	 * by the new bytes.
> +	 *
> +	 * struct {
> +	 *	struct perf_event_header	header;
> +	 *	u64				addr;
> +	 *	u16				flags;

What's the purpose of the 'flags' field? We currently only have the 1
flag defined, but what possible other flags are you thinking of?

> +	 *	u16				len;
> +	 *	u8				bytes[];
> +	 *	struct sample_id		sample_id;
> +	 * };
> +	 */
> +	PERF_RECORD_TEXT_POKE			= 19,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };
>  
> @@ -1041,6 +1062,11 @@ enum perf_callchain_context {
>  #define PERF_AUX_FLAG_PARTIAL		0x04	/* record contains gaps */
>  #define PERF_AUX_FLAG_COLLISION		0x08	/* sample collided with another */
>  
> +/**
> + * PERF_RECORD_TEXT_POKE::flags bits
> + */
> +#define PERF_TEXT_POKE_UPDATE		0x01	/* update old bytes to new bytes */
> +
>  #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
>  #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
>  #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */


> +void perf_event_text_poke(unsigned long addr, u16 flags, const void *old_bytes,
> +			  const void *new_bytes, size_t len)
> +{
> +	struct perf_text_poke_event text_poke_event;
> +	size_t tot, pad;
> +
> +	if (!atomic_read(&nr_text_poke_events))
> +		return;
> +
> +	tot  = sizeof(text_poke_event.flags) +
> +	       sizeof(text_poke_event.len) + (len << 1);

Maybe use: 'len * 2', compiler should be smart enough.

> +	pad  = ALIGN(tot, sizeof(u64)) - tot;
> +
> +	text_poke_event = (struct perf_text_poke_event){
> +		.old_bytes    = old_bytes,
> +		.new_bytes    = new_bytes,
> +		.pad          = pad,
> +		.flags        = flags,
> +		.len          = len,
> +		.event_id  = {
> +			.header = {
> +				.type = PERF_RECORD_TEXT_POKE,
> +				.misc = PERF_RECORD_MISC_KERNEL,
> +				.size = sizeof(text_poke_event.event_id) + tot + pad,
> +			},
> +			.addr = addr,
> +		},
> +	};
> +
> +	perf_iterate_sb(perf_event_text_poke_output, &text_poke_event, NULL);
> +}

Also, I'm thinking this patch has a problem with
arch_unoptimize_kprobe() as it stands today.

I've got a patch pending for that:

  https://lkml.kernel.org/r/20191018074634.629386219@infradead.org

That rewrites that a little, but it will be slightly differently broken
after that.

The problem is that while we can (and do) ignore regular kprobes (they
use text_poke() which isn't instrumented, and they don't need to be,
because they're always going through INT3), we cannot ignore optprobes.

Installing the optprobe works as expected, but unoptimizing is difficult
and as it stands your patch will not report the old text (you'll see
that first byte overwriten with 0xCC) and after my patch you'll not see
it at all.

Now, I suppose we can stick an explicit perf_event_text_poke() in there
after my patch.

We have to delay this patch until after my x86/ftrace rewrite anyway,
because otherwise ftrace is hidden too.

And even then, this will then notify us of all text modifications, but
what about all the extra text, like ftrace trampolines, k(ret)probe
trampolines, optprobe slots, bpf-jit, etc.?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-04 10:40   ` Peter Zijlstra
@ 2019-11-04 12:32     ` Adrian Hunter
  0 siblings, 0 replies; 34+ messages in thread
From: Adrian Hunter @ 2019-11-04 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86, Mark Rutland,
	Alexander Shishkin, Mathieu Poirier, Leo Yan,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel

On 4/11/19 12:40 PM, Peter Zijlstra wrote:
> On Fri, Oct 25, 2019 at 03:59:55PM +0300, Adrian Hunter wrote:
>> Record changes to kernel text (i.e. self-modifying code) in order to
>> support tracers like Intel PT decoding through jump labels.
>>
>> A copy of the running kernel code is needed as a reference point
>> (e.g. from /proc/kcore). The text poke event records the old bytes
>> and the new bytes so that the event can be processed forwards or backwards.
>>
>> The text poke event has 'flags' to describe when the event is sent. At
>> present the only flag is PERF_TEXT_POKE_UPDATE which indicates the
>> point at which tools should update their reference kernel executable to
>> change the old bytes to the new bytes.
>>
>> Other architectures may wish to emit a pair of text poke events. One before
>> and one after a text poke. In that case, PERF_TEXT_POKE_UPDATE flag would
>> be set on only one of the pair.
>>
>> In the case of Intel PT tracing, the executable code must be walked to
>> reconstruct the control flow. For x86 a jump label text poke begins:
>>   - write INT3 byte
>>   - IPI-SYNC
>> At this point the actual control flow will be through the INT3 and handler
>> and not hit the old or new instruction. Intel PT outputs FUP/TIP packets
>> for the INT3, so the flow can still be decoded. Subsequently:
>>   - emit RECORD_TEXT_POKE with the new instruction
>>   - write instruction tail
>>   - IPI-SYNC
>>   - write first byte
>>   - IPI-SYNC
>> So before the text poke event timestamp, the decoder will see either the
>> old instruction flow or FUP/TIP of INT3. After the text poke event
>> timestamp, the decoder will see either the new instruction flow or FUP/TIP
>> of INT3. Thus decoders can use the timestamp as the point at which to
>> modify the executable code.
>>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> 
>> @@ -1000,6 +1001,26 @@ enum perf_event_type {
>>  	 */
>>  	PERF_RECORD_BPF_EVENT			= 18,
>>  
>> +	/*
>> +	 * Records changes to kernel text i.e. self-modified code.
>> +	 * 'flags' has PERF_TEXT_POKE_UPDATE (i.e. bit 0) set to
>> +	 * indicate to tools to update old bytes to new bytes in the
>> +	 * executable image.
>> +	 * 'len' is the number of old bytes, which is the same as the number
>> +	 * of new bytes. 'bytes' contains the old bytes followed immediately
>> +	 * by the new bytes.
>> +	 *
>> +	 * struct {
>> +	 *	struct perf_event_header	header;
>> +	 *	u64				addr;
>> +	 *	u16				flags;
> 
> What's the purpose of the 'flags' field? We currently only have the 1
> flag defined, but what possible other flags are you thinking of?

That was to allow for identifying 'before' and 'after' text poke events.

> 
>> +	 *	u16				len;
>> +	 *	u8				bytes[];
>> +	 *	struct sample_id		sample_id;
>> +	 * };
>> +	 */
>> +	PERF_RECORD_TEXT_POKE			= 19,
>> +
>>  	PERF_RECORD_MAX,			/* non-ABI */
>>  };
>>  
>> @@ -1041,6 +1062,11 @@ enum perf_callchain_context {
>>  #define PERF_AUX_FLAG_PARTIAL		0x04	/* record contains gaps */
>>  #define PERF_AUX_FLAG_COLLISION		0x08	/* sample collided with another */
>>  
>> +/**
>> + * PERF_RECORD_TEXT_POKE::flags bits
>> + */
>> +#define PERF_TEXT_POKE_UPDATE		0x01	/* update old bytes to new bytes */
>> +
>>  #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
>>  #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
>>  #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
> 
> 
>> +void perf_event_text_poke(unsigned long addr, u16 flags, const void *old_bytes,
>> +			  const void *new_bytes, size_t len)
>> +{
>> +	struct perf_text_poke_event text_poke_event;
>> +	size_t tot, pad;
>> +
>> +	if (!atomic_read(&nr_text_poke_events))
>> +		return;
>> +
>> +	tot  = sizeof(text_poke_event.flags) +
>> +	       sizeof(text_poke_event.len) + (len << 1);
> 
> Maybe use: 'len * 2', compiler should be smart enough.
> 
>> +	pad  = ALIGN(tot, sizeof(u64)) - tot;
>> +
>> +	text_poke_event = (struct perf_text_poke_event){
>> +		.old_bytes    = old_bytes,
>> +		.new_bytes    = new_bytes,
>> +		.pad          = pad,
>> +		.flags        = flags,
>> +		.len          = len,
>> +		.event_id  = {
>> +			.header = {
>> +				.type = PERF_RECORD_TEXT_POKE,
>> +				.misc = PERF_RECORD_MISC_KERNEL,
>> +				.size = sizeof(text_poke_event.event_id) + tot + pad,
>> +			},
>> +			.addr = addr,
>> +		},
>> +	};
>> +
>> +	perf_iterate_sb(perf_event_text_poke_output, &text_poke_event, NULL);
>> +}
> 
> Also, I'm thinking this patch has a problem with
> arch_unoptimize_kprobe() as it stands today.
> 
> I've got a patch pending for that:
> 
>   https://lkml.kernel.org/r/20191018074634.629386219@infradead.org
> 
> That rewrites that a little, but it will be slightly differently broken
> after that.
> 
> The problem is that while we can (and do) ignore regular kprobes (they
> use text_poke() which isn't instrumented, and they don't need to be,
> because they're always going through INT3), we cannot ignore optprobes.
> 
> Installing the optprobe works as expected, but unoptimizing is difficult
> and as it stands your patch will not report the old text (you'll see
> that first byte overwriten with 0xCC) and after my patch you'll not see
> it at all.
> 
> Now, I suppose we can stick an explicit perf_event_text_poke() in there
> after my patch.
> 
> We have to delay this patch until after my x86/ftrace rewrite anyway,
> because otherwise ftrace is hidden too.
> 
> And even then, this will then notify us of all text modifications, but
> what about all the extra text, like ftrace trampolines, k(ret)probe
> trampolines, optprobe slots, bpf-jit, etc.?

Generally jump labels are used by default, whereas ftrace and probes are
more under the control of the user.  So just having jump labels done is the
main thing.

The tools have some support for copying out bpf-jit, although I am not sure
it guarantees to get short-lived code.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-04  2:23               ` Leo Yan
@ 2019-11-08 15:05                 ` Leo Yan
  2019-11-11 14:46                   ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Leo Yan @ 2019-11-08 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Mark Rutland, Mike Leach, Adrian Hunter,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86,
	Alexander Shishkin, Mathieu Poirier, Arnaldo Carvalho de Melo,
	Jiri Olsa, linux-kernel

On Mon, Nov 04, 2019 at 10:23:46AM +0800, Leo Yan wrote:

[...]

> > > But for jump label, it doesn't use the stop_machine() and perf event
> > > PERF_TEXT_POKE_UPDATE will introduce race condition as below (Let's see
> > > the example for transition from nop to branch):
> > > 
> > >               CPU0                                      CPU1
> > >   NOP instruction
> > >    `-> static_key_enable()
> > >         `-> aarch64_insn_patch_text_nosync()
> > >              `-> perf event PERF_TEXT_POKE_UPDATE
> > >                                                      -> Execute nop
> > >                                                         instruction
> > >              `-> aarch64_insn_write()
> > >              `-> __flush_icache_range()
> > > 
> > > Since x86 platform have INT3 as a mediate state, it can avoid the
> > > race condition between CPU0 (who is do transition) and other CPUs (who
> > > is possible to execute nop/branch).
> > 
> > Ah, you found the _nosync thing in jump_label, here's the one in ftrace:
> > 
> > arch/arm64/kernel/ftrace.c:     if (aarch64_insn_patch_text_nosync((void *)pc, new))
> > 
> > And yes, this is racy.
> > 
> > > > The thing is, as I argued, the instruction state between PRE and POST is
> > > > ambiguous. This makes it impossible to decode the branch decision
> > > > stream.
> > > > 
> > > > Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we
> > > > have CPU1 covering the instruction at T3.
> > > > 
> > > > How do you decide where CPU1 goes and what the next conditional branch
> > > > is?
> > > 
> > > Sorry for my not well thought.
> > > 
> > > I agree that T3 is an uncertain state with below flow:
> > > 
> > >       CPU0                                             CPU1
> > >   perf event PERF_TEXT_POKE_UPDATE_PRE   -> T1
> > > 
> > >     Int3 / NOP                                       -> T3
> > > 
> > >     Int3 / branch                                    -> T3'
> > > 
> > >   perf event PERF_TEXT_POKE_UPDATE_POST  -> T5
> > > 
> > > Except if the trace has extra info and can use old/new instructions
> > > combination for analysis, otherwise PRE/POST pair events aren't helpful
> > > for resolve this issue (if trace decoder can do this, then the change in
> > > kernel will be much simpler).
> > > 
> > > Below are two potential options we can use on Arm64 platform:
> > > 
> > > - Change to use stop_machine() for jump label; this might introduce
> > >   performance issue if jump label is altered frequently.
> > > 
> > >   To mitigate the impaction, we can only use stop_machine() when
> > >   detect the perf events are enabled, otherwise will rollback to use
> > >   the old code path.
> > > 
> > > - We can use breakpoint to emulate the similiar flow with x86's int3,
> > >   thus we can dismiss the race condition between one CPU alters
> > >   instruction and other CPUs run into the alternative instruction.
> > > 
> > > @Will, @Mark, could you help review this?  Appreciate any comments
> > > and suggestions.  And please let me know if you want to consolidate
> > > related works with your side (or as you know if there have ongoing
> > > discussion or someone works on this).
> > 
> > Given people are building larger Aargh64 machines (I've heard about 100+
> > CPUs already), I'm thinking the 3rd option is the most performant.
> > 
> > But yes, as you mention earlier, we can make this optional on the
> > TEXT_POKE_UPDATE event being in use.
> > 
> > I'm thinking something along the lines of:
> > 
> > static uintptr_t nosync_addr;
> > static u32 nosync_insn;
> > 
> > int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
> > {
> > 	const u32 break = // some_breakpoint_insn;
> > 	uintptr_t tp = (uintptr_t)addr;
> > 	int ret;
> > 
> > 	lockdep_assert_held(&text_mutex);
> > 
> > 	/* A64 instructions must be word aligned */
> > 	if (tp & 0x3)
> > 		return -EINVAL;
> > 
> > 	if (perf_text_poke_update_enabled()) {
> > 
> > 		nosync_insn = insn;
> > 		smp_store_release(&nosync_addr, tp);
> > 
> > 		ret = aarch64_insn_write(addr, break);
> > 		if (ret == 0)
> > 			__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);
> > 
> > 		perf_event_text_poke(....);
> > 	}
> > 
> > 	ret = aarch64_insn_write(addr, insn);
> > 	if (ret == 0)
> > 		__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);
> > 
> > 	return ret;
> > }
> > 
> > And have the 'break' handler do:
> > 
> > aarch64_insn_break_handler(struct pt_regs *regs)
> > {
> > 	unsigned long addr = smp_load_acquire(&nosync_addr);
> > 	u32 insn = nosync_insn;
> > 
> > 	if (regs->ip != addr)
> > 		return;
> > 
> > 	// emulate @insn
> > }
> > 
> > I understood from Will the whole nosync scheme only works for a limited
> > set of instructions, but you only have to implement emulation for the
> > actual instructions used of course.
> > 
> > (which is what we do on x86)
> > 
> > Does this sound workable?

I will update some status for prototype (the code has been uploaded into
git server [1]) and found some issues for text poke perf event on arm64.
These issues are mainly related with arch64 architecture.

- The first issue is for the nosync instruction emulation.  On arm64,
  some instructions need to be single stepped and some instructions
  is emulated.  Especially, after I read the code for kprobe
  implementation on Arm64.  So the main idea for prototyping is to use
  the almos same method with kprobe for nosync instruction.

  As result, there have duplicated code between kprobe and text
  poke handling.  If this is the right way to move forward, we need to
  refactor the common code so that kprobe and text poke can share with
  each other.

- The second issue is race condition between the CPU alters
  instructions and other CPUs hit the altered instructions (and
  breakpointed).

  Peter's suggestion uses global variables 'nosync_insn' and
  'nosync_addr' to record the altered instruction.  But from the
  testing I found for single static key, usually it will change for
  multiple address at once.

  So this might cause the issue is: CPU(a) will loop to alter
  instructions for different address (sometimes the opcode also is
  different for different address), at the meantime, CPU(b) hits an
  altered instruction and handle exception for the breakpoint, if
  CPU(a) is continuing to alter instruction for the next address, thne
  CPU(a) might wrongly to use the value from 'nosync_insn' and
  'nosync_addr'.

  Simply to say, we cannot only support single nosync instruction but
  need to support multiple nosync instructions in the loop.

- The third issue is might be nested issue.  E.g. for the injected
  break instruction, we have no method to pass perf event for this
  instruction; and if we connect with the first issue, if the
  instruction is single stepped with slot (the slot is allocated with
  get_insn_slot()), we cannot to allow the perf user space to know
  the instructions which are executed in the slots.

  I am not sure if I think too complex here.  But seems to me, we
  inject more instructions for poke text event, and if these injected
  instructions are executed but the userspace decoder has no way to
  pass the related info.

Just clarify, I am fine for merging this patch set, but we might
consider more what's the best way on Arm64.  Welcome any public
comments and suggestions; I will sync internally for how to follow up
this functionality.

Thanks,
Leo Yan

[1] https://git.linaro.org/people/leo.yan/linux-coresight.git/log/?h=arm64_perf_text_pork_prototype

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-08 15:05                 ` Leo Yan
@ 2019-11-11 14:46                   ` Peter Zijlstra
  2019-11-11 15:39                     ` Will Deacon
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2019-11-11 14:46 UTC (permalink / raw)
  To: Leo Yan
  Cc: Will Deacon, Mark Rutland, Mike Leach, Adrian Hunter,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, x86,
	Alexander Shishkin, Mathieu Poirier, Arnaldo Carvalho de Melo,
	Jiri Olsa, linux-kernel

On Fri, Nov 08, 2019 at 11:05:30PM +0800, Leo Yan wrote:

> I will update some status for prototype (the code has been uploaded into
> git server [1]) and found some issues for text poke perf event on arm64.
> These issues are mainly related with arch64 architecture.
> 
> - The first issue is for the nosync instruction emulation.  On arm64,
>   some instructions need to be single stepped and some instructions
>   is emulated.  Especially, after I read the code for kprobe
>   implementation on Arm64.  So the main idea for prototyping is to use
>   the almos same method with kprobe for nosync instruction.

This makes no sense to me what so ever. What actual instructions are
patched with _nosync() ? ftrace/jump_label only use 'NOP/JMP/CALL'

For NOP you can advance regs->ip, for JMP you can adjust regs->ip, for
CALL you adjust regs->ip and regs->r14 (IIUC you do something like:
regs->r14 = regs->ip+4; regs->ip = func;)

(FWIW emulating CALL on x86 is fun because we get to PUSH something on
the stack, let me know if you want to see the patches that were required
to make that happen :-)

> - The second issue is race condition between the CPU alters
>   instructions and other CPUs hit the altered instructions (and
>   breakpointed).
> 
>   Peter's suggestion uses global variables 'nosync_insn' and
>   'nosync_addr' to record the altered instruction.  But from the
>   testing I found for single static key, usually it will change for
>   multiple address at once.
> 
>   So this might cause the issue is: CPU(a) will loop to alter
>   instructions for different address (sometimes the opcode also is
>   different for different address), at the meantime, CPU(b) hits an
>   altered instruction and handle exception for the breakpoint, if
>   CPU(a) is continuing to alter instruction for the next address, thne
>   CPU(a) might wrongly to use the value from 'nosync_insn' and
>   'nosync_addr'.
> 
>   Simply to say, we cannot only support single nosync instruction but
>   need to support multiple nosync instructions in the loop.

On x86 all actual text poking is serialized by text_mutex.

> - The third issue is might be nested issue.  E.g. for the injected
>   break instruction, we have no method to pass perf event for this
>   instruction; and if we connect with the first issue, if the
>   instruction is single stepped with slot (the slot is allocated with
>   get_insn_slot()), we cannot to allow the perf user space to know
>   the instructions which are executed in the slots.
> 
>   I am not sure if I think too complex here.  But seems to me, we
>   inject more instructions for poke text event, and if these injected
>   instructions are executed but the userspace decoder has no way to
>   pass the related info.

That's a misunderstanding, the text_poke event is a side-band event and
as such delivered to all events that expressed interest in it. You don't
need any exception to event mapping yourself.

> Just clarify, I am fine for merging this patch set, but we might
> consider more what's the best way on Arm64.  Welcome any public
> comments and suggestions; I will sync internally for how to follow up
> this functionality.

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-11 14:46                   ` Peter Zijlstra
@ 2019-11-11 15:39                     ` Will Deacon
  2019-11-11 16:05                       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Will Deacon @ 2019-11-11 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Leo Yan, Mark Rutland, Mike Leach, Adrian Hunter, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, x86, Alexander Shishkin,
	Mathieu Poirier, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel

As a disclaimer: I'm having a hard time understanding what this thread is
about.

On Mon, Nov 11, 2019 at 03:46:42PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 08, 2019 at 11:05:30PM +0800, Leo Yan wrote:
> 
> > I will update some status for prototype (the code has been uploaded into
> > git server [1]) and found some issues for text poke perf event on arm64.
> > These issues are mainly related with arch64 architecture.
> > 
> > - The first issue is for the nosync instruction emulation.  On arm64,
> >   some instructions need to be single stepped and some instructions
> >   is emulated.  Especially, after I read the code for kprobe
> >   implementation on Arm64.  So the main idea for prototyping is to use
> >   the almos same method with kprobe for nosync instruction.
> 
> This makes no sense to me what so ever. What actual instructions are
> patched with _nosync() ? ftrace/jump_label only use 'NOP/JMP/CALL'

'_nosync()' can be used to patch the following instructions:

	B		(JMP)
	BL		(CALL)
	BRK		(Breakpoint)
	SVC, HVC, SMC	(System calls)
	NOP
	ISB		(Pipe flush)

Our kprobes implementation prefers to single-step the instruction in an
XOL buffer (we should do this by placing a BRK at the end of the buffer,
but we currently use hardware step which is overkill imo). For instructions
that are performing a PC-relative operation (e.g. an immediate branch),
we can't run from the XOL buffer because we'd go to the wrong place; these
instructions are therefore emulated in software...

> For NOP you can advance regs->ip, for JMP you can adjust regs->ip, for
> CALL you adjust regs->ip and regs->r14 (IIUC you do something like:
> regs->r14 = regs->ip+4; regs->ip = func;)

... in a manner similar to what you describe here. See simulate_b_bl(),
although it's thankfully simpler than what x86 seems to have to do. This
approach means we can avoid emulating the vast majority of the instruction
set.

> (FWIW emulating CALL on x86 is fun because we get to PUSH something on
> the stack, let me know if you want to see the patches that were required
> to make that happen :-)

No thanks ;)

> > - The second issue is race condition between the CPU alters
> >   instructions and other CPUs hit the altered instructions (and
> >   breakpointed).
> > 
> >   Peter's suggestion uses global variables 'nosync_insn' and
> >   'nosync_addr' to record the altered instruction.  But from the
> >   testing I found for single static key, usually it will change for
> >   multiple address at once.
> > 
> >   So this might cause the issue is: CPU(a) will loop to alter
> >   instructions for different address (sometimes the opcode also is
> >   different for different address), at the meantime, CPU(b) hits an
> >   altered instruction and handle exception for the breakpoint, if
> >   CPU(a) is continuing to alter instruction for the next address, thne
> >   CPU(a) might wrongly to use the value from 'nosync_insn' and
> >   'nosync_addr'.
> > 
> >   Simply to say, we cannot only support single nosync instruction but
> >   need to support multiple nosync instructions in the loop.
> 
> On x86 all actual text poking is serialized by text_mutex.

On arm64, we patch a *lot* and I think stop_machine() is the only safe
choice for things like the alternatives, where we patch all sorts of things
(cache maintenance, I/O accessors, atomics (until recently), user accessors.
Even then, we have to provide our own synchronisation in
__apply_alternatives_multi_stop() and use special cache-flushing
(clean_dcache_range_nopatch()) to avoid the possibility of running the code
that we're modifying. Outside of the alternatives, I'd still be wary about
recursive debug exceptions if we were to patch jump labels with breakpoints.

Backing up though, I think I'm missing details about what this thread is
trying to achieve. You're adding perf events so that coresight trace can
take into account modifications of the kernel text, right? If so:

  * Does this need to take into account more than just jump_label()?
  * What consistency guarantees are needed by userspace? It's not clear to
    me how it correlates the new event with execution on other CPUs. Is this
    using timestamps or something else?
  * What about module loading?

Finally, the whole point of the '_nosync()' stuff is that we don't have
to synchronise. Therefore, there is a window where you really can't tell
whether the instruction has been updated from the perspective of another
CPU.

Will

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-11 15:39                     ` Will Deacon
@ 2019-11-11 16:05                       ` Peter Zijlstra
  2019-11-11 17:29                         ` Will Deacon
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2019-11-11 16:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Leo Yan, Mark Rutland, Mike Leach, Adrian Hunter, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, x86, Alexander Shishkin,
	Mathieu Poirier, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel

On Mon, Nov 11, 2019 at 03:39:25PM +0000, Will Deacon wrote:

> Backing up though, I think I'm missing details about what this thread is
> trying to achieve. You're adding perf events so that coresight trace can
> take into account modifications of the kernel text, right?

Yes, because ARM-CS / Intel-PT need to know the exact text at any one
time in order to correctly decode their traces.

> If so:
>
>   * Does this need to take into account more than just jump_label()?

jump_label seems to be the initial target Adrian set, but yes, it needs
to cover 'everything'.

That is, all single instruction patching crud around:

 - optimized kprobes
   (regular kprobes are exempt because they rely on exceptions)
 - jump_labels
 - static_call (still pending but still)
 - ftrace fentry

We also need a solution for whole new chunks of text:

 - modules
 - ftrace trampolines
 - optprobe trampolines
 - JIT stuff

but that is so far not included; I had some ideas about a /dev/mem based
interface that would report new ranges and wait for acks (from open
file-desc) before freeing them.

>   * What consistency guarantees are needed by userspace? It's not clear to
>     me how it correlates the new event with execution on other CPUs. Is this
>     using timestamps or something else?

Something else :-)

Both CS/PT basically have a bit-stream encoding of taken / not-taken
decisions. To decode they read the text until a conditional
branch-point, then consume one bit from the stream and continue.

(note how unconditional branches -- jump_labels -- are expected to be
correct in this scheme)

This means they need an exact representation of the text to be able to
accurately decode.

This means timestamps or PRE/POST modify events are not acceptible.
Because until the I-FLUSH has completed we do not know which CPU has
which version of the instruction.

Instead we rely on exceptions; exceptions are differently encoded in the
CS/PT data streams.

The scheme used is:

 - overwrite target instruction with an exception (INT3 on x86, BRK on arm)
 - sync (IPI broadcast CPUID or I-FLUSH completion)

at this point we know the instruction _will_ trap and CS/PT can observe
this alternate flow. That is, the exception handler will emulate the
instruction.

 - emit the TEXT_POKE event with both the old and new instruction
   included

 - overwrite the target instruction with the new instruction
 - sync

at this point the new instruction should be valid.

Using this scheme we can at all times follow the instruction flow.
Either it is an exception and the exception encoding helps us navigate,
or, on either size, we'll know the old/new instruction.

>   * What about module loading?

I mentioned that above; that is still pending.

> Finally, the whole point of the '_nosync()' stuff is that we don't have
> to synchronise. Therefore, there is a window where you really can't tell
> whether the instruction has been updated from the perspective of another
> CPU.

Right, and much of that is preserved with the above scheme, nowhere do
you have to wait/disturb other CPUs, except for the I-FLUSH completion.

Does this help?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-11 16:05                       ` Peter Zijlstra
@ 2019-11-11 17:29                         ` Will Deacon
  2019-11-11 20:32                           ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Will Deacon @ 2019-11-11 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Leo Yan, Mark Rutland, Mike Leach, Adrian Hunter, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, x86, Alexander Shishkin,
	Mathieu Poirier, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel

On Mon, Nov 11, 2019 at 05:05:05PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 11, 2019 at 03:39:25PM +0000, Will Deacon wrote:
> 
> > Backing up though, I think I'm missing details about what this thread is
> > trying to achieve. You're adding perf events so that coresight trace can
> > take into account modifications of the kernel text, right?
> 
> Yes, because ARM-CS / Intel-PT need to know the exact text at any one
> time in order to correctly decode their traces.
> 
> > If so:
> >
> >   * Does this need to take into account more than just jump_label()?
> 
> jump_label seems to be the initial target Adrian set, but yes, it needs
> to cover 'everything'.

Including alternatives, which are what get me worried since the potential
for recursion is pretty high there (on arm64, at least).

> That is, all single instruction patching crud around:
> 
>  - optimized kprobes
>    (regular kprobes are exempt because they rely on exceptions)
>  - jump_labels
>  - static_call (still pending but still)
>  - ftrace fentry
> 
> We also need a solution for whole new chunks of text:
> 
>  - modules
>  - ftrace trampolines
>  - optprobe trampolines
>  - JIT stuff
> 
> but that is so far not included; I had some ideas about a /dev/mem based
> interface that would report new ranges and wait for acks (from open
> file-desc) before freeing them.

I think that it would be nice not to end up with two very different
interfaces for this. But I see this is still just an RFC, so maybe the
full picture will emerge once we solve more of these use-cases.

> >   * What consistency guarantees are needed by userspace? It's not clear to
> >     me how it correlates the new event with execution on other CPUs. Is this
> >     using timestamps or something else?
> 
> Something else :-)
> 
> Both CS/PT basically have a bit-stream encoding of taken / not-taken
> decisions. To decode they read the text until a conditional
> branch-point, then consume one bit from the stream and continue.
> 
> (note how unconditional branches -- jump_labels -- are expected to be
> correct in this scheme)
> 
> This means they need an exact representation of the text to be able to
> accurately decode.
> 
> This means timestamps or PRE/POST modify events are not acceptible.
> Because until the I-FLUSH has completed we do not know which CPU has
> which version of the instruction.
> 
> Instead we rely on exceptions; exceptions are differently encoded in the
> CS/PT data streams.
> 
> The scheme used is:
> 
>  - overwrite target instruction with an exception (INT3 on x86, BRK on arm)
>  - sync (IPI broadcast CPUID or I-FLUSH completion)

Hmm. Wouldn't this sync also need to drain the trace buffers for all other
CPUs so that we make sure that the upcoming TEXT_POKE event occurs after
all prior trace data, which could've been from before the breakpoint was
installed?

> at this point we know the instruction _will_ trap and CS/PT can observe
> this alternate flow. That is, the exception handler will emulate the
> instruction.
> 
>  - emit the TEXT_POKE event with both the old and new instruction
>    included
> 
>  - overwrite the target instruction with the new instruction
>  - sync
> 
> at this point the new instruction should be valid.
> 
> Using this scheme we can at all times follow the instruction flow.
> Either it is an exception and the exception encoding helps us navigate,
> or, on either size, we'll know the old/new instruction.
> 
> >   * What about module loading?
> 
> I mentioned that above; that is still pending.
> 
> > Finally, the whole point of the '_nosync()' stuff is that we don't have
> > to synchronise. Therefore, there is a window where you really can't tell
> > whether the instruction has been updated from the perspective of another
> > CPU.
> 
> Right, and much of that is preserved with the above scheme, nowhere do
> you have to wait/disturb other CPUs, except for the I-FLUSH completion.
> 
> Does this help?

Yes, thank you for explaining it!

Will

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
  2019-11-11 17:29                         ` Will Deacon
@ 2019-11-11 20:32                           ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2019-11-11 20:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Leo Yan, Mark Rutland, Mike Leach, Adrian Hunter, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, x86, Alexander Shishkin,
	Mathieu Poirier, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel

On Mon, Nov 11, 2019 at 05:29:35PM +0000, Will Deacon wrote:
> On Mon, Nov 11, 2019 at 05:05:05PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 11, 2019 at 03:39:25PM +0000, Will Deacon wrote:
> > 
> > > Backing up though, I think I'm missing details about what this thread is
> > > trying to achieve. You're adding perf events so that coresight trace can
> > > take into account modifications of the kernel text, right?
> > 
> > Yes, because ARM-CS / Intel-PT need to know the exact text at any one
> > time in order to correctly decode their traces.
> > 
> > > If so:
> > >
> > >   * Does this need to take into account more than just jump_label()?
> > 
> > jump_label seems to be the initial target Adrian set, but yes, it needs
> > to cover 'everything'.
> 
> Including alternatives, which are what get me worried since the potential
> for recursion is pretty high there (on arm64, at least).

So I had not considered alternatives because they're typically ran once
at boot (and module load) and never seen again. That would make them
just part of loading new text.

But you mentioned wanting to run them at hotplug time... which is more
'interresting'.

> > That is, all single instruction patching crud around:
> > 
> >  - optimized kprobes
> >    (regular kprobes are exempt because they rely on exceptions)
> >  - jump_labels
> >  - static_call (still pending but still)
> >  - ftrace fentry
> > 
> > We also need a solution for whole new chunks of text:
> > 
> >  - modules
> >  - ftrace trampolines
> >  - optprobe trampolines
> >  - JIT stuff
> > 
> > but that is so far not included; I had some ideas about a /dev/mem based
> > interface that would report new ranges and wait for acks (from open
> > file-desc) before freeing them.
> 
> I think that it would be nice not to end up with two very different
> interfaces for this. But I see this is still just an RFC, so maybe the
> full picture will emerge once we solve more of these use-cases.

The general distinction is between new text mappings and changing them
once they exist.

In general a text mapping is large and doesn't change (much). Once you
get an event it exist you can copy it out at your convenience, all you
really need to make sure of it that it doesn't dissapear before you've
completed your copy.

OTOH dynamic text like jump_labels can happen quite frequently and we
cannot wait for all observers to have observed/copied the new state
before we allow changing it again -- ie. we need a buffered event.

So we don't want to stick whole new text things into a buffer (a module
might be larger than the buffer) but we cannot be lazy with text
updates.

That is, yes it sucks, but these are two different cases.

> > Instead we rely on exceptions; exceptions are differently encoded in the
> > CS/PT data streams.
> > 
> > The scheme used is:
> > 
> >  - overwrite target instruction with an exception (INT3 on x86, BRK on arm)
> >  - sync (IPI broadcast CPUID or I-FLUSH completion)
> 
> Hmm. Wouldn't this sync also need to drain the trace buffers for all other
> CPUs so that we make sure that the upcoming TEXT_POKE event occurs after
> all prior trace data, which could've been from before the breakpoint was
> installed?

All we need to ensure is that the breakpoint is visible before the
event. That way we know that before the event we have the old
instruction, after the event we have the new instruction, and any
ambiguity must be resolved with exception packets.

That is, if there is concurrency, the trace buffer will be 'flushed' by
the exception. If there is no concurrency, we don't care and can assume
old/new depending on timestamps relative to the event.

> > at this point we know the instruction _will_ trap and CS/PT can observe
> > this alternate flow. That is, the exception handler will emulate the
> > instruction.
> > 
> >  - emit the TEXT_POKE event with both the old and new instruction
> >    included
> > 
> >  - overwrite the target instruction with the new instruction
> >  - sync
> > 
> > at this point the new instruction should be valid.
> > 
> > Using this scheme we can at all times follow the instruction flow.
> > Either it is an exception and the exception encoding helps us navigate,
> > or, on either size, we'll know the old/new instruction.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [tip: perf/core] perf dso: Add dso__data_write_cache_addr()
  2019-10-25 12:59 ` [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr() Adrian Hunter
  2019-10-28 15:45   ` Jiri Olsa
@ 2019-11-12 11:18   ` tip-bot2 for Adrian Hunter
  1 sibling, 0 replies; 34+ messages in thread
From: tip-bot2 for Adrian Hunter @ 2019-11-12 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Adrian Hunter, Alexander Shishkin, Borislav Petkov,
	H. Peter Anvin, Jiri Olsa, Leo Yan, Mark Rutland,
	Mathieu Poirier, Peter Zijlstra, x86, Arnaldo Carvalho de Melo,
	Ingo Molnar, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     b86a9d918a389162803d833d4dc491fde9b62fa2
Gitweb:        https://git.kernel.org/tip/b86a9d918a389162803d833d4dc491fde9b62fa2
Author:        Adrian Hunter <adrian.hunter@intel.com>
AuthorDate:    Fri, 25 Oct 2019 15:59:57 +03:00
Committer:     Arnaldo Carvalho de Melo <acme@redhat.com>
CommitterDate: Wed, 06 Nov 2019 15:43:06 -03:00

perf dso: Add dso__data_write_cache_addr()

Add functions to write into the dso file data cache, but not change the
file itself.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lore.kernel.org/lkml/20191025130000.13032-4-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/dso.c | 73 +++++++++++++++++++++++++++++++++---------
 tools/perf/util/dso.h |  7 ++++-
 2 files changed, 65 insertions(+), 15 deletions(-)

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index 460330d..0f1b772 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -827,14 +827,16 @@ out:
 	return cache;
 }
 
-static ssize_t
-dso_cache__memcpy(struct dso_cache *cache, u64 offset,
-		  u8 *data, u64 size)
+static ssize_t dso_cache__memcpy(struct dso_cache *cache, u64 offset, u8 *data,
+				 u64 size, bool out)
 {
 	u64 cache_offset = offset - cache->offset;
 	u64 cache_size   = min(cache->size - cache_offset, size);
 
-	memcpy(data, cache->data + cache_offset, cache_size);
+	if (out)
+		memcpy(data, cache->data + cache_offset, cache_size);
+	else
+		memcpy(cache->data + cache_offset, data, cache_size);
 	return cache_size;
 }
 
@@ -910,8 +912,8 @@ static struct dso_cache *dso_cache__find(struct dso *dso,
 	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
 }
 
-static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
-			      u64 offset, u8 *data, ssize_t size)
+static ssize_t dso_cache_io(struct dso *dso, struct machine *machine,
+			    u64 offset, u8 *data, ssize_t size, bool out)
 {
 	struct dso_cache *cache;
 	ssize_t ret = 0;
@@ -920,16 +922,16 @@ static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
 	if (!cache)
 		return ret;
 
-	return dso_cache__memcpy(cache, offset, data, size);
+	return dso_cache__memcpy(cache, offset, data, size, out);
 }
 
 /*
  * Reads and caches dso data DSO__DATA_CACHE_SIZE size chunks
  * in the rb_tree. Any read to already cached data is served
- * by cached data.
+ * by cached data. Writes update the cache only, not the backing file.
  */
-static ssize_t cached_read(struct dso *dso, struct machine *machine,
-			   u64 offset, u8 *data, ssize_t size)
+static ssize_t cached_io(struct dso *dso, struct machine *machine,
+			 u64 offset, u8 *data, ssize_t size, bool out)
 {
 	ssize_t r = 0;
 	u8 *p = data;
@@ -937,7 +939,7 @@ static ssize_t cached_read(struct dso *dso, struct machine *machine,
 	do {
 		ssize_t ret;
 
-		ret = dso_cache_read(dso, machine, offset, p, size);
+		ret = dso_cache_io(dso, machine, offset, p, size, out);
 		if (ret < 0)
 			return ret;
 
@@ -1021,8 +1023,9 @@ off_t dso__data_size(struct dso *dso, struct machine *machine)
 	return dso->data.file_size;
 }
 
-static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
-				u64 offset, u8 *data, ssize_t size)
+static ssize_t data_read_write_offset(struct dso *dso, struct machine *machine,
+				      u64 offset, u8 *data, ssize_t size,
+				      bool out)
 {
 	if (dso__data_file_size(dso, machine))
 		return -1;
@@ -1034,7 +1037,7 @@ static ssize_t data_read_offset(struct dso *dso, struct machine *machine,
 	if (offset + size < offset)
 		return -1;
 
-	return cached_read(dso, machine, offset, data, size);
+	return cached_io(dso, machine, offset, data, size, out);
 }
 
 /**
@@ -1054,7 +1057,7 @@ ssize_t dso__data_read_offset(struct dso *dso, struct machine *machine,
 	if (dso->data.status == DSO_DATA_STATUS_ERROR)
 		return -1;
 
-	return data_read_offset(dso, machine, offset, data, size);
+	return data_read_write_offset(dso, machine, offset, data, size, true);
 }
 
 /**
@@ -1075,6 +1078,46 @@ ssize_t dso__data_read_addr(struct dso *dso, struct map *map,
 	return dso__data_read_offset(dso, machine, offset, data, size);
 }
 
+/**
+ * dso__data_write_cache_offs - Write data to dso data cache at file offset
+ * @dso: dso object
+ * @machine: machine object
+ * @offset: file offset
+ * @data: buffer to write
+ * @size: size of the @data buffer
+ *
+ * Write into the dso file data cache, but do not change the file itself.
+ */
+ssize_t dso__data_write_cache_offs(struct dso *dso, struct machine *machine,
+				   u64 offset, const u8 *data_in, ssize_t size)
+{
+	u8 *data = (u8 *)data_in; /* cast away const to use same fns for r/w */
+
+	if (dso->data.status == DSO_DATA_STATUS_ERROR)
+		return -1;
+
+	return data_read_write_offset(dso, machine, offset, data, size, false);
+}
+
+/**
+ * dso__data_write_cache_addr - Write data to dso data cache at dso address
+ * @dso: dso object
+ * @machine: machine object
+ * @add: virtual memory address
+ * @data: buffer to write
+ * @size: size of the @data buffer
+ *
+ * External interface to write into the dso file data cache, but do not change
+ * the file itself.
+ */
+ssize_t dso__data_write_cache_addr(struct dso *dso, struct map *map,
+				   struct machine *machine, u64 addr,
+				   const u8 *data, ssize_t size)
+{
+	u64 offset = map->map_ip(map, addr);
+	return dso__data_write_cache_offs(dso, machine, offset, data, size);
+}
+
 struct map *dso__new_map(const char *name)
 {
 	struct map *map = NULL;
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index e4dddb7..2f1fcbc 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -285,6 +285,8 @@ void dso__set_module_info(struct dso *dso, struct kmod_path *m,
  *   dso__data_size
  *   dso__data_read_offset
  *   dso__data_read_addr
+ *   dso__data_write_cache_offs
+ *   dso__data_write_cache_addr
  *
  * Please refer to the dso.c object code for each function and
  * arguments documentation. Following text tries to explain the
@@ -332,6 +334,11 @@ ssize_t dso__data_read_addr(struct dso *dso, struct map *map,
 			    struct machine *machine, u64 addr,
 			    u8 *data, ssize_t size);
 bool dso__data_status_seen(struct dso *dso, enum dso_data_status_seen by);
+ssize_t dso__data_write_cache_offs(struct dso *dso, struct machine *machine,
+				   u64 offset, const u8 *data, ssize_t size);
+ssize_t dso__data_write_cache_addr(struct dso *dso, struct map *map,
+				   struct machine *machine, u64 addr,
+				   const u8 *data, ssize_t size);
 
 struct map *dso__new_map(const char *name);
 struct dso *machine__findnew_kernel(struct machine *machine, const char *name,

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: perf/core] perf auxtrace: Add auxtrace_cache__remove()
  2019-10-25 12:59 ` [PATCH RFC 5/6] perf auxtrace: Add auxtrace_cache__remove() Adrian Hunter
  2019-10-25 14:48   ` Arnaldo Carvalho de Melo
@ 2019-11-12 11:18   ` tip-bot2 for Adrian Hunter
  1 sibling, 0 replies; 34+ messages in thread
From: tip-bot2 for Adrian Hunter @ 2019-11-12 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Adrian Hunter, Alexander Shishkin, Borislav Petkov,
	H. Peter Anvin, Jiri Olsa, Leo Yan, Mark Rutland,
	Mathieu Poirier, Peter Zijlstra, x86, Arnaldo Carvalho de Melo,
	Ingo Molnar, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     fd62c1097a0700484fc2cbc9a182f341f30890cd
Gitweb:        https://git.kernel.org/tip/fd62c1097a0700484fc2cbc9a182f341f30890cd
Author:        Adrian Hunter <adrian.hunter@intel.com>
AuthorDate:    Fri, 25 Oct 2019 15:59:59 +03:00
Committer:     Arnaldo Carvalho de Melo <acme@redhat.com>
CommitterDate: Wed, 06 Nov 2019 15:43:06 -03:00

perf auxtrace: Add auxtrace_cache__remove()

Add auxtrace_cache__remove(). Intel PT uses an auxtrace_cache to store
the results of code-walking, so that the same block of instructions does
not have to be decoded repeatedly. However, when there are text poke
events, the associated cache entries need to be removed.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lore.kernel.org/lkml/20191025130000.13032-6-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/auxtrace.c | 28 ++++++++++++++++++++++++++++
 tools/perf/util/auxtrace.h |  1 +
 2 files changed, 29 insertions(+)

diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 8470dfe..c555c3c 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -1457,6 +1457,34 @@ int auxtrace_cache__add(struct auxtrace_cache *c, u32 key,
 	return 0;
 }
 
+static struct auxtrace_cache_entry *auxtrace_cache__rm(struct auxtrace_cache *c,
+						       u32 key)
+{
+	struct auxtrace_cache_entry *entry;
+	struct hlist_head *hlist;
+	struct hlist_node *n;
+
+	if (!c)
+		return NULL;
+
+	hlist = &c->hashtable[hash_32(key, c->bits)];
+	hlist_for_each_entry_safe(entry, n, hlist, hash) {
+		if (entry->key == key) {
+			hlist_del(&entry->hash);
+			return entry;
+		}
+	}
+
+	return NULL;
+}
+
+void auxtrace_cache__remove(struct auxtrace_cache *c, u32 key)
+{
+	struct auxtrace_cache_entry *entry = auxtrace_cache__rm(c, key);
+
+	auxtrace_cache__free_entry(c, entry);
+}
+
 void *auxtrace_cache__lookup(struct auxtrace_cache *c, u32 key)
 {
 	struct auxtrace_cache_entry *entry;
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index f201f36..3f4aa54 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -489,6 +489,7 @@ void *auxtrace_cache__alloc_entry(struct auxtrace_cache *c);
 void auxtrace_cache__free_entry(struct auxtrace_cache *c, void *entry);
 int auxtrace_cache__add(struct auxtrace_cache *c, u32 key,
 			struct auxtrace_cache_entry *entry);
+void auxtrace_cache__remove(struct auxtrace_cache *c, u32 key);
 void *auxtrace_cache__lookup(struct auxtrace_cache *c, u32 key);
 
 struct auxtrace_record *auxtrace_record__init(struct evlist *evlist,

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: perf/core] perf dso: Refactor dso_cache__read()
  2019-10-25 12:59 ` [PATCH RFC 2/6] perf dso: Refactor dso_cache__read() Adrian Hunter
  2019-10-25 14:54   ` Arnaldo Carvalho de Melo
  2019-10-28 15:39   ` Jiri Olsa
@ 2019-11-12 11:18   ` tip-bot2 for Adrian Hunter
  2 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Adrian Hunter @ 2019-11-12 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Adrian Hunter, Alexander Shishkin, Borislav Petkov,
	H. Peter Anvin, Jiri Olsa, Leo Yan, Mark Rutland,
	Mathieu Poirier, Peter Zijlstra, x86, Arnaldo Carvalho de Melo,
	Ingo Molnar, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     366df72657e0cd6bd072b56a48e63b8d89718f70
Gitweb:        https://git.kernel.org/tip/366df72657e0cd6bd072b56a48e63b8d89718f70
Author:        Adrian Hunter <adrian.hunter@intel.com>
AuthorDate:    Fri, 25 Oct 2019 15:59:56 +03:00
Committer:     Arnaldo Carvalho de Melo <acme@redhat.com>
CommitterDate: Wed, 06 Nov 2019 15:43:06 -03:00

perf dso: Refactor dso_cache__read()

Refactor dso_cache__read() to separate populating the cache from copying
data from it.  This is preparation for adding a cache "write" that will
update the data in the cache.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lore.kernel.org/lkml/20191025130000.13032-3-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/dso.c | 64 ++++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 27 deletions(-)

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index e11ddf8..460330d 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -768,7 +768,7 @@ dso_cache__free(struct dso *dso)
 	pthread_mutex_unlock(&dso->lock);
 }
 
-static struct dso_cache *dso_cache__find(struct dso *dso, u64 offset)
+static struct dso_cache *__dso_cache__find(struct dso *dso, u64 offset)
 {
 	const struct rb_root *root = &dso->data.cache;
 	struct rb_node * const *p = &root->rb_node;
@@ -863,54 +863,64 @@ out:
 	return ret;
 }
 
-static ssize_t
-dso_cache__read(struct dso *dso, struct machine *machine,
-		u64 offset, u8 *data, ssize_t size)
+static struct dso_cache *dso_cache__populate(struct dso *dso,
+					     struct machine *machine,
+					     u64 offset, ssize_t *ret)
 {
 	u64 cache_offset = offset & DSO__DATA_CACHE_MASK;
 	struct dso_cache *cache;
 	struct dso_cache *old;
-	ssize_t ret;
 
 	cache = zalloc(sizeof(*cache) + DSO__DATA_CACHE_SIZE);
-	if (!cache)
-		return -ENOMEM;
+	if (!cache) {
+		*ret = -ENOMEM;
+		return NULL;
+	}
 
 	if (dso->binary_type == DSO_BINARY_TYPE__BPF_PROG_INFO)
-		ret = bpf_read(dso, cache_offset, cache->data);
+		*ret = bpf_read(dso, cache_offset, cache->data);
 	else
-		ret = file_read(dso, machine, cache_offset, cache->data);
+		*ret = file_read(dso, machine, cache_offset, cache->data);
 
-	if (ret > 0) {
-		cache->offset = cache_offset;
-		cache->size   = ret;
+	if (*ret <= 0) {
+		free(cache);
+		return NULL;
+	}
 
-		old = dso_cache__insert(dso, cache);
-		if (old) {
-			/* we lose the race */
-			free(cache);
-			cache = old;
-		}
+	cache->offset = cache_offset;
+	cache->size   = *ret;
 
-		ret = dso_cache__memcpy(cache, offset, data, size);
+	old = dso_cache__insert(dso, cache);
+	if (old) {
+		/* we lose the race */
+		free(cache);
+		cache = old;
 	}
 
-	if (ret <= 0)
-		free(cache);
+	return cache;
+}
 
-	return ret;
+static struct dso_cache *dso_cache__find(struct dso *dso,
+					 struct machine *machine,
+					 u64 offset,
+					 ssize_t *ret)
+{
+	struct dso_cache *cache = __dso_cache__find(dso, offset);
+
+	return cache ? cache : dso_cache__populate(dso, machine, offset, ret);
 }
 
 static ssize_t dso_cache_read(struct dso *dso, struct machine *machine,
 			      u64 offset, u8 *data, ssize_t size)
 {
 	struct dso_cache *cache;
+	ssize_t ret = 0;
 
-	cache = dso_cache__find(dso, offset);
-	if (cache)
-		return dso_cache__memcpy(cache, offset, data, size);
-	else
-		return dso_cache__read(dso, machine, offset, data, size);
+	cache = dso_cache__find(dso, machine, offset, &ret);
+	if (!cache)
+		return ret;
+
+	return dso_cache__memcpy(cache, offset, data, size);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-11-12 11:20 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-25 12:59 [PATCH RFC 0/6] perf/x86: Add perf text poke event Adrian Hunter
2019-10-25 12:59 ` [PATCH RFC 1/6] " Adrian Hunter
2019-10-30 10:47   ` Leo Yan
2019-10-30 12:46     ` Peter Zijlstra
2019-10-30 14:19       ` Leo Yan
2019-10-30 15:00         ` Mike Leach
2019-10-30 16:23         ` Peter Zijlstra
2019-10-31  7:31           ` Leo Yan
2019-11-01 10:04             ` Peter Zijlstra
2019-11-01 10:09               ` Peter Zijlstra
2019-11-04  2:23               ` Leo Yan
2019-11-08 15:05                 ` Leo Yan
2019-11-11 14:46                   ` Peter Zijlstra
2019-11-11 15:39                     ` Will Deacon
2019-11-11 16:05                       ` Peter Zijlstra
2019-11-11 17:29                         ` Will Deacon
2019-11-11 20:32                           ` Peter Zijlstra
     [not found]             ` <CAJ9a7VgZH7g=rFDpKf=FzEcyBVLS_WjqbrqtRnjOi7WOY4st+w@mail.gmail.com>
2019-11-01 10:06               ` Peter Zijlstra
2019-11-04 10:40   ` Peter Zijlstra
2019-11-04 12:32     ` Adrian Hunter
2019-10-25 12:59 ` [PATCH RFC 2/6] perf dso: Refactor dso_cache__read() Adrian Hunter
2019-10-25 14:54   ` Arnaldo Carvalho de Melo
2019-10-28 15:39   ` Jiri Olsa
2019-10-29  9:19     ` Adrian Hunter
2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
2019-10-25 12:59 ` [PATCH RFC 3/6] perf dso: Add dso__data_write_cache_addr() Adrian Hunter
2019-10-28 15:45   ` Jiri Olsa
2019-10-29  9:20     ` Adrian Hunter
2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
2019-10-25 12:59 ` [PATCH RFC 4/6] perf tools: Add support for PERF_RECORD_TEXT_POKE Adrian Hunter
2019-10-25 12:59 ` [PATCH RFC 5/6] perf auxtrace: Add auxtrace_cache__remove() Adrian Hunter
2019-10-25 14:48   ` Arnaldo Carvalho de Melo
2019-11-12 11:18   ` [tip: perf/core] " tip-bot2 for Adrian Hunter
2019-10-25 13:00 ` [PATCH RFC 6/6] perf intel-pt: Add support for text poke events Adrian Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).