linux-trace-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Initial trace-cmd perf support
@ 2020-12-03  6:02 Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 1/5] trace-cmd: Internal refactoring of pid address map logic Tzvetomir Stoyanov (VMware)
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Tzvetomir Stoyanov (VMware) @ 2020-12-03  6:02 UTC (permalink / raw)
  To: rostedt; +Cc: linux-trace-devel

Initial implementation of trace-cmd perf subcommand. It collects call traces
of a given PID and resolves collected addresses to function names. Still
work in progress.
What it has till now:
 - Configures and runs perf to collect call traces of the given PID.
 - Runs per CPU thread to read perf buffers and store traces in temp files.
 - Resolves collected addresses to function names (well, at least tries to do
   it). This includes also dynamic libraries, loaded runtime during the
   trace.
What should be implemented:
 - Store the results in a trace.dat file. Should think about how to extend
   its format to hold these traces.
 - Visualise the call stack from trace.dat file with trace-cmd report and
   KernelShark.

Tzvetomir Stoyanov (VMware) (5):
  trace-cmd: Internal refactoring of pid address map logic
  trace-cmd: Make read_file_string() non static
  trace-cmd: New internal APIs for reading ELF header
  trace-cmd: Add a new option in trace.dat file for the address to
    function name mapping
  [WIP] trace-cmd: Add new subcomand "trace-cmd perf"

 Makefile                                      |  10 +
 .../include/private/trace-cmd-private.h       |   5 +
 lib/trace-cmd/trace-input.c                   | 144 +++
 tracecmd/Makefile                             |   2 +
 tracecmd/include/trace-local.h                |  46 +-
 tracecmd/trace-cmd.c                          |   1 +
 tracecmd/trace-dump.c                         |   9 +-
 tracecmd/trace-obj-debug.c                    | 907 ++++++++++++++++++
 tracecmd/trace-perf.c                         | 726 ++++++++++++++
 tracecmd/trace-record.c                       | 107 +--
 10 files changed, 1841 insertions(+), 116 deletions(-)
 create mode 100644 tracecmd/trace-obj-debug.c
 create mode 100644 tracecmd/trace-perf.c

-- 
2.28.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] trace-cmd: Internal refactoring of pid address map logic
  2020-12-03  6:02 [PATCH 0/5] Initial trace-cmd perf support Tzvetomir Stoyanov (VMware)
@ 2020-12-03  6:02 ` Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 2/5] trace-cmd: Make read_file_string() non static Tzvetomir Stoyanov (VMware)
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Tzvetomir Stoyanov (VMware) @ 2020-12-03  6:02 UTC (permalink / raw)
  To: rostedt; +Cc: linux-trace-devel

Make functions for collecting file to memory map of a PID non static,
so they can be reused in the trace-cmd application context. Created new
file trace-obj-debug.c with these functions.

Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
---
 tracecmd/Makefile              |   1 +
 tracecmd/include/trace-local.h |  19 +++--
 tracecmd/trace-obj-debug.c     | 138 +++++++++++++++++++++++++++++++++
 tracecmd/trace-record.c        | 107 +------------------------
 4 files changed, 152 insertions(+), 113 deletions(-)
 create mode 100644 tracecmd/trace-obj-debug.c

diff --git a/tracecmd/Makefile b/tracecmd/Makefile
index 01f36c61..5a16a5e5 100644
--- a/tracecmd/Makefile
+++ b/tracecmd/Makefile
@@ -32,6 +32,7 @@ TRACE_CMD_OBJS += trace-list.o
 TRACE_CMD_OBJS += trace-usage.o
 TRACE_CMD_OBJS += trace-dump.o
 TRACE_CMD_OBJS += trace-clear.o
+TRACE_CMD_OBJS += trace-obj-debug.o
 ifeq ($(VSOCK_DEFINED), 1)
 TRACE_CMD_OBJS += trace-tsync.o
 endif
diff --git a/tracecmd/include/trace-local.h b/tracecmd/include/trace-local.h
index 85c7e03e..8b2d4c21 100644
--- a/tracecmd/include/trace-local.h
+++ b/tracecmd/include/trace-local.h
@@ -179,14 +179,6 @@ struct func_list {
 	const char *mod;
 };
 
-struct pid_addr_maps {
-	struct pid_addr_maps		*next;
-	struct tracecmd_proc_addr_map	*lib_maps;
-	unsigned int			nr_lib_maps;
-	char				*proc_name;
-	int				pid;
-};
-
 struct opt_list {
 	struct opt_list *next;
 	const char	*option;
@@ -315,4 +307,15 @@ void *malloc_or_die(unsigned int size); /* Can be overridden */
 void __noreturn __die(const char *fmt, ...);
 void __noreturn _vdie(const char *fmt, va_list ap);
 
+/* --- Debug symbols--- */
+struct pid_addr_maps {
+	struct pid_addr_maps	*next;
+	struct tracecmd_proc_addr_map	*lib_maps;
+	unsigned int			nr_lib_maps;
+	char				*proc_name;
+	int				pid;
+};
+int trace_debug_get_filemap(struct pid_addr_maps **file_maps, int pid);
+void trace_debug_free_filemap(struct pid_addr_maps *maps);
+
 #endif /* __TRACE_LOCAL_H */
diff --git a/tracecmd/trace-obj-debug.c b/tracecmd/trace-obj-debug.c
new file mode 100644
index 00000000..9aa9baae
--- /dev/null
+++ b/tracecmd/trace-obj-debug.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Copyright (C) 2020, VMware, Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
+ *
+ */
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include "trace-local.h"
+
+#define _STRINGIFY(x) #x
+#define STRINGIFY(x) _STRINGIFY(x)
+int trace_debug_get_filemap(struct pid_addr_maps **pid_maps, int pid)
+{
+	struct pid_addr_maps *maps = *pid_maps;
+	struct tracecmd_proc_addr_map *map;
+	unsigned long long begin, end;
+	struct pid_addr_maps *m;
+	char mapname[PATH_MAX+1];
+	char fname[PATH_MAX+1];
+	char buf[PATH_MAX+100];
+	FILE *f;
+	int ret;
+	int res;
+	int i;
+
+	sprintf(fname, "/proc/%d/exe", pid);
+	ret = readlink(fname, mapname, PATH_MAX);
+	if (ret >= PATH_MAX || ret < 0)
+		return -ENOENT;
+	mapname[ret] = 0;
+
+	sprintf(fname, "/proc/%d/maps", pid);
+	f = fopen(fname, "r");
+	if (!f)
+		return -ENOENT;
+
+	while (maps) {
+		if (pid == maps->pid)
+			break;
+		maps = maps->next;
+	}
+
+	ret = -ENOMEM;
+	if (!maps) {
+		maps = calloc(1, sizeof(*maps));
+		if (!maps)
+			goto out_fail;
+		maps->pid = pid;
+		maps->next = *pid_maps;
+		*pid_maps = maps;
+	} else {
+		for (i = 0; i < maps->nr_lib_maps; i++)
+			free(maps->lib_maps[i].lib_name);
+		free(maps->lib_maps);
+		maps->lib_maps = NULL;
+		maps->nr_lib_maps = 0;
+		free(maps->proc_name);
+	}
+
+	maps->proc_name = strdup(mapname);
+	if (!maps->proc_name)
+		goto out;
+
+	while (fgets(buf, sizeof(buf), f)) {
+		mapname[0] = '\0';
+		res = sscanf(buf, "%llx-%llx %*s %*x %*s %*d %"STRINGIFY(PATH_MAX)"s",
+			     &begin, &end, mapname);
+		if (res == 3 && mapname[0] != '\0') {
+			map = realloc(maps->lib_maps,
+				      (maps->nr_lib_maps + 1) * sizeof(*map));
+			if (!map)
+				goto out_fail;
+			map[maps->nr_lib_maps].end = end;
+			map[maps->nr_lib_maps].start = begin;
+			map[maps->nr_lib_maps].lib_name = strdup(mapname);
+			if (!map[maps->nr_lib_maps].lib_name)
+				goto out_fail;
+			maps->lib_maps = map;
+			maps->nr_lib_maps++;
+		}
+	}
+out:
+	fclose(f);
+	return 0;
+
+out_fail:
+	fclose(f);
+	if (maps) {
+		for (i = 0; i < maps->nr_lib_maps; i++)
+			free(maps->lib_maps[i].lib_name);
+		if (*pid_maps != maps) {
+			m = *pid_maps;
+			while (m) {
+				if (m->next == maps) {
+					m->next = maps->next;
+					break;
+				}
+				m = m->next;
+			}
+		} else
+			*pid_maps = maps->next;
+		free(maps->lib_maps);
+		maps->lib_maps = NULL;
+		maps->nr_lib_maps = 0;
+		free(maps->proc_name);
+		maps->proc_name = NULL;
+		free(maps);
+	}
+	return ret;
+}
+
+static void procmap_free(struct pid_addr_maps *maps)
+{
+	int i;
+
+	if (!maps)
+		return;
+	if (maps->lib_maps) {
+		for (i = 0; i < maps->nr_lib_maps; i++)
+			free(maps->lib_maps[i].lib_name);
+		free(maps->lib_maps);
+	}
+	free(maps->proc_name);
+	free(maps);
+}
+
+void trace_debug_free_filemap(struct pid_addr_maps *maps)
+{
+	struct pid_addr_maps *del;
+
+	while (maps) {
+		del = maps;
+		maps = maps->next;
+		procmap_free(del);
+	}
+}
diff --git a/tracecmd/trace-record.c b/tracecmd/trace-record.c
index 3a63f1bd..17d5474c 100644
--- a/tracecmd/trace-record.c
+++ b/tracecmd/trace-record.c
@@ -1075,109 +1075,6 @@ static char *make_pid_filter(struct buffer_instance *instance,
 	return filter;
 }
 
-#define _STRINGIFY(x) #x
-#define STRINGIFY(x) _STRINGIFY(x)
-
-static int get_pid_addr_maps(struct buffer_instance *instance, int pid)
-{
-	struct pid_addr_maps *maps = instance->pid_maps;
-	struct tracecmd_proc_addr_map *map;
-	unsigned long long begin, end;
-	struct pid_addr_maps *m;
-	char mapname[PATH_MAX+1];
-	char fname[PATH_MAX+1];
-	char buf[PATH_MAX+100];
-	FILE *f;
-	int ret;
-	int res;
-	int i;
-
-	sprintf(fname, "/proc/%d/exe", pid);
-	ret = readlink(fname, mapname, PATH_MAX);
-	if (ret >= PATH_MAX || ret < 0)
-		return -ENOENT;
-	mapname[ret] = 0;
-
-	sprintf(fname, "/proc/%d/maps", pid);
-	f = fopen(fname, "r");
-	if (!f)
-		return -ENOENT;
-
-	while (maps) {
-		if (pid == maps->pid)
-			break;
-		maps = maps->next;
-	}
-
-	ret = -ENOMEM;
-	if (!maps) {
-		maps = calloc(1, sizeof(*maps));
-		if (!maps)
-			goto out_fail;
-		maps->pid = pid;
-		maps->next = instance->pid_maps;
-		instance->pid_maps = maps;
-	} else {
-		for (i = 0; i < maps->nr_lib_maps; i++)
-			free(maps->lib_maps[i].lib_name);
-		free(maps->lib_maps);
-		maps->lib_maps = NULL;
-		maps->nr_lib_maps = 0;
-		free(maps->proc_name);
-	}
-
-	maps->proc_name = strdup(mapname);
-	if (!maps->proc_name)
-		goto out;
-
-	while (fgets(buf, sizeof(buf), f)) {
-		mapname[0] = '\0';
-		res = sscanf(buf, "%llx-%llx %*s %*x %*s %*d %"STRINGIFY(PATH_MAX)"s",
-			     &begin, &end, mapname);
-		if (res == 3 && mapname[0] != '\0') {
-			map = realloc(maps->lib_maps,
-				      (maps->nr_lib_maps + 1) * sizeof(*map));
-			if (!map)
-				goto out_fail;
-			map[maps->nr_lib_maps].end = end;
-			map[maps->nr_lib_maps].start = begin;
-			map[maps->nr_lib_maps].lib_name = strdup(mapname);
-			if (!map[maps->nr_lib_maps].lib_name)
-				goto out_fail;
-			maps->lib_maps = map;
-			maps->nr_lib_maps++;
-		}
-	}
-out:
-	fclose(f);
-	return 0;
-
-out_fail:
-	fclose(f);
-	if (maps) {
-		for (i = 0; i < maps->nr_lib_maps; i++)
-			free(maps->lib_maps[i].lib_name);
-		if (instance->pid_maps != maps) {
-			m = instance->pid_maps;
-			while (m) {
-				if (m->next == maps) {
-					m->next = maps->next;
-					break;
-				}
-				m = m->next;
-			}
-		} else
-			instance->pid_maps = maps->next;
-		free(maps->lib_maps);
-		maps->lib_maps = NULL;
-		maps->nr_lib_maps = 0;
-		free(maps->proc_name);
-		maps->proc_name = NULL;
-		free(maps);
-	}
-	return ret;
-}
-
 static void get_filter_pid_maps(void)
 {
 	struct buffer_instance *instance;
@@ -1189,7 +1086,7 @@ static void get_filter_pid_maps(void)
 		for (p = instance->filter_pids; p; p = p->next) {
 			if (p->exclude)
 				continue;
-			get_pid_addr_maps(instance, p->pid);
+			trace_debug_get_filemap(&instance->pid_maps, p->pid);
 		}
 	}
 }
@@ -1519,7 +1416,7 @@ static void ptrace_wait(enum trace_type type)
 			case PTRACE_EVENT_EXIT:
 				instance = get_intance_fpid(pid);
 				if (instance && instance->get_procmap)
-					get_pid_addr_maps(instance, pid);
+					trace_debug_get_filemap(&instance->pid_maps, pid);
 				ptrace(PTRACE_GETEVENTMSG, pid, NULL, &cstatus);
 				ptrace(PTRACE_DETACH, pid, NULL, NULL);
 				break;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/5] trace-cmd: Make read_file_string() non static
  2020-12-03  6:02 [PATCH 0/5] Initial trace-cmd perf support Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 1/5] trace-cmd: Internal refactoring of pid address map logic Tzvetomir Stoyanov (VMware)
@ 2020-12-03  6:02 ` Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 3/5] trace-cmd: New internal APIs for reading ELF header Tzvetomir Stoyanov (VMware)
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Tzvetomir Stoyanov (VMware) @ 2020-12-03  6:02 UTC (permalink / raw)
  To: rostedt; +Cc: linux-trace-devel

This helper function is used to read strings from a binary file locally
in trace-dump.c context. The same functionality could be useful for
other trace-cmd logics. Made the function non static and prefix it with
"trace_", so it could be used in trace-cmd application context.

Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
---
 tracecmd/include/trace-local.h | 1 +
 tracecmd/trace-dump.c          | 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/tracecmd/include/trace-local.h b/tracecmd/include/trace-local.h
index 8b2d4c21..109c3328 100644
--- a/tracecmd/include/trace-local.h
+++ b/tracecmd/include/trace-local.h
@@ -318,4 +318,5 @@ struct pid_addr_maps {
 int trace_debug_get_filemap(struct pid_addr_maps **file_maps, int pid);
 void trace_debug_free_filemap(struct pid_addr_maps *maps);
 
+int trace_read_file_string(int fd, char *dst, int len);
 #endif /* __TRACE_LOCAL_H */
diff --git a/tracecmd/trace-dump.c b/tracecmd/trace-dump.c
index 5642f12a..c55e2389 100644
--- a/tracecmd/trace-dump.c
+++ b/tracecmd/trace-dump.c
@@ -49,7 +49,7 @@ enum dump_items verbosity;
 			tracecmd_plog(fmt, ##__VA_ARGS__);	\
 	} while (0)
 
-static int read_file_string(int fd, char *dst, int len)
+int trace_read_file_string(int fd, char *dst, int len)
 {
 	size_t size = 0;
 	int r;
@@ -160,7 +160,7 @@ static void dump_initial_format(int fd)
 		die("wrong tracing string: %s", buf);
 
 	/* get file version */
-	if (read_file_string(fd, buf, DUMP_SIZE))
+	if (trace_read_file_string(fd, buf, DUMP_SIZE))
 		die("no version string");
 
 	do_print(SUMMARY, "\t\t%s\t[Version]\n", buf);
@@ -260,7 +260,7 @@ static void dump_events_format(int fd)
 
 	while (systems) {
 
-		if (read_file_string(fd, buf, DUMP_SIZE))
+		if (trace_read_file_string(fd, buf, DUMP_SIZE))
 			die("cannot read the name of the $dth system", systems);
 		if (read_file_number(fd, &events, 4))
 			die("cannot read the count of the events in system %s",
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/5] trace-cmd: New internal APIs for reading ELF header
  2020-12-03  6:02 [PATCH 0/5] Initial trace-cmd perf support Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 1/5] trace-cmd: Internal refactoring of pid address map logic Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 2/5] trace-cmd: Make read_file_string() non static Tzvetomir Stoyanov (VMware)
@ 2020-12-03  6:02 ` Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 4/5] trace-cmd: Add a new option in trace.dat file for the address to function name mapping Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf" Tzvetomir Stoyanov (VMware)
  4 siblings, 0 replies; 13+ messages in thread
From: Tzvetomir Stoyanov (VMware) @ 2020-12-03  6:02 UTC (permalink / raw)
  To: rostedt; +Cc: linux-trace-devel

Implemented new trace-cmd internal APIs for parsing ELF header and
resolving VMA to function name and function name to VMA and file offset.
The bfd library is used to read and parse the binary file. The new APIs
are part of trace-cmd application and are visible only inside its
context.

Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
---
 Makefile                       |  10 +
 tracecmd/include/trace-local.h |  24 +
 tracecmd/trace-obj-debug.c     | 769 +++++++++++++++++++++++++++++++++
 3 files changed, 803 insertions(+)

diff --git a/Makefile b/Makefile
index 8626b91b..1ca933a7 100644
--- a/Makefile
+++ b/Makefile
@@ -246,6 +246,16 @@ endif
 CUNIT_INSTALLED := $(shell if (printf "$(pound)include <CUnit/Basic.h>\n void main(){CU_initialize_registry();}" | $(CC) -x c - -lcunit >/dev/null 2>&1) ; then echo 1; else echo 0 ; fi)
 export CUNIT_INSTALLED
 
+BFD_INSTALLED := $(shell if (echo -e "$(pound)include <bfd.h>\n void main(){bfd_init();}" | $(CC) -o /dev/null -xc  - -lbfd >/dev/null 2>&1) ; then echo 1; else echo 0 ; fi)
+
+export BFD_INSTALLED
+ifeq ($(BFD_INSTALLED), 1)
+CFLAGS += -DBFD_INSTALLED
+LIBS += -lbfd
+else
+$(warning libbfd is not installed)
+endif
+
 export CFLAGS
 export INCLUDES
 
diff --git a/tracecmd/include/trace-local.h b/tracecmd/include/trace-local.h
index 109c3328..cd2d34a4 100644
--- a/tracecmd/include/trace-local.h
+++ b/tracecmd/include/trace-local.h
@@ -318,5 +318,29 @@ struct pid_addr_maps {
 int trace_debug_get_filemap(struct pid_addr_maps **file_maps, int pid);
 void trace_debug_free_filemap(struct pid_addr_maps *maps);
 
+struct tracecmd_debug_symbols {
+	char *name;			/* symbol's name */
+	char *fname;			/* symbol's file */
+	unsigned long long vma_start;	/* symbol's start VMA */
+	unsigned long long vma_near;	/* symbol's requested VMA */
+	unsigned long long foffset;	/* symbol's offset in the binary file*/
+};
+struct trace_debug_object;
+struct trace_debug_object *trace_debug_obj_create_file(char *file, bool libs);
+struct trace_debug_object *trace_debug_obj_create_pid(int pid);
+void trace_debug_obj_destroy(struct trace_debug_object *debug);
+int trace_debug_obj_add_file(struct trace_debug_object *dbg, char *file_name,
+			     unsigned long long vmem_start,
+			     unsigned long long vmem_end,
+			     unsigned long long pgoff);
+
+int trace_debug_resolve_symbols(struct trace_debug_object *obj);
+int trace_debug_add_resolve_symbol(struct trace_debug_object *obj,
+				   unsigned long long vma, char *name);
+
+void trace_debug_walk_resolved_symbols(struct trace_debug_object *obj,
+				       int (*callback)(struct tracecmd_debug_symbols *, void *),
+				       void *context);
+
 int trace_read_file_string(int fd, char *dst, int len);
 #endif /* __TRACE_LOCAL_H */
diff --git a/tracecmd/trace-obj-debug.c b/tracecmd/trace-obj-debug.c
index 9aa9baae..6a4ae1da 100644
--- a/tracecmd/trace-obj-debug.c
+++ b/tracecmd/trace-obj-debug.c
@@ -4,10 +4,779 @@
  *
  */
 #include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <limits.h>
 #include <errno.h>
+#include <bfd.h>
 #include <unistd.h>
+#include <dlfcn.h>
+#include <fnmatch.h>
 
 #include "trace-local.h"
+#include "trace-cmd.h"
+
+//#define DEBUG_INTERNALS
+
+#ifdef BFD_INSTALLED
+
+struct trace_debug_handle {
+	bfd *bfd;
+	unsigned long long addr_offset;
+};
+
+enum match_type {
+	MATCH_EXACT	= 0,
+	MATH_WILDCARD	= 1,
+};
+struct debug_symbols {
+	struct debug_symbols		*next;
+	struct tracecmd_debug_symbols	symbol;
+	enum match_type			match;
+};
+
+struct trace_debug_file {
+	struct trace_debug_file		*next;
+	char				*file_name;
+	unsigned long long		vmem_start;
+	unsigned long long		vmem_end;
+	struct trace_debug_handle	*dbg;
+	int				sym_count;
+	struct debug_symbols		*sym; /* symbols to resolve,
+					       * look in this file only
+					       */
+};
+
+struct trace_debug_object {
+	int				pid;
+	char				*fname;
+	struct pid_addr_maps		*fmaps;
+	int				sym_count;
+	struct debug_symbols		*sym;	/* symbols to resolve,
+						 * look into all files
+						 */
+	struct trace_debug_file		*files;
+};
+
+#define RESOLVE_NAME		(1 << 0)
+#define RESOLVE_VMA		(1 << 1)
+#define RESOLVE_FOFFSET		(1 << 2)
+struct trace_obj_job {
+	unsigned int flags;
+	unsigned long long addr_offset;
+	struct debug_symbols *symbols;
+};
+
+struct dwarf_bfd_context {
+	asymbol **table;
+	struct trace_obj_job *job;
+};
+
+static void process_bfd_section(bfd *abfd, asection *section, void *param)
+{
+	struct dwarf_bfd_context *context = (struct dwarf_bfd_context *)param;
+	unsigned int discriminator;
+	const char *functionname;
+	struct debug_symbols *s;
+	unsigned long long vma;
+	const char *filename;
+	unsigned int line;
+	bfd_boolean found;
+
+	if (!(section->flags & SEC_CODE))
+		return;
+
+	for (s = context->job->symbols; s; s = s->next) {
+		if (s->symbol.vma_near)
+			vma = s->symbol.vma_near;
+		else if (s->symbol.vma_start)
+			vma = s->symbol.vma_start;
+		else
+			continue;
+
+		if (abfd->flags & DYNAMIC)
+			vma -=  context->job->addr_offset;
+		if (vma && section->vma <= vma &&
+		    (section->vma + section->size) > vma) {
+			if (!s->symbol.fname)
+				s->symbol.fname = strdup(abfd->filename);
+			if (context->job->flags & RESOLVE_FOFFSET)
+				s->symbol.foffset = section->filepos + (vma - section->vma);
+			if (!s->symbol.name && (context->job->flags & RESOLVE_NAME)) {
+				found = bfd_find_nearest_line_discriminator(abfd, section, context->table,
+									    vma - section->vma, &filename,
+									    &functionname, &line, &discriminator);
+				if (found)
+					s->symbol.name = strdup(functionname);
+			}
+		}
+	}
+}
+
+static asymbol **get_sym_table(bfd *handle)
+{
+	bfd_boolean dyn = FALSE;
+	asymbol **symtable;
+	long count;
+	long size;
+
+	if ((bfd_get_file_flags(handle) & HAS_SYMS) == 0)
+		return NULL;
+	size = bfd_get_symtab_upper_bound(handle);
+	if (size == 0) {
+		size = bfd_get_dynamic_symtab_upper_bound(handle);
+		dyn = TRUE;
+	}
+	if (size <= 0)
+		return NULL;
+
+	symtable = (asymbol **) calloc(1, size);
+	if (!symtable)
+		return NULL;
+	if (dyn)
+		count = bfd_canonicalize_dynamic_symtab(handle, symtable);
+	else
+		count = bfd_canonicalize_symtab(handle, symtable);
+	if (count <= 0) {
+		free(symtable);
+		return NULL;
+	}
+/*
+ *	 alloc = bfd_demangle(cur_bfd, name, demangle_flags);
+ */
+
+	return symtable;
+}
+
+static bool sym_match(char *pattern, enum match_type match, const char *symbol)
+{
+	bool ret = false;
+
+	switch (match) {
+	case MATCH_EXACT:
+		if (strlen(pattern) == strlen(symbol) &&
+		    !strcmp(pattern, symbol))
+			ret = true;
+		break;
+	case MATH_WILDCARD:
+		if (!fnmatch(pattern, symbol, 0))
+			ret = true;
+		break;
+	}
+
+	return ret;
+}
+
+static int lookup_bfd_sym(struct dwarf_bfd_context *context)
+{
+	struct debug_symbols *s, *last = NULL;
+	struct debug_symbols *new, *new_list = NULL;
+	unsigned long long vma;
+	asymbol **sp;
+	int res = 0;
+
+	for (sp = context->table; *sp != NULL; sp++) {
+		if (!((*sp)->flags & BSF_FUNCTION))
+			continue;
+		for (s = context->job->symbols; s; s = s->next) {
+			if (!s->symbol.name)
+				continue;
+			last = s;
+			if (!sym_match(s->symbol.name, s->match, (*sp)->name))
+				continue;
+#ifdef DEBUG_INTERNALS
+			printf("Matched %s, pattern %s\n\r", (*sp)->name, s->symbol.name);
+#endif
+			vma = (*sp)->value + (*sp)->section->vma;
+			if ((*sp)->the_bfd->flags & DYNAMIC)
+				vma += context->job->addr_offset;
+			if (s->match == MATCH_EXACT) { /* exact match */
+				s->symbol.vma_start = vma;
+			} else if (s->match == MATH_WILDCARD) { /* wildcard pattern match */
+				new = calloc(1, sizeof(struct debug_symbols));
+				if (!new)
+					break;
+				new->symbol.name = strdup((*sp)->name);
+				new->symbol.vma_start = vma;
+				new->symbol.vma_near = s->symbol.vma_near;
+				new->symbol.foffset = s->symbol.foffset;
+				if (s->symbol.fname)
+					new->symbol.fname = strdup(s->symbol.fname);
+				new->next = new_list;
+				new_list = new;
+			}
+			res++;
+		}
+	}
+	if (last && !last->next)
+		last->next = new_list;
+
+	return res;
+}
+
+static int process_bfd_object(bfd *abfd, struct trace_obj_job *job)
+{
+	struct dwarf_bfd_context context;
+	int ret = 0;
+
+	memset(&context, 0, sizeof(context));
+	context.job = job;
+
+	if (bfd_check_format_matches(abfd, bfd_object, NULL) ||
+	    bfd_check_format_matches(abfd, bfd_core, NULL)) {
+		context.table = get_sym_table(abfd);
+		if (!context.table)
+			return -1;
+		if (job->flags & RESOLVE_VMA)
+			lookup_bfd_sym(&context);
+		if ((job->flags & RESOLVE_NAME) || (job->flags & RESOLVE_FOFFSET))
+			bfd_map_over_sections(abfd, process_bfd_section, &context);
+		free(context.table);
+	} else {
+		ret = -1;
+	}
+
+	return ret;
+}
+
+static int read_all_bfd(bfd *abfd, struct trace_obj_job *job)
+{
+	bfd *last_arfile = NULL;
+	bfd *arfile = NULL;
+	int ret = 0;
+
+	if (bfd_check_format(abfd, bfd_archive)) {
+		for (;;) {
+			bfd_set_error(bfd_error_no_error);
+			arfile = bfd_openr_next_archived_file(abfd, arfile);
+			if (!arfile) {
+				if (bfd_get_error() != bfd_error_no_more_archived_files)
+					break;
+			}
+			ret = read_all_bfd(arfile, job);
+			if (last_arfile)
+				bfd_close(last_arfile);
+			last_arfile = arfile;
+		}
+		if (last_arfile)
+			bfd_close(last_arfile);
+	} else
+		ret = process_bfd_object(abfd, job);
+
+	return ret;
+}
+
+/**
+ * resolve_symbol_vma - name -> (vma, file offset) resolving
+ * @obj - pointer to object, returned by trace_obj_debug_create()
+ * @symbols - link list with desired symbols, with given name
+ *
+ * Get VMA and file offset of the symbols with given name
+ * Return 0 on success, -1 on error
+ */
+static int resolve_symbol_vma(struct trace_debug_handle *obj,
+			      struct debug_symbols *symbols)
+{
+	struct trace_obj_job job;
+	int ret;
+
+	memset(&job, 0, sizeof(job));
+	job.flags |= RESOLVE_VMA;
+	job.flags |= RESOLVE_FOFFSET;
+	job.symbols = symbols;
+	job.addr_offset = obj->addr_offset;
+	ret = read_all_bfd(obj->bfd, &job);
+
+	return ret;
+}
+
+/**
+ * resolve_symbol_name - vma -> name resolving
+ * @obj - pointer to object, returned by trace_obj_debug_create()
+ * @symbols - link list with desired symbols, with given VMA
+ *
+ * Get names of the symbols with given VMA, look for nearest symbol to that VMA
+ * Return 0 on success, -1 on error
+ */
+static int resolve_symbol_name(struct trace_debug_handle *obj,
+			       struct debug_symbols *symbols)
+{
+	struct trace_obj_job job;
+
+	if (!obj || !obj->bfd)
+		return -1;
+	memset(&job, 0, sizeof(job));
+	job.flags |= RESOLVE_NAME;
+	job.addr_offset = obj->addr_offset;
+	job.symbols = symbols;
+	return read_all_bfd(obj->bfd, &job);
+}
+
+/**
+ * debug_handle_destroy - Close file opened with trace_obj_debug_create()
+ * @obj - pointer to object, returned by trace_obj_debug_create()
+ *
+ * Close the file and free any allocated resources, related to file's debug
+ * information
+ */
+static void debug_handle_destroy(struct trace_debug_handle *obj)
+{
+	if (obj && obj->bfd)
+		bfd_close(obj->bfd);
+	free(obj);
+}
+
+/**
+ * debug_handle_create - Open binary file for parsing ELF and DWARF information
+ * @name: Name of the binary ELF file.
+ *
+ * Return pointer to trace_obj_debug structure, that can be passed to other APIs
+ * for extracting debug information from the file. NULL in case of an error.
+ */
+static struct trace_debug_handle *debug_handle_create(char *file)
+{
+	struct trace_debug_handle *obj = NULL;
+
+	obj = calloc(1, sizeof(*obj));
+	if (!obj)
+		return NULL;
+
+	bfd_init();
+	obj->bfd = bfd_openr(file, NULL);
+	if (!obj->bfd)
+		goto error;
+	obj->bfd->flags |= BFD_DECOMPRESS;
+
+	return obj;
+
+error:
+	debug_handle_destroy(obj);
+	return NULL;
+}
+
+static void set_vma_offset(struct trace_debug_handle *obj,
+				unsigned long long addr_offset)
+{
+	if (obj)
+		obj->addr_offset = addr_offset;
+}
+
+static char *get_full_name(int pid)
+{
+	char mapname[PATH_MAX+1];
+	char fname[PATH_MAX+1];
+	int ret;
+
+	sprintf(fname, "/proc/%d/exe", pid);
+	ret = readlink(fname, mapname, PATH_MAX);
+	if (ret >= PATH_MAX || ret < 0)
+		return NULL;
+	mapname[ret] = 0;
+
+	return strdup(mapname);
+}
+
+static struct trace_debug_file *get_mapped_file(struct trace_debug_object *dbg,
+						char *fname,
+						unsigned long long vmem_start)
+{
+	struct trace_debug_file *file = dbg->files;
+
+	while (file) {
+		if (!strcmp(fname, file->file_name) &&
+		    vmem_start && file->vmem_end == vmem_start)
+			break;
+		file = file->next;
+	}
+	if (file)
+		return file;
+
+	file = calloc(1, sizeof(*file));
+	if (!file)
+		return NULL;
+	file->file_name = strdup(fname);
+	file->dbg = debug_handle_create(fname);
+	file->next = dbg->files;
+	dbg->files = file;
+	return file;
+}
+
+void trace_debug_obj_destroy(struct trace_debug_object *dbg)
+{
+	struct trace_debug_file *fdel;
+	struct debug_symbols *sdel;
+
+	while (dbg->sym) {
+		sdel = dbg->sym;
+		dbg->sym = dbg->sym->next;
+		free(sdel->symbol.name);
+		free(sdel->symbol.fname);
+		free(sdel);
+	}
+	while (dbg->files) {
+		fdel = dbg->files;
+		dbg->files = dbg->files->next;
+		debug_handle_destroy(fdel->dbg);
+		while (fdel->sym) {
+			sdel = fdel->sym;
+			fdel->sym = fdel->sym->next;
+			free(sdel->symbol.name);
+			free(sdel->symbol.fname);
+			free(sdel);
+		}
+		free(fdel);
+	}
+
+	free(dbg->fname);
+	trace_debug_free_filemap(dbg->fmaps);
+	free(dbg);
+}
+
+int trace_debug_obj_add_file(struct trace_debug_object *dbg, char *file_name,
+			     unsigned long long vmem_start,
+			     unsigned long long vmem_end,
+			     unsigned long long pgoff)
+{
+	struct trace_debug_file *file;
+
+	file = get_mapped_file(dbg, file_name, vmem_start);
+	if (!file)
+		return -1;
+	if (file->vmem_end == vmem_start) {
+		file->vmem_end = vmem_end;
+	} else {
+		file->vmem_start = vmem_start;
+		file->vmem_end = vmem_end;
+		set_vma_offset(file->dbg, vmem_start - pgoff);
+	}
+
+	return 0;
+}
+
+struct trace_debug_object *trace_debug_obj_create_pid(int pid)
+{
+	struct trace_debug_object *dbg;
+	int i;
+
+	dbg = calloc(1, sizeof(*dbg));
+	if (!dbg)
+		return NULL;
+
+	dbg->pid = pid;
+	dbg->fname = get_full_name(pid);
+	if (!dbg->fname)
+		return NULL;
+	trace_debug_get_filemap(&dbg->fmaps, pid);
+
+	for (i = 0; i < dbg->fmaps->nr_lib_maps; i++) {
+		if (trace_debug_obj_add_file(dbg, dbg->fmaps->lib_maps[i].lib_name,
+				dbg->fmaps->lib_maps[i].start,
+				dbg->fmaps->lib_maps[i].end, 0) < 0)
+			break;
+	}
+	if (i < dbg->fmaps->nr_lib_maps)
+		goto error;
+
+	return dbg;
+
+error:
+	trace_debug_obj_destroy(dbg);
+	return NULL;
+}
+
+static char *get_lib_full_path(char *libname)
+{
+	void *h = dlmopen(LM_ID_NEWLM, libname, RTLD_LAZY);
+	char dldir[PATH_MAX+1];
+	char *fname = NULL;
+	int ret;
+
+	if (!h)
+		return NULL;
+	ret = dlinfo(h, RTLD_DI_ORIGIN, dldir);
+	dlclose(h);
+
+	if (!ret) {
+		ret = asprintf(&fname, "%s/%s", dldir, libname);
+		if (ret > 0)
+			return fname;
+	}
+
+	free(fname);
+	return NULL;
+
+
+}
+
+static int debug_obj_file_add_libs(struct trace_debug_object *dbg,
+				   struct trace_debug_file *file)
+{
+	char line[PATH_MAX];
+	char *libname;
+	char *trimmed;
+	char *fullname;
+	char *cmd = NULL;
+	FILE *fp = NULL;
+	int ret = -1;
+
+	if (asprintf(&cmd, "LD_TRACE_LOADED_OBJECTS=1 %s", file->file_name) < 0)
+		return -1;
+	fp = popen(cmd, "r");
+	if (!fp)
+		goto out;
+
+	while (fgets(line, sizeof(line), fp) != NULL) {
+		libname = strchr(line, ' ');
+		trimmed = line;
+		if (libname) {
+			*libname = '\0';
+			while (isspace(*trimmed))
+				trimmed++;
+			if (*trimmed != '/') {
+				fullname = get_lib_full_path(trimmed);
+				if (fullname) {
+					get_mapped_file(dbg, fullname, 0);
+					free(fullname);
+				}
+			} else {
+				get_mapped_file(dbg, trimmed, 0);
+			}
+		}
+	}
+
+out:
+	if (fp)
+		pclose(fp);
+	free(cmd);
+	return ret;
+}
+
+struct trace_debug_object *trace_debug_obj_create_file(char *fname, bool libs)
+{
+	struct trace_debug_object *dbg;
+	struct trace_debug_file *file;
+
+	dbg = calloc(1, sizeof(*dbg));
+	if (!dbg)
+		return NULL;
+
+	dbg->fname = strdup(fname);
+	file = get_mapped_file(dbg, fname, 0);
+	if (!file)
+		goto error;
+	if (libs)
+		debug_obj_file_add_libs(dbg, file);
+
+#ifdef DEBUG_INTERNALS
+	printf("Created debug object for %s:\n\r", dbg->fname);
+	file = dbg->files;
+	while (file) {
+		printf("\t%s\n\r", file->file_name);
+		file = file->next;
+	}
+#endif
+	return dbg;
+
+error:
+	trace_debug_obj_destroy(dbg);
+	return NULL;
+}
+
+static void set_unknown(struct debug_symbols *sym, char *file)
+{
+	while (sym) {
+		if (!sym->symbol.fname)
+			sym->symbol.fname = strdup(file);
+		sym = sym->next;
+	}
+}
+
+int trace_debug_resolve_symbols(struct trace_debug_object *obj)
+{
+	struct trace_debug_file *file;
+
+	for (file = obj->files; file; file = file->next) {
+		if (!file->dbg) {
+			set_unknown(file->sym, file->file_name);
+			continue;
+		}
+		/* near VMA -> name resolving */
+		resolve_symbol_name(file->dbg, file->sym);
+		/* name -> exact VMA resolving */
+		resolve_symbol_vma(file->dbg, file->sym);
+		resolve_symbol_vma(file->dbg, obj->sym);
+	}
+
+	return 0;
+}
+
+static int add_resolve_vma2name(struct trace_debug_object *obj,
+				unsigned long long vma)
+{
+	struct debug_symbols *s = NULL;
+	struct trace_debug_file *file;
+
+	file = obj->files;
+	while (file) {
+		if (vma >= file->vmem_start && vma <= file->vmem_end)
+			break;
+		file = file->next;
+	}
+	if (file) {
+		s = file->sym;
+		while (s) {
+			if (s->symbol.vma_near == vma)
+				break;
+			s = s->next;
+		}
+		if (!s) {
+			s = calloc(1, sizeof(*s));
+			if (!s)
+				return -1;
+			s->symbol.vma_near = vma;
+			s->symbol.fname = strdup(file->file_name);
+			s->next = file->sym;
+			file->sym = s;
+			file->sym_count++;
+		}
+	}
+
+	if (s)
+		return 0;
+	return -1;
+}
+
+static int add_resolve_name2vma(struct trace_debug_object *obj, char *name)
+{
+	struct debug_symbols *s = NULL;
+
+	s = obj->sym;
+	while (s) {
+		if (s->symbol.name && !strcmp(name, s->symbol.name))
+			break;
+		s = s->next;
+	}
+	if (!s) {
+		s = calloc(1, sizeof(*s));
+		if (!s)
+			return -1;
+		s->symbol.name = strdup(name);
+		if (!s->symbol.name)
+			goto error;
+		if (strchr(name, '*') || strchr(name, '?'))
+			s->match = MATH_WILDCARD;
+
+		s->next = obj->sym;
+		obj->sym = s;
+		obj->sym_count++;
+	}
+
+	return 0;
+
+error:
+	if (s) {
+		free(s->symbol.name);
+		free(s);
+	}
+	return -1;
+}
+
+int trace_debug_add_resolve_symbol(struct trace_debug_object *obj,
+				   unsigned long long vma, char *name)
+{
+	int ret = -1;
+
+	if (!obj)
+		return -1;
+
+	if (!name && vma) /* vma -> name resolving */
+		ret = add_resolve_vma2name(obj, vma);
+	else if (name) /* name -> vma resolving */
+		ret = add_resolve_name2vma(obj, name);
+	return ret;
+}
+
+static int walk_symbols(struct debug_symbols *sym,
+			int (*callback)(struct tracecmd_debug_symbols *, void *),
+			void *context)
+{
+	while (sym) {
+		if (callback(&sym->symbol, context))
+			return -1;
+		sym = sym->next;
+	}
+
+	return 0;
+}
+
+void trace_debug_walk_resolved_symbols(struct trace_debug_object *obj,
+				       int (*callback)(struct tracecmd_debug_symbols *, void *),
+				       void *context)
+{
+	struct trace_debug_file *file;
+
+	walk_symbols(obj->sym, callback, context);
+	file = obj->files;
+	while (file) {
+		walk_symbols(file->sym, callback, context);
+		file = file->next;
+	}
+}
+
+
+void trace_debug_free_symbols(struct tracecmd_debug_symbols *symbols, int count)
+{
+	int i;
+
+	if (!symbols)
+		return;
+
+	for (i = 0; i < count; i++) {
+		free(symbols[i].name);
+		free(symbols[i].fname);
+	}
+	free(symbols);
+
+}
+#else
+int trace_debug_resolve_symbols(struct trace_debug_object *obj)
+{
+	return -1;
+}
+
+int trace_debug_add_resolve_symbol(struct trace_debug_object *obj,
+				   unsigned long long vma, char *name)
+{
+	return -1;
+}
+
+void trace_debug_walk_resolved_symbols(struct trace_debug_object *obj,
+				       int (*callback)(struct tracecmd_debug_symbols *, void *),
+				       void *context)
+{
+
+}
+
+void trace_debug_free_symbols(struct tracecmd_debug_symbols *symbols, int count)
+{
+
+}
+
+void trace_debug_obj_destroy(struct trace_debug_object *debug)
+{
+
+}
+
+struct trace_debug_object *trace_debug_obj_create_file(char *file, bool libs)
+{
+	return NULL;
+}
+struct trace_debug_object *trace_debug_obj_create_pid(int pid)
+{
+	return NULL;
+}
+
+#endif
 
 #define _STRINGIFY(x) #x
 #define STRINGIFY(x) _STRINGIFY(x)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/5] trace-cmd: Add a new option in trace.dat file for the address to function name mapping
  2020-12-03  6:02 [PATCH 0/5] Initial trace-cmd perf support Tzvetomir Stoyanov (VMware)
                   ` (2 preceding siblings ...)
  2020-12-03  6:02 ` [PATCH 3/5] trace-cmd: New internal APIs for reading ELF header Tzvetomir Stoyanov (VMware)
@ 2020-12-03  6:02 ` Tzvetomir Stoyanov (VMware)
  2020-12-03  6:02 ` [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf" Tzvetomir Stoyanov (VMware)
  4 siblings, 0 replies; 13+ messages in thread
From: Tzvetomir Stoyanov (VMware) @ 2020-12-03  6:02 UTC (permalink / raw)
  To: rostedt; +Cc: linux-trace-devel

There are cases where resolving a user space address recorded in the
trace file to function name is required, when this file is visualised.
This mapping is available during the trace. A new trace.dat file option
TRACECMD_OPTION_PID_SYMBOLS is added, for storing the mapping in the
file. A new API is introduced, to resolve address to name, using
information stored in the TRACECMD_OPTION_PID_SYMBOLS option:

 struct tracecmd_debug_symbols *
 tracecmd_search_task_symbol(struct tracecmd_input *handle,
			     int pid, unsigned long long addr);

Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
---
 .../include/private/trace-cmd-private.h       |   5 +
 lib/trace-cmd/trace-input.c                   | 144 ++++++++++++++++++
 tracecmd/trace-dump.c                         |   3 +
 3 files changed, 152 insertions(+)

diff --git a/lib/trace-cmd/include/private/trace-cmd-private.h b/lib/trace-cmd/include/private/trace-cmd-private.h
index a0dac5da..8c99faa3 100644
--- a/lib/trace-cmd/include/private/trace-cmd-private.h
+++ b/lib/trace-cmd/include/private/trace-cmd-private.h
@@ -112,6 +112,7 @@ enum {
 	TRACECMD_OPTION_TRACEID,
 	TRACECMD_OPTION_TIME_SHIFT,
 	TRACECMD_OPTION_GUEST,
+	TRACECMD_OPTION_PID_SYMBOLS,
 };
 
 enum {
@@ -231,6 +232,10 @@ unsigned int tracecmd_record_ts_delta(struct tracecmd_input *handle,
 struct tracecmd_proc_addr_map *
 tracecmd_search_task_map(struct tracecmd_input *handle,
 			 int pid, unsigned long long addr);
+
+struct tracecmd_debug_symbols *
+tracecmd_search_task_symbol(struct tracecmd_input *handle,
+			    int pid, unsigned long long addr);
 #ifndef SWIG
 /* hack for function graph work around */
 extern __thread struct tracecmd_input *tracecmd_curr_thread_handle;
diff --git a/lib/trace-cmd/trace-input.c b/lib/trace-cmd/trace-input.c
index af97883e..39da0cb8 100644
--- a/lib/trace-cmd/trace-input.c
+++ b/lib/trace-cmd/trace-input.c
@@ -19,6 +19,8 @@
 #include "trace-local.h"
 #include "kbuffer.h"
 #include "list.h"
+#include "trace-hash.h"
+#include "trace-hash-local.h"
 
 #define _STRINGIFY(x) #x
 #define STRINGIFY(x) _STRINGIFY(x)
@@ -98,6 +100,18 @@ struct host_trace_info {
 	struct ts_offset_sample	*ts_samples;
 };
 
+struct pid_function_map {
+	struct trace_hash_item hash;
+	struct tracecmd_debug_symbols symb;
+};
+
+struct pid_symbol_maps {
+	struct pid_symbol_maps *next;
+	int pid;
+	int symb_count;
+	struct trace_hash symbols;
+};
+
 struct tracecmd_input {
 	struct tep_handle	*pevent;
 	struct tep_plugin_list	*plugin_list;
@@ -130,6 +144,7 @@ struct tracecmd_input {
 
 	struct hook_list	*hooks;
 	struct pid_addr_maps	*pid_maps;
+	struct pid_symbol_maps	*pid_symbols;
 	/* file information */
 	size_t			header_files_start;
 	size_t			ftrace_files_start;
@@ -2430,6 +2445,92 @@ static int trace_pid_map_search(const void *a, const void *b)
 	return 0;
 }
 
+static void trace_pid_symbols_free(struct pid_symbol_maps *maps)
+{
+	struct pid_symbol_maps *del;
+	struct pid_function_map *fmap;
+	struct trace_hash_item **bucket;
+	struct trace_hash_item *item;
+
+	while (maps) {
+		del = maps;
+		maps = maps->next;
+		trace_hash_for_each_bucket(bucket, &del->symbols) {
+			trace_hash_while_item(item, bucket) {
+				trace_hash_del(item);
+				fmap = (struct pid_function_map *)item;
+				free(fmap->symb.name);
+				free(fmap->symb.fname);
+				free(fmap);
+			}
+		}
+		trace_hash_free(&del->symbols);
+	}
+}
+
+#define MAX_FUNC_NAME	256
+static int trace_pid_symbols_load(struct tracecmd_input *handle, char *buf)
+{
+	struct pid_symbol_maps *maps = NULL;
+	struct pid_function_map *s;
+	char file[PATH_MAX];
+	char fname[MAX_FUNC_NAME];
+	char *line;
+	int res;
+	int ret;
+	int i;
+
+	maps = calloc(1, sizeof(*maps));
+	if (!maps)
+		return -ENOMEM;
+
+	ret  = -EINVAL;
+	line = strchr(buf, '\n');
+	if (!line)
+		goto out_fail;
+
+	*line = '\0';
+	res = sscanf(buf, "%x %d", &maps->pid, &maps->symb_count);
+	if (res != 2)
+		goto out_fail;
+
+	ret  = -ENOMEM;
+	if (trace_hash_init(&maps->symbols, 64) < 0)
+		goto out_fail;
+
+	buf = line + 1;
+	line = strchr(buf, '\n');
+	for (i = 0; i < maps->symb_count; i++) {
+		if (!line)
+			break;
+		*line = '\0';
+		s = calloc(1, sizeof(*s));
+		if (!s)
+			goto out_fail;
+		res = sscanf(buf, "%"STRINGIFY(PATH_MAX)"s %"STRINGIFY(MAX_FUNC_NAME)"s %llx %llx",
+				file, fname, &s->symb.vma_start, &s->symb.vma_near);
+		if (res != 4)
+			break;
+		s->symb.fname = strdup(file);
+		s->symb.name = strdup(fname);
+		if (!s->symb.fname || !s->symb.name)
+			goto out_fail;
+		s->hash.key = trace_hash(s->symb.vma_near);
+		trace_hash_add(&maps->symbols, (struct trace_hash_item *)s);
+		buf = line + 1;
+		line = strchr(buf, '\n');
+	}
+
+	maps->next = handle->pid_symbols;
+	handle->pid_symbols = maps;
+
+	return 0;
+
+out_fail:
+	trace_pid_symbols_free(maps);
+	return ret;
+}
+
 /**
  * tracecmd_search_task_map - Search task memory address map
  * @handle: input handle to the trace.dat file
@@ -2469,6 +2570,43 @@ tracecmd_search_task_map(struct tracecmd_input *handle,
 	return lib;
 }
 
+/**
+ * tracecmd_search_task_symbol - Resolve address to function name
+ * @handle: input handle to the trace.dat file
+ * @pid: pid of the task, which function information is stored in the trace.dat
+ * @addr: address of the function
+ *
+ * Mapping of some functions to addresses of traced PIDs can be saved in the
+ * trace.dat file, using the "perf" sub command. If there is such information,
+ * this API can be used to look up into this function maps to find the name of
+ * the function and the name of the file where that function is loaded.
+ *
+ * A pointer to struct tracecmd_debug_symbols is returned, containing information
+ * about the reuquested @addr: the name of the function, its start address, the
+ * name of the binary file that contains the function.
+ */
+struct tracecmd_debug_symbols *
+tracecmd_search_task_symbol(struct tracecmd_input *handle,
+			    int pid, unsigned long long addr)
+{
+	struct pid_function_map *symb = NULL;
+	struct pid_symbol_maps	*map;
+
+	for (map = handle->pid_symbols; map; map = map->next) {
+		if (pid == map->pid)
+			break;
+	}
+
+	if (map)
+		symb = (struct pid_function_map *)trace_hash_find(&map->symbols,
+								  trace_hash(addr), NULL, NULL);
+
+	if (symb)
+		return &symb->symb;
+
+	return NULL;
+}
+
 static int handle_options(struct tracecmd_input *handle)
 {
 	long long offset;
@@ -2610,6 +2748,10 @@ static int handle_options(struct tracecmd_input *handle)
 		case TRACECMD_OPTION_GUEST:
 			trace_guest_load(handle, buf, size);
 			break;
+		case TRACECMD_OPTION_PID_SYMBOLS:
+			if (buf[size-1] == '\0')
+				trace_pid_symbols_load(handle, buf);
+			break;
 		default:
 			warning("unknown option %d", option);
 			break;
@@ -3352,6 +3494,8 @@ void tracecmd_close(struct tracecmd_input *handle)
 
 	trace_pid_map_free(handle->pid_maps);
 	handle->pid_maps = NULL;
+	trace_pid_symbols_free(handle->pid_symbols);
+	handle->pid_symbols = NULL;
 
 	trace_tsync_offset_free(&handle->host);
 	trace_guests_free(handle);
diff --git a/tracecmd/trace-dump.c b/tracecmd/trace-dump.c
index c55e2389..a3e7acf7 100644
--- a/tracecmd/trace-dump.c
+++ b/tracecmd/trace-dump.c
@@ -530,6 +530,9 @@ static void dump_options(int fd)
 		case TRACECMD_OPTION_GUEST:
 			dump_option_guest(fd, size);
 			break;
+		case TRACECMD_OPTION_PID_SYMBOLS:
+			dump_option_string(fd, size, "PIDSYMBOLS");
+			break;
 		default:
 			do_print(OPTIONS, " %d %d\t[Unknown option, size - skipping]\n",
 				 option, size);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2020-12-03  6:02 [PATCH 0/5] Initial trace-cmd perf support Tzvetomir Stoyanov (VMware)
                   ` (3 preceding siblings ...)
  2020-12-03  6:02 ` [PATCH 4/5] trace-cmd: Add a new option in trace.dat file for the address to function name mapping Tzvetomir Stoyanov (VMware)
@ 2020-12-03  6:02 ` Tzvetomir Stoyanov (VMware)
  2021-02-19  2:03   ` Steven Rostedt
  4 siblings, 1 reply; 13+ messages in thread
From: Tzvetomir Stoyanov (VMware) @ 2020-12-03  6:02 UTC (permalink / raw)
  To: rostedt; +Cc: linux-trace-devel

The new sub command uses perf to collect performance information of the
selected process. It has one mandatory argument:
trace-cmd perf --pid <PID>

Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
---
 tracecmd/Makefile              |   1 +
 tracecmd/include/trace-local.h |   2 +
 tracecmd/trace-cmd.c           |   1 +
 tracecmd/trace-perf.c          | 726 +++++++++++++++++++++++++++++++++
 4 files changed, 730 insertions(+)
 create mode 100644 tracecmd/trace-perf.c

diff --git a/tracecmd/Makefile b/tracecmd/Makefile
index 5a16a5e5..3fda6de8 100644
--- a/tracecmd/Makefile
+++ b/tracecmd/Makefile
@@ -33,6 +33,7 @@ TRACE_CMD_OBJS += trace-usage.o
 TRACE_CMD_OBJS += trace-dump.o
 TRACE_CMD_OBJS += trace-clear.o
 TRACE_CMD_OBJS += trace-obj-debug.o
+TRACE_CMD_OBJS += trace-perf.o
 ifeq ($(VSOCK_DEFINED), 1)
 TRACE_CMD_OBJS += trace-tsync.o
 endif
diff --git a/tracecmd/include/trace-local.h b/tracecmd/include/trace-local.h
index cd2d34a4..b0d03900 100644
--- a/tracecmd/include/trace-local.h
+++ b/tracecmd/include/trace-local.h
@@ -102,6 +102,8 @@ void trace_usage(int argc, char **argv);
 
 void trace_dump(int argc, char **argv);
 
+void trace_perf(int argc, char **argv);
+
 int trace_record_agent(struct tracecmd_msg_handle *msg_handle,
 		       int cpus, int *fds,
 		       int argc, char **argv, bool use_fifos,
diff --git a/tracecmd/trace-cmd.c b/tracecmd/trace-cmd.c
index 7376c5a5..0c5b324f 100644
--- a/tracecmd/trace-cmd.c
+++ b/tracecmd/trace-cmd.c
@@ -104,6 +104,7 @@ struct command commands[] = {
 	{"list", trace_list},
 	{"help", trace_usage},
 	{"dump", trace_dump},
+	{"perf", trace_perf},
 	{"-h", trace_usage},
 };
 
diff --git a/tracecmd/trace-perf.c b/tracecmd/trace-perf.c
new file mode 100644
index 00000000..9b9f3519
--- /dev/null
+++ b/tracecmd/trace-perf.c
@@ -0,0 +1,726 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Copyright (C) 2020, VMware, Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
+ *
+ */
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <sys/ioctl.h>
+#include <pthread.h>
+#include <linux/perf_event.h>
+#include <asm/unistd.h>
+#include <sys/mman.h>
+#include <signal.h>
+
+#include "trace-local.h"
+#include "trace-cmd-local.h"
+
+#define SLEEP_USEC		1000
+#define SAMPLE_PER_SECOND	8000
+
+static volatile bool done;
+
+struct perf_stat {
+	__u64 count;
+	__u64 time_enabled;
+	__u64 time_running;
+	__u64 id;
+};
+
+struct perf_split_event {
+	int size;
+	int part_size;
+	char *buf;
+};
+
+struct perf_cpu {
+	int cpu;
+	int pid;
+	unsigned int prev_pos;
+	int perf_fd;
+	int sample_file_fd;
+	int mmap_file_fd;
+	char *sample_file;
+	char *mmap_file;
+	pthread_t tid;
+	struct perf_split_event split;
+	struct perf_event_mmap_page *mpage;
+};
+
+struct perf_session {
+	int pid;
+	int cpu_count;
+	struct perf_cpu *cpus;
+	struct trace_debug_object *fdebug;
+};
+
+static long perf_event_open(struct perf_event_attr *event, pid_t pid,
+			    int cpu, int group_fd, unsigned long flags)
+{
+	return syscall(__NR_perf_event_open, event, pid, cpu, group_fd, flags);
+}
+
+/*
+ * struct {
+ *	struct perf_event_header	header;
+ *
+ *	#
+ *	# Note that PERF_SAMPLE_IDENTIFIER duplicates PERF_SAMPLE_ID.
+ *	# The advantage of PERF_SAMPLE_IDENTIFIER is that its position
+ *	# is fixed relative to header.
+ *	#
+ *
+ *	{ u64			id;	  } && PERF_SAMPLE_IDENTIFIER
+ *	{ u64			ip;	  } && PERF_SAMPLE_IP
+ *	{ u32			pid, tid; } && PERF_SAMPLE_TID
+ *	{ u64			time;     } && PERF_SAMPLE_TIME
+ *	{ u64			addr;     } && PERF_SAMPLE_ADDR
+ *	{ u64			id;	  } && PERF_SAMPLE_ID
+ *	{ u64			stream_id;} && PERF_SAMPLE_STREAM_ID
+ *	{ u32			cpu, res; } && PERF_SAMPLE_CPU
+ *	{ u64			period;   } && PERF_SAMPLE_PERIOD
+ *
+ *	{ struct read_format	values;	  } && PERF_SAMPLE_READ
+ *
+ *	{ u64			nr,
+ *	  u64			ips[nr];  } && PERF_SAMPLE_CALLCHAIN
+ *
+ *	#
+ *	# The RAW record below is opaque data wrt the ABI
+ *	#
+ *	# That is, the ABI doesn't make any promises wrt to
+ *	# the stability of its content, it may vary depending
+ *	# on event, hardware, kernel version and phase of
+ *	# the moon.
+ *	#
+ *	# In other words, PERF_SAMPLE_RAW contents are not an ABI.
+ *	#
+ *
+ *	{ u32			size;
+ *	  char                  data[size];}&& PERF_SAMPLE_RAW
+ *
+ *	{ u64                   nr;
+ *	  { u64	hw_idx; } && PERF_SAMPLE_BRANCH_HW_INDEX
+ *        { u64 from, to, flags } lbr[nr];
+ *      } && PERF_SAMPLE_BRANCH_STACK
+ *
+ *	{ u64			abi; # enum perf_sample_regs_abi
+ *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+ *
+ *	{ u64			size;
+ *	  char			data[size];
+ *	  u64			dyn_size; } && PERF_SAMPLE_STACK_USER
+ *
+ *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
+ *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
+ *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+ *	{ u64			abi; # enum perf_sample_regs_abi
+ *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+ *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
+ *	{ u64			size;
+ *	  char			data[size]; } && PERF_SAMPLE_AUX
+ * };
+ */
+
+struct perf_event_sample {
+	struct perf_event_header head;
+	uint64_t id;		/* PERF_SAMPLE_IDENTIFIER */
+	uint32_t pid, tid;	/* PERF_SAMPLE_TID */
+	uint64_t time;		/* PERF_SAMPLE_TIME */
+	uint32_t cpu, res;	/* PERF_SAMPLE_CPU */
+	uint64_t period;	/* PERF_SAMPLE_PERIOD */
+	uint64_t nr;
+	uint64_t ips[];		/* PERF_SAMPLE_CALLCHAIN */
+} __attribute__((packed));
+
+struct perf_event_mmap {
+	struct perf_event_header	header;
+
+	uint32_t			pid, tid;
+	uint64_t			addr;
+	uint64_t			len;
+	uint64_t			pgoff;
+	char				filename[];
+} __attribute__((packed));
+
+static int handle_event_sample(struct perf_event_sample *sample, struct perf_cpu *cpu)
+{
+	int i;
+
+	/* check if sample is for our PID */
+	if (sample->pid != cpu->pid)
+		return 0;
+
+	if (write(cpu->sample_file_fd, &sample->time, sizeof(uint64_t)) < 0)
+		return -1;
+	if (write(cpu->sample_file_fd, &sample->nr, sizeof(uint64_t)) < 0)
+		return -1;
+	for (i = 0; i < sample->nr; i++) {
+		if (write(cpu->sample_file_fd, &sample->ips[i], sizeof(uint64_t)) < 0)
+			return -1;
+	}
+
+	return 1;
+}
+
+static int handle_event_mmap(struct perf_event_mmap *event, struct perf_cpu *cpu)
+{
+	/* check if mmap is for our PID */
+	if (event->pid != cpu->pid)
+		return 0;
+	/* check if executable memory is mapped */
+	if (event->header.misc & PERF_RECORD_MISC_MMAP_DATA)
+		return 0;
+
+	if (write(cpu->mmap_file_fd, &event->addr, sizeof(uint64_t)) < 0)
+		return -1;
+	if (write(cpu->mmap_file_fd, &event->len, sizeof(uint64_t)) < 0)
+		return -1;
+	if (write(cpu->mmap_file_fd, &event->pgoff, sizeof(uint64_t)) < 0)
+		return -1;
+	if (write(cpu->mmap_file_fd, event->filename, strlen(event->filename) + 1) < 0)
+		return -1;
+
+	return 1;
+}
+
+#define NUM_PAGES 8
+static unsigned int mmap_mask;
+
+static int handle_event(struct perf_event_header *event, struct perf_cpu *cpu)
+{
+	int ret;
+
+	if (event->size == 0)
+		return -1;
+
+	switch (event->type) {
+	case PERF_RECORD_SAMPLE: /* new callstack sample */
+		ret = handle_event_sample((struct perf_event_sample *)event, cpu);
+		break;
+	case PERF_RECORD_MMAP: /* new memory mapped, may be a library is loaded ? */
+		ret = handle_event_mmap((struct perf_event_mmap *)event, cpu);
+		break;
+	default:
+		/* Got unknown perf event */
+		ret = 0;
+		break;
+	}
+	return ret;
+}
+
+static int perf_event_read_samples(struct perf_cpu *cpu, unsigned char *buf, int size)
+{
+	unsigned char *end = buf + size;
+	struct perf_event_header *h;
+	int csize = size;
+	int ret;
+
+	if (cpu->split.buf) {
+		/* partial read, check if there is enough data to complete it */
+		if ((cpu->split.size - cpu->split.part_size) > size)
+			return -1;
+		memcpy(cpu->split.buf + cpu->split.part_size,
+		       buf, cpu->split.size - cpu->split.part_size);
+		ret = handle_event((struct perf_event_header *)cpu->split.buf, cpu);
+		buf += (cpu->split.size - cpu->split.part_size);
+		free(cpu->split.buf);
+		memset(&cpu->split, 0, sizeof(struct perf_split_event));
+		if (ret < 0)
+			return -1; /* Failed to process the partial event */
+	}
+
+	while (buf < end) {
+		h = (struct perf_event_header *)buf;
+		if (h->size == 0)
+			return -1;
+		if (h->size > csize) {
+			/* part of the event is in the buffer, do partial read */
+			if (cpu->split.buf != NULL)
+				return -1; /* Previous partial read pending */
+			cpu->split.buf = calloc(1, h->size);
+			if (!cpu->split.buf)
+				return -1;
+			cpu->split.size = h->size;
+			cpu->split.part_size = csize;
+			memcpy(cpu->split.buf, buf, csize);
+			return csize;
+		}
+		if (handle_event(h, cpu) < 0)
+			return -1; /* Failed to process the event */
+		csize -= h->size;
+		buf += h->size;
+	}
+
+	return 0;
+}
+
+static int perf_event_read(struct perf_cpu *cpu)
+{
+	unsigned char *buf;
+	unsigned char *data;
+	unsigned int head;
+	int diff, size;
+	unsigned int old;
+
+	data = (unsigned char *)cpu->mpage + getpagesize();
+
+	head = cpu->mpage->data_head;
+	/* On SMP-capable platforms, after reading the data_head value,
+	 * user space should issue an rmb().
+	 */
+	__sync_synchronize();
+
+	old = cpu->prev_pos;
+	diff = head - old;
+	if (diff < 0)
+		old = head; /* buffer overrun, events lost */
+
+	size = head - old;
+	if ((old & mmap_mask) + size != (head & mmap_mask)) {
+		buf = data + (old & mmap_mask);
+		size = mmap_mask + 1 - (old & mmap_mask);
+		old += size;
+		perf_event_read_samples(cpu, buf, size);
+	}
+	buf = data + (old & mmap_mask);
+	size = head - old;
+	if (size)
+		perf_event_read_samples(cpu, buf, size);
+
+	old += size;
+	cpu->prev_pos = old;
+	/* reset the buffer */
+	cpu->mpage->data_tail = old;
+
+	return size;
+}
+
+static void perf_init_pe(struct perf_event_attr *pe)
+{
+	memset(pe, 0, sizeof(struct perf_event_attr));
+	pe->type = PERF_TYPE_HARDWARE;
+	pe->sample_type = PERF_SAMPLE_CALLCHAIN |
+			  PERF_SAMPLE_IDENTIFIER |
+			  PERF_SAMPLE_TID |
+			  PERF_SAMPLE_TIME |
+			  PERF_SAMPLE_CPU |
+			  PERF_SAMPLE_PERIOD;
+	pe->size = sizeof(struct perf_event_attr);
+	pe->config = PERF_COUNT_HW_CPU_CYCLES;
+	pe->disabled = 1;
+	pe->exclude_kernel = 1;
+	pe->freq = 1;
+	pe->sample_freq = SAMPLE_PER_SECOND;
+	pe->inherit = 1;
+	pe->mmap = 1;
+	pe->comm = 1;
+	pe->task = 1;
+	pe->precise_ip = 1;
+	pe->sample_id_all = 1;
+//	pe->mmap2 = 1;
+	pe->comm_exec = 1;
+	pe->ksymbol = 1;
+	pe->bpf_event = 1;
+	pe->read_format = PERF_FORMAT_ID |
+			PERF_FORMAT_TOTAL_TIME_ENABLED|
+			PERF_FORMAT_TOTAL_TIME_RUNNING;
+
+}
+
+static int perf_enable(struct perf_cpu *perf, bool enable)
+{
+	int ret;
+
+	if (enable) {
+		ret = ioctl(perf->perf_fd, PERF_EVENT_IOC_RESET, 0);
+		ret = ioctl(perf->perf_fd, PERF_EVENT_IOC_ENABLE, 0);
+	} else
+		ret = ioctl(perf->perf_fd, PERF_EVENT_IOC_DISABLE, 0);
+
+	return ret;
+}
+
+static int perf_mmap(struct perf_cpu *perf)
+{
+	mmap_mask = NUM_PAGES * getpagesize() - 1;
+
+	/* associate a buffer with the file */
+	perf->mpage = mmap(NULL, (NUM_PAGES + 1) * getpagesize(),
+			PROT_READ | PROT_WRITE, MAP_SHARED, perf->perf_fd, 0);
+	if (perf->mpage == MAP_FAILED)
+		return -1;
+	return 0;
+}
+
+static void perf_session_destroy(struct perf_session *perf)
+{
+	int i;
+
+	if (!perf)
+		return;
+
+	if (perf->cpu_count && perf->cpus) {
+		for (i = 0; i < perf->cpu_count; i++) {
+			if (perf->cpus[i].perf_fd >= 0)
+				close(perf->cpus[i].perf_fd);
+			if (perf->cpus[i].sample_file_fd >= 0)
+				close(perf->cpus[i].sample_file_fd);
+			if (perf->cpus[i].mmap_file_fd >= 0)
+				close(perf->cpus[i].mmap_file_fd);
+			if (perf->cpus[i].mpage)
+				munmap(perf->cpus[i].mpage,
+				       (NUM_PAGES + 1) * getpagesize());
+			if (perf->cpus[i].sample_file) {
+				remove(perf->cpus[i].sample_file);
+				free(perf->cpus[i].sample_file);
+			}
+			if (perf->cpus[i].mmap_file) {
+				remove(perf->cpus[i].mmap_file);
+				free(perf->cpus[i].mmap_file);
+			}
+			free(perf->cpus[i].split.buf);
+		}
+		free(perf->cpus);
+	}
+
+	if (perf->fdebug)
+		trace_debug_obj_destroy(perf->fdebug);
+
+	free(perf);
+}
+
+#define TMP_FILE "/tmp/perf_temp_data.XXXXXX"
+static struct perf_session *perf_session_new(int pid)
+{
+	int cpus = tracecmd_count_cpus();
+	struct perf_event_attr pe;
+	struct perf_session *perf;
+	char *template;
+	int i;
+
+	perf = calloc(cpus, sizeof(struct perf_session));
+	if (!perf)
+		return NULL;
+	template = strdup(TMP_FILE);
+	if (!template)
+		goto error;
+	perf->fdebug = trace_debug_obj_create_pid(pid);
+	if (!perf->fdebug)
+		goto error;
+	perf->cpus = calloc(cpus, sizeof(struct perf_cpu));
+	if (!cpus)
+		goto error;
+	for (i = 0; i < cpus; i++) {
+		perf->cpus[i].perf_fd = -1;
+		perf->cpus[i].sample_file_fd = -1;
+		perf->cpus[i].mmap_file_fd = -1;
+	}
+
+	perf_init_pe(&pe);
+	for (i = 0; i < cpus; i++) {
+		perf->cpus[i].perf_fd = perf_event_open(&pe, pid, i, -1, 0);
+		if (perf->cpus[i].perf_fd < 0)
+			goto error;
+		fcntl(perf->cpus[i].perf_fd, F_SETFL, O_NONBLOCK);
+		perf->cpus[i].cpu = i;
+		if (perf_mmap(&perf->cpus[i]) < 0)
+			goto error;
+		strcpy(template, TMP_FILE);
+		mktemp(template);
+		perf->cpus[i].sample_file = strdup(template);
+		strcpy(template, TMP_FILE);
+		mktemp(template);
+		perf->cpus[i].mmap_file = strdup(template);
+		perf->cpus[i].pid = pid;
+	}
+	perf->cpu_count = cpus;
+	perf->pid = pid;
+	free(template);
+	return perf;
+error:
+	free(template);
+	perf_session_destroy(perf);
+	return NULL;
+}
+
+void *perf_read_thread(void *context)
+{
+	struct perf_cpu *cpu = (struct perf_cpu *)context;
+	struct timespec swait;
+	sigset_t sig_set;
+	int ret;
+
+	swait.tv_sec = SLEEP_USEC / 1000000,
+	swait.tv_nsec = (SLEEP_USEC % 1000000) * 1000,
+
+	sigemptyset(&sig_set);
+	sigaddset(&sig_set, SIGINT);
+	sigaddset(&sig_set, SIGTERM);
+	pthread_sigmask(SIG_BLOCK, &sig_set, NULL);
+
+	cpu->sample_file_fd = open(cpu->sample_file, O_WRONLY|O_CREAT|O_TRUNC, 0600);
+	if (cpu->sample_file_fd < 0)
+		return NULL;
+	cpu->mmap_file_fd = open(cpu->mmap_file, O_WRONLY|O_CREAT|O_TRUNC, 0600);
+	if (cpu->mmap_file_fd < 0)
+		return NULL;
+
+	perf_enable(cpu, true);
+
+	while (1) {
+		ret = perf_event_read(cpu);
+		if (!ret) /* sleep if the buffer was empty */
+			nanosleep(&swait, NULL);
+	}
+
+	return NULL;
+}
+
+#define PERF_READER_SCHEDULER SCHED_RR
+static int perf_run_readers(struct perf_session *perf)
+{
+	struct sched_param sched;
+	pthread_attr_t attr;
+	int i;
+
+	sched.sched_priority = sched_get_priority_max(PERF_READER_SCHEDULER);
+	pthread_attr_init(&attr);
+	pthread_attr_setschedpolicy(&attr, PERF_READER_SCHEDULER);
+	pthread_attr_setschedparam(&attr, &sched);
+	pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
+
+	for (i = 0; i < perf->cpu_count; i++) {
+		if (pthread_create(&perf->cpus[i].tid, &attr,
+				   perf_read_thread, &perf->cpus[i]))
+			return -1;
+	}
+
+	return 0;
+}
+
+static int perf_stop_readers(struct perf_session *perf)
+{
+	int i;
+
+	for (i = 0; i < perf->cpu_count; i++)
+		pthread_cancel(perf->cpus[i].tid);
+
+	for (i = 0; i < perf->cpu_count; i++)
+		pthread_join(perf->cpus[i].tid, NULL);
+
+	return 0;
+}
+
+static void perf_collect_results(struct perf_session *perf, int cpu)
+{
+	unsigned long long addr, len, pgoff;
+	unsigned long long time, count, ip;
+	char file[PATH_MAX];
+	int sfd, mfd, i;
+	int sall = 0;
+	int mall = 0;
+	int ret;
+
+	if (!perf->cpus[cpu].mmap_file || !perf->cpus[cpu].sample_file)
+		return;
+	mfd = open(perf->cpus[cpu].mmap_file, O_RDONLY, 0600);
+	if (mfd < 0)
+		return;
+	sfd = open(perf->cpus[cpu].sample_file, O_RDONLY, 0600);
+	if (sfd < 0)
+		return;
+	do {
+		/* Loop through all dynamically loaded libraries during the trace
+		 * and attach them to the debug object, to be used for symbols resolve.
+		 */
+		ret = read(mfd, &addr, sizeof(addr));
+		if (ret != sizeof(addr))
+			break;
+		ret = read(mfd, &len, sizeof(len));
+		if (ret != sizeof(len))
+			break;
+		ret = read(mfd, &pgoff, sizeof(pgoff));
+		if (ret != sizeof(pgoff))
+			break;
+		ret = trace_read_file_string(mfd, file, PATH_MAX);
+		if (ret < 0)
+			break;
+		trace_debug_obj_add_file(perf->fdebug, file, addr, addr + len, pgoff);
+		mall++;
+	} while (1);
+
+	do {
+		/* Loop through all collected calltraces and add addresses to
+		 * the debug object for function name resolving.
+		 */
+		ret = read(sfd, &time, sizeof(time));
+		if (ret != sizeof(time))
+			break;
+
+		ret = read(sfd, &count, sizeof(count));
+		if (ret != sizeof(count))
+			break;
+
+		for (i = 0; i < count; i++) {
+			ret = read(sfd, &ip, sizeof(ip));
+			if (ret != sizeof(ip))
+				break;
+			ret = trace_debug_add_resolve_symbol(perf->fdebug, ip, NULL);
+		}
+		sall++;
+		if (i < count)
+			break;
+	} while (1);
+	close(sfd);
+	close(mfd);
+
+	printf("Got %d samples and %d mmaps for CPU %d\n", sall, mall, cpu);
+}
+
+static void stop_perf(int sig)
+{
+	done = true;
+}
+
+static int perf_collect(struct perf_session *perf)
+{
+	int i;
+
+	signal(SIGINT, stop_perf);
+	signal(SIGTERM, stop_perf);
+
+	if (perf_run_readers(perf))
+		return -1;
+
+	printf("Collecting callstack of pid %d, hit Ctrl^C to stop recording\n",
+		perf->pid);
+
+	while (!done)
+		usleep(500000);
+
+	for (i = 0; i < perf->cpu_count; i++)
+		perf_enable(&perf->cpus[i], false);
+
+	perf_stop_readers(perf);
+
+	for (i = 0; i < perf->cpu_count; i++)
+		perf_collect_results(perf, i);
+
+	return 0;
+}
+
+struct symb_walk_context {
+	struct trace_seq *s;
+	int count;
+};
+
+static int perf_symbols_walk(struct tracecmd_debug_symbols *symb, void *data)
+{
+	struct symb_walk_context *context = (struct symb_walk_context *)data;
+
+	if (!symb->name)
+		printf("\t Failed to resolve %llx in %s\n",
+			symb->vma_near, symb->fname);
+
+	trace_seq_printf(context->s, "%s %s %llx %llx\n",
+			 symb->fname, symb->name,
+			 symb->vma_start, symb->vma_near);
+	context->count++;
+
+	return 0;
+}
+
+static void
+add_pid_symbols(struct tracecmd_output *handle, struct perf_session *perf)
+{
+	struct symb_walk_context context;
+	struct trace_seq body, head;
+
+	trace_seq_init(&body);
+	trace_seq_init(&head);
+	context.s = &body;
+	context.count = 0;
+	trace_debug_walk_resolved_symbols(perf->fdebug, perf_symbols_walk, &context);
+	trace_seq_printf(&head, "%x %d\n", perf->pid, context.count);
+	trace_seq_terminate(&head);
+	trace_seq_terminate(&body);
+	trace_seq_puts(&head, body.buffer);
+	if (handle && context.count)
+		tracecmd_add_option(handle, TRACECMD_OPTION_PID_SYMBOLS,
+				    head.len + 1, head.buffer);
+	else
+		trace_seq_do_printf(&head);
+	trace_seq_destroy(&body);
+}
+
+static int perf_resolve_trace(struct perf_session *perf)
+{
+	trace_debug_resolve_symbols(perf->fdebug);
+	add_pid_symbols(NULL, perf);
+
+	return 0;
+}
+
+int perf_run(int pid)
+{
+	struct perf_session *perf;
+
+	perf = perf_session_new(pid);
+	if (perf == NULL)
+		return -1;
+
+	perf_collect(perf);
+
+	perf_resolve_trace(perf);
+
+	perf_session_destroy(perf);
+	return 0;
+}
+
+enum {
+	OPT_pid	= 255,
+};
+
+void trace_perf(int argc, char **argv)
+{
+	int pid = 0;
+	int c;
+
+
+	if (strcmp(argv[1], "perf") != 0)
+		usage(argv);
+	for (;;) {
+		int option_index = 0;
+		static struct option long_options[] = {
+			{"pid", required_argument, NULL, OPT_pid},
+			{"help", no_argument, NULL, '?'},
+			{NULL, 0, NULL, 0}
+		};
+
+		c = getopt_long (argc-1, argv+1, "+hp:",
+				long_options, &option_index);
+		if (c == -1)
+			break;
+		switch (c) {
+
+		case OPT_pid:
+		case 'p':
+			pid = atoi(optarg);
+			break;
+		case 'h':
+		case '?':
+		default:
+			usage(argv);
+		}
+	}
+
+	if (perf_run(pid) < 0)
+		printf("Failed to run perf\n");
+}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2020-12-03  6:02 ` [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf" Tzvetomir Stoyanov (VMware)
@ 2021-02-19  2:03   ` Steven Rostedt
  2021-02-19  3:53     ` Steven Rostedt
  2021-02-19  7:16     ` Tzvetomir Stoyanov
  0 siblings, 2 replies; 13+ messages in thread
From: Steven Rostedt @ 2021-02-19  2:03 UTC (permalink / raw)
  To: Tzvetomir Stoyanov (VMware); +Cc: linux-trace-devel

On Thu,  3 Dec 2020 08:02:26 +0200
"Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> wrote:

> +static int perf_mmap(struct perf_cpu *perf)
> +{
> +	mmap_mask = NUM_PAGES * getpagesize() - 1;
> +
> +	/* associate a buffer with the file */
> +	perf->mpage = mmap(NULL, (NUM_PAGES + 1) * getpagesize(),
> +			PROT_READ | PROT_WRITE, MAP_SHARED, perf->perf_fd, 0);
> +	if (perf->mpage == MAP_FAILED)
> +		return -1;
> +	return 0;
> +}

BTW, I found that the above holds the conversions we need for the local
clock!

        printf("time_shift=%d\n", perf->mpage->time_shift);
        printf("time_mult=%d\n", perf->mpage->time_mult);
        printf("time_offset=%lld\n", perf->mpage->time_offset);

Which gives me:

time_shift=31
time_mult=633046315
time_offset=-115773323084683

[ one for each CPU ]

The ftrace local clock is defined by:

u64 notrace trace_clock_local(void)
{
	u64 clock;
	preempt_disable_notrace();
	clock = sched_clock();
	preempt_enable_notrace();
	return clock;
}

Where

u64 sched_clock(void)
{
	if (static_branch_likely(&__use_tsc)) { // true
		u64 tsc_now = rdtsc();

		/* return the value in ns */
		return cycles_2_ns(tsc_now);
	}

and

static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
{
	struct cyc2ns_data data;
	unsigned long long ns;

	cyc2ns_read_begin(&data); // <- this is where the data comes from

	ns = data.cyc2ns_offset;
	ns += mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift);

	cyc2ns_read_end();

	return ns;
}

__always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
{
	int seq, idx;

	preempt_disable_notrace();

	do {
		seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
		idx = seq & 1;

		data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
		data->cyc2ns_mul    = this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
		data->cyc2ns_shift  = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);

	} while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
}

The offset, multiplier and shift are from the cyc2ns.data[idx] (per
cpu) is what determines the conversion from x86 cycles to nanoseconds.

Does user space have access to that? Yes! Via perf!

void arch_perf_update_userpage(struct perf_event *event,
			       struct perf_event_mmap_page *userpg, u64 now)
{
[..]
	cyc2ns_read_begin(&data);

	offset = data.cyc2ns_offset + __sched_clock_offset;

	/*
	 * Internal timekeeping for enabled/running/stopped times
	 * is always in the local_clock domain.
	 */
	userpg->cap_user_time = 1;
	userpg->time_mult = data.cyc2ns_mul;
	userpg->time_shift = data.cyc2ns_shift;
	userpg->time_offset = offset - now;

Those above values are the ones I printed at the beginning of this
email.

Hence, we can use x86-tsc as the clock for both the host and guest, and
then using perf find out how to convert that to what the 'local' clock
would produce. At least the multiplier and the shfit.

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2021-02-19  2:03   ` Steven Rostedt
@ 2021-02-19  3:53     ` Steven Rostedt
  2021-02-19 17:51       ` Dario Faggioli
  2021-02-19  7:16     ` Tzvetomir Stoyanov
  1 sibling, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2021-02-19  3:53 UTC (permalink / raw)
  To: Tzvetomir Stoyanov (VMware); +Cc: linux-trace-devel

On Thu, 18 Feb 2021 21:03:52 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Hence, we can use x86-tsc as the clock for both the host and guest, and
> then using perf find out how to convert that to what the 'local' clock
> would produce. At least the multiplier and the shfit.

I just tried this out:

 http://rostedt.org/private/perf-trace.c

Where I did:

 # trace-cmd record -M 1 -C local -e irq -e sched taskset -c 0 ./perf-trace
time_shift=31
time_mult=633046315
time_offset=-125757047487632
 # trace-cmd report |grep print         taskset-34857 [000] 125742.260276: print:                tracing_mark_write: [426606557411358] [125757.283123254] - <idle>-0 [0] softirq_raise: vec=9 [action=RCU]
         taskset-34857 [000] 125742.260283: print:                tracing_mark_write: [426606557431758] [125757.283129268] - <idle>-0 [0] softirq_entry: vec=9 [action=RCU]
         taskset-34857 [000] 125742.260289: print:                tracing_mark_write: [426606557580560] [125757.283173133] - <idle>-0 [0] softirq_exit: vec=9 [action=RCU]
         taskset-34857 [000] 125742.260295: print:                tracing_mark_write: [426606560784758] [125757.284117683] - <idle>-0 [0] softirq_raise: vec=7 [action=SCHED]

The time offset is slightly different, but we don't care. We only want
to convert the cycles into nanoseconds, and this appears to do the job!

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2021-02-19  2:03   ` Steven Rostedt
  2021-02-19  3:53     ` Steven Rostedt
@ 2021-02-19  7:16     ` Tzvetomir Stoyanov
  2021-02-19 14:36       ` Steven Rostedt
  1 sibling, 1 reply; 13+ messages in thread
From: Tzvetomir Stoyanov @ 2021-02-19  7:16 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Linux Trace Devel

Hi Steven,

On Fri, Feb 19, 2021 at 4:03 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Dec 2020 08:02:26 +0200
> "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> wrote:
>
> > +static int perf_mmap(struct perf_cpu *perf)
> > +{
> > +     mmap_mask = NUM_PAGES * getpagesize() - 1;
> > +
> > +     /* associate a buffer with the file */
> > +     perf->mpage = mmap(NULL, (NUM_PAGES + 1) * getpagesize(),
> > +                     PROT_READ | PROT_WRITE, MAP_SHARED, perf->perf_fd, 0);
> > +     if (perf->mpage == MAP_FAILED)
> > +             return -1;
> > +     return 0;
> > +}
>
> BTW, I found that the above holds the conversions we need for the local
> clock!
>
>         printf("time_shift=%d\n", perf->mpage->time_shift);
>         printf("time_mult=%d\n", perf->mpage->time_mult);
>         printf("time_offset=%lld\n", perf->mpage->time_offset);
>
> Which gives me:
>
> time_shift=31
> time_mult=633046315
> time_offset=-115773323084683
>
> [ one for each CPU ]

This will give us time shift/mult/offset for each host CPU, right ? Is
the local trace clock
different for each CPU ? Currently, the time offset is calculated per
VCPU, assuming
that the host CPU on which this VCPU runs has no impact on the
timestamp synchronization.
If the local clock depends on the CPU, then we should calculate the
time offset of each guest
event individually, depending on host CPU and VCPU the event happened
- as the host task which runs
the VCPU can migrate between CPUs at any time. So, we need to:
  1. Add timesync information for each host CPU in the trace.dat file.
  2. Track the migration between CPUs of each task that runs VCPU and
save that information
    in the trace.dat file.
  2. When calculating the new timestamp of each guest event
(individually) - somehow find out on
     which host CPU that guest event happened ?

Points 1 and 2 are doable, but will break the current trace.dat file
option that holds the timesync information.
Point 3 is not clear to me, how we can get such information before the
host and guest events are synchronised ?

>
> The ftrace local clock is defined by:
>
> u64 notrace trace_clock_local(void)
> {
>         u64 clock;
>         preempt_disable_notrace();
>         clock = sched_clock();
>         preempt_enable_notrace();
>         return clock;
> }
>
> Where
>
> u64 sched_clock(void)
> {
>         if (static_branch_likely(&__use_tsc)) { // true
>                 u64 tsc_now = rdtsc();
>
>                 /* return the value in ns */
>                 return cycles_2_ns(tsc_now);
>         }
>
> and
>
> static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
> {
>         struct cyc2ns_data data;
>         unsigned long long ns;
>
>         cyc2ns_read_begin(&data); // <- this is where the data comes from
>
>         ns = data.cyc2ns_offset;
>         ns += mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift);
>
>         cyc2ns_read_end();
>
>         return ns;
> }
>
> __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
> {
>         int seq, idx;
>
>         preempt_disable_notrace();
>
>         do {
>                 seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
>                 idx = seq & 1;
>
>                 data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
>                 data->cyc2ns_mul    = this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
>                 data->cyc2ns_shift  = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
>
>         } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
> }
>
> The offset, multiplier and shift are from the cyc2ns.data[idx] (per
> cpu) is what determines the conversion from x86 cycles to nanoseconds.
>
> Does user space have access to that? Yes! Via perf!
>
> void arch_perf_update_userpage(struct perf_event *event,
>                                struct perf_event_mmap_page *userpg, u64 now)
> {
> [..]
>         cyc2ns_read_begin(&data);
>
>         offset = data.cyc2ns_offset + __sched_clock_offset;
>
>         /*
>          * Internal timekeeping for enabled/running/stopped times
>          * is always in the local_clock domain.
>          */
>         userpg->cap_user_time = 1;
>         userpg->time_mult = data.cyc2ns_mul;
>         userpg->time_shift = data.cyc2ns_shift;
>         userpg->time_offset = offset - now;
>
> Those above values are the ones I printed at the beginning of this
> email.
>
> Hence, we can use x86-tsc as the clock for both the host and guest, and
> then using perf find out how to convert that to what the 'local' clock
> would produce. At least the multiplier and the shfit.
>
> -- Steve



-- 
Tzvetomir (Ceco) Stoyanov
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2021-02-19  7:16     ` Tzvetomir Stoyanov
@ 2021-02-19 14:36       ` Steven Rostedt
  2021-02-19 17:56         ` Tzvetomir Stoyanov
  0 siblings, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2021-02-19 14:36 UTC (permalink / raw)
  To: Tzvetomir Stoyanov; +Cc: Linux Trace Devel

On Fri, 19 Feb 2021 09:16:26 +0200
Tzvetomir Stoyanov <tz.stoyanov@gmail.com> wrote:

> Hi Steven,
> 
> On Fri, Feb 19, 2021 at 4:03 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu,  3 Dec 2020 08:02:26 +0200
> > "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> wrote:
> >  
> > > +static int perf_mmap(struct perf_cpu *perf)
> > > +{
> > > +     mmap_mask = NUM_PAGES * getpagesize() - 1;
> > > +
> > > +     /* associate a buffer with the file */
> > > +     perf->mpage = mmap(NULL, (NUM_PAGES + 1) * getpagesize(),
> > > +                     PROT_READ | PROT_WRITE, MAP_SHARED, perf->perf_fd, 0);
> > > +     if (perf->mpage == MAP_FAILED)
> > > +             return -1;
> > > +     return 0;
> > > +}  
> >
> > BTW, I found that the above holds the conversions we need for the local
> > clock!
> >
> >         printf("time_shift=%d\n", perf->mpage->time_shift);
> >         printf("time_mult=%d\n", perf->mpage->time_mult);
> >         printf("time_offset=%lld\n", perf->mpage->time_offset);
> >
> > Which gives me:
> >
> > time_shift=31
> > time_mult=633046315
> > time_offset=-115773323084683
> >
> > [ one for each CPU ]  
> 
> This will give us time shift/mult/offset for each host CPU, right ? Is
> the local trace clock
> different for each CPU ? 

It can be. Note, the above offset is basically useless. That injects the
current time into the value and we can't rely on it. But the shift and mult
is needed.

But, usually, the shift and offset are identical on most systems across
CPUs, but there's no guarantee that it will always be the case.

>Currently, the time offset is calculated per
> VCPU, assuming
> that the host CPU on which this VCPU runs has no impact on the
> timestamp synchronization.
> If the local clock depends on the CPU, then we should calculate the
> time offset of each guest
> event individually, depending on host CPU and VCPU the event happened
> - as the host task which runs
> the VCPU can migrate between CPUs at any time. So, we need to:
>   1. Add timesync information for each host CPU in the trace.dat file.
>   2. Track the migration between CPUs of each task that runs VCPU and
> save that information
>     in the trace.dat file.

I was thinking about this too. And perhaps we can hold off until we find
systems that have different values for mult and shift.

That said, we can easily add this information by recording the sched_switch
events in a separate buffer. And I've been thinking about doing this by
default anyway. More below.

>   2. When calculating the new timestamp of each guest event
> (individually) - somehow find out on
>      which host CPU that guest event happened ?
> 
> Points 1 and 2 are doable, but will break the current trace.dat file
> option that holds the timesync information.

I don't think we need to have it in the timesync option. I think we can
create another option to hold guest event data.

> Point 3 is not clear to me, how we can get such information before the
> host and guest events are synchronised ?
> 


My thoughts about this is. When we enable tracing of a guest (-A), we then
create an instance on the host that records only kvm enter / exit events as
well as sched switch events. Basically, enable all the events that we need
to synchronize and show entering and exiting of the guest.

The synchronization logic already shows us what host thread controls each
guest VCPU. If we record the kvm enter/exit and sched_switch events in a
separate buffer, we can see when a host thread that runs a guest VCPU
migrates to another CPU. Since the timestamps of those events are recorded
in the meta events themselves (sched_switch), we know exactly where we need
to use the new mult and shift values for the guest events.

Make sense?

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2021-02-19  3:53     ` Steven Rostedt
@ 2021-02-19 17:51       ` Dario Faggioli
  0 siblings, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2021-02-19 17:51 UTC (permalink / raw)
  To: Steven Rostedt, Tzvetomir Stoyanov (VMware); +Cc: linux-trace-devel

[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]

Hey guys,

On Thu, 2021-02-18 at 22:53 -0500, Steven Rostedt wrote:
> On Thu, 18 Feb 2021 21:03:52 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Hence, we can use x86-tsc as the clock for both the host and guest,
> > and
> > then using perf find out how to convert that to what the 'local'
> > clock
> > would produce. At least the multiplier and the shfit.
> 
> I just tried this out:
> 
>  http://rostedt.org/private/perf-trace.c
> 
> Where I did:
> 
>  # trace-cmd record -M 1 -C local -e irq -e sched taskset -c 0
> ./perf-trace
> time_shift=31
> time_mult=633046315
> time_offset=-125757047487632
>
Let me ask something really really stupid: where was this run?

Host? Gust? Both?

This seems very interesting... But I definitely need to stare more at
your code (while scratching my head, at least for now :-P).

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2021-02-19 14:36       ` Steven Rostedt
@ 2021-02-19 17:56         ` Tzvetomir Stoyanov
  2021-02-19 19:11           ` Steven Rostedt
  0 siblings, 1 reply; 13+ messages in thread
From: Tzvetomir Stoyanov @ 2021-02-19 17:56 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Linux Trace Devel

On Fri, Feb 19, 2021 at 4:36 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 19 Feb 2021 09:16:26 +0200
> Tzvetomir Stoyanov <tz.stoyanov@gmail.com> wrote:
>
> > Hi Steven,
> >
> > On Fri, Feb 19, 2021 at 4:03 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Thu,  3 Dec 2020 08:02:26 +0200
> > > "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> wrote:
> > >
> > > > +static int perf_mmap(struct perf_cpu *perf)
> > > > +{
> > > > +     mmap_mask = NUM_PAGES * getpagesize() - 1;
> > > > +
> > > > +     /* associate a buffer with the file */
> > > > +     perf->mpage = mmap(NULL, (NUM_PAGES + 1) * getpagesize(),
> > > > +                     PROT_READ | PROT_WRITE, MAP_SHARED, perf->perf_fd, 0);
> > > > +     if (perf->mpage == MAP_FAILED)
> > > > +             return -1;
> > > > +     return 0;
> > > > +}
> > >
> > > BTW, I found that the above holds the conversions we need for the local
> > > clock!
> > >
> > >         printf("time_shift=%d\n", perf->mpage->time_shift);
> > >         printf("time_mult=%d\n", perf->mpage->time_mult);
> > >         printf("time_offset=%lld\n", perf->mpage->time_offset);
> > >
> > > Which gives me:
> > >
> > > time_shift=31
> > > time_mult=633046315
> > > time_offset=-115773323084683
> > >
> > > [ one for each CPU ]
> >
> > This will give us time shift/mult/offset for each host CPU, right ? Is
> > the local trace clock
> > different for each CPU ?
>
> It can be. Note, the above offset is basically useless. That injects the
> current time into the value and we can't rely on it. But the shift and mult
> is needed.
>
> But, usually, the shift and offset are identical on most systems across
> CPUs, but there's no guarantee that it will always be the case.
>
> >Currently, the time offset is calculated per
> > VCPU, assuming
> > that the host CPU on which this VCPU runs has no impact on the
> > timestamp synchronization.
> > If the local clock depends on the CPU, then we should calculate the
> > time offset of each guest
> > event individually, depending on host CPU and VCPU the event happened
> > - as the host task which runs
> > the VCPU can migrate between CPUs at any time. So, we need to:
> >   1. Add timesync information for each host CPU in the trace.dat file.
> >   2. Track the migration between CPUs of each task that runs VCPU and
> > save that information
> >     in the trace.dat file.
>
> I was thinking about this too. And perhaps we can hold off until we find
> systems that have different values for mult and shift.
>
> That said, we can easily add this information by recording the sched_switch
> events in a separate buffer. And I've been thinking about doing this by
> default anyway. More below.
>
> >   2. When calculating the new timestamp of each guest event
> > (individually) - somehow find out on
> >      which host CPU that guest event happened ?
> >
> > Points 1 and 2 are doable, but will break the current trace.dat file
> > option that holds the timesync information.
>
> I don't think we need to have it in the timesync option. I think we can
> create another option to hold guest event data.
>
> > Point 3 is not clear to me, how we can get such information before the
> > host and guest events are synchronised ?
> >
>
>
> My thoughts about this is. When we enable tracing of a guest (-A), we then
> create an instance on the host that records only kvm enter / exit events as
> well as sched switch events. Basically, enable all the events that we need
> to synchronize and show entering and exiting of the guest.
>
> The synchronization logic already shows us what host thread controls each
> guest VCPU. If we record the kvm enter/exit and sched_switch events in a
> separate buffer, we can see when a host thread that runs a guest VCPU
> migrates to another CPU. Since the timestamps of those events are recorded
> in the meta events themselves (sched_switch), we know exactly where we need
> to use the new mult and shift values for the guest events.
>
> Make sense?
Yes, but I have two concerns:
 1. Recording kvm enter / exit events will make this implementation
KVM specific,
     how about other hypervisors ?
 2. The problem with finding guest event -> host CPU relation. We now only the
      timestamp of the guest event, not yet synchronized with the host
time. How will
      find on what host CPU that guest event happened, as the host
task CPU migration
      mapping is recorded with host time ?
>
> -- Steve



-- 
Tzvetomir (Ceco) Stoyanov
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf"
  2021-02-19 17:56         ` Tzvetomir Stoyanov
@ 2021-02-19 19:11           ` Steven Rostedt
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2021-02-19 19:11 UTC (permalink / raw)
  To: Tzvetomir Stoyanov; +Cc: Linux Trace Devel

On Fri, 19 Feb 2021 19:56:59 +0200
Tzvetomir Stoyanov <tz.stoyanov@gmail.com> wrote:


> > Make sense?  
> Yes, but I have two concerns:
>  1. Recording kvm enter / exit events will make this implementation
> KVM specific,
>      how about other hypervisors ?

I think this should all be part of the protocol. The option can be
TRACECMD_OPTION_GUEST_METADATA, where that has a field for "enter/exit"
events with CPU, vCPU, timestamp, and the also a note of sched switch
events.

That is, the implementation will enable kvm events, but other trace-cmd
plugins can implement other events. But what gets recorded into the
trace.dat file can be post processed into a generic format.


>  2. The problem with finding guest event -> host CPU relation. We now only the
>       timestamp of the guest event, not yet synchronized with the host
> time. How will
>       find on what host CPU that guest event happened, as the host
> task CPU migration
>       mapping is recorded with host time ?

It would be done all post processed. Here's the workflow I envision:

 1. Run pre synchronization protocol between host/guest (ie. kvm, ptp)

 2. Start the meta data events.
	Start tracing guest enter/exit events along with scheduling.
	We may be able to even filter specifically if we know the threads a
	head of time, but that's just an optimization.
	These events will be recorded raw into a file (with host threads
	saving the events to disk like it does for the tracing data).
	All this is in a separate instance.

 3. Run tracing
	Recorded events on the guest are sent to the host and stored in raw
	format on the host.
	Optionally, events on the host is also recorded.

 4. At the end of tracing
	Stop the recording of tracing.
	Stop the recording of meta data.
	Run post synchronization protocol between host and guest.

 5. Read the meta data events and convert them into a form of timestamp,
    enter/exit events, migration events. Yes, the timestamp will be that of
    the host.

A couple of things to understand with step 5, and assumptions we can make.

The scale factor between the host and guest for each VCPU should not
change, even if the VCPU migrates. If it does, then there had to have been a
kernel event where the kernel knew about this change, and we need to find a
way to extract that. For now, we shouldn't worry about it, as I'm guessing
that could be rather complex to manage even on the kernel end, and it would
also be expensive for migrations.

I'm making a big assumption here, that either the different VCPUS have
different scaling factors but the host scaling between physical CPUS are
the same, or the physical CPUS have different scaling factors, but the VCPUS
do not. So far I ran this on 4 different machines (including my laptop) and
all had the same mult and shift on the host. Perhaps we just state that we
do not support machines that do not have this.

Anyway, when all is done, the meta data events will map to the host. Using
this mapping and the synchronization between the host and guest, we can map
these migrations and such to the guest itself. And if the vCPUS are
different (and we assume the host is the same for all), and we know when
VCPU scheduled away from a physical CPU (from the host time), we can then
figure out how to find the guest time when that happened.

Or did I miss something?

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-02-19 19:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03  6:02 [PATCH 0/5] Initial trace-cmd perf support Tzvetomir Stoyanov (VMware)
2020-12-03  6:02 ` [PATCH 1/5] trace-cmd: Internal refactoring of pid address map logic Tzvetomir Stoyanov (VMware)
2020-12-03  6:02 ` [PATCH 2/5] trace-cmd: Make read_file_string() non static Tzvetomir Stoyanov (VMware)
2020-12-03  6:02 ` [PATCH 3/5] trace-cmd: New internal APIs for reading ELF header Tzvetomir Stoyanov (VMware)
2020-12-03  6:02 ` [PATCH 4/5] trace-cmd: Add a new option in trace.dat file for the address to function name mapping Tzvetomir Stoyanov (VMware)
2020-12-03  6:02 ` [PATCH 5/5] [WIP] trace-cmd: Add new subcomand "trace-cmd perf" Tzvetomir Stoyanov (VMware)
2021-02-19  2:03   ` Steven Rostedt
2021-02-19  3:53     ` Steven Rostedt
2021-02-19 17:51       ` Dario Faggioli
2021-02-19  7:16     ` Tzvetomir Stoyanov
2021-02-19 14:36       ` Steven Rostedt
2021-02-19 17:56         ` Tzvetomir Stoyanov
2021-02-19 19:11           ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).