All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/6] event synthesization multithreading for perf record
@ 2017-10-20 20:05 kan.liang
  2017-10-20 20:05 ` [PATCH V3 1/6] perf tools: pass thread info to process function kan.liang
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: kan.liang @ 2017-10-20 20:05 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The event synthesization multithreading is introduced in
("perf top optimization") https://lkml.org/lkml/2017/9/29/269
But it was not enabled for perf record. Because the process function
process_synthesized_event was not multithreading friendly.

The patch series temporarily stores the process result in per-thread file,
which make the processing in parallel. Then it dumps the file one by one to
the perf.data at the end of event synthesization.

The source code is also available at
https://github.com/kliang2/perf.git perf_record_opt

Usually, the event synthesization only happens once on either start or end.
With the snapshotting code, we synthesize events multiple times, once per
each new perf.data file. Both of the cases are verified.

Here are the latency test result on Knights Mill and Skylake server

The workload is to compile Linux kernel as below
"sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
Then, "sudo perf record -e cycles -a -- sleep 1"

The latency is the time cost of __machine__synthesize_threads or
its multithreading replacement, record__multithread_synthesize.

Original:              original single thread synthesize
With patch(not merge): multithread synthesize without final file merge
                       (intermediate results for scalability measurement)
With patch(merge):     multithread synthesize with file merge
                       (final result)

- Latency on Knights Mill (272 CPUs)

Original(s)	With patch(not merge)(s)	With patch(merge)(s)
12.7		6.6				7.76

- Latency on Skylake server (192 CPUs)

Original(s)	With patch(not merge)(s)	With patch(merge)(s)
0.34		0.21				0.23

Changes since V2:
 - Introduce a new interface to automatically generate tmp file.(Patch 4/6)
   Remove the tmp file when it close. (jirka)
 - Move all checks to record__multithread_synthesize (jirka)
 - Minor changes for record__multithread_synthesize
 - Update test data (Ingo)

Changes since V1:
 - Dump the synthesized result to per-thread file and merge them to perf.data
   at the end. (Arnaldo)

Kan Liang (6):
  perf tools: pass thread info to process function
  perf tools: pass thread info in event synthesization
  perf tools: expose copyfile_offset()
  perf tools: add perf_data_file__open_tmp
  perf record: synthesize event multithreading support
  perf record: add option to set the number of thread for event
    synthesize

 tools/perf/Documentation/perf-record.txt |   4 ++
 tools/perf/arch/x86/util/tsc.c           |   2 +-
 tools/perf/builtin-inject.c              |  12 +++-
 tools/perf/builtin-record.c              | 114 ++++++++++++++++++++++++++++---
 tools/perf/builtin-sched.c               |  12 ++--
 tools/perf/builtin-stat.c                |   3 +-
 tools/perf/builtin-trace.c               |   3 +-
 tools/perf/tests/cpumap.c                |   6 +-
 tools/perf/tests/dwarf-unwind.c          |   6 +-
 tools/perf/tests/event_update.c          |  12 ++--
 tools/perf/tests/stat.c                  |   9 ++-
 tools/perf/tests/thread-map.c            |   3 +-
 tools/perf/util/auxtrace.c               |   2 +-
 tools/perf/util/data.c                   |  26 +++++++
 tools/perf/util/data.h                   |   2 +
 tools/perf/util/event.c                  | 111 ++++++++++++++++++------------
 tools/perf/util/event.h                  |  19 ++++--
 tools/perf/util/header.c                 |  16 ++---
 tools/perf/util/intel-bts.c              |   3 +-
 tools/perf/util/intel-pt.c               |   3 +-
 tools/perf/util/session.c                |   4 +-
 tools/perf/util/util.c                   |   2 +-
 tools/perf/util/util.h                   |   2 +
 23 files changed, 284 insertions(+), 92 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V3 1/6] perf tools: pass thread info to process function
  2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
@ 2017-10-20 20:05 ` kan.liang
  2017-10-20 20:05 ` [PATCH V3 2/6] perf tools: pass thread info in event synthesization kan.liang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: kan.liang @ 2017-10-20 20:05 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

For multithreading, the process function needs to know the thread
related information. E.g. saving the process result to the buffer or
file which belongs to specific thread.

Add struct thread_info parameter for process function.
Currently, it only includes thread index.

perf_event__repipe is shared by process function and event_op of
perf_tool in builtin-inject.c. Add dedicated process function
perf_event__repipe_threads.

No functional change.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/arch/x86/util/tsc.c  |  2 +-
 tools/perf/builtin-inject.c     | 12 +++++++++++-
 tools/perf/builtin-record.c     |  3 ++-
 tools/perf/builtin-sched.c      | 12 ++++++++----
 tools/perf/builtin-stat.c       |  3 ++-
 tools/perf/builtin-trace.c      |  3 ++-
 tools/perf/tests/cpumap.c       |  6 ++++--
 tools/perf/tests/dwarf-unwind.c |  3 ++-
 tools/perf/tests/event_update.c | 12 ++++++++----
 tools/perf/tests/stat.c         |  9 ++++++---
 tools/perf/tests/thread-map.c   |  3 ++-
 tools/perf/util/auxtrace.c      |  2 +-
 tools/perf/util/event.c         | 15 ++++++++-------
 tools/perf/util/event.h         | 10 ++++++++--
 tools/perf/util/header.c        | 16 ++++++++--------
 tools/perf/util/intel-bts.c     |  3 ++-
 tools/perf/util/intel-pt.c      |  3 ++-
 tools/perf/util/session.c       |  4 ++--
 18 files changed, 79 insertions(+), 42 deletions(-)

diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c
index 2e5567c..0affc0f 100644
--- a/tools/perf/arch/x86/util/tsc.c
+++ b/tools/perf/arch/x86/util/tsc.c
@@ -76,5 +76,5 @@ int perf_event__synth_time_conv(const struct perf_event_mmap_page *pc,
 	event.time_conv.time_shift = tc.time_shift;
 	event.time_conv.time_zero  = tc.time_zero;
 
-	return process(tool, &event, NULL, machine);
+	return process(tool, &event, NULL, machine, NULL);
 }
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 2b80329..67a6701 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -191,6 +191,15 @@ static int perf_event__repipe(struct perf_tool *tool,
 	return perf_event__repipe_synth(tool, event);
 }
 
+static int perf_event__repipe_threads(struct perf_tool *tool,
+				      union perf_event *event,
+				      struct perf_sample *sample,
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
+{
+	return perf_event__repipe(tool, event, sample, machine);
+}
+
 static int perf_event__drop(struct perf_tool *tool __maybe_unused,
 			    union perf_event *event __maybe_unused,
 			    struct perf_sample *sample __maybe_unused,
@@ -413,7 +422,8 @@ static int dso__inject_build_id(struct dso *dso, struct perf_tool *tool,
 	if (dso->kernel)
 		misc = PERF_RECORD_MISC_KERNEL;
 
-	err = perf_event__synthesize_build_id(tool, dso, misc, perf_event__repipe,
+	err = perf_event__synthesize_build_id(tool, dso, misc,
+					      perf_event__repipe_threads,
 					      machine);
 	if (err) {
 		pr_err("Can't synthesize build_id event for %s\n", dso->long_name);
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a6cbf16..f53c1163 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -123,7 +123,8 @@ static int record__write(struct record *rec, void *bf, size_t size)
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	struct record *rec = container_of(tool, struct record, tool);
 	return record__write(rec, event, event->header.size);
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index b7e8812..ed34a14 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1432,7 +1432,8 @@ static void perf_sched__sort_lat(struct perf_sched *sched)
 static int process_sched_wakeup_event(struct perf_tool *tool,
 				      struct perf_evsel *evsel,
 				      struct perf_sample *sample,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
@@ -1603,7 +1604,8 @@ static int map_switch_event(struct perf_sched *sched, struct perf_evsel *evsel,
 static int process_sched_switch_event(struct perf_tool *tool,
 				      struct perf_evsel *evsel,
 				      struct perf_sample *sample,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 	int this_cpu = sample->cpu, err = 0;
@@ -1629,7 +1631,8 @@ static int process_sched_switch_event(struct perf_tool *tool,
 static int process_sched_runtime_event(struct perf_tool *tool,
 				       struct perf_evsel *evsel,
 				       struct perf_sample *sample,
-				       struct machine *machine)
+				       struct machine *machine,
+				       struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
@@ -1659,7 +1662,8 @@ static int perf_sched__process_fork_event(struct perf_tool *tool,
 static int process_sched_migrate_task_event(struct perf_tool *tool,
 					    struct perf_evsel *evsel,
 					    struct perf_sample *sample,
-					    struct machine *machine)
+					    struct machine *machine,
+					    struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index dd52541..80d5add 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -293,7 +293,8 @@ static inline int nsec_counter(struct perf_evsel *evsel)
 static int process_synthesized_event(struct perf_tool *tool __maybe_unused,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	if (perf_data_file__write(&perf_stat.file, event, event->header.size) < 0) {
 		pr_err("failed to write perf data, error: %m\n");
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index afef6fe..f737416 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1091,7 +1091,8 @@ static int trace__process_event(struct trace *trace, struct machine *machine,
 static int trace__tool_process(struct perf_tool *tool,
 			       union perf_event *event,
 			       struct perf_sample *sample,
-			       struct machine *machine)
+			       struct machine *machine,
+			       struct thread_info *thread __maybe_unused)
 {
 	struct trace *trace = container_of(tool, struct trace, tool);
 	return trace__process_event(trace, machine, event, sample);
diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index 1997022..fec51c7 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -11,7 +11,8 @@ struct machine;
 static int process_event_mask(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct cpu_map_event *map_event = &event->cpu_map;
 	struct cpu_map_mask *mask;
@@ -45,7 +46,8 @@ static int process_event_mask(struct perf_tool *tool __maybe_unused,
 static int process_event_cpus(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct cpu_map_event *map_event = &event->cpu_map;
 	struct cpu_map_entries *cpus;
diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 9ba1d21..5ed2271 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -22,7 +22,8 @@
 static int mmap_handler(struct perf_tool *tool __maybe_unused,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine)
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused)
 {
 	return machine__process_mmap2_event(machine, event, sample);
 }
diff --git a/tools/perf/tests/event_update.c b/tools/perf/tests/event_update.c
index 9484da2..b5f4ab1 100644
--- a/tools/perf/tests/event_update.c
+++ b/tools/perf/tests/event_update.c
@@ -8,7 +8,8 @@
 static int process_event_unit(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event *) event;
 
@@ -21,7 +22,8 @@ static int process_event_unit(struct perf_tool *tool __maybe_unused,
 static int process_event_scale(struct perf_tool *tool __maybe_unused,
 			       union perf_event *event,
 			       struct perf_sample *sample __maybe_unused,
-			       struct machine *machine __maybe_unused)
+			       struct machine *machine __maybe_unused,
+			       struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event *) event;
 	struct event_update_event_scale *ev_data;
@@ -42,7 +44,8 @@ struct event_name {
 static int process_event_name(struct perf_tool *tool,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_name *tmp = container_of(tool, struct event_name, tool);
 	struct event_update_event *ev = (struct event_update_event*) event;
@@ -56,7 +59,8 @@ static int process_event_name(struct perf_tool *tool,
 static int process_event_cpus(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event*) event;
 	struct event_update_event_cpus *ev_data;
diff --git a/tools/perf/tests/stat.c b/tools/perf/tests/stat.c
index 7f988a9..846cbec 100644
--- a/tools/perf/tests/stat.c
+++ b/tools/perf/tests/stat.c
@@ -22,7 +22,8 @@ static bool has_term(struct stat_config_event *config,
 static int process_stat_config_event(struct perf_tool *tool __maybe_unused,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	struct stat_config_event *config = &event->stat_config;
 	struct perf_stat_config stat_config;
@@ -62,7 +63,8 @@ int test__synthesize_stat_config(struct test *test __maybe_unused, int subtest _
 static int process_stat_event(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct stat_event *st = &event->stat;
 
@@ -92,7 +94,8 @@ int test__synthesize_stat(struct test *test __maybe_unused, int subtest __maybe_
 static int process_stat_round_event(struct perf_tool *tool __maybe_unused,
 				    union perf_event *event,
 				    struct perf_sample *sample __maybe_unused,
-				    struct machine *machine __maybe_unused)
+				    struct machine *machine __maybe_unused,
+				    struct thread_info *thread __maybe_unused)
 {
 	struct stat_round_event *stat_round = &event->stat_round;
 
diff --git a/tools/perf/tests/thread-map.c b/tools/perf/tests/thread-map.c
index b3423c7..d2f42ad 100644
--- a/tools/perf/tests/thread-map.c
+++ b/tools/perf/tests/thread-map.c
@@ -52,7 +52,8 @@ int test__thread_map(struct test *test __maybe_unused, int subtest __maybe_unuse
 static int process_event(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct thread_map_event *map = &event->thread_map;
 	struct thread_map *threads;
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 5547457..c4ab2c8 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -887,7 +887,7 @@ int perf_event__synthesize_auxtrace_info(struct auxtrace_record *itr,
 	if (err)
 		goto out_free;
 
-	err = process(tool, ev, NULL, NULL);
+	err = process(tool, ev, NULL, NULL, NULL);
 out_free:
 	free(ev);
 	return err;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 47eff47..fd523ca7 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -102,7 +102,7 @@ static int perf_tool__process_synth_event(struct perf_tool *tool,
 	.cpumode   = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK,
 	};
 
-	return process(tool, event, &synth_sample, machine);
+	return process(tool, event, &synth_sample, machine, NULL);
 };
 
 /*
@@ -976,7 +976,7 @@ int perf_event__synthesize_thread_map2(struct perf_tool *tool,
 		strncpy((char *) &entry->comm, comm, sizeof(entry->comm));
 	}
 
-	err = process(tool, event, NULL, machine);
+	err = process(tool, event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1107,7 +1107,7 @@ int perf_event__synthesize_cpu_map(struct perf_tool *tool,
 	if (!event)
 		return -ENOMEM;
 
-	err = process(tool, (union perf_event *) event, NULL, machine);
+	err = process(tool, (union perf_event *) event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1145,7 +1145,7 @@ int perf_event__synthesize_stat_config(struct perf_tool *tool,
 		  "stat config terms unbalanced\n");
 #undef ADD
 
-	err = process(tool, (union perf_event *) event, NULL, machine);
+	err = process(tool, (union perf_event *) event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1170,7 +1170,7 @@ int perf_event__synthesize_stat(struct perf_tool *tool,
 	event.ena       = count->ena;
 	event.run       = count->run;
 
-	return process(tool, (union perf_event *) &event, NULL, machine);
+	return process(tool, (union perf_event *) &event, NULL, machine, NULL);
 }
 
 int perf_event__synthesize_stat_round(struct perf_tool *tool,
@@ -1187,7 +1187,7 @@ int perf_event__synthesize_stat_round(struct perf_tool *tool,
 	event.time = evtime;
 	event.type = type;
 
-	return process(tool, (union perf_event *) &event, NULL, machine);
+	return process(tool, (union perf_event *) &event, NULL, machine, NULL);
 }
 
 void perf_event__read_stat_config(struct perf_stat_config *config,
@@ -1476,7 +1476,8 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 int perf_event__process(struct perf_tool *tool __maybe_unused,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine)
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused)
 {
 	return machine__process_event(machine, event, sample);
 }
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index d6cbb0a..200f3f8 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -659,10 +659,15 @@ struct cpu_map;
 struct perf_stat_config;
 struct perf_counts_values;
 
+struct thread_info {
+	int	idx;
+};
+
 typedef int (*perf_event__handler_t)(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample,
-				     struct machine *machine);
+				     struct machine *machine,
+				     struct thread_info *thread);
 
 int perf_event__synthesize_thread_map(struct perf_tool *tool,
 				      struct thread_map *threads,
@@ -751,7 +756,8 @@ int perf_event__process_exit(struct perf_tool *tool,
 int perf_event__process(struct perf_tool *tool,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine);
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused);
 
 struct addr_location;
 
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 605bbd5..c0183fb 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -2986,7 +2986,7 @@ int perf_event__synthesize_attr(struct perf_tool *tool,
 	ev->attr.header.size = (u16)size;
 
 	if (ev->attr.header.size == size)
-		err = process(tool, ev, NULL, NULL);
+		err = process(tool, ev, NULL, NULL, NULL);
 	else
 		err = -E2BIG;
 
@@ -3040,7 +3040,7 @@ int perf_event__synthesize_features(struct perf_tool *tool,
 		fe->header.type = PERF_RECORD_HEADER_FEATURE;
 		fe->header.size = ff.offset;
 
-		ret = process(tool, ff.buf, NULL, NULL);
+		ret = process(tool, ff.buf, NULL, NULL, NULL);
 		if (ret) {
 			free(ff.buf);
 			return ret;
@@ -3124,7 +3124,7 @@ perf_event__synthesize_event_update_unit(struct perf_tool *tool,
 		return -ENOMEM;
 
 	strncpy(ev->data, evsel->unit, size);
-	err = process(tool, (union perf_event *)ev, NULL, NULL);
+	err = process(tool, (union perf_event *)ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3144,7 +3144,7 @@ perf_event__synthesize_event_update_scale(struct perf_tool *tool,
 
 	ev_data = (struct event_update_event_scale *) ev->data;
 	ev_data->scale = evsel->scale;
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3163,7 +3163,7 @@ perf_event__synthesize_event_update_name(struct perf_tool *tool,
 		return -ENOMEM;
 
 	strncpy(ev->data, evsel->name, len);
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3194,7 +3194,7 @@ perf_event__synthesize_event_update_cpus(struct perf_tool *tool,
 				 evsel->own_cpus,
 				 type, max);
 
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3377,7 +3377,7 @@ int perf_event__synthesize_tracing_data(struct perf_tool *tool, int fd,
 	ev.tracing_data.header.size = sizeof(ev.tracing_data);
 	ev.tracing_data.size = aligned_size;
 
-	process(tool, &ev, NULL, NULL);
+	process(tool, &ev, NULL, NULL, NULL);
 
 	/*
 	 * The put function will copy all the tracing data
@@ -3455,7 +3455,7 @@ int perf_event__synthesize_build_id(struct perf_tool *tool,
 	ev.build_id.header.size = sizeof(ev.build_id) + len;
 	memcpy(&ev.build_id.filename, pos->long_name, pos->long_name_len);
 
-	err = process(tool, &ev, NULL, machine);
+	err = process(tool, &ev, NULL, machine, NULL);
 
 	return err;
 }
diff --git a/tools/perf/util/intel-bts.c b/tools/perf/util/intel-bts.c
index 218ee2b..8c15309 100644
--- a/tools/perf/util/intel-bts.c
+++ b/tools/perf/util/intel-bts.c
@@ -755,7 +755,8 @@ struct intel_bts_synth {
 static int intel_bts_event_synth(struct perf_tool *tool,
 				 union perf_event *event,
 				 struct perf_sample *sample __maybe_unused,
-				 struct machine *machine __maybe_unused)
+				 struct machine *machine __maybe_unused,
+				 struct thread_info *thread __maybe_unused)
 {
 	struct intel_bts_synth *intel_bts_synth =
 			container_of(tool, struct intel_bts_synth, dummy_tool);
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index b58f9fd..4858634 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -2121,7 +2121,8 @@ struct intel_pt_synth {
 static int intel_pt_event_synth(struct perf_tool *tool,
 				union perf_event *event,
 				struct perf_sample *sample __maybe_unused,
-				struct machine *machine __maybe_unused)
+				struct machine *machine __maybe_unused,
+				struct thread_info *thread __maybe_unused)
 {
 	struct intel_pt_synth *intel_pt_synth =
 			container_of(tool, struct intel_pt_synth, dummy_tool);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ceac084..f044bad 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2211,7 +2211,7 @@ int perf_event__synthesize_id_index(struct perf_tool *tool,
 			struct perf_sample_id *sid;
 
 			if (i >= n) {
-				err = process(tool, ev, NULL, machine);
+				err = process(tool, ev, NULL, machine, NULL);
 				if (err)
 					goto out_err;
 				nr -= n;
@@ -2238,7 +2238,7 @@ int perf_event__synthesize_id_index(struct perf_tool *tool,
 	ev->id_index.header.size = sz;
 	ev->id_index.nr = nr;
 
-	err = process(tool, ev, NULL, machine);
+	err = process(tool, ev, NULL, machine, NULL);
 out_err:
 	free(ev);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 2/6] perf tools: pass thread info in event synthesization
  2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
  2017-10-20 20:05 ` [PATCH V3 1/6] perf tools: pass thread info to process function kan.liang
@ 2017-10-20 20:05 ` kan.liang
  2017-10-20 20:05 ` [PATCH V3 3/6] perf tools: expose copyfile_offset() kan.liang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: kan.liang @ 2017-10-20 20:05 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

Pass the thread idx to process function, which is used by the following
patch.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/builtin-record.c     |  4 +-
 tools/perf/tests/dwarf-unwind.c |  3 +-
 tools/perf/util/event.c         | 98 ++++++++++++++++++++++++++---------------
 tools/perf/util/event.h         |  9 ++--
 4 files changed, 72 insertions(+), 42 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f53c1163..4ede9bf 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -922,7 +922,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		tgid = perf_event__synthesize_comm(tool, event,
 						   rec->evlist->workload.pid,
 						   process_synthesized_event,
-						   machine);
+						   machine, NULL);
 		free(event);
 
 		if (tgid == -1)
@@ -942,7 +942,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		perf_event__synthesize_namespaces(tool, event,
 						  rec->evlist->workload.pid,
 						  tgid, process_synthesized_event,
-						  machine);
+						  machine, NULL);
 		free(event);
 
 		perf_evlist__start_workload(rec->evlist);
diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 5ed2271..792c277 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -34,7 +34,8 @@ static int init_live_machine(struct machine *machine)
 	pid_t pid = getpid();
 
 	return perf_event__synthesize_mmap_events(NULL, &event, pid, pid,
-						  mmap_handler, machine, true, 500);
+						  mmap_handler, machine,
+						  true, 500, NULL);
 }
 
 #define MAX_STACK 8
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index fd523ca7..43e1dfa 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -90,7 +90,8 @@ static const char *perf_ns__name(unsigned int id)
 static int perf_tool__process_synth_event(struct perf_tool *tool,
 					  union perf_event *event,
 					  struct machine *machine,
-					  perf_event__handler_t process)
+					  perf_event__handler_t process,
+					  struct thread_info *thread)
 {
 	struct perf_sample synth_sample = {
 	.pid	   = -1,
@@ -102,7 +103,7 @@ static int perf_tool__process_synth_event(struct perf_tool *tool,
 	.cpumode   = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK,
 	};
 
-	return process(tool, event, &synth_sample, machine, NULL);
+	return process(tool, event, &synth_sample, machine, thread);
 };
 
 /*
@@ -219,14 +220,16 @@ static int perf_event__prepare_comm(union perf_event *event, pid_t pid,
 pid_t perf_event__synthesize_comm(struct perf_tool *tool,
 					 union perf_event *event, pid_t pid,
 					 perf_event__handler_t process,
-					 struct machine *machine)
+					 struct machine *machine,
+					 struct thread_info *thread)
 {
 	pid_t tgid, ppid;
 
 	if (perf_event__prepare_comm(event, pid, machine, &tgid, &ppid) != 0)
 		return -1;
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return tgid;
@@ -249,7 +252,8 @@ int perf_event__synthesize_namespaces(struct perf_tool *tool,
 				      union perf_event *event,
 				      pid_t pid, pid_t tgid,
 				      perf_event__handler_t process,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread)
 {
 	u32 idx;
 	struct perf_ns_link_info *ns_link_info;
@@ -278,7 +282,8 @@ int perf_event__synthesize_namespaces(struct perf_tool *tool,
 			(NR_NAMESPACES * sizeof(struct perf_ns_link_info)) +
 			machine->id_hdr_size);
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return 0;
@@ -288,7 +293,8 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
 				       union perf_event *event,
 				       pid_t pid, pid_t tgid, pid_t ppid,
 				       perf_event__handler_t process,
-				       struct machine *machine)
+				       struct machine *machine,
+				       struct thread_info *thread)
 {
 	memset(&event->fork, 0, sizeof(event->fork) + machine->id_hdr_size);
 
@@ -310,7 +316,8 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
 
 	event->fork.header.size = (sizeof(event->fork) + machine->id_hdr_size);
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return 0;
@@ -322,7 +329,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine,
 				       bool mmap_data,
-				       unsigned int proc_map_timeout)
+				       unsigned int proc_map_timeout,
+				       struct thread_info *thread)
 {
 	char filename[PATH_MAX];
 	FILE *fp;
@@ -444,7 +452,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 		event->mmap2.pid = tgid;
 		event->mmap2.tid = pid;
 
-		if (perf_tool__process_synth_event(tool, event, machine, process) != 0) {
+		if (perf_tool__process_synth_event(tool, event, machine,
+						   process, thread) != 0) {
 			rc = -1;
 			break;
 		}
@@ -502,7 +511,8 @@ int perf_event__synthesize_modules(struct perf_tool *tool,
 
 		memcpy(event->mmap.filename, pos->dso->long_name,
 		       pos->dso->long_name_len + 1);
-		if (perf_tool__process_synth_event(tool, event, machine, process) != 0) {
+		if (perf_tool__process_synth_event(tool, event, machine,
+						   process, NULL) != 0) {
 			rc = -1;
 			break;
 		}
@@ -521,7 +531,8 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 				      struct perf_tool *tool,
 				      struct machine *machine,
 				      bool mmap_data,
-				      unsigned int proc_map_timeout)
+				      unsigned int proc_map_timeout,
+				      struct thread_info *thread)
 {
 	char filename[PATH_MAX];
 	DIR *tasks;
@@ -532,19 +543,22 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 	/* special case: only send one comm event using passed in pid */
 	if (!full) {
 		tgid = perf_event__synthesize_comm(tool, comm_event, pid,
-						   process, machine);
+						   process, machine, thread);
 
 		if (tgid == -1)
 			return -1;
 
-		if (perf_event__synthesize_namespaces(tool, namespaces_event, pid,
-						      tgid, process, machine) < 0)
+		if (perf_event__synthesize_namespaces(tool, namespaces_event,
+						      pid, tgid, process,
+						      machine, thread) < 0)
 			return -1;
 
 
-		return perf_event__synthesize_mmap_events(tool, mmap_event, pid, tgid,
-							  process, machine, mmap_data,
-							  proc_map_timeout);
+		return perf_event__synthesize_mmap_events(tool, mmap_event,
+							  pid, tgid, process,
+							  machine, mmap_data,
+							  proc_map_timeout,
+							  thread);
 	}
 
 	if (machine__is_default_guest(machine))
@@ -572,25 +586,30 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 					     &tgid, &ppid) != 0)
 			break;
 
-		if (perf_event__synthesize_fork(tool, fork_event, _pid, tgid,
-						ppid, process, machine) < 0)
+		if (perf_event__synthesize_fork(tool, fork_event, _pid,
+						tgid, ppid, process,
+						machine, thread) < 0)
 			break;
 
-		if (perf_event__synthesize_namespaces(tool, namespaces_event, _pid,
-						      tgid, process, machine) < 0)
+		if (perf_event__synthesize_namespaces(tool, namespaces_event,
+						      _pid, tgid, process,
+						      machine, thread) < 0)
 			break;
 
 		/*
 		 * Send the prepared comm event
 		 */
-		if (perf_tool__process_synth_event(tool, comm_event, machine, process) != 0)
+		if (perf_tool__process_synth_event(tool, comm_event, machine,
+						   process, thread) != 0)
 			break;
 
 		rc = 0;
 		if (_pid == pid) {
 			/* process the parent's maps too */
-			rc = perf_event__synthesize_mmap_events(tool, mmap_event, pid, tgid,
-						process, machine, mmap_data, proc_map_timeout);
+			rc = perf_event__synthesize_mmap_events(tool,
+						mmap_event, pid, tgid,
+						process, machine, mmap_data,
+						proc_map_timeout, thread);
 			if (rc)
 				break;
 		}
@@ -633,9 +652,10 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
 	for (thread = 0; thread < threads->nr; ++thread) {
 		if (__event__synthesize_thread(comm_event, mmap_event,
 					       fork_event, namespaces_event,
-					       thread_map__pid(threads, thread), 0,
-					       process, tool, machine,
-					       mmap_data, proc_map_timeout)) {
+					       thread_map__pid(threads, thread),
+					       0, process, tool, machine,
+					       mmap_data, proc_map_timeout,
+					       NULL)) {
 			err = -1;
 			break;
 		}
@@ -658,10 +678,13 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
 			/* if not, generate events for it */
 			if (need_leader &&
 			    __event__synthesize_thread(comm_event, mmap_event,
-						       fork_event, namespaces_event,
+						       fork_event,
+						       namespaces_event,
 						       comm_event->comm.pid, 0,
 						       process, tool, machine,
-						       mmap_data, proc_map_timeout)) {
+						       mmap_data,
+						       proc_map_timeout,
+						       NULL)) {
 				err = -1;
 				break;
 			}
@@ -684,8 +707,8 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
 					    bool mmap_data,
 					    unsigned int proc_map_timeout,
 					    struct dirent **dirent,
-					    int start,
-					    int num)
+					    int start, int num,
+					    struct thread_info *thread)
 {
 	union perf_event *comm_event, *mmap_event, *fork_event;
 	union perf_event *namespaces_event;
@@ -727,7 +750,7 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
 		__event__synthesize_thread(comm_event, mmap_event, fork_event,
 					   namespaces_event, pid, 1, process,
 					   tool, machine, mmap_data,
-					   proc_map_timeout);
+					   proc_map_timeout, thread);
 	}
 	err = 0;
 
@@ -751,6 +774,7 @@ struct synthesize_threads_arg {
 	struct dirent **dirent;
 	int num;
 	int start;
+	struct thread_info thread;
 };
 
 static void *synthesize_threads_worker(void *arg)
@@ -760,7 +784,7 @@ static void *synthesize_threads_worker(void *arg)
 	__perf_event__synthesize_threads(args->tool, args->process,
 					 args->machine, args->mmap_data,
 					 args->proc_map_timeout, args->dirent,
-					 args->start, args->num);
+					 args->start, args->num, &args->thread);
 	return NULL;
 }
 
@@ -799,7 +823,7 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 		err = __perf_event__synthesize_threads(tool, process,
 						       machine, mmap_data,
 						       proc_map_timeout,
-						       dirent, base, n);
+						       dirent, base, n, NULL);
 		goto free_dirent;
 	}
 	if (thread_nr > n)
@@ -822,6 +846,7 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 		args[i].mmap_data = mmap_data;
 		args[i].proc_map_timeout = proc_map_timeout;
 		args[i].dirent = dirent;
+		args[i].thread.idx = i;
 	}
 	for (i = 0; i < m; i++) {
 		args[i].num = num_per_thread + 1;
@@ -940,7 +965,8 @@ int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
 	event->mmap.len   = map->end - event->mmap.start;
 	event->mmap.pid   = machine->pid;
 
-	err = perf_tool__process_synth_event(tool, event, machine, process);
+	err = perf_tool__process_synth_event(tool, event, machine,
+					     process, NULL);
 	free(event);
 
 	return err;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 200f3f8..6e7b08b3 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -785,13 +785,15 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
 pid_t perf_event__synthesize_comm(struct perf_tool *tool,
 				  union perf_event *event, pid_t pid,
 				  perf_event__handler_t process,
-				  struct machine *machine);
+				  struct machine *machine,
+				  struct thread_info *thread);
 
 int perf_event__synthesize_namespaces(struct perf_tool *tool,
 				      union perf_event *event,
 				      pid_t pid, pid_t tgid,
 				      perf_event__handler_t process,
-				      struct machine *machine);
+				      struct machine *machine,
+				      struct thread_info *thread);
 
 int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       union perf_event *event,
@@ -799,7 +801,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine,
 				       bool mmap_data,
-				       unsigned int proc_map_timeout);
+				       unsigned int proc_map_timeout,
+				       struct thread_info *thread);
 
 size_t perf_event__fprintf_comm(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 3/6] perf tools: expose copyfile_offset()
  2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
  2017-10-20 20:05 ` [PATCH V3 1/6] perf tools: pass thread info to process function kan.liang
  2017-10-20 20:05 ` [PATCH V3 2/6] perf tools: pass thread info in event synthesization kan.liang
@ 2017-10-20 20:05 ` kan.liang
  2017-10-20 20:05 ` [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp kan.liang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: kan.liang @ 2017-10-20 20:05 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

copyfile_offset could be used to merge per thread file to perf.data in
the following patch.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/util/util.c | 2 +-
 tools/perf/util/util.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 97e0c8e..a003ce4 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -188,7 +188,7 @@ static int slow_copyfile(const char *from, const char *to, struct nsinfo *nsi)
 	return err;
 }
 
-static int copyfile_offset(int ifd, loff_t off_in, int ofd, loff_t off_out, u64 size)
+int copyfile_offset(int ifd, loff_t off_in, int ofd, loff_t off_out, u64 size)
 {
 	void *ptr;
 	loff_t pgoff;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 6c7e6cc..0e1358f 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -5,6 +5,7 @@
 /* glibc 2.20 deprecates _BSD_SOURCE in favour of _DEFAULT_SOURCE */
 #define _DEFAULT_SOURCE 1
 
+#include <fcntl.h>
 #include <stdbool.h>
 #include <stddef.h>
 #include <stdlib.h>
@@ -34,6 +35,7 @@ bool lsdir_no_dot_filter(const char *name, struct dirent *d);
 int copyfile(const char *from, const char *to);
 int copyfile_mode(const char *from, const char *to, mode_t mode);
 int copyfile_ns(const char *from, const char *to, struct nsinfo *nsi);
+int copyfile_offset(int fromfd, loff_t from_ofs, int tofd, loff_t to_ofs, u64 size);
 
 ssize_t readn(int fd, void *buf, size_t n);
 ssize_t writen(int fd, const void *buf, size_t n);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp
  2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
                   ` (2 preceding siblings ...)
  2017-10-20 20:05 ` [PATCH V3 3/6] perf tools: expose copyfile_offset() kan.liang
@ 2017-10-20 20:05 ` kan.liang
  2017-10-23 16:13   ` Jiri Olsa
  2017-10-24  7:26   ` Jiri Olsa
  2017-10-20 20:05 ` [PATCH V3 5/6] perf record: synthesize event multithreading support kan.liang
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 27+ messages in thread
From: kan.liang @ 2017-10-20 20:05 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

And an interface for perf_data_file to open tmp file.
Automatically generate the tmp file name if it's not assigned. The name
cannot be const char. Introduce char *tmp_path for it.
The tmp file will be deleted after close.

It will be used as per-thread file to temporarily store event synthesizd
result in the following patch.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/util/data.c | 26 ++++++++++++++++++++++++++
 tools/perf/util/data.h |  2 ++
 2 files changed, 28 insertions(+)

diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
index 1123b30..cd6fdf9 100644
--- a/tools/perf/util/data.c
+++ b/tools/perf/util/data.c
@@ -139,9 +139,35 @@ int perf_data_file__open(struct perf_data_file *file)
 	return open_file(file);
 }
 
+int perf_data_file__open_tmp(struct perf_data_file *file)
+{
+	int fd;
+
+	if (!file->tmp_path &&
+	    (asprintf(&file->tmp_path, "perf.tmp.XXXXXX") < 0))
+		goto err;
+
+	fd = mkstemp(file->tmp_path);
+	if (fd < 0) {
+		free(file->tmp_path);
+		goto err;
+	}
+
+	file->fd = fd;
+	return 0;
+err:
+	file->tmp_path = NULL;
+	return -1;
+}
+
 void perf_data_file__close(struct perf_data_file *file)
 {
 	close(file->fd);
+	if (file->tmp_path) {
+		unlink(file->tmp_path);
+		free(file->tmp_path);
+		file->tmp_path = NULL;
+	}
 }
 
 ssize_t perf_data_file__write(struct perf_data_file *file,
diff --git a/tools/perf/util/data.h b/tools/perf/util/data.h
index ae510ce..892b3d5 100644
--- a/tools/perf/util/data.h
+++ b/tools/perf/util/data.h
@@ -10,6 +10,7 @@ enum perf_data_mode {
 
 struct perf_data_file {
 	const char		*path;
+	char			*tmp_path;
 	int			 fd;
 	bool			 is_pipe;
 	bool			 force;
@@ -43,6 +44,7 @@ static inline unsigned long perf_data_file__size(struct perf_data_file *file)
 }
 
 int perf_data_file__open(struct perf_data_file *file);
+int perf_data_file__open_tmp(struct perf_data_file *file);
 void perf_data_file__close(struct perf_data_file *file);
 ssize_t perf_data_file__write(struct perf_data_file *file,
 			      void *buf, size_t size);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 5/6] perf record: synthesize event multithreading support
  2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
                   ` (3 preceding siblings ...)
  2017-10-20 20:05 ` [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp kan.liang
@ 2017-10-20 20:05 ` kan.liang
  2017-10-20 20:05 ` [PATCH V3 6/6] perf record: add option to set the number of thread for event synthesize kan.liang
  2017-10-23 11:48 ` [PATCH V3 0/6] event synthesization multithreading for perf record Ingo Molnar
  6 siblings, 0 replies; 27+ messages in thread
From: kan.liang @ 2017-10-20 20:05 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The process function process_synthesized_event writes the synthesized
result to perf.data, which is not multithreading friendly.

Create per thread file to temporarily keep the processing result.
Write them to the perf.data at the end of event synthesization.
The new method doesn't impact the final result, because the order of the
synthesized event is not important.

The threads number hard code to online CPU number. The following patch
will introduce an option to set it.

The multithreading synthesize is only available for per cpu monitoring.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/builtin-record.c | 103 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 94 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4ede9bf..283fc21 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -50,6 +50,7 @@
 #include <signal.h>
 #include <sys/mman.h>
 #include <sys/wait.h>
+#include <sys/stat.h>
 #include <asm/bug.h>
 #include <linux/time64.h>
 
@@ -80,6 +81,7 @@ struct record {
 	bool			timestamp_filename;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
+	struct perf_data_file	*synthesized_file;
 };
 
 static volatile int auxtrace_record__snapshot_started;
@@ -105,6 +107,14 @@ static bool switch_output_time(struct record *rec)
 	       trigger_is_ready(&switch_output_trigger);
 }
 
+static void update_bytes_written(struct record *rec, size_t size)
+{
+	rec->bytes_written += size;
+
+	if (switch_output_size(rec))
+		trigger_hit(&switch_output_trigger);
+}
+
 static int record__write(struct record *rec, void *bf, size_t size)
 {
 	if (perf_data_file__write(rec->session->file, bf, size) < 0) {
@@ -112,10 +122,7 @@ static int record__write(struct record *rec, void *bf, size_t size)
 		return -1;
 	}
 
-	rec->bytes_written += size;
-
-	if (switch_output_size(rec))
-		trigger_hit(&switch_output_trigger);
+	update_bytes_written(rec, size);
 
 	return 0;
 }
@@ -124,10 +131,15 @@ static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
 				     struct machine *machine __maybe_unused,
-				     struct thread_info *thread __maybe_unused)
+				     struct thread_info *thread)
 {
 	struct record *rec = container_of(tool, struct record, tool);
-	return record__write(rec, event, event->header.size);
+
+	if (!perf_singlethreaded && thread)
+		return (perf_data_file__write(&rec->synthesized_file[thread->idx],
+					      event, event->header.size) < 0) ? -1 : 0;
+	else
+		return record__write(rec, event, event->header.size);
 }
 
 static int record__pushfn(void *to, void *bf, size_t size)
@@ -690,6 +702,81 @@ static const struct perf_event_mmap_page *record__pick_pc(struct record *rec)
 	return NULL;
 }
 
+static int record__multithread_synthesize(struct record *rec,
+					  struct machine *machine,
+					  struct perf_tool *tool,
+					  struct record_opts *opts)
+{
+	int i, j, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	int fd_from, fd_to;
+	struct stat st;
+
+again:
+	/* multithreading synthesize is only available for cpu monitoring */
+	if (target__has_task(&opts->target) || (nr_thread <= 1))
+		return __machine__synthesize_threads(machine, tool,
+						     &opts->target,
+						     rec->evlist->threads,
+						     process_synthesized_event,
+						     opts->sample_address,
+						     opts->proc_map_timeout,
+						     1);
+
+	rec->synthesized_file = calloc(nr_thread, sizeof(struct perf_data_file));
+	if (rec->synthesized_file == NULL) {
+		pr_debug("Could not do multithread synthesize."
+			 "Roll back to single thread\n");
+		nr_thread = 1;
+		goto again;
+	}
+
+	for (i = 0; i < nr_thread; i++) {
+		if (perf_data_file__open_tmp(&rec->synthesized_file[i])) {
+			/* Roll back if failed to open tmp file */
+			for (j = 0; j < i; j++)
+				perf_data_file__close(&rec->synthesized_file[j]);
+			free(rec->synthesized_file);
+			nr_thread = 1;
+			goto again;
+		}
+	}
+
+	/* now start multithreading */
+	perf_set_multithreaded();
+
+	err = __machine__synthesize_threads(machine, tool, &opts->target,
+					    rec->evlist->threads,
+					    process_synthesized_event,
+					    opts->sample_address,
+					    opts->proc_map_timeout, nr_thread);
+	if (err < 0)
+		goto free;
+
+	fd_to = rec->session->file->fd;
+	for (i = 0; i < nr_thread; i++) {
+		fd_from = rec->synthesized_file[i].fd;
+
+		fstat(fd_from, &st);
+		if (st.st_size == 0)
+			continue;
+		err = copyfile_offset(fd_from, 0, fd_to,
+				      lseek(fd_to, 0, SEEK_END),
+				      st.st_size);
+		if (err < 0)
+			goto free;
+		/* pwrite in copyfile_offset doesn't change the file pointer */
+		lseek(fd_to, 0, SEEK_END);
+		update_bytes_written(rec, st.st_size);
+	}
+free:
+	for (i = 0; i < nr_thread; i++)
+		perf_data_file__close(&rec->synthesized_file[i]);
+	free(rec->synthesized_file);
+	perf_set_singlethreaded();
+
+	return err;
+}
+
 static int record__synthesize(struct record *rec, bool tail)
 {
 	struct perf_session *session = rec->session;
@@ -766,9 +853,7 @@ static int record__synthesize(struct record *rec, bool tail)
 					 perf_event__synthesize_guest_os, tool);
 	}
 
-	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
-					    process_synthesized_event, opts->sample_address,
-					    opts->proc_map_timeout, 1);
+	err = record__multithread_synthesize(rec, machine, tool, opts);
 out:
 	return err;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 6/6] perf record: add option to set the number of thread for event synthesize
  2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
                   ` (4 preceding siblings ...)
  2017-10-20 20:05 ` [PATCH V3 5/6] perf record: synthesize event multithreading support kan.liang
@ 2017-10-20 20:05 ` kan.liang
  2017-10-23 11:48 ` [PATCH V3 0/6] event synthesization multithreading for perf record Ingo Molnar
  6 siblings, 0 replies; 27+ messages in thread
From: kan.liang @ 2017-10-20 20:05 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

Using UINT_MAX to indicate the default thread#, which is the number of
online CPU.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/Documentation/perf-record.txt |  4 ++++
 tools/perf/builtin-record.c              | 10 +++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 68a1ffb..f759dc4 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -483,6 +483,10 @@ config terms. For example: 'cycles/overwrite/' and 'instructions/no-overwrite/'.
 
 Implies --tail-synthesize.
 
+--num-thread-synthesize::
+The number of threads to run event synthesize.
+By default, the number of threads equals to the online CPU number.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 283fc21..4b65ebd 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -82,6 +82,7 @@ struct record {
 	struct switch_output	switch_output;
 	unsigned long long	samples;
 	struct perf_data_file	*synthesized_file;
+	unsigned int		nr_threads_synthesize;
 };
 
 static volatile int auxtrace_record__snapshot_started;
@@ -707,10 +708,14 @@ static int record__multithread_synthesize(struct record *rec,
 					  struct perf_tool *tool,
 					  struct record_opts *opts)
 {
-	int i, j, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	int i, j, err, nr_thread;
 	int fd_from, fd_to;
 	struct stat st;
 
+	if (rec->nr_threads_synthesize == UINT_MAX)
+		nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	else
+		nr_thread = rec->nr_threads_synthesize;
 again:
 	/* multithreading synthesize is only available for cpu monitoring */
 	if (target__has_task(&opts->target) || (nr_thread <= 1))
@@ -1530,6 +1535,7 @@ static struct record record = {
 		.mmap2		= perf_event__process_mmap2,
 		.ordered_events	= true,
 	},
+	.nr_threads_synthesize = UINT_MAX,
 };
 
 const char record_callchain_help[] = CALLCHAIN_RECORD_HELP
@@ -1671,6 +1677,8 @@ static struct option __record_options[] = {
 			  "signal"),
 	OPT_BOOLEAN(0, "dry-run", &dry_run,
 		    "Parse options then exit"),
+	OPT_UINTEGER(0, "num-thread-synthesize", &record.nr_threads_synthesize,
+			"number of thread to run event synthesize"),
 	OPT_END()
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
                   ` (5 preceding siblings ...)
  2017-10-20 20:05 ` [PATCH V3 6/6] perf record: add option to set the number of thread for event synthesize kan.liang
@ 2017-10-23 11:48 ` Ingo Molnar
  2017-10-23 13:43   ` Liang, Kan
  6 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2017-10-23 11:48 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak


* kan.liang@intel.com <kan.liang@intel.com> wrote:

> The latency is the time cost of __machine__synthesize_threads or
> its multithreading replacement, record__multithread_synthesize.
> 
> Original:              original single thread synthesize
> With patch(not merge): multithread synthesize without final file merge
>                        (intermediate results for scalability measurement)
> With patch(merge):     multithread synthesize with file merge
>                        (final result)
> 
> - Latency on Knights Mill (272 CPUs)
> 
> Original(s)	With patch(not merge)(s)	With patch(merge)(s)
> 12.7		6.6				7.76
> 
> - Latency on Skylake server (192 CPUs)
> 
> Original(s)	With patch(not merge)(s)	With patch(merge)(s)
> 0.34		0.21				0.23

Ok, I think I mis-understood some aspects of the patch series.

It multi-threads a certain stage of processing (synthesizing), but not the _whole_ 
process of recording events, right?

So I'm wondering, in the context of 'perf record -a' and 'perf top' CPU-granular 
profiling at least (but maybe also in the context of inherited workload 'perf 
record' profiling), could we simply record with per-CPU recording threads created 
early on, which would record into the percpu files quite naturally, which would 
also offer natural multithreading of any 'synthesizing' steps later on?

I.e. instead of multithreading perf record piecemeal wise, why not multithread it 
all - and win big in terms of scalable, low overhead profiling?

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-23 11:48 ` [PATCH V3 0/6] event synthesization multithreading for perf record Ingo Molnar
@ 2017-10-23 13:43   ` Liang, Kan
  2017-10-23 14:25     ` acme
  2017-10-24  9:22     ` Ingo Molnar
  0 siblings, 2 replies; 27+ messages in thread
From: Liang, Kan @ 2017-10-23 13:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, Hunter, Adrian, ak

> * kan.liang@intel.com <kan.liang@intel.com> wrote:
> 
> > The latency is the time cost of __machine__synthesize_threads or its
> > multithreading replacement, record__multithread_synthesize.
> >
> > Original:              original single thread synthesize
> > With patch(not merge): multithread synthesize without final file merge
> >                        (intermediate results for scalability measurement)
> > With patch(merge):     multithread synthesize with file merge
> >                        (final result)
> >
> > - Latency on Knights Mill (272 CPUs)
> >
> > Original(s)	With patch(not merge)(s)	With patch(merge)(s)
> > 12.7		6.6				7.76
> >
> > - Latency on Skylake server (192 CPUs)
> >
> > Original(s)	With patch(not merge)(s)	With patch(merge)(s)
> > 0.34		0.21				0.23
> 
> Ok, I think I mis-understood some aspects of the patch series.
> 
> It multi-threads a certain stage of processing (synthesizing), but not the
> _whole_ process of recording events, right?

Right.

> 
> So I'm wondering, in the context of 'perf record -a' and 'perf top' CPU-
> granular profiling at least (but maybe also in the context of inherited
> workload 'perf record' profiling), could we simply record with per-CPU
> recording threads created early on, which would record into the percpu files
> quite naturally, which would also offer natural multithreading of any
> 'synthesizing' steps later on?
> 
> I.e. instead of multithreading perf record piecemeal wise, why not
> multithread it all - and win big in terms of scalable, low overhead profiling?
>

For 'all', do you mean the whole process?
I think that's the ultimate goal.  Eventually there will be per-CPU recording
threads created at the beginning of perf record and go through the whole process.
The plan is to do the multithreading step by step from the simplest case.
Synthesizing stage is just a start.

Only for synthesizing stage, I think the patch series should already cover all the
'synthesizing' steps which can do multithreading. For the rest 'synthesizing' steps,
it only need to be done by single thread.

Since there is only multithreading for 'synthesizing' step, the threads creation related
code is event.c for now. It's better to move it to a dedicate file and make it generic for
recording threads. I think we can do it later separately. 


Thanks,
Kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-23 13:43   ` Liang, Kan
@ 2017-10-23 14:25     ` acme
  2017-10-23 18:45       ` Liang, Kan
  2017-10-24  9:22     ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: acme @ 2017-10-23 14:25 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Ingo Molnar, mingo, linux-kernel, peterz, jolsa, wangnan0,
	hekuang, namhyung, alexander.shishkin, Hunter, Adrian, ak

Em Mon, Oct 23, 2017 at 01:43:39PM +0000, Liang, Kan escreveu:
> The plan is to do the multithreading step by step from the simplest case.
> Synthesizing stage is just a start.
 
> Only for synthesizing stage, I think the patch series should already cover all the
> 'synthesizing' steps which can do multithreading. For the rest 'synthesizing' steps,
> it only need to be done by single thread.
 
> Since there is only multithreading for 'synthesizing' step, the threads creation related
> code is event.c for now. It's better to move it to a dedicate file and make it generic for
> recording threads. I think we can do it later separately. 

Yes, and I'm happy that you plan to continue working in this area,
right? :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp
  2017-10-20 20:05 ` [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp kan.liang
@ 2017-10-23 16:13   ` Jiri Olsa
  2017-10-23 16:19     ` Liang, Kan
  2017-10-24  7:26   ` Jiri Olsa
  1 sibling, 1 reply; 27+ messages in thread
From: Jiri Olsa @ 2017-10-23 16:13 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak

On Fri, Oct 20, 2017 at 01:05:32PM -0700, kan.liang@intel.com wrote:
> From: Kan Liang <Kan.liang@intel.com>
> 
> And an interface for perf_data_file to open tmp file.
> Automatically generate the tmp file name if it's not assigned. The name
> cannot be const char. Introduce char *tmp_path for it.
> The tmp file will be deleted after close.
> 
> It will be used as per-thread file to temporarily store event synthesizd
> result in the following patch.
> 
> Signed-off-by: Kan Liang <Kan.liang@intel.com>
> ---
>  tools/perf/util/data.c | 26 ++++++++++++++++++++++++++
>  tools/perf/util/data.h |  2 ++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> index 1123b30..cd6fdf9 100644
> --- a/tools/perf/util/data.c
> +++ b/tools/perf/util/data.c
> @@ -139,9 +139,35 @@ int perf_data_file__open(struct perf_data_file *file)
>  	return open_file(file);
>  }
>  
> +int perf_data_file__open_tmp(struct perf_data_file *file)
> +{
> +	int fd;
> +
> +	if (!file->tmp_path &&
> +	    (asprintf(&file->tmp_path, "perf.tmp.XXXXXX") < 0))
> +		goto err;
> +
> +	fd = mkstemp(file->tmp_path);
> +	if (fd < 0) {
> +		free(file->tmp_path);
> +		goto err;
> +	}
> +
> +	file->fd = fd;
> +	return 0;
> +err:
> +	file->tmp_path = NULL;
> +	return -1;
> +}
> +
>  void perf_data_file__close(struct perf_data_file *file)
>  {
>  	close(file->fd);
> +	if (file->tmp_path) {
> +		unlink(file->tmp_path);
> +		free(file->tmp_path);
> +		file->tmp_path = NULL;
> +	}
>  }
>  
>  ssize_t perf_data_file__write(struct perf_data_file *file,
> diff --git a/tools/perf/util/data.h b/tools/perf/util/data.h
> index ae510ce..892b3d5 100644
> --- a/tools/perf/util/data.h
> +++ b/tools/perf/util/data.h
> @@ -10,6 +10,7 @@ enum perf_data_mode {
>  
>  struct perf_data_file {
>  	const char		*path;
> +	char			*tmp_path;

could we add is_tmp instead of new path pointer
and keep the path for the name..?

thanks,
jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp
  2017-10-23 16:13   ` Jiri Olsa
@ 2017-10-23 16:19     ` Liang, Kan
  2017-10-23 16:26       ` Jiri Olsa
  0 siblings, 1 reply; 27+ messages in thread
From: Liang, Kan @ 2017-10-23 16:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, Hunter, Adrian, ak

> > From: Kan Liang <Kan.liang@intel.com>
> >
> > And an interface for perf_data_file to open tmp file.
> > Automatically generate the tmp file name if it's not assigned. The
> > name cannot be const char. Introduce char *tmp_path for it.
> > The tmp file will be deleted after close.
> >
> > It will be used as per-thread file to temporarily store event
> > synthesizd result in the following patch.
> >
> > Signed-off-by: Kan Liang <Kan.liang@intel.com>
> > ---
> >  tools/perf/util/data.c | 26 ++++++++++++++++++++++++++
> > tools/perf/util/data.h |  2 ++
> >  2 files changed, 28 insertions(+)
> >
> > diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c index
> > 1123b30..cd6fdf9 100644
> > --- a/tools/perf/util/data.c
> > +++ b/tools/perf/util/data.c
> > @@ -139,9 +139,35 @@ int perf_data_file__open(struct perf_data_file
> *file)
> >  	return open_file(file);
> >  }
> >
> > +int perf_data_file__open_tmp(struct perf_data_file *file) {
> > +	int fd;
> > +
> > +	if (!file->tmp_path &&
> > +	    (asprintf(&file->tmp_path, "perf.tmp.XXXXXX") < 0))
> > +		goto err;
> > +
> > +	fd = mkstemp(file->tmp_path);
> > +	if (fd < 0) {
> > +		free(file->tmp_path);
> > +		goto err;
> > +	}
> > +
> > +	file->fd = fd;
> > +	return 0;
> > +err:
> > +	file->tmp_path = NULL;
> > +	return -1;
> > +}
> > +
> >  void perf_data_file__close(struct perf_data_file *file)  {
> >  	close(file->fd);
> > +	if (file->tmp_path) {
> > +		unlink(file->tmp_path);
> > +		free(file->tmp_path);
> > +		file->tmp_path = NULL;
> > +	}
> >  }
> >
> >  ssize_t perf_data_file__write(struct perf_data_file *file, diff --git
> > a/tools/perf/util/data.h b/tools/perf/util/data.h index
> > ae510ce..892b3d5 100644
> > --- a/tools/perf/util/data.h
> > +++ b/tools/perf/util/data.h
> > @@ -10,6 +10,7 @@ enum perf_data_mode {
> >
> >  struct perf_data_file {
> >  	const char		*path;
> > +	char			*tmp_path;
> 
> could we add is_tmp instead of new path pointer and keep the path for the
> name..?

The 'path' is const char. I think it's not good for tmp file which
generate the file name in real time.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp
  2017-10-23 16:19     ` Liang, Kan
@ 2017-10-23 16:26       ` Jiri Olsa
  2017-10-23 18:05         ` Liang, Kan
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Olsa @ 2017-10-23 16:26 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, Hunter, Adrian, ak

On Mon, Oct 23, 2017 at 04:19:19PM +0000, Liang, Kan wrote:

SNIP

> > >  ssize_t perf_data_file__write(struct perf_data_file *file, diff --git
> > > a/tools/perf/util/data.h b/tools/perf/util/data.h index
> > > ae510ce..892b3d5 100644
> > > --- a/tools/perf/util/data.h
> > > +++ b/tools/perf/util/data.h
> > > @@ -10,6 +10,7 @@ enum perf_data_mode {
> > >
> > >  struct perf_data_file {
> > >  	const char		*path;
> > > +	char			*tmp_path;
> > 
> > could we add is_tmp instead of new path pointer and keep the path for the
> > name..?
> 
> The 'path' is const char. I think it's not good for tmp file which
> generate the file name in real time.

then change path to 'char *' ? I just dont think having
2 name pointers for path will keep this simple

jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp
  2017-10-23 16:26       ` Jiri Olsa
@ 2017-10-23 18:05         ` Liang, Kan
  2017-10-24  7:26           ` Jiri Olsa
  0 siblings, 1 reply; 27+ messages in thread
From: Liang, Kan @ 2017-10-23 18:05 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, Hunter, Adrian, ak

> SNIP
> 
> > > >  ssize_t perf_data_file__write(struct perf_data_file *file, diff
> > > > --git a/tools/perf/util/data.h b/tools/perf/util/data.h index
> > > > ae510ce..892b3d5 100644
> > > > --- a/tools/perf/util/data.h
> > > > +++ b/tools/perf/util/data.h
> > > > @@ -10,6 +10,7 @@ enum perf_data_mode {
> > > >
> > > >  struct perf_data_file {
> > > >  	const char		*path;
> > > > +	char			*tmp_path;
> > >
> > > could we add is_tmp instead of new path pointer and keep the path
> > > for the name..?
> >
> > The 'path' is const char. I think it's not good for tmp file which
> > generate the file name in real time.
> 
> then change path to 'char *' ? I just dont think having
> 2 name pointers for path will keep this simple
>

I tried, but it will impact almost all the perf tools.
The input/output file name is const char*, which means that the file
name should not be change.
I just don't think it's right to cast it to char *.

How about introducing a new dedicated struct for tmp files only?
struct perf_data_tmp_file {
	chat *tmp_patch
	int fd
}

And three interfaces to open, write and close.
perf_data_file__open_tmp()
perf_data_file__write_tmp()
perf_data_file__close_tmp()

Thanks,
Kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-23 14:25     ` acme
@ 2017-10-23 18:45       ` Liang, Kan
  0 siblings, 0 replies; 27+ messages in thread
From: Liang, Kan @ 2017-10-23 18:45 UTC (permalink / raw)
  To: acme
  Cc: Ingo Molnar, mingo, linux-kernel, peterz, jolsa, wangnan0,
	hekuang, namhyung, alexander.shishkin, Hunter, Adrian, ak

> Em Mon, Oct 23, 2017 at 01:43:39PM +0000, Liang, Kan escreveu:
> > The plan is to do the multithreading step by step from the simplest case.
> > Synthesizing stage is just a start.
> 
> > Only for synthesizing stage, I think the patch series should already
> > cover all the 'synthesizing' steps which can do multithreading. For
> > the rest 'synthesizing' steps, it only need to be done by single thread.
> 
> > Since there is only multithreading for 'synthesizing' step, the
> > threads creation related code is event.c for now. It's better to move
> > it to a dedicate file and make it generic for recording threads. I think we can
> do it later separately.
> 
> Yes, and I'm happy that you plan to continue working in this area, right? :-)
> 

After reviewing my schedule, I'm afraid that I will not have the bandwidth
in this quarter to continue to support full multithreading.
I still have some other high priority items not finished yet.
I may revisit it after that. I'm sorry about that.

Thanks,
Kan 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp
  2017-10-23 18:05         ` Liang, Kan
@ 2017-10-24  7:26           ` Jiri Olsa
  0 siblings, 0 replies; 27+ messages in thread
From: Jiri Olsa @ 2017-10-24  7:26 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, Hunter, Adrian, ak

On Mon, Oct 23, 2017 at 06:05:12PM +0000, Liang, Kan wrote:
> > SNIP
> > 
> > > > >  ssize_t perf_data_file__write(struct perf_data_file *file, diff
> > > > > --git a/tools/perf/util/data.h b/tools/perf/util/data.h index
> > > > > ae510ce..892b3d5 100644
> > > > > --- a/tools/perf/util/data.h
> > > > > +++ b/tools/perf/util/data.h
> > > > > @@ -10,6 +10,7 @@ enum perf_data_mode {
> > > > >
> > > > >  struct perf_data_file {
> > > > >  	const char		*path;
> > > > > +	char			*tmp_path;
> > > >
> > > > could we add is_tmp instead of new path pointer and keep the path
> > > > for the name..?
> > >
> > > The 'path' is const char. I think it's not good for tmp file which
> > > generate the file name in real time.
> > 
> > then change path to 'char *' ? I just dont think having
> > 2 name pointers for path will keep this simple
> >
> 
> I tried, but it will impact almost all the perf tools.
> The input/output file name is const char*, which means that the file
> name should not be change.
> I just don't think it's right to cast it to char *.

ok, then the original approach does not look that bad now ;-)

jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp
  2017-10-20 20:05 ` [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp kan.liang
  2017-10-23 16:13   ` Jiri Olsa
@ 2017-10-24  7:26   ` Jiri Olsa
  1 sibling, 0 replies; 27+ messages in thread
From: Jiri Olsa @ 2017-10-24  7:26 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak

On Fri, Oct 20, 2017 at 01:05:32PM -0700, kan.liang@intel.com wrote:
> From: Kan Liang <Kan.liang@intel.com>
> 
> And an interface for perf_data_file to open tmp file.
> Automatically generate the tmp file name if it's not assigned. The name
> cannot be const char. Introduce char *tmp_path for it.
> The tmp file will be deleted after close.
> 
> It will be used as per-thread file to temporarily store event synthesizd
> result in the following patch.
> 
> Signed-off-by: Kan Liang <Kan.liang@intel.com>
> ---
>  tools/perf/util/data.c | 26 ++++++++++++++++++++++++++
>  tools/perf/util/data.h |  2 ++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> index 1123b30..cd6fdf9 100644
> --- a/tools/perf/util/data.c
> +++ b/tools/perf/util/data.c
> @@ -139,9 +139,35 @@ int perf_data_file__open(struct perf_data_file *file)
>  	return open_file(file);
>  }
>  
> +int perf_data_file__open_tmp(struct perf_data_file *file)
> +{
> +	int fd;
> +
> +	if (!file->tmp_path &&
> +	    (asprintf(&file->tmp_path, "perf.tmp.XXXXXX") < 0))
> +		goto err;
> +
> +	fd = mkstemp(file->tmp_path);
> +	if (fd < 0) {
> +		free(file->tmp_path);
> +		goto err;
> +	}
> +
> +	file->fd = fd;
> +	return 0;
> +err:
> +	file->tmp_path = NULL;
> +	return -1;
> +}
> +
>  void perf_data_file__close(struct perf_data_file *file)
>  {
>  	close(file->fd);
> +	if (file->tmp_path) {
> +		unlink(file->tmp_path);
> +		free(file->tmp_path);
> +		file->tmp_path = NULL;

we have a zfree for that

jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-23 13:43   ` Liang, Kan
  2017-10-23 14:25     ` acme
@ 2017-10-24  9:22     ` Ingo Molnar
  2017-10-24 11:47       ` Jiri Olsa
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2017-10-24  9:22 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, Hunter, Adrian, ak


* Liang, Kan <kan.liang@intel.com> wrote:

> For 'all', do you mean the whole process?

Yeah.

> I think that's the ultimate goal.  Eventually there will be per-CPU recording
> threads created at the beginning of perf record and go through the whole process.
> The plan is to do the multithreading step by step from the simplest case.
> Synthesizing stage is just a start.

So, why not do it like the kernel did: add all the threads, create the percpu 
files, and introduce a 'big perf lock' (big mutex) that is taken for all the 
current non-threaded perf functionality. This should be fairly straightforward to 
do and should be 'obviously correct'.

_Then_ start doing the hard threading work on top of this, like threading the 
synthesizing phase.

Doing the whole per CPU thread setup/teardown for just the synthesizing part of it 
looks like the wrong design.

I.e. what I'm suggesting is no extra threading work, just organizing it in a 
different fashion and increasing the life-time of the per CPU threads from 'perf 
startup' to 'perf shutdown'.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-24  9:22     ` Ingo Molnar
@ 2017-10-24 11:47       ` Jiri Olsa
  2017-10-24 12:47         ` Liang, Kan
  2017-10-24 12:59         ` Ingo Molnar
  0 siblings, 2 replies; 27+ messages in thread
From: Jiri Olsa @ 2017-10-24 11:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Liang, Kan, acme, mingo, linux-kernel, peterz, jolsa, wangnan0,
	hekuang, namhyung, alexander.shishkin, Hunter, Adrian, ak

On Tue, Oct 24, 2017 at 11:22:00AM +0200, Ingo Molnar wrote:
> 
> * Liang, Kan <kan.liang@intel.com> wrote:
> 
> > For 'all', do you mean the whole process?
> 
> Yeah.
> 
> > I think that's the ultimate goal.  Eventually there will be per-CPU recording
> > threads created at the beginning of perf record and go through the whole process.
> > The plan is to do the multithreading step by step from the simplest case.
> > Synthesizing stage is just a start.
> 
> So, why not do it like the kernel did: add all the threads, create the percpu 
> files, and introduce a 'big perf lock' (big mutex) that is taken for all the 
> current non-threaded perf functionality. This should be fairly straightforward to 
> do and should be 'obviously correct'.
> 
> _Then_ start doing the hard threading work on top of this, like threading the 
> synthesizing phase.
> 
> Doing the whole per CPU thread setup/teardown for just the synthesizing part of it 
> looks like the wrong design.
> 
> I.e. what I'm suggesting is no extra threading work, just organizing it in a 
> different fashion and increasing the life-time of the per CPU threads from 'perf 
> startup' to 'perf shutdown'.

I recently made some changes on threaded record, which are based
on Namhyungs time* API, which is needed to read/sort the data afterwards

but I wasn't able to get any substantial and constant reduce of LOST events
and then I got sidetracked and did not finish, but it's in here:

git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git perf/data

I'll try to rebase and send it out for comments

jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-24 11:47       ` Jiri Olsa
@ 2017-10-24 12:47         ` Liang, Kan
  2017-10-24 12:59         ` Ingo Molnar
  1 sibling, 0 replies; 27+ messages in thread
From: Liang, Kan @ 2017-10-24 12:47 UTC (permalink / raw)
  To: Jiri Olsa, Ingo Molnar
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, Hunter, Adrian, ak

> On Tue, Oct 24, 2017 at 11:22:00AM +0200, Ingo Molnar wrote:
> >
> > * Liang, Kan <kan.liang@intel.com> wrote:
> >
> > > For 'all', do you mean the whole process?
> >
> > Yeah.
> >
> > > I think that's the ultimate goal.  Eventually there will be per-CPU
> > > recording threads created at the beginning of perf record and go through
> the whole process.
> > > The plan is to do the multithreading step by step from the simplest case.
> > > Synthesizing stage is just a start.
> >
> > So, why not do it like the kernel did: add all the threads, create the
> > percpu files, and introduce a 'big perf lock' (big mutex) that is
> > taken for all the current non-threaded perf functionality. This should
> > be fairly straightforward to do and should be 'obviously correct'.
> >
> > _Then_ start doing the hard threading work on top of this, like
> > threading the synthesizing phase.
> >
> > Doing the whole per CPU thread setup/teardown for just the
> > synthesizing part of it looks like the wrong design.
> >
> > I.e. what I'm suggesting is no extra threading work, just organizing
> > it in a different fashion and increasing the life-time of the per CPU
> > threads from 'perf startup' to 'perf shutdown'.
> 
> I recently made some changes on threaded record, which are based on
> Namhyungs time* API, which is needed to read/sort the data afterwards
> 
> but I wasn't able to get any substantial and constant reduce of LOST events
> and then I got sidetracked and did not finish, but it's in here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git perf/data
> 
> I'll try to rebase and send it out for comments
> 

I think I will wait for your patches, and rebase this series. :) 

Thanks,
Kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-24 11:47       ` Jiri Olsa
  2017-10-24 12:47         ` Liang, Kan
@ 2017-10-24 12:59         ` Ingo Molnar
  2017-10-24 13:08           ` Arnaldo Carvalho de Melo
  2017-10-25  9:00           ` Jiri Olsa
  1 sibling, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2017-10-24 12:59 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Liang, Kan, acme, mingo, linux-kernel, peterz, jolsa, wangnan0,
	hekuang, namhyung, alexander.shishkin, Hunter, Adrian, ak


* Jiri Olsa <jolsa@redhat.com> wrote:

> I recently made some changes on threaded record, which are based
> on Namhyungs time* API, which is needed to read/sort the data afterwards
> 
> but I wasn't able to get any substantial and constant reduce of LOST events
> and then I got sidetracked and did not finish, but it's in here:

So, in the context of system-wide profiling, the way that would work best I think 
is the following:

  thread #0 binds itself to CPU#0 (via sched_setaffinity) and creates a per-CPU event on CPU#0
  thread #1 binds itself to CPU#1 (via sched_setaffinity) and creates a per-CPU event on CPU#1
  thread #2 binds itself to CPU#2 (via sched_setaffinity) and creates a per-CPU event on CPU#2

etc.

Is this how you implemented it?

If the threads in the thread pool are just free-running then the scheduler might 
not migrate it to the 'right' CPU that is streaming the perf events and there will 
be a lot of cross-talking between CPUs.

Inherited events (default 'perf record') is tougher.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-24 12:59         ` Ingo Molnar
@ 2017-10-24 13:08           ` Arnaldo Carvalho de Melo
  2017-10-24 13:25             ` Ingo Molnar
  2017-10-25  9:00           ` Jiri Olsa
  1 sibling, 1 reply; 27+ messages in thread
From: Arnaldo Carvalho de Melo @ 2017-10-24 13:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Liang, Kan, mingo, linux-kernel, peterz, jolsa,
	wangnan0, hekuang, namhyung, alexander.shishkin, Hunter, Adrian,
	ak

Em Tue, Oct 24, 2017 at 02:59:44PM +0200, Ingo Molnar escreveu:
> 
> * Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > I recently made some changes on threaded record, which are based
> > on Namhyungs time* API, which is needed to read/sort the data afterwards
> > 
> > but I wasn't able to get any substantial and constant reduce of LOST events
> > and then I got sidetracked and did not finish, but it's in here:
> 
> So, in the context of system-wide profiling, the way that would work best I think 
> is the following:
> 
>   thread #0 binds itself to CPU#0 (via sched_setaffinity) and creates a per-CPU event on CPU#0
>   thread #1 binds itself to CPU#1 (via sched_setaffinity) and creates a per-CPU event on CPU#1
>   thread #2 binds itself to CPU#2 (via sched_setaffinity) and creates a per-CPU event on CPU#2

Right, that is how I think it should be done as well, and those will
just dump on separate files, in a per session directory, with an extra
file for the session details, in what is now the header.

Later, the same thing happens at processing time, this time we'll have
contention to access global thread state, the need for rounds of
PERF_SAMPLE_TIME based ordering, like what we have now in the
tools/perf/util/ordered-events.[ch] code, etc.

This works for both 'report', 'script', 'top', 'trace', etc, as is
basically the model we already have. All the work that was done for
refcounting the thread, map, etc as well as locking those rbtrees would
finally be taken full advantage of.

- Arnaldo
 
> etc.
> 
> Is this how you implemented it?

> If the threads in the thread pool are just free-running then the scheduler might 
> not migrate it to the 'right' CPU that is streaming the perf events and there will 
> be a lot of cross-talking between CPUs.
> 
> Inherited events (default 'perf record') is tougher.
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-24 13:08           ` Arnaldo Carvalho de Melo
@ 2017-10-24 13:25             ` Ingo Molnar
  2017-10-25  2:35               ` Namhyung Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2017-10-24 13:25 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Liang, Kan, mingo, linux-kernel, peterz, jolsa,
	wangnan0, hekuang, namhyung, alexander.shishkin, Hunter, Adrian,
	ak


* Arnaldo Carvalho de Melo <acme@kernel.org> wrote:

> Em Tue, Oct 24, 2017 at 02:59:44PM +0200, Ingo Molnar escreveu:
> > 
> > * Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > > I recently made some changes on threaded record, which are based
> > > on Namhyungs time* API, which is needed to read/sort the data afterwards
> > > 
> > > but I wasn't able to get any substantial and constant reduce of LOST events
> > > and then I got sidetracked and did not finish, but it's in here:
> > 
> > So, in the context of system-wide profiling, the way that would work best I think 
> > is the following:
> > 
> >   thread #0 binds itself to CPU#0 (via sched_setaffinity) and creates a per-CPU event on CPU#0
> >   thread #1 binds itself to CPU#1 (via sched_setaffinity) and creates a per-CPU event on CPU#1
> >   thread #2 binds itself to CPU#2 (via sched_setaffinity) and creates a per-CPU event on CPU#2
> 
> Right, that is how I think it should be done as well, and those will
> just dump on separate files, in a per session directory, with an extra
> file for the session details, in what is now the header.

Yes. Also note how easy to examine such a directory structure is - I'd suggest 
making all the session details textual eventually. I.e. only the ring-buffers 
should be binary, the rest should be arch-independent text encoding.

It's also very extensible.

> Later, the same thing happens at processing time, this time we'll have
> contention to access global thread state, the need for rounds of
> PERF_SAMPLE_TIME based ordering, like what we have now in the
> tools/perf/util/ordered-events.[ch] code, etc.
> 
> This works for both 'report', 'script', 'top', 'trace', etc, as is
> basically the model we already have. All the work that was done for
> refcounting the thread, map, etc as well as locking those rbtrees would
> finally be taken full advantage of.

Yeah, cool!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-24 13:25             ` Ingo Molnar
@ 2017-10-25  2:35               ` Namhyung Kim
  2017-10-25  9:02                 ` Jiri Olsa
  0 siblings, 1 reply; 27+ messages in thread
From: Namhyung Kim @ 2017-10-25  2:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Liang, Kan, mingo,
	linux-kernel, peterz, jolsa, wangnan0, hekuang,
	alexander.shishkin, Hunter, Adrian, ak, kernel-team

Hi Ingo,

On Tue, Oct 24, 2017 at 03:25:23PM +0200, Ingo Molnar wrote:
> 
> * Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> 
> > Em Tue, Oct 24, 2017 at 02:59:44PM +0200, Ingo Molnar escreveu:
> > > 
> > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > 
> > > > I recently made some changes on threaded record, which are based
> > > > on Namhyungs time* API, which is needed to read/sort the data afterwards
> > > > 
> > > > but I wasn't able to get any substantial and constant reduce of LOST events
> > > > and then I got sidetracked and did not finish, but it's in here:
> > > 
> > > So, in the context of system-wide profiling, the way that would work best I think 
> > > is the following:
> > > 
> > >   thread #0 binds itself to CPU#0 (via sched_setaffinity) and creates a per-CPU event on CPU#0
> > >   thread #1 binds itself to CPU#1 (via sched_setaffinity) and creates a per-CPU event on CPU#1
> > >   thread #2 binds itself to CPU#2 (via sched_setaffinity) and creates a per-CPU event on CPU#2
> > 
> > Right, that is how I think it should be done as well, and those will
> > just dump on separate files, in a per session directory, with an extra
> > file for the session details, in what is now the header.
> 
> Yes. Also note how easy to examine such a directory structure is - I'd suggest 
> making all the session details textual eventually. I.e. only the ring-buffers 
> should be binary, the rest should be arch-independent text encoding.
> 
> It's also very extensible.

Agreed.

Also for multithread work, conversion to directory should be the first
step IMHO.

Thanks,
Namhyung


> 
> > Later, the same thing happens at processing time, this time we'll have
> > contention to access global thread state, the need for rounds of
> > PERF_SAMPLE_TIME based ordering, like what we have now in the
> > tools/perf/util/ordered-events.[ch] code, etc.
> > 
> > This works for both 'report', 'script', 'top', 'trace', etc, as is
> > basically the model we already have. All the work that was done for
> > refcounting the thread, map, etc as well as locking those rbtrees would
> > finally be taken full advantage of.
> 
> Yeah, cool!
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-24 12:59         ` Ingo Molnar
  2017-10-24 13:08           ` Arnaldo Carvalho de Melo
@ 2017-10-25  9:00           ` Jiri Olsa
  2017-10-25  9:07             ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Jiri Olsa @ 2017-10-25  9:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Liang, Kan, acme, mingo, linux-kernel, peterz, jolsa, wangnan0,
	hekuang, namhyung, alexander.shishkin, Hunter, Adrian, ak

On Tue, Oct 24, 2017 at 02:59:44PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > I recently made some changes on threaded record, which are based
> > on Namhyungs time* API, which is needed to read/sort the data afterwards
> > 
> > but I wasn't able to get any substantial and constant reduce of LOST events
> > and then I got sidetracked and did not finish, but it's in here:
> 
> So, in the context of system-wide profiling, the way that would work best I think 
> is the following:
> 
>   thread #0 binds itself to CPU#0 (via sched_setaffinity) and creates a per-CPU event on CPU#0
>   thread #1 binds itself to CPU#1 (via sched_setaffinity) and creates a per-CPU event on CPU#1
>   thread #2 binds itself to CPU#2 (via sched_setaffinity) and creates a per-CPU event on CPU#2
> 
> etc.
> 
> Is this how you implemented it?

in a way ;-) but I made it more generic and let record create just
few threads and let them share cpu subset.. and so there was no binding

> 
> If the threads in the thread pool are just free-running then the scheduler might 
> not migrate it to the 'right' CPU that is streaming the perf events and there will 
> be a lot of cross-talking between CPUs.

ok it's easy to add binding now and 1:1 thread:cpu mapping.. I'll retry

jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-25  2:35               ` Namhyung Kim
@ 2017-10-25  9:02                 ` Jiri Olsa
  0 siblings, 0 replies; 27+ messages in thread
From: Jiri Olsa @ 2017-10-25  9:02 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Liang, Kan, mingo,
	linux-kernel, peterz, jolsa, wangnan0, hekuang,
	alexander.shishkin, Hunter, Adrian, ak, kernel-team

On Wed, Oct 25, 2017 at 11:35:39AM +0900, Namhyung Kim wrote:
> Hi Ingo,
> 
> On Tue, Oct 24, 2017 at 03:25:23PM +0200, Ingo Molnar wrote:
> > 
> > * Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > 
> > > Em Tue, Oct 24, 2017 at 02:59:44PM +0200, Ingo Molnar escreveu:
> > > > 
> > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > 
> > > > > I recently made some changes on threaded record, which are based
> > > > > on Namhyungs time* API, which is needed to read/sort the data afterwards
> > > > > 
> > > > > but I wasn't able to get any substantial and constant reduce of LOST events
> > > > > and then I got sidetracked and did not finish, but it's in here:
> > > > 
> > > > So, in the context of system-wide profiling, the way that would work best I think 
> > > > is the following:
> > > > 
> > > >   thread #0 binds itself to CPU#0 (via sched_setaffinity) and creates a per-CPU event on CPU#0
> > > >   thread #1 binds itself to CPU#1 (via sched_setaffinity) and creates a per-CPU event on CPU#1
> > > >   thread #2 binds itself to CPU#2 (via sched_setaffinity) and creates a per-CPU event on CPU#2
> > > 
> > > Right, that is how I think it should be done as well, and those will
> > > just dump on separate files, in a per session directory, with an extra
> > > file for the session details, in what is now the header.
> > 
> > Yes. Also note how easy to examine such a directory structure is - I'd suggest 
> > making all the session details textual eventually. I.e. only the ring-buffers 
> > should be binary, the rest should be arch-independent text encoding.
> > 
> > It's also very extensible.
> 
> Agreed.
> 
> Also for multithread work, conversion to directory should be the first
> step IMHO.

yes, but thats what we already have with yours (Namhyung's) HEADER_DATA_INDEX

jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/6] event synthesization multithreading for perf record
  2017-10-25  9:00           ` Jiri Olsa
@ 2017-10-25  9:07             ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2017-10-25  9:07 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Liang, Kan, acme, mingo, linux-kernel, peterz, jolsa, wangnan0,
	hekuang, namhyung, alexander.shishkin, Hunter, Adrian, ak


* Jiri Olsa <jolsa@redhat.com> wrote:

> On Tue, Oct 24, 2017 at 02:59:44PM +0200, Ingo Molnar wrote:
> > 
> > * Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > > I recently made some changes on threaded record, which are based
> > > on Namhyungs time* API, which is needed to read/sort the data afterwards
> > > 
> > > but I wasn't able to get any substantial and constant reduce of LOST events
> > > and then I got sidetracked and did not finish, but it's in here:
> > 
> > So, in the context of system-wide profiling, the way that would work best I think 
> > is the following:
> > 
> >   thread #0 binds itself to CPU#0 (via sched_setaffinity) and creates a per-CPU event on CPU#0
> >   thread #1 binds itself to CPU#1 (via sched_setaffinity) and creates a per-CPU event on CPU#1
> >   thread #2 binds itself to CPU#2 (via sched_setaffinity) and creates a per-CPU event on CPU#2
> > 
> > etc.
> > 
> > Is this how you implemented it?
> 
> in a way ;-) but I made it more generic and let record create just
> few threads and let them share cpu subset.. and so there was no binding
> 
> > 
> > If the threads in the thread pool are just free-running then the scheduler might 
> > not migrate it to the 'right' CPU that is streaming the perf events and there will 
> > be a lot of cross-talking between CPUs.
> 
> ok it's easy to add binding now and 1:1 thread:cpu mapping.. I'll retry

Please Cc: me - this is a really interesting aspect of perf scalability!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-10-25  9:07 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-20 20:05 [PATCH V3 0/6] event synthesization multithreading for perf record kan.liang
2017-10-20 20:05 ` [PATCH V3 1/6] perf tools: pass thread info to process function kan.liang
2017-10-20 20:05 ` [PATCH V3 2/6] perf tools: pass thread info in event synthesization kan.liang
2017-10-20 20:05 ` [PATCH V3 3/6] perf tools: expose copyfile_offset() kan.liang
2017-10-20 20:05 ` [PATCH V3 4/6] perf tools: add perf_data_file__open_tmp kan.liang
2017-10-23 16:13   ` Jiri Olsa
2017-10-23 16:19     ` Liang, Kan
2017-10-23 16:26       ` Jiri Olsa
2017-10-23 18:05         ` Liang, Kan
2017-10-24  7:26           ` Jiri Olsa
2017-10-24  7:26   ` Jiri Olsa
2017-10-20 20:05 ` [PATCH V3 5/6] perf record: synthesize event multithreading support kan.liang
2017-10-20 20:05 ` [PATCH V3 6/6] perf record: add option to set the number of thread for event synthesize kan.liang
2017-10-23 11:48 ` [PATCH V3 0/6] event synthesization multithreading for perf record Ingo Molnar
2017-10-23 13:43   ` Liang, Kan
2017-10-23 14:25     ` acme
2017-10-23 18:45       ` Liang, Kan
2017-10-24  9:22     ` Ingo Molnar
2017-10-24 11:47       ` Jiri Olsa
2017-10-24 12:47         ` Liang, Kan
2017-10-24 12:59         ` Ingo Molnar
2017-10-24 13:08           ` Arnaldo Carvalho de Melo
2017-10-24 13:25             ` Ingo Molnar
2017-10-25  2:35               ` Namhyung Kim
2017-10-25  9:02                 ` Jiri Olsa
2017-10-25  9:00           ` Jiri Olsa
2017-10-25  9:07             ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.