linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC V4 0/6] perf top optimization
@ 2017-09-29 14:47 kan.liang
  2017-09-29 14:47 ` [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list kan.liang
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: kan.liang @ 2017-09-29 14:47 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, namhyung, adrian.hunter, lukasz.odzioba, wangnan0,
	hekuang, ast, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

The patch series intends to fix the severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system.
perf top costs a few minutes to show the result, which is
unacceptable.
With the patch series applied, the latency will reduces to
several seconds.

machine__synthesize_threads and perf_top__mmap_read costs most of
the perf top time (> 99%).
Patch 1-4 do the optimization for machine__synthesize_threads.
Patch 5-6 does the optimization for perf_top__mmap_read.

Optimization for machine__synthesize_threads
  - Multithreading the whole process.
  - The threads number is set to the max online CPU# by default.
    User can change the threads number through the new option.
  - Introduces hashtable for machine threads to reduce the lock
    contention.
  - The optimization can also benefit other platforms and other
    perf tools, like perf record. But this patch series doesn't
    do the optimization for other tools. It can be done later
    separately.
  - With this optimization applied, there is a 1.56x speedup in
    Knights Mill with heavy workload.

Optimization for perf_top__mmap_read
  - switch to backward overwrite mode
    For non overwrite mode, it tries to read everything in the ring buffer
    and does not check the messup. Once there are lots of samples delivered
    shortly, the processing time could be very long.
    Considering the real time requirement for perf top, it should switch
    to backward overwrite mode.
  - With this optimization applied, there is a 8.98x speedup in
    Knights Mill with heavy workload.
  - However, the latency of perf_top__mmap_read is still higher than the
    default perf top fresh time (2s) in Knights Mill with heavy workload.
    A check is introduced to give some hints to reduce the overhead.

The source code is also available at
https://github.com/kliang2/perf.git perf_top_opt

Here are perf top latency test result on Knights Mill and Skylake server

The heavy workload is to compile Linux kernel as below
"sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
Then, "sudo perf top"

The latency period is the time between perf top launched and the first
profiling result shown.

- Latency on Knights Mill (272 CPUs)

Original(s)     With patch(s)   Speedup
272.68          40.89           6.67x

- Latency on Skylake server (192 CPUs)

Original(s)     With patch(s)   Speedup
12.28           2.05            5.99x

Changes since V3:
 - Switch to backward overwrite mode (jirka)
   The backward mode can avoid some problems found in forward,
   but the performance drops which compared with V3 for Knights Mill.
   Introduce a new patch to check the latency of perf_top__mmap_read.
 - Handle the thread_nr = 1 specially (jirka)

Changes since V2:
 - patches 1 and 2 for V2 for hashtable and scandir have been merged.
 - patches 3, 4 and 7 for V2 are droped. Because the optimization code
   doesn't touch those codes. The protection is not needed. (Arnaldo & jirka)
 - Using mutex wrappers for multithread only lock. (jirka)
 - Move struct synthesize_threads_arg to event.c (jirka)

Changes since V1:
 - Patch 1: machine threads and hashtable related renaming (Arnaldo)
 - Patch 6: use a smaller locked section for comm_str__put
   add a locked wrapper for comm_str__findnew              (Arnaldo)

Kan Liang (6):
  perf tools: lock to protect namespaces and comm list
  perf tools: lock to protect comm_str rb tree
  perf top: implement multithreading for perf_event__synthesize_threads
  perf top: add option to set the number of thread for event synthesize
  perf top: switch to backward overwrite mode
  perf top: check the cost of perf_top__mmap_read

 tools/perf/Documentation/perf-top.txt |   3 +
 tools/perf/builtin-kvm.c              |   3 +-
 tools/perf/builtin-record.c           |   2 +-
 tools/perf/builtin-top.c              |  43 ++++++---
 tools/perf/builtin-trace.c            |   2 +-
 tools/perf/tests/mmap-thread-lookup.c |   2 +-
 tools/perf/ui/browsers/hists.c        |  12 ++-
 tools/perf/util/comm.c                |  18 +++-
 tools/perf/util/event.c               | 163 +++++++++++++++++++++++++++-------
 tools/perf/util/event.h               |   3 +-
 tools/perf/util/machine.c             |   8 +-
 tools/perf/util/machine.h             |   9 +-
 tools/perf/util/thread.c              |  53 +++++++++--
 tools/perf/util/thread.h              |   3 +
 tools/perf/util/top.h                 |   1 +
 15 files changed, 262 insertions(+), 63 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list
  2017-09-29 14:47 [PATCH RFC V4 0/6] perf top optimization kan.liang
@ 2017-09-29 14:47 ` kan.liang
  2017-10-02 18:05   ` Arnaldo Carvalho de Melo
  2017-10-03 16:44   ` [tip:perf/core] perf tools: Lock " tip-bot for Kan Liang
  2017-09-29 14:47 ` [PATCH RFC V4 2/6] perf tools: lock to protect comm_str rb tree kan.liang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: kan.liang @ 2017-09-29 14:47 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, namhyung, adrian.hunter, lukasz.odzioba, wangnan0,
	hekuang, ast, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Add two locks to protect namespaces_list and comm_list.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Not all the comm_list/namespaces_list accessing are protected, e.g.
thread__exec_comm. Because the multithread code for perf top event
synthesizing does not touch them. They don't need a lock.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/util/thread.c | 53 +++++++++++++++++++++++++++++++++++++++++++-----
 tools/perf/util/thread.h |  3 +++
 2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index c09bdb5..bf73117 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -45,6 +45,8 @@ struct thread *thread__new(pid_t pid, pid_t tid)
 		thread->cpu = -1;
 		INIT_LIST_HEAD(&thread->namespaces_list);
 		INIT_LIST_HEAD(&thread->comm_list);
+		init_rwsem(&thread->namespaces_lock);
+		init_rwsem(&thread->comm_lock);
 
 		comm_str = malloc(32);
 		if (!comm_str)
@@ -83,18 +85,26 @@ void thread__delete(struct thread *thread)
 		map_groups__put(thread->mg);
 		thread->mg = NULL;
 	}
+	down_write(&thread->namespaces_lock);
 	list_for_each_entry_safe(namespaces, tmp_namespaces,
 				 &thread->namespaces_list, list) {
 		list_del(&namespaces->list);
 		namespaces__free(namespaces);
 	}
+	up_write(&thread->namespaces_lock);
+
+	down_write(&thread->comm_lock);
 	list_for_each_entry_safe(comm, tmp_comm, &thread->comm_list, list) {
 		list_del(&comm->list);
 		comm__free(comm);
 	}
+	up_write(&thread->comm_lock);
+
 	unwind__finish_access(thread);
 	nsinfo__zput(thread->nsinfo);
 
+	exit_rwsem(&thread->namespaces_lock);
+	exit_rwsem(&thread->comm_lock);
 	free(thread);
 }
 
@@ -125,8 +135,8 @@ struct namespaces *thread__namespaces(const struct thread *thread)
 	return list_first_entry(&thread->namespaces_list, struct namespaces, list);
 }
 
-int thread__set_namespaces(struct thread *thread, u64 timestamp,
-			   struct namespaces_event *event)
+static int __thread__set_namespaces(struct thread *thread, u64 timestamp,
+				    struct namespaces_event *event)
 {
 	struct namespaces *new, *curr = thread__namespaces(thread);
 
@@ -149,6 +159,17 @@ int thread__set_namespaces(struct thread *thread, u64 timestamp,
 	return 0;
 }
 
+int thread__set_namespaces(struct thread *thread, u64 timestamp,
+			   struct namespaces_event *event)
+{
+	int ret;
+
+	down_write(&thread->namespaces_lock);
+	ret = __thread__set_namespaces(thread, timestamp, event);
+	up_write(&thread->namespaces_lock);
+	return ret;
+}
+
 struct comm *thread__comm(const struct thread *thread)
 {
 	if (list_empty(&thread->comm_list))
@@ -170,8 +191,8 @@ struct comm *thread__exec_comm(const struct thread *thread)
 	return last;
 }
 
-int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
-		       bool exec)
+static int ____thread__set_comm(struct thread *thread, const char *str,
+				u64 timestamp, bool exec)
 {
 	struct comm *new, *curr = thread__comm(thread);
 
@@ -195,6 +216,17 @@ int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
 	return 0;
 }
 
+int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
+		       bool exec)
+{
+	int ret;
+
+	down_write(&thread->comm_lock);
+	ret = ____thread__set_comm(thread, str, timestamp, exec);
+	up_write(&thread->comm_lock);
+	return ret;
+}
+
 int thread__set_comm_from_proc(struct thread *thread)
 {
 	char path[64];
@@ -212,7 +244,7 @@ int thread__set_comm_from_proc(struct thread *thread)
 	return err;
 }
 
-const char *thread__comm_str(const struct thread *thread)
+static const char *__thread__comm_str(const struct thread *thread)
 {
 	const struct comm *comm = thread__comm(thread);
 
@@ -222,6 +254,17 @@ const char *thread__comm_str(const struct thread *thread)
 	return comm__str(comm);
 }
 
+const char *thread__comm_str(const struct thread *thread)
+{
+	const char *str;
+
+	down_read((struct rw_semaphore *)&thread->comm_lock);
+	str = __thread__comm_str(thread);
+	up_read((struct rw_semaphore *)&thread->comm_lock);
+
+	return str;
+}
+
 /* CHECKME: it should probably better return the max comm len from its comm list */
 int thread__comm_len(struct thread *thread)
 {
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index cb1a5dd..10555d6 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -9,6 +9,7 @@
 #include "symbol.h"
 #include <strlist.h>
 #include <intlist.h>
+#include "rwsem.h"
 
 struct thread_stack;
 struct unwind_libunwind_ops;
@@ -29,7 +30,9 @@ struct thread {
 	int			comm_len;
 	bool			dead; /* if set thread has exited */
 	struct list_head	namespaces_list;
+	struct rw_semaphore	namespaces_lock;
 	struct list_head	comm_list;
+	struct rw_semaphore	comm_lock;
 	u64			db_id;
 
 	void			*priv;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC V4 2/6] perf tools: lock to protect comm_str rb tree
  2017-09-29 14:47 [PATCH RFC V4 0/6] perf top optimization kan.liang
  2017-09-29 14:47 ` [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list kan.liang
@ 2017-09-29 14:47 ` kan.liang
  2017-10-03 16:45   ` [tip:perf/core] perf tools: Lock " tip-bot for Kan Liang
  2017-09-29 14:47 ` [PATCH RFC V4 3/6] perf top: implement multithreading for perf_event__synthesize_threads kan.liang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: kan.liang @ 2017-09-29 14:47 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, namhyung, adrian.hunter, lukasz.odzioba, wangnan0,
	hekuang, ast, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Add comm_str_lock to protect comm_str rb tree.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/util/comm.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/comm.c b/tools/perf/util/comm.c
index 7bc981b..756a9c1 100644
--- a/tools/perf/util/comm.c
+++ b/tools/perf/util/comm.c
@@ -5,6 +5,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <linux/refcount.h>
+#include "rwsem.h"
 
 struct comm_str {
 	char *str;
@@ -14,6 +15,7 @@ struct comm_str {
 
 /* Should perhaps be moved to struct machine */
 static struct rb_root comm_str_root;
+static struct rw_semaphore comm_str_lock = {.lock = PTHREAD_RWLOCK_INITIALIZER,};
 
 static struct comm_str *comm_str__get(struct comm_str *cs)
 {
@@ -25,7 +27,9 @@ static struct comm_str *comm_str__get(struct comm_str *cs)
 static void comm_str__put(struct comm_str *cs)
 {
 	if (cs && refcount_dec_and_test(&cs->refcnt)) {
+		down_write(&comm_str_lock);
 		rb_erase(&cs->rb_node, &comm_str_root);
+		up_write(&comm_str_lock);
 		zfree(&cs->str);
 		free(cs);
 	}
@@ -50,7 +54,8 @@ static struct comm_str *comm_str__alloc(const char *str)
 	return cs;
 }
 
-static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+static
+struct comm_str *__comm_str__findnew(const char *str, struct rb_root *root)
 {
 	struct rb_node **p = &root->rb_node;
 	struct rb_node *parent = NULL;
@@ -81,6 +86,17 @@ static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
 	return new;
 }
 
+static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+{
+	struct comm_str *cs;
+
+	down_write(&comm_str_lock);
+	cs = __comm_str__findnew(str, root);
+	up_write(&comm_str_lock);
+
+	return cs;
+}
+
 struct comm *comm__new(const char *str, u64 timestamp, bool exec)
 {
 	struct comm *comm = zalloc(sizeof(*comm));
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC V4 3/6] perf top: implement multithreading for perf_event__synthesize_threads
  2017-09-29 14:47 [PATCH RFC V4 0/6] perf top optimization kan.liang
  2017-09-29 14:47 ` [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list kan.liang
  2017-09-29 14:47 ` [PATCH RFC V4 2/6] perf tools: lock to protect comm_str rb tree kan.liang
@ 2017-09-29 14:47 ` kan.liang
  2017-10-03 16:45   ` [tip:perf/core] perf top: Implement " tip-bot for Kan Liang
  2017-09-29 14:47 ` [PATCH RFC V4 4/6] perf top: add option to set the number of thread for event synthesize kan.liang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: kan.liang @ 2017-09-29 14:47 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, namhyung, adrian.hunter, lukasz.odzioba, wangnan0,
	hekuang, ast, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

The proc files which is sorted with alphabetical order are evenly
assigned to several synthesize threads to be processed in parallel.

For perf top, the threads number hard code to online CPU number. The
following patch will introduce an option to set it.
For other perf tools, the thread number is 1. Because the process
function is not ready for multithreading, e.g.
process_synthesized_event.
This patch series only support event synthesize multithreading for perf
top. For other tools, it can be done separately later.

With multithread applied, the total processing time can get up to 1.56x
speedup on Knights Mill for perf top.

For specific single event processing, the processing time could increase
because of the lock contention. So proc_map_timeout may need to be
increased. Otherwise some proc maps will be truncated.
Based on my test, increasing the proc_map_timeout has small impact
on the total processing time. The total processing time still get 1.49x
speedup on Knights Mill after increasing the proc_map_timeout.
The patch itself doesn't increase the proc_map_timeout.

Doesn't need to implement multithreading for per task monitoring,
perf_event__synthesize_thread_map. It doesn't have performance issue.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-kvm.c              |   3 +-
 tools/perf/builtin-record.c           |   2 +-
 tools/perf/builtin-top.c              |   8 +-
 tools/perf/builtin-trace.c            |   2 +-
 tools/perf/tests/mmap-thread-lookup.c |   2 +-
 tools/perf/util/event.c               | 160 +++++++++++++++++++++++++++-------
 tools/perf/util/event.h               |   3 +-
 tools/perf/util/machine.c             |   8 +-
 tools/perf/util/machine.h             |   9 +-
 9 files changed, 155 insertions(+), 42 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index c747a1a..721f4f9 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -1441,7 +1441,8 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
 	perf_session__set_id_hdr_size(kvm->session);
 	ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
 	machine__synthesize_threads(&kvm->session->machines.host, &kvm->opts.target,
-				    kvm->evlist->threads, false, kvm->opts.proc_map_timeout);
+				    kvm->evlist->threads, false,
+				    kvm->opts.proc_map_timeout, 1);
 	err = kvm_live_open_events(kvm);
 	if (err)
 		goto out;
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9b379f3..234fdf4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -863,7 +863,7 @@ static int record__synthesize(struct record *rec, bool tail)
 
 	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
 					    process_synthesized_event, opts->sample_address,
-					    opts->proc_map_timeout);
+					    opts->proc_map_timeout, 1);
 out:
 	return err;
 }
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index ee954bd..bc31b93 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,8 +958,14 @@ static int __cmd_top(struct perf_top *top)
 	if (perf_session__register_idle_thread(top->session) < 0)
 		goto out_delete;
 
+	perf_set_multithreaded();
+
 	machine__synthesize_threads(&top->session->machines.host, &opts->target,
-				    top->evlist->threads, false, opts->proc_map_timeout);
+				    top->evlist->threads, false,
+				    opts->proc_map_timeout,
+				    (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+
+	perf_set_singlethreaded();
 
 	if (perf_hpp_list.socket) {
 		ret = perf_env__read_cpu_topology_map(&perf_env);
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 967bd35..afef6fe 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1131,7 +1131,7 @@ static int trace__symbols_init(struct trace *trace, struct perf_evlist *evlist)
 
 	err = __machine__synthesize_threads(trace->host, &trace->tool, &trace->opts.target,
 					    evlist->threads, trace__tool_process, false,
-					    trace->opts.proc_map_timeout);
+					    trace->opts.proc_map_timeout, 1);
 	if (err)
 		symbol__exit();
 
diff --git a/tools/perf/tests/mmap-thread-lookup.c b/tools/perf/tests/mmap-thread-lookup.c
index f94a419..2a0068a 100644
--- a/tools/perf/tests/mmap-thread-lookup.c
+++ b/tools/perf/tests/mmap-thread-lookup.c
@@ -131,7 +131,7 @@ static int synth_all(struct machine *machine)
 {
 	return perf_event__synthesize_threads(NULL,
 					      perf_event__process,
-					      machine, 0, 500);
+					      machine, 0, 500, 1);
 }
 
 static int synth_process(struct machine *machine)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 10366b8..0e678dd 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -678,23 +678,21 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
 	return err;
 }
 
-int perf_event__synthesize_threads(struct perf_tool *tool,
-				   perf_event__handler_t process,
-				   struct machine *machine,
-				   bool mmap_data,
-				   unsigned int proc_map_timeout)
+static int __perf_event__synthesize_threads(struct perf_tool *tool,
+					    perf_event__handler_t process,
+					    struct machine *machine,
+					    bool mmap_data,
+					    unsigned int proc_map_timeout,
+					    struct dirent **dirent,
+					    int start,
+					    int num)
 {
 	union perf_event *comm_event, *mmap_event, *fork_event;
 	union perf_event *namespaces_event;
-	char proc_path[PATH_MAX];
-	struct dirent **dirent;
 	int err = -1;
 	char *end;
 	pid_t pid;
-	int n, i;
-
-	if (machine__is_default_guest(machine))
-		return 0;
+	int i;
 
 	comm_event = malloc(sizeof(comm_event->comm) + machine->id_hdr_size);
 	if (comm_event == NULL)
@@ -714,34 +712,25 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 	if (namespaces_event == NULL)
 		goto out_free_fork;
 
-	snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
-	n = scandir(proc_path, &dirent, 0, alphasort);
-
-	if (n < 0)
-		goto out_free_namespaces;
-
-	for (i = 0; i < n; i++) {
+	for (i = start; i < start + num; i++) {
 		if (!isdigit(dirent[i]->d_name[0]))
 			continue;
 
 		pid = (pid_t)strtol(dirent[i]->d_name, &end, 10);
 		/* only interested in proper numerical dirents */
-		if (!*end) {
-			/*
-			 * We may race with exiting thread, so don't stop just because
-			 * one thread couldn't be synthesized.
-			 */
-			__event__synthesize_thread(comm_event, mmap_event, fork_event,
-						   namespaces_event, pid, 1, process,
-						   tool, machine, mmap_data,
-						   proc_map_timeout);
-		}
-		free(dirent[i]);
+		if (*end)
+			continue;
+		/*
+		 * We may race with exiting thread, so don't stop just because
+		 * one thread couldn't be synthesized.
+		 */
+		__event__synthesize_thread(comm_event, mmap_event, fork_event,
+					   namespaces_event, pid, 1, process,
+					   tool, machine, mmap_data,
+					   proc_map_timeout);
 	}
-	free(dirent);
 	err = 0;
 
-out_free_namespaces:
 	free(namespaces_event);
 out_free_fork:
 	free(fork_event);
@@ -753,6 +742,115 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 	return err;
 }
 
+struct synthesize_threads_arg {
+	struct perf_tool *tool;
+	perf_event__handler_t process;
+	struct machine *machine;
+	bool mmap_data;
+	unsigned int proc_map_timeout;
+	struct dirent **dirent;
+	int num;
+	int start;
+};
+
+static void *synthesize_threads_worker(void *arg)
+{
+	struct synthesize_threads_arg *args = arg;
+
+	__perf_event__synthesize_threads(args->tool, args->process,
+					 args->machine, args->mmap_data,
+					 args->proc_map_timeout, args->dirent,
+					 args->start, args->num);
+	return NULL;
+}
+
+int perf_event__synthesize_threads(struct perf_tool *tool,
+				   perf_event__handler_t process,
+				   struct machine *machine,
+				   bool mmap_data,
+				   unsigned int proc_map_timeout,
+				   unsigned int nr_threads_synthesize)
+{
+	struct synthesize_threads_arg *args = NULL;
+	pthread_t *synthesize_threads = NULL;
+	char proc_path[PATH_MAX];
+	struct dirent **dirent;
+	int num_per_thread;
+	int m, n, i, j;
+	int thread_nr;
+	int base = 0;
+	int err = -1;
+
+
+	if (machine__is_default_guest(machine))
+		return 0;
+
+	snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
+	n = scandir(proc_path, &dirent, 0, alphasort);
+	if (n < 0)
+		return err;
+
+	thread_nr = nr_threads_synthesize;
+
+	if (thread_nr <= 1) {
+		err = __perf_event__synthesize_threads(tool, process,
+						       machine, mmap_data,
+						       proc_map_timeout,
+						       dirent, base, n);
+		goto free_dirent;
+	}
+	if (thread_nr > n)
+		thread_nr = n;
+
+	synthesize_threads = calloc(sizeof(pthread_t), thread_nr);
+	if (synthesize_threads == NULL)
+		goto free_dirent;
+
+	args = calloc(sizeof(*args), thread_nr);
+	if (args == NULL)
+		goto free_threads;
+
+	num_per_thread = n / thread_nr;
+	m = n % thread_nr;
+	for (i = 0; i < thread_nr; i++) {
+		args[i].tool = tool;
+		args[i].process = process;
+		args[i].machine = machine;
+		args[i].mmap_data = mmap_data;
+		args[i].proc_map_timeout = proc_map_timeout;
+		args[i].dirent = dirent;
+	}
+	for (i = 0; i < m; i++) {
+		args[i].num = num_per_thread + 1;
+		args[i].start = i * args[i].num;
+	}
+	if (i != 0)
+		base = args[i-1].start + args[i-1].num;
+	for (j = i; j < thread_nr; j++) {
+		args[j].num = num_per_thread;
+		args[j].start = base + (j - i) * args[i].num;
+	}
+
+	for (i = 0; i < thread_nr; i++) {
+		if (pthread_create(&synthesize_threads[i], NULL,
+				   synthesize_threads_worker, &args[i]))
+			goto out_join;
+	}
+	err = 0;
+out_join:
+	for (i = 0; i < thread_nr; i++)
+		pthread_join(synthesize_threads[i], NULL);
+	free(args);
+free_threads:
+	free(synthesize_threads);
+free_dirent:
+	for (i = 0; i < n; i++)
+		free(dirent[i]);
+	free(dirent);
+
+	return err;
+}
+
 struct process_symbol_args {
 	const char *name;
 	u64	   start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ee7bcc8..d6cbb0a 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -680,7 +680,8 @@ int perf_event__synthesize_cpu_map(struct perf_tool *tool,
 int perf_event__synthesize_threads(struct perf_tool *tool,
 				   perf_event__handler_t process,
 				   struct machine *machine, bool mmap_data,
-				   unsigned int proc_map_timeout);
+				   unsigned int proc_map_timeout,
+				   unsigned int nr_threads_synthesize);
 int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 585b4a3..7c3aa47 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2218,12 +2218,16 @@ int machines__for_each_thread(struct machines *machines,
 int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
 				  struct target *target, struct thread_map *threads,
 				  perf_event__handler_t process, bool data_mmap,
-				  unsigned int proc_map_timeout)
+				  unsigned int proc_map_timeout,
+				  unsigned int nr_threads_synthesize)
 {
 	if (target__has_task(target))
 		return perf_event__synthesize_thread_map(tool, threads, process, machine, data_mmap, proc_map_timeout);
 	else if (target__has_cpu(target))
-		return perf_event__synthesize_threads(tool, process, machine, data_mmap, proc_map_timeout);
+		return perf_event__synthesize_threads(tool, process,
+						      machine, data_mmap,
+						      proc_map_timeout,
+						      nr_threads_synthesize);
 	/* command specified */
 	return 0;
 }
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index b1cd516..c6a299e 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -257,15 +257,18 @@ int machines__for_each_thread(struct machines *machines,
 int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
 				  struct target *target, struct thread_map *threads,
 				  perf_event__handler_t process, bool data_mmap,
-				  unsigned int proc_map_timeout);
+				  unsigned int proc_map_timeout,
+				  unsigned int nr_threads_synthesize);
 static inline
 int machine__synthesize_threads(struct machine *machine, struct target *target,
 				struct thread_map *threads, bool data_mmap,
-				unsigned int proc_map_timeout)
+				unsigned int proc_map_timeout,
+				unsigned int nr_threads_synthesize)
 {
 	return __machine__synthesize_threads(machine, NULL, target, threads,
 					     perf_event__process, data_mmap,
-					     proc_map_timeout);
+					     proc_map_timeout,
+					     nr_threads_synthesize);
 }
 
 pid_t machine__get_current_tid(struct machine *machine, int cpu);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC V4 4/6] perf top: add option to set the number of thread for event synthesize
  2017-09-29 14:47 [PATCH RFC V4 0/6] perf top optimization kan.liang
                   ` (2 preceding siblings ...)
  2017-09-29 14:47 ` [PATCH RFC V4 3/6] perf top: implement multithreading for perf_event__synthesize_threads kan.liang
@ 2017-09-29 14:47 ` kan.liang
  2017-10-03 16:45   ` [tip:perf/core] perf top: Add " tip-bot for Kan Liang
  2017-09-29 14:47 ` [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode kan.liang
  2017-09-29 14:47 ` [PATCH RFC V4 6/6] perf top: check the cost of perf_top__mmap_read kan.liang
  5 siblings, 1 reply; 18+ messages in thread
From: kan.liang @ 2017-09-29 14:47 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, namhyung, adrian.hunter, lukasz.odzioba, wangnan0,
	hekuang, ast, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Using UINT_MAX to indicate the default thread#, which is the max number
of online CPU.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/Documentation/perf-top.txt |  3 +++
 tools/perf/builtin-top.c              | 11 ++++++++---
 tools/perf/util/event.c               |  5 ++++-
 tools/perf/util/top.h                 |  1 +
 4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index d864ea6..c3cc9a3 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -240,6 +240,9 @@ Default is to monitor all CPUS.
 --force::
 	Don't do ownership validation.
 
+--num-thread-synthesize::
+	The number of threads to run event synthesize.
+	By default, the number of threads equals to the online CPU number.
 
 INTERACTIVE PROMPTING KEYS
 --------------------------
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index bc31b93..477a869 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,14 +958,16 @@ static int __cmd_top(struct perf_top *top)
 	if (perf_session__register_idle_thread(top->session) < 0)
 		goto out_delete;
 
-	perf_set_multithreaded();
+	if (top->nr_threads_synthesize > 1)
+		perf_set_multithreaded();
 
 	machine__synthesize_threads(&top->session->machines.host, &opts->target,
 				    top->evlist->threads, false,
 				    opts->proc_map_timeout,
-				    (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+				    top->nr_threads_synthesize);
 
-	perf_set_singlethreaded();
+	if (top->nr_threads_synthesize > 1)
+		perf_set_singlethreaded();
 
 	if (perf_hpp_list.socket) {
 		ret = perf_env__read_cpu_topology_map(&perf_env);
@@ -1118,6 +1120,7 @@ int cmd_top(int argc, const char **argv)
 		},
 		.max_stack	     = sysctl_perf_event_max_stack,
 		.sym_pcnt_filter     = 5,
+		.nr_threads_synthesize = UINT_MAX,
 	};
 	struct record_opts *opts = &top.record_opts;
 	struct target *target = &opts->target;
@@ -1227,6 +1230,8 @@ int cmd_top(int argc, const char **argv)
 	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
 		    "Show entries in a hierarchy"),
 	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
+	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
+			"number of thread to run event synthesize"),
 	OPT_END()
 	};
 	const char * const top_usage[] = {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 0e678dd..47eff47 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -790,7 +790,10 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 	if (n < 0)
 		return err;
 
-	thread_nr = nr_threads_synthesize;
+	if (nr_threads_synthesize == UINT_MAX)
+		thread_nr = sysconf(_SC_NPROCESSORS_ONLN);
+	else
+		thread_nr = nr_threads_synthesize;
 
 	if (thread_nr <= 1) {
 		err = __perf_event__synthesize_threads(tool, process,
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index 9bdfb78..f4296e1 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -37,6 +37,7 @@ struct perf_top {
 	int		   sym_pcnt_filter;
 	const char	   *sym_filter;
 	float		   min_percent;
+	unsigned int	   nr_threads_synthesize;
 };
 
 #define CONSOLE_CLEAR "^[[H^[[2J"
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode
  2017-09-29 14:47 [PATCH RFC V4 0/6] perf top optimization kan.liang
                   ` (3 preceding siblings ...)
  2017-09-29 14:47 ` [PATCH RFC V4 4/6] perf top: add option to set the number of thread for event synthesize kan.liang
@ 2017-09-29 14:47 ` kan.liang
  2017-09-30 19:48   ` Jiri Olsa
  2017-09-29 14:47 ` [PATCH RFC V4 6/6] perf top: check the cost of perf_top__mmap_read kan.liang
  5 siblings, 1 reply; 18+ messages in thread
From: kan.liang @ 2017-09-29 14:47 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, namhyung, adrian.hunter, lukasz.odzioba, wangnan0,
	hekuang, ast, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

perf_top__mmap_read has severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system. It costs
several minutes to finish, which is unacceptable.

perf top was overwrite mode. But it is changed to non overwrite mode
since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
For non overwrite mode, it tries to read everything in the ring buffer
and does not check the messup. Once there are lots of samples delivered
shortly, the processing time could be very long.
Knights Landing/Mill as a manycore processor contains a large number of
small cores. Because of the huge core number, it will generated lots of
samples in a heavy load system. Also, since the huge sample#, the mmap
writer probably bite the tail and mess up the samples.

Also, to avoid the problems which is described in 
commit 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf
event"), switch to backward overwrite mode.
Pausing the ring-buffer during perf_top__mmap_read to ensure the
ring-buffer is stable.
There would be some records lost in backward overwrite mode. Removing
the lost events checking.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-top.c       | 21 +++++++++------------
 tools/perf/ui/browsers/hists.c | 12 +++++++++---
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 477a869..03090d0 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -276,16 +276,6 @@ static void perf_top__print_sym_table(struct perf_top *top)
 
 	printf("%-*.*s\n", win_width, win_width, graph_dotted_line);
 
-	if (hists->stats.nr_lost_warned !=
-	    hists->stats.nr_events[PERF_RECORD_LOST]) {
-		hists->stats.nr_lost_warned =
-			      hists->stats.nr_events[PERF_RECORD_LOST];
-		color_fprintf(stdout, PERF_COLOR_RED,
-			      "WARNING: LOST %d chunks, Check IO/CPU overload",
-			      hists->stats.nr_lost_warned);
-		++printed;
-	}
-
 	if (top->sym_filter_entry) {
 		perf_top__show_details(top);
 		return;
@@ -802,6 +792,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
 
 static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
 {
+	struct perf_mmap *md = &top->evlist->backward_mmap[idx];
 	struct perf_sample sample;
 	struct perf_evsel *evsel;
 	struct perf_session *session = top->session;
@@ -809,7 +800,9 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
 	struct machine *machine;
 	int ret;
 
-	while ((event = perf_evlist__mmap_read(top->evlist, idx)) != NULL) {
+	perf_mmap__read_catchup(md);
+
+	while ((event = perf_mmap__read_backward(md)) != NULL) {
 		ret = perf_evlist__parse_sample(top->evlist, event, &sample);
 		if (ret) {
 			pr_err("Can't parse sample, err = %d\n", ret);
@@ -872,8 +865,11 @@ static void perf_top__mmap_read(struct perf_top *top)
 {
 	int i;
 
+	perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_DATA_PENDING);
 	for (i = 0; i < top->evlist->nr_mmaps; i++)
 		perf_top__mmap_read_idx(top, i);
+	perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_EMPTY);
+	perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_RUNNING);
 }
 
 static int perf_top__start_counters(struct perf_top *top)
@@ -902,7 +898,7 @@ static int perf_top__start_counters(struct perf_top *top)
 		}
 	}
 
-	if (perf_evlist__mmap(evlist, opts->mmap_pages, false) < 0) {
+	if (perf_evlist__mmap(evlist, opts->mmap_pages, opts->overwrite) < 0) {
 		ui__error("Failed to mmap with %d (%s)\n",
 			    errno, str_error_r(errno, msg, sizeof(msg)));
 		goto out_err;
@@ -1117,6 +1113,7 @@ int cmd_top(int argc, const char **argv)
 				.uses_mmap   = true,
 			},
 			.proc_map_timeout    = 500,
+			.overwrite	= 1,
 		},
 		.max_stack	     = sysctl_perf_event_max_stack,
 		.sym_pcnt_filter     = 5,
diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index 13dfb0a..7419c05 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -712,8 +712,13 @@ int hist_browser__run(struct hist_browser *browser, const char *help)
 			nr_entries = hist_browser__nr_entries(browser);
 			ui_browser__update_nr_entries(&browser->b, nr_entries);
 
-			if (browser->hists->stats.nr_lost_warned !=
-			    browser->hists->stats.nr_events[PERF_RECORD_LOST]) {
+			/*
+			 * Don't print lost events warning for perf top,
+			 * because it is backward overwrite mode.
+			 * Perf top is the only tool which has hbt timer.
+			 */
+			if ((browser->hists->stats.nr_lost_warned !=
+			    browser->hists->stats.nr_events[PERF_RECORD_LOST]) && !hbt) {
 				browser->hists->stats.nr_lost_warned =
 					browser->hists->stats.nr_events[PERF_RECORD_LOST];
 				ui_browser__warn_lost_events(&browser->b);
@@ -3358,7 +3363,8 @@ static int perf_evsel_menu__run(struct perf_evsel_menu *menu,
 		case K_TIMER:
 			hbt->timer(hbt->arg);
 
-			if (!menu->lost_events_warned && menu->lost_events) {
+			if (!menu->lost_events_warned && menu->lost_events &&
+			    !hbt) {
 				ui_browser__warn_lost_events(&menu->b);
 				menu->lost_events_warned = true;
 			}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC V4 6/6] perf top: check the cost of perf_top__mmap_read
  2017-09-29 14:47 [PATCH RFC V4 0/6] perf top optimization kan.liang
                   ` (4 preceding siblings ...)
  2017-09-29 14:47 ` [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode kan.liang
@ 2017-09-29 14:47 ` kan.liang
  5 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2017-09-29 14:47 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, namhyung, adrian.hunter, lukasz.odzioba, wangnan0,
	hekuang, ast, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

The latency of perf_top__mmap_read should be lower than refresh time.
If not, give some hints to reduce the overhead.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-top.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 03090d0..b5143c9 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -863,11 +863,20 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
 
 static void perf_top__mmap_read(struct perf_top *top)
 {
+	unsigned long long s, e;
 	int i;
 
 	perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_DATA_PENDING);
+	s = rdclock();
 	for (i = 0; i < top->evlist->nr_mmaps; i++)
 		perf_top__mmap_read_idx(top, i);
+	e = rdclock();
+	if ((e - s) > (unsigned long long)(top->delay_secs * NSEC_PER_SEC)) {
+		ui__warning("Too slow to read ring buffer.\n"
+			    "Please try increasing the period (-c) or\n"
+			    "decreasing the freq (-F) or\n"
+			    "limiting the number of CPUs (-C)\n");
+	}
 	perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_EMPTY);
 	perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_RUNNING);
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode
  2017-09-29 14:47 ` [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode kan.liang
@ 2017-09-30 19:48   ` Jiri Olsa
  2017-10-02 17:19     ` Liang, Kan
  0 siblings, 1 reply; 18+ messages in thread
From: Jiri Olsa @ 2017-09-30 19:48 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, peterz, mingo, linux-kernel, jolsa, namhyung,
	adrian.hunter, lukasz.odzioba, wangnan0, hekuang, ast, ak

On Fri, Sep 29, 2017 at 07:47:56AM -0700, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> perf_top__mmap_read has severe performance issue in
> Knights Landing/Mill, when monitoring in heavy load system. It costs
> several minutes to finish, which is unacceptable.
> 
> perf top was overwrite mode. But it is changed to non overwrite mode
> since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
> For non overwrite mode, it tries to read everything in the ring buffer
> and does not check the messup. Once there are lots of samples delivered
> shortly, the processing time could be very long.
> Knights Landing/Mill as a manycore processor contains a large number of
> small cores. Because of the huge core number, it will generated lots of
> samples in a heavy load system. Also, since the huge sample#, the mmap
> writer probably bite the tail and mess up the samples.
> 
> Also, to avoid the problems which is described in 
> commit 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf
> event"), switch to backward overwrite mode.
> Pausing the ring-buffer during perf_top__mmap_read to ensure the
> ring-buffer is stable.
> There would be some records lost in backward overwrite mode. Removing
> the lost events checking.

I'm getting perf top hogging the cpu completely with this change

jirka

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode
  2017-09-30 19:48   ` Jiri Olsa
@ 2017-10-02 17:19     ` Liang, Kan
  2017-10-02 18:03       ` acme
  2017-10-03  8:24       ` Jiri Olsa
  0 siblings, 2 replies; 18+ messages in thread
From: Liang, Kan @ 2017-10-02 17:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: acme, peterz, mingo, linux-kernel, jolsa, namhyung, Hunter,
	Adrian, Odzioba, Lukasz, wangnan0, hekuang, ast, ak



> On Fri, Sep 29, 2017 at 07:47:56AM -0700, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > perf_top__mmap_read has severe performance issue in Knights
> > Landing/Mill, when monitoring in heavy load system. It costs several
> > minutes to finish, which is unacceptable.
> >
> > perf top was overwrite mode. But it is changed to non overwrite mode
> > since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
> > For non overwrite mode, it tries to read everything in the ring buffer
> > and does not check the messup. Once there are lots of samples
> > delivered shortly, the processing time could be very long.
> > Knights Landing/Mill as a manycore processor contains a large number
> > of small cores. Because of the huge core number, it will generated
> > lots of samples in a heavy load system. Also, since the huge sample#,
> > the mmap writer probably bite the tail and mess up the samples.
> >
> > Also, to avoid the problems which is described in commit 9ecda41acb97
> > ("perf/core: Add ::write_backward attribute to perf event"), switch to
> > backward overwrite mode.
> > Pausing the ring-buffer during perf_top__mmap_read to ensure the
> > ring-buffer is stable.
> > There would be some records lost in backward overwrite mode. Removing
> > the lost events checking.
> 
> I'm getting perf top hogging the cpu completely with this change
> 

I think I find the root cause of the cpu hogging.
perf_mmap__read_catchup discards the md->prev from previous mmap_read.
Current mmap_read doesn't know which data has already been processed by
previous mmap_read. So it has to go through all the valid data in the ring buffer,
even most of the data has been processed by previous mmap_read.

Also, it looks perf record has the similar issue.
The previous location will be discarded as well in backward overwrite mode.
That will be an issue when --overwrite and --switch-output are enabled.
The new output will always include the old data in the previous output, which
should be wrong.

I think I will rewrite the perf_mmap__read_backward and perf_mmap__read_catchup
to fix this issue in a separate thread. Those functions should be common backward
mmap_read functions for all tools and tests.

BTW, are you OK with patch 1-4?
Those patches multithreading the machine__synthesize_threads, which is
irrelevant with the overwrite mode.
I think they can be merged separately.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode
  2017-10-02 17:19     ` Liang, Kan
@ 2017-10-02 18:03       ` acme
  2017-10-03  8:24       ` Jiri Olsa
  1 sibling, 0 replies; 18+ messages in thread
From: acme @ 2017-10-02 18:03 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Jiri Olsa, peterz, mingo, linux-kernel, jolsa, namhyung, Hunter,
	Adrian, Odzioba, Lukasz, wangnan0, hekuang, ast, ak

Em Mon, Oct 02, 2017 at 05:19:41PM +0000, Liang, Kan escreveu:
> 
> 
> > On Fri, Sep 29, 2017 at 07:47:56AM -0700, kan.liang@intel.com wrote:
> > > From: Kan Liang <kan.liang@intel.com>
> > >
> > > perf_top__mmap_read has severe performance issue in Knights
> > > Landing/Mill, when monitoring in heavy load system. It costs several
> > > minutes to finish, which is unacceptable.
> > >
> > > perf top was overwrite mode. But it is changed to non overwrite mode
> > > since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
> > > For non overwrite mode, it tries to read everything in the ring buffer
> > > and does not check the messup. Once there are lots of samples
> > > delivered shortly, the processing time could be very long.
> > > Knights Landing/Mill as a manycore processor contains a large number
> > > of small cores. Because of the huge core number, it will generated
> > > lots of samples in a heavy load system. Also, since the huge sample#,
> > > the mmap writer probably bite the tail and mess up the samples.
> > >
> > > Also, to avoid the problems which is described in commit 9ecda41acb97
> > > ("perf/core: Add ::write_backward attribute to perf event"), switch to
> > > backward overwrite mode.
> > > Pausing the ring-buffer during perf_top__mmap_read to ensure the
> > > ring-buffer is stable.
> > > There would be some records lost in backward overwrite mode. Removing
> > > the lost events checking.
> > 
> > I'm getting perf top hogging the cpu completely with this change
> > 
> 
> I think I find the root cause of the cpu hogging.
> perf_mmap__read_catchup discards the md->prev from previous mmap_read.
> Current mmap_read doesn't know which data has already been processed by
> previous mmap_read. So it has to go through all the valid data in the ring buffer,
> even most of the data has been processed by previous mmap_read.
> 
> Also, it looks perf record has the similar issue.
> The previous location will be discarded as well in backward overwrite mode.
> That will be an issue when --overwrite and --switch-output are enabled.
> The new output will always include the old data in the previous output, which
> should be wrong.
> 
> I think I will rewrite the perf_mmap__read_backward and perf_mmap__read_catchup
> to fix this issue in a separate thread. Those functions should be common backward
> mmap_read functions for all tools and tests.
> 
> BTW, are you OK with patch 1-4?
> Those patches multithreading the machine__synthesize_threads, which is
> irrelevant with the overwrite mode.
> I think they can be merged separately.

I'll try and process 1-4 while you work on the read_backward case.

- Arnaldo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list
  2017-09-29 14:47 ` [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list kan.liang
@ 2017-10-02 18:05   ` Arnaldo Carvalho de Melo
  2017-10-02 19:14     ` Liang, Kan
  2017-10-03 16:44   ` [tip:perf/core] perf tools: Lock " tip-bot for Kan Liang
  1 sibling, 1 reply; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2017-10-02 18:05 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, linux-kernel, jolsa, namhyung, adrian.hunter,
	lukasz.odzioba, wangnan0, hekuang, ast, ak

Em Fri, Sep 29, 2017 at 07:47:52AM -0700, kan.liang@intel.com escreveu:
> From: Kan Liang <kan.liang@intel.com>
> 
> Add two locks to protect namespaces_list and comm_list.
> 
> The lock is only needed for multithreaded code, so using mutex wrappers
> provided by perf tool.
> 
> Not all the comm_list/namespaces_list accessing are protected, e.g.
> thread__exec_comm. Because the multithread code for perf top event
> synthesizing does not touch them. They don't need a lock.

They don't need a lock _now_, ok I think we can proceed that way, this
cset should serve as a warning to people working in further
multithreading perf.

- Arnaldo
 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  tools/perf/util/thread.c | 53 +++++++++++++++++++++++++++++++++++++++++++-----
>  tools/perf/util/thread.h |  3 +++
>  2 files changed, 51 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
> index c09bdb5..bf73117 100644
> --- a/tools/perf/util/thread.c
> +++ b/tools/perf/util/thread.c
> @@ -45,6 +45,8 @@ struct thread *thread__new(pid_t pid, pid_t tid)
>  		thread->cpu = -1;
>  		INIT_LIST_HEAD(&thread->namespaces_list);
>  		INIT_LIST_HEAD(&thread->comm_list);
> +		init_rwsem(&thread->namespaces_lock);
> +		init_rwsem(&thread->comm_lock);
>  
>  		comm_str = malloc(32);
>  		if (!comm_str)
> @@ -83,18 +85,26 @@ void thread__delete(struct thread *thread)
>  		map_groups__put(thread->mg);
>  		thread->mg = NULL;
>  	}
> +	down_write(&thread->namespaces_lock);
>  	list_for_each_entry_safe(namespaces, tmp_namespaces,
>  				 &thread->namespaces_list, list) {
>  		list_del(&namespaces->list);
>  		namespaces__free(namespaces);
>  	}
> +	up_write(&thread->namespaces_lock);
> +
> +	down_write(&thread->comm_lock);
>  	list_for_each_entry_safe(comm, tmp_comm, &thread->comm_list, list) {
>  		list_del(&comm->list);
>  		comm__free(comm);
>  	}
> +	up_write(&thread->comm_lock);
> +
>  	unwind__finish_access(thread);
>  	nsinfo__zput(thread->nsinfo);
>  
> +	exit_rwsem(&thread->namespaces_lock);
> +	exit_rwsem(&thread->comm_lock);
>  	free(thread);
>  }
>  
> @@ -125,8 +135,8 @@ struct namespaces *thread__namespaces(const struct thread *thread)
>  	return list_first_entry(&thread->namespaces_list, struct namespaces, list);
>  }
>  
> -int thread__set_namespaces(struct thread *thread, u64 timestamp,
> -			   struct namespaces_event *event)
> +static int __thread__set_namespaces(struct thread *thread, u64 timestamp,
> +				    struct namespaces_event *event)
>  {
>  	struct namespaces *new, *curr = thread__namespaces(thread);
>  
> @@ -149,6 +159,17 @@ int thread__set_namespaces(struct thread *thread, u64 timestamp,
>  	return 0;
>  }
>  
> +int thread__set_namespaces(struct thread *thread, u64 timestamp,
> +			   struct namespaces_event *event)
> +{
> +	int ret;
> +
> +	down_write(&thread->namespaces_lock);
> +	ret = __thread__set_namespaces(thread, timestamp, event);
> +	up_write(&thread->namespaces_lock);
> +	return ret;
> +}
> +
>  struct comm *thread__comm(const struct thread *thread)
>  {
>  	if (list_empty(&thread->comm_list))
> @@ -170,8 +191,8 @@ struct comm *thread__exec_comm(const struct thread *thread)
>  	return last;
>  }
>  
> -int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
> -		       bool exec)
> +static int ____thread__set_comm(struct thread *thread, const char *str,
> +				u64 timestamp, bool exec)
>  {
>  	struct comm *new, *curr = thread__comm(thread);
>  
> @@ -195,6 +216,17 @@ int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
>  	return 0;
>  }
>  
> +int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
> +		       bool exec)
> +{
> +	int ret;
> +
> +	down_write(&thread->comm_lock);
> +	ret = ____thread__set_comm(thread, str, timestamp, exec);
> +	up_write(&thread->comm_lock);
> +	return ret;
> +}
> +
>  int thread__set_comm_from_proc(struct thread *thread)
>  {
>  	char path[64];
> @@ -212,7 +244,7 @@ int thread__set_comm_from_proc(struct thread *thread)
>  	return err;
>  }
>  
> -const char *thread__comm_str(const struct thread *thread)
> +static const char *__thread__comm_str(const struct thread *thread)
>  {
>  	const struct comm *comm = thread__comm(thread);
>  
> @@ -222,6 +254,17 @@ const char *thread__comm_str(const struct thread *thread)
>  	return comm__str(comm);
>  }
>  
> +const char *thread__comm_str(const struct thread *thread)
> +{
> +	const char *str;
> +
> +	down_read((struct rw_semaphore *)&thread->comm_lock);
> +	str = __thread__comm_str(thread);
> +	up_read((struct rw_semaphore *)&thread->comm_lock);
> +
> +	return str;
> +}
> +
>  /* CHECKME: it should probably better return the max comm len from its comm list */
>  int thread__comm_len(struct thread *thread)
>  {
> diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
> index cb1a5dd..10555d6 100644
> --- a/tools/perf/util/thread.h
> +++ b/tools/perf/util/thread.h
> @@ -9,6 +9,7 @@
>  #include "symbol.h"
>  #include <strlist.h>
>  #include <intlist.h>
> +#include "rwsem.h"
>  
>  struct thread_stack;
>  struct unwind_libunwind_ops;
> @@ -29,7 +30,9 @@ struct thread {
>  	int			comm_len;
>  	bool			dead; /* if set thread has exited */
>  	struct list_head	namespaces_list;
> +	struct rw_semaphore	namespaces_lock;
>  	struct list_head	comm_list;
> +	struct rw_semaphore	comm_lock;
>  	u64			db_id;
>  
>  	void			*priv;
> -- 
> 2.5.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list
  2017-10-02 18:05   ` Arnaldo Carvalho de Melo
@ 2017-10-02 19:14     ` Liang, Kan
  2017-10-02 19:34       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 18+ messages in thread
From: Liang, Kan @ 2017-10-02 19:14 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: peterz, mingo, linux-kernel, jolsa, namhyung, Hunter, Adrian,
	Odzioba, Lukasz, wangnan0, hekuang, ast, ak

> Em Fri, Sep 29, 2017 at 07:47:52AM -0700, kan.liang@intel.com escreveu:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > Add two locks to protect namespaces_list and comm_list.
> >
> > The lock is only needed for multithreaded code, so using mutex
> > wrappers provided by perf tool.
> >
> > Not all the comm_list/namespaces_list accessing are protected, e.g.
> > thread__exec_comm. Because the multithread code for perf top event
> > synthesizing does not touch them. They don't need a lock.
> 
> They don't need a lock _now_, ok I think we can proceed that way, this cset
> should serve as a warning to people working in further multithreading perf.
> 

Do I need to add some comments in the code to give a warning?

Thanks,
Kan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list
  2017-10-02 19:14     ` Liang, Kan
@ 2017-10-02 19:34       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2017-10-02 19:34 UTC (permalink / raw)
  To: Liang, Kan
  Cc: peterz, mingo, linux-kernel, jolsa, namhyung, Hunter, Adrian,
	Odzioba, Lukasz, wangnan0, hekuang, ast, ak

Em Mon, Oct 02, 2017 at 07:14:10PM +0000, Liang, Kan escreveu:
> > Em Fri, Sep 29, 2017 at 07:47:52AM -0700, kan.liang@intel.com escreveu:
> > > From: Kan Liang <kan.liang@intel.com>
> > >
> > > Add two locks to protect namespaces_list and comm_list.
> > >
> > > The lock is only needed for multithreaded code, so using mutex
> > > wrappers provided by perf tool.
> > >
> > > Not all the comm_list/namespaces_list accessing are protected, e.g.
> > > thread__exec_comm. Because the multithread code for perf top event
> > > synthesizing does not touch them. They don't need a lock.
> > 
> > They don't need a lock _now_, ok I think we can proceed that way, this cset
> > should serve as a warning to people working in further multithreading perf.
> > 
> 
> Do I need to add some comments in the code to give a warning?

Maybe, but then one can use 'git blame' and see the comments on the
patch introducing the locking.

BTW, I'm applied and tested it, pushed to acme/perf/core.

- Arnaldo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode
  2017-10-02 17:19     ` Liang, Kan
  2017-10-02 18:03       ` acme
@ 2017-10-03  8:24       ` Jiri Olsa
  1 sibling, 0 replies; 18+ messages in thread
From: Jiri Olsa @ 2017-10-03  8:24 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, peterz, mingo, linux-kernel, jolsa, namhyung, Hunter,
	Adrian, Odzioba, Lukasz, wangnan0, hekuang, ast, ak

On Mon, Oct 02, 2017 at 05:19:41PM +0000, Liang, Kan wrote:
> 
> 
> > On Fri, Sep 29, 2017 at 07:47:56AM -0700, kan.liang@intel.com wrote:
> > > From: Kan Liang <kan.liang@intel.com>
> > >
> > > perf_top__mmap_read has severe performance issue in Knights
> > > Landing/Mill, when monitoring in heavy load system. It costs several
> > > minutes to finish, which is unacceptable.
> > >
> > > perf top was overwrite mode. But it is changed to non overwrite mode
> > > since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
> > > For non overwrite mode, it tries to read everything in the ring buffer
> > > and does not check the messup. Once there are lots of samples
> > > delivered shortly, the processing time could be very long.
> > > Knights Landing/Mill as a manycore processor contains a large number
> > > of small cores. Because of the huge core number, it will generated
> > > lots of samples in a heavy load system. Also, since the huge sample#,
> > > the mmap writer probably bite the tail and mess up the samples.
> > >
> > > Also, to avoid the problems which is described in commit 9ecda41acb97
> > > ("perf/core: Add ::write_backward attribute to perf event"), switch to
> > > backward overwrite mode.
> > > Pausing the ring-buffer during perf_top__mmap_read to ensure the
> > > ring-buffer is stable.
> > > There would be some records lost in backward overwrite mode. Removing
> > > the lost events checking.
> > 
> > I'm getting perf top hogging the cpu completely with this change
> > 
> 
> I think I find the root cause of the cpu hogging.
> perf_mmap__read_catchup discards the md->prev from previous mmap_read.
> Current mmap_read doesn't know which data has already been processed by
> previous mmap_read. So it has to go through all the valid data in the ring buffer,
> even most of the data has been processed by previous mmap_read.
> 
> Also, it looks perf record has the similar issue.
> The previous location will be discarded as well in backward overwrite mode.
> That will be an issue when --overwrite and --switch-output are enabled.
> The new output will always include the old data in the previous output, which
> should be wrong.
> 
> I think I will rewrite the perf_mmap__read_backward and perf_mmap__read_catchup
> to fix this issue in a separate thread. Those functions should be common backward
> mmap_read functions for all tools and tests.
> 
> BTW, are you OK with patch 1-4?
> Those patches multithreading the machine__synthesize_threads, which is
> irrelevant with the overwrite mode.
> I think they can be merged separately.

yep, for patches 1-4:

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tip:perf/core] perf tools: Lock to protect namespaces and comm list
  2017-09-29 14:47 ` [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list kan.liang
  2017-10-02 18:05   ` Arnaldo Carvalho de Melo
@ 2017-10-03 16:44   ` tip-bot for Kan Liang
  1 sibling, 0 replies; 18+ messages in thread
From: tip-bot for Kan Liang @ 2017-10-03 16:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: kan.liang, linux-kernel, ak, namhyung, ast, acme, tglx, mingo,
	peterz, adrian.hunter, lukasz.odzioba, wangnan0, hekuang, jolsa,
	hpa

Commit-ID:  b32ee9e522f7ba26339856a047cfe9efc0be0ff3
Gitweb:     https://git.kernel.org/tip/b32ee9e522f7ba26339856a047cfe9efc0be0ff3
Author:     Kan Liang <kan.liang@intel.com>
AuthorDate: Fri, 29 Sep 2017 07:47:52 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Tue, 3 Oct 2017 09:27:27 -0300

perf tools: Lock to protect namespaces and comm list

Add two locks to protect namespaces_list and comm_list.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Not all the comm_list/namespaces_list accessing are protected, e.g.
thread__exec_comm. Because the multithread code for perf top event
synthesizing does not touch them. They don't need a lock.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Lukasz Odzioba <lukasz.odzioba@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1506696477-146932-2-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/thread.c | 53 +++++++++++++++++++++++++++++++++++++++++++-----
 tools/perf/util/thread.h |  3 +++
 2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index c09bdb5..bf73117 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -45,6 +45,8 @@ struct thread *thread__new(pid_t pid, pid_t tid)
 		thread->cpu = -1;
 		INIT_LIST_HEAD(&thread->namespaces_list);
 		INIT_LIST_HEAD(&thread->comm_list);
+		init_rwsem(&thread->namespaces_lock);
+		init_rwsem(&thread->comm_lock);
 
 		comm_str = malloc(32);
 		if (!comm_str)
@@ -83,18 +85,26 @@ void thread__delete(struct thread *thread)
 		map_groups__put(thread->mg);
 		thread->mg = NULL;
 	}
+	down_write(&thread->namespaces_lock);
 	list_for_each_entry_safe(namespaces, tmp_namespaces,
 				 &thread->namespaces_list, list) {
 		list_del(&namespaces->list);
 		namespaces__free(namespaces);
 	}
+	up_write(&thread->namespaces_lock);
+
+	down_write(&thread->comm_lock);
 	list_for_each_entry_safe(comm, tmp_comm, &thread->comm_list, list) {
 		list_del(&comm->list);
 		comm__free(comm);
 	}
+	up_write(&thread->comm_lock);
+
 	unwind__finish_access(thread);
 	nsinfo__zput(thread->nsinfo);
 
+	exit_rwsem(&thread->namespaces_lock);
+	exit_rwsem(&thread->comm_lock);
 	free(thread);
 }
 
@@ -125,8 +135,8 @@ struct namespaces *thread__namespaces(const struct thread *thread)
 	return list_first_entry(&thread->namespaces_list, struct namespaces, list);
 }
 
-int thread__set_namespaces(struct thread *thread, u64 timestamp,
-			   struct namespaces_event *event)
+static int __thread__set_namespaces(struct thread *thread, u64 timestamp,
+				    struct namespaces_event *event)
 {
 	struct namespaces *new, *curr = thread__namespaces(thread);
 
@@ -149,6 +159,17 @@ int thread__set_namespaces(struct thread *thread, u64 timestamp,
 	return 0;
 }
 
+int thread__set_namespaces(struct thread *thread, u64 timestamp,
+			   struct namespaces_event *event)
+{
+	int ret;
+
+	down_write(&thread->namespaces_lock);
+	ret = __thread__set_namespaces(thread, timestamp, event);
+	up_write(&thread->namespaces_lock);
+	return ret;
+}
+
 struct comm *thread__comm(const struct thread *thread)
 {
 	if (list_empty(&thread->comm_list))
@@ -170,8 +191,8 @@ struct comm *thread__exec_comm(const struct thread *thread)
 	return last;
 }
 
-int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
-		       bool exec)
+static int ____thread__set_comm(struct thread *thread, const char *str,
+				u64 timestamp, bool exec)
 {
 	struct comm *new, *curr = thread__comm(thread);
 
@@ -195,6 +216,17 @@ int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
 	return 0;
 }
 
+int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
+		       bool exec)
+{
+	int ret;
+
+	down_write(&thread->comm_lock);
+	ret = ____thread__set_comm(thread, str, timestamp, exec);
+	up_write(&thread->comm_lock);
+	return ret;
+}
+
 int thread__set_comm_from_proc(struct thread *thread)
 {
 	char path[64];
@@ -212,7 +244,7 @@ int thread__set_comm_from_proc(struct thread *thread)
 	return err;
 }
 
-const char *thread__comm_str(const struct thread *thread)
+static const char *__thread__comm_str(const struct thread *thread)
 {
 	const struct comm *comm = thread__comm(thread);
 
@@ -222,6 +254,17 @@ const char *thread__comm_str(const struct thread *thread)
 	return comm__str(comm);
 }
 
+const char *thread__comm_str(const struct thread *thread)
+{
+	const char *str;
+
+	down_read((struct rw_semaphore *)&thread->comm_lock);
+	str = __thread__comm_str(thread);
+	up_read((struct rw_semaphore *)&thread->comm_lock);
+
+	return str;
+}
+
 /* CHECKME: it should probably better return the max comm len from its comm list */
 int thread__comm_len(struct thread *thread)
 {
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index cb1a5dd..10555d6 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -9,6 +9,7 @@
 #include "symbol.h"
 #include <strlist.h>
 #include <intlist.h>
+#include "rwsem.h"
 
 struct thread_stack;
 struct unwind_libunwind_ops;
@@ -29,7 +30,9 @@ struct thread {
 	int			comm_len;
 	bool			dead; /* if set thread has exited */
 	struct list_head	namespaces_list;
+	struct rw_semaphore	namespaces_lock;
 	struct list_head	comm_list;
+	struct rw_semaphore	comm_lock;
 	u64			db_id;
 
 	void			*priv;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip:perf/core] perf tools: Lock to protect comm_str rb tree
  2017-09-29 14:47 ` [PATCH RFC V4 2/6] perf tools: lock to protect comm_str rb tree kan.liang
@ 2017-10-03 16:45   ` tip-bot for Kan Liang
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot for Kan Liang @ 2017-10-03 16:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, ak, lukasz.odzioba, acme, wangnan0, ast, kan.liang,
	linux-kernel, adrian.hunter, tglx, hekuang, jolsa, hpa, peterz,
	namhyung

Commit-ID:  f988e71bc6220d8b404dbd43c0e0962e30305795
Gitweb:     https://git.kernel.org/tip/f988e71bc6220d8b404dbd43c0e0962e30305795
Author:     Kan Liang <kan.liang@intel.com>
AuthorDate: Fri, 29 Sep 2017 07:47:53 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Tue, 3 Oct 2017 09:27:36 -0300

perf tools: Lock to protect comm_str rb tree

Add comm_str_lock to protect comm_str rb tree.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Lukasz Odzioba <lukasz.odzioba@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1506696477-146932-3-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/comm.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/comm.c b/tools/perf/util/comm.c
index 7bc981b..756a9c1 100644
--- a/tools/perf/util/comm.c
+++ b/tools/perf/util/comm.c
@@ -5,6 +5,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <linux/refcount.h>
+#include "rwsem.h"
 
 struct comm_str {
 	char *str;
@@ -14,6 +15,7 @@ struct comm_str {
 
 /* Should perhaps be moved to struct machine */
 static struct rb_root comm_str_root;
+static struct rw_semaphore comm_str_lock = {.lock = PTHREAD_RWLOCK_INITIALIZER,};
 
 static struct comm_str *comm_str__get(struct comm_str *cs)
 {
@@ -25,7 +27,9 @@ static struct comm_str *comm_str__get(struct comm_str *cs)
 static void comm_str__put(struct comm_str *cs)
 {
 	if (cs && refcount_dec_and_test(&cs->refcnt)) {
+		down_write(&comm_str_lock);
 		rb_erase(&cs->rb_node, &comm_str_root);
+		up_write(&comm_str_lock);
 		zfree(&cs->str);
 		free(cs);
 	}
@@ -50,7 +54,8 @@ static struct comm_str *comm_str__alloc(const char *str)
 	return cs;
 }
 
-static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+static
+struct comm_str *__comm_str__findnew(const char *str, struct rb_root *root)
 {
 	struct rb_node **p = &root->rb_node;
 	struct rb_node *parent = NULL;
@@ -81,6 +86,17 @@ static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
 	return new;
 }
 
+static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+{
+	struct comm_str *cs;
+
+	down_write(&comm_str_lock);
+	cs = __comm_str__findnew(str, root);
+	up_write(&comm_str_lock);
+
+	return cs;
+}
+
 struct comm *comm__new(const char *str, u64 timestamp, bool exec)
 {
 	struct comm *comm = zalloc(sizeof(*comm));

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip:perf/core] perf top: Implement multithreading for perf_event__synthesize_threads
  2017-09-29 14:47 ` [PATCH RFC V4 3/6] perf top: implement multithreading for perf_event__synthesize_threads kan.liang
@ 2017-10-03 16:45   ` tip-bot for Kan Liang
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot for Kan Liang @ 2017-10-03 16:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hekuang, acme, kan.liang, tglx, lukasz.odzioba, hpa, jolsa,
	peterz, ast, namhyung, adrian.hunter, wangnan0, ak, linux-kernel,
	mingo

Commit-ID:  340b47f510bbe55a76b7309107276f02ea11f117
Gitweb:     https://git.kernel.org/tip/340b47f510bbe55a76b7309107276f02ea11f117
Author:     Kan Liang <kan.liang@intel.com>
AuthorDate: Fri, 29 Sep 2017 07:47:54 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Tue, 3 Oct 2017 09:27:46 -0300

perf top: Implement multithreading for perf_event__synthesize_threads

The proc files which is sorted with alphabetical order are evenly
assigned to several synthesize threads to be processed in parallel.

For 'perf top', the threads number hard code to online CPU number. The
following patch will introduce an option to set it.

For other perf tools, the thread number is 1. Because the process
function is not ready for multithreading, e.g.
process_synthesized_event.

This patch series only support event synthesize multithreading for 'perf
top'. For other tools, it can be done separately later.

With multithread applied, the total processing time can get up to 1.56x
speedup on Knights Mill for 'perf top'.

For specific single event processing, the processing time could increase
because of the lock contention. So proc_map_timeout may need to be
increased. Otherwise some proc maps will be truncated.

Based on my test, increasing the proc_map_timeout has small impact
on the total processing time. The total processing time still get 1.49x
speedup on Knights Mill after increasing the proc_map_timeout.
The patch itself doesn't increase the proc_map_timeout.

Doesn't need to implement multithreading for per task monitoring,
perf_event__synthesize_thread_map. It doesn't have performance issue.

Committer testing:

  # getconf _NPROCESSORS_ONLN
  4
  # perf trace --no-inherit -e clone -o /tmp/output perf top
  # tail -4 /tmp/bla
     0.124 ( 0.041 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3eb3a8f30, parent_tidptr: 0x7fc3eb3a99d0, child_tidptr: 0x7fc3eb3a99d0, tls: 0x7fc3eb3a9700) = 9548 (perf)
     0.246 ( 0.023 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3eaba7f30, parent_tidptr: 0x7fc3eaba89d0, child_tidptr: 0x7fc3eaba89d0, tls: 0x7fc3eaba8700) = 9549 (perf)
     0.286 ( 0.019 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3ea3a6f30, parent_tidptr: 0x7fc3ea3a79d0, child_tidptr: 0x7fc3ea3a79d0, tls: 0x7fc3ea3a7700) = 9550 (perf)
   246.540 ( 0.047 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3ea3a6f30, parent_tidptr: 0x7fc3ea3a79d0, child_tidptr: 0x7fc3ea3a79d0, tls: 0x7fc3ea3a7700) = 9551 (perf)
  #

Signed-off-by: Kan Liang <kan.liang@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Lukasz Odzioba <lukasz.odzioba@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1506696477-146932-4-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-kvm.c              |   3 +-
 tools/perf/builtin-record.c           |   2 +-
 tools/perf/builtin-top.c              |   8 +-
 tools/perf/builtin-trace.c            |   2 +-
 tools/perf/tests/mmap-thread-lookup.c |   2 +-
 tools/perf/util/event.c               | 160 +++++++++++++++++++++++++++-------
 tools/perf/util/event.h               |   3 +-
 tools/perf/util/machine.c             |   8 +-
 tools/perf/util/machine.h             |   9 +-
 9 files changed, 155 insertions(+), 42 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index c747a1a..721f4f9 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -1441,7 +1441,8 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
 	perf_session__set_id_hdr_size(kvm->session);
 	ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
 	machine__synthesize_threads(&kvm->session->machines.host, &kvm->opts.target,
-				    kvm->evlist->threads, false, kvm->opts.proc_map_timeout);
+				    kvm->evlist->threads, false,
+				    kvm->opts.proc_map_timeout, 1);
 	err = kvm_live_open_events(kvm);
 	if (err)
 		goto out;
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9b379f3..234fdf4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -863,7 +863,7 @@ static int record__synthesize(struct record *rec, bool tail)
 
 	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
 					    process_synthesized_event, opts->sample_address,
-					    opts->proc_map_timeout);
+					    opts->proc_map_timeout, 1);
 out:
 	return err;
 }
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index ee954bd..bc31b93 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,8 +958,14 @@ static int __cmd_top(struct perf_top *top)
 	if (perf_session__register_idle_thread(top->session) < 0)
 		goto out_delete;
 
+	perf_set_multithreaded();
+
 	machine__synthesize_threads(&top->session->machines.host, &opts->target,
-				    top->evlist->threads, false, opts->proc_map_timeout);
+				    top->evlist->threads, false,
+				    opts->proc_map_timeout,
+				    (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+
+	perf_set_singlethreaded();
 
 	if (perf_hpp_list.socket) {
 		ret = perf_env__read_cpu_topology_map(&perf_env);
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 967bd35..afef6fe 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1131,7 +1131,7 @@ static int trace__symbols_init(struct trace *trace, struct perf_evlist *evlist)
 
 	err = __machine__synthesize_threads(trace->host, &trace->tool, &trace->opts.target,
 					    evlist->threads, trace__tool_process, false,
-					    trace->opts.proc_map_timeout);
+					    trace->opts.proc_map_timeout, 1);
 	if (err)
 		symbol__exit();
 
diff --git a/tools/perf/tests/mmap-thread-lookup.c b/tools/perf/tests/mmap-thread-lookup.c
index f94a419..2a0068a 100644
--- a/tools/perf/tests/mmap-thread-lookup.c
+++ b/tools/perf/tests/mmap-thread-lookup.c
@@ -131,7 +131,7 @@ static int synth_all(struct machine *machine)
 {
 	return perf_event__synthesize_threads(NULL,
 					      perf_event__process,
-					      machine, 0, 500);
+					      machine, 0, 500, 1);
 }
 
 static int synth_process(struct machine *machine)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 10366b8..0e678dd 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -678,23 +678,21 @@ out:
 	return err;
 }
 
-int perf_event__synthesize_threads(struct perf_tool *tool,
-				   perf_event__handler_t process,
-				   struct machine *machine,
-				   bool mmap_data,
-				   unsigned int proc_map_timeout)
+static int __perf_event__synthesize_threads(struct perf_tool *tool,
+					    perf_event__handler_t process,
+					    struct machine *machine,
+					    bool mmap_data,
+					    unsigned int proc_map_timeout,
+					    struct dirent **dirent,
+					    int start,
+					    int num)
 {
 	union perf_event *comm_event, *mmap_event, *fork_event;
 	union perf_event *namespaces_event;
-	char proc_path[PATH_MAX];
-	struct dirent **dirent;
 	int err = -1;
 	char *end;
 	pid_t pid;
-	int n, i;
-
-	if (machine__is_default_guest(machine))
-		return 0;
+	int i;
 
 	comm_event = malloc(sizeof(comm_event->comm) + machine->id_hdr_size);
 	if (comm_event == NULL)
@@ -714,34 +712,25 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 	if (namespaces_event == NULL)
 		goto out_free_fork;
 
-	snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
-	n = scandir(proc_path, &dirent, 0, alphasort);
-
-	if (n < 0)
-		goto out_free_namespaces;
-
-	for (i = 0; i < n; i++) {
+	for (i = start; i < start + num; i++) {
 		if (!isdigit(dirent[i]->d_name[0]))
 			continue;
 
 		pid = (pid_t)strtol(dirent[i]->d_name, &end, 10);
 		/* only interested in proper numerical dirents */
-		if (!*end) {
-			/*
-			 * We may race with exiting thread, so don't stop just because
-			 * one thread couldn't be synthesized.
-			 */
-			__event__synthesize_thread(comm_event, mmap_event, fork_event,
-						   namespaces_event, pid, 1, process,
-						   tool, machine, mmap_data,
-						   proc_map_timeout);
-		}
-		free(dirent[i]);
+		if (*end)
+			continue;
+		/*
+		 * We may race with exiting thread, so don't stop just because
+		 * one thread couldn't be synthesized.
+		 */
+		__event__synthesize_thread(comm_event, mmap_event, fork_event,
+					   namespaces_event, pid, 1, process,
+					   tool, machine, mmap_data,
+					   proc_map_timeout);
 	}
-	free(dirent);
 	err = 0;
 
-out_free_namespaces:
 	free(namespaces_event);
 out_free_fork:
 	free(fork_event);
@@ -753,6 +742,115 @@ out:
 	return err;
 }
 
+struct synthesize_threads_arg {
+	struct perf_tool *tool;
+	perf_event__handler_t process;
+	struct machine *machine;
+	bool mmap_data;
+	unsigned int proc_map_timeout;
+	struct dirent **dirent;
+	int num;
+	int start;
+};
+
+static void *synthesize_threads_worker(void *arg)
+{
+	struct synthesize_threads_arg *args = arg;
+
+	__perf_event__synthesize_threads(args->tool, args->process,
+					 args->machine, args->mmap_data,
+					 args->proc_map_timeout, args->dirent,
+					 args->start, args->num);
+	return NULL;
+}
+
+int perf_event__synthesize_threads(struct perf_tool *tool,
+				   perf_event__handler_t process,
+				   struct machine *machine,
+				   bool mmap_data,
+				   unsigned int proc_map_timeout,
+				   unsigned int nr_threads_synthesize)
+{
+	struct synthesize_threads_arg *args = NULL;
+	pthread_t *synthesize_threads = NULL;
+	char proc_path[PATH_MAX];
+	struct dirent **dirent;
+	int num_per_thread;
+	int m, n, i, j;
+	int thread_nr;
+	int base = 0;
+	int err = -1;
+
+
+	if (machine__is_default_guest(machine))
+		return 0;
+
+	snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
+	n = scandir(proc_path, &dirent, 0, alphasort);
+	if (n < 0)
+		return err;
+
+	thread_nr = nr_threads_synthesize;
+
+	if (thread_nr <= 1) {
+		err = __perf_event__synthesize_threads(tool, process,
+						       machine, mmap_data,
+						       proc_map_timeout,
+						       dirent, base, n);
+		goto free_dirent;
+	}
+	if (thread_nr > n)
+		thread_nr = n;
+
+	synthesize_threads = calloc(sizeof(pthread_t), thread_nr);
+	if (synthesize_threads == NULL)
+		goto free_dirent;
+
+	args = calloc(sizeof(*args), thread_nr);
+	if (args == NULL)
+		goto free_threads;
+
+	num_per_thread = n / thread_nr;
+	m = n % thread_nr;
+	for (i = 0; i < thread_nr; i++) {
+		args[i].tool = tool;
+		args[i].process = process;
+		args[i].machine = machine;
+		args[i].mmap_data = mmap_data;
+		args[i].proc_map_timeout = proc_map_timeout;
+		args[i].dirent = dirent;
+	}
+	for (i = 0; i < m; i++) {
+		args[i].num = num_per_thread + 1;
+		args[i].start = i * args[i].num;
+	}
+	if (i != 0)
+		base = args[i-1].start + args[i-1].num;
+	for (j = i; j < thread_nr; j++) {
+		args[j].num = num_per_thread;
+		args[j].start = base + (j - i) * args[i].num;
+	}
+
+	for (i = 0; i < thread_nr; i++) {
+		if (pthread_create(&synthesize_threads[i], NULL,
+				   synthesize_threads_worker, &args[i]))
+			goto out_join;
+	}
+	err = 0;
+out_join:
+	for (i = 0; i < thread_nr; i++)
+		pthread_join(synthesize_threads[i], NULL);
+	free(args);
+free_threads:
+	free(synthesize_threads);
+free_dirent:
+	for (i = 0; i < n; i++)
+		free(dirent[i]);
+	free(dirent);
+
+	return err;
+}
+
 struct process_symbol_args {
 	const char *name;
 	u64	   start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ee7bcc8..d6cbb0a 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -680,7 +680,8 @@ int perf_event__synthesize_cpu_map(struct perf_tool *tool,
 int perf_event__synthesize_threads(struct perf_tool *tool,
 				   perf_event__handler_t process,
 				   struct machine *machine, bool mmap_data,
-				   unsigned int proc_map_timeout);
+				   unsigned int proc_map_timeout,
+				   unsigned int nr_threads_synthesize);
 int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 585b4a3..7c3aa47 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2218,12 +2218,16 @@ int machines__for_each_thread(struct machines *machines,
 int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
 				  struct target *target, struct thread_map *threads,
 				  perf_event__handler_t process, bool data_mmap,
-				  unsigned int proc_map_timeout)
+				  unsigned int proc_map_timeout,
+				  unsigned int nr_threads_synthesize)
 {
 	if (target__has_task(target))
 		return perf_event__synthesize_thread_map(tool, threads, process, machine, data_mmap, proc_map_timeout);
 	else if (target__has_cpu(target))
-		return perf_event__synthesize_threads(tool, process, machine, data_mmap, proc_map_timeout);
+		return perf_event__synthesize_threads(tool, process,
+						      machine, data_mmap,
+						      proc_map_timeout,
+						      nr_threads_synthesize);
 	/* command specified */
 	return 0;
 }
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index b1cd516..c6a299e 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -257,15 +257,18 @@ int machines__for_each_thread(struct machines *machines,
 int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
 				  struct target *target, struct thread_map *threads,
 				  perf_event__handler_t process, bool data_mmap,
-				  unsigned int proc_map_timeout);
+				  unsigned int proc_map_timeout,
+				  unsigned int nr_threads_synthesize);
 static inline
 int machine__synthesize_threads(struct machine *machine, struct target *target,
 				struct thread_map *threads, bool data_mmap,
-				unsigned int proc_map_timeout)
+				unsigned int proc_map_timeout,
+				unsigned int nr_threads_synthesize)
 {
 	return __machine__synthesize_threads(machine, NULL, target, threads,
 					     perf_event__process, data_mmap,
-					     proc_map_timeout);
+					     proc_map_timeout,
+					     nr_threads_synthesize);
 }
 
 pid_t machine__get_current_tid(struct machine *machine, int cpu);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip:perf/core] perf top: Add option to set the number of thread for event synthesize
  2017-09-29 14:47 ` [PATCH RFC V4 4/6] perf top: add option to set the number of thread for event synthesize kan.liang
@ 2017-10-03 16:45   ` tip-bot for Kan Liang
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot for Kan Liang @ 2017-10-03 16:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: namhyung, tglx, peterz, hpa, ast, hekuang, mingo, lukasz.odzioba,
	jolsa, ak, wangnan0, linux-kernel, kan.liang, acme,
	adrian.hunter

Commit-ID:  0c6b499495e928777c41ca2de4fbb58788269690
Gitweb:     https://git.kernel.org/tip/0c6b499495e928777c41ca2de4fbb58788269690
Author:     Kan Liang <kan.liang@intel.com>
AuthorDate: Fri, 29 Sep 2017 07:47:55 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Tue, 3 Oct 2017 09:27:54 -0300

perf top: Add option to set the number of thread for event synthesize

Using UINT_MAX to indicate the default thread#, which is the max number
of online CPU.

Committer testing:

  # perf trace --no-inherit -e clone -o /tmp/output perf top --num-thread-synthesize 9
  # cat /tmp/output
         ? (     ?   ):  ... [continued]: clone()) = 26651 (perf)
     0.059 ( 0.010 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5bfac44f30, parent_tidptr: 0x7f5bfac459d0, child_tidptr: 0x7f5bfac459d0, tls: 0x7f5bfac45700) = 26652 (perf)
     0.116 ( 0.014 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5bfa443f30, parent_tidptr: 0x7f5bfa4449d0, child_tidptr: 0x7f5bfa4449d0, tls: 0x7f5bfa444700) = 26653 (perf)
     0.141 ( 0.009 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5bf9c42f30, parent_tidptr: 0x7f5bf9c439d0, child_tidptr: 0x7f5bf9c439d0, tls: 0x7f5bf9c43700) = 26654 (perf)
     0.160 ( 0.012 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5bf9441f30, parent_tidptr: 0x7f5bf94429d0, child_tidptr: 0x7f5bf94429d0, tls: 0x7f5bf9442700) = 26655 (perf)
     0.232 ( 0.013 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5bf8c40f30, parent_tidptr: 0x7f5bf8c419d0, child_tidptr: 0x7f5bf8c419d0, tls: 0x7f5bf8c41700) = 26656 (perf)
     0.393 ( 0.011 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5be3ffef30, parent_tidptr: 0x7f5be3fff9d0, child_tidptr: 0x7f5be3fff9d0, tls: 0x7f5be3fff700) = 26657 (perf)
     0.802 ( 0.012 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5be37fdf30, parent_tidptr: 0x7f5be37fe9d0, child_tidptr: 0x7f5be37fe9d0, tls: 0x7f5be37fe700) = 26658 (perf)
     1.411 ( 0.022 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5be2ffcf30, parent_tidptr: 0x7f5be2ffd9d0, child_tidptr: 0x7f5be2ffd9d0, tls: 0x7f5be2ffd700) = 26659 (perf)
   246.422 ( 0.042 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7f5be2ffcf30, parent_tidptr: 0x7f5be2ffd9d0, child_tidptr: 0x7f5be2ffd9d0, tls: 0x7f5be2ffd700) = 26660 (perf)
  #

Signed-off-by: Kan Liang <kan.liang@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Lukasz Odzioba <lukasz.odzioba@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1506696477-146932-5-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-top.txt |  3 +++
 tools/perf/builtin-top.c              | 11 ++++++++---
 tools/perf/util/event.c               |  5 ++++-
 tools/perf/util/top.h                 |  1 +
 4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index d864ea6..4353262 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -240,6 +240,9 @@ Default is to monitor all CPUS.
 --force::
 	Don't do ownership validation.
 
+--num-thread-synthesize::
+	The number of threads to run when synthesizing events for existing processes.
+	By default, the number of threads equals to the number of online CPUs.
 
 INTERACTIVE PROMPTING KEYS
 --------------------------
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index bc31b93..477a869 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,14 +958,16 @@ static int __cmd_top(struct perf_top *top)
 	if (perf_session__register_idle_thread(top->session) < 0)
 		goto out_delete;
 
-	perf_set_multithreaded();
+	if (top->nr_threads_synthesize > 1)
+		perf_set_multithreaded();
 
 	machine__synthesize_threads(&top->session->machines.host, &opts->target,
 				    top->evlist->threads, false,
 				    opts->proc_map_timeout,
-				    (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+				    top->nr_threads_synthesize);
 
-	perf_set_singlethreaded();
+	if (top->nr_threads_synthesize > 1)
+		perf_set_singlethreaded();
 
 	if (perf_hpp_list.socket) {
 		ret = perf_env__read_cpu_topology_map(&perf_env);
@@ -1118,6 +1120,7 @@ int cmd_top(int argc, const char **argv)
 		},
 		.max_stack	     = sysctl_perf_event_max_stack,
 		.sym_pcnt_filter     = 5,
+		.nr_threads_synthesize = UINT_MAX,
 	};
 	struct record_opts *opts = &top.record_opts;
 	struct target *target = &opts->target;
@@ -1227,6 +1230,8 @@ int cmd_top(int argc, const char **argv)
 	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
 		    "Show entries in a hierarchy"),
 	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
+	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
+			"number of thread to run event synthesize"),
 	OPT_END()
 	};
 	const char * const top_usage[] = {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 0e678dd..47eff47 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -790,7 +790,10 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 	if (n < 0)
 		return err;
 
-	thread_nr = nr_threads_synthesize;
+	if (nr_threads_synthesize == UINT_MAX)
+		thread_nr = sysconf(_SC_NPROCESSORS_ONLN);
+	else
+		thread_nr = nr_threads_synthesize;
 
 	if (thread_nr <= 1) {
 		err = __perf_event__synthesize_threads(tool, process,
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index 9bdfb78..f4296e1 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -37,6 +37,7 @@ struct perf_top {
 	int		   sym_pcnt_filter;
 	const char	   *sym_filter;
 	float		   min_percent;
+	unsigned int	   nr_threads_synthesize;
 };
 
 #define CONSOLE_CLEAR "^[[H^[[2J"

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-10-03 16:49 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-29 14:47 [PATCH RFC V4 0/6] perf top optimization kan.liang
2017-09-29 14:47 ` [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list kan.liang
2017-10-02 18:05   ` Arnaldo Carvalho de Melo
2017-10-02 19:14     ` Liang, Kan
2017-10-02 19:34       ` Arnaldo Carvalho de Melo
2017-10-03 16:44   ` [tip:perf/core] perf tools: Lock " tip-bot for Kan Liang
2017-09-29 14:47 ` [PATCH RFC V4 2/6] perf tools: lock to protect comm_str rb tree kan.liang
2017-10-03 16:45   ` [tip:perf/core] perf tools: Lock " tip-bot for Kan Liang
2017-09-29 14:47 ` [PATCH RFC V4 3/6] perf top: implement multithreading for perf_event__synthesize_threads kan.liang
2017-10-03 16:45   ` [tip:perf/core] perf top: Implement " tip-bot for Kan Liang
2017-09-29 14:47 ` [PATCH RFC V4 4/6] perf top: add option to set the number of thread for event synthesize kan.liang
2017-10-03 16:45   ` [tip:perf/core] perf top: Add " tip-bot for Kan Liang
2017-09-29 14:47 ` [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode kan.liang
2017-09-30 19:48   ` Jiri Olsa
2017-10-02 17:19     ` Liang, Kan
2017-10-02 18:03       ` acme
2017-10-03  8:24       ` Jiri Olsa
2017-09-29 14:47 ` [PATCH RFC V4 6/6] perf top: check the cost of perf_top__mmap_read kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).