linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/6] Thread memory improvements and fixes
@ 2024-02-14  6:37 Ian Rogers
  2024-02-14  6:37 ` [PATCH v1 1/6] perf report: Sort child tasks by tid Ian Rogers
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14  6:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

The next 6 patches from:
https://lore.kernel.org/lkml/20240202061532.1939474-1-irogers@google.com/
now the initial maps fixes have landed:
https://lore.kernel.org/all/20240210031746.4057262-1-irogers@google.com/

Separate out and reimplement threads to use a hashmap for lower memory
consumption and faster look up. The fixes a regression in memory usage
where reference count checking switched to using non-invasive tree
nodes.  Reduce threads default size by 32 times and improve locking
discipline. Also, fix regressions where tids had become unordered to
make `perf report --tasks` and `perf trace --summary` output easier to
read.

Ian Rogers (6):
  perf report: Sort child tasks by tid
  perf trace: Ignore thread hashing in summary
  perf machine: Move fprintf to for_each loop and a callback
  perf threads: Move threads to its own files
  perf threads: Switch from rbtree to hashmap
  perf threads: Reduce table size from 256 to 8

 tools/perf/builtin-report.c           | 203 ++++++++-------
 tools/perf/builtin-trace.c            |  41 +--
 tools/perf/util/Build                 |   1 +
 tools/perf/util/bpf_lock_contention.c |   8 +-
 tools/perf/util/machine.c             | 344 +++++++-------------------
 tools/perf/util/machine.h             |  30 +--
 tools/perf/util/rb_resort.h           |   5 -
 tools/perf/util/thread.c              |   2 +-
 tools/perf/util/thread.h              |   6 -
 tools/perf/util/threads.c             | 186 ++++++++++++++
 tools/perf/util/threads.h             |  35 +++
 11 files changed, 464 insertions(+), 397 deletions(-)
 create mode 100644 tools/perf/util/threads.c
 create mode 100644 tools/perf/util/threads.h

-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
@ 2024-02-14  6:37 ` Ian Rogers
  2024-02-14 17:24   ` Arnaldo Carvalho de Melo
  2024-02-27  6:39   ` Namhyung Kim
  2024-02-14  6:37 ` [PATCH v1 2/6] perf trace: Ignore thread hashing in summary Ian Rogers
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14  6:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

Commit 91e467bc568f ("perf machine: Use hashtable for machine
threads") made the iteration of thread tids unordered. The perf report
--tasks output now shows child threads in an order determined by the
hashing. For example, in this snippet tid 3 appears after tid 256 even
though they have the same ppid 2:

```
$ perf report --tasks
%      pid      tid     ppid  comm
         0        0       -1 |swapper
         2        2        0 | kthreadd
       256      256        2 |  kworker/12:1H-k
    693761   693761        2 |  kworker/10:1-mm
   1301762  1301762        2 |  kworker/1:1-mm_
   1302530  1302530        2 |  kworker/u32:0-k
         3        3        2 |  rcu_gp
...
```

The output is easier to read if threads appear numerically
increasing. To allow for this, read all threads into a list then sort
with a comparator that orders by the child task's of the first common
parent. The list creation and deletion are created as utilities on
machine.  The indentation is possible by counting the number of
parents a child has.

With this change the output for the same data file is now like:
```
$ perf report --tasks
%      pid      tid     ppid  comm
         0        0       -1 |swapper
         1        1        0 | systemd
       823      823        1 |  systemd-journal
       853      853        1 |  systemd-udevd
      3230     3230        1 |  systemd-timesyn
      3236     3236        1 |  auditd
      3239     3239     3236 |   audisp-syslog
      3321     3321        1 |  accounts-daemon
...
```

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-report.c | 203 ++++++++++++++++++++----------------
 tools/perf/util/machine.c   |  30 ++++++
 tools/perf/util/machine.h   |  10 ++
 3 files changed, 155 insertions(+), 88 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 8e16fa261e6f..b48f1d5309e3 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -59,6 +59,7 @@
 #include <linux/ctype.h>
 #include <signal.h>
 #include <linux/bitmap.h>
+#include <linux/list_sort.h>
 #include <linux/string.h>
 #include <linux/stringify.h>
 #include <linux/time64.h>
@@ -828,35 +829,6 @@ static void tasks_setup(struct report *rep)
 	rep->tool.no_warn = true;
 }
 
-struct task {
-	struct thread		*thread;
-	struct list_head	 list;
-	struct list_head	 children;
-};
-
-static struct task *tasks_list(struct task *task, struct machine *machine)
-{
-	struct thread *parent_thread, *thread = task->thread;
-	struct task   *parent_task;
-
-	/* Already listed. */
-	if (!list_empty(&task->list))
-		return NULL;
-
-	/* Last one in the chain. */
-	if (thread__ppid(thread) == -1)
-		return task;
-
-	parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
-	if (!parent_thread)
-		return ERR_PTR(-ENOENT);
-
-	parent_task = thread__priv(parent_thread);
-	thread__put(parent_thread);
-	list_add_tail(&task->list, &parent_task->children);
-	return tasks_list(parent_task, machine);
-}
-
 struct maps__fprintf_task_args {
 	int indent;
 	FILE *fp;
@@ -900,89 +872,144 @@ static size_t maps__fprintf_task(struct maps *maps, int indent, FILE *fp)
 	return args.printed;
 }
 
-static void task__print_level(struct task *task, FILE *fp, int level)
+static int thread_level(struct machine *machine, const struct thread *thread)
 {
-	struct thread *thread = task->thread;
-	struct task *child;
-	int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
-				  thread__pid(thread), thread__tid(thread),
-				  thread__ppid(thread), level, "");
+	struct thread *parent_thread;
+	int res;
 
-	fprintf(fp, "%s\n", thread__comm_str(thread));
+	if (thread__tid(thread) <= 0)
+		return 0;
 
-	maps__fprintf_task(thread__maps(thread), comm_indent, fp);
+	if (thread__ppid(thread) <= 0)
+		return 1;
 
-	if (!list_empty(&task->children)) {
-		list_for_each_entry(child, &task->children, list)
-			task__print_level(child, fp, level + 1);
+	parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
+	if (!parent_thread) {
+		pr_err("Missing parent thread of %d\n", thread__tid(thread));
+		return 0;
 	}
+	res = 1 + thread_level(machine, parent_thread);
+	thread__put(parent_thread);
+	return res;
 }
 
-static int tasks_print(struct report *rep, FILE *fp)
+static void task__print_level(struct machine *machine, struct thread *thread, FILE *fp)
 {
-	struct perf_session *session = rep->session;
-	struct machine      *machine = &session->machines.host;
-	struct task *tasks, *task;
-	unsigned int nr = 0, itask = 0, i;
-	struct rb_node *nd;
-	LIST_HEAD(list);
+	int level = thread_level(machine, thread);
+	int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
+				  thread__pid(thread), thread__tid(thread),
+				  thread__ppid(thread), level, "");
 
-	/*
-	 * No locking needed while accessing machine->threads,
-	 * because --tasks is single threaded command.
-	 */
+	fprintf(fp, "%s\n", thread__comm_str(thread));
 
-	/* Count all the threads. */
-	for (i = 0; i < THREADS__TABLE_SIZE; i++)
-		nr += machine->threads[i].nr;
+	maps__fprintf_task(thread__maps(thread), comm_indent, fp);
+}
 
-	tasks = malloc(sizeof(*tasks) * nr);
-	if (!tasks)
-		return -ENOMEM;
+static int task_list_cmp(void *priv, const struct list_head *la, const struct list_head *lb)
+{
+	struct machine *machine = priv;
+	struct thread_list *task_a = list_entry(la, struct thread_list, list);
+	struct thread_list *task_b = list_entry(lb, struct thread_list, list);
+	struct thread *a = task_a->thread;
+	struct thread *b = task_b->thread;
+	int level_a, level_b, res;
+
+	/* Compare a and b to root. */
+	if (thread__tid(a) == thread__tid(b))
+		return 0;
 
-	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
-		struct threads *threads = &machine->threads[i];
+	if (thread__tid(a) == 0)
+		return -1;
 
-		for (nd = rb_first_cached(&threads->entries); nd;
-		     nd = rb_next(nd)) {
-			task = tasks + itask++;
+	if (thread__tid(b) == 0)
+		return 1;
 
-			task->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
-			INIT_LIST_HEAD(&task->children);
-			INIT_LIST_HEAD(&task->list);
-			thread__set_priv(task->thread, task);
-		}
+	/* If parents match sort by tid. */
+	if (thread__ppid(a) == thread__ppid(b)) {
+		return thread__tid(a) < thread__tid(b)
+			? -1
+			: (thread__tid(a) > thread__tid(b) ? 1 : 0);
 	}
 
 	/*
-	 * Iterate every task down to the unprocessed parent
-	 * and link all in task children list. Task with no
-	 * parent is added into 'list'.
+	 * Find a and b such that if they are a child of each other a and b's
+	 * tid's match, otherwise a and b have a common parent and distinct
+	 * tid's to sort by. First make the depths of the threads match.
 	 */
-	for (itask = 0; itask < nr; itask++) {
-		task = tasks + itask;
-
-		if (!list_empty(&task->list))
-			continue;
-
-		task = tasks_list(task, machine);
-		if (IS_ERR(task)) {
-			pr_err("Error: failed to process tasks\n");
-			free(tasks);
-			return PTR_ERR(task);
+	level_a = thread_level(machine, a);
+	level_b = thread_level(machine, b);
+	a = thread__get(a);
+	b = thread__get(b);
+	for (int i = level_a; i > level_b; i--) {
+		struct thread *parent = machine__find_thread(machine, -1, thread__ppid(a));
+
+		thread__put(a);
+		if (!parent) {
+			pr_err("Missing parent thread of %d\n", thread__tid(a));
+			thread__put(b);
+			return -1;
 		}
+		a = parent;
+	}
+	for (int i = level_b; i > level_a; i--) {
+		struct thread *parent = machine__find_thread(machine, -1, thread__ppid(b));
 
-		if (task)
-			list_add_tail(&task->list, &list);
+		thread__put(b);
+		if (!parent) {
+			pr_err("Missing parent thread of %d\n", thread__tid(b));
+			thread__put(a);
+			return 1;
+		}
+		b = parent;
+	}
+	/* Search up to a common parent. */
+	while (thread__ppid(a) != thread__ppid(b)) {
+		struct thread *parent;
+
+		parent = machine__find_thread(machine, -1, thread__ppid(a));
+		thread__put(a);
+		if (!parent)
+			pr_err("Missing parent thread of %d\n", thread__tid(a));
+		a = parent;
+		parent = machine__find_thread(machine, -1, thread__ppid(b));
+		thread__put(b);
+		if (!parent)
+			pr_err("Missing parent thread of %d\n", thread__tid(b));
+		b = parent;
+		if (!a || !b)
+			return !a && !b ? 0 : (!a ? -1 : 1);
+	}
+	if (thread__tid(a) == thread__tid(b)) {
+		/* a is a child of b or vice-versa, deeper levels appear later. */
+		res = level_a < level_b ? -1 : (level_a > level_b ? 1 : 0);
+	} else {
+		/* Sort by tid now the parent is the same. */
+		res = thread__tid(a) < thread__tid(b) ? -1 : 1;
 	}
+	thread__put(a);
+	thread__put(b);
+	return res;
+}
+
+static int tasks_print(struct report *rep, FILE *fp)
+{
+	struct machine *machine = &rep->session->machines.host;
+	LIST_HEAD(tasks);
+	int ret;
 
-	fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
+	ret = machine__thread_list(machine, &tasks);
+	if (!ret) {
+		struct thread_list *task;
 
-	list_for_each_entry(task, &list, list)
-		task__print_level(task, fp, 0);
+		list_sort(machine, &tasks, task_list_cmp);
 
-	free(tasks);
-	return 0;
+		fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
+
+		list_for_each_entry(task, &tasks, list)
+			task__print_level(machine, task->thread, fp);
+	}
+	thread_list__delete(&tasks);
+	return ret;
 }
 
 static int __cmd_report(struct report *rep)
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 3da92f18814a..7872ce92c9fc 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -3261,6 +3261,36 @@ int machines__for_each_thread(struct machines *machines,
 	return rc;
 }
 
+
+static int thread_list_cb(struct thread *thread, void *data)
+{
+	struct list_head *list = data;
+	struct thread_list *entry = malloc(sizeof(*entry));
+
+	if (!entry)
+		return -ENOMEM;
+
+	entry->thread = thread__get(thread);
+	list_add_tail(&entry->list, list);
+	return 0;
+}
+
+int machine__thread_list(struct machine *machine, struct list_head *list)
+{
+	return machine__for_each_thread(machine, thread_list_cb, list);
+}
+
+void thread_list__delete(struct list_head *list)
+{
+	struct thread_list *pos, *next;
+
+	list_for_each_entry_safe(pos, next, list, list) {
+		thread__zput(pos->thread);
+		list_del(&pos->list);
+		free(pos);
+	}
+}
+
 pid_t machine__get_current_tid(struct machine *machine, int cpu)
 {
 	if (cpu < 0 || (size_t)cpu >= machine->current_tid_sz)
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 1279acda6a8a..b738ce84817b 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -280,6 +280,16 @@ int machines__for_each_thread(struct machines *machines,
 			      int (*fn)(struct thread *thread, void *p),
 			      void *priv);
 
+struct thread_list {
+	struct list_head	 list;
+	struct thread		*thread;
+};
+
+/* Make a list of struct thread_list based on threads in the machine. */
+int machine__thread_list(struct machine *machine, struct list_head *list);
+/* Free up the nodes within the thread_list list. */
+void thread_list__delete(struct list_head *list);
+
 pid_t machine__get_current_tid(struct machine *machine, int cpu);
 int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid,
 			     pid_t tid);
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
  2024-02-14  6:37 ` [PATCH v1 1/6] perf report: Sort child tasks by tid Ian Rogers
@ 2024-02-14  6:37 ` Ian Rogers
  2024-02-14 17:25   ` Arnaldo Carvalho de Melo
  2024-02-27  6:55   ` Namhyung Kim
  2024-02-14  6:37 ` [PATCH v1 3/6] perf machine: Move fprintf to for_each loop and a callback Ian Rogers
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14  6:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

Commit 91e467bc568f ("perf machine: Use hashtable for machine
threads") made the iteration of thread tids unordered. The perf trace
--summary output sorts and prints each hash bucket, rather than all
threads globally. Change this behavior by turn all threads into a
list, sort the list by number of trace events then by tids, finally
print the list. This also allows the rbtree in threads to be not
accessed outside of machine.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-trace.c  | 41 +++++++++++++++++++++----------------
 tools/perf/util/rb_resort.h |  5 -----
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 109b8e64fe69..90eaff8c0f6e 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -74,6 +74,7 @@
 #include <linux/err.h>
 #include <linux/filter.h>
 #include <linux/kernel.h>
+#include <linux/list_sort.h>
 #include <linux/random.h>
 #include <linux/stringify.h>
 #include <linux/time64.h>
@@ -4312,34 +4313,38 @@ static unsigned long thread__nr_events(struct thread_trace *ttrace)
 	return ttrace ? ttrace->nr_events : 0;
 }
 
-DEFINE_RESORT_RB(threads,
-		(thread__nr_events(thread__priv(a->thread)) <
-		 thread__nr_events(thread__priv(b->thread))),
-	struct thread *thread;
-)
+static int trace_nr_events_cmp(void *priv __maybe_unused,
+			       const struct list_head *la,
+			       const struct list_head *lb)
 {
-	entry->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
+	struct thread_list *a = list_entry(la, struct thread_list, list);
+	struct thread_list *b = list_entry(lb, struct thread_list, list);
+	unsigned long a_nr_events = thread__nr_events(thread__priv(a->thread));
+	unsigned long b_nr_events = thread__nr_events(thread__priv(b->thread));
+
+	if (a_nr_events != b_nr_events)
+		return a_nr_events < b_nr_events ? -1 : 1;
+
+	/* Identical number of threads, place smaller tids first. */
+	return thread__tid(a->thread) < thread__tid(b->thread)
+		? -1
+		: (thread__tid(a->thread) > thread__tid(b->thread) ? 1 : 0);
 }
 
 static size_t trace__fprintf_thread_summary(struct trace *trace, FILE *fp)
 {
 	size_t printed = trace__fprintf_threads_header(fp);
-	struct rb_node *nd;
-	int i;
-
-	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
-		DECLARE_RESORT_RB_MACHINE_THREADS(threads, trace->host, i);
+	LIST_HEAD(threads);
 
-		if (threads == NULL) {
-			fprintf(fp, "%s", "Error sorting output by nr_events!\n");
-			return 0;
-		}
+	if (machine__thread_list(trace->host, &threads) == 0) {
+		struct thread_list *pos;
 
-		resort_rb__for_each_entry(nd, threads)
-			printed += trace__fprintf_thread(fp, threads_entry->thread, trace);
+		list_sort(NULL, &threads, trace_nr_events_cmp);
 
-		resort_rb__delete(threads);
+		list_for_each_entry(pos, &threads, list)
+			printed += trace__fprintf_thread(fp, pos->thread, trace);
 	}
+	thread_list__delete(&threads);
 	return printed;
 }
 
diff --git a/tools/perf/util/rb_resort.h b/tools/perf/util/rb_resort.h
index 376e86cb4c3c..d927a0d25052 100644
--- a/tools/perf/util/rb_resort.h
+++ b/tools/perf/util/rb_resort.h
@@ -143,9 +143,4 @@ struct __name##_sorted *__name = __name##_sorted__new
 	DECLARE_RESORT_RB(__name)(&__ilist->rblist.entries.rb_root,		\
 				  __ilist->rblist.nr_entries)
 
-/* For 'struct machine->threads' */
-#define DECLARE_RESORT_RB_MACHINE_THREADS(__name, __machine, hash_bucket)    \
- DECLARE_RESORT_RB(__name)(&__machine->threads[hash_bucket].entries.rb_root, \
-			   __machine->threads[hash_bucket].nr)
-
 #endif /* _PERF_RESORT_RB_H_ */
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 3/6] perf machine: Move fprintf to for_each loop and a callback
  2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
  2024-02-14  6:37 ` [PATCH v1 1/6] perf report: Sort child tasks by tid Ian Rogers
  2024-02-14  6:37 ` [PATCH v1 2/6] perf trace: Ignore thread hashing in summary Ian Rogers
@ 2024-02-14  6:37 ` Ian Rogers
  2024-02-14  6:37 ` [PATCH v1 4/6] perf threads: Move threads to its own files Ian Rogers
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14  6:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

Avoid exposing the threads data structure by switching to the callback
machine__for_each_thread approach. machine__fprintf is only used in
tests and verbose >3 output so don't turn to list and sort. Add
machine__threads_nr to be refactored later.

Note, all existing *_fprintf routines ignore fprintf errors.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/machine.c | 43 ++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 7872ce92c9fc..e072b2115b64 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1113,29 +1113,40 @@ size_t machine__fprintf_vmlinux_path(struct machine *machine, FILE *fp)
 	return printed;
 }
 
-size_t machine__fprintf(struct machine *machine, FILE *fp)
+struct machine_fprintf_cb_args {
+	FILE *fp;
+	size_t printed;
+};
+
+static int machine_fprintf_cb(struct thread *thread, void *data)
 {
-	struct rb_node *nd;
-	size_t ret;
-	int i;
+	struct machine_fprintf_cb_args *args = data;
 
-	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
-		struct threads *threads = &machine->threads[i];
+	/* TODO: handle fprintf errors. */
+	args->printed += thread__fprintf(thread, args->fp);
+	return 0;
+}
 
-		down_read(&threads->lock);
+static size_t machine__threads_nr(const struct machine *machine)
+{
+	size_t nr = 0;
 
-		ret = fprintf(fp, "Threads: %u\n", threads->nr);
+	for (int i = 0; i < THREADS__TABLE_SIZE; i++)
+		nr += machine->threads[i].nr;
 
-		for (nd = rb_first_cached(&threads->entries); nd;
-		     nd = rb_next(nd)) {
-			struct thread *pos = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
+	return nr;
+}
 
-			ret += thread__fprintf(pos, fp);
-		}
+size_t machine__fprintf(struct machine *machine, FILE *fp)
+{
+	struct machine_fprintf_cb_args args = {
+		.fp = fp,
+		.printed = 0,
+	};
+	size_t ret = fprintf(fp, "Threads: %zu\n", machine__threads_nr(machine));
 
-		up_read(&threads->lock);
-	}
-	return ret;
+	machine__for_each_thread(machine, machine_fprintf_cb, &args);
+	return ret + args.printed;
 }
 
 static struct dso *machine__get_kernel(struct machine *machine)
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
                   ` (2 preceding siblings ...)
  2024-02-14  6:37 ` [PATCH v1 3/6] perf machine: Move fprintf to for_each loop and a callback Ian Rogers
@ 2024-02-14  6:37 ` Ian Rogers
  2024-02-27  7:07   ` Namhyung Kim
  2024-02-14  6:37 ` [PATCH v1 5/6] perf threads: Switch from rbtree to hashmap Ian Rogers
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-14  6:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

Move threads out of machine and move thread_rb_node into the C
file. This hides the implementation of threads from the rest of the
code allowing for it to be refactored.

Locking discipline is tightened up in this change.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/Build                 |   1 +
 tools/perf/util/bpf_lock_contention.c |   8 +-
 tools/perf/util/machine.c             | 287 ++++----------------------
 tools/perf/util/machine.h             |  20 +-
 tools/perf/util/thread.c              |   2 +-
 tools/perf/util/thread.h              |   6 -
 tools/perf/util/threads.c             | 244 ++++++++++++++++++++++
 tools/perf/util/threads.h             |  35 ++++
 8 files changed, 325 insertions(+), 278 deletions(-)
 create mode 100644 tools/perf/util/threads.c
 create mode 100644 tools/perf/util/threads.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 8027f450fa3e..a0e8cd68d490 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -71,6 +71,7 @@ perf-y += ordered-events.o
 perf-y += namespaces.o
 perf-y += comm.o
 perf-y += thread.o
+perf-y += threads.o
 perf-y += thread_map.o
 perf-y += parse-events-flex.o
 perf-y += parse-events-bison.o
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 31ff19afc20c..3992c8a9fd96 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -210,7 +210,7 @@ static const char *lock_contention_get_name(struct lock_contention *con,
 
 		/* do not update idle comm which contains CPU number */
 		if (pid) {
-			struct thread *t = __machine__findnew_thread(machine, /*pid=*/-1, pid);
+			struct thread *t = machine__findnew_thread(machine, /*pid=*/-1, pid);
 
 			if (t == NULL)
 				return name;
@@ -302,9 +302,9 @@ int lock_contention_read(struct lock_contention *con)
 		return -1;
 
 	if (con->aggr_mode == LOCK_AGGR_TASK) {
-		struct thread *idle = __machine__findnew_thread(machine,
-								/*pid=*/0,
-								/*tid=*/0);
+		struct thread *idle = machine__findnew_thread(machine,
+							      /*pid=*/0,
+							      /*tid=*/0);
 		thread__set_comm(idle, "swapper", /*timestamp=*/0);
 	}
 
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index e072b2115b64..e668a97255f8 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -43,9 +43,6 @@
 #include <linux/string.h>
 #include <linux/zalloc.h>
 
-static void __machine__remove_thread(struct machine *machine, struct thread_rb_node *nd,
-				     struct thread *th, bool lock);
-
 static struct dso *machine__kernel_dso(struct machine *machine)
 {
 	return map__dso(machine->vmlinux_map);
@@ -58,35 +55,6 @@ static void dsos__init(struct dsos *dsos)
 	init_rwsem(&dsos->lock);
 }
 
-static void machine__threads_init(struct machine *machine)
-{
-	int i;
-
-	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
-		struct threads *threads = &machine->threads[i];
-		threads->entries = RB_ROOT_CACHED;
-		init_rwsem(&threads->lock);
-		threads->nr = 0;
-		threads->last_match = NULL;
-	}
-}
-
-static int thread_rb_node__cmp_tid(const void *key, const struct rb_node *nd)
-{
-	int to_find = (int) *((pid_t *)key);
-
-	return to_find - (int)thread__tid(rb_entry(nd, struct thread_rb_node, rb_node)->thread);
-}
-
-static struct thread_rb_node *thread_rb_node__find(const struct thread *th,
-						   struct rb_root *tree)
-{
-	pid_t to_find = thread__tid(th);
-	struct rb_node *nd = rb_find(&to_find, tree, thread_rb_node__cmp_tid);
-
-	return rb_entry(nd, struct thread_rb_node, rb_node);
-}
-
 static int machine__set_mmap_name(struct machine *machine)
 {
 	if (machine__is_host(machine))
@@ -120,7 +88,7 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
 	RB_CLEAR_NODE(&machine->rb_node);
 	dsos__init(&machine->dsos);
 
-	machine__threads_init(machine);
+	threads__init(&machine->threads);
 
 	machine->vdso_info = NULL;
 	machine->env = NULL;
@@ -221,27 +189,11 @@ static void dsos__exit(struct dsos *dsos)
 
 void machine__delete_threads(struct machine *machine)
 {
-	struct rb_node *nd;
-	int i;
-
-	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
-		struct threads *threads = &machine->threads[i];
-		down_write(&threads->lock);
-		nd = rb_first_cached(&threads->entries);
-		while (nd) {
-			struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
-
-			nd = rb_next(nd);
-			__machine__remove_thread(machine, trb, trb->thread, false);
-		}
-		up_write(&threads->lock);
-	}
+	threads__remove_all_threads(&machine->threads);
 }
 
 void machine__exit(struct machine *machine)
 {
-	int i;
-
 	if (machine == NULL)
 		return;
 
@@ -254,12 +206,7 @@ void machine__exit(struct machine *machine)
 	zfree(&machine->current_tid);
 	zfree(&machine->kallsyms_filename);
 
-	machine__delete_threads(machine);
-	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
-		struct threads *threads = &machine->threads[i];
-
-		exit_rwsem(&threads->lock);
-	}
+	threads__exit(&machine->threads);
 }
 
 void machine__delete(struct machine *machine)
@@ -526,7 +473,7 @@ static void machine__update_thread_pid(struct machine *machine,
 	if (thread__pid(th) == thread__tid(th))
 		return;
 
-	leader = __machine__findnew_thread(machine, thread__pid(th), thread__pid(th));
+	leader = machine__findnew_thread(machine, thread__pid(th), thread__pid(th));
 	if (!leader)
 		goto out_err;
 
@@ -560,160 +507,55 @@ static void machine__update_thread_pid(struct machine *machine,
 	goto out_put;
 }
 
-/*
- * Front-end cache - TID lookups come in blocks,
- * so most of the time we dont have to look up
- * the full rbtree:
- */
-static struct thread*
-__threads__get_last_match(struct threads *threads, struct machine *machine,
-			  int pid, int tid)
-{
-	struct thread *th;
-
-	th = threads->last_match;
-	if (th != NULL) {
-		if (thread__tid(th) == tid) {
-			machine__update_thread_pid(machine, th, pid);
-			return thread__get(th);
-		}
-		thread__put(threads->last_match);
-		threads->last_match = NULL;
-	}
-
-	return NULL;
-}
-
-static struct thread*
-threads__get_last_match(struct threads *threads, struct machine *machine,
-			int pid, int tid)
-{
-	struct thread *th = NULL;
-
-	if (perf_singlethreaded)
-		th = __threads__get_last_match(threads, machine, pid, tid);
-
-	return th;
-}
-
-static void
-__threads__set_last_match(struct threads *threads, struct thread *th)
-{
-	thread__put(threads->last_match);
-	threads->last_match = thread__get(th);
-}
-
-static void
-threads__set_last_match(struct threads *threads, struct thread *th)
-{
-	if (perf_singlethreaded)
-		__threads__set_last_match(threads, th);
-}
-
 /*
  * Caller must eventually drop thread->refcnt returned with a successful
  * lookup/new thread inserted.
  */
-static struct thread *____machine__findnew_thread(struct machine *machine,
-						  struct threads *threads,
-						  pid_t pid, pid_t tid,
-						  bool create)
+static struct thread *__machine__findnew_thread(struct machine *machine,
+						pid_t pid,
+						pid_t tid,
+						bool create)
 {
-	struct rb_node **p = &threads->entries.rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct thread *th;
-	struct thread_rb_node *nd;
-	bool leftmost = true;
+	struct thread *th = threads__find(&machine->threads, tid);
+	bool created;
 
-	th = threads__get_last_match(threads, machine, pid, tid);
-	if (th)
+	if (th) {
+		machine__update_thread_pid(machine, th, pid);
 		return th;
-
-	while (*p != NULL) {
-		parent = *p;
-		th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
-
-		if (thread__tid(th) == tid) {
-			threads__set_last_match(threads, th);
-			machine__update_thread_pid(machine, th, pid);
-			return thread__get(th);
-		}
-
-		if (tid < thread__tid(th))
-			p = &(*p)->rb_left;
-		else {
-			p = &(*p)->rb_right;
-			leftmost = false;
-		}
 	}
-
 	if (!create)
 		return NULL;
 
-	th = thread__new(pid, tid);
-	if (th == NULL)
-		return NULL;
-
-	nd = malloc(sizeof(*nd));
-	if (nd == NULL) {
-		thread__put(th);
-		return NULL;
-	}
-	nd->thread = th;
-
-	rb_link_node(&nd->rb_node, parent, p);
-	rb_insert_color_cached(&nd->rb_node, &threads->entries, leftmost);
-	/*
-	 * We have to initialize maps separately after rb tree is updated.
-	 *
-	 * The reason is that we call machine__findnew_thread within
-	 * thread__init_maps to find the thread leader and that would screwed
-	 * the rb tree.
-	 */
-	if (thread__init_maps(th, machine)) {
-		pr_err("Thread init failed thread %d\n", pid);
-		rb_erase_cached(&nd->rb_node, &threads->entries);
-		RB_CLEAR_NODE(&nd->rb_node);
-		free(nd);
-		thread__put(th);
-		return NULL;
-	}
-	/*
-	 * It is now in the rbtree, get a ref
-	 */
-	threads__set_last_match(threads, th);
-	++threads->nr;
-
-	return thread__get(th);
-}
+	th = threads__findnew(&machine->threads, pid, tid, &created);
+	if (created) {
+		/*
+		 * We have to initialize maps separately after rb tree is
+		 * updated.
+		 *
+		 * The reason is that we call machine__findnew_thread within
+		 * thread__init_maps to find the thread leader and that would
+		 * screwed the rb tree.
+		 */
+		if (thread__init_maps(th, machine)) {
+			pr_err("Thread init failed thread %d\n", pid);
+			threads__remove(&machine->threads, th);
+			thread__put(th);
+			return NULL;
+		}
+	} else
+		machine__update_thread_pid(machine, th, pid);
 
-struct thread *__machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid)
-{
-	return ____machine__findnew_thread(machine, machine__threads(machine, tid), pid, tid, true);
+	return th;
 }
 
-struct thread *machine__findnew_thread(struct machine *machine, pid_t pid,
-				       pid_t tid)
+struct thread *machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid)
 {
-	struct threads *threads = machine__threads(machine, tid);
-	struct thread *th;
-
-	down_write(&threads->lock);
-	th = __machine__findnew_thread(machine, pid, tid);
-	up_write(&threads->lock);
-	return th;
+	return __machine__findnew_thread(machine, pid, tid, /*create=*/true);
 }
 
-struct thread *machine__find_thread(struct machine *machine, pid_t pid,
-				    pid_t tid)
+struct thread *machine__find_thread(struct machine *machine, pid_t pid, pid_t tid)
 {
-	struct threads *threads = machine__threads(machine, tid);
-	struct thread *th;
-
-	down_read(&threads->lock);
-	th =  ____machine__findnew_thread(machine, threads, pid, tid, false);
-	up_read(&threads->lock);
-	return th;
+	return __machine__findnew_thread(machine, pid, tid, /*create=*/false);
 }
 
 /*
@@ -1127,23 +969,13 @@ static int machine_fprintf_cb(struct thread *thread, void *data)
 	return 0;
 }
 
-static size_t machine__threads_nr(const struct machine *machine)
-{
-	size_t nr = 0;
-
-	for (int i = 0; i < THREADS__TABLE_SIZE; i++)
-		nr += machine->threads[i].nr;
-
-	return nr;
-}
-
 size_t machine__fprintf(struct machine *machine, FILE *fp)
 {
 	struct machine_fprintf_cb_args args = {
 		.fp = fp,
 		.printed = 0,
 	};
-	size_t ret = fprintf(fp, "Threads: %zu\n", machine__threads_nr(machine));
+	size_t ret = fprintf(fp, "Threads: %zu\n", threads__nr(&machine->threads));
 
 	machine__for_each_thread(machine, machine_fprintf_cb, &args);
 	return ret + args.printed;
@@ -2069,36 +1901,9 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
 	return 0;
 }
 
-static void __machine__remove_thread(struct machine *machine, struct thread_rb_node *nd,
-				     struct thread *th, bool lock)
-{
-	struct threads *threads = machine__threads(machine, thread__tid(th));
-
-	if (!nd)
-		nd = thread_rb_node__find(th, &threads->entries.rb_root);
-
-	if (threads->last_match && RC_CHK_EQUAL(threads->last_match, th))
-		threads__set_last_match(threads, NULL);
-
-	if (lock)
-		down_write(&threads->lock);
-
-	BUG_ON(refcount_read(thread__refcnt(th)) == 0);
-
-	thread__put(nd->thread);
-	rb_erase_cached(&nd->rb_node, &threads->entries);
-	RB_CLEAR_NODE(&nd->rb_node);
-	--threads->nr;
-
-	free(nd);
-
-	if (lock)
-		up_write(&threads->lock);
-}
-
 void machine__remove_thread(struct machine *machine, struct thread *th)
 {
-	return __machine__remove_thread(machine, NULL, th, true);
+	return threads__remove(&machine->threads, th);
 }
 
 int machine__process_fork_event(struct machine *machine, union perf_event *event,
@@ -3232,23 +3037,7 @@ int machine__for_each_thread(struct machine *machine,
 			     int (*fn)(struct thread *thread, void *p),
 			     void *priv)
 {
-	struct threads *threads;
-	struct rb_node *nd;
-	int rc = 0;
-	int i;
-
-	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
-		threads = &machine->threads[i];
-		for (nd = rb_first_cached(&threads->entries); nd;
-		     nd = rb_next(nd)) {
-			struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
-
-			rc = fn(trb->thread, priv);
-			if (rc != 0)
-				return rc;
-		}
-	}
-	return rc;
+	return threads__for_each_thread(&machine->threads, fn, priv);
 }
 
 int machines__for_each_thread(struct machines *machines,
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index b738ce84817b..e28c787616fe 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -7,6 +7,7 @@
 #include "maps.h"
 #include "dsos.h"
 #include "rwsem.h"
+#include "threads.h"
 
 struct addr_location;
 struct branch_stack;
@@ -28,16 +29,6 @@ extern const char *ref_reloc_sym_names[];
 
 struct vdso_info;
 
-#define THREADS__TABLE_BITS	8
-#define THREADS__TABLE_SIZE	(1 << THREADS__TABLE_BITS)
-
-struct threads {
-	struct rb_root_cached  entries;
-	struct rw_semaphore    lock;
-	unsigned int	       nr;
-	struct thread	       *last_match;
-};
-
 struct machine {
 	struct rb_node	  rb_node;
 	pid_t		  pid;
@@ -48,7 +39,7 @@ struct machine {
 	char		  *root_dir;
 	char		  *mmap_name;
 	char		  *kallsyms_filename;
-	struct threads    threads[THREADS__TABLE_SIZE];
+	struct threads    threads;
 	struct vdso_info  *vdso_info;
 	struct perf_env   *env;
 	struct dsos	  dsos;
@@ -69,12 +60,6 @@ struct machine {
 	bool		  trampolines_mapped;
 };
 
-static inline struct threads *machine__threads(struct machine *machine, pid_t tid)
-{
-	/* Cast it to handle tid == -1 */
-	return &machine->threads[(unsigned int)tid % THREADS__TABLE_SIZE];
-}
-
 /*
  * The main kernel (vmlinux) map
  */
@@ -220,7 +205,6 @@ bool machine__is(struct machine *machine, const char *arch);
 bool machine__normalized_is(struct machine *machine, const char *arch);
 int machine__nr_cpus_avail(struct machine *machine);
 
-struct thread *__machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid);
 struct thread *machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid);
 
 struct dso *machine__findnew_dso_id(struct machine *machine, const char *filename, struct dso_id *id);
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index c59ab4d79163..1aa8962dcf52 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -26,7 +26,7 @@ int thread__init_maps(struct thread *thread, struct machine *machine)
 	if (pid == thread__tid(thread) || pid == -1) {
 		thread__set_maps(thread, maps__new(machine));
 	} else {
-		struct thread *leader = __machine__findnew_thread(machine, pid, pid);
+		struct thread *leader = machine__findnew_thread(machine, pid, pid);
 
 		if (leader) {
 			thread__set_maps(thread, maps__get(thread__maps(leader)));
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index 0df775b5c110..4b8f3e9e513b 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -3,7 +3,6 @@
 #define __PERF_THREAD_H
 
 #include <linux/refcount.h>
-#include <linux/rbtree.h>
 #include <linux/list.h>
 #include <stdio.h>
 #include <unistd.h>
@@ -30,11 +29,6 @@ struct lbr_stitch {
 	struct callchain_cursor_node	*prev_lbr_cursor;
 };
 
-struct thread_rb_node {
-	struct rb_node rb_node;
-	struct thread *thread;
-};
-
 DECLARE_RC_STRUCT(thread) {
 	/** @maps: mmaps associated with this thread. */
 	struct maps		*maps;
diff --git a/tools/perf/util/threads.c b/tools/perf/util/threads.c
new file mode 100644
index 000000000000..d984ec939c7b
--- /dev/null
+++ b/tools/perf/util/threads.c
@@ -0,0 +1,244 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "threads.h"
+#include "machine.h"
+#include "thread.h"
+
+struct thread_rb_node {
+	struct rb_node rb_node;
+	struct thread *thread;
+};
+
+static struct threads_table_entry *threads__table(struct threads *threads, pid_t tid)
+{
+	/* Cast it to handle tid == -1 */
+	return &threads->table[(unsigned int)tid % THREADS__TABLE_SIZE];
+}
+
+void threads__init(struct threads *threads)
+{
+	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
+		struct threads_table_entry *table = &threads->table[i];
+
+		table->entries = RB_ROOT_CACHED;
+		init_rwsem(&table->lock);
+		table->nr = 0;
+		table->last_match = NULL;
+	}
+}
+
+void threads__exit(struct threads *threads)
+{
+	threads__remove_all_threads(threads);
+	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
+		struct threads_table_entry *table = &threads->table[i];
+
+		exit_rwsem(&table->lock);
+	}
+}
+
+size_t threads__nr(struct threads *threads)
+{
+	size_t nr = 0;
+
+	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
+		struct threads_table_entry *table = &threads->table[i];
+
+		down_read(&table->lock);
+		nr += table->nr;
+		up_read(&table->lock);
+	}
+	return nr;
+}
+
+/*
+ * Front-end cache - TID lookups come in blocks,
+ * so most of the time we dont have to look up
+ * the full rbtree:
+ */
+static struct thread *__threads_table_entry__get_last_match(struct threads_table_entry *table,
+							    pid_t tid)
+{
+	struct thread *th, *res = NULL;
+
+	th = table->last_match;
+	if (th != NULL) {
+		if (thread__tid(th) == tid)
+			res = thread__get(th);
+	}
+	return res;
+}
+
+static void __threads_table_entry__set_last_match(struct threads_table_entry *table,
+						  struct thread *th)
+{
+	thread__put(table->last_match);
+	table->last_match = thread__get(th);
+}
+
+static void threads_table_entry__set_last_match(struct threads_table_entry *table,
+						struct thread *th)
+{
+	down_write(&table->lock);
+	__threads_table_entry__set_last_match(table, th);
+	up_write(&table->lock);
+}
+
+struct thread *threads__find(struct threads *threads, pid_t tid)
+{
+	struct threads_table_entry *table  = threads__table(threads, tid);
+	struct rb_node **p;
+	struct thread *res = NULL;
+
+	down_read(&table->lock);
+	res = __threads_table_entry__get_last_match(table, tid);
+	if (res)
+		return res;
+
+	p = &table->entries.rb_root.rb_node;
+	while (*p != NULL) {
+		struct rb_node *parent = *p;
+		struct thread *th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
+
+		if (thread__tid(th) == tid) {
+			res = thread__get(th);
+			break;
+		}
+
+		if (tid < thread__tid(th))
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+	up_read(&table->lock);
+	if (res)
+		threads_table_entry__set_last_match(table, res);
+	return res;
+}
+
+struct thread *threads__findnew(struct threads *threads, pid_t pid, pid_t tid, bool *created)
+{
+	struct threads_table_entry *table  = threads__table(threads, tid);
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct thread *res = NULL;
+	struct thread_rb_node *nd;
+	bool leftmost = true;
+
+	*created = false;
+	down_write(&table->lock);
+	p = &table->entries.rb_root.rb_node;
+	while (*p != NULL) {
+		struct thread *th;
+
+		parent = *p;
+		th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
+
+		if (thread__tid(th) == tid) {
+			__threads_table_entry__set_last_match(table, th);
+			res = thread__get(th);
+			goto out_unlock;
+		}
+
+		if (tid < thread__tid(th))
+			p = &(*p)->rb_left;
+		else {
+			leftmost = false;
+			p = &(*p)->rb_right;
+		}
+	}
+	nd = malloc(sizeof(*nd));
+	if (nd == NULL)
+		goto out_unlock;
+	res = thread__new(pid, tid);
+	if (!res)
+		free(nd);
+	else {
+		*created = true;
+		nd->thread = thread__get(res);
+		rb_link_node(&nd->rb_node, parent, p);
+		rb_insert_color_cached(&nd->rb_node, &table->entries, leftmost);
+		++table->nr;
+		__threads_table_entry__set_last_match(table, res);
+	}
+out_unlock:
+	up_write(&table->lock);
+	return res;
+}
+
+void threads__remove_all_threads(struct threads *threads)
+{
+	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
+		struct threads_table_entry *table = &threads->table[i];
+		struct rb_node *nd;
+
+		down_write(&table->lock);
+		__threads_table_entry__set_last_match(table, NULL);
+		nd = rb_first_cached(&table->entries);
+		while (nd) {
+			struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
+
+			nd = rb_next(nd);
+			thread__put(trb->thread);
+			rb_erase_cached(&trb->rb_node, &table->entries);
+			RB_CLEAR_NODE(&trb->rb_node);
+			--table->nr;
+
+			free(trb);
+		}
+		assert(table->nr == 0);
+		up_write(&table->lock);
+	}
+}
+
+void threads__remove(struct threads *threads, struct thread *thread)
+{
+	struct rb_node **p;
+	struct threads_table_entry *table  = threads__table(threads, thread__tid(thread));
+	pid_t tid = thread__tid(thread);
+
+	down_write(&table->lock);
+	if (table->last_match && RC_CHK_EQUAL(table->last_match, thread))
+		__threads_table_entry__set_last_match(table, NULL);
+
+	p = &table->entries.rb_root.rb_node;
+	while (*p != NULL) {
+		struct rb_node *parent = *p;
+		struct thread_rb_node *nd = rb_entry(parent, struct thread_rb_node, rb_node);
+		struct thread *th = nd->thread;
+
+		if (RC_CHK_EQUAL(th, thread)) {
+			thread__put(nd->thread);
+			rb_erase_cached(&nd->rb_node, &table->entries);
+			RB_CLEAR_NODE(&nd->rb_node);
+			--table->nr;
+			free(nd);
+			break;
+		}
+
+		if (tid < thread__tid(th))
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+	up_write(&table->lock);
+}
+
+int threads__for_each_thread(struct threads *threads,
+			     int (*fn)(struct thread *thread, void *data),
+			     void *data)
+{
+	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
+		struct threads_table_entry *table = &threads->table[i];
+		struct rb_node *nd;
+
+		for (nd = rb_first_cached(&table->entries); nd; nd = rb_next(nd)) {
+			struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
+			int rc = fn(trb->thread, data);
+
+			if (rc != 0)
+				return rc;
+		}
+	}
+	return 0;
+
+}
diff --git a/tools/perf/util/threads.h b/tools/perf/util/threads.h
new file mode 100644
index 000000000000..ed67de627578
--- /dev/null
+++ b/tools/perf/util/threads.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __PERF_THREADS_H
+#define __PERF_THREADS_H
+
+#include <linux/rbtree.h>
+#include "rwsem.h"
+
+struct thread;
+
+#define THREADS__TABLE_BITS	8
+#define THREADS__TABLE_SIZE	(1 << THREADS__TABLE_BITS)
+
+struct threads_table_entry {
+	struct rb_root_cached  entries;
+	struct rw_semaphore    lock;
+	unsigned int	       nr;
+	struct thread	       *last_match;
+};
+
+struct threads {
+	struct threads_table_entry table[THREADS__TABLE_SIZE];
+};
+
+void threads__init(struct threads *threads);
+void threads__exit(struct threads *threads);
+size_t threads__nr(struct threads *threads);
+struct thread *threads__find(struct threads *threads, pid_t tid);
+struct thread *threads__findnew(struct threads *threads, pid_t pid, pid_t tid, bool *created);
+void threads__remove_all_threads(struct threads *threads);
+void threads__remove(struct threads *threads, struct thread *thread);
+int threads__for_each_thread(struct threads *threads,
+			     int (*fn)(struct thread *thread, void *data),
+			     void *data);
+
+#endif	/* __PERF_THREADS_H */
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 5/6] perf threads: Switch from rbtree to hashmap
  2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
                   ` (3 preceding siblings ...)
  2024-02-14  6:37 ` [PATCH v1 4/6] perf threads: Move threads to its own files Ian Rogers
@ 2024-02-14  6:37 ` Ian Rogers
  2024-02-14  6:37 ` [PATCH v1 6/6] perf threads: Reduce table size from 256 to 8 Ian Rogers
  2024-02-25 18:50 ` [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
  6 siblings, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14  6:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

The rbtree provides a sorting on entries but this is unused. Switch to
using hashmap for O(1) rather than O(log n) find/insert/remove
complexity.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/threads.c | 146 ++++++++++++--------------------------
 tools/perf/util/threads.h |   6 +-
 2 files changed, 47 insertions(+), 105 deletions(-)

diff --git a/tools/perf/util/threads.c b/tools/perf/util/threads.c
index d984ec939c7b..55923be53180 100644
--- a/tools/perf/util/threads.c
+++ b/tools/perf/util/threads.c
@@ -3,25 +3,30 @@
 #include "machine.h"
 #include "thread.h"
 
-struct thread_rb_node {
-	struct rb_node rb_node;
-	struct thread *thread;
-};
-
 static struct threads_table_entry *threads__table(struct threads *threads, pid_t tid)
 {
 	/* Cast it to handle tid == -1 */
 	return &threads->table[(unsigned int)tid % THREADS__TABLE_SIZE];
 }
 
+static size_t key_hash(long key, void *ctx __maybe_unused)
+{
+	/* The table lookup removes low bit entropy, but this is just ignored here. */
+	return key;
+}
+
+static bool key_equal(long key1, long key2, void *ctx __maybe_unused)
+{
+	return key1 == key2;
+}
+
 void threads__init(struct threads *threads)
 {
 	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
 		struct threads_table_entry *table = &threads->table[i];
 
-		table->entries = RB_ROOT_CACHED;
+		hashmap__init(&table->shard, key_hash, key_equal, NULL);
 		init_rwsem(&table->lock);
-		table->nr = 0;
 		table->last_match = NULL;
 	}
 }
@@ -32,6 +37,7 @@ void threads__exit(struct threads *threads)
 	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
 		struct threads_table_entry *table = &threads->table[i];
 
+		hashmap__clear(&table->shard);
 		exit_rwsem(&table->lock);
 	}
 }
@@ -44,7 +50,7 @@ size_t threads__nr(struct threads *threads)
 		struct threads_table_entry *table = &threads->table[i];
 
 		down_read(&table->lock);
-		nr += table->nr;
+		nr += hashmap__size(&table->shard);
 		up_read(&table->lock);
 	}
 	return nr;
@@ -86,28 +92,13 @@ static void threads_table_entry__set_last_match(struct threads_table_entry *tabl
 struct thread *threads__find(struct threads *threads, pid_t tid)
 {
 	struct threads_table_entry *table  = threads__table(threads, tid);
-	struct rb_node **p;
-	struct thread *res = NULL;
+	struct thread *res;
 
 	down_read(&table->lock);
 	res = __threads_table_entry__get_last_match(table, tid);
-	if (res)
-		return res;
-
-	p = &table->entries.rb_root.rb_node;
-	while (*p != NULL) {
-		struct rb_node *parent = *p;
-		struct thread *th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
-
-		if (thread__tid(th) == tid) {
-			res = thread__get(th);
-			break;
-		}
-
-		if (tid < thread__tid(th))
-			p = &(*p)->rb_left;
-		else
-			p = &(*p)->rb_right;
+	if (!res) {
+		if (hashmap__find(&table->shard, tid, &res))
+			res = thread__get(res);
 	}
 	up_read(&table->lock);
 	if (res)
@@ -118,49 +109,25 @@ struct thread *threads__find(struct threads *threads, pid_t tid)
 struct thread *threads__findnew(struct threads *threads, pid_t pid, pid_t tid, bool *created)
 {
 	struct threads_table_entry *table  = threads__table(threads, tid);
-	struct rb_node **p;
-	struct rb_node *parent = NULL;
 	struct thread *res = NULL;
-	struct thread_rb_node *nd;
-	bool leftmost = true;
 
 	*created = false;
 	down_write(&table->lock);
-	p = &table->entries.rb_root.rb_node;
-	while (*p != NULL) {
-		struct thread *th;
-
-		parent = *p;
-		th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
-
-		if (thread__tid(th) == tid) {
-			__threads_table_entry__set_last_match(table, th);
-			res = thread__get(th);
-			goto out_unlock;
-		}
-
-		if (tid < thread__tid(th))
-			p = &(*p)->rb_left;
-		else {
-			leftmost = false;
-			p = &(*p)->rb_right;
-		}
-	}
-	nd = malloc(sizeof(*nd));
-	if (nd == NULL)
-		goto out_unlock;
 	res = thread__new(pid, tid);
-	if (!res)
-		free(nd);
-	else {
-		*created = true;
-		nd->thread = thread__get(res);
-		rb_link_node(&nd->rb_node, parent, p);
-		rb_insert_color_cached(&nd->rb_node, &table->entries, leftmost);
-		++table->nr;
-		__threads_table_entry__set_last_match(table, res);
+	if (res) {
+		if (hashmap__add(&table->shard, tid, res)) {
+			/* Add failed. Assume a race so find other entry. */
+			thread__put(res);
+			res = NULL;
+			if (hashmap__find(&table->shard, tid, &res))
+				res = thread__get(res);
+		} else {
+			res = thread__get(res);
+			*created = true;
+		}
+		if (res)
+			__threads_table_entry__set_last_match(table, res);
 	}
-out_unlock:
 	up_write(&table->lock);
 	return res;
 }
@@ -169,57 +136,32 @@ void threads__remove_all_threads(struct threads *threads)
 {
 	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
 		struct threads_table_entry *table = &threads->table[i];
-		struct rb_node *nd;
+		struct hashmap_entry *cur, *tmp;
+		size_t bkt;
 
 		down_write(&table->lock);
 		__threads_table_entry__set_last_match(table, NULL);
-		nd = rb_first_cached(&table->entries);
-		while (nd) {
-			struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
-
-			nd = rb_next(nd);
-			thread__put(trb->thread);
-			rb_erase_cached(&trb->rb_node, &table->entries);
-			RB_CLEAR_NODE(&trb->rb_node);
-			--table->nr;
+		hashmap__for_each_entry_safe((&table->shard), cur, tmp, bkt) {
+			struct thread *old_value;
 
-			free(trb);
+			hashmap__delete(&table->shard, cur->key, /*old_key=*/NULL, &old_value);
+			thread__put(old_value);
 		}
-		assert(table->nr == 0);
 		up_write(&table->lock);
 	}
 }
 
 void threads__remove(struct threads *threads, struct thread *thread)
 {
-	struct rb_node **p;
 	struct threads_table_entry *table  = threads__table(threads, thread__tid(thread));
-	pid_t tid = thread__tid(thread);
+	struct thread *old_value;
 
 	down_write(&table->lock);
 	if (table->last_match && RC_CHK_EQUAL(table->last_match, thread))
 		__threads_table_entry__set_last_match(table, NULL);
 
-	p = &table->entries.rb_root.rb_node;
-	while (*p != NULL) {
-		struct rb_node *parent = *p;
-		struct thread_rb_node *nd = rb_entry(parent, struct thread_rb_node, rb_node);
-		struct thread *th = nd->thread;
-
-		if (RC_CHK_EQUAL(th, thread)) {
-			thread__put(nd->thread);
-			rb_erase_cached(&nd->rb_node, &table->entries);
-			RB_CLEAR_NODE(&nd->rb_node);
-			--table->nr;
-			free(nd);
-			break;
-		}
-
-		if (tid < thread__tid(th))
-			p = &(*p)->rb_left;
-		else
-			p = &(*p)->rb_right;
-	}
+	hashmap__delete(&table->shard, thread__tid(thread), /*old_key=*/NULL, &old_value);
+	thread__put(old_value);
 	up_write(&table->lock);
 }
 
@@ -229,11 +171,11 @@ int threads__for_each_thread(struct threads *threads,
 {
 	for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
 		struct threads_table_entry *table = &threads->table[i];
-		struct rb_node *nd;
+		struct hashmap_entry *cur;
+		size_t bkt;
 
-		for (nd = rb_first_cached(&table->entries); nd; nd = rb_next(nd)) {
-			struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
-			int rc = fn(trb->thread, data);
+		hashmap__for_each_entry((&table->shard), cur, bkt) {
+			int rc = fn((struct thread *)cur->pvalue, data);
 
 			if (rc != 0)
 				return rc;
diff --git a/tools/perf/util/threads.h b/tools/perf/util/threads.h
index ed67de627578..d03bd91a7769 100644
--- a/tools/perf/util/threads.h
+++ b/tools/perf/util/threads.h
@@ -2,7 +2,7 @@
 #ifndef __PERF_THREADS_H
 #define __PERF_THREADS_H
 
-#include <linux/rbtree.h>
+#include "hashmap.h"
 #include "rwsem.h"
 
 struct thread;
@@ -11,9 +11,9 @@ struct thread;
 #define THREADS__TABLE_SIZE	(1 << THREADS__TABLE_BITS)
 
 struct threads_table_entry {
-	struct rb_root_cached  entries;
+	/* Key is tid, value is struct thread. */
+	struct hashmap	       shard;
 	struct rw_semaphore    lock;
-	unsigned int	       nr;
 	struct thread	       *last_match;
 };
 
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 6/6] perf threads: Reduce table size from 256 to 8
  2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
                   ` (4 preceding siblings ...)
  2024-02-14  6:37 ` [PATCH v1 5/6] perf threads: Switch from rbtree to hashmap Ian Rogers
@ 2024-02-14  6:37 ` Ian Rogers
  2024-02-25 18:50 ` [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
  6 siblings, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14  6:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

The threads data structure is an array of hashmaps, previously
rbtrees. The two levels allows for a fixed outer array where access is
guarded by rw_semaphores. Commit 91e467bc568f ("perf machine: Use
hashtable for machine threads") sized the outer table at 256 entries
to avoid future scalability problems, however, this means the threads
struct is sized at 30,720 bytes. As the hashmaps allow O(1) access for
the common find/insert/remove operations, lower the number of entries
to 8. This reduces the size overhead to 960 bytes.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/threads.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/threads.h b/tools/perf/util/threads.h
index d03bd91a7769..da68d2223f18 100644
--- a/tools/perf/util/threads.h
+++ b/tools/perf/util/threads.h
@@ -7,7 +7,7 @@
 
 struct thread;
 
-#define THREADS__TABLE_BITS	8
+#define THREADS__TABLE_BITS	3
 #define THREADS__TABLE_SIZE	(1 << THREADS__TABLE_BITS)
 
 struct threads_table_entry {
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-14  6:37 ` [PATCH v1 1/6] perf report: Sort child tasks by tid Ian Rogers
@ 2024-02-14 17:24   ` Arnaldo Carvalho de Melo
  2024-02-14 17:42     ` Ian Rogers
  2024-02-27  6:39   ` Namhyung Kim
  1 sibling, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-02-14 17:24 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 13, 2024 at 10:37:03PM -0800, Ian Rogers wrote:
> Commit 91e467bc568f ("perf machine: Use hashtable for machine
> threads") made the iteration of thread tids unordered. The perf report
> --tasks output now shows child threads in an order determined by the
> hashing. For example, in this snippet tid 3 appears after tid 256 even
> though they have the same ppid 2:
> 
> ```
> $ perf report --tasks
> %      pid      tid     ppid  comm
>          0        0       -1 |swapper
>          2        2        0 | kthreadd
>        256      256        2 |  kworker/12:1H-k
>     693761   693761        2 |  kworker/10:1-mm
>    1301762  1301762        2 |  kworker/1:1-mm_
>    1302530  1302530        2 |  kworker/u32:0-k
>          3        3        2 |  rcu_gp
> ...
> ```
> 
> The output is easier to read if threads appear numerically
> increasing. To allow for this, read all threads into a list then sort
> with a comparator that orders by the child task's of the first common
> parent. The list creation and deletion are created as utilities on
> machine.  The indentation is possible by counting the number of
> parents a child has.
> 
> With this change the output for the same data file is now like:
> ```
> $ perf report --tasks
> %      pid      tid     ppid  comm
>          0        0       -1 |swapper
>          1        1        0 | systemd
>        823      823        1 |  systemd-journal
>        853      853        1 |  systemd-udevd
>       3230     3230        1 |  systemd-timesyn
>       3236     3236        1 |  auditd
>       3239     3239     3236 |   audisp-syslog
>       3321     3321        1 |  accounts-daemon


Since we're adding extra code for sorting wouldn't be more convenient to
have this done in an graphically hierarchical output?

But maybe to make this honour asking for a CSV output the above is
enough? Or can we have both, i.e. for people just doing --tasks, the
hirarchical way, for CSV, then like above, with the comma separator.

But then perf stat has -x to ask for CSV that is used by the more
obscure --exclude-other option :-\

Maybe we need a --csv that is consistent accross all tools.

- Arnaldo

⬢[acme@toolbox b]$ perf stat -x, ls
perf.data
0.65,msec,task-clock:u,648266,100.00,0.534,CPUs utilized
0,,context-switches:u,648266,100.00,0.000,/sec
0,,cpu-migrations:u,648266,100.00,0.000,/sec
91,,page-faults:u,648266,100.00,140.374,K/sec
775564,,cpu_atom/cycles/u,276334,42.00,1.196,GHz
<not counted>,,cpu_core/cycles/u,0,0.00,,
508381,,cpu_atom/instructions/u,648266,100.00,0.66,insn per cycle
<not counted>,,cpu_core/instructions/u,0,0.00,,
99137,,cpu_atom/branches/u,648266,100.00,152.926,M/sec
<not counted>,,cpu_core/branches/u,0,0.00,,
6238,,cpu_atom/branch-misses/u,648266,100.00,6.29,of all branches
<not counted>,,cpu_core/branch-misses/u,0,0.00,,
,648266,100.00,,,,TopdownL1 (cpu_atom)
,,,,,87.9,%  tma_bad_speculation
,648266,100.00,,,24.0,%  tma_frontend_bound
,648266,100.00,,,31.5,%  tma_backend_bound
,,,,,31.5,%  tma_backend_bound_aux
,371932,57.00,,,0.0,%  tma_retiring
⬢[acme@toolbox b]$ perf report -h -x

 Usage: perf report [<options>]

    -x, --exclude-other   Only display entries with parent-match

⬢[acme@toolbox b]$

- Arnaldo

> ...
> ```
> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-report.c | 203 ++++++++++++++++++++----------------
>  tools/perf/util/machine.c   |  30 ++++++
>  tools/perf/util/machine.h   |  10 ++
>  3 files changed, 155 insertions(+), 88 deletions(-)
> 
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 8e16fa261e6f..b48f1d5309e3 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -59,6 +59,7 @@
>  #include <linux/ctype.h>
>  #include <signal.h>
>  #include <linux/bitmap.h>
> +#include <linux/list_sort.h>
>  #include <linux/string.h>
>  #include <linux/stringify.h>
>  #include <linux/time64.h>
> @@ -828,35 +829,6 @@ static void tasks_setup(struct report *rep)
>  	rep->tool.no_warn = true;
>  }
>  
> -struct task {
> -	struct thread		*thread;
> -	struct list_head	 list;
> -	struct list_head	 children;
> -};
> -
> -static struct task *tasks_list(struct task *task, struct machine *machine)
> -{
> -	struct thread *parent_thread, *thread = task->thread;
> -	struct task   *parent_task;
> -
> -	/* Already listed. */
> -	if (!list_empty(&task->list))
> -		return NULL;
> -
> -	/* Last one in the chain. */
> -	if (thread__ppid(thread) == -1)
> -		return task;
> -
> -	parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> -	if (!parent_thread)
> -		return ERR_PTR(-ENOENT);
> -
> -	parent_task = thread__priv(parent_thread);
> -	thread__put(parent_thread);
> -	list_add_tail(&task->list, &parent_task->children);
> -	return tasks_list(parent_task, machine);
> -}
> -
>  struct maps__fprintf_task_args {
>  	int indent;
>  	FILE *fp;
> @@ -900,89 +872,144 @@ static size_t maps__fprintf_task(struct maps *maps, int indent, FILE *fp)
>  	return args.printed;
>  }
>  
> -static void task__print_level(struct task *task, FILE *fp, int level)
> +static int thread_level(struct machine *machine, const struct thread *thread)
>  {
> -	struct thread *thread = task->thread;
> -	struct task *child;
> -	int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> -				  thread__pid(thread), thread__tid(thread),
> -				  thread__ppid(thread), level, "");
> +	struct thread *parent_thread;
> +	int res;
>  
> -	fprintf(fp, "%s\n", thread__comm_str(thread));
> +	if (thread__tid(thread) <= 0)
> +		return 0;
>  
> -	maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> +	if (thread__ppid(thread) <= 0)
> +		return 1;
>  
> -	if (!list_empty(&task->children)) {
> -		list_for_each_entry(child, &task->children, list)
> -			task__print_level(child, fp, level + 1);
> +	parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> +	if (!parent_thread) {
> +		pr_err("Missing parent thread of %d\n", thread__tid(thread));
> +		return 0;
>  	}
> +	res = 1 + thread_level(machine, parent_thread);
> +	thread__put(parent_thread);
> +	return res;
>  }
>  
> -static int tasks_print(struct report *rep, FILE *fp)
> +static void task__print_level(struct machine *machine, struct thread *thread, FILE *fp)
>  {
> -	struct perf_session *session = rep->session;
> -	struct machine      *machine = &session->machines.host;
> -	struct task *tasks, *task;
> -	unsigned int nr = 0, itask = 0, i;
> -	struct rb_node *nd;
> -	LIST_HEAD(list);
> +	int level = thread_level(machine, thread);
> +	int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> +				  thread__pid(thread), thread__tid(thread),
> +				  thread__ppid(thread), level, "");
>  
> -	/*
> -	 * No locking needed while accessing machine->threads,
> -	 * because --tasks is single threaded command.
> -	 */
> +	fprintf(fp, "%s\n", thread__comm_str(thread));
>  
> -	/* Count all the threads. */
> -	for (i = 0; i < THREADS__TABLE_SIZE; i++)
> -		nr += machine->threads[i].nr;
> +	maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> +}
>  
> -	tasks = malloc(sizeof(*tasks) * nr);
> -	if (!tasks)
> -		return -ENOMEM;
> +static int task_list_cmp(void *priv, const struct list_head *la, const struct list_head *lb)
> +{
> +	struct machine *machine = priv;
> +	struct thread_list *task_a = list_entry(la, struct thread_list, list);
> +	struct thread_list *task_b = list_entry(lb, struct thread_list, list);
> +	struct thread *a = task_a->thread;
> +	struct thread *b = task_b->thread;
> +	int level_a, level_b, res;
> +
> +	/* Compare a and b to root. */
> +	if (thread__tid(a) == thread__tid(b))
> +		return 0;
>  
> -	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -		struct threads *threads = &machine->threads[i];
> +	if (thread__tid(a) == 0)
> +		return -1;
>  
> -		for (nd = rb_first_cached(&threads->entries); nd;
> -		     nd = rb_next(nd)) {
> -			task = tasks + itask++;
> +	if (thread__tid(b) == 0)
> +		return 1;
>  
> -			task->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> -			INIT_LIST_HEAD(&task->children);
> -			INIT_LIST_HEAD(&task->list);
> -			thread__set_priv(task->thread, task);
> -		}
> +	/* If parents match sort by tid. */
> +	if (thread__ppid(a) == thread__ppid(b)) {
> +		return thread__tid(a) < thread__tid(b)
> +			? -1
> +			: (thread__tid(a) > thread__tid(b) ? 1 : 0);
>  	}
>  
>  	/*
> -	 * Iterate every task down to the unprocessed parent
> -	 * and link all in task children list. Task with no
> -	 * parent is added into 'list'.
> +	 * Find a and b such that if they are a child of each other a and b's
> +	 * tid's match, otherwise a and b have a common parent and distinct
> +	 * tid's to sort by. First make the depths of the threads match.
>  	 */
> -	for (itask = 0; itask < nr; itask++) {
> -		task = tasks + itask;
> -
> -		if (!list_empty(&task->list))
> -			continue;
> -
> -		task = tasks_list(task, machine);
> -		if (IS_ERR(task)) {
> -			pr_err("Error: failed to process tasks\n");
> -			free(tasks);
> -			return PTR_ERR(task);
> +	level_a = thread_level(machine, a);
> +	level_b = thread_level(machine, b);
> +	a = thread__get(a);
> +	b = thread__get(b);
> +	for (int i = level_a; i > level_b; i--) {
> +		struct thread *parent = machine__find_thread(machine, -1, thread__ppid(a));
> +
> +		thread__put(a);
> +		if (!parent) {
> +			pr_err("Missing parent thread of %d\n", thread__tid(a));
> +			thread__put(b);
> +			return -1;
>  		}
> +		a = parent;
> +	}
> +	for (int i = level_b; i > level_a; i--) {
> +		struct thread *parent = machine__find_thread(machine, -1, thread__ppid(b));
>  
> -		if (task)
> -			list_add_tail(&task->list, &list);
> +		thread__put(b);
> +		if (!parent) {
> +			pr_err("Missing parent thread of %d\n", thread__tid(b));
> +			thread__put(a);
> +			return 1;
> +		}
> +		b = parent;
> +	}
> +	/* Search up to a common parent. */
> +	while (thread__ppid(a) != thread__ppid(b)) {
> +		struct thread *parent;
> +
> +		parent = machine__find_thread(machine, -1, thread__ppid(a));
> +		thread__put(a);
> +		if (!parent)
> +			pr_err("Missing parent thread of %d\n", thread__tid(a));
> +		a = parent;
> +		parent = machine__find_thread(machine, -1, thread__ppid(b));
> +		thread__put(b);
> +		if (!parent)
> +			pr_err("Missing parent thread of %d\n", thread__tid(b));
> +		b = parent;
> +		if (!a || !b)
> +			return !a && !b ? 0 : (!a ? -1 : 1);
> +	}
> +	if (thread__tid(a) == thread__tid(b)) {
> +		/* a is a child of b or vice-versa, deeper levels appear later. */
> +		res = level_a < level_b ? -1 : (level_a > level_b ? 1 : 0);
> +	} else {
> +		/* Sort by tid now the parent is the same. */
> +		res = thread__tid(a) < thread__tid(b) ? -1 : 1;
>  	}
> +	thread__put(a);
> +	thread__put(b);
> +	return res;
> +}
> +
> +static int tasks_print(struct report *rep, FILE *fp)
> +{
> +	struct machine *machine = &rep->session->machines.host;
> +	LIST_HEAD(tasks);
> +	int ret;
>  
> -	fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> +	ret = machine__thread_list(machine, &tasks);
> +	if (!ret) {
> +		struct thread_list *task;
>  
> -	list_for_each_entry(task, &list, list)
> -		task__print_level(task, fp, 0);
> +		list_sort(machine, &tasks, task_list_cmp);
>  
> -	free(tasks);
> -	return 0;
> +		fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> +
> +		list_for_each_entry(task, &tasks, list)
> +			task__print_level(machine, task->thread, fp);
> +	}
> +	thread_list__delete(&tasks);
> +	return ret;
>  }
>  
>  static int __cmd_report(struct report *rep)
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 3da92f18814a..7872ce92c9fc 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -3261,6 +3261,36 @@ int machines__for_each_thread(struct machines *machines,
>  	return rc;
>  }
>  
> +
> +static int thread_list_cb(struct thread *thread, void *data)
> +{
> +	struct list_head *list = data;
> +	struct thread_list *entry = malloc(sizeof(*entry));
> +
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	entry->thread = thread__get(thread);
> +	list_add_tail(&entry->list, list);
> +	return 0;
> +}
> +
> +int machine__thread_list(struct machine *machine, struct list_head *list)
> +{
> +	return machine__for_each_thread(machine, thread_list_cb, list);
> +}
> +
> +void thread_list__delete(struct list_head *list)
> +{
> +	struct thread_list *pos, *next;
> +
> +	list_for_each_entry_safe(pos, next, list, list) {
> +		thread__zput(pos->thread);
> +		list_del(&pos->list);
> +		free(pos);
> +	}
> +}
> +
>  pid_t machine__get_current_tid(struct machine *machine, int cpu)
>  {
>  	if (cpu < 0 || (size_t)cpu >= machine->current_tid_sz)
> diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> index 1279acda6a8a..b738ce84817b 100644
> --- a/tools/perf/util/machine.h
> +++ b/tools/perf/util/machine.h
> @@ -280,6 +280,16 @@ int machines__for_each_thread(struct machines *machines,
>  			      int (*fn)(struct thread *thread, void *p),
>  			      void *priv);
>  
> +struct thread_list {
> +	struct list_head	 list;
> +	struct thread		*thread;
> +};
> +
> +/* Make a list of struct thread_list based on threads in the machine. */
> +int machine__thread_list(struct machine *machine, struct list_head *list);
> +/* Free up the nodes within the thread_list list. */
> +void thread_list__delete(struct list_head *list);
> +
>  pid_t machine__get_current_tid(struct machine *machine, int cpu);
>  int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid,
>  			     pid_t tid);
> -- 
> 2.43.0.687.g38aa6559b0-goog
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14  6:37 ` [PATCH v1 2/6] perf trace: Ignore thread hashing in summary Ian Rogers
@ 2024-02-14 17:25   ` Arnaldo Carvalho de Melo
  2024-02-14 18:27     ` Ian Rogers
  2024-02-27  6:55   ` Namhyung Kim
  1 sibling, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-02-14 17:25 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 13, 2024 at 10:37:04PM -0800, Ian Rogers wrote:
> Commit 91e467bc568f ("perf machine: Use hashtable for machine
> threads") made the iteration of thread tids unordered. The perf trace
> --summary output sorts and prints each hash bucket, rather than all
> threads globally. Change this behavior by turn all threads into a
> list, sort the list by number of trace events then by tids, finally
> print the list. This also allows the rbtree in threads to be not
> accessed outside of machine.

Can you please provide a refresh of the output that is changed by your patch?

- Arnaldo
 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-trace.c  | 41 +++++++++++++++++++++----------------
>  tools/perf/util/rb_resort.h |  5 -----
>  2 files changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index 109b8e64fe69..90eaff8c0f6e 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -74,6 +74,7 @@
>  #include <linux/err.h>
>  #include <linux/filter.h>
>  #include <linux/kernel.h>
> +#include <linux/list_sort.h>
>  #include <linux/random.h>
>  #include <linux/stringify.h>
>  #include <linux/time64.h>
> @@ -4312,34 +4313,38 @@ static unsigned long thread__nr_events(struct thread_trace *ttrace)
>  	return ttrace ? ttrace->nr_events : 0;
>  }
>  
> -DEFINE_RESORT_RB(threads,
> -		(thread__nr_events(thread__priv(a->thread)) <
> -		 thread__nr_events(thread__priv(b->thread))),
> -	struct thread *thread;
> -)
> +static int trace_nr_events_cmp(void *priv __maybe_unused,
> +			       const struct list_head *la,
> +			       const struct list_head *lb)
>  {
> -	entry->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> +	struct thread_list *a = list_entry(la, struct thread_list, list);
> +	struct thread_list *b = list_entry(lb, struct thread_list, list);
> +	unsigned long a_nr_events = thread__nr_events(thread__priv(a->thread));
> +	unsigned long b_nr_events = thread__nr_events(thread__priv(b->thread));
> +
> +	if (a_nr_events != b_nr_events)
> +		return a_nr_events < b_nr_events ? -1 : 1;
> +
> +	/* Identical number of threads, place smaller tids first. */
> +	return thread__tid(a->thread) < thread__tid(b->thread)
> +		? -1
> +		: (thread__tid(a->thread) > thread__tid(b->thread) ? 1 : 0);
>  }
>  
>  static size_t trace__fprintf_thread_summary(struct trace *trace, FILE *fp)
>  {
>  	size_t printed = trace__fprintf_threads_header(fp);
> -	struct rb_node *nd;
> -	int i;
> -
> -	for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -		DECLARE_RESORT_RB_MACHINE_THREADS(threads, trace->host, i);
> +	LIST_HEAD(threads);
>  
> -		if (threads == NULL) {
> -			fprintf(fp, "%s", "Error sorting output by nr_events!\n");
> -			return 0;
> -		}
> +	if (machine__thread_list(trace->host, &threads) == 0) {
> +		struct thread_list *pos;
>  
> -		resort_rb__for_each_entry(nd, threads)
> -			printed += trace__fprintf_thread(fp, threads_entry->thread, trace);
> +		list_sort(NULL, &threads, trace_nr_events_cmp);
>  
> -		resort_rb__delete(threads);
> +		list_for_each_entry(pos, &threads, list)
> +			printed += trace__fprintf_thread(fp, pos->thread, trace);
>  	}
> +	thread_list__delete(&threads);
>  	return printed;
>  }
>  
> diff --git a/tools/perf/util/rb_resort.h b/tools/perf/util/rb_resort.h
> index 376e86cb4c3c..d927a0d25052 100644
> --- a/tools/perf/util/rb_resort.h
> +++ b/tools/perf/util/rb_resort.h
> @@ -143,9 +143,4 @@ struct __name##_sorted *__name = __name##_sorted__new
>  	DECLARE_RESORT_RB(__name)(&__ilist->rblist.entries.rb_root,		\
>  				  __ilist->rblist.nr_entries)
>  
> -/* For 'struct machine->threads' */
> -#define DECLARE_RESORT_RB_MACHINE_THREADS(__name, __machine, hash_bucket)    \
> - DECLARE_RESORT_RB(__name)(&__machine->threads[hash_bucket].entries.rb_root, \
> -			   __machine->threads[hash_bucket].nr)
> -
>  #endif /* _PERF_RESORT_RB_H_ */
> -- 
> 2.43.0.687.g38aa6559b0-goog

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-14 17:24   ` Arnaldo Carvalho de Melo
@ 2024-02-14 17:42     ` Ian Rogers
  2024-02-16 20:25       ` Namhyung Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-14 17:42 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 14, 2024 at 9:24 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Tue, Feb 13, 2024 at 10:37:03PM -0800, Ian Rogers wrote:
> > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > threads") made the iteration of thread tids unordered. The perf report
> > --tasks output now shows child threads in an order determined by the
> > hashing. For example, in this snippet tid 3 appears after tid 256 even
> > though they have the same ppid 2:
> >
> > ```
> > $ perf report --tasks
> > %      pid      tid     ppid  comm
> >          0        0       -1 |swapper
> >          2        2        0 | kthreadd
> >        256      256        2 |  kworker/12:1H-k
> >     693761   693761        2 |  kworker/10:1-mm
> >    1301762  1301762        2 |  kworker/1:1-mm_
> >    1302530  1302530        2 |  kworker/u32:0-k
> >          3        3        2 |  rcu_gp
> > ...
> > ```
> >
> > The output is easier to read if threads appear numerically
> > increasing. To allow for this, read all threads into a list then sort
> > with a comparator that orders by the child task's of the first common
> > parent. The list creation and deletion are created as utilities on
> > machine.  The indentation is possible by counting the number of
> > parents a child has.
> >
> > With this change the output for the same data file is now like:
> > ```
> > $ perf report --tasks
> > %      pid      tid     ppid  comm
> >          0        0       -1 |swapper
> >          1        1        0 | systemd
> >        823      823        1 |  systemd-journal
> >        853      853        1 |  systemd-udevd
> >       3230     3230        1 |  systemd-timesyn
> >       3236     3236        1 |  auditd
> >       3239     3239     3236 |   audisp-syslog
> >       3321     3321        1 |  accounts-daemon
>
>
> Since we're adding extra code for sorting wouldn't be more convenient to
> have this done in an graphically hierarchical output?
>
> But maybe to make this honour asking for a CSV output the above is
> enough? Or can we have both, i.e. for people just doing --tasks, the
> hirarchical way, for CSV, then like above, with the comma separator.
>
> But then perf stat has -x to ask for CSV that is used by the more
> obscure --exclude-other option :-\
>
> Maybe we need a --csv that is consistent accross all tools.

I've no objection to a graphical/CSV output, I was in this code as I
was restructuring it for memory savings. Fixing the output ordering
was a side-effect, the "graphical" sorting/indentation is mentioned as
it is carrying forward a behavior from the previous code but done in a
somewhat different way. Let's have other output things as follow up
work.

Thanks,
Ian

> - Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14 17:25   ` Arnaldo Carvalho de Melo
@ 2024-02-14 18:27     ` Ian Rogers
  2024-02-14 21:15       ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-14 18:27 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 14, 2024 at 9:25 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Tue, Feb 13, 2024 at 10:37:04PM -0800, Ian Rogers wrote:
> > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > threads") made the iteration of thread tids unordered. The perf trace
> > --summary output sorts and prints each hash bucket, rather than all
> > threads globally. Change this behavior by turn all threads into a
> > list, sort the list by number of trace events then by tids, finally
> > print the list. This also allows the rbtree in threads to be not
> > accessed outside of machine.
>
> Can you please provide a refresh of the output that is changed by your patch?

Hmm.. looks like perf trace record has broken and doesn't produce
output in newer perfs. It works on 6.5 and so a bisect is necessary.

Without record it is hard to be repeatable but here is a before:
```
$ perf trace -a --summary sleep 1 2>&1 |grep events
Summary of events:
process1 (976898), 10 events, 0.1%
process2 (79368), 2 events, 0.0%
process3 (3759625), 11 events, 0.1%
...
```
Note, see how the events 10, 2, 11 aren't sorted.

After:
```
$ perf trace -a --summary sleep 1 2>&1 |grep events
Summary of events:
process1 (3190), 2 events, 0.0%
process1 (3191), 2 events, 0.0%
process2 (3681), 2 events, 0.0%
process3 (4034), 2 events, 0.0%
process3 (4035), 2 events, 0.0%
...
```
I've anonymized the process names.

Thanks,
Ian

> - Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14 18:27     ` Ian Rogers
@ 2024-02-14 21:15       ` Ian Rogers
  2024-02-14 21:36         ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-14 21:15 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 14, 2024 at 10:27 AM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Feb 14, 2024 at 9:25 AM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > On Tue, Feb 13, 2024 at 10:37:04PM -0800, Ian Rogers wrote:
> > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > threads") made the iteration of thread tids unordered. The perf trace
> > > --summary output sorts and prints each hash bucket, rather than all
> > > threads globally. Change this behavior by turn all threads into a
> > > list, sort the list by number of trace events then by tids, finally
> > > print the list. This also allows the rbtree in threads to be not
> > > accessed outside of machine.
> >
> > Can you please provide a refresh of the output that is changed by your patch?
>
> Hmm.. looks like perf trace record has broken and doesn't produce
> output in newer perfs. It works on 6.5 and so a bisect is necessary.

Bisect result:
```
9925495d96efc14d885ba66c5696f664fe0e663c is the first bad commit
commit 9925495d96efc14d885ba66c5696f664fe0e663c
Author: Ian Rogers <irogers@google.com>
Date:   Thu Sep 14 14:19:45 2023 -0700

   perf build: Default BUILD_BPF_SKEL, warn/disable for missing deps
...
https://lore.kernel.org/r/20230914211948.814999-3-irogers@google.com
```

Now to do the bisect with BUILD_BPF_SKEL=1 on each make.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14 21:15       ` Ian Rogers
@ 2024-02-14 21:36         ` Ian Rogers
  2024-02-14 21:42           ` Ian Rogers
  2024-02-16 14:57           ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14 21:36 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 14, 2024 at 1:15 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Feb 14, 2024 at 10:27 AM Ian Rogers <irogers@google.com> wrote:
> >
> > On Wed, Feb 14, 2024 at 9:25 AM Arnaldo Carvalho de Melo
> > <acme@kernel.org> wrote:
> > >
> > > On Tue, Feb 13, 2024 at 10:37:04PM -0800, Ian Rogers wrote:
> > > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > > threads") made the iteration of thread tids unordered. The perf trace
> > > > --summary output sorts and prints each hash bucket, rather than all
> > > > threads globally. Change this behavior by turn all threads into a
> > > > list, sort the list by number of trace events then by tids, finally
> > > > print the list. This also allows the rbtree in threads to be not
> > > > accessed outside of machine.
> > >
> > > Can you please provide a refresh of the output that is changed by your patch?
> >
> > Hmm.. looks like perf trace record has broken and doesn't produce
> > output in newer perfs. It works on 6.5 and so a bisect is necessary.
>
> Bisect result:
> ```
> 9925495d96efc14d885ba66c5696f664fe0e663c is the first bad commit
> commit 9925495d96efc14d885ba66c5696f664fe0e663c
> Author: Ian Rogers <irogers@google.com>
> Date:   Thu Sep 14 14:19:45 2023 -0700
>
>    perf build: Default BUILD_BPF_SKEL, warn/disable for missing deps
> ...
> https://lore.kernel.org/r/20230914211948.814999-3-irogers@google.com
> ```
>
> Now to do the bisect with BUILD_BPF_SKEL=1 on each make.

This looks better (how could I be at fault :-) ):
```
1836480429d173c01664a633b61e525b13d41a2a is the first bad commit
commit 1836480429d173c01664a633b61e525b13d41a2a
Author: Arnaldo Carvalho de Melo <acme@redhat.com>
Date:   Wed Aug 16 13:53:26 2023 -0300

   perf bpf_skel augmented_raw_syscalls: Cap the socklen parameter
using &= sizeof(saddr)
...
   Cc: Adrian Hunter <adrian.hunter@intel.com>
   Cc: Ian Rogers <irogers@google.com>
   Cc: Jiri Olsa <jolsa@kernel.org>
   Cc: Namhyung Kim <namhyung@kernel.org>
   Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
```
No LKML link.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14 21:36         ` Ian Rogers
@ 2024-02-14 21:42           ` Ian Rogers
  2024-02-16 14:57           ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-14 21:42 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 14, 2024 at 1:36 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Feb 14, 2024 at 1:15 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Wed, Feb 14, 2024 at 10:27 AM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Wed, Feb 14, 2024 at 9:25 AM Arnaldo Carvalho de Melo
> > > <acme@kernel.org> wrote:
> > > >
> > > > On Tue, Feb 13, 2024 at 10:37:04PM -0800, Ian Rogers wrote:
> > > > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > > > threads") made the iteration of thread tids unordered. The perf trace
> > > > > --summary output sorts and prints each hash bucket, rather than all
> > > > > threads globally. Change this behavior by turn all threads into a
> > > > > list, sort the list by number of trace events then by tids, finally
> > > > > print the list. This also allows the rbtree in threads to be not
> > > > > accessed outside of machine.
> > > >
> > > > Can you please provide a refresh of the output that is changed by your patch?
> > >
> > > Hmm.. looks like perf trace record has broken and doesn't produce
> > > output in newer perfs. It works on 6.5 and so a bisect is necessary.
> >
> > Bisect result:
> > ```
> > 9925495d96efc14d885ba66c5696f664fe0e663c is the first bad commit
> > commit 9925495d96efc14d885ba66c5696f664fe0e663c
> > Author: Ian Rogers <irogers@google.com>
> > Date:   Thu Sep 14 14:19:45 2023 -0700
> >
> >    perf build: Default BUILD_BPF_SKEL, warn/disable for missing deps
> > ...
> > https://lore.kernel.org/r/20230914211948.814999-3-irogers@google.com
> > ```
> >
> > Now to do the bisect with BUILD_BPF_SKEL=1 on each make.
>
> This looks better (how could I be at fault :-) ):
> ```
> 1836480429d173c01664a633b61e525b13d41a2a is the first bad commit
> commit 1836480429d173c01664a633b61e525b13d41a2a
> Author: Arnaldo Carvalho de Melo <acme@redhat.com>
> Date:   Wed Aug 16 13:53:26 2023 -0300
>
>    perf bpf_skel augmented_raw_syscalls: Cap the socklen parameter
> using &= sizeof(saddr)
> ...
>    Cc: Adrian Hunter <adrian.hunter@intel.com>
>    Cc: Ian Rogers <irogers@google.com>
>    Cc: Jiri Olsa <jolsa@kernel.org>
>    Cc: Namhyung Kim <namhyung@kernel.org>
>    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ```
> No LKML link.

Hmm.. basically that change fixed the BPF program to verify and so the
problem has been long standing with the BPF code. Maybe perf trace
record never worked with BPF.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14 21:36         ` Ian Rogers
  2024-02-14 21:42           ` Ian Rogers
@ 2024-02-16 14:57           ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-02-16 14:57 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 14, 2024 at 01:36:46PM -0800, Ian Rogers wrote:
> On Wed, Feb 14, 2024 at 1:15 PM Ian Rogers <irogers@google.com> wrote:
> > On Wed, Feb 14, 2024 at 10:27 AM Ian Rogers <irogers@google.com> wrote:
> > > On Wed, Feb 14, 2024 at 9:25 AM Arnaldo Carvalho de Melo
> > > <acme@kernel.org> wrote:
> > > > On Tue, Feb 13, 2024 at 10:37:04PM -0800, Ian Rogers wrote:
> > > > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > > > threads") made the iteration of thread tids unordered. The perf trace
> > > > > --summary output sorts and prints each hash bucket, rather than all
> > > > > threads globally. Change this behavior by turn all threads into a
> > > > > list, sort the list by number of trace events then by tids, finally
> > > > > print the list. This also allows the rbtree in threads to be not
> > > > > accessed outside of machine.

> > > > Can you please provide a refresh of the output that is changed by your patch?
> > >
> > > Hmm.. looks like perf trace record has broken and doesn't produce
> > > output in newer perfs. It works on 6.5 and so a bisect is necessary.
> >
> > Bisect result:
> > ```
> > 9925495d96efc14d885ba66c5696f664fe0e663c is the first bad commit
> > commit 9925495d96efc14d885ba66c5696f664fe0e663c
> > Author: Ian Rogers <irogers@google.com>
> > Date:   Thu Sep 14 14:19:45 2023 -0700
> >
> >    perf build: Default BUILD_BPF_SKEL, warn/disable for missing deps
> > ...
> > https://lore.kernel.org/r/20230914211948.814999-3-irogers@google.com
> > ```
> >
> > Now to do the bisect with BUILD_BPF_SKEL=1 on each make.
> 
> This looks better (how could I be at fault :-) ):
> ```
> 1836480429d173c01664a633b61e525b13d41a2a is the first bad commit
> commit 1836480429d173c01664a633b61e525b13d41a2a
> Author: Arnaldo Carvalho de Melo <acme@redhat.com>
> Date:   Wed Aug 16 13:53:26 2023 -0300
> 
>    perf bpf_skel augmented_raw_syscalls: Cap the socklen parameter
> using &= sizeof(saddr)
> ...
>    Cc: Adrian Hunter <adrian.hunter@intel.com>
>    Cc: Ian Rogers <irogers@google.com>
>    Cc: Jiri Olsa <jolsa@kernel.org>
>    Cc: Namhyung Kim <namhyung@kernel.org>
>    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ```
> No LKML link.

So simple... ;-\

I've reproduced your steps and got to the same cset while testing on a
recent distro kernel (6.6.13-200.fc39.x86_64), scratching my head now
and trying to figure this out.

Wonder if trying to run on an older kernel the problem would appear.

Will try and add a perf test shell entry with a simple:

root@number:~# perf trace record sleep 0.001 && perf script | head | wc -l
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.034 MB perf.data ]
0
root@number:~#

Has to be 10 :-)

Thanks,

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-14 17:42     ` Ian Rogers
@ 2024-02-16 20:25       ` Namhyung Kim
  0 siblings, 0 replies; 35+ messages in thread
From: Namhyung Kim @ 2024-02-16 20:25 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 14, 2024 at 9:42 AM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Feb 14, 2024 at 9:24 AM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > On Tue, Feb 13, 2024 at 10:37:03PM -0800, Ian Rogers wrote:
> > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > threads") made the iteration of thread tids unordered. The perf report
> > > --tasks output now shows child threads in an order determined by the
> > > hashing. For example, in this snippet tid 3 appears after tid 256 even
> > > though they have the same ppid 2:
> > >
> > > ```
> > > $ perf report --tasks
> > > %      pid      tid     ppid  comm
> > >          0        0       -1 |swapper
> > >          2        2        0 | kthreadd
> > >        256      256        2 |  kworker/12:1H-k
> > >     693761   693761        2 |  kworker/10:1-mm
> > >    1301762  1301762        2 |  kworker/1:1-mm_
> > >    1302530  1302530        2 |  kworker/u32:0-k
> > >          3        3        2 |  rcu_gp
> > > ...
> > > ```
> > >
> > > The output is easier to read if threads appear numerically
> > > increasing. To allow for this, read all threads into a list then sort
> > > with a comparator that orders by the child task's of the first common
> > > parent. The list creation and deletion are created as utilities on
> > > machine.  The indentation is possible by counting the number of
> > > parents a child has.
> > >
> > > With this change the output for the same data file is now like:
> > > ```
> > > $ perf report --tasks
> > > %      pid      tid     ppid  comm
> > >          0        0       -1 |swapper
> > >          1        1        0 | systemd
> > >        823      823        1 |  systemd-journal
> > >        853      853        1 |  systemd-udevd
> > >       3230     3230        1 |  systemd-timesyn
> > >       3236     3236        1 |  auditd
> > >       3239     3239     3236 |   audisp-syslog
> > >       3321     3321        1 |  accounts-daemon
> >
> >
> > Since we're adding extra code for sorting wouldn't be more convenient to
> > have this done in an graphically hierarchical output?
> >
> > But maybe to make this honour asking for a CSV output the above is
> > enough? Or can we have both, i.e. for people just doing --tasks, the
> > hirarchical way, for CSV, then like above, with the comma separator.
> >
> > But then perf stat has -x to ask for CSV that is used by the more
> > obscure --exclude-other option :-\
> >
> > Maybe we need a --csv that is consistent accross all tools.
>
> I've no objection to a graphical/CSV output, I was in this code as I
> was restructuring it for memory savings. Fixing the output ordering
> was a side-effect, the "graphical" sorting/indentation is mentioned as
> it is carrying forward a behavior from the previous code but done in a
> somewhat different way. Let's have other output things as follow up
> work.

Agreed, maybe a good project for GSoC students..

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 0/6] Thread memory improvements and fixes
  2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
                   ` (5 preceding siblings ...)
  2024-02-14  6:37 ` [PATCH v1 6/6] perf threads: Reduce table size from 256 to 8 Ian Rogers
@ 2024-02-25 18:50 ` Ian Rogers
  6 siblings, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-25 18:50 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
>
> The next 6 patches from:
> https://lore.kernel.org/lkml/20240202061532.1939474-1-irogers@google.com/
> now the initial maps fixes have landed:
> https://lore.kernel.org/all/20240210031746.4057262-1-irogers@google.com/
>
> Separate out and reimplement threads to use a hashmap for lower memory
> consumption and faster look up. The fixes a regression in memory usage
> where reference count checking switched to using non-invasive tree
> nodes.  Reduce threads default size by 32 times and improve locking
> discipline. Also, fix regressions where tids had become unordered to
> make `perf report --tasks` and `perf trace --summary` output easier to
> read.
>
> Ian Rogers (6):
>   perf report: Sort child tasks by tid
>   perf trace: Ignore thread hashing in summary
>   perf machine: Move fprintf to for_each loop and a callback
>   perf threads: Move threads to its own files
>   perf threads: Switch from rbtree to hashmap
>   perf threads: Reduce table size from 256 to 8
>
>  tools/perf/builtin-report.c           | 203 ++++++++-------
>  tools/perf/builtin-trace.c            |  41 +--
>  tools/perf/util/Build                 |   1 +
>  tools/perf/util/bpf_lock_contention.c |   8 +-
>  tools/perf/util/machine.c             | 344 +++++++-------------------
>  tools/perf/util/machine.h             |  30 +--
>  tools/perf/util/rb_resort.h           |   5 -
>  tools/perf/util/thread.c              |   2 +-
>  tools/perf/util/thread.h              |   6 -
>  tools/perf/util/threads.c             | 186 ++++++++++++++
>  tools/perf/util/threads.h             |  35 +++
>  11 files changed, 464 insertions(+), 397 deletions(-)
>  create mode 100644 tools/perf/util/threads.c
>  create mode 100644 tools/perf/util/threads.h

Arnaldo/Namhyung, anything outstanding that needs addressing in a v2?
I'm looking for reviewed-by/acked-by tags :-)

Thanks,
Ian

> --
> 2.43.0.687.g38aa6559b0-goog
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-14  6:37 ` [PATCH v1 1/6] perf report: Sort child tasks by tid Ian Rogers
  2024-02-14 17:24   ` Arnaldo Carvalho de Melo
@ 2024-02-27  6:39   ` Namhyung Kim
  2024-02-27  7:12     ` Ian Rogers
  1 sibling, 1 reply; 35+ messages in thread
From: Namhyung Kim @ 2024-02-27  6:39 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
>
> Commit 91e467bc568f ("perf machine: Use hashtable for machine
> threads") made the iteration of thread tids unordered. The perf report
> --tasks output now shows child threads in an order determined by the
> hashing. For example, in this snippet tid 3 appears after tid 256 even
> though they have the same ppid 2:
>
> ```
> $ perf report --tasks
> %      pid      tid     ppid  comm
>          0        0       -1 |swapper
>          2        2        0 | kthreadd
>        256      256        2 |  kworker/12:1H-k
>     693761   693761        2 |  kworker/10:1-mm
>    1301762  1301762        2 |  kworker/1:1-mm_
>    1302530  1302530        2 |  kworker/u32:0-k
>          3        3        2 |  rcu_gp
> ...
> ```
>
> The output is easier to read if threads appear numerically
> increasing. To allow for this, read all threads into a list then sort
> with a comparator that orders by the child task's of the first common
> parent. The list creation and deletion are created as utilities on
> machine.  The indentation is possible by counting the number of
> parents a child has.
>
> With this change the output for the same data file is now like:
> ```
> $ perf report --tasks
> %      pid      tid     ppid  comm
>          0        0       -1 |swapper
>          1        1        0 | systemd
>        823      823        1 |  systemd-journal
>        853      853        1 |  systemd-udevd
>       3230     3230        1 |  systemd-timesyn
>       3236     3236        1 |  auditd
>       3239     3239     3236 |   audisp-syslog
>       3321     3321        1 |  accounts-daemon
> ...
> ```
>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-report.c | 203 ++++++++++++++++++++----------------
>  tools/perf/util/machine.c   |  30 ++++++
>  tools/perf/util/machine.h   |  10 ++
>  3 files changed, 155 insertions(+), 88 deletions(-)
>
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 8e16fa261e6f..b48f1d5309e3 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -59,6 +59,7 @@
>  #include <linux/ctype.h>
>  #include <signal.h>
>  #include <linux/bitmap.h>
> +#include <linux/list_sort.h>
>  #include <linux/string.h>
>  #include <linux/stringify.h>
>  #include <linux/time64.h>
> @@ -828,35 +829,6 @@ static void tasks_setup(struct report *rep)
>         rep->tool.no_warn = true;
>  }
>
> -struct task {
> -       struct thread           *thread;
> -       struct list_head         list;
> -       struct list_head         children;
> -};
> -
> -static struct task *tasks_list(struct task *task, struct machine *machine)
> -{
> -       struct thread *parent_thread, *thread = task->thread;
> -       struct task   *parent_task;
> -
> -       /* Already listed. */
> -       if (!list_empty(&task->list))
> -               return NULL;
> -
> -       /* Last one in the chain. */
> -       if (thread__ppid(thread) == -1)
> -               return task;
> -
> -       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> -       if (!parent_thread)
> -               return ERR_PTR(-ENOENT);
> -
> -       parent_task = thread__priv(parent_thread);
> -       thread__put(parent_thread);
> -       list_add_tail(&task->list, &parent_task->children);
> -       return tasks_list(parent_task, machine);
> -}
> -
>  struct maps__fprintf_task_args {
>         int indent;
>         FILE *fp;
> @@ -900,89 +872,144 @@ static size_t maps__fprintf_task(struct maps *maps, int indent, FILE *fp)
>         return args.printed;
>  }
>
> -static void task__print_level(struct task *task, FILE *fp, int level)
> +static int thread_level(struct machine *machine, const struct thread *thread)
>  {
> -       struct thread *thread = task->thread;
> -       struct task *child;
> -       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> -                                 thread__pid(thread), thread__tid(thread),
> -                                 thread__ppid(thread), level, "");
> +       struct thread *parent_thread;
> +       int res;
>
> -       fprintf(fp, "%s\n", thread__comm_str(thread));
> +       if (thread__tid(thread) <= 0)
> +               return 0;
>
> -       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> +       if (thread__ppid(thread) <= 0)
> +               return 1;
>
> -       if (!list_empty(&task->children)) {
> -               list_for_each_entry(child, &task->children, list)
> -                       task__print_level(child, fp, level + 1);
> +       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> +       if (!parent_thread) {
> +               pr_err("Missing parent thread of %d\n", thread__tid(thread));
> +               return 0;
>         }
> +       res = 1 + thread_level(machine, parent_thread);
> +       thread__put(parent_thread);
> +       return res;
>  }
>
> -static int tasks_print(struct report *rep, FILE *fp)
> +static void task__print_level(struct machine *machine, struct thread *thread, FILE *fp)
>  {
> -       struct perf_session *session = rep->session;
> -       struct machine      *machine = &session->machines.host;
> -       struct task *tasks, *task;
> -       unsigned int nr = 0, itask = 0, i;
> -       struct rb_node *nd;
> -       LIST_HEAD(list);
> +       int level = thread_level(machine, thread);
> +       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> +                                 thread__pid(thread), thread__tid(thread),
> +                                 thread__ppid(thread), level, "");
>
> -       /*
> -        * No locking needed while accessing machine->threads,
> -        * because --tasks is single threaded command.
> -        */
> +       fprintf(fp, "%s\n", thread__comm_str(thread));
>
> -       /* Count all the threads. */
> -       for (i = 0; i < THREADS__TABLE_SIZE; i++)
> -               nr += machine->threads[i].nr;
> +       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> +}
>
> -       tasks = malloc(sizeof(*tasks) * nr);
> -       if (!tasks)
> -               return -ENOMEM;
> +static int task_list_cmp(void *priv, const struct list_head *la, const struct list_head *lb)

I'm a little afraid that this comparison logic becomes complex.
But I think it's better than having a tree of thread relationship.
Just a comment that explains why we need this would be nice.


> +{
> +       struct machine *machine = priv;
> +       struct thread_list *task_a = list_entry(la, struct thread_list, list);
> +       struct thread_list *task_b = list_entry(lb, struct thread_list, list);
> +       struct thread *a = task_a->thread;
> +       struct thread *b = task_b->thread;
> +       int level_a, level_b, res;
> +
> +       /* Compare a and b to root. */
> +       if (thread__tid(a) == thread__tid(b))
> +               return 0;
>
> -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -               struct threads *threads = &machine->threads[i];
> +       if (thread__tid(a) == 0)
> +               return -1;
>
> -               for (nd = rb_first_cached(&threads->entries); nd;
> -                    nd = rb_next(nd)) {
> -                       task = tasks + itask++;
> +       if (thread__tid(b) == 0)
> +               return 1;
>
> -                       task->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> -                       INIT_LIST_HEAD(&task->children);
> -                       INIT_LIST_HEAD(&task->list);
> -                       thread__set_priv(task->thread, task);
> -               }
> +       /* If parents match sort by tid. */
> +       if (thread__ppid(a) == thread__ppid(b)) {
> +               return thread__tid(a) < thread__tid(b)
> +                       ? -1
> +                       : (thread__tid(a) > thread__tid(b) ? 1 : 0);

Can it be simply like this?  We know tid(a) != tid(b).

  return thread__tid(a) < thread__tid(b) ? -1 : 1;


>         }
>
>         /*
> -        * Iterate every task down to the unprocessed parent
> -        * and link all in task children list. Task with no
> -        * parent is added into 'list'.
> +        * Find a and b such that if they are a child of each other a and b's
> +        * tid's match, otherwise a and b have a common parent and distinct
> +        * tid's to sort by. First make the depths of the threads match.
>          */
> -       for (itask = 0; itask < nr; itask++) {
> -               task = tasks + itask;
> -
> -               if (!list_empty(&task->list))
> -                       continue;
> -
> -               task = tasks_list(task, machine);
> -               if (IS_ERR(task)) {
> -                       pr_err("Error: failed to process tasks\n");
> -                       free(tasks);
> -                       return PTR_ERR(task);
> +       level_a = thread_level(machine, a);
> +       level_b = thread_level(machine, b);
> +       a = thread__get(a);
> +       b = thread__get(b);
> +       for (int i = level_a; i > level_b; i--) {
> +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(a));
> +
> +               thread__put(a);
> +               if (!parent) {
> +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> +                       thread__put(b);
> +                       return -1;
>                 }
> +               a = parent;
> +       }
> +       for (int i = level_b; i > level_a; i--) {
> +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(b));
>
> -               if (task)
> -                       list_add_tail(&task->list, &list);
> +               thread__put(b);
> +               if (!parent) {
> +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> +                       thread__put(a);
> +                       return 1;
> +               }
> +               b = parent;
> +       }
> +       /* Search up to a common parent. */
> +       while (thread__ppid(a) != thread__ppid(b)) {
> +               struct thread *parent;
> +
> +               parent = machine__find_thread(machine, -1, thread__ppid(a));
> +               thread__put(a);
> +               if (!parent)
> +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> +               a = parent;
> +               parent = machine__find_thread(machine, -1, thread__ppid(b));
> +               thread__put(b);
> +               if (!parent)
> +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> +               b = parent;
> +               if (!a || !b)
> +                       return !a && !b ? 0 : (!a ? -1 : 1);

Wouldn't it leak a refcount if either a or b is NULL (not both)?


> +       }
> +       if (thread__tid(a) == thread__tid(b)) {
> +               /* a is a child of b or vice-versa, deeper levels appear later. */
> +               res = level_a < level_b ? -1 : (level_a > level_b ? 1 : 0);
> +       } else {
> +               /* Sort by tid now the parent is the same. */
> +               res = thread__tid(a) < thread__tid(b) ? -1 : 1;
>         }
> +       thread__put(a);
> +       thread__put(b);
> +       return res;
> +}
> +
> +static int tasks_print(struct report *rep, FILE *fp)
> +{
> +       struct machine *machine = &rep->session->machines.host;
> +       LIST_HEAD(tasks);
> +       int ret;
>
> -       fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> +       ret = machine__thread_list(machine, &tasks);
> +       if (!ret) {
> +               struct thread_list *task;

Do we really need this thread_list?  Why not use an
array of threads directly?

Thanks,
Namhyung

>
> -       list_for_each_entry(task, &list, list)
> -               task__print_level(task, fp, 0);
> +               list_sort(machine, &tasks, task_list_cmp);
>
> -       free(tasks);
> -       return 0;
> +               fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> +
> +               list_for_each_entry(task, &tasks, list)
> +                       task__print_level(machine, task->thread, fp);
> +       }
> +       thread_list__delete(&tasks);
> +       return ret;
>  }
>
>  static int __cmd_report(struct report *rep)
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 3da92f18814a..7872ce92c9fc 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -3261,6 +3261,36 @@ int machines__for_each_thread(struct machines *machines,
>         return rc;
>  }
>
> +
> +static int thread_list_cb(struct thread *thread, void *data)
> +{
> +       struct list_head *list = data;
> +       struct thread_list *entry = malloc(sizeof(*entry));
> +
> +       if (!entry)
> +               return -ENOMEM;
> +
> +       entry->thread = thread__get(thread);
> +       list_add_tail(&entry->list, list);
> +       return 0;
> +}
> +
> +int machine__thread_list(struct machine *machine, struct list_head *list)
> +{
> +       return machine__for_each_thread(machine, thread_list_cb, list);
> +}
> +
> +void thread_list__delete(struct list_head *list)
> +{
> +       struct thread_list *pos, *next;
> +
> +       list_for_each_entry_safe(pos, next, list, list) {
> +               thread__zput(pos->thread);
> +               list_del(&pos->list);
> +               free(pos);
> +       }
> +}
> +
>  pid_t machine__get_current_tid(struct machine *machine, int cpu)
>  {
>         if (cpu < 0 || (size_t)cpu >= machine->current_tid_sz)
> diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> index 1279acda6a8a..b738ce84817b 100644
> --- a/tools/perf/util/machine.h
> +++ b/tools/perf/util/machine.h
> @@ -280,6 +280,16 @@ int machines__for_each_thread(struct machines *machines,
>                               int (*fn)(struct thread *thread, void *p),
>                               void *priv);
>
> +struct thread_list {
> +       struct list_head         list;
> +       struct thread           *thread;
> +};
> +
> +/* Make a list of struct thread_list based on threads in the machine. */
> +int machine__thread_list(struct machine *machine, struct list_head *list);
> +/* Free up the nodes within the thread_list list. */
> +void thread_list__delete(struct list_head *list);
> +
>  pid_t machine__get_current_tid(struct machine *machine, int cpu);
>  int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid,
>                              pid_t tid);
> --
> 2.43.0.687.g38aa6559b0-goog
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 2/6] perf trace: Ignore thread hashing in summary
  2024-02-14  6:37 ` [PATCH v1 2/6] perf trace: Ignore thread hashing in summary Ian Rogers
  2024-02-14 17:25   ` Arnaldo Carvalho de Melo
@ 2024-02-27  6:55   ` Namhyung Kim
  1 sibling, 0 replies; 35+ messages in thread
From: Namhyung Kim @ 2024-02-27  6:55 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
>
> Commit 91e467bc568f ("perf machine: Use hashtable for machine
> threads") made the iteration of thread tids unordered. The perf trace
> --summary output sorts and prints each hash bucket, rather than all
> threads globally. Change this behavior by turn all threads into a
> list, sort the list by number of trace events then by tids, finally
> print the list. This also allows the rbtree in threads to be not
> accessed outside of machine.
>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-trace.c  | 41 +++++++++++++++++++++----------------
>  tools/perf/util/rb_resort.h |  5 -----
>  2 files changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index 109b8e64fe69..90eaff8c0f6e 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -74,6 +74,7 @@
>  #include <linux/err.h>
>  #include <linux/filter.h>
>  #include <linux/kernel.h>
> +#include <linux/list_sort.h>
>  #include <linux/random.h>
>  #include <linux/stringify.h>
>  #include <linux/time64.h>
> @@ -4312,34 +4313,38 @@ static unsigned long thread__nr_events(struct thread_trace *ttrace)
>         return ttrace ? ttrace->nr_events : 0;
>  }
>
> -DEFINE_RESORT_RB(threads,
> -               (thread__nr_events(thread__priv(a->thread)) <
> -                thread__nr_events(thread__priv(b->thread))),
> -       struct thread *thread;
> -)
> +static int trace_nr_events_cmp(void *priv __maybe_unused,
> +                              const struct list_head *la,
> +                              const struct list_head *lb)
>  {
> -       entry->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> +       struct thread_list *a = list_entry(la, struct thread_list, list);
> +       struct thread_list *b = list_entry(lb, struct thread_list, list);
> +       unsigned long a_nr_events = thread__nr_events(thread__priv(a->thread));
> +       unsigned long b_nr_events = thread__nr_events(thread__priv(b->thread));
> +
> +       if (a_nr_events != b_nr_events)
> +               return a_nr_events < b_nr_events ? -1 : 1;
> +
> +       /* Identical number of threads, place smaller tids first. */
> +       return thread__tid(a->thread) < thread__tid(b->thread)
> +               ? -1
> +               : (thread__tid(a->thread) > thread__tid(b->thread) ? 1 : 0);

I'm not sure if it can have a case where two different threads in the
hash table can have the same tid.  If not, it can simplify the last case.


>  }
>
>  static size_t trace__fprintf_thread_summary(struct trace *trace, FILE *fp)
>  {
>         size_t printed = trace__fprintf_threads_header(fp);
> -       struct rb_node *nd;
> -       int i;
> -
> -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -               DECLARE_RESORT_RB_MACHINE_THREADS(threads, trace->host, i);
> +       LIST_HEAD(threads);
>
> -               if (threads == NULL) {
> -                       fprintf(fp, "%s", "Error sorting output by nr_events!\n");
> -                       return 0;
> -               }
> +       if (machine__thread_list(trace->host, &threads) == 0) {
> +               struct thread_list *pos;
>
> -               resort_rb__for_each_entry(nd, threads)
> -                       printed += trace__fprintf_thread(fp, threads_entry->thread, trace);
> +               list_sort(NULL, &threads, trace_nr_events_cmp);

Same concern, it'd be nice if we can use an array instead.

Thanks,
Namhyung


>
> -               resort_rb__delete(threads);
> +               list_for_each_entry(pos, &threads, list)
> +                       printed += trace__fprintf_thread(fp, pos->thread, trace);
>         }
> +       thread_list__delete(&threads);
>         return printed;
>  }
>
> diff --git a/tools/perf/util/rb_resort.h b/tools/perf/util/rb_resort.h
> index 376e86cb4c3c..d927a0d25052 100644
> --- a/tools/perf/util/rb_resort.h
> +++ b/tools/perf/util/rb_resort.h
> @@ -143,9 +143,4 @@ struct __name##_sorted *__name = __name##_sorted__new
>         DECLARE_RESORT_RB(__name)(&__ilist->rblist.entries.rb_root,             \
>                                   __ilist->rblist.nr_entries)
>
> -/* For 'struct machine->threads' */
> -#define DECLARE_RESORT_RB_MACHINE_THREADS(__name, __machine, hash_bucket)    \
> - DECLARE_RESORT_RB(__name)(&__machine->threads[hash_bucket].entries.rb_root, \
> -                          __machine->threads[hash_bucket].nr)
> -
>  #endif /* _PERF_RESORT_RB_H_ */
> --
> 2.43.0.687.g38aa6559b0-goog
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-14  6:37 ` [PATCH v1 4/6] perf threads: Move threads to its own files Ian Rogers
@ 2024-02-27  7:07   ` Namhyung Kim
  2024-02-27  7:24     ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Namhyung Kim @ 2024-02-27  7:07 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
>
> Move threads out of machine and move thread_rb_node into the C
> file. This hides the implementation of threads from the rest of the
> code allowing for it to be refactored.
>
> Locking discipline is tightened up in this change.

Doesn't look like a simple code move.  Can we split the locking
change from the move to make the reviewer's life a bit easier? :)

Thanks,
Namhyung

>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/util/Build                 |   1 +
>  tools/perf/util/bpf_lock_contention.c |   8 +-
>  tools/perf/util/machine.c             | 287 ++++----------------------
>  tools/perf/util/machine.h             |  20 +-
>  tools/perf/util/thread.c              |   2 +-
>  tools/perf/util/thread.h              |   6 -
>  tools/perf/util/threads.c             | 244 ++++++++++++++++++++++
>  tools/perf/util/threads.h             |  35 ++++
>  8 files changed, 325 insertions(+), 278 deletions(-)
>  create mode 100644 tools/perf/util/threads.c
>  create mode 100644 tools/perf/util/threads.h
>
> diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> index 8027f450fa3e..a0e8cd68d490 100644
> --- a/tools/perf/util/Build
> +++ b/tools/perf/util/Build
> @@ -71,6 +71,7 @@ perf-y += ordered-events.o
>  perf-y += namespaces.o
>  perf-y += comm.o
>  perf-y += thread.o
> +perf-y += threads.o
>  perf-y += thread_map.o
>  perf-y += parse-events-flex.o
>  perf-y += parse-events-bison.o
> diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> index 31ff19afc20c..3992c8a9fd96 100644
> --- a/tools/perf/util/bpf_lock_contention.c
> +++ b/tools/perf/util/bpf_lock_contention.c
> @@ -210,7 +210,7 @@ static const char *lock_contention_get_name(struct lock_contention *con,
>
>                 /* do not update idle comm which contains CPU number */
>                 if (pid) {
> -                       struct thread *t = __machine__findnew_thread(machine, /*pid=*/-1, pid);
> +                       struct thread *t = machine__findnew_thread(machine, /*pid=*/-1, pid);
>
>                         if (t == NULL)
>                                 return name;
> @@ -302,9 +302,9 @@ int lock_contention_read(struct lock_contention *con)
>                 return -1;
>
>         if (con->aggr_mode == LOCK_AGGR_TASK) {
> -               struct thread *idle = __machine__findnew_thread(machine,
> -                                                               /*pid=*/0,
> -                                                               /*tid=*/0);
> +               struct thread *idle = machine__findnew_thread(machine,
> +                                                             /*pid=*/0,
> +                                                             /*tid=*/0);
>                 thread__set_comm(idle, "swapper", /*timestamp=*/0);
>         }
>
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index e072b2115b64..e668a97255f8 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -43,9 +43,6 @@
>  #include <linux/string.h>
>  #include <linux/zalloc.h>
>
> -static void __machine__remove_thread(struct machine *machine, struct thread_rb_node *nd,
> -                                    struct thread *th, bool lock);
> -
>  static struct dso *machine__kernel_dso(struct machine *machine)
>  {
>         return map__dso(machine->vmlinux_map);
> @@ -58,35 +55,6 @@ static void dsos__init(struct dsos *dsos)
>         init_rwsem(&dsos->lock);
>  }
>
> -static void machine__threads_init(struct machine *machine)
> -{
> -       int i;
> -
> -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -               struct threads *threads = &machine->threads[i];
> -               threads->entries = RB_ROOT_CACHED;
> -               init_rwsem(&threads->lock);
> -               threads->nr = 0;
> -               threads->last_match = NULL;
> -       }
> -}
> -
> -static int thread_rb_node__cmp_tid(const void *key, const struct rb_node *nd)
> -{
> -       int to_find = (int) *((pid_t *)key);
> -
> -       return to_find - (int)thread__tid(rb_entry(nd, struct thread_rb_node, rb_node)->thread);
> -}
> -
> -static struct thread_rb_node *thread_rb_node__find(const struct thread *th,
> -                                                  struct rb_root *tree)
> -{
> -       pid_t to_find = thread__tid(th);
> -       struct rb_node *nd = rb_find(&to_find, tree, thread_rb_node__cmp_tid);
> -
> -       return rb_entry(nd, struct thread_rb_node, rb_node);
> -}
> -
>  static int machine__set_mmap_name(struct machine *machine)
>  {
>         if (machine__is_host(machine))
> @@ -120,7 +88,7 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
>         RB_CLEAR_NODE(&machine->rb_node);
>         dsos__init(&machine->dsos);
>
> -       machine__threads_init(machine);
> +       threads__init(&machine->threads);
>
>         machine->vdso_info = NULL;
>         machine->env = NULL;
> @@ -221,27 +189,11 @@ static void dsos__exit(struct dsos *dsos)
>
>  void machine__delete_threads(struct machine *machine)
>  {
> -       struct rb_node *nd;
> -       int i;
> -
> -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -               struct threads *threads = &machine->threads[i];
> -               down_write(&threads->lock);
> -               nd = rb_first_cached(&threads->entries);
> -               while (nd) {
> -                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> -
> -                       nd = rb_next(nd);
> -                       __machine__remove_thread(machine, trb, trb->thread, false);
> -               }
> -               up_write(&threads->lock);
> -       }
> +       threads__remove_all_threads(&machine->threads);
>  }
>
>  void machine__exit(struct machine *machine)
>  {
> -       int i;
> -
>         if (machine == NULL)
>                 return;
>
> @@ -254,12 +206,7 @@ void machine__exit(struct machine *machine)
>         zfree(&machine->current_tid);
>         zfree(&machine->kallsyms_filename);
>
> -       machine__delete_threads(machine);
> -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -               struct threads *threads = &machine->threads[i];
> -
> -               exit_rwsem(&threads->lock);
> -       }
> +       threads__exit(&machine->threads);
>  }
>
>  void machine__delete(struct machine *machine)
> @@ -526,7 +473,7 @@ static void machine__update_thread_pid(struct machine *machine,
>         if (thread__pid(th) == thread__tid(th))
>                 return;
>
> -       leader = __machine__findnew_thread(machine, thread__pid(th), thread__pid(th));
> +       leader = machine__findnew_thread(machine, thread__pid(th), thread__pid(th));
>         if (!leader)
>                 goto out_err;
>
> @@ -560,160 +507,55 @@ static void machine__update_thread_pid(struct machine *machine,
>         goto out_put;
>  }
>
> -/*
> - * Front-end cache - TID lookups come in blocks,
> - * so most of the time we dont have to look up
> - * the full rbtree:
> - */
> -static struct thread*
> -__threads__get_last_match(struct threads *threads, struct machine *machine,
> -                         int pid, int tid)
> -{
> -       struct thread *th;
> -
> -       th = threads->last_match;
> -       if (th != NULL) {
> -               if (thread__tid(th) == tid) {
> -                       machine__update_thread_pid(machine, th, pid);
> -                       return thread__get(th);
> -               }
> -               thread__put(threads->last_match);
> -               threads->last_match = NULL;
> -       }
> -
> -       return NULL;
> -}
> -
> -static struct thread*
> -threads__get_last_match(struct threads *threads, struct machine *machine,
> -                       int pid, int tid)
> -{
> -       struct thread *th = NULL;
> -
> -       if (perf_singlethreaded)
> -               th = __threads__get_last_match(threads, machine, pid, tid);
> -
> -       return th;
> -}
> -
> -static void
> -__threads__set_last_match(struct threads *threads, struct thread *th)
> -{
> -       thread__put(threads->last_match);
> -       threads->last_match = thread__get(th);
> -}
> -
> -static void
> -threads__set_last_match(struct threads *threads, struct thread *th)
> -{
> -       if (perf_singlethreaded)
> -               __threads__set_last_match(threads, th);
> -}
> -
>  /*
>   * Caller must eventually drop thread->refcnt returned with a successful
>   * lookup/new thread inserted.
>   */
> -static struct thread *____machine__findnew_thread(struct machine *machine,
> -                                                 struct threads *threads,
> -                                                 pid_t pid, pid_t tid,
> -                                                 bool create)
> +static struct thread *__machine__findnew_thread(struct machine *machine,
> +                                               pid_t pid,
> +                                               pid_t tid,
> +                                               bool create)
>  {
> -       struct rb_node **p = &threads->entries.rb_root.rb_node;
> -       struct rb_node *parent = NULL;
> -       struct thread *th;
> -       struct thread_rb_node *nd;
> -       bool leftmost = true;
> +       struct thread *th = threads__find(&machine->threads, tid);
> +       bool created;
>
> -       th = threads__get_last_match(threads, machine, pid, tid);
> -       if (th)
> +       if (th) {
> +               machine__update_thread_pid(machine, th, pid);
>                 return th;
> -
> -       while (*p != NULL) {
> -               parent = *p;
> -               th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
> -
> -               if (thread__tid(th) == tid) {
> -                       threads__set_last_match(threads, th);
> -                       machine__update_thread_pid(machine, th, pid);
> -                       return thread__get(th);
> -               }
> -
> -               if (tid < thread__tid(th))
> -                       p = &(*p)->rb_left;
> -               else {
> -                       p = &(*p)->rb_right;
> -                       leftmost = false;
> -               }
>         }
> -
>         if (!create)
>                 return NULL;
>
> -       th = thread__new(pid, tid);
> -       if (th == NULL)
> -               return NULL;
> -
> -       nd = malloc(sizeof(*nd));
> -       if (nd == NULL) {
> -               thread__put(th);
> -               return NULL;
> -       }
> -       nd->thread = th;
> -
> -       rb_link_node(&nd->rb_node, parent, p);
> -       rb_insert_color_cached(&nd->rb_node, &threads->entries, leftmost);
> -       /*
> -        * We have to initialize maps separately after rb tree is updated.
> -        *
> -        * The reason is that we call machine__findnew_thread within
> -        * thread__init_maps to find the thread leader and that would screwed
> -        * the rb tree.
> -        */
> -       if (thread__init_maps(th, machine)) {
> -               pr_err("Thread init failed thread %d\n", pid);
> -               rb_erase_cached(&nd->rb_node, &threads->entries);
> -               RB_CLEAR_NODE(&nd->rb_node);
> -               free(nd);
> -               thread__put(th);
> -               return NULL;
> -       }
> -       /*
> -        * It is now in the rbtree, get a ref
> -        */
> -       threads__set_last_match(threads, th);
> -       ++threads->nr;
> -
> -       return thread__get(th);
> -}
> +       th = threads__findnew(&machine->threads, pid, tid, &created);
> +       if (created) {
> +               /*
> +                * We have to initialize maps separately after rb tree is
> +                * updated.
> +                *
> +                * The reason is that we call machine__findnew_thread within
> +                * thread__init_maps to find the thread leader and that would
> +                * screwed the rb tree.
> +                */
> +               if (thread__init_maps(th, machine)) {
> +                       pr_err("Thread init failed thread %d\n", pid);
> +                       threads__remove(&machine->threads, th);
> +                       thread__put(th);
> +                       return NULL;
> +               }
> +       } else
> +               machine__update_thread_pid(machine, th, pid);
>
> -struct thread *__machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid)
> -{
> -       return ____machine__findnew_thread(machine, machine__threads(machine, tid), pid, tid, true);
> +       return th;
>  }
>
> -struct thread *machine__findnew_thread(struct machine *machine, pid_t pid,
> -                                      pid_t tid)
> +struct thread *machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid)
>  {
> -       struct threads *threads = machine__threads(machine, tid);
> -       struct thread *th;
> -
> -       down_write(&threads->lock);
> -       th = __machine__findnew_thread(machine, pid, tid);
> -       up_write(&threads->lock);
> -       return th;
> +       return __machine__findnew_thread(machine, pid, tid, /*create=*/true);
>  }
>
> -struct thread *machine__find_thread(struct machine *machine, pid_t pid,
> -                                   pid_t tid)
> +struct thread *machine__find_thread(struct machine *machine, pid_t pid, pid_t tid)
>  {
> -       struct threads *threads = machine__threads(machine, tid);
> -       struct thread *th;
> -
> -       down_read(&threads->lock);
> -       th =  ____machine__findnew_thread(machine, threads, pid, tid, false);
> -       up_read(&threads->lock);
> -       return th;
> +       return __machine__findnew_thread(machine, pid, tid, /*create=*/false);
>  }
>
>  /*
> @@ -1127,23 +969,13 @@ static int machine_fprintf_cb(struct thread *thread, void *data)
>         return 0;
>  }
>
> -static size_t machine__threads_nr(const struct machine *machine)
> -{
> -       size_t nr = 0;
> -
> -       for (int i = 0; i < THREADS__TABLE_SIZE; i++)
> -               nr += machine->threads[i].nr;
> -
> -       return nr;
> -}
> -
>  size_t machine__fprintf(struct machine *machine, FILE *fp)
>  {
>         struct machine_fprintf_cb_args args = {
>                 .fp = fp,
>                 .printed = 0,
>         };
> -       size_t ret = fprintf(fp, "Threads: %zu\n", machine__threads_nr(machine));
> +       size_t ret = fprintf(fp, "Threads: %zu\n", threads__nr(&machine->threads));
>
>         machine__for_each_thread(machine, machine_fprintf_cb, &args);
>         return ret + args.printed;
> @@ -2069,36 +1901,9 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
>         return 0;
>  }
>
> -static void __machine__remove_thread(struct machine *machine, struct thread_rb_node *nd,
> -                                    struct thread *th, bool lock)
> -{
> -       struct threads *threads = machine__threads(machine, thread__tid(th));
> -
> -       if (!nd)
> -               nd = thread_rb_node__find(th, &threads->entries.rb_root);
> -
> -       if (threads->last_match && RC_CHK_EQUAL(threads->last_match, th))
> -               threads__set_last_match(threads, NULL);
> -
> -       if (lock)
> -               down_write(&threads->lock);
> -
> -       BUG_ON(refcount_read(thread__refcnt(th)) == 0);
> -
> -       thread__put(nd->thread);
> -       rb_erase_cached(&nd->rb_node, &threads->entries);
> -       RB_CLEAR_NODE(&nd->rb_node);
> -       --threads->nr;
> -
> -       free(nd);
> -
> -       if (lock)
> -               up_write(&threads->lock);
> -}
> -
>  void machine__remove_thread(struct machine *machine, struct thread *th)
>  {
> -       return __machine__remove_thread(machine, NULL, th, true);
> +       return threads__remove(&machine->threads, th);
>  }
>
>  int machine__process_fork_event(struct machine *machine, union perf_event *event,
> @@ -3232,23 +3037,7 @@ int machine__for_each_thread(struct machine *machine,
>                              int (*fn)(struct thread *thread, void *p),
>                              void *priv)
>  {
> -       struct threads *threads;
> -       struct rb_node *nd;
> -       int rc = 0;
> -       int i;
> -
> -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> -               threads = &machine->threads[i];
> -               for (nd = rb_first_cached(&threads->entries); nd;
> -                    nd = rb_next(nd)) {
> -                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> -
> -                       rc = fn(trb->thread, priv);
> -                       if (rc != 0)
> -                               return rc;
> -               }
> -       }
> -       return rc;
> +       return threads__for_each_thread(&machine->threads, fn, priv);
>  }
>
>  int machines__for_each_thread(struct machines *machines,
> diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> index b738ce84817b..e28c787616fe 100644
> --- a/tools/perf/util/machine.h
> +++ b/tools/perf/util/machine.h
> @@ -7,6 +7,7 @@
>  #include "maps.h"
>  #include "dsos.h"
>  #include "rwsem.h"
> +#include "threads.h"
>
>  struct addr_location;
>  struct branch_stack;
> @@ -28,16 +29,6 @@ extern const char *ref_reloc_sym_names[];
>
>  struct vdso_info;
>
> -#define THREADS__TABLE_BITS    8
> -#define THREADS__TABLE_SIZE    (1 << THREADS__TABLE_BITS)
> -
> -struct threads {
> -       struct rb_root_cached  entries;
> -       struct rw_semaphore    lock;
> -       unsigned int           nr;
> -       struct thread          *last_match;
> -};
> -
>  struct machine {
>         struct rb_node    rb_node;
>         pid_t             pid;
> @@ -48,7 +39,7 @@ struct machine {
>         char              *root_dir;
>         char              *mmap_name;
>         char              *kallsyms_filename;
> -       struct threads    threads[THREADS__TABLE_SIZE];
> +       struct threads    threads;
>         struct vdso_info  *vdso_info;
>         struct perf_env   *env;
>         struct dsos       dsos;
> @@ -69,12 +60,6 @@ struct machine {
>         bool              trampolines_mapped;
>  };
>
> -static inline struct threads *machine__threads(struct machine *machine, pid_t tid)
> -{
> -       /* Cast it to handle tid == -1 */
> -       return &machine->threads[(unsigned int)tid % THREADS__TABLE_SIZE];
> -}
> -
>  /*
>   * The main kernel (vmlinux) map
>   */
> @@ -220,7 +205,6 @@ bool machine__is(struct machine *machine, const char *arch);
>  bool machine__normalized_is(struct machine *machine, const char *arch);
>  int machine__nr_cpus_avail(struct machine *machine);
>
> -struct thread *__machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid);
>  struct thread *machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid);
>
>  struct dso *machine__findnew_dso_id(struct machine *machine, const char *filename, struct dso_id *id);
> diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
> index c59ab4d79163..1aa8962dcf52 100644
> --- a/tools/perf/util/thread.c
> +++ b/tools/perf/util/thread.c
> @@ -26,7 +26,7 @@ int thread__init_maps(struct thread *thread, struct machine *machine)
>         if (pid == thread__tid(thread) || pid == -1) {
>                 thread__set_maps(thread, maps__new(machine));
>         } else {
> -               struct thread *leader = __machine__findnew_thread(machine, pid, pid);
> +               struct thread *leader = machine__findnew_thread(machine, pid, pid);
>
>                 if (leader) {
>                         thread__set_maps(thread, maps__get(thread__maps(leader)));
> diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
> index 0df775b5c110..4b8f3e9e513b 100644
> --- a/tools/perf/util/thread.h
> +++ b/tools/perf/util/thread.h
> @@ -3,7 +3,6 @@
>  #define __PERF_THREAD_H
>
>  #include <linux/refcount.h>
> -#include <linux/rbtree.h>
>  #include <linux/list.h>
>  #include <stdio.h>
>  #include <unistd.h>
> @@ -30,11 +29,6 @@ struct lbr_stitch {
>         struct callchain_cursor_node    *prev_lbr_cursor;
>  };
>
> -struct thread_rb_node {
> -       struct rb_node rb_node;
> -       struct thread *thread;
> -};
> -
>  DECLARE_RC_STRUCT(thread) {
>         /** @maps: mmaps associated with this thread. */
>         struct maps             *maps;
> diff --git a/tools/perf/util/threads.c b/tools/perf/util/threads.c
> new file mode 100644
> index 000000000000..d984ec939c7b
> --- /dev/null
> +++ b/tools/perf/util/threads.c
> @@ -0,0 +1,244 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "threads.h"
> +#include "machine.h"
> +#include "thread.h"
> +
> +struct thread_rb_node {
> +       struct rb_node rb_node;
> +       struct thread *thread;
> +};
> +
> +static struct threads_table_entry *threads__table(struct threads *threads, pid_t tid)
> +{
> +       /* Cast it to handle tid == -1 */
> +       return &threads->table[(unsigned int)tid % THREADS__TABLE_SIZE];
> +}
> +
> +void threads__init(struct threads *threads)
> +{
> +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> +               struct threads_table_entry *table = &threads->table[i];
> +
> +               table->entries = RB_ROOT_CACHED;
> +               init_rwsem(&table->lock);
> +               table->nr = 0;
> +               table->last_match = NULL;
> +       }
> +}
> +
> +void threads__exit(struct threads *threads)
> +{
> +       threads__remove_all_threads(threads);
> +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> +               struct threads_table_entry *table = &threads->table[i];
> +
> +               exit_rwsem(&table->lock);
> +       }
> +}
> +
> +size_t threads__nr(struct threads *threads)
> +{
> +       size_t nr = 0;
> +
> +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> +               struct threads_table_entry *table = &threads->table[i];
> +
> +               down_read(&table->lock);
> +               nr += table->nr;
> +               up_read(&table->lock);
> +       }
> +       return nr;
> +}
> +
> +/*
> + * Front-end cache - TID lookups come in blocks,
> + * so most of the time we dont have to look up
> + * the full rbtree:
> + */
> +static struct thread *__threads_table_entry__get_last_match(struct threads_table_entry *table,
> +                                                           pid_t tid)
> +{
> +       struct thread *th, *res = NULL;
> +
> +       th = table->last_match;
> +       if (th != NULL) {
> +               if (thread__tid(th) == tid)
> +                       res = thread__get(th);
> +       }
> +       return res;
> +}
> +
> +static void __threads_table_entry__set_last_match(struct threads_table_entry *table,
> +                                                 struct thread *th)
> +{
> +       thread__put(table->last_match);
> +       table->last_match = thread__get(th);
> +}
> +
> +static void threads_table_entry__set_last_match(struct threads_table_entry *table,
> +                                               struct thread *th)
> +{
> +       down_write(&table->lock);
> +       __threads_table_entry__set_last_match(table, th);
> +       up_write(&table->lock);
> +}
> +
> +struct thread *threads__find(struct threads *threads, pid_t tid)
> +{
> +       struct threads_table_entry *table  = threads__table(threads, tid);
> +       struct rb_node **p;
> +       struct thread *res = NULL;
> +
> +       down_read(&table->lock);
> +       res = __threads_table_entry__get_last_match(table, tid);
> +       if (res)
> +               return res;
> +
> +       p = &table->entries.rb_root.rb_node;
> +       while (*p != NULL) {
> +               struct rb_node *parent = *p;
> +               struct thread *th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
> +
> +               if (thread__tid(th) == tid) {
> +                       res = thread__get(th);
> +                       break;
> +               }
> +
> +               if (tid < thread__tid(th))
> +                       p = &(*p)->rb_left;
> +               else
> +                       p = &(*p)->rb_right;
> +       }
> +       up_read(&table->lock);
> +       if (res)
> +               threads_table_entry__set_last_match(table, res);
> +       return res;
> +}
> +
> +struct thread *threads__findnew(struct threads *threads, pid_t pid, pid_t tid, bool *created)
> +{
> +       struct threads_table_entry *table  = threads__table(threads, tid);
> +       struct rb_node **p;
> +       struct rb_node *parent = NULL;
> +       struct thread *res = NULL;
> +       struct thread_rb_node *nd;
> +       bool leftmost = true;
> +
> +       *created = false;
> +       down_write(&table->lock);
> +       p = &table->entries.rb_root.rb_node;
> +       while (*p != NULL) {
> +               struct thread *th;
> +
> +               parent = *p;
> +               th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
> +
> +               if (thread__tid(th) == tid) {
> +                       __threads_table_entry__set_last_match(table, th);
> +                       res = thread__get(th);
> +                       goto out_unlock;
> +               }
> +
> +               if (tid < thread__tid(th))
> +                       p = &(*p)->rb_left;
> +               else {
> +                       leftmost = false;
> +                       p = &(*p)->rb_right;
> +               }
> +       }
> +       nd = malloc(sizeof(*nd));
> +       if (nd == NULL)
> +               goto out_unlock;
> +       res = thread__new(pid, tid);
> +       if (!res)
> +               free(nd);
> +       else {
> +               *created = true;
> +               nd->thread = thread__get(res);
> +               rb_link_node(&nd->rb_node, parent, p);
> +               rb_insert_color_cached(&nd->rb_node, &table->entries, leftmost);
> +               ++table->nr;
> +               __threads_table_entry__set_last_match(table, res);
> +       }
> +out_unlock:
> +       up_write(&table->lock);
> +       return res;
> +}
> +
> +void threads__remove_all_threads(struct threads *threads)
> +{
> +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> +               struct threads_table_entry *table = &threads->table[i];
> +               struct rb_node *nd;
> +
> +               down_write(&table->lock);
> +               __threads_table_entry__set_last_match(table, NULL);
> +               nd = rb_first_cached(&table->entries);
> +               while (nd) {
> +                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> +
> +                       nd = rb_next(nd);
> +                       thread__put(trb->thread);
> +                       rb_erase_cached(&trb->rb_node, &table->entries);
> +                       RB_CLEAR_NODE(&trb->rb_node);
> +                       --table->nr;
> +
> +                       free(trb);
> +               }
> +               assert(table->nr == 0);
> +               up_write(&table->lock);
> +       }
> +}
> +
> +void threads__remove(struct threads *threads, struct thread *thread)
> +{
> +       struct rb_node **p;
> +       struct threads_table_entry *table  = threads__table(threads, thread__tid(thread));
> +       pid_t tid = thread__tid(thread);
> +
> +       down_write(&table->lock);
> +       if (table->last_match && RC_CHK_EQUAL(table->last_match, thread))
> +               __threads_table_entry__set_last_match(table, NULL);
> +
> +       p = &table->entries.rb_root.rb_node;
> +       while (*p != NULL) {
> +               struct rb_node *parent = *p;
> +               struct thread_rb_node *nd = rb_entry(parent, struct thread_rb_node, rb_node);
> +               struct thread *th = nd->thread;
> +
> +               if (RC_CHK_EQUAL(th, thread)) {
> +                       thread__put(nd->thread);
> +                       rb_erase_cached(&nd->rb_node, &table->entries);
> +                       RB_CLEAR_NODE(&nd->rb_node);
> +                       --table->nr;
> +                       free(nd);
> +                       break;
> +               }
> +
> +               if (tid < thread__tid(th))
> +                       p = &(*p)->rb_left;
> +               else
> +                       p = &(*p)->rb_right;
> +       }
> +       up_write(&table->lock);
> +}
> +
> +int threads__for_each_thread(struct threads *threads,
> +                            int (*fn)(struct thread *thread, void *data),
> +                            void *data)
> +{
> +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> +               struct threads_table_entry *table = &threads->table[i];
> +               struct rb_node *nd;
> +
> +               for (nd = rb_first_cached(&table->entries); nd; nd = rb_next(nd)) {
> +                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> +                       int rc = fn(trb->thread, data);
> +
> +                       if (rc != 0)
> +                               return rc;
> +               }
> +       }
> +       return 0;
> +
> +}
> diff --git a/tools/perf/util/threads.h b/tools/perf/util/threads.h
> new file mode 100644
> index 000000000000..ed67de627578
> --- /dev/null
> +++ b/tools/perf/util/threads.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __PERF_THREADS_H
> +#define __PERF_THREADS_H
> +
> +#include <linux/rbtree.h>
> +#include "rwsem.h"
> +
> +struct thread;
> +
> +#define THREADS__TABLE_BITS    8
> +#define THREADS__TABLE_SIZE    (1 << THREADS__TABLE_BITS)
> +
> +struct threads_table_entry {
> +       struct rb_root_cached  entries;
> +       struct rw_semaphore    lock;
> +       unsigned int           nr;
> +       struct thread          *last_match;
> +};
> +
> +struct threads {
> +       struct threads_table_entry table[THREADS__TABLE_SIZE];
> +};
> +
> +void threads__init(struct threads *threads);
> +void threads__exit(struct threads *threads);
> +size_t threads__nr(struct threads *threads);
> +struct thread *threads__find(struct threads *threads, pid_t tid);
> +struct thread *threads__findnew(struct threads *threads, pid_t pid, pid_t tid, bool *created);
> +void threads__remove_all_threads(struct threads *threads);
> +void threads__remove(struct threads *threads, struct thread *thread);
> +int threads__for_each_thread(struct threads *threads,
> +                            int (*fn)(struct thread *thread, void *data),
> +                            void *data);
> +
> +#endif /* __PERF_THREADS_H */
> --
> 2.43.0.687.g38aa6559b0-goog
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-27  6:39   ` Namhyung Kim
@ 2024-02-27  7:12     ` Ian Rogers
  2024-02-28  6:11       ` Namhyung Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-27  7:12 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Mon, Feb 26, 2024 at 10:39 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> >
> > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > threads") made the iteration of thread tids unordered. The perf report
> > --tasks output now shows child threads in an order determined by the
> > hashing. For example, in this snippet tid 3 appears after tid 256 even
> > though they have the same ppid 2:
> >
> > ```
> > $ perf report --tasks
> > %      pid      tid     ppid  comm
> >          0        0       -1 |swapper
> >          2        2        0 | kthreadd
> >        256      256        2 |  kworker/12:1H-k
> >     693761   693761        2 |  kworker/10:1-mm
> >    1301762  1301762        2 |  kworker/1:1-mm_
> >    1302530  1302530        2 |  kworker/u32:0-k
> >          3        3        2 |  rcu_gp
> > ...
> > ```
> >
> > The output is easier to read if threads appear numerically
> > increasing. To allow for this, read all threads into a list then sort
> > with a comparator that orders by the child task's of the first common
> > parent. The list creation and deletion are created as utilities on
> > machine.  The indentation is possible by counting the number of
> > parents a child has.
> >
> > With this change the output for the same data file is now like:
> > ```
> > $ perf report --tasks
> > %      pid      tid     ppid  comm
> >          0        0       -1 |swapper
> >          1        1        0 | systemd
> >        823      823        1 |  systemd-journal
> >        853      853        1 |  systemd-udevd
> >       3230     3230        1 |  systemd-timesyn
> >       3236     3236        1 |  auditd
> >       3239     3239     3236 |   audisp-syslog
> >       3321     3321        1 |  accounts-daemon
> > ...
> > ```
> >
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/builtin-report.c | 203 ++++++++++++++++++++----------------
> >  tools/perf/util/machine.c   |  30 ++++++
> >  tools/perf/util/machine.h   |  10 ++
> >  3 files changed, 155 insertions(+), 88 deletions(-)
> >
> > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > index 8e16fa261e6f..b48f1d5309e3 100644
> > --- a/tools/perf/builtin-report.c
> > +++ b/tools/perf/builtin-report.c
> > @@ -59,6 +59,7 @@
> >  #include <linux/ctype.h>
> >  #include <signal.h>
> >  #include <linux/bitmap.h>
> > +#include <linux/list_sort.h>
> >  #include <linux/string.h>
> >  #include <linux/stringify.h>
> >  #include <linux/time64.h>
> > @@ -828,35 +829,6 @@ static void tasks_setup(struct report *rep)
> >         rep->tool.no_warn = true;
> >  }
> >
> > -struct task {
> > -       struct thread           *thread;
> > -       struct list_head         list;
> > -       struct list_head         children;
> > -};
> > -
> > -static struct task *tasks_list(struct task *task, struct machine *machine)
> > -{
> > -       struct thread *parent_thread, *thread = task->thread;
> > -       struct task   *parent_task;
> > -
> > -       /* Already listed. */
> > -       if (!list_empty(&task->list))
> > -               return NULL;
> > -
> > -       /* Last one in the chain. */
> > -       if (thread__ppid(thread) == -1)
> > -               return task;
> > -
> > -       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > -       if (!parent_thread)
> > -               return ERR_PTR(-ENOENT);
> > -
> > -       parent_task = thread__priv(parent_thread);
> > -       thread__put(parent_thread);
> > -       list_add_tail(&task->list, &parent_task->children);
> > -       return tasks_list(parent_task, machine);
> > -}
> > -
> >  struct maps__fprintf_task_args {
> >         int indent;
> >         FILE *fp;
> > @@ -900,89 +872,144 @@ static size_t maps__fprintf_task(struct maps *maps, int indent, FILE *fp)
> >         return args.printed;
> >  }
> >
> > -static void task__print_level(struct task *task, FILE *fp, int level)
> > +static int thread_level(struct machine *machine, const struct thread *thread)
> >  {
> > -       struct thread *thread = task->thread;
> > -       struct task *child;
> > -       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > -                                 thread__pid(thread), thread__tid(thread),
> > -                                 thread__ppid(thread), level, "");
> > +       struct thread *parent_thread;
> > +       int res;
> >
> > -       fprintf(fp, "%s\n", thread__comm_str(thread));
> > +       if (thread__tid(thread) <= 0)
> > +               return 0;
> >
> > -       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > +       if (thread__ppid(thread) <= 0)
> > +               return 1;
> >
> > -       if (!list_empty(&task->children)) {
> > -               list_for_each_entry(child, &task->children, list)
> > -                       task__print_level(child, fp, level + 1);
> > +       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > +       if (!parent_thread) {
> > +               pr_err("Missing parent thread of %d\n", thread__tid(thread));
> > +               return 0;
> >         }
> > +       res = 1 + thread_level(machine, parent_thread);
> > +       thread__put(parent_thread);
> > +       return res;
> >  }
> >
> > -static int tasks_print(struct report *rep, FILE *fp)
> > +static void task__print_level(struct machine *machine, struct thread *thread, FILE *fp)
> >  {
> > -       struct perf_session *session = rep->session;
> > -       struct machine      *machine = &session->machines.host;
> > -       struct task *tasks, *task;
> > -       unsigned int nr = 0, itask = 0, i;
> > -       struct rb_node *nd;
> > -       LIST_HEAD(list);
> > +       int level = thread_level(machine, thread);
> > +       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > +                                 thread__pid(thread), thread__tid(thread),
> > +                                 thread__ppid(thread), level, "");
> >
> > -       /*
> > -        * No locking needed while accessing machine->threads,
> > -        * because --tasks is single threaded command.
> > -        */
> > +       fprintf(fp, "%s\n", thread__comm_str(thread));
> >
> > -       /* Count all the threads. */
> > -       for (i = 0; i < THREADS__TABLE_SIZE; i++)
> > -               nr += machine->threads[i].nr;
> > +       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > +}
> >
> > -       tasks = malloc(sizeof(*tasks) * nr);
> > -       if (!tasks)
> > -               return -ENOMEM;
> > +static int task_list_cmp(void *priv, const struct list_head *la, const struct list_head *lb)
>
> I'm a little afraid that this comparison logic becomes complex.
> But I think it's better than having a tree of thread relationship.
> Just a comment that explains why we need this would be nice.

I can add something in v2.

>
> > +{
> > +       struct machine *machine = priv;
> > +       struct thread_list *task_a = list_entry(la, struct thread_list, list);
> > +       struct thread_list *task_b = list_entry(lb, struct thread_list, list);
> > +       struct thread *a = task_a->thread;
> > +       struct thread *b = task_b->thread;
> > +       int level_a, level_b, res;
> > +
> > +       /* Compare a and b to root. */
> > +       if (thread__tid(a) == thread__tid(b))
> > +               return 0;
> >
> > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > -               struct threads *threads = &machine->threads[i];
> > +       if (thread__tid(a) == 0)
> > +               return -1;
> >
> > -               for (nd = rb_first_cached(&threads->entries); nd;
> > -                    nd = rb_next(nd)) {
> > -                       task = tasks + itask++;
> > +       if (thread__tid(b) == 0)
> > +               return 1;
> >
> > -                       task->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> > -                       INIT_LIST_HEAD(&task->children);
> > -                       INIT_LIST_HEAD(&task->list);
> > -                       thread__set_priv(task->thread, task);
> > -               }
> > +       /* If parents match sort by tid. */
> > +       if (thread__ppid(a) == thread__ppid(b)) {
> > +               return thread__tid(a) < thread__tid(b)
> > +                       ? -1
> > +                       : (thread__tid(a) > thread__tid(b) ? 1 : 0);
>
> Can it be simply like this?  We know tid(a) != tid(b).
>
>   return thread__tid(a) < thread__tid(b) ? -1 : 1;

Yes, but the parent check is still required.

> >         }
> >
> >         /*
> > -        * Iterate every task down to the unprocessed parent
> > -        * and link all in task children list. Task with no
> > -        * parent is added into 'list'.
> > +        * Find a and b such that if they are a child of each other a and b's
> > +        * tid's match, otherwise a and b have a common parent and distinct
> > +        * tid's to sort by. First make the depths of the threads match.
> >          */
> > -       for (itask = 0; itask < nr; itask++) {
> > -               task = tasks + itask;
> > -
> > -               if (!list_empty(&task->list))
> > -                       continue;
> > -
> > -               task = tasks_list(task, machine);
> > -               if (IS_ERR(task)) {
> > -                       pr_err("Error: failed to process tasks\n");
> > -                       free(tasks);
> > -                       return PTR_ERR(task);
> > +       level_a = thread_level(machine, a);
> > +       level_b = thread_level(machine, b);
> > +       a = thread__get(a);
> > +       b = thread__get(b);
> > +       for (int i = level_a; i > level_b; i--) {
> > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(a));
> > +
> > +               thread__put(a);
> > +               if (!parent) {
> > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > +                       thread__put(b);
> > +                       return -1;
> >                 }
> > +               a = parent;
> > +       }
> > +       for (int i = level_b; i > level_a; i--) {
> > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(b));
> >
> > -               if (task)
> > -                       list_add_tail(&task->list, &list);
> > +               thread__put(b);
> > +               if (!parent) {
> > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > +                       thread__put(a);
> > +                       return 1;
> > +               }
> > +               b = parent;
> > +       }
> > +       /* Search up to a common parent. */
> > +       while (thread__ppid(a) != thread__ppid(b)) {
> > +               struct thread *parent;
> > +
> > +               parent = machine__find_thread(machine, -1, thread__ppid(a));
> > +               thread__put(a);
> > +               if (!parent)
> > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > +               a = parent;
> > +               parent = machine__find_thread(machine, -1, thread__ppid(b));
> > +               thread__put(b);
> > +               if (!parent)
> > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > +               b = parent;
> > +               if (!a || !b)
> > +                       return !a && !b ? 0 : (!a ? -1 : 1);
>
> Wouldn't it leak a refcount if either a or b is NULL (not both)?

It would, but this would be an error condition anyway. I can add puts.

>
> > +       }
> > +       if (thread__tid(a) == thread__tid(b)) {
> > +               /* a is a child of b or vice-versa, deeper levels appear later. */
> > +               res = level_a < level_b ? -1 : (level_a > level_b ? 1 : 0);
> > +       } else {
> > +               /* Sort by tid now the parent is the same. */
> > +               res = thread__tid(a) < thread__tid(b) ? -1 : 1;
> >         }
> > +       thread__put(a);
> > +       thread__put(b);
> > +       return res;
> > +}
> > +
> > +static int tasks_print(struct report *rep, FILE *fp)
> > +{
> > +       struct machine *machine = &rep->session->machines.host;
> > +       LIST_HEAD(tasks);
> > +       int ret;
> >
> > -       fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> > +       ret = machine__thread_list(machine, &tasks);
> > +       if (!ret) {
> > +               struct thread_list *task;
>
> Do we really need this thread_list?  Why not use an
> array of threads directly?

The code isn't particularly performance critical. I used a list as it
best approximated how the rbtree was being used. The code is reused in
subsequent patches, there's no initial pass to size an array and I
think the reallocarray/qsort logic is generally more problematic than
the list ones. If we were worried about performance then I think
arrays could make sense for optimization, but I think this is good
enough for now.

Thanks,
Ian

> Thanks,
> Namhyung
>
> >
> > -       list_for_each_entry(task, &list, list)
> > -               task__print_level(task, fp, 0);
> > +               list_sort(machine, &tasks, task_list_cmp);
> >
> > -       free(tasks);
> > -       return 0;
> > +               fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> > +
> > +               list_for_each_entry(task, &tasks, list)
> > +                       task__print_level(machine, task->thread, fp);
> > +       }
> > +       thread_list__delete(&tasks);
> > +       return ret;
> >  }
> >
> >  static int __cmd_report(struct report *rep)
> > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > index 3da92f18814a..7872ce92c9fc 100644
> > --- a/tools/perf/util/machine.c
> > +++ b/tools/perf/util/machine.c
> > @@ -3261,6 +3261,36 @@ int machines__for_each_thread(struct machines *machines,
> >         return rc;
> >  }
> >
> > +
> > +static int thread_list_cb(struct thread *thread, void *data)
> > +{
> > +       struct list_head *list = data;
> > +       struct thread_list *entry = malloc(sizeof(*entry));
> > +
> > +       if (!entry)
> > +               return -ENOMEM;
> > +
> > +       entry->thread = thread__get(thread);
> > +       list_add_tail(&entry->list, list);
> > +       return 0;
> > +}
> > +
> > +int machine__thread_list(struct machine *machine, struct list_head *list)
> > +{
> > +       return machine__for_each_thread(machine, thread_list_cb, list);
> > +}
> > +
> > +void thread_list__delete(struct list_head *list)
> > +{
> > +       struct thread_list *pos, *next;
> > +
> > +       list_for_each_entry_safe(pos, next, list, list) {
> > +               thread__zput(pos->thread);
> > +               list_del(&pos->list);
> > +               free(pos);
> > +       }
> > +}
> > +
> >  pid_t machine__get_current_tid(struct machine *machine, int cpu)
> >  {
> >         if (cpu < 0 || (size_t)cpu >= machine->current_tid_sz)
> > diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> > index 1279acda6a8a..b738ce84817b 100644
> > --- a/tools/perf/util/machine.h
> > +++ b/tools/perf/util/machine.h
> > @@ -280,6 +280,16 @@ int machines__for_each_thread(struct machines *machines,
> >                               int (*fn)(struct thread *thread, void *p),
> >                               void *priv);
> >
> > +struct thread_list {
> > +       struct list_head         list;
> > +       struct thread           *thread;
> > +};
> > +
> > +/* Make a list of struct thread_list based on threads in the machine. */
> > +int machine__thread_list(struct machine *machine, struct list_head *list);
> > +/* Free up the nodes within the thread_list list. */
> > +void thread_list__delete(struct list_head *list);
> > +
> >  pid_t machine__get_current_tid(struct machine *machine, int cpu);
> >  int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid,
> >                              pid_t tid);
> > --
> > 2.43.0.687.g38aa6559b0-goog
> >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-27  7:07   ` Namhyung Kim
@ 2024-02-27  7:24     ` Ian Rogers
  2024-02-27 17:31       ` Namhyung Kim
  2024-02-29 21:59       ` David Laight
  0 siblings, 2 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-27  7:24 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Mon, Feb 26, 2024 at 11:07 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> >
> > Move threads out of machine and move thread_rb_node into the C
> > file. This hides the implementation of threads from the rest of the
> > code allowing for it to be refactored.
> >
> > Locking discipline is tightened up in this change.
>
> Doesn't look like a simple code move.  Can we split the locking
> change from the move to make the reviewer's life a bit easier? :)

Not sure I follow. Take threads_nr as an example.

The old code is in machine.c, so:
-static size_t machine__threads_nr(const struct machine *machine)
-{
-       size_t nr = 0;
-
-       for (int i = 0; i < THREADS__TABLE_SIZE; i++)
-               nr += machine->threads[i].nr;
-
-       return nr;
-}

The new code is in threads.c:
+size_t threads__nr(struct threads *threads)
+{
+       size_t nr = 0;
+
+       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
+               struct threads_table_entry *table = &threads->table[i];
+
+               down_read(&table->lock);
+               nr += table->nr;
+               up_read(&table->lock);
+       }
+       return nr;
+}

So it is a copy paste from one file to the other. The only difference
is that the old code failed to take a lock when reading "nr" so the
locking is added. I wanted to make sure all the functions in threads.c
were properly correct wrt locking, semaphore creation and destruction,
etc.  We could have a broken threads.c and fix it in the next change,
but given that's a bug it could make bisection more difficult.
Ultimately I thought the locking changes were small enough to not
warrant being on their own compared to the advantages of having a sane
threads abstraction.

Thanks,
Ian

> Thanks,
> Namhyung
>
> >
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/util/Build                 |   1 +
> >  tools/perf/util/bpf_lock_contention.c |   8 +-
> >  tools/perf/util/machine.c             | 287 ++++----------------------
> >  tools/perf/util/machine.h             |  20 +-
> >  tools/perf/util/thread.c              |   2 +-
> >  tools/perf/util/thread.h              |   6 -
> >  tools/perf/util/threads.c             | 244 ++++++++++++++++++++++
> >  tools/perf/util/threads.h             |  35 ++++
> >  8 files changed, 325 insertions(+), 278 deletions(-)
> >  create mode 100644 tools/perf/util/threads.c
> >  create mode 100644 tools/perf/util/threads.h
> >
> > diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> > index 8027f450fa3e..a0e8cd68d490 100644
> > --- a/tools/perf/util/Build
> > +++ b/tools/perf/util/Build
> > @@ -71,6 +71,7 @@ perf-y += ordered-events.o
> >  perf-y += namespaces.o
> >  perf-y += comm.o
> >  perf-y += thread.o
> > +perf-y += threads.o
> >  perf-y += thread_map.o
> >  perf-y += parse-events-flex.o
> >  perf-y += parse-events-bison.o
> > diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> > index 31ff19afc20c..3992c8a9fd96 100644
> > --- a/tools/perf/util/bpf_lock_contention.c
> > +++ b/tools/perf/util/bpf_lock_contention.c
> > @@ -210,7 +210,7 @@ static const char *lock_contention_get_name(struct lock_contention *con,
> >
> >                 /* do not update idle comm which contains CPU number */
> >                 if (pid) {
> > -                       struct thread *t = __machine__findnew_thread(machine, /*pid=*/-1, pid);
> > +                       struct thread *t = machine__findnew_thread(machine, /*pid=*/-1, pid);
> >
> >                         if (t == NULL)
> >                                 return name;
> > @@ -302,9 +302,9 @@ int lock_contention_read(struct lock_contention *con)
> >                 return -1;
> >
> >         if (con->aggr_mode == LOCK_AGGR_TASK) {
> > -               struct thread *idle = __machine__findnew_thread(machine,
> > -                                                               /*pid=*/0,
> > -                                                               /*tid=*/0);
> > +               struct thread *idle = machine__findnew_thread(machine,
> > +                                                             /*pid=*/0,
> > +                                                             /*tid=*/0);
> >                 thread__set_comm(idle, "swapper", /*timestamp=*/0);
> >         }
> >
> > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > index e072b2115b64..e668a97255f8 100644
> > --- a/tools/perf/util/machine.c
> > +++ b/tools/perf/util/machine.c
> > @@ -43,9 +43,6 @@
> >  #include <linux/string.h>
> >  #include <linux/zalloc.h>
> >
> > -static void __machine__remove_thread(struct machine *machine, struct thread_rb_node *nd,
> > -                                    struct thread *th, bool lock);
> > -
> >  static struct dso *machine__kernel_dso(struct machine *machine)
> >  {
> >         return map__dso(machine->vmlinux_map);
> > @@ -58,35 +55,6 @@ static void dsos__init(struct dsos *dsos)
> >         init_rwsem(&dsos->lock);
> >  }
> >
> > -static void machine__threads_init(struct machine *machine)
> > -{
> > -       int i;
> > -
> > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > -               struct threads *threads = &machine->threads[i];
> > -               threads->entries = RB_ROOT_CACHED;
> > -               init_rwsem(&threads->lock);
> > -               threads->nr = 0;
> > -               threads->last_match = NULL;
> > -       }
> > -}
> > -
> > -static int thread_rb_node__cmp_tid(const void *key, const struct rb_node *nd)
> > -{
> > -       int to_find = (int) *((pid_t *)key);
> > -
> > -       return to_find - (int)thread__tid(rb_entry(nd, struct thread_rb_node, rb_node)->thread);
> > -}
> > -
> > -static struct thread_rb_node *thread_rb_node__find(const struct thread *th,
> > -                                                  struct rb_root *tree)
> > -{
> > -       pid_t to_find = thread__tid(th);
> > -       struct rb_node *nd = rb_find(&to_find, tree, thread_rb_node__cmp_tid);
> > -
> > -       return rb_entry(nd, struct thread_rb_node, rb_node);
> > -}
> > -
> >  static int machine__set_mmap_name(struct machine *machine)
> >  {
> >         if (machine__is_host(machine))
> > @@ -120,7 +88,7 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
> >         RB_CLEAR_NODE(&machine->rb_node);
> >         dsos__init(&machine->dsos);
> >
> > -       machine__threads_init(machine);
> > +       threads__init(&machine->threads);
> >
> >         machine->vdso_info = NULL;
> >         machine->env = NULL;
> > @@ -221,27 +189,11 @@ static void dsos__exit(struct dsos *dsos)
> >
> >  void machine__delete_threads(struct machine *machine)
> >  {
> > -       struct rb_node *nd;
> > -       int i;
> > -
> > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > -               struct threads *threads = &machine->threads[i];
> > -               down_write(&threads->lock);
> > -               nd = rb_first_cached(&threads->entries);
> > -               while (nd) {
> > -                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> > -
> > -                       nd = rb_next(nd);
> > -                       __machine__remove_thread(machine, trb, trb->thread, false);
> > -               }
> > -               up_write(&threads->lock);
> > -       }
> > +       threads__remove_all_threads(&machine->threads);
> >  }
> >
> >  void machine__exit(struct machine *machine)
> >  {
> > -       int i;
> > -
> >         if (machine == NULL)
> >                 return;
> >
> > @@ -254,12 +206,7 @@ void machine__exit(struct machine *machine)
> >         zfree(&machine->current_tid);
> >         zfree(&machine->kallsyms_filename);
> >
> > -       machine__delete_threads(machine);
> > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > -               struct threads *threads = &machine->threads[i];
> > -
> > -               exit_rwsem(&threads->lock);
> > -       }
> > +       threads__exit(&machine->threads);
> >  }
> >
> >  void machine__delete(struct machine *machine)
> > @@ -526,7 +473,7 @@ static void machine__update_thread_pid(struct machine *machine,
> >         if (thread__pid(th) == thread__tid(th))
> >                 return;
> >
> > -       leader = __machine__findnew_thread(machine, thread__pid(th), thread__pid(th));
> > +       leader = machine__findnew_thread(machine, thread__pid(th), thread__pid(th));
> >         if (!leader)
> >                 goto out_err;
> >
> > @@ -560,160 +507,55 @@ static void machine__update_thread_pid(struct machine *machine,
> >         goto out_put;
> >  }
> >
> > -/*
> > - * Front-end cache - TID lookups come in blocks,
> > - * so most of the time we dont have to look up
> > - * the full rbtree:
> > - */
> > -static struct thread*
> > -__threads__get_last_match(struct threads *threads, struct machine *machine,
> > -                         int pid, int tid)
> > -{
> > -       struct thread *th;
> > -
> > -       th = threads->last_match;
> > -       if (th != NULL) {
> > -               if (thread__tid(th) == tid) {
> > -                       machine__update_thread_pid(machine, th, pid);
> > -                       return thread__get(th);
> > -               }
> > -               thread__put(threads->last_match);
> > -               threads->last_match = NULL;
> > -       }
> > -
> > -       return NULL;
> > -}
> > -
> > -static struct thread*
> > -threads__get_last_match(struct threads *threads, struct machine *machine,
> > -                       int pid, int tid)
> > -{
> > -       struct thread *th = NULL;
> > -
> > -       if (perf_singlethreaded)
> > -               th = __threads__get_last_match(threads, machine, pid, tid);
> > -
> > -       return th;
> > -}
> > -
> > -static void
> > -__threads__set_last_match(struct threads *threads, struct thread *th)
> > -{
> > -       thread__put(threads->last_match);
> > -       threads->last_match = thread__get(th);
> > -}
> > -
> > -static void
> > -threads__set_last_match(struct threads *threads, struct thread *th)
> > -{
> > -       if (perf_singlethreaded)
> > -               __threads__set_last_match(threads, th);
> > -}
> > -
> >  /*
> >   * Caller must eventually drop thread->refcnt returned with a successful
> >   * lookup/new thread inserted.
> >   */
> > -static struct thread *____machine__findnew_thread(struct machine *machine,
> > -                                                 struct threads *threads,
> > -                                                 pid_t pid, pid_t tid,
> > -                                                 bool create)
> > +static struct thread *__machine__findnew_thread(struct machine *machine,
> > +                                               pid_t pid,
> > +                                               pid_t tid,
> > +                                               bool create)
> >  {
> > -       struct rb_node **p = &threads->entries.rb_root.rb_node;
> > -       struct rb_node *parent = NULL;
> > -       struct thread *th;
> > -       struct thread_rb_node *nd;
> > -       bool leftmost = true;
> > +       struct thread *th = threads__find(&machine->threads, tid);
> > +       bool created;
> >
> > -       th = threads__get_last_match(threads, machine, pid, tid);
> > -       if (th)
> > +       if (th) {
> > +               machine__update_thread_pid(machine, th, pid);
> >                 return th;
> > -
> > -       while (*p != NULL) {
> > -               parent = *p;
> > -               th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
> > -
> > -               if (thread__tid(th) == tid) {
> > -                       threads__set_last_match(threads, th);
> > -                       machine__update_thread_pid(machine, th, pid);
> > -                       return thread__get(th);
> > -               }
> > -
> > -               if (tid < thread__tid(th))
> > -                       p = &(*p)->rb_left;
> > -               else {
> > -                       p = &(*p)->rb_right;
> > -                       leftmost = false;
> > -               }
> >         }
> > -
> >         if (!create)
> >                 return NULL;
> >
> > -       th = thread__new(pid, tid);
> > -       if (th == NULL)
> > -               return NULL;
> > -
> > -       nd = malloc(sizeof(*nd));
> > -       if (nd == NULL) {
> > -               thread__put(th);
> > -               return NULL;
> > -       }
> > -       nd->thread = th;
> > -
> > -       rb_link_node(&nd->rb_node, parent, p);
> > -       rb_insert_color_cached(&nd->rb_node, &threads->entries, leftmost);
> > -       /*
> > -        * We have to initialize maps separately after rb tree is updated.
> > -        *
> > -        * The reason is that we call machine__findnew_thread within
> > -        * thread__init_maps to find the thread leader and that would screwed
> > -        * the rb tree.
> > -        */
> > -       if (thread__init_maps(th, machine)) {
> > -               pr_err("Thread init failed thread %d\n", pid);
> > -               rb_erase_cached(&nd->rb_node, &threads->entries);
> > -               RB_CLEAR_NODE(&nd->rb_node);
> > -               free(nd);
> > -               thread__put(th);
> > -               return NULL;
> > -       }
> > -       /*
> > -        * It is now in the rbtree, get a ref
> > -        */
> > -       threads__set_last_match(threads, th);
> > -       ++threads->nr;
> > -
> > -       return thread__get(th);
> > -}
> > +       th = threads__findnew(&machine->threads, pid, tid, &created);
> > +       if (created) {
> > +               /*
> > +                * We have to initialize maps separately after rb tree is
> > +                * updated.
> > +                *
> > +                * The reason is that we call machine__findnew_thread within
> > +                * thread__init_maps to find the thread leader and that would
> > +                * screwed the rb tree.
> > +                */
> > +               if (thread__init_maps(th, machine)) {
> > +                       pr_err("Thread init failed thread %d\n", pid);
> > +                       threads__remove(&machine->threads, th);
> > +                       thread__put(th);
> > +                       return NULL;
> > +               }
> > +       } else
> > +               machine__update_thread_pid(machine, th, pid);
> >
> > -struct thread *__machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid)
> > -{
> > -       return ____machine__findnew_thread(machine, machine__threads(machine, tid), pid, tid, true);
> > +       return th;
> >  }
> >
> > -struct thread *machine__findnew_thread(struct machine *machine, pid_t pid,
> > -                                      pid_t tid)
> > +struct thread *machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid)
> >  {
> > -       struct threads *threads = machine__threads(machine, tid);
> > -       struct thread *th;
> > -
> > -       down_write(&threads->lock);
> > -       th = __machine__findnew_thread(machine, pid, tid);
> > -       up_write(&threads->lock);
> > -       return th;
> > +       return __machine__findnew_thread(machine, pid, tid, /*create=*/true);
> >  }
> >
> > -struct thread *machine__find_thread(struct machine *machine, pid_t pid,
> > -                                   pid_t tid)
> > +struct thread *machine__find_thread(struct machine *machine, pid_t pid, pid_t tid)
> >  {
> > -       struct threads *threads = machine__threads(machine, tid);
> > -       struct thread *th;
> > -
> > -       down_read(&threads->lock);
> > -       th =  ____machine__findnew_thread(machine, threads, pid, tid, false);
> > -       up_read(&threads->lock);
> > -       return th;
> > +       return __machine__findnew_thread(machine, pid, tid, /*create=*/false);
> >  }
> >
> >  /*
> > @@ -1127,23 +969,13 @@ static int machine_fprintf_cb(struct thread *thread, void *data)
> >         return 0;
> >  }
> >
> > -static size_t machine__threads_nr(const struct machine *machine)
> > -{
> > -       size_t nr = 0;
> > -
> > -       for (int i = 0; i < THREADS__TABLE_SIZE; i++)
> > -               nr += machine->threads[i].nr;
> > -
> > -       return nr;
> > -}
> > -
> >  size_t machine__fprintf(struct machine *machine, FILE *fp)
> >  {
> >         struct machine_fprintf_cb_args args = {
> >                 .fp = fp,
> >                 .printed = 0,
> >         };
> > -       size_t ret = fprintf(fp, "Threads: %zu\n", machine__threads_nr(machine));
> > +       size_t ret = fprintf(fp, "Threads: %zu\n", threads__nr(&machine->threads));
> >
> >         machine__for_each_thread(machine, machine_fprintf_cb, &args);
> >         return ret + args.printed;
> > @@ -2069,36 +1901,9 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
> >         return 0;
> >  }
> >
> > -static void __machine__remove_thread(struct machine *machine, struct thread_rb_node *nd,
> > -                                    struct thread *th, bool lock)
> > -{
> > -       struct threads *threads = machine__threads(machine, thread__tid(th));
> > -
> > -       if (!nd)
> > -               nd = thread_rb_node__find(th, &threads->entries.rb_root);
> > -
> > -       if (threads->last_match && RC_CHK_EQUAL(threads->last_match, th))
> > -               threads__set_last_match(threads, NULL);
> > -
> > -       if (lock)
> > -               down_write(&threads->lock);
> > -
> > -       BUG_ON(refcount_read(thread__refcnt(th)) == 0);
> > -
> > -       thread__put(nd->thread);
> > -       rb_erase_cached(&nd->rb_node, &threads->entries);
> > -       RB_CLEAR_NODE(&nd->rb_node);
> > -       --threads->nr;
> > -
> > -       free(nd);
> > -
> > -       if (lock)
> > -               up_write(&threads->lock);
> > -}
> > -
> >  void machine__remove_thread(struct machine *machine, struct thread *th)
> >  {
> > -       return __machine__remove_thread(machine, NULL, th, true);
> > +       return threads__remove(&machine->threads, th);
> >  }
> >
> >  int machine__process_fork_event(struct machine *machine, union perf_event *event,
> > @@ -3232,23 +3037,7 @@ int machine__for_each_thread(struct machine *machine,
> >                              int (*fn)(struct thread *thread, void *p),
> >                              void *priv)
> >  {
> > -       struct threads *threads;
> > -       struct rb_node *nd;
> > -       int rc = 0;
> > -       int i;
> > -
> > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > -               threads = &machine->threads[i];
> > -               for (nd = rb_first_cached(&threads->entries); nd;
> > -                    nd = rb_next(nd)) {
> > -                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> > -
> > -                       rc = fn(trb->thread, priv);
> > -                       if (rc != 0)
> > -                               return rc;
> > -               }
> > -       }
> > -       return rc;
> > +       return threads__for_each_thread(&machine->threads, fn, priv);
> >  }
> >
> >  int machines__for_each_thread(struct machines *machines,
> > diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> > index b738ce84817b..e28c787616fe 100644
> > --- a/tools/perf/util/machine.h
> > +++ b/tools/perf/util/machine.h
> > @@ -7,6 +7,7 @@
> >  #include "maps.h"
> >  #include "dsos.h"
> >  #include "rwsem.h"
> > +#include "threads.h"
> >
> >  struct addr_location;
> >  struct branch_stack;
> > @@ -28,16 +29,6 @@ extern const char *ref_reloc_sym_names[];
> >
> >  struct vdso_info;
> >
> > -#define THREADS__TABLE_BITS    8
> > -#define THREADS__TABLE_SIZE    (1 << THREADS__TABLE_BITS)
> > -
> > -struct threads {
> > -       struct rb_root_cached  entries;
> > -       struct rw_semaphore    lock;
> > -       unsigned int           nr;
> > -       struct thread          *last_match;
> > -};
> > -
> >  struct machine {
> >         struct rb_node    rb_node;
> >         pid_t             pid;
> > @@ -48,7 +39,7 @@ struct machine {
> >         char              *root_dir;
> >         char              *mmap_name;
> >         char              *kallsyms_filename;
> > -       struct threads    threads[THREADS__TABLE_SIZE];
> > +       struct threads    threads;
> >         struct vdso_info  *vdso_info;
> >         struct perf_env   *env;
> >         struct dsos       dsos;
> > @@ -69,12 +60,6 @@ struct machine {
> >         bool              trampolines_mapped;
> >  };
> >
> > -static inline struct threads *machine__threads(struct machine *machine, pid_t tid)
> > -{
> > -       /* Cast it to handle tid == -1 */
> > -       return &machine->threads[(unsigned int)tid % THREADS__TABLE_SIZE];
> > -}
> > -
> >  /*
> >   * The main kernel (vmlinux) map
> >   */
> > @@ -220,7 +205,6 @@ bool machine__is(struct machine *machine, const char *arch);
> >  bool machine__normalized_is(struct machine *machine, const char *arch);
> >  int machine__nr_cpus_avail(struct machine *machine);
> >
> > -struct thread *__machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid);
> >  struct thread *machine__findnew_thread(struct machine *machine, pid_t pid, pid_t tid);
> >
> >  struct dso *machine__findnew_dso_id(struct machine *machine, const char *filename, struct dso_id *id);
> > diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
> > index c59ab4d79163..1aa8962dcf52 100644
> > --- a/tools/perf/util/thread.c
> > +++ b/tools/perf/util/thread.c
> > @@ -26,7 +26,7 @@ int thread__init_maps(struct thread *thread, struct machine *machine)
> >         if (pid == thread__tid(thread) || pid == -1) {
> >                 thread__set_maps(thread, maps__new(machine));
> >         } else {
> > -               struct thread *leader = __machine__findnew_thread(machine, pid, pid);
> > +               struct thread *leader = machine__findnew_thread(machine, pid, pid);
> >
> >                 if (leader) {
> >                         thread__set_maps(thread, maps__get(thread__maps(leader)));
> > diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
> > index 0df775b5c110..4b8f3e9e513b 100644
> > --- a/tools/perf/util/thread.h
> > +++ b/tools/perf/util/thread.h
> > @@ -3,7 +3,6 @@
> >  #define __PERF_THREAD_H
> >
> >  #include <linux/refcount.h>
> > -#include <linux/rbtree.h>
> >  #include <linux/list.h>
> >  #include <stdio.h>
> >  #include <unistd.h>
> > @@ -30,11 +29,6 @@ struct lbr_stitch {
> >         struct callchain_cursor_node    *prev_lbr_cursor;
> >  };
> >
> > -struct thread_rb_node {
> > -       struct rb_node rb_node;
> > -       struct thread *thread;
> > -};
> > -
> >  DECLARE_RC_STRUCT(thread) {
> >         /** @maps: mmaps associated with this thread. */
> >         struct maps             *maps;
> > diff --git a/tools/perf/util/threads.c b/tools/perf/util/threads.c
> > new file mode 100644
> > index 000000000000..d984ec939c7b
> > --- /dev/null
> > +++ b/tools/perf/util/threads.c
> > @@ -0,0 +1,244 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "threads.h"
> > +#include "machine.h"
> > +#include "thread.h"
> > +
> > +struct thread_rb_node {
> > +       struct rb_node rb_node;
> > +       struct thread *thread;
> > +};
> > +
> > +static struct threads_table_entry *threads__table(struct threads *threads, pid_t tid)
> > +{
> > +       /* Cast it to handle tid == -1 */
> > +       return &threads->table[(unsigned int)tid % THREADS__TABLE_SIZE];
> > +}
> > +
> > +void threads__init(struct threads *threads)
> > +{
> > +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> > +               struct threads_table_entry *table = &threads->table[i];
> > +
> > +               table->entries = RB_ROOT_CACHED;
> > +               init_rwsem(&table->lock);
> > +               table->nr = 0;
> > +               table->last_match = NULL;
> > +       }
> > +}
> > +
> > +void threads__exit(struct threads *threads)
> > +{
> > +       threads__remove_all_threads(threads);
> > +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> > +               struct threads_table_entry *table = &threads->table[i];
> > +
> > +               exit_rwsem(&table->lock);
> > +       }
> > +}
> > +
> > +size_t threads__nr(struct threads *threads)
> > +{
> > +       size_t nr = 0;
> > +
> > +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> > +               struct threads_table_entry *table = &threads->table[i];
> > +
> > +               down_read(&table->lock);
> > +               nr += table->nr;
> > +               up_read(&table->lock);
> > +       }
> > +       return nr;
> > +}
> > +
> > +/*
> > + * Front-end cache - TID lookups come in blocks,
> > + * so most of the time we dont have to look up
> > + * the full rbtree:
> > + */
> > +static struct thread *__threads_table_entry__get_last_match(struct threads_table_entry *table,
> > +                                                           pid_t tid)
> > +{
> > +       struct thread *th, *res = NULL;
> > +
> > +       th = table->last_match;
> > +       if (th != NULL) {
> > +               if (thread__tid(th) == tid)
> > +                       res = thread__get(th);
> > +       }
> > +       return res;
> > +}
> > +
> > +static void __threads_table_entry__set_last_match(struct threads_table_entry *table,
> > +                                                 struct thread *th)
> > +{
> > +       thread__put(table->last_match);
> > +       table->last_match = thread__get(th);
> > +}
> > +
> > +static void threads_table_entry__set_last_match(struct threads_table_entry *table,
> > +                                               struct thread *th)
> > +{
> > +       down_write(&table->lock);
> > +       __threads_table_entry__set_last_match(table, th);
> > +       up_write(&table->lock);
> > +}
> > +
> > +struct thread *threads__find(struct threads *threads, pid_t tid)
> > +{
> > +       struct threads_table_entry *table  = threads__table(threads, tid);
> > +       struct rb_node **p;
> > +       struct thread *res = NULL;
> > +
> > +       down_read(&table->lock);
> > +       res = __threads_table_entry__get_last_match(table, tid);
> > +       if (res)
> > +               return res;
> > +
> > +       p = &table->entries.rb_root.rb_node;
> > +       while (*p != NULL) {
> > +               struct rb_node *parent = *p;
> > +               struct thread *th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
> > +
> > +               if (thread__tid(th) == tid) {
> > +                       res = thread__get(th);
> > +                       break;
> > +               }
> > +
> > +               if (tid < thread__tid(th))
> > +                       p = &(*p)->rb_left;
> > +               else
> > +                       p = &(*p)->rb_right;
> > +       }
> > +       up_read(&table->lock);
> > +       if (res)
> > +               threads_table_entry__set_last_match(table, res);
> > +       return res;
> > +}
> > +
> > +struct thread *threads__findnew(struct threads *threads, pid_t pid, pid_t tid, bool *created)
> > +{
> > +       struct threads_table_entry *table  = threads__table(threads, tid);
> > +       struct rb_node **p;
> > +       struct rb_node *parent = NULL;
> > +       struct thread *res = NULL;
> > +       struct thread_rb_node *nd;
> > +       bool leftmost = true;
> > +
> > +       *created = false;
> > +       down_write(&table->lock);
> > +       p = &table->entries.rb_root.rb_node;
> > +       while (*p != NULL) {
> > +               struct thread *th;
> > +
> > +               parent = *p;
> > +               th = rb_entry(parent, struct thread_rb_node, rb_node)->thread;
> > +
> > +               if (thread__tid(th) == tid) {
> > +                       __threads_table_entry__set_last_match(table, th);
> > +                       res = thread__get(th);
> > +                       goto out_unlock;
> > +               }
> > +
> > +               if (tid < thread__tid(th))
> > +                       p = &(*p)->rb_left;
> > +               else {
> > +                       leftmost = false;
> > +                       p = &(*p)->rb_right;
> > +               }
> > +       }
> > +       nd = malloc(sizeof(*nd));
> > +       if (nd == NULL)
> > +               goto out_unlock;
> > +       res = thread__new(pid, tid);
> > +       if (!res)
> > +               free(nd);
> > +       else {
> > +               *created = true;
> > +               nd->thread = thread__get(res);
> > +               rb_link_node(&nd->rb_node, parent, p);
> > +               rb_insert_color_cached(&nd->rb_node, &table->entries, leftmost);
> > +               ++table->nr;
> > +               __threads_table_entry__set_last_match(table, res);
> > +       }
> > +out_unlock:
> > +       up_write(&table->lock);
> > +       return res;
> > +}
> > +
> > +void threads__remove_all_threads(struct threads *threads)
> > +{
> > +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> > +               struct threads_table_entry *table = &threads->table[i];
> > +               struct rb_node *nd;
> > +
> > +               down_write(&table->lock);
> > +               __threads_table_entry__set_last_match(table, NULL);
> > +               nd = rb_first_cached(&table->entries);
> > +               while (nd) {
> > +                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> > +
> > +                       nd = rb_next(nd);
> > +                       thread__put(trb->thread);
> > +                       rb_erase_cached(&trb->rb_node, &table->entries);
> > +                       RB_CLEAR_NODE(&trb->rb_node);
> > +                       --table->nr;
> > +
> > +                       free(trb);
> > +               }
> > +               assert(table->nr == 0);
> > +               up_write(&table->lock);
> > +       }
> > +}
> > +
> > +void threads__remove(struct threads *threads, struct thread *thread)
> > +{
> > +       struct rb_node **p;
> > +       struct threads_table_entry *table  = threads__table(threads, thread__tid(thread));
> > +       pid_t tid = thread__tid(thread);
> > +
> > +       down_write(&table->lock);
> > +       if (table->last_match && RC_CHK_EQUAL(table->last_match, thread))
> > +               __threads_table_entry__set_last_match(table, NULL);
> > +
> > +       p = &table->entries.rb_root.rb_node;
> > +       while (*p != NULL) {
> > +               struct rb_node *parent = *p;
> > +               struct thread_rb_node *nd = rb_entry(parent, struct thread_rb_node, rb_node);
> > +               struct thread *th = nd->thread;
> > +
> > +               if (RC_CHK_EQUAL(th, thread)) {
> > +                       thread__put(nd->thread);
> > +                       rb_erase_cached(&nd->rb_node, &table->entries);
> > +                       RB_CLEAR_NODE(&nd->rb_node);
> > +                       --table->nr;
> > +                       free(nd);
> > +                       break;
> > +               }
> > +
> > +               if (tid < thread__tid(th))
> > +                       p = &(*p)->rb_left;
> > +               else
> > +                       p = &(*p)->rb_right;
> > +       }
> > +       up_write(&table->lock);
> > +}
> > +
> > +int threads__for_each_thread(struct threads *threads,
> > +                            int (*fn)(struct thread *thread, void *data),
> > +                            void *data)
> > +{
> > +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> > +               struct threads_table_entry *table = &threads->table[i];
> > +               struct rb_node *nd;
> > +
> > +               for (nd = rb_first_cached(&table->entries); nd; nd = rb_next(nd)) {
> > +                       struct thread_rb_node *trb = rb_entry(nd, struct thread_rb_node, rb_node);
> > +                       int rc = fn(trb->thread, data);
> > +
> > +                       if (rc != 0)
> > +                               return rc;
> > +               }
> > +       }
> > +       return 0;
> > +
> > +}
> > diff --git a/tools/perf/util/threads.h b/tools/perf/util/threads.h
> > new file mode 100644
> > index 000000000000..ed67de627578
> > --- /dev/null
> > +++ b/tools/perf/util/threads.h
> > @@ -0,0 +1,35 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef __PERF_THREADS_H
> > +#define __PERF_THREADS_H
> > +
> > +#include <linux/rbtree.h>
> > +#include "rwsem.h"
> > +
> > +struct thread;
> > +
> > +#define THREADS__TABLE_BITS    8
> > +#define THREADS__TABLE_SIZE    (1 << THREADS__TABLE_BITS)
> > +
> > +struct threads_table_entry {
> > +       struct rb_root_cached  entries;
> > +       struct rw_semaphore    lock;
> > +       unsigned int           nr;
> > +       struct thread          *last_match;
> > +};
> > +
> > +struct threads {
> > +       struct threads_table_entry table[THREADS__TABLE_SIZE];
> > +};
> > +
> > +void threads__init(struct threads *threads);
> > +void threads__exit(struct threads *threads);
> > +size_t threads__nr(struct threads *threads);
> > +struct thread *threads__find(struct threads *threads, pid_t tid);
> > +struct thread *threads__findnew(struct threads *threads, pid_t pid, pid_t tid, bool *created);
> > +void threads__remove_all_threads(struct threads *threads);
> > +void threads__remove(struct threads *threads, struct thread *thread);
> > +int threads__for_each_thread(struct threads *threads,
> > +                            int (*fn)(struct thread *thread, void *data),
> > +                            void *data);
> > +
> > +#endif /* __PERF_THREADS_H */
> > --
> > 2.43.0.687.g38aa6559b0-goog
> >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-27  7:24     ` Ian Rogers
@ 2024-02-27 17:31       ` Namhyung Kim
  2024-02-27 19:02         ` Ian Rogers
  2024-02-27 19:17         ` Arnaldo Carvalho de Melo
  2024-02-29 21:59       ` David Laight
  1 sibling, 2 replies; 35+ messages in thread
From: Namhyung Kim @ 2024-02-27 17:31 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Mon, Feb 26, 2024 at 11:24 PM Ian Rogers <irogers@google.com> wrote:
>
> On Mon, Feb 26, 2024 at 11:07 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > Move threads out of machine and move thread_rb_node into the C
> > > file. This hides the implementation of threads from the rest of the
> > > code allowing for it to be refactored.
> > >
> > > Locking discipline is tightened up in this change.
> >
> > Doesn't look like a simple code move.  Can we split the locking
> > change from the move to make the reviewer's life a bit easier? :)
>
> Not sure I follow. Take threads_nr as an example.
>
> The old code is in machine.c, so:
> -static size_t machine__threads_nr(const struct machine *machine)
> -{
> -       size_t nr = 0;
> -
> -       for (int i = 0; i < THREADS__TABLE_SIZE; i++)
> -               nr += machine->threads[i].nr;
> -
> -       return nr;
> -}
>
> The new code is in threads.c:
> +size_t threads__nr(struct threads *threads)
> +{
> +       size_t nr = 0;
> +
> +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> +               struct threads_table_entry *table = &threads->table[i];
> +
> +               down_read(&table->lock);
> +               nr += table->nr;
> +               up_read(&table->lock);
> +       }
> +       return nr;
> +}
>
> So it is a copy paste from one file to the other. The only difference
> is that the old code failed to take a lock when reading "nr" so the
> locking is added. I wanted to make sure all the functions in threads.c
> were properly correct wrt locking, semaphore creation and destruction,
> etc.  We could have a broken threads.c and fix it in the next change,
> but given that's a bug it could make bisection more difficult.
> Ultimately I thought the locking changes were small enough to not
> warrant being on their own compared to the advantages of having a sane
> threads abstraction.

I can see some other differences like machine__findnew_thread()
which I think is due to the locking change.  Maybe we can fix the
problem before moving the code and let the code move simple.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-27 17:31       ` Namhyung Kim
@ 2024-02-27 19:02         ` Ian Rogers
  2024-02-27 19:17         ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-27 19:02 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 9:31 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Mon, Feb 26, 2024 at 11:24 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Mon, Feb 26, 2024 at 11:07 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > Move threads out of machine and move thread_rb_node into the C
> > > > file. This hides the implementation of threads from the rest of the
> > > > code allowing for it to be refactored.
> > > >
> > > > Locking discipline is tightened up in this change.
> > >
> > > Doesn't look like a simple code move.  Can we split the locking
> > > change from the move to make the reviewer's life a bit easier? :)
> >
> > Not sure I follow. Take threads_nr as an example.
> >
> > The old code is in machine.c, so:
> > -static size_t machine__threads_nr(const struct machine *machine)
> > -{
> > -       size_t nr = 0;
> > -
> > -       for (int i = 0; i < THREADS__TABLE_SIZE; i++)
> > -               nr += machine->threads[i].nr;
> > -
> > -       return nr;
> > -}
> >
> > The new code is in threads.c:
> > +size_t threads__nr(struct threads *threads)
> > +{
> > +       size_t nr = 0;
> > +
> > +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> > +               struct threads_table_entry *table = &threads->table[i];
> > +
> > +               down_read(&table->lock);
> > +               nr += table->nr;
> > +               up_read(&table->lock);
> > +       }
> > +       return nr;
> > +}
> >
> > So it is a copy paste from one file to the other. The only difference
> > is that the old code failed to take a lock when reading "nr" so the
> > locking is added. I wanted to make sure all the functions in threads.c
> > were properly correct wrt locking, semaphore creation and destruction,
> > etc.  We could have a broken threads.c and fix it in the next change,
> > but given that's a bug it could make bisection more difficult.
> > Ultimately I thought the locking changes were small enough to not
> > warrant being on their own compared to the advantages of having a sane
> > threads abstraction.
>
> I can see some other differences like machine__findnew_thread()
> which I think is due to the locking change.  Maybe we can fix the
> problem before moving the code and let the code move simple.

I'll see what I can split out in v2. I don't think findnew will change
and the nr change is trivial. In the previous code the lock is taken
before calling __machine__findnew_thread, which doesn't make sense
when we try to abstract inside of threads, where it should
take/release the lock in the threads and not the machine code. Moving
the lock to __machine__findnew_thread doesn't really make sense as the
__ implies the lock is already held.

Thanks,
Ian

> Thanks,
> Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-27 17:31       ` Namhyung Kim
  2024-02-27 19:02         ` Ian Rogers
@ 2024-02-27 19:17         ` Arnaldo Carvalho de Melo
  2024-02-27 21:42           ` Ian Rogers
  1 sibling, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-02-27 19:17 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ian Rogers, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 09:31:33AM -0800, Namhyung Kim wrote:
> I can see some other differences like machine__findnew_thread()
> which I think is due to the locking change.  Maybe we can fix the
> problem before moving the code and let the code move simple.

I was going to suggest that, agreed.

We may start doing a refactoring, then find a bug, at that point we
first fix the problem them go back to refactoring.

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-27 19:17         ` Arnaldo Carvalho de Melo
@ 2024-02-27 21:42           ` Ian Rogers
  2024-02-28  6:39             ` Namhyung Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-27 21:42 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, Oliver Upton,
	Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 11:17 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 09:31:33AM -0800, Namhyung Kim wrote:
> > I can see some other differences like machine__findnew_thread()
> > which I think is due to the locking change.  Maybe we can fix the
> > problem before moving the code and let the code move simple.
>
> I was going to suggest that, agreed.
>
> We may start doing a refactoring, then find a bug, at that point we
> first fix the problem them go back to refactoring.

Sure I do this all the time. Your typical complaint on the v+1 patch
set is to move the bug fixes to the front of the changes. On the v+2
patch set the bug fixes get applied but not the rest of the patch
series, etc.

Here we are refactoring code for an rb-tree implementation of threads
and worrying about its correctness. There's no indication it's not
correct, it is largely copy and paste, there is also good evidence in
the locking disciple it is more correct. The next patch deletes that
implementation, replacing it with a hash table. Were I not trying to
break things apart I could squash those 2 patches together, but I've
tried to do the right thing. Now we're trying to micro correct, break
apart, etc. a state that gets deleted. A reviewer could equally
criticise this being 2 changes rather than 1, and the cognitive load
of having to look at code that gets deleted. At some point it is a
judgement call, and I think this patch is actually the right size. I
think what is missing here is some motivation in the commit message to
the findnew refactoring and so I'll add that.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-27  7:12     ` Ian Rogers
@ 2024-02-28  6:11       ` Namhyung Kim
  2024-02-28  7:05         ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Namhyung Kim @ 2024-02-28  6:11 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Mon, Feb 26, 2024 at 11:12 PM Ian Rogers <irogers@google.com> wrote:
>
> On Mon, Feb 26, 2024 at 10:39 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > threads") made the iteration of thread tids unordered. The perf report
> > > --tasks output now shows child threads in an order determined by the
> > > hashing. For example, in this snippet tid 3 appears after tid 256 even
> > > though they have the same ppid 2:
> > >
> > > ```
> > > $ perf report --tasks
> > > %      pid      tid     ppid  comm
> > >          0        0       -1 |swapper
> > >          2        2        0 | kthreadd
> > >        256      256        2 |  kworker/12:1H-k
> > >     693761   693761        2 |  kworker/10:1-mm
> > >    1301762  1301762        2 |  kworker/1:1-mm_
> > >    1302530  1302530        2 |  kworker/u32:0-k
> > >          3        3        2 |  rcu_gp
> > > ...
> > > ```
> > >
> > > The output is easier to read if threads appear numerically
> > > increasing. To allow for this, read all threads into a list then sort
> > > with a comparator that orders by the child task's of the first common
> > > parent. The list creation and deletion are created as utilities on
> > > machine.  The indentation is possible by counting the number of
> > > parents a child has.
> > >
> > > With this change the output for the same data file is now like:
> > > ```
> > > $ perf report --tasks
> > > %      pid      tid     ppid  comm
> > >          0        0       -1 |swapper
> > >          1        1        0 | systemd
> > >        823      823        1 |  systemd-journal
> > >        853      853        1 |  systemd-udevd
> > >       3230     3230        1 |  systemd-timesyn
> > >       3236     3236        1 |  auditd
> > >       3239     3239     3236 |   audisp-syslog
> > >       3321     3321        1 |  accounts-daemon
> > > ...
> > > ```
> > >
> > > Signed-off-by: Ian Rogers <irogers@google.com>

I know you sent out v2 already, but let me continue the discussion
here.


> > > ---
> > >  tools/perf/builtin-report.c | 203 ++++++++++++++++++++----------------
> > >  tools/perf/util/machine.c   |  30 ++++++
> > >  tools/perf/util/machine.h   |  10 ++
> > >  3 files changed, 155 insertions(+), 88 deletions(-)
> > >
> > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > index 8e16fa261e6f..b48f1d5309e3 100644
> > > --- a/tools/perf/builtin-report.c
> > > +++ b/tools/perf/builtin-report.c
> > > @@ -59,6 +59,7 @@
> > >  #include <linux/ctype.h>
> > >  #include <signal.h>
> > >  #include <linux/bitmap.h>
> > > +#include <linux/list_sort.h>
> > >  #include <linux/string.h>
> > >  #include <linux/stringify.h>
> > >  #include <linux/time64.h>
> > > @@ -828,35 +829,6 @@ static void tasks_setup(struct report *rep)
> > >         rep->tool.no_warn = true;
> > >  }
> > >
> > > -struct task {
> > > -       struct thread           *thread;
> > > -       struct list_head         list;
> > > -       struct list_head         children;
> > > -};
> > > -
> > > -static struct task *tasks_list(struct task *task, struct machine *machine)
> > > -{
> > > -       struct thread *parent_thread, *thread = task->thread;
> > > -       struct task   *parent_task;
> > > -
> > > -       /* Already listed. */
> > > -       if (!list_empty(&task->list))
> > > -               return NULL;
> > > -
> > > -       /* Last one in the chain. */
> > > -       if (thread__ppid(thread) == -1)
> > > -               return task;
> > > -
> > > -       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > > -       if (!parent_thread)
> > > -               return ERR_PTR(-ENOENT);
> > > -
> > > -       parent_task = thread__priv(parent_thread);
> > > -       thread__put(parent_thread);
> > > -       list_add_tail(&task->list, &parent_task->children);
> > > -       return tasks_list(parent_task, machine);
> > > -}
> > > -
> > >  struct maps__fprintf_task_args {
> > >         int indent;
> > >         FILE *fp;
> > > @@ -900,89 +872,144 @@ static size_t maps__fprintf_task(struct maps *maps, int indent, FILE *fp)
> > >         return args.printed;
> > >  }
> > >
> > > -static void task__print_level(struct task *task, FILE *fp, int level)
> > > +static int thread_level(struct machine *machine, const struct thread *thread)
> > >  {
> > > -       struct thread *thread = task->thread;
> > > -       struct task *child;
> > > -       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > > -                                 thread__pid(thread), thread__tid(thread),
> > > -                                 thread__ppid(thread), level, "");
> > > +       struct thread *parent_thread;
> > > +       int res;
> > >
> > > -       fprintf(fp, "%s\n", thread__comm_str(thread));
> > > +       if (thread__tid(thread) <= 0)
> > > +               return 0;
> > >
> > > -       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > > +       if (thread__ppid(thread) <= 0)
> > > +               return 1;
> > >
> > > -       if (!list_empty(&task->children)) {
> > > -               list_for_each_entry(child, &task->children, list)
> > > -                       task__print_level(child, fp, level + 1);
> > > +       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > > +       if (!parent_thread) {
> > > +               pr_err("Missing parent thread of %d\n", thread__tid(thread));
> > > +               return 0;
> > >         }
> > > +       res = 1 + thread_level(machine, parent_thread);
> > > +       thread__put(parent_thread);
> > > +       return res;
> > >  }
> > >
> > > -static int tasks_print(struct report *rep, FILE *fp)
> > > +static void task__print_level(struct machine *machine, struct thread *thread, FILE *fp)
> > >  {
> > > -       struct perf_session *session = rep->session;
> > > -       struct machine      *machine = &session->machines.host;
> > > -       struct task *tasks, *task;
> > > -       unsigned int nr = 0, itask = 0, i;
> > > -       struct rb_node *nd;
> > > -       LIST_HEAD(list);
> > > +       int level = thread_level(machine, thread);
> > > +       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > > +                                 thread__pid(thread), thread__tid(thread),
> > > +                                 thread__ppid(thread), level, "");
> > >
> > > -       /*
> > > -        * No locking needed while accessing machine->threads,
> > > -        * because --tasks is single threaded command.
> > > -        */
> > > +       fprintf(fp, "%s\n", thread__comm_str(thread));
> > >
> > > -       /* Count all the threads. */
> > > -       for (i = 0; i < THREADS__TABLE_SIZE; i++)
> > > -               nr += machine->threads[i].nr;
> > > +       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > > +}
> > >
> > > -       tasks = malloc(sizeof(*tasks) * nr);
> > > -       if (!tasks)
> > > -               return -ENOMEM;
> > > +static int task_list_cmp(void *priv, const struct list_head *la, const struct list_head *lb)
> >
> > I'm a little afraid that this comparison logic becomes complex.
> > But I think it's better than having a tree of thread relationship.
> > Just a comment that explains why we need this would be nice.
>
> I can add something in v2.
>
> >
> > > +{
> > > +       struct machine *machine = priv;
> > > +       struct thread_list *task_a = list_entry(la, struct thread_list, list);
> > > +       struct thread_list *task_b = list_entry(lb, struct thread_list, list);
> > > +       struct thread *a = task_a->thread;
> > > +       struct thread *b = task_b->thread;
> > > +       int level_a, level_b, res;
> > > +
> > > +       /* Compare a and b to root. */
> > > +       if (thread__tid(a) == thread__tid(b))
> > > +               return 0;
> > >
> > > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > > -               struct threads *threads = &machine->threads[i];
> > > +       if (thread__tid(a) == 0)
> > > +               return -1;
> > >
> > > -               for (nd = rb_first_cached(&threads->entries); nd;
> > > -                    nd = rb_next(nd)) {
> > > -                       task = tasks + itask++;
> > > +       if (thread__tid(b) == 0)
> > > +               return 1;
> > >
> > > -                       task->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> > > -                       INIT_LIST_HEAD(&task->children);
> > > -                       INIT_LIST_HEAD(&task->list);
> > > -                       thread__set_priv(task->thread, task);
> > > -               }
> > > +       /* If parents match sort by tid. */
> > > +       if (thread__ppid(a) == thread__ppid(b)) {
> > > +               return thread__tid(a) < thread__tid(b)
> > > +                       ? -1
> > > +                       : (thread__tid(a) > thread__tid(b) ? 1 : 0);
> >
> > Can it be simply like this?  We know tid(a) != tid(b).
> >
> >   return thread__tid(a) < thread__tid(b) ? -1 : 1;
>
> Yes, but the parent check is still required.

Sure.  I only meant the return statement.

>
> > >         }
> > >
> > >         /*
> > > -        * Iterate every task down to the unprocessed parent
> > > -        * and link all in task children list. Task with no
> > > -        * parent is added into 'list'.
> > > +        * Find a and b such that if they are a child of each other a and b's
> > > +        * tid's match, otherwise a and b have a common parent and distinct
> > > +        * tid's to sort by. First make the depths of the threads match.
> > >          */
> > > -       for (itask = 0; itask < nr; itask++) {
> > > -               task = tasks + itask;
> > > -
> > > -               if (!list_empty(&task->list))
> > > -                       continue;
> > > -
> > > -               task = tasks_list(task, machine);
> > > -               if (IS_ERR(task)) {
> > > -                       pr_err("Error: failed to process tasks\n");
> > > -                       free(tasks);
> > > -                       return PTR_ERR(task);
> > > +       level_a = thread_level(machine, a);
> > > +       level_b = thread_level(machine, b);
> > > +       a = thread__get(a);
> > > +       b = thread__get(b);
> > > +       for (int i = level_a; i > level_b; i--) {
> > > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(a));
> > > +
> > > +               thread__put(a);
> > > +               if (!parent) {
> > > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > > +                       thread__put(b);
> > > +                       return -1;
> > >                 }
> > > +               a = parent;
> > > +       }
> > > +       for (int i = level_b; i > level_a; i--) {
> > > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(b));
> > >
> > > -               if (task)
> > > -                       list_add_tail(&task->list, &list);
> > > +               thread__put(b);
> > > +               if (!parent) {
> > > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > > +                       thread__put(a);
> > > +                       return 1;
> > > +               }
> > > +               b = parent;
> > > +       }
> > > +       /* Search up to a common parent. */
> > > +       while (thread__ppid(a) != thread__ppid(b)) {
> > > +               struct thread *parent;
> > > +
> > > +               parent = machine__find_thread(machine, -1, thread__ppid(a));
> > > +               thread__put(a);
> > > +               if (!parent)
> > > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > > +               a = parent;
> > > +               parent = machine__find_thread(machine, -1, thread__ppid(b));
> > > +               thread__put(b);
> > > +               if (!parent)
> > > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > > +               b = parent;
> > > +               if (!a || !b)
> > > +                       return !a && !b ? 0 : (!a ? -1 : 1);
> >
> > Wouldn't it leak a refcount if either a or b is NULL (not both)?
>
> It would, but this would be an error condition anyway. I can add puts.
>
> >
> > > +       }
> > > +       if (thread__tid(a) == thread__tid(b)) {
> > > +               /* a is a child of b or vice-versa, deeper levels appear later. */
> > > +               res = level_a < level_b ? -1 : (level_a > level_b ? 1 : 0);
> > > +       } else {
> > > +               /* Sort by tid now the parent is the same. */
> > > +               res = thread__tid(a) < thread__tid(b) ? -1 : 1;
> > >         }
> > > +       thread__put(a);
> > > +       thread__put(b);
> > > +       return res;
> > > +}
> > > +
> > > +static int tasks_print(struct report *rep, FILE *fp)
> > > +{
> > > +       struct machine *machine = &rep->session->machines.host;
> > > +       LIST_HEAD(tasks);
> > > +       int ret;
> > >
> > > -       fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> > > +       ret = machine__thread_list(machine, &tasks);
> > > +       if (!ret) {
> > > +               struct thread_list *task;
> >
> > Do we really need this thread_list?  Why not use an
> > array of threads directly?
>
> The code isn't particularly performance critical. I used a list as it
> best approximated how the rbtree was being used. The code is reused in
> subsequent patches, there's no initial pass to size an array and I
> think the reallocarray/qsort logic is generally more problematic than
> the list ones. If we were worried about performance then I think
> arrays could make sense for optimization, but I think this is good
> enough for now.

Well, it's not about performance.  It made me think why we need
this thread_list but I couldn't find the reason.  If you can move
machine__threads_nr() here then you won't need realloc().

Thanks,
Namhyung

> > >
> > > -       list_for_each_entry(task, &list, list)
> > > -               task__print_level(task, fp, 0);
> > > +               list_sort(machine, &tasks, task_list_cmp);
> > >
> > > -       free(tasks);
> > > -       return 0;
> > > +               fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> > > +
> > > +               list_for_each_entry(task, &tasks, list)
> > > +                       task__print_level(machine, task->thread, fp);
> > > +       }
> > > +       thread_list__delete(&tasks);
> > > +       return ret;
> > >  }
> > >
> > >  static int __cmd_report(struct report *rep)
> > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > > index 3da92f18814a..7872ce92c9fc 100644
> > > --- a/tools/perf/util/machine.c
> > > +++ b/tools/perf/util/machine.c
> > > @@ -3261,6 +3261,36 @@ int machines__for_each_thread(struct machines *machines,
> > >         return rc;
> > >  }
> > >
> > > +
> > > +static int thread_list_cb(struct thread *thread, void *data)
> > > +{
> > > +       struct list_head *list = data;
> > > +       struct thread_list *entry = malloc(sizeof(*entry));
> > > +
> > > +       if (!entry)
> > > +               return -ENOMEM;
> > > +
> > > +       entry->thread = thread__get(thread);
> > > +       list_add_tail(&entry->list, list);
> > > +       return 0;
> > > +}
> > > +
> > > +int machine__thread_list(struct machine *machine, struct list_head *list)
> > > +{
> > > +       return machine__for_each_thread(machine, thread_list_cb, list);
> > > +}
> > > +
> > > +void thread_list__delete(struct list_head *list)
> > > +{
> > > +       struct thread_list *pos, *next;
> > > +
> > > +       list_for_each_entry_safe(pos, next, list, list) {
> > > +               thread__zput(pos->thread);
> > > +               list_del(&pos->list);
> > > +               free(pos);
> > > +       }
> > > +}
> > > +
> > >  pid_t machine__get_current_tid(struct machine *machine, int cpu)
> > >  {
> > >         if (cpu < 0 || (size_t)cpu >= machine->current_tid_sz)
> > > diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> > > index 1279acda6a8a..b738ce84817b 100644
> > > --- a/tools/perf/util/machine.h
> > > +++ b/tools/perf/util/machine.h
> > > @@ -280,6 +280,16 @@ int machines__for_each_thread(struct machines *machines,
> > >                               int (*fn)(struct thread *thread, void *p),
> > >                               void *priv);
> > >
> > > +struct thread_list {
> > > +       struct list_head         list;
> > > +       struct thread           *thread;
> > > +};
> > > +
> > > +/* Make a list of struct thread_list based on threads in the machine. */
> > > +int machine__thread_list(struct machine *machine, struct list_head *list);
> > > +/* Free up the nodes within the thread_list list. */
> > > +void thread_list__delete(struct list_head *list);
> > > +
> > >  pid_t machine__get_current_tid(struct machine *machine, int cpu);
> > >  int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid,
> > >                              pid_t tid);
> > > --
> > > 2.43.0.687.g38aa6559b0-goog
> > >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-27 21:42           ` Ian Rogers
@ 2024-02-28  6:39             ` Namhyung Kim
  2024-02-28  7:24               ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Namhyung Kim @ 2024-02-28  6:39 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 1:42 PM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Feb 27, 2024 at 11:17 AM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > On Tue, Feb 27, 2024 at 09:31:33AM -0800, Namhyung Kim wrote:
> > > I can see some other differences like machine__findnew_thread()
> > > which I think is due to the locking change.  Maybe we can fix the
> > > problem before moving the code and let the code move simple.
> >
> > I was going to suggest that, agreed.
> >
> > We may start doing a refactoring, then find a bug, at that point we
> > first fix the problem them go back to refactoring.
>
> Sure I do this all the time. Your typical complaint on the v+1 patch
> set is to move the bug fixes to the front of the changes. On the v+2
> patch set the bug fixes get applied but not the rest of the patch
> series, etc.
>
> Here we are refactoring code for an rb-tree implementation of threads
> and worrying about its correctness. There's no indication it's not
> correct, it is largely copy and paste, there is also good evidence in
> the locking disciple it is more correct. The next patch deletes that
> implementation, replacing it with a hash table. Were I not trying to
> break things apart I could squash those 2 patches together, but I've
> tried to do the right thing. Now we're trying to micro correct, break
> apart, etc. a state that gets deleted. A reviewer could equally
> criticise this being 2 changes rather than 1, and the cognitive load
> of having to look at code that gets deleted. At some point it is a
> judgement call, and I think this patch is actually the right size. I
> think what is missing here is some motivation in the commit message to
> the findnew refactoring and so I'll add that.

I'm not against your approach and actually appreciate your effort
to split rb-tree refactoring and hash table introduction.  What I'm
asking is just to separate out the code moving.  I think you can do
whatever you want in the current file.  Once you have the final code
you can move it to its own file exactly the same.  When I look at this
commit, say a few years later, I won't expect a commit that says
moving something to a new file has other changes.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-28  6:11       ` Namhyung Kim
@ 2024-02-28  7:05         ` Ian Rogers
  2024-02-28 22:45           ` Namhyung Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-28  7:05 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 10:11 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Mon, Feb 26, 2024 at 11:12 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Mon, Feb 26, 2024 at 10:39 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > > threads") made the iteration of thread tids unordered. The perf report
> > > > --tasks output now shows child threads in an order determined by the
> > > > hashing. For example, in this snippet tid 3 appears after tid 256 even
> > > > though they have the same ppid 2:
> > > >
> > > > ```
> > > > $ perf report --tasks
> > > > %      pid      tid     ppid  comm
> > > >          0        0       -1 |swapper
> > > >          2        2        0 | kthreadd
> > > >        256      256        2 |  kworker/12:1H-k
> > > >     693761   693761        2 |  kworker/10:1-mm
> > > >    1301762  1301762        2 |  kworker/1:1-mm_
> > > >    1302530  1302530        2 |  kworker/u32:0-k
> > > >          3        3        2 |  rcu_gp
> > > > ...
> > > > ```
> > > >
> > > > The output is easier to read if threads appear numerically
> > > > increasing. To allow for this, read all threads into a list then sort
> > > > with a comparator that orders by the child task's of the first common
> > > > parent. The list creation and deletion are created as utilities on
> > > > machine.  The indentation is possible by counting the number of
> > > > parents a child has.
> > > >
> > > > With this change the output for the same data file is now like:
> > > > ```
> > > > $ perf report --tasks
> > > > %      pid      tid     ppid  comm
> > > >          0        0       -1 |swapper
> > > >          1        1        0 | systemd
> > > >        823      823        1 |  systemd-journal
> > > >        853      853        1 |  systemd-udevd
> > > >       3230     3230        1 |  systemd-timesyn
> > > >       3236     3236        1 |  auditd
> > > >       3239     3239     3236 |   audisp-syslog
> > > >       3321     3321        1 |  accounts-daemon
> > > > ...
> > > > ```
> > > >
> > > > Signed-off-by: Ian Rogers <irogers@google.com>
>
> I know you sent out v2 already, but let me continue the discussion
> here.
>
>
> > > > ---
> > > >  tools/perf/builtin-report.c | 203 ++++++++++++++++++++----------------
> > > >  tools/perf/util/machine.c   |  30 ++++++
> > > >  tools/perf/util/machine.h   |  10 ++
> > > >  3 files changed, 155 insertions(+), 88 deletions(-)
> > > >
> > > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > > index 8e16fa261e6f..b48f1d5309e3 100644
> > > > --- a/tools/perf/builtin-report.c
> > > > +++ b/tools/perf/builtin-report.c
> > > > @@ -59,6 +59,7 @@
> > > >  #include <linux/ctype.h>
> > > >  #include <signal.h>
> > > >  #include <linux/bitmap.h>
> > > > +#include <linux/list_sort.h>
> > > >  #include <linux/string.h>
> > > >  #include <linux/stringify.h>
> > > >  #include <linux/time64.h>
> > > > @@ -828,35 +829,6 @@ static void tasks_setup(struct report *rep)
> > > >         rep->tool.no_warn = true;
> > > >  }
> > > >
> > > > -struct task {
> > > > -       struct thread           *thread;
> > > > -       struct list_head         list;
> > > > -       struct list_head         children;
> > > > -};
> > > > -
> > > > -static struct task *tasks_list(struct task *task, struct machine *machine)
> > > > -{
> > > > -       struct thread *parent_thread, *thread = task->thread;
> > > > -       struct task   *parent_task;
> > > > -
> > > > -       /* Already listed. */
> > > > -       if (!list_empty(&task->list))
> > > > -               return NULL;
> > > > -
> > > > -       /* Last one in the chain. */
> > > > -       if (thread__ppid(thread) == -1)
> > > > -               return task;
> > > > -
> > > > -       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > > > -       if (!parent_thread)
> > > > -               return ERR_PTR(-ENOENT);
> > > > -
> > > > -       parent_task = thread__priv(parent_thread);
> > > > -       thread__put(parent_thread);
> > > > -       list_add_tail(&task->list, &parent_task->children);
> > > > -       return tasks_list(parent_task, machine);
> > > > -}
> > > > -
> > > >  struct maps__fprintf_task_args {
> > > >         int indent;
> > > >         FILE *fp;
> > > > @@ -900,89 +872,144 @@ static size_t maps__fprintf_task(struct maps *maps, int indent, FILE *fp)
> > > >         return args.printed;
> > > >  }
> > > >
> > > > -static void task__print_level(struct task *task, FILE *fp, int level)
> > > > +static int thread_level(struct machine *machine, const struct thread *thread)
> > > >  {
> > > > -       struct thread *thread = task->thread;
> > > > -       struct task *child;
> > > > -       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > > > -                                 thread__pid(thread), thread__tid(thread),
> > > > -                                 thread__ppid(thread), level, "");
> > > > +       struct thread *parent_thread;
> > > > +       int res;
> > > >
> > > > -       fprintf(fp, "%s\n", thread__comm_str(thread));
> > > > +       if (thread__tid(thread) <= 0)
> > > > +               return 0;
> > > >
> > > > -       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > > > +       if (thread__ppid(thread) <= 0)
> > > > +               return 1;
> > > >
> > > > -       if (!list_empty(&task->children)) {
> > > > -               list_for_each_entry(child, &task->children, list)
> > > > -                       task__print_level(child, fp, level + 1);
> > > > +       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > > > +       if (!parent_thread) {
> > > > +               pr_err("Missing parent thread of %d\n", thread__tid(thread));
> > > > +               return 0;
> > > >         }
> > > > +       res = 1 + thread_level(machine, parent_thread);
> > > > +       thread__put(parent_thread);
> > > > +       return res;
> > > >  }
> > > >
> > > > -static int tasks_print(struct report *rep, FILE *fp)
> > > > +static void task__print_level(struct machine *machine, struct thread *thread, FILE *fp)
> > > >  {
> > > > -       struct perf_session *session = rep->session;
> > > > -       struct machine      *machine = &session->machines.host;
> > > > -       struct task *tasks, *task;
> > > > -       unsigned int nr = 0, itask = 0, i;
> > > > -       struct rb_node *nd;
> > > > -       LIST_HEAD(list);
> > > > +       int level = thread_level(machine, thread);
> > > > +       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > > > +                                 thread__pid(thread), thread__tid(thread),
> > > > +                                 thread__ppid(thread), level, "");
> > > >
> > > > -       /*
> > > > -        * No locking needed while accessing machine->threads,
> > > > -        * because --tasks is single threaded command.
> > > > -        */
> > > > +       fprintf(fp, "%s\n", thread__comm_str(thread));
> > > >
> > > > -       /* Count all the threads. */
> > > > -       for (i = 0; i < THREADS__TABLE_SIZE; i++)
> > > > -               nr += machine->threads[i].nr;
> > > > +       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > > > +}
> > > >
> > > > -       tasks = malloc(sizeof(*tasks) * nr);
> > > > -       if (!tasks)
> > > > -               return -ENOMEM;
> > > > +static int task_list_cmp(void *priv, const struct list_head *la, const struct list_head *lb)
> > >
> > > I'm a little afraid that this comparison logic becomes complex.
> > > But I think it's better than having a tree of thread relationship.
> > > Just a comment that explains why we need this would be nice.
> >
> > I can add something in v2.
> >
> > >
> > > > +{
> > > > +       struct machine *machine = priv;
> > > > +       struct thread_list *task_a = list_entry(la, struct thread_list, list);
> > > > +       struct thread_list *task_b = list_entry(lb, struct thread_list, list);
> > > > +       struct thread *a = task_a->thread;
> > > > +       struct thread *b = task_b->thread;
> > > > +       int level_a, level_b, res;
> > > > +
> > > > +       /* Compare a and b to root. */
> > > > +       if (thread__tid(a) == thread__tid(b))
> > > > +               return 0;
> > > >
> > > > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > > > -               struct threads *threads = &machine->threads[i];
> > > > +       if (thread__tid(a) == 0)
> > > > +               return -1;
> > > >
> > > > -               for (nd = rb_first_cached(&threads->entries); nd;
> > > > -                    nd = rb_next(nd)) {
> > > > -                       task = tasks + itask++;
> > > > +       if (thread__tid(b) == 0)
> > > > +               return 1;
> > > >
> > > > -                       task->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> > > > -                       INIT_LIST_HEAD(&task->children);
> > > > -                       INIT_LIST_HEAD(&task->list);
> > > > -                       thread__set_priv(task->thread, task);
> > > > -               }
> > > > +       /* If parents match sort by tid. */
> > > > +       if (thread__ppid(a) == thread__ppid(b)) {
> > > > +               return thread__tid(a) < thread__tid(b)
> > > > +                       ? -1
> > > > +                       : (thread__tid(a) > thread__tid(b) ? 1 : 0);
> > >
> > > Can it be simply like this?  We know tid(a) != tid(b).
> > >
> > >   return thread__tid(a) < thread__tid(b) ? -1 : 1;
> >
> > Yes, but the parent check is still required.
>
> Sure.  I only meant the return statement.
>
> >
> > > >         }
> > > >
> > > >         /*
> > > > -        * Iterate every task down to the unprocessed parent
> > > > -        * and link all in task children list. Task with no
> > > > -        * parent is added into 'list'.
> > > > +        * Find a and b such that if they are a child of each other a and b's
> > > > +        * tid's match, otherwise a and b have a common parent and distinct
> > > > +        * tid's to sort by. First make the depths of the threads match.
> > > >          */
> > > > -       for (itask = 0; itask < nr; itask++) {
> > > > -               task = tasks + itask;
> > > > -
> > > > -               if (!list_empty(&task->list))
> > > > -                       continue;
> > > > -
> > > > -               task = tasks_list(task, machine);
> > > > -               if (IS_ERR(task)) {
> > > > -                       pr_err("Error: failed to process tasks\n");
> > > > -                       free(tasks);
> > > > -                       return PTR_ERR(task);
> > > > +       level_a = thread_level(machine, a);
> > > > +       level_b = thread_level(machine, b);
> > > > +       a = thread__get(a);
> > > > +       b = thread__get(b);
> > > > +       for (int i = level_a; i > level_b; i--) {
> > > > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(a));
> > > > +
> > > > +               thread__put(a);
> > > > +               if (!parent) {
> > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > > > +                       thread__put(b);
> > > > +                       return -1;
> > > >                 }
> > > > +               a = parent;
> > > > +       }
> > > > +       for (int i = level_b; i > level_a; i--) {
> > > > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(b));
> > > >
> > > > -               if (task)
> > > > -                       list_add_tail(&task->list, &list);
> > > > +               thread__put(b);
> > > > +               if (!parent) {
> > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > > > +                       thread__put(a);
> > > > +                       return 1;
> > > > +               }
> > > > +               b = parent;
> > > > +       }
> > > > +       /* Search up to a common parent. */
> > > > +       while (thread__ppid(a) != thread__ppid(b)) {
> > > > +               struct thread *parent;
> > > > +
> > > > +               parent = machine__find_thread(machine, -1, thread__ppid(a));
> > > > +               thread__put(a);
> > > > +               if (!parent)
> > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > > > +               a = parent;
> > > > +               parent = machine__find_thread(machine, -1, thread__ppid(b));
> > > > +               thread__put(b);
> > > > +               if (!parent)
> > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > > > +               b = parent;
> > > > +               if (!a || !b)
> > > > +                       return !a && !b ? 0 : (!a ? -1 : 1);
> > >
> > > Wouldn't it leak a refcount if either a or b is NULL (not both)?
> >
> > It would, but this would be an error condition anyway. I can add puts.
> >
> > >
> > > > +       }
> > > > +       if (thread__tid(a) == thread__tid(b)) {
> > > > +               /* a is a child of b or vice-versa, deeper levels appear later. */
> > > > +               res = level_a < level_b ? -1 : (level_a > level_b ? 1 : 0);
> > > > +       } else {
> > > > +               /* Sort by tid now the parent is the same. */
> > > > +               res = thread__tid(a) < thread__tid(b) ? -1 : 1;
> > > >         }
> > > > +       thread__put(a);
> > > > +       thread__put(b);
> > > > +       return res;
> > > > +}
> > > > +
> > > > +static int tasks_print(struct report *rep, FILE *fp)
> > > > +{
> > > > +       struct machine *machine = &rep->session->machines.host;
> > > > +       LIST_HEAD(tasks);
> > > > +       int ret;
> > > >
> > > > -       fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> > > > +       ret = machine__thread_list(machine, &tasks);
> > > > +       if (!ret) {
> > > > +               struct thread_list *task;
> > >
> > > Do we really need this thread_list?  Why not use an
> > > array of threads directly?
> >
> > The code isn't particularly performance critical. I used a list as it
> > best approximated how the rbtree was being used. The code is reused in
> > subsequent patches, there's no initial pass to size an array and I
> > think the reallocarray/qsort logic is generally more problematic than
> > the list ones. If we were worried about performance then I think
> > arrays could make sense for optimization, but I think this is good
> > enough for now.
>
> Well, it's not about performance.  It made me think why we need
> this thread_list but I couldn't find the reason.  If you can move
> machine__threads_nr() here then you won't need realloc().

But then you can race between allocating an array and traversing to
fill it in. Using realloc in the iterator callback would avoid this
but with capacity tracking, etc. If this were C++ its a call between a
std::vector and a std::list, and std::vector would win that race there
(imo). Here we're moving from code that was working on sorted tree
nodes in code that tends to more heavily use lists. I wanted the
transition from the rbtree nodes to list nodes to be as small as
possible in the changes to the APIs that did strange things to the
threads tree (resorting it). Moving to an array with indices would
require more tracking and be a larger change in general. The array
could move because of a realloc, whilst nodes wouldn't, etc. Having
the code now work on a list its easier to see how it can migrate to an
array, but that can be follow on work. I'm not sure we're motivated to
do it given there's no code on a performance critical path.

Thanks,
Ian

> Thanks,
> Namhyung
>
> > > >
> > > > -       list_for_each_entry(task, &list, list)
> > > > -               task__print_level(task, fp, 0);
> > > > +               list_sort(machine, &tasks, task_list_cmp);
> > > >
> > > > -       free(tasks);
> > > > -       return 0;
> > > > +               fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> > > > +
> > > > +               list_for_each_entry(task, &tasks, list)
> > > > +                       task__print_level(machine, task->thread, fp);
> > > > +       }
> > > > +       thread_list__delete(&tasks);
> > > > +       return ret;
> > > >  }
> > > >
> > > >  static int __cmd_report(struct report *rep)
> > > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > > > index 3da92f18814a..7872ce92c9fc 100644
> > > > --- a/tools/perf/util/machine.c
> > > > +++ b/tools/perf/util/machine.c
> > > > @@ -3261,6 +3261,36 @@ int machines__for_each_thread(struct machines *machines,
> > > >         return rc;
> > > >  }
> > > >
> > > > +
> > > > +static int thread_list_cb(struct thread *thread, void *data)
> > > > +{
> > > > +       struct list_head *list = data;
> > > > +       struct thread_list *entry = malloc(sizeof(*entry));
> > > > +
> > > > +       if (!entry)
> > > > +               return -ENOMEM;
> > > > +
> > > > +       entry->thread = thread__get(thread);
> > > > +       list_add_tail(&entry->list, list);
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +int machine__thread_list(struct machine *machine, struct list_head *list)
> > > > +{
> > > > +       return machine__for_each_thread(machine, thread_list_cb, list);
> > > > +}
> > > > +
> > > > +void thread_list__delete(struct list_head *list)
> > > > +{
> > > > +       struct thread_list *pos, *next;
> > > > +
> > > > +       list_for_each_entry_safe(pos, next, list, list) {
> > > > +               thread__zput(pos->thread);
> > > > +               list_del(&pos->list);
> > > > +               free(pos);
> > > > +       }
> > > > +}
> > > > +
> > > >  pid_t machine__get_current_tid(struct machine *machine, int cpu)
> > > >  {
> > > >         if (cpu < 0 || (size_t)cpu >= machine->current_tid_sz)
> > > > diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> > > > index 1279acda6a8a..b738ce84817b 100644
> > > > --- a/tools/perf/util/machine.h
> > > > +++ b/tools/perf/util/machine.h
> > > > @@ -280,6 +280,16 @@ int machines__for_each_thread(struct machines *machines,
> > > >                               int (*fn)(struct thread *thread, void *p),
> > > >                               void *priv);
> > > >
> > > > +struct thread_list {
> > > > +       struct list_head         list;
> > > > +       struct thread           *thread;
> > > > +};
> > > > +
> > > > +/* Make a list of struct thread_list based on threads in the machine. */
> > > > +int machine__thread_list(struct machine *machine, struct list_head *list);
> > > > +/* Free up the nodes within the thread_list list. */
> > > > +void thread_list__delete(struct list_head *list);
> > > > +
> > > >  pid_t machine__get_current_tid(struct machine *machine, int cpu);
> > > >  int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid,
> > > >                              pid_t tid);
> > > > --
> > > > 2.43.0.687.g38aa6559b0-goog
> > > >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-28  6:39             ` Namhyung Kim
@ 2024-02-28  7:24               ` Ian Rogers
  2024-02-28 23:43                 ` Namhyung Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2024-02-28  7:24 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 10:40 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 1:42 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 11:17 AM Arnaldo Carvalho de Melo
> > <acme@kernel.org> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 09:31:33AM -0800, Namhyung Kim wrote:
> > > > I can see some other differences like machine__findnew_thread()
> > > > which I think is due to the locking change.  Maybe we can fix the
> > > > problem before moving the code and let the code move simple.
> > >
> > > I was going to suggest that, agreed.
> > >
> > > We may start doing a refactoring, then find a bug, at that point we
> > > first fix the problem them go back to refactoring.
> >
> > Sure I do this all the time. Your typical complaint on the v+1 patch
> > set is to move the bug fixes to the front of the changes. On the v+2
> > patch set the bug fixes get applied but not the rest of the patch
> > series, etc.
> >
> > Here we are refactoring code for an rb-tree implementation of threads
> > and worrying about its correctness. There's no indication it's not
> > correct, it is largely copy and paste, there is also good evidence in
> > the locking disciple it is more correct. The next patch deletes that
> > implementation, replacing it with a hash table. Were I not trying to
> > break things apart I could squash those 2 patches together, but I've
> > tried to do the right thing. Now we're trying to micro correct, break
> > apart, etc. a state that gets deleted. A reviewer could equally
> > criticise this being 2 changes rather than 1, and the cognitive load
> > of having to look at code that gets deleted. At some point it is a
> > judgement call, and I think this patch is actually the right size. I
> > think what is missing here is some motivation in the commit message to
> > the findnew refactoring and so I'll add that.
>
> I'm not against your approach and actually appreciate your effort
> to split rb-tree refactoring and hash table introduction.  What I'm
> asking is just to separate out the code moving.  I think you can do
> whatever you want in the current file.  Once you have the final code
> you can move it to its own file exactly the same.  When I look at this
> commit, say a few years later, I won't expect a commit that says
> moving something to a new file has other changes.

The problem is that the code in machine treats the threads lock as if
it is a lock in machine. So there is __machine__findnew_thread which
implies the thread lock is held. This change is making threads its own
separate concept/collection and the lock belongs with that collection.
Most of the implementation of threads__findnew matches
__machine__findnew_thread, so we may be able to engineer a smaller
line diff by moving "__machine__findnew_thread" code into threads.c,
then renaming it to build the collection, etc. We could also build the
threads collection inside of machine and then in a separate change
move it to threads.[ch].  In the commit history this seems muddier
than just splitting out threads as a collection. Also, some of the API
design choices are motivated more by the hash table implementation of
the next patch than trying to have a good rbtree abstracted collection
of threads. Essentially it'd be engineering a collection of threads
but only with a view to delete it in the next patch. I don't think it
would be for the best and the commit history for deleted code is
unlikely to be looked upon.

Thanks,
Ian

> Thanks,
> Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 1/6] perf report: Sort child tasks by tid
  2024-02-28  7:05         ` Ian Rogers
@ 2024-02-28 22:45           ` Namhyung Kim
  0 siblings, 0 replies; 35+ messages in thread
From: Namhyung Kim @ 2024-02-28 22:45 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 11:06 PM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Feb 27, 2024 at 10:11 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Mon, Feb 26, 2024 at 11:12 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Mon, Feb 26, 2024 at 10:39 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> > > > >
> > > > > Commit 91e467bc568f ("perf machine: Use hashtable for machine
> > > > > threads") made the iteration of thread tids unordered. The perf report
> > > > > --tasks output now shows child threads in an order determined by the
> > > > > hashing. For example, in this snippet tid 3 appears after tid 256 even
> > > > > though they have the same ppid 2:
> > > > >
> > > > > ```
> > > > > $ perf report --tasks
> > > > > %      pid      tid     ppid  comm
> > > > >          0        0       -1 |swapper
> > > > >          2        2        0 | kthreadd
> > > > >        256      256        2 |  kworker/12:1H-k
> > > > >     693761   693761        2 |  kworker/10:1-mm
> > > > >    1301762  1301762        2 |  kworker/1:1-mm_
> > > > >    1302530  1302530        2 |  kworker/u32:0-k
> > > > >          3        3        2 |  rcu_gp
> > > > > ...
> > > > > ```
> > > > >
> > > > > The output is easier to read if threads appear numerically
> > > > > increasing. To allow for this, read all threads into a list then sort
> > > > > with a comparator that orders by the child task's of the first common
> > > > > parent. The list creation and deletion are created as utilities on
> > > > > machine.  The indentation is possible by counting the number of
> > > > > parents a child has.
> > > > >
> > > > > With this change the output for the same data file is now like:
> > > > > ```
> > > > > $ perf report --tasks
> > > > > %      pid      tid     ppid  comm
> > > > >          0        0       -1 |swapper
> > > > >          1        1        0 | systemd
> > > > >        823      823        1 |  systemd-journal
> > > > >        853      853        1 |  systemd-udevd
> > > > >       3230     3230        1 |  systemd-timesyn
> > > > >       3236     3236        1 |  auditd
> > > > >       3239     3239     3236 |   audisp-syslog
> > > > >       3321     3321        1 |  accounts-daemon
> > > > > ...
> > > > > ```
> > > > >
> > > > > Signed-off-by: Ian Rogers <irogers@google.com>
> >
> > I know you sent out v2 already, but let me continue the discussion
> > here.
> >
> >
> > > > > ---
> > > > >  tools/perf/builtin-report.c | 203 ++++++++++++++++++++----------------
> > > > >  tools/perf/util/machine.c   |  30 ++++++
> > > > >  tools/perf/util/machine.h   |  10 ++
> > > > >  3 files changed, 155 insertions(+), 88 deletions(-)
> > > > >
> > > > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > > > index 8e16fa261e6f..b48f1d5309e3 100644
> > > > > --- a/tools/perf/builtin-report.c
> > > > > +++ b/tools/perf/builtin-report.c
> > > > > @@ -59,6 +59,7 @@
> > > > >  #include <linux/ctype.h>
> > > > >  #include <signal.h>
> > > > >  #include <linux/bitmap.h>
> > > > > +#include <linux/list_sort.h>
> > > > >  #include <linux/string.h>
> > > > >  #include <linux/stringify.h>
> > > > >  #include <linux/time64.h>
> > > > > @@ -828,35 +829,6 @@ static void tasks_setup(struct report *rep)
> > > > >         rep->tool.no_warn = true;
> > > > >  }
> > > > >
> > > > > -struct task {
> > > > > -       struct thread           *thread;
> > > > > -       struct list_head         list;
> > > > > -       struct list_head         children;
> > > > > -};
> > > > > -
> > > > > -static struct task *tasks_list(struct task *task, struct machine *machine)
> > > > > -{
> > > > > -       struct thread *parent_thread, *thread = task->thread;
> > > > > -       struct task   *parent_task;
> > > > > -
> > > > > -       /* Already listed. */
> > > > > -       if (!list_empty(&task->list))
> > > > > -               return NULL;
> > > > > -
> > > > > -       /* Last one in the chain. */
> > > > > -       if (thread__ppid(thread) == -1)
> > > > > -               return task;
> > > > > -
> > > > > -       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > > > > -       if (!parent_thread)
> > > > > -               return ERR_PTR(-ENOENT);
> > > > > -
> > > > > -       parent_task = thread__priv(parent_thread);
> > > > > -       thread__put(parent_thread);
> > > > > -       list_add_tail(&task->list, &parent_task->children);
> > > > > -       return tasks_list(parent_task, machine);
> > > > > -}
> > > > > -
> > > > >  struct maps__fprintf_task_args {
> > > > >         int indent;
> > > > >         FILE *fp;
> > > > > @@ -900,89 +872,144 @@ static size_t maps__fprintf_task(struct maps *maps, int indent, FILE *fp)
> > > > >         return args.printed;
> > > > >  }
> > > > >
> > > > > -static void task__print_level(struct task *task, FILE *fp, int level)
> > > > > +static int thread_level(struct machine *machine, const struct thread *thread)
> > > > >  {
> > > > > -       struct thread *thread = task->thread;
> > > > > -       struct task *child;
> > > > > -       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > > > > -                                 thread__pid(thread), thread__tid(thread),
> > > > > -                                 thread__ppid(thread), level, "");
> > > > > +       struct thread *parent_thread;
> > > > > +       int res;
> > > > >
> > > > > -       fprintf(fp, "%s\n", thread__comm_str(thread));
> > > > > +       if (thread__tid(thread) <= 0)
> > > > > +               return 0;
> > > > >
> > > > > -       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > > > > +       if (thread__ppid(thread) <= 0)
> > > > > +               return 1;
> > > > >
> > > > > -       if (!list_empty(&task->children)) {
> > > > > -               list_for_each_entry(child, &task->children, list)
> > > > > -                       task__print_level(child, fp, level + 1);
> > > > > +       parent_thread = machine__find_thread(machine, -1, thread__ppid(thread));
> > > > > +       if (!parent_thread) {
> > > > > +               pr_err("Missing parent thread of %d\n", thread__tid(thread));
> > > > > +               return 0;
> > > > >         }
> > > > > +       res = 1 + thread_level(machine, parent_thread);
> > > > > +       thread__put(parent_thread);
> > > > > +       return res;
> > > > >  }
> > > > >
> > > > > -static int tasks_print(struct report *rep, FILE *fp)
> > > > > +static void task__print_level(struct machine *machine, struct thread *thread, FILE *fp)
> > > > >  {
> > > > > -       struct perf_session *session = rep->session;
> > > > > -       struct machine      *machine = &session->machines.host;
> > > > > -       struct task *tasks, *task;
> > > > > -       unsigned int nr = 0, itask = 0, i;
> > > > > -       struct rb_node *nd;
> > > > > -       LIST_HEAD(list);
> > > > > +       int level = thread_level(machine, thread);
> > > > > +       int comm_indent = fprintf(fp, "  %8d %8d %8d |%*s",
> > > > > +                                 thread__pid(thread), thread__tid(thread),
> > > > > +                                 thread__ppid(thread), level, "");
> > > > >
> > > > > -       /*
> > > > > -        * No locking needed while accessing machine->threads,
> > > > > -        * because --tasks is single threaded command.
> > > > > -        */
> > > > > +       fprintf(fp, "%s\n", thread__comm_str(thread));
> > > > >
> > > > > -       /* Count all the threads. */
> > > > > -       for (i = 0; i < THREADS__TABLE_SIZE; i++)
> > > > > -               nr += machine->threads[i].nr;
> > > > > +       maps__fprintf_task(thread__maps(thread), comm_indent, fp);
> > > > > +}
> > > > >
> > > > > -       tasks = malloc(sizeof(*tasks) * nr);
> > > > > -       if (!tasks)
> > > > > -               return -ENOMEM;
> > > > > +static int task_list_cmp(void *priv, const struct list_head *la, const struct list_head *lb)
> > > >
> > > > I'm a little afraid that this comparison logic becomes complex.
> > > > But I think it's better than having a tree of thread relationship.
> > > > Just a comment that explains why we need this would be nice.
> > >
> > > I can add something in v2.
> > >
> > > >
> > > > > +{
> > > > > +       struct machine *machine = priv;
> > > > > +       struct thread_list *task_a = list_entry(la, struct thread_list, list);
> > > > > +       struct thread_list *task_b = list_entry(lb, struct thread_list, list);
> > > > > +       struct thread *a = task_a->thread;
> > > > > +       struct thread *b = task_b->thread;
> > > > > +       int level_a, level_b, res;
> > > > > +
> > > > > +       /* Compare a and b to root. */
> > > > > +       if (thread__tid(a) == thread__tid(b))
> > > > > +               return 0;
> > > > >
> > > > > -       for (i = 0; i < THREADS__TABLE_SIZE; i++) {
> > > > > -               struct threads *threads = &machine->threads[i];
> > > > > +       if (thread__tid(a) == 0)
> > > > > +               return -1;
> > > > >
> > > > > -               for (nd = rb_first_cached(&threads->entries); nd;
> > > > > -                    nd = rb_next(nd)) {
> > > > > -                       task = tasks + itask++;
> > > > > +       if (thread__tid(b) == 0)
> > > > > +               return 1;
> > > > >
> > > > > -                       task->thread = rb_entry(nd, struct thread_rb_node, rb_node)->thread;
> > > > > -                       INIT_LIST_HEAD(&task->children);
> > > > > -                       INIT_LIST_HEAD(&task->list);
> > > > > -                       thread__set_priv(task->thread, task);
> > > > > -               }
> > > > > +       /* If parents match sort by tid. */
> > > > > +       if (thread__ppid(a) == thread__ppid(b)) {
> > > > > +               return thread__tid(a) < thread__tid(b)
> > > > > +                       ? -1
> > > > > +                       : (thread__tid(a) > thread__tid(b) ? 1 : 0);
> > > >
> > > > Can it be simply like this?  We know tid(a) != tid(b).
> > > >
> > > >   return thread__tid(a) < thread__tid(b) ? -1 : 1;
> > >
> > > Yes, but the parent check is still required.
> >
> > Sure.  I only meant the return statement.
> >
> > >
> > > > >         }
> > > > >
> > > > >         /*
> > > > > -        * Iterate every task down to the unprocessed parent
> > > > > -        * and link all in task children list. Task with no
> > > > > -        * parent is added into 'list'.
> > > > > +        * Find a and b such that if they are a child of each other a and b's
> > > > > +        * tid's match, otherwise a and b have a common parent and distinct
> > > > > +        * tid's to sort by. First make the depths of the threads match.
> > > > >          */
> > > > > -       for (itask = 0; itask < nr; itask++) {
> > > > > -               task = tasks + itask;
> > > > > -
> > > > > -               if (!list_empty(&task->list))
> > > > > -                       continue;
> > > > > -
> > > > > -               task = tasks_list(task, machine);
> > > > > -               if (IS_ERR(task)) {
> > > > > -                       pr_err("Error: failed to process tasks\n");
> > > > > -                       free(tasks);
> > > > > -                       return PTR_ERR(task);
> > > > > +       level_a = thread_level(machine, a);
> > > > > +       level_b = thread_level(machine, b);
> > > > > +       a = thread__get(a);
> > > > > +       b = thread__get(b);
> > > > > +       for (int i = level_a; i > level_b; i--) {
> > > > > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(a));
> > > > > +
> > > > > +               thread__put(a);
> > > > > +               if (!parent) {
> > > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > > > > +                       thread__put(b);
> > > > > +                       return -1;
> > > > >                 }
> > > > > +               a = parent;
> > > > > +       }
> > > > > +       for (int i = level_b; i > level_a; i--) {
> > > > > +               struct thread *parent = machine__find_thread(machine, -1, thread__ppid(b));
> > > > >
> > > > > -               if (task)
> > > > > -                       list_add_tail(&task->list, &list);
> > > > > +               thread__put(b);
> > > > > +               if (!parent) {
> > > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > > > > +                       thread__put(a);
> > > > > +                       return 1;
> > > > > +               }
> > > > > +               b = parent;
> > > > > +       }
> > > > > +       /* Search up to a common parent. */
> > > > > +       while (thread__ppid(a) != thread__ppid(b)) {
> > > > > +               struct thread *parent;
> > > > > +
> > > > > +               parent = machine__find_thread(machine, -1, thread__ppid(a));
> > > > > +               thread__put(a);
> > > > > +               if (!parent)
> > > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(a));
> > > > > +               a = parent;
> > > > > +               parent = machine__find_thread(machine, -1, thread__ppid(b));
> > > > > +               thread__put(b);
> > > > > +               if (!parent)
> > > > > +                       pr_err("Missing parent thread of %d\n", thread__tid(b));
> > > > > +               b = parent;
> > > > > +               if (!a || !b)
> > > > > +                       return !a && !b ? 0 : (!a ? -1 : 1);
> > > >
> > > > Wouldn't it leak a refcount if either a or b is NULL (not both)?
> > >
> > > It would, but this would be an error condition anyway. I can add puts.
> > >
> > > >
> > > > > +       }
> > > > > +       if (thread__tid(a) == thread__tid(b)) {
> > > > > +               /* a is a child of b or vice-versa, deeper levels appear later. */
> > > > > +               res = level_a < level_b ? -1 : (level_a > level_b ? 1 : 0);
> > > > > +       } else {
> > > > > +               /* Sort by tid now the parent is the same. */
> > > > > +               res = thread__tid(a) < thread__tid(b) ? -1 : 1;
> > > > >         }
> > > > > +       thread__put(a);
> > > > > +       thread__put(b);
> > > > > +       return res;
> > > > > +}
> > > > > +
> > > > > +static int tasks_print(struct report *rep, FILE *fp)
> > > > > +{
> > > > > +       struct machine *machine = &rep->session->machines.host;
> > > > > +       LIST_HEAD(tasks);
> > > > > +       int ret;
> > > > >
> > > > > -       fprintf(fp, "# %8s %8s %8s  %s\n", "pid", "tid", "ppid", "comm");
> > > > > +       ret = machine__thread_list(machine, &tasks);
> > > > > +       if (!ret) {
> > > > > +               struct thread_list *task;
> > > >
> > > > Do we really need this thread_list?  Why not use an
> > > > array of threads directly?
> > >
> > > The code isn't particularly performance critical. I used a list as it
> > > best approximated how the rbtree was being used. The code is reused in
> > > subsequent patches, there's no initial pass to size an array and I
> > > think the reallocarray/qsort logic is generally more problematic than
> > > the list ones. If we were worried about performance then I think
> > > arrays could make sense for optimization, but I think this is good
> > > enough for now.
> >
> > Well, it's not about performance.  It made me think why we need
> > this thread_list but I couldn't find the reason.  If you can move
> > machine__threads_nr() here then you won't need realloc().
>
> But then you can race between allocating an array and traversing to
> fill it in. Using realloc in the iterator callback would avoid this
> but with capacity tracking, etc. If this were C++ its a call between a
> std::vector and a std::list, and std::vector would win that race there
> (imo). Here we're moving from code that was working on sorted tree
> nodes in code that tends to more heavily use lists. I wanted the
> transition from the rbtree nodes to list nodes to be as small as
> possible in the changes to the APIs that did strange things to the
> threads tree (resorting it). Moving to an array with indices would
> require more tracking and be a larger change in general. The array
> could move because of a realloc, whilst nodes wouldn't, etc. Having
> the code now work on a list its easier to see how it can migrate to an
> array, but that can be follow on work. I'm not sure we're motivated to
> do it given there's no code on a performance critical path.

Ok, as you said it's not a critical path.  I'm ok with this change.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-28  7:24               ` Ian Rogers
@ 2024-02-28 23:43                 ` Namhyung Kim
  2024-02-29  0:31                   ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Namhyung Kim @ 2024-02-28 23:43 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Tue, Feb 27, 2024 at 11:24 PM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Feb 27, 2024 at 10:40 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, Feb 27, 2024 at 1:42 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 11:17 AM Arnaldo Carvalho de Melo
> > > <acme@kernel.org> wrote:
> > > >
> > > > On Tue, Feb 27, 2024 at 09:31:33AM -0800, Namhyung Kim wrote:
> > > > > I can see some other differences like machine__findnew_thread()
> > > > > which I think is due to the locking change.  Maybe we can fix the
> > > > > problem before moving the code and let the code move simple.
> > > >
> > > > I was going to suggest that, agreed.
> > > >
> > > > We may start doing a refactoring, then find a bug, at that point we
> > > > first fix the problem them go back to refactoring.
> > >
> > > Sure I do this all the time. Your typical complaint on the v+1 patch
> > > set is to move the bug fixes to the front of the changes. On the v+2
> > > patch set the bug fixes get applied but not the rest of the patch
> > > series, etc.
> > >
> > > Here we are refactoring code for an rb-tree implementation of threads
> > > and worrying about its correctness. There's no indication it's not
> > > correct, it is largely copy and paste, there is also good evidence in
> > > the locking disciple it is more correct. The next patch deletes that
> > > implementation, replacing it with a hash table. Were I not trying to
> > > break things apart I could squash those 2 patches together, but I've
> > > tried to do the right thing. Now we're trying to micro correct, break
> > > apart, etc. a state that gets deleted. A reviewer could equally
> > > criticise this being 2 changes rather than 1, and the cognitive load
> > > of having to look at code that gets deleted. At some point it is a
> > > judgement call, and I think this patch is actually the right size. I
> > > think what is missing here is some motivation in the commit message to
> > > the findnew refactoring and so I'll add that.
> >
> > I'm not against your approach and actually appreciate your effort
> > to split rb-tree refactoring and hash table introduction.  What I'm
> > asking is just to separate out the code moving.  I think you can do
> > whatever you want in the current file.  Once you have the final code
> > you can move it to its own file exactly the same.  When I look at this
> > commit, say a few years later, I won't expect a commit that says
> > moving something to a new file has other changes.
>
> The problem is that the code in machine treats the threads lock as if
> it is a lock in machine. So there is __machine__findnew_thread which
> implies the thread lock is held. This change is making threads its own
> separate concept/collection and the lock belongs with that collection.
> Most of the implementation of threads__findnew matches
> __machine__findnew_thread, so we may be able to engineer a smaller
> line diff by moving "__machine__findnew_thread" code into threads.c,
> then renaming it to build the collection, etc. We could also build the
> threads collection inside of machine and then in a separate change
> move it to threads.[ch].  In the commit history this seems muddier
> than just splitting out threads as a collection. Also, some of the API
> design choices are motivated more by the hash table implementation of
> the next patch than trying to have a good rbtree abstracted collection
> of threads. Essentially it'd be engineering a collection of threads
> but only with a view to delete it in the next patch. I don't think it
> would be for the best and the commit history for deleted code is
> unlikely to be looked upon.

I think the conversation is repeating. :)  Why not do this?

1. refactor threads code in machine.c and fix the locking
2. move threads code to its own file
3. use hash table in threads

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-28 23:43                 ` Namhyung Kim
@ 2024-02-29  0:31                   ` Ian Rogers
  0 siblings, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-02-29  0:31 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

On Wed, Feb 28, 2024 at 3:44 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 11:24 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 10:40 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 1:42 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Tue, Feb 27, 2024 at 11:17 AM Arnaldo Carvalho de Melo
> > > > <acme@kernel.org> wrote:
> > > > >
> > > > > On Tue, Feb 27, 2024 at 09:31:33AM -0800, Namhyung Kim wrote:
> > > > > > I can see some other differences like machine__findnew_thread()
> > > > > > which I think is due to the locking change.  Maybe we can fix the
> > > > > > problem before moving the code and let the code move simple.
> > > > >
> > > > > I was going to suggest that, agreed.
> > > > >
> > > > > We may start doing a refactoring, then find a bug, at that point we
> > > > > first fix the problem them go back to refactoring.
> > > >
> > > > Sure I do this all the time. Your typical complaint on the v+1 patch
> > > > set is to move the bug fixes to the front of the changes. On the v+2
> > > > patch set the bug fixes get applied but not the rest of the patch
> > > > series, etc.
> > > >
> > > > Here we are refactoring code for an rb-tree implementation of threads
> > > > and worrying about its correctness. There's no indication it's not
> > > > correct, it is largely copy and paste, there is also good evidence in
> > > > the locking disciple it is more correct. The next patch deletes that
> > > > implementation, replacing it with a hash table. Were I not trying to
> > > > break things apart I could squash those 2 patches together, but I've
> > > > tried to do the right thing. Now we're trying to micro correct, break
> > > > apart, etc. a state that gets deleted. A reviewer could equally
> > > > criticise this being 2 changes rather than 1, and the cognitive load
> > > > of having to look at code that gets deleted. At some point it is a
> > > > judgement call, and I think this patch is actually the right size. I
> > > > think what is missing here is some motivation in the commit message to
> > > > the findnew refactoring and so I'll add that.
> > >
> > > I'm not against your approach and actually appreciate your effort
> > > to split rb-tree refactoring and hash table introduction.  What I'm
> > > asking is just to separate out the code moving.  I think you can do
> > > whatever you want in the current file.  Once you have the final code
> > > you can move it to its own file exactly the same.  When I look at this
> > > commit, say a few years later, I won't expect a commit that says
> > > moving something to a new file has other changes.
> >
> > The problem is that the code in machine treats the threads lock as if
> > it is a lock in machine. So there is __machine__findnew_thread which
> > implies the thread lock is held. This change is making threads its own
> > separate concept/collection and the lock belongs with that collection.
> > Most of the implementation of threads__findnew matches
> > __machine__findnew_thread, so we may be able to engineer a smaller
> > line diff by moving "__machine__findnew_thread" code into threads.c,
> > then renaming it to build the collection, etc. We could also build the
> > threads collection inside of machine and then in a separate change
> > move it to threads.[ch].  In the commit history this seems muddier
> > than just splitting out threads as a collection. Also, some of the API
> > design choices are motivated more by the hash table implementation of
> > the next patch than trying to have a good rbtree abstracted collection
> > of threads. Essentially it'd be engineering a collection of threads
> > but only with a view to delete it in the next patch. I don't think it
> > would be for the best and the commit history for deleted code is
> > unlikely to be looked upon.
>
> I think the conversation is repeating. :)  Why not do this?
>
> 1. refactor threads code in machine.c and fix the locking
> 2. move threads code to its own file
> 3. use hash table in threads

Step 3 is patch 5 in this series. I can split this patch into steps 1
and 2 as you say in v3. I'm not sure why step 2 is going to be of any
use as the code will match that of step 1, code that the next patch
then deletes. Anyway, it's a plan for v3.

Thanks,
Ian

> Thanks,
> Namhyung

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-27  7:24     ` Ian Rogers
  2024-02-27 17:31       ` Namhyung Kim
@ 2024-02-29 21:59       ` David Laight
  2024-03-01  0:19         ` Ian Rogers
  1 sibling, 1 reply; 35+ messages in thread
From: David Laight @ 2024-02-29 21:59 UTC (permalink / raw)
  To: 'Ian Rogers', Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Oliver Upton, Yang Jihong, linux-kernel, linux-perf-users, bpf

From: Ian Rogers
> Sent: 27 February 2024 07:24
> 
> On Mon, Feb 26, 2024 at 11:07 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > Move threads out of machine and move thread_rb_node into the C
> > > file. This hides the implementation of threads from the rest of the
> > > code allowing for it to be refactored.
> > >
> > > Locking discipline is tightened up in this change.
> >
> > Doesn't look like a simple code move.  Can we split the locking
> > change from the move to make the reviewer's life a bit easier? :)
> 
> Not sure I follow. Take threads_nr as an example.
> 
> The old code is in machine.c, so:
> -static size_t machine__threads_nr(const struct machine *machine)
> -{
> -       size_t nr = 0;
> -
> -       for (int i = 0; i < THREADS__TABLE_SIZE; i++)
> -               nr += machine->threads[i].nr;
> -
> -       return nr;
> -}
> 
> The new code is in threads.c:
> +size_t threads__nr(struct threads *threads)
> +{
> +       size_t nr = 0;
> +
> +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> +               struct threads_table_entry *table = &threads->table[i];
> +
> +               down_read(&table->lock);
> +               nr += table->nr;
> +               up_read(&table->lock);
> +       }
> +       return nr;
> +}
> 
> So it is a copy paste from one file to the other. The only difference
> is that the old code failed to take a lock when reading "nr" so the
> locking is added. I wanted to make sure all the functions in threads.c
> were properly correct wrt locking, semaphore creation and destruction,
> etc.  We could have a broken threads.c and fix it in the next change,
> but given that's a bug it could make bisection more difficult.
> Ultimately I thought the locking changes were small enough to not
> warrant being on their own compared to the advantages of having a sane
> threads abstraction.

The lock is pretty much entirely pointless.
All it really does is slow the code down.
The most you could want is:
	nr += READ_ONCE(table->nr);
to avoid any hypothetical data tearing.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 4/6] perf threads: Move threads to its own files
  2024-02-29 21:59       ` David Laight
@ 2024-03-01  0:19         ` Ian Rogers
  0 siblings, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2024-03-01  0:19 UTC (permalink / raw)
  To: David Laight
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Adrian Hunter, Oliver Upton, Yang Jihong,
	linux-kernel, linux-perf-users, bpf

On Thu, Feb 29, 2024 at 1:59 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Ian Rogers
> > Sent: 27 February 2024 07:24
> >
> > On Mon, Feb 26, 2024 at 11:07 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, Feb 13, 2024 at 10:37 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > Move threads out of machine and move thread_rb_node into the C
> > > > file. This hides the implementation of threads from the rest of the
> > > > code allowing for it to be refactored.
> > > >
> > > > Locking discipline is tightened up in this change.
> > >
> > > Doesn't look like a simple code move.  Can we split the locking
> > > change from the move to make the reviewer's life a bit easier? :)
> >
> > Not sure I follow. Take threads_nr as an example.
> >
> > The old code is in machine.c, so:
> > -static size_t machine__threads_nr(const struct machine *machine)
> > -{
> > -       size_t nr = 0;
> > -
> > -       for (int i = 0; i < THREADS__TABLE_SIZE; i++)
> > -               nr += machine->threads[i].nr;
> > -
> > -       return nr;
> > -}
> >
> > The new code is in threads.c:
> > +size_t threads__nr(struct threads *threads)
> > +{
> > +       size_t nr = 0;
> > +
> > +       for (int i = 0; i < THREADS__TABLE_SIZE; i++) {
> > +               struct threads_table_entry *table = &threads->table[i];
> > +
> > +               down_read(&table->lock);
> > +               nr += table->nr;
> > +               up_read(&table->lock);
> > +       }
> > +       return nr;
> > +}
> >
> > So it is a copy paste from one file to the other. The only difference
> > is that the old code failed to take a lock when reading "nr" so the
> > locking is added. I wanted to make sure all the functions in threads.c
> > were properly correct wrt locking, semaphore creation and destruction,
> > etc.  We could have a broken threads.c and fix it in the next change,
> > but given that's a bug it could make bisection more difficult.
> > Ultimately I thought the locking changes were small enough to not
> > warrant being on their own compared to the advantages of having a sane
> > threads abstraction.
>
> The lock is pretty much entirely pointless.
> All it really does is slow the code down.
> The most you could want is:
>         nr += READ_ONCE(table->nr);
> to avoid any hypothetical data tearing.

Completely agreed, but as this is user space code I'm unclear on
thread sanitizer support for READ_ONCE. The code is also only called
in debug statements. The migration to a hashmap in the later patches
means the complexity of taking the lock is more justified, although
we're using libbpf's hashmap that has a size variable.

Thanks,
Ian

>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-03-01  0:19 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-14  6:37 [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers
2024-02-14  6:37 ` [PATCH v1 1/6] perf report: Sort child tasks by tid Ian Rogers
2024-02-14 17:24   ` Arnaldo Carvalho de Melo
2024-02-14 17:42     ` Ian Rogers
2024-02-16 20:25       ` Namhyung Kim
2024-02-27  6:39   ` Namhyung Kim
2024-02-27  7:12     ` Ian Rogers
2024-02-28  6:11       ` Namhyung Kim
2024-02-28  7:05         ` Ian Rogers
2024-02-28 22:45           ` Namhyung Kim
2024-02-14  6:37 ` [PATCH v1 2/6] perf trace: Ignore thread hashing in summary Ian Rogers
2024-02-14 17:25   ` Arnaldo Carvalho de Melo
2024-02-14 18:27     ` Ian Rogers
2024-02-14 21:15       ` Ian Rogers
2024-02-14 21:36         ` Ian Rogers
2024-02-14 21:42           ` Ian Rogers
2024-02-16 14:57           ` Arnaldo Carvalho de Melo
2024-02-27  6:55   ` Namhyung Kim
2024-02-14  6:37 ` [PATCH v1 3/6] perf machine: Move fprintf to for_each loop and a callback Ian Rogers
2024-02-14  6:37 ` [PATCH v1 4/6] perf threads: Move threads to its own files Ian Rogers
2024-02-27  7:07   ` Namhyung Kim
2024-02-27  7:24     ` Ian Rogers
2024-02-27 17:31       ` Namhyung Kim
2024-02-27 19:02         ` Ian Rogers
2024-02-27 19:17         ` Arnaldo Carvalho de Melo
2024-02-27 21:42           ` Ian Rogers
2024-02-28  6:39             ` Namhyung Kim
2024-02-28  7:24               ` Ian Rogers
2024-02-28 23:43                 ` Namhyung Kim
2024-02-29  0:31                   ` Ian Rogers
2024-02-29 21:59       ` David Laight
2024-03-01  0:19         ` Ian Rogers
2024-02-14  6:37 ` [PATCH v1 5/6] perf threads: Switch from rbtree to hashmap Ian Rogers
2024-02-14  6:37 ` [PATCH v1 6/6] perf threads: Reduce table size from 256 to 8 Ian Rogers
2024-02-25 18:50 ` [PATCH v1 0/6] Thread memory improvements and fixes Ian Rogers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).