linux-trace-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] tracing: Have trace_pid_list be a sparse array
@ 2021-09-24  3:35 Steven Rostedt
  2021-09-24  3:35 ` [PATCH 1/2] tracing: Place trace_pid_list logic into abstract functions Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-09-24  3:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-devel

When the trace_pid_list was created, the default pid max was 32768.
Creating a bitmask that can hold one bit for all 32768 took up 4096 (one
page). Having a one page bitmask was not much of a problem, and that was
used for mapping pids. But today, systems are bigger and can run more
tasks, and now the default pid_max is usually set to 4194304. Which means
to handle that many pids requires 524288 bytes. Worse yet, the pid_max can
be set to 2^30 (1073741824 or 1G) which would take 134217728 (128M) of
memory to store this array.

Since the pid_list array is very sparsely populated, it is a huge waste of
memory to store all possible bits for each pid when most will not be set.

Instead, use a page table scheme to store the array, and allow this to
handle up to 32 bit pids.

The pid_mask will start out with 1024 entries for the first 10 MSB bits.
This will cost 4K for 32 bit architectures and 8K for 64 bit. Each of
these will have a 1024 array to store the next 10 bits of the pid (another
4 or 8K). These will hold an 512 byte bitmask (which will cover the LSB 12
  bits or 4096 bits).
 
When the trace_pid_list is allocated, it will have the 4/8K upper bits
allocated, and then it will allocate a cache for the next upper chunks and
the lower chunks (default 6 of each). Then when a bit is "set", these
chunks will be pulled from the free list and added to the array. If the
free list gets down to a lever (default 2), it will trigger an irqwork
that will refill the cache back up.

On clearing a bit, if the clear causes the bitmask to be zero, that chunk
will then be placed back into the free cache for later use, keeping the
need to allocate more down to a minimum.


Steven Rostedt (VMware) (2):
      tracing: Place trace_pid_list logic into abstract functions
      tracing: Create a sparse bitmask for pid filtering

----
 kernel/trace/Makefile       |   1 +
 kernel/trace/ftrace.c       |   6 +-
 kernel/trace/pid_list.c     | 551 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.c        |  78 +++----
 kernel/trace/trace.h        |  14 +-
 kernel/trace/trace_events.c |   6 +-
 6 files changed, 595 insertions(+), 61 deletions(-)
 create mode 100644 kernel/trace/pid_list.c

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] tracing: Place trace_pid_list logic into abstract functions
  2021-09-24  3:35 [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
@ 2021-09-24  3:35 ` Steven Rostedt
  2021-09-24  3:35 ` [PATCH 2/2] tracing: Create a sparse bitmask for pid filtering Steven Rostedt
  2021-09-24  4:07 ` [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
  2 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-09-24  3:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-devel

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Instead of having the logic that does trace_pid_list open coded, wrap it in
abstract functions. This will allow a rewrite of the logic that implements
the trace_pid_list without affecting the users.

Note, this causes a change in behavior. Every time a pid is written into
the set_*_pid file, it creates a new list and uses RCU to update it. If
pid_max is lowered, but there was a pid currently in the list that was
higher than pid_max, those pids will now be removed on updating the list.
The old behavior kept that from happening.

The rewrite of the pid_list logic will no longer depend on pid_max,
and will return the old behavior.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/Makefile       |   1 +
 kernel/trace/ftrace.c       |   6 +-
 kernel/trace/pid_list.c     | 165 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.c        |  78 ++++++-----------
 kernel/trace/trace.h        |  14 ++-
 kernel/trace/trace_events.c |   6 +-
 6 files changed, 209 insertions(+), 61 deletions(-)
 create mode 100644 kernel/trace/pid_list.c

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 6de5d4d63165..bedc5caceec7 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_TRACING) += trace_output.o
 obj-$(CONFIG_TRACING) += trace_seq.o
 obj-$(CONFIG_TRACING) += trace_stat.o
 obj-$(CONFIG_TRACING) += trace_printk.o
+obj-$(CONFIG_TRACING) += 	pid_list.o
 obj-$(CONFIG_TRACING_MAP) += tracing_map.o
 obj-$(CONFIG_PREEMPTIRQ_DELAY_TEST) += preemptirq_delay_test.o
 obj-$(CONFIG_SYNTH_EVENT_GEN_TEST) += synth_event_gen_test.o
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 7efbc8aaf7f6..3eec6792f115 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7184,10 +7184,10 @@ static void clear_ftrace_pids(struct trace_array *tr, int type)
 	synchronize_rcu();
 
 	if ((type & TRACE_PIDS) && pid_list)
-		trace_free_pid_list(pid_list);
+		trace_pid_list_free(pid_list);
 
 	if ((type & TRACE_NO_PIDS) && no_pid_list)
-		trace_free_pid_list(no_pid_list);
+		trace_pid_list_free(no_pid_list);
 }
 
 void ftrace_clear_pids(struct trace_array *tr)
@@ -7428,7 +7428,7 @@ pid_write(struct file *filp, const char __user *ubuf,
 
 	if (filtered_pids) {
 		synchronize_rcu();
-		trace_free_pid_list(filtered_pids);
+		trace_pid_list_free(filtered_pids);
 	} else if (pid_list && !other_pids) {
 		/* Register a probe to set whether to ignore the tracing of a task */
 		register_trace_sched_switch(ftrace_filter_pid_sched_switch_probe, tr);
diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c
new file mode 100644
index 000000000000..b78c80d7f396
--- /dev/null
+++ b/kernel/trace/pid_list.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 VMware Inc, Steven Rostedt <rostedt@goodmis.org>
+ */
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include "trace.h"
+
+struct trace_pid_list {
+	int			pid_max;
+	unsigned long		*pids;
+};
+
+/**
+ * trace_pid_list_is_set - test if the pid is set in the list
+ * @pid_list: The pid list to test
+ * @pid: The pid to to see if set in the list.
+ *
+ * Tests if @pid is is set in the @pid_list. This is usually called
+ * from the scheduler when a task is scheduled. Its pid is checked
+ * if it should be traced or not.
+ *
+ * Return true if the pid is in the list, false otherwise.
+ */
+bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
+{
+	/*
+	 * If pid_max changed after filtered_pids was created, we
+	 * by default ignore all pids greater than the previous pid_max.
+	 */
+	if (pid >= pid_list->pid_max)
+		return false;
+
+	return test_bit(pid, pid_list->pids);
+}
+
+/**
+ * trace_pid_list_set - add a pid to the list
+ * @pid_list: The pid list to add the @pid to.
+ * @pid: The pid to add.
+ *
+ * Adds @pid to @pid_list. This is usually done explicitly by a user
+ * adding a task to be traced, or indirectly by the fork function
+ * when children should be traced and a task's pid is in the list.
+ *
+ * Return 0 on success, negative otherwise.
+ */
+int trace_pid_list_set(struct trace_pid_list *pid_list, unsigned int pid)
+{
+	/* Sorry, but we don't support pid_max changing after setting */
+	if (pid >= pid_list->pid_max)
+		return -EINVAL;
+
+	set_bit(pid, pid_list->pids);
+
+	return 0;
+}
+
+/**
+ * trace_pid_list_clear - remove a pid from the list
+ * @pid_list: The pid list to remove the @pid from.
+ * @pid: The pid to remove.
+ *
+ * Removes @pid from @pid_list. This is usually done explicitly by a user
+ * removing tasks from tracing, or indirectly by the exit function
+ * when a task that is set to be traced exits.
+ *
+ * Return 0 on success, negative otherwise.
+ */
+int trace_pid_list_clear(struct trace_pid_list *pid_list, unsigned int pid)
+{
+	/* Sorry, but we don't support pid_max changing after setting */
+	if (pid >= pid_list->pid_max)
+		return -EINVAL;
+
+	clear_bit(pid, pid_list->pids);
+
+	return 0;
+}
+
+/**
+ * trace_pid_list_next - return the next pid in the list
+ * @pid_list: The pid list to examine.
+ * @pid: The pid to start from
+ * @next: The pointer to place the pid that is set starting from @pid.
+ *
+ * Looks for the next consecutive pid that is in @pid_list starting
+ * at the pid specified by @pid. If one is set (including @pid), then
+ * that pid is placed into @next.
+ *
+ * Return 0 when a pid is found, -1 if there are no more pids included.
+ */
+int trace_pid_list_next(struct trace_pid_list *pid_list, unsigned int pid,
+			unsigned int *next)
+{
+	pid = find_next_bit(pid_list->pids, pid_list->pid_max, pid);
+
+	if (pid < pid_list->pid_max) {
+		*next = pid;
+		return 0;
+	}
+	return -1;
+}
+
+/**
+ * trace_pid_list_first - return the first pid in the list
+ * @pid_list: The pid list to examine.
+ * @pid: The pointer to place the pid first found pid that is set.
+ *
+ * Looks for the first pid that is set in @pid_list, and places it
+ * into @pid if found.
+ *
+ * Return 0 when a pid is found, -1 if there are no pids set.
+ */
+int trace_pid_list_first(struct trace_pid_list *pid_list, unsigned int *pid)
+{
+	unsigned int first;
+
+	first = find_first_bit(pid_list->pids, pid_list->pid_max);
+
+	if (first < pid_list->pid_max) {
+		*pid = first;
+		return 0;
+	}
+	return -1;
+}
+
+/**
+ * trace_pid_list_alloc - create a new pid_list
+ *
+ * Allocates a new pid_list to store pids into.
+ *
+ * Returns the pid_list on success, NULL otherwise.
+ */
+struct trace_pid_list *trace_pid_list_alloc(void)
+{
+	struct trace_pid_list *pid_list;
+
+	pid_list = kmalloc(sizeof(*pid_list), GFP_KERNEL);
+	if (!pid_list)
+		return NULL;
+
+	pid_list->pid_max = READ_ONCE(pid_max);
+
+	pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
+	if (!pid_list->pids) {
+		kfree(pid_list);
+		return NULL;
+	}
+	return pid_list;
+}
+
+/**
+ * trace_pid_list_free - Frees an allocated pid_list.
+ *
+ * Frees the memory for a pid_list that was allocated.
+ */
+void trace_pid_list_free(struct trace_pid_list *pid_list)
+{
+	if (!pid_list)
+		return;
+
+	vfree(pid_list->pids);
+	kfree(pid_list);
+}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7896d30d90f7..dcced07a45e6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -512,12 +512,6 @@ int call_filter_check_discard(struct trace_event_call *call, void *rec,
 	return 0;
 }
 
-void trace_free_pid_list(struct trace_pid_list *pid_list)
-{
-	vfree(pid_list->pids);
-	kfree(pid_list);
-}
-
 /**
  * trace_find_filtered_pid - check if a pid exists in a filtered_pid list
  * @filtered_pids: The list of pids to check
@@ -528,14 +522,7 @@ void trace_free_pid_list(struct trace_pid_list *pid_list)
 bool
 trace_find_filtered_pid(struct trace_pid_list *filtered_pids, pid_t search_pid)
 {
-	/*
-	 * If pid_max changed after filtered_pids was created, we
-	 * by default ignore all pids greater than the previous pid_max.
-	 */
-	if (search_pid >= filtered_pids->pid_max)
-		return false;
-
-	return test_bit(search_pid, filtered_pids->pids);
+	return trace_pid_list_is_set(filtered_pids, search_pid);
 }
 
 /**
@@ -592,15 +579,11 @@ void trace_filter_add_remove_task(struct trace_pid_list *pid_list,
 			return;
 	}
 
-	/* Sorry, but we don't support pid_max changing after setting */
-	if (task->pid >= pid_list->pid_max)
-		return;
-
 	/* "self" is set for forks, and NULL for exits */
 	if (self)
-		set_bit(task->pid, pid_list->pids);
+		trace_pid_list_set(pid_list, task->pid);
 	else
-		clear_bit(task->pid, pid_list->pids);
+		trace_pid_list_clear(pid_list, task->pid);
 }
 
 /**
@@ -617,18 +600,19 @@ void trace_filter_add_remove_task(struct trace_pid_list *pid_list,
  */
 void *trace_pid_next(struct trace_pid_list *pid_list, void *v, loff_t *pos)
 {
-	unsigned long pid = (unsigned long)v;
+	long pid = (unsigned long)v;
+	unsigned int next;
 
 	(*pos)++;
 
 	/* pid already is +1 of the actual previous bit */
-	pid = find_next_bit(pid_list->pids, pid_list->pid_max, pid);
+	if (trace_pid_list_next(pid_list, pid, &next) < 0)
+		return NULL;
 
-	/* Return pid + 1 to allow zero to be represented */
-	if (pid < pid_list->pid_max)
-		return (void *)(pid + 1);
+	pid = next;
 
-	return NULL;
+	/* Return pid + 1 to allow zero to be represented */
+	return (void *)(pid + 1);
 }
 
 /**
@@ -645,12 +629,14 @@ void *trace_pid_next(struct trace_pid_list *pid_list, void *v, loff_t *pos)
 void *trace_pid_start(struct trace_pid_list *pid_list, loff_t *pos)
 {
 	unsigned long pid;
+	unsigned int first;
 	loff_t l = 0;
 
-	pid = find_first_bit(pid_list->pids, pid_list->pid_max);
-	if (pid >= pid_list->pid_max)
+	if (trace_pid_list_first(pid_list, &first) < 0)
 		return NULL;
 
+	pid = first;
+
 	/* Return pid + 1 so that zero can be the exit value */
 	for (pid++; pid && l < *pos;
 	     pid = (unsigned long)trace_pid_next(pid_list, (void *)pid, &l))
@@ -686,7 +672,7 @@ int trace_pid_write(struct trace_pid_list *filtered_pids,
 	unsigned long val;
 	int nr_pids = 0;
 	ssize_t read = 0;
-	ssize_t ret = 0;
+	ssize_t ret;
 	loff_t pos;
 	pid_t pid;
 
@@ -699,34 +685,23 @@ int trace_pid_write(struct trace_pid_list *filtered_pids,
 	 * the user. If the operation fails, then the current list is
 	 * not modified.
 	 */
-	pid_list = kmalloc(sizeof(*pid_list), GFP_KERNEL);
+	pid_list = trace_pid_list_alloc();
 	if (!pid_list) {
 		trace_parser_put(&parser);
 		return -ENOMEM;
 	}
 
-	pid_list->pid_max = READ_ONCE(pid_max);
-
-	/* Only truncating will shrink pid_max */
-	if (filtered_pids && filtered_pids->pid_max > pid_list->pid_max)
-		pid_list->pid_max = filtered_pids->pid_max;
-
-	pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
-	if (!pid_list->pids) {
-		trace_parser_put(&parser);
-		kfree(pid_list);
-		return -ENOMEM;
-	}
-
 	if (filtered_pids) {
 		/* copy the current bits to the new max */
-		for_each_set_bit(pid, filtered_pids->pids,
-				 filtered_pids->pid_max) {
-			set_bit(pid, pid_list->pids);
+		ret = trace_pid_list_first(filtered_pids, &pid);
+		while (!ret) {
+			trace_pid_list_set(pid_list, pid);
+			ret = trace_pid_list_next(filtered_pids, pid + 1, &pid);
 			nr_pids++;
 		}
 	}
 
+	ret = 0;
 	while (cnt > 0) {
 
 		pos = 0;
@@ -742,12 +717,13 @@ int trace_pid_write(struct trace_pid_list *filtered_pids,
 		ret = -EINVAL;
 		if (kstrtoul(parser.buffer, 0, &val))
 			break;
-		if (val >= pid_list->pid_max)
-			break;
 
 		pid = (pid_t)val;
 
-		set_bit(pid, pid_list->pids);
+		if (trace_pid_list_set(pid_list, pid) < 0) {
+			ret = -1;
+			break;
+		}
 		nr_pids++;
 
 		trace_parser_clear(&parser);
@@ -756,13 +732,13 @@ int trace_pid_write(struct trace_pid_list *filtered_pids,
 	trace_parser_put(&parser);
 
 	if (ret < 0) {
-		trace_free_pid_list(pid_list);
+		trace_pid_list_free(pid_list);
 		return ret;
 	}
 
 	if (!nr_pids) {
 		/* Cleared the list of pids */
-		trace_free_pid_list(pid_list);
+		trace_pid_list_free(pid_list);
 		read = ret;
 		pid_list = NULL;
 	}
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index b7c0f8e160fb..fe13a0542486 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -188,10 +188,16 @@ struct trace_options {
 	struct trace_option_dentry	*topts;
 };
 
-struct trace_pid_list {
-	int				pid_max;
-	unsigned long			*pids;
-};
+struct trace_pid_list;
+
+struct trace_pid_list *trace_pid_list_alloc(void);
+void trace_pid_list_free(struct trace_pid_list *pid_list);
+bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid);
+int trace_pid_list_set(struct trace_pid_list *pid_list, unsigned int pid);
+int trace_pid_list_clear(struct trace_pid_list *pid_list, unsigned int pid);
+int trace_pid_list_first(struct trace_pid_list *pid_list, unsigned int *pid);
+int trace_pid_list_next(struct trace_pid_list *pid_list, unsigned int pid,
+			unsigned int *next);
 
 enum {
 	TRACE_PIDS		= BIT(0),
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 830b3b9940f4..bf54d2803261 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -885,10 +885,10 @@ static void __ftrace_clear_event_pids(struct trace_array *tr, int type)
 	tracepoint_synchronize_unregister();
 
 	if ((type & TRACE_PIDS) && pid_list)
-		trace_free_pid_list(pid_list);
+		trace_pid_list_free(pid_list);
 
 	if ((type & TRACE_NO_PIDS) && no_pid_list)
-		trace_free_pid_list(no_pid_list);
+		trace_pid_list_free(no_pid_list);
 }
 
 static void ftrace_clear_event_pids(struct trace_array *tr, int type)
@@ -1967,7 +1967,7 @@ event_pid_write(struct file *filp, const char __user *ubuf,
 
 	if (filtered_pids) {
 		tracepoint_synchronize_unregister();
-		trace_free_pid_list(filtered_pids);
+		trace_pid_list_free(filtered_pids);
 	} else if (pid_list && !other_pids) {
 		register_pid_events(tr);
 	}
-- 
2.32.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] tracing: Create a sparse bitmask for pid filtering
  2021-09-24  3:35 [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
  2021-09-24  3:35 ` [PATCH 1/2] tracing: Place trace_pid_list logic into abstract functions Steven Rostedt
@ 2021-09-24  3:35 ` Steven Rostedt
  2021-09-24  4:07 ` [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
  2 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-09-24  3:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-devel

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

When the trace_pid_list was created, the default pid max was 32768.
Creating a bitmask that can hold one bit for all 32768 took up 4096 (one
page). Having a one page bitmask was not much of a problem, and that was
used for mapping pids. But today, systems are bigger and can run more
tasks, and now the default pid_max is usually set to 4194304. Which means
to handle that many pids requires 524288 bytes. Worse yet, the pid_max can
be set to 2^30 (1073741824 or 1G) which would take 134217728 (128M) of
memory to store this array.

Since the pid_list array is very sparsely populated, it is a huge waste of
memory to store all possible bits for each pid when most will not be set.

Instead, use a page table scheme to store the array, and allow this to
handle up to 32 bit pids.

The pid_mask will start out with 1024 entries for the first 10 MSB bits.
This will cost 4K for 32 bit architectures and 8K for 64 bit. Each of
these will have a 1024 array to store the next 10 bits of the pid (another
4 or 8K). These will hold an 512 byte bitmask (which will cover the LSB 12
  bits or 4096 bits).

When the trace_pid_list is allocated, it will have the 4/8K upper bits
allocated, and then it will allocate a cache for the next upper chunks and
the lower chunks (default 6 of each). Then when a bit is "set", these
chunks will be pulled from the free list and added to the array. If the
free list gets down to a lever (default 2), it will trigger an irqwork
that will refill the cache back up.

On clearing a bit, if the clear causes the bitmask to be zero, that chunk
will then be placed back into the free cache for later use, keeping the
need to allocate more down to a minimum.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/pid_list.c | 460 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 423 insertions(+), 37 deletions(-)

diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c
index b78c80d7f396..f307de267a9e 100644
--- a/kernel/trace/pid_list.c
+++ b/kernel/trace/pid_list.c
@@ -2,15 +2,182 @@
 /*
  * Copyright (C) 2021 VMware Inc, Steven Rostedt <rostedt@goodmis.org>
  */
-#include <linux/vmalloc.h>
+#include <linux/spinlock.h>
+#include <linux/irq_work.h>
 #include <linux/slab.h>
-#include "trace.h"
+
+/*
+ * In order to keep track of what pids to trace, a tree is created much
+ * like page tables are used. This creates a sparse bit map, where
+ * the tree is filled in when needed. A PID is 32 bits, and is broken
+ * up into 3 sections based on the bit map of the bits. The 10 MSB
+ * is the "upper1" section. The next 10 MSB is the "upper2" section
+ * and the 12 LSB is the "lower" section.
+ *
+ * A trace_pid_list structure holds the "upper1" section, in an
+ * array of 1024 pointers to "upper_chunk" unions, where each has an
+ * array of 1024 pointers to the "lower_chunk" structures, where each
+ * has an array of size 512 bytes representing a bitmask of the 12 LSB of
+ * the PID (512 * 8 = 4096)
+ *
+ * When a trace_pid_list is allocated, it includes the 1024 array
+ * of the upper1 unions. Then a "cache" of upper and lower is allocated
+ * where these will be assigned as needed.
+ *
+ * When a bit is set in the pid_list bitmask, the pid to use has
+ * the 10 MSB masked, and this is used to index the array in the
+ * pid_list to find the next upper union. If the element is NULL,
+ * then one is retrieved from the upper_list cache. If none is
+ * available, then -ENOMEM is returned.
+ *
+ * The next 10 MSB is used to index into the "upper2" section. If this
+ * element is NULL, then it is retrieved from the lower_list cache.
+ * Again, if one is not available -ENOMEM is returned.
+ *
+ * Finally the 12 LSB of the PID is used to set the bit in the 4096
+ * bitmask (made up of 512 bytes).
+ *
+ * The upper mask is never freed until the trace_pid_list is freed,
+ * but the lower sections are moved back to the cache list when
+ * all the bits are cleared and can be reused.
+ */
+
+#define UPPER1_SIZE	1024
+#define UPPER2_SIZE	1024
+
+#define LOWER_BITS	4096
+#define LOWER_SIZE	(LOWER_BITS / BITS_PER_LONG)
+
+#define UPPER1_SHIFT	24
+#define UPPER2_SHIFT	12
+#define LOWER_MASK	((1 << UPPER2_SHIFT) - 1)
+
+/* UPPER_MASK is 10 bits, done after the shift */
+#define UPPER_MASK	((1 << 10) - 1)
+
+/* Just keep 6 chunks of both upper and lower in the cache on alloc */
+#define CHUNK_ALLOC 6
+
+/* Have 2 chunks free, trigger a refill of the cache */
+#define CHUNK_REALLOC 2
+
+union lower_chunk {
+	union lower_chunk		*next;
+	unsigned long			data[LOWER_SIZE]; // 512B in size
+};
+
+union upper_chunk {
+	union upper_chunk		*next;
+	union lower_chunk		*data[UPPER2_SIZE]; // 4 or 8K in size
+};
 
 struct trace_pid_list {
-	int			pid_max;
-	unsigned long		*pids;
+	raw_spinlock_t			lock;
+	struct irq_work			refill_irqwork;
+	union upper_chunk		*upper[UPPER1_SIZE]; // 4 or 8K in size
+	union upper_chunk		*upper_list;
+	union lower_chunk		*lower_list;
+	int				free_upper_chunks;
+	int				free_lower_chunks;
 };
 
+static inline union lower_chunk *get_lower_chunk(struct trace_pid_list *pid_list)
+{
+	union lower_chunk *chunk;
+
+	lockdep_assert_held(&pid_list->lock);
+
+	if (!pid_list->lower_list)
+		return NULL;
+
+	chunk = pid_list->lower_list;
+	pid_list->lower_list = chunk->next;
+	pid_list->free_lower_chunks--;
+	WARN_ON_ONCE(pid_list->free_lower_chunks < 0);
+	chunk->next = NULL;
+	/*
+	 * If a refill needs to happen, it can not happen here
+	 * as the scheduler run queue locks are held.
+	 */
+	if (pid_list->free_lower_chunks <= CHUNK_REALLOC)
+		irq_work_queue(&pid_list->refill_irqwork);
+
+	return chunk;
+}
+
+static inline union upper_chunk *get_upper_chunk(struct trace_pid_list *pid_list)
+{
+	union upper_chunk *chunk;
+
+	lockdep_assert_held(&pid_list->lock);
+
+	if (!pid_list->upper_list)
+		return NULL;
+
+	chunk = pid_list->upper_list;
+	pid_list->upper_list = chunk->next;
+	pid_list->free_upper_chunks--;
+	WARN_ON_ONCE(pid_list->free_upper_chunks < 0);
+	chunk->next = NULL;
+	/*
+	 * If a refill needs to happen, it can not happen here
+	 * as the scheduler run queue locks are held.
+	 */
+	if (pid_list->free_upper_chunks <= CHUNK_REALLOC)
+		irq_work_queue(&pid_list->refill_irqwork);
+
+	return chunk;
+}
+
+static inline void put_lower_chunk(struct trace_pid_list *pid_list,
+				   union lower_chunk *chunk)
+{
+	lockdep_assert_held(&pid_list->lock);
+
+	chunk->next = pid_list->lower_list;
+	pid_list->lower_list = chunk;
+	pid_list->free_lower_chunks++;
+}
+
+static inline void put_upper_chunk(struct trace_pid_list *pid_list,
+				   union upper_chunk *chunk)
+{
+	lockdep_assert_held(&pid_list->lock);
+
+	chunk->next = pid_list->upper_list;
+	pid_list->upper_list = chunk;
+	pid_list->free_upper_chunks++;
+}
+
+static inline bool upper_empty(union upper_chunk *chunk)
+{
+	/*
+	 * If chunk->data has no lower chunks, it will be the same
+	 * as a zeroed bitmask. Use find_first_bit() to test it
+	 * and if it doesn't find any bits set, then the array
+	 * is empty.
+	 */
+	int bit = find_first_bit((unsigned long *)chunk->data,
+				 sizeof(chunk->data) * 8);
+	return bit >= sizeof(chunk->data) * 8;
+}
+
+static inline void pid_split(unsigned int pid, unsigned int *upper1,
+			     unsigned int *upper2, unsigned int *lower)
+{
+	*upper1 = (pid >> UPPER1_SHIFT) & UPPER_MASK;
+	*upper2 = (pid >> UPPER2_SHIFT) & UPPER_MASK;
+	*lower = pid & LOWER_MASK;
+}
+
+static inline unsigned int pid_join(unsigned int upper1,
+				    unsigned int upper2, unsigned int lower)
+{
+	return ((upper1 & UPPER_MASK) << UPPER1_SHIFT) |
+		((upper2 & UPPER_MASK) << UPPER2_SHIFT) |
+		(lower & LOWER_MASK);
+}
+
 /**
  * trace_pid_list_is_set - test if the pid is set in the list
  * @pid_list: The pid list to test
@@ -24,14 +191,29 @@ struct trace_pid_list {
  */
 bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
 {
-	/*
-	 * If pid_max changed after filtered_pids was created, we
-	 * by default ignore all pids greater than the previous pid_max.
-	 */
-	if (pid >= pid_list->pid_max)
+	union upper_chunk *upper_chunk;
+	union lower_chunk *lower_chunk;
+	unsigned long flags;
+	unsigned int upper1;
+	unsigned int upper2;
+	unsigned int lower;
+	bool ret = false;
+
+	if (!pid_list)
 		return false;
 
-	return test_bit(pid, pid_list->pids);
+	pid_split(pid, &upper1, &upper2, &lower);
+
+	raw_spin_lock_irqsave(&pid_list->lock, flags);
+	upper_chunk = pid_list->upper[upper1];
+	if (upper_chunk) {
+		lower_chunk = upper_chunk->data[upper2];
+		if (lower_chunk)
+			ret = test_bit(lower, lower_chunk->data);
+	}
+	raw_spin_unlock_irqrestore(&pid_list->lock, flags);
+
+	return ret;
 }
 
 /**
@@ -47,13 +229,43 @@ bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
  */
 int trace_pid_list_set(struct trace_pid_list *pid_list, unsigned int pid)
 {
-	/* Sorry, but we don't support pid_max changing after setting */
-	if (pid >= pid_list->pid_max)
-		return -EINVAL;
+	union upper_chunk *upper_chunk;
+	union lower_chunk *lower_chunk;
+	unsigned long flags;
+	unsigned int upper1;
+	unsigned int upper2;
+	unsigned int lower;
+	int ret;
+
+	if (!pid_list)
+		return -ENODEV;
 
-	set_bit(pid, pid_list->pids);
+	pid_split(pid, &upper1, &upper2, &lower);
 
-	return 0;
+	raw_spin_lock_irqsave(&pid_list->lock, flags);
+	upper_chunk = pid_list->upper[upper1];
+	if (!upper_chunk) {
+		upper_chunk = get_upper_chunk(pid_list);
+		if (!upper_chunk) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		pid_list->upper[upper1] = upper_chunk;
+	}
+	lower_chunk = upper_chunk->data[upper2];
+	if (!lower_chunk) {
+		lower_chunk = get_lower_chunk(pid_list);
+		if (!lower_chunk) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		upper_chunk->data[upper2] = lower_chunk;
+	}
+	set_bit(lower, lower_chunk->data);
+	ret = 0;
+ out:
+	raw_spin_unlock_irqrestore(&pid_list->lock, flags);
+	return ret;
 }
 
 /**
@@ -69,12 +281,40 @@ int trace_pid_list_set(struct trace_pid_list *pid_list, unsigned int pid)
  */
 int trace_pid_list_clear(struct trace_pid_list *pid_list, unsigned int pid)
 {
-	/* Sorry, but we don't support pid_max changing after setting */
-	if (pid >= pid_list->pid_max)
-		return -EINVAL;
+	union upper_chunk *upper_chunk;
+	union lower_chunk *lower_chunk;
+	unsigned long flags;
+	unsigned int upper1;
+	unsigned int upper2;
+	unsigned int lower;
+
+	if (!pid_list)
+		return -ENODEV;
+
+	pid_split(pid, &upper1, &upper2, &lower);
+
+	raw_spin_lock_irqsave(&pid_list->lock, flags);
+	upper_chunk = pid_list->upper[upper1];
+	if (!upper_chunk)
+		goto out;
+
+	lower_chunk = upper_chunk->data[upper2];
+	if (!lower_chunk)
+		goto out;
 
-	clear_bit(pid, pid_list->pids);
+	clear_bit(lower, lower_chunk->data);
 
+	/* if there's no more bits set, add it to the free list */
+	if (find_first_bit(lower_chunk->data, LOWER_BITS) >= LOWER_BITS) {
+		put_lower_chunk(pid_list, lower_chunk);
+		upper_chunk->data[upper2] = NULL;
+		if (upper_empty(upper_chunk)) {
+			put_upper_chunk(pid_list, upper_chunk);
+			pid_list->upper[upper1] = NULL;
+		}
+	}
+ out:
+	raw_spin_unlock_irqrestore(&pid_list->lock, flags);
 	return 0;
 }
 
@@ -93,13 +333,44 @@ int trace_pid_list_clear(struct trace_pid_list *pid_list, unsigned int pid)
 int trace_pid_list_next(struct trace_pid_list *pid_list, unsigned int pid,
 			unsigned int *next)
 {
-	pid = find_next_bit(pid_list->pids, pid_list->pid_max, pid);
+	union upper_chunk *upper_chunk;
+	union lower_chunk *lower_chunk;
+	unsigned long flags;
+	unsigned int upper1;
+	unsigned int upper2;
+	unsigned int lower;
+
+	if (!pid_list)
+		return -ENODEV;
+
+	pid_split(pid, &upper1, &upper2, &lower);
 
-	if (pid < pid_list->pid_max) {
-		*next = pid;
-		return 0;
+	raw_spin_lock_irqsave(&pid_list->lock, flags);
+	for (; upper1 <= UPPER_MASK; upper1++, upper2 = 0) {
+		upper_chunk = pid_list->upper[upper1];
+
+		if (!upper_chunk)
+			continue;
+
+		for (; upper2 <= UPPER_MASK; upper2++, lower = 0) {
+			lower_chunk = upper_chunk->data[upper2];
+			if (!lower_chunk)
+				continue;
+
+			lower = find_next_bit(lower_chunk->data, LOWER_BITS,
+					    lower);
+			if (lower < LOWER_BITS)
+				goto found;
+		}
 	}
-	return -1;
+
+ found:
+	raw_spin_unlock_irqrestore(&pid_list->lock, flags);
+	if (upper1 > UPPER_MASK)
+		return -1;
+
+	*next = pid_join(upper1, upper2, lower);
+	return 0;
 }
 
 /**
@@ -114,15 +385,79 @@ int trace_pid_list_next(struct trace_pid_list *pid_list, unsigned int pid,
  */
 int trace_pid_list_first(struct trace_pid_list *pid_list, unsigned int *pid)
 {
-	unsigned int first;
+	return trace_pid_list_next(pid_list, 0, pid);
+}
 
-	first = find_first_bit(pid_list->pids, pid_list->pid_max);
+static void pid_list_refill_irq(struct irq_work *iwork)
+{
+	struct trace_pid_list *pid_list = container_of(iwork, struct trace_pid_list,
+						       refill_irqwork);
+	union upper_chunk *upper;
+	union lower_chunk *lower;
+	union upper_chunk **upper_next = &upper;
+	union lower_chunk **lower_next = &lower;
+	int upper_count;
+	int lower_count;
+	int ucnt = 0;
+	int lcnt = 0;
+
+ again:
+	raw_spin_lock(&pid_list->lock);
+	upper_count = CHUNK_ALLOC - pid_list->free_upper_chunks;
+	lower_count = CHUNK_ALLOC - pid_list->free_lower_chunks;
+	raw_spin_unlock(&pid_list->lock);
+
+	if (upper_count <= 0 && lower_count <= 0)
+		return;
 
-	if (first < pid_list->pid_max) {
-		*pid = first;
-		return 0;
+	while (upper_count-- > 0) {
+		union upper_chunk *chunk;
+
+		chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
+		if (!chunk)
+			break;
+		*upper_next = chunk;
+		upper_next = &chunk->next;
+		ucnt++;
+	}
+
+	while (lower_count-- > 0) {
+		union lower_chunk *chunk;
+
+		chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
+		if (!chunk)
+			break;
+		*lower_next = chunk;
+		lower_next = &chunk->next;
+		lcnt++;
 	}
-	return -1;
+
+	raw_spin_lock(&pid_list->lock);
+	if (upper) {
+		*upper_next = pid_list->upper_list;
+		pid_list->upper_list = upper;
+		pid_list->free_upper_chunks += ucnt;
+	}
+	if (lower) {
+		*lower_next = pid_list->lower_list;
+		pid_list->lower_list = lower;
+		pid_list->free_lower_chunks += lcnt;
+	}
+	raw_spin_unlock(&pid_list->lock);
+
+	/*
+	 * On success of allocating all the chunks, both counters
+	 * will be less than zero. If they are not, then an allocation
+	 * failed, and we should not try again.
+	 */
+	if (upper_count >= 0 || lower_count >= 0)
+		return;
+	/*
+	 * When the locks were released, free chunks could have
+	 * been used and allocation needs to be done again. Might as
+	 * well allocate it now.
+	 */
+	goto again;
 }
 
 /**
@@ -135,18 +470,38 @@ int trace_pid_list_first(struct trace_pid_list *pid_list, unsigned int *pid)
 struct trace_pid_list *trace_pid_list_alloc(void)
 {
 	struct trace_pid_list *pid_list;
+	int i;
 
-	pid_list = kmalloc(sizeof(*pid_list), GFP_KERNEL);
+	pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL);
 	if (!pid_list)
 		return NULL;
 
-	pid_list->pid_max = READ_ONCE(pid_max);
+	init_irq_work(&pid_list->refill_irqwork, pid_list_refill_irq);
 
-	pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
-	if (!pid_list->pids) {
-		kfree(pid_list);
-		return NULL;
+	raw_spin_lock_init(&pid_list->lock);
+
+	for (i = 0; i < CHUNK_ALLOC; i++) {
+		union upper_chunk *chunk;
+
+		chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
+		if (!chunk)
+			break;
+		chunk->next = pid_list->upper_list;
+		pid_list->upper_list = chunk;
+		pid_list->free_upper_chunks++;
+	}
+
+	for (i = 0; i < CHUNK_ALLOC; i++) {
+		union lower_chunk *chunk;
+
+		chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
+		if (!chunk)
+			break;
+		chunk->next = pid_list->lower_list;
+		pid_list->lower_list = chunk;
+		pid_list->free_lower_chunks++;
 	}
+
 	return pid_list;
 }
 
@@ -157,9 +512,40 @@ struct trace_pid_list *trace_pid_list_alloc(void)
  */
 void trace_pid_list_free(struct trace_pid_list *pid_list)
 {
+	union upper_chunk *upper;
+	union lower_chunk *lower;
+	int i, j;
+
 	if (!pid_list)
 		return;
 
-	vfree(pid_list->pids);
+	irq_work_sync(&pid_list->refill_irqwork);
+
+	while (pid_list->lower_list) {
+		union lower_chunk *chunk;
+
+		chunk = pid_list->lower_list;
+		pid_list->lower_list = pid_list->lower_list->next;
+		kfree(chunk);
+	}
+
+	while (pid_list->upper_list) {
+		union upper_chunk *chunk;
+
+		chunk = pid_list->upper_list;
+		pid_list->upper_list = pid_list->upper_list->next;
+		kfree(chunk);
+	}
+
+	for (i = 0; i < UPPER1_SIZE; i++) {
+		upper = pid_list->upper[i];
+		if (upper) {
+			for (j = 0; j < UPPER2_SIZE; j++) {
+				lower = upper->data[j];
+				kfree(lower);
+			}
+			kfree(upper);
+		}
+	}
 	kfree(pid_list);
 }
-- 
2.32.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] tracing: Have trace_pid_list be a sparse array
  2021-09-24  3:35 [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
  2021-09-24  3:35 ` [PATCH 1/2] tracing: Place trace_pid_list logic into abstract functions Steven Rostedt
  2021-09-24  3:35 ` [PATCH 2/2] tracing: Create a sparse bitmask for pid filtering Steven Rostedt
@ 2021-09-24  4:07 ` Steven Rostedt
  2021-09-24 10:51   ` Eugene Syromyatnikov
  2 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2021-09-24  4:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-devel

On Thu, 23 Sep 2021 23:35:47 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> The pid_mask will start out with 1024 entries for the first 10 MSB bits.
> This will cost 4K for 32 bit architectures and 8K for 64 bit. Each of
> these will have a 1024 array to store the next 10 bits of the pid (another
> 4 or 8K). These will hold an 512 byte bitmask (which will cover the LSB 12
>   bits or 4096 bits).

Thinking about this more, I should adjust this split.

Currently, this is a 10,10,12 split, but since the upper chunks are
arrays of pointers, and the lower chunk is a bitmask, and that pids
tend to be close together, I should make the lower split bigger.

As a 4K page is 32768 bits (2^15), the lower split should be 15 bits,
not 12. This will probably allocate better.

Of course 32 - 15 is 17. So maybe to keep it simple, by having the two
upper chunks still the same in size, I could have it be 14 bits for the
lower (2048 bytes). And since pid_max can only be 2^30, we don't even
need to cover the full 32 bits.

30 - 14 = 16 = 8 * 2.

Then I can make the upper chunks cover 8 bits (arrays of 256 pointers)
and the lower chunks cove 14 bits. This would coincidentally make all
chunks 2K in size (on 64 bit machines).

Hmm, in that case, I can make the lower and upper chunks use the same
memory and not separate them. They are unions after all. But that may
be unfair for 32 bit machines, as the upper chunks are only going to be
1K in size for them (256 * 4).

-- Steve


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] tracing: Have trace_pid_list be a sparse array
  2021-09-24  4:07 ` [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
@ 2021-09-24 10:51   ` Eugene Syromyatnikov
  2021-09-24 13:16     ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Eugene Syromyatnikov @ 2021-09-24 10:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: lkml, Ingo Molnar, Andrew Morton, Masami Hiramatsu,
	Mathieu Desnoyers, linux-trace-devel

On Fri, Sep 24, 2021 at 6:07 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 23 Sep 2021 23:35:47 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > The pid_mask will start out with 1024 entries for the first 10 MSB bits.
> > This will cost 4K for 32 bit architectures and 8K for 64 bit. Each of
> > these will have a 1024 array to store the next 10 bits of the pid (another
> > 4 or 8K). These will hold an 512 byte bitmask (which will cover the LSB 12
> >   bits or 4096 bits).
>
> Thinking about this more, I should adjust this split.
>
> Currently, this is a 10,10,12 split, but since the upper chunks are
> arrays of pointers, and the lower chunk is a bitmask, and that pids
> tend to be close together, I should make the lower split bigger.
>
> As a 4K page is 32768 bits (2^15), the lower split should be 15 bits,
> not 12. This will probably allocate better.
>
> Of course 32 - 15 is 17. So maybe to keep it simple, by having the two
> upper chunks still the same in size, I could have it be 14 bits for the
> lower (2048 bytes). And since pid_max can only be 2^30, we don't even
> need to cover the full 32 bits.
>
> 30 - 14 = 16 = 8 * 2.
>
> Then I can make the upper chunks cover 8 bits (arrays of 256 pointers)
> and the lower chunks cove 14 bits. This would coincidentally make all
> chunks 2K in size (on 64 bit machines).
>
> Hmm, in that case, I can make the lower and upper chunks use the same
> memory and not separate them. They are unions after all. But that may
> be unfair for 32 bit machines, as the upper chunks are only going to be
> 1K in size for them (256 * 4).

Note that there is only one top-level chunk (so its size doesn't
matter much), (usually) about one middle-tier chunk (except for some
heavy cases, since pids are allocated linearly), and quite some
lower-tier bitset leaves.  So I'd optimise towards smaller leaves at
the expense of middle-tier (and especially top-tier) chunk size
(especially considering the fact that in the kernel, buddy allocator
is used), like 12-8-12 or something like that, but I have no factual
basis for arguing about specific split.  Also, I cannot resist from
noticing that this reminds me an awful lot of XArray and [1].  Maybe,
some wrapper around XArray would do?

[1] https://gitlab.com/strace/strace/-/raw/master/src/trie.h

>
> -- Steve
>



--
Eugene Syromyatnikov
mailto:evgsyr@gmail.com
xmpp:esyr@jabber.{ru|org}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] tracing: Have trace_pid_list be a sparse array
  2021-09-24 10:51   ` Eugene Syromyatnikov
@ 2021-09-24 13:16     ` Steven Rostedt
  2021-09-24 13:35       ` Masami Hiramatsu
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2021-09-24 13:16 UTC (permalink / raw)
  To: Eugene Syromyatnikov
  Cc: lkml, Ingo Molnar, Andrew Morton, Masami Hiramatsu,
	Mathieu Desnoyers, linux-trace-devel

On Fri, 24 Sep 2021 12:51:07 +0200
Eugene Syromyatnikov <evgsyr@gmail.com> wrote:

Hi Eugene,

> Note that there is only one top-level chunk (so its size doesn't
> matter much), (usually) about one middle-tier chunk (except for some
> heavy cases, since pids are allocated linearly), and quite some
> lower-tier bitset leaves.  So I'd optimise towards smaller leaves at
> the expense of middle-tier (and especially top-tier) chunk size
> (especially considering the fact that in the kernel, buddy allocator
> is used), like 12-8-12 or something like that, but I have no factual

What I really like about my 8 8 14 split I have, it makes everything 2K in
size on 64 bit machines (1K 1K 2K for 32 bit, but who cares ;-)

 1 << 8 * 8 bytes = 2K   // top tiers are pointers to lower tiers
 1 << 14 bits = 2K       // lower tier only cares about bits

This means they will likely all be allocated in the same slab.

I'm optimizing the top tiers for size, because they are likely to be empty.
Why add memory for something that will never be used, and can't be removed.
Note, the middle and lower tiers can be reused when they go empty, which is
a likely use case (at least when I test this using hackbench).

> basis for arguing about specific split.  Also, I cannot resist from
> noticing that this reminds me an awful lot of XArray and [1].  Maybe,
> some wrapper around XArray would do?
> 
> [1] https://gitlab.com/strace/strace/-/raw/master/src/trie.h
> 

I looked into xarray and it appears to be optimized for storing something,
where as I'm just interested in a sparse bitmask.

Thanks for the review.

Note, I'll be posting a v3 soon because I found if I echo 1<<30 into
set_event_pid, it adds 0 (because it only looks at the bottom 30 bits).
It should really return -EINVAL.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] tracing: Have trace_pid_list be a sparse array
  2021-09-24 13:16     ` Steven Rostedt
@ 2021-09-24 13:35       ` Masami Hiramatsu
  2021-09-24 14:18         ` Eugene Syromyatnikov
  0 siblings, 1 reply; 8+ messages in thread
From: Masami Hiramatsu @ 2021-09-24 13:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Eugene Syromyatnikov, lkml, Ingo Molnar, Andrew Morton,
	Masami Hiramatsu, Mathieu Desnoyers, linux-trace-devel

On Fri, 24 Sep 2021 09:16:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Sep 2021 12:51:07 +0200
> Eugene Syromyatnikov <evgsyr@gmail.com> wrote:
> 
> Hi Eugene,
> 
> > Note that there is only one top-level chunk (so its size doesn't
> > matter much), (usually) about one middle-tier chunk (except for some
> > heavy cases, since pids are allocated linearly), and quite some
> > lower-tier bitset leaves.  So I'd optimise towards smaller leaves at
> > the expense of middle-tier (and especially top-tier) chunk size
> > (especially considering the fact that in the kernel, buddy allocator
> > is used), like 12-8-12 or something like that, but I have no factual
> 
> What I really like about my 8 8 14 split I have, it makes everything 2K in
> size on 64 bit machines (1K 1K 2K for 32 bit, but who cares ;-)
> 
>  1 << 8 * 8 bytes = 2K   // top tiers are pointers to lower tiers
>  1 << 14 bits = 2K       // lower tier only cares about bits
> 
> This means they will likely all be allocated in the same slab.
> 
> I'm optimizing the top tiers for size, because they are likely to be empty.
> Why add memory for something that will never be used, and can't be removed.
> Note, the middle and lower tiers can be reused when they go empty, which is
> a likely use case (at least when I test this using hackbench).
> 
> > basis for arguing about specific split.  Also, I cannot resist from
> > noticing that this reminds me an awful lot of XArray and [1].  Maybe,
> > some wrapper around XArray would do?
> > 
> > [1] https://gitlab.com/strace/strace/-/raw/master/src/trie.h
> > 
> 
> I looked into xarray and it appears to be optimized for storing something,
> where as I'm just interested in a sparse bitmask.

I guess he suggested that store the bitmask in xarray. Anyway, both are
OK to me. This is needed for reducing the memory.

Thank you,

> 
> Thanks for the review.
> 
> Note, I'll be posting a v3 soon because I found if I echo 1<<30 into
> set_event_pid, it adds 0 (because it only looks at the bottom 30 bits).
> It should really return -EINVAL.
> 
> -- Steve


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] tracing: Have trace_pid_list be a sparse array
  2021-09-24 13:35       ` Masami Hiramatsu
@ 2021-09-24 14:18         ` Eugene Syromyatnikov
  0 siblings, 0 replies; 8+ messages in thread
From: Eugene Syromyatnikov @ 2021-09-24 14:18 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, lkml, Ingo Molnar, Andrew Morton,
	Mathieu Desnoyers, linux-trace-devel

On Fri, Sep 24, 2021 at 3:35 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 24 Sep 2021 09:16:27 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> > I'm optimizing the top tiers for size, because they are likely to be empty.
> > Why add memory for something that will never be used, and can't be removed.
> > Note, the middle and lower tiers can be reused when they go empty, which is
> > a likely use case (at least when I test this using hackbench).

Makes sense;  I was thinking about worse case scenarios—tracing
thousands+ processes, but those probably not as common and important.

> > I looked into xarray and it appears to be optimized for storing something,
> > where as I'm just interested in a sparse bitmask.
>
> I guess he suggested that store the bitmask in xarray. Anyway, both are
> OK to me. This is needed for reducing the memory.

Yes, the idea was to store pointers to bitset leaves in XArray and
leverage its radix tree implementation, at cost of somewhat lesser
efficiency (since XArray indices are longs and thus it employs more
intermediate levels on 64-bit architectures).

-- 
Eugene Syromyatnikov
mailto:evgsyr@gmail.com
xmpp:esyr@jabber.{ru|org}

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-09-24 14:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24  3:35 [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
2021-09-24  3:35 ` [PATCH 1/2] tracing: Place trace_pid_list logic into abstract functions Steven Rostedt
2021-09-24  3:35 ` [PATCH 2/2] tracing: Create a sparse bitmask for pid filtering Steven Rostedt
2021-09-24  4:07 ` [PATCH 0/2] tracing: Have trace_pid_list be a sparse array Steven Rostedt
2021-09-24 10:51   ` Eugene Syromyatnikov
2021-09-24 13:16     ` Steven Rostedt
2021-09-24 13:35       ` Masami Hiramatsu
2021-09-24 14:18         ` Eugene Syromyatnikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).