bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC bpf-next 00/16] bpf: Speed up trampoline attach
@ 2020-10-22  8:21 Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 01/16] ftrace: Add check_direct_entry function Jiri Olsa
                   ` (16 more replies)
  0 siblings, 17 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

hi,
this patchset tries to speed up the attach time for trampolines
and make bpftrace faster for wildcard use cases like:

  # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); }"

Profiles show mostly ftrace backend, because we add trampoline
functions one by one and ftrace direct function registering is
quite expensive. Thus main change in this patchset is to allow
batch attach and use just single ftrace call to attach or detach
multiple ips/trampolines.

This patchset also contains other speedup changes that showed
up in profiles:

  - delayed link free
    to bypass detach cycles completely

  - kallsyms rbtree search
    change linear search to rb tree search

For clean attach workload I added also new attach selftest,
which is not meant to be merged but is used to show profile
results.

Following numbers show speedup after applying specific change
on top of the previous (and including the previous changes).

profiled with: 'perf stat -r 5 -e cycles:k,cycles:u ...'

For bpftrace:

  # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}"

  - base

      3,290,457,628      cycles:k         ( +-  0.27% )
        933,581,973      cycles:u         ( +-  0.20% )

      50.25 +- 4.79 seconds time elapsed  ( +-  9.53% )

  + delayed link free

      2,535,458,767      cycles:k         ( +-  0.55% )
        940,046,382      cycles:u         ( +-  0.27% )

      33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )

  + kallsym rbtree search

      2,199,433,771      cycles:k         ( +-  0.55% )
        936,105,469      cycles:u         ( +-  0.37% )

      26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )

  + batch support

      1,456,854,867      cycles:k         ( +-  0.57% )
        937,737,431      cycles:u         ( +-  0.13% )

      12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )

  + rcu fix

      1,427,959,119      cycles:k         ( +-  0.87% )
        930,833,507      cycles:u         ( +-  0.23% )

      14.53 +- 3.51 seconds time elapsed  ( +- 24.14% )


For attach_test numbers do not show direct time speedup when
using the batch support, but show big decrease in kernel cycles.
It seems the time is spent in rcu waiting, which I tried to
address in most likely wrong rcu fix:

  # ./test_progs -t attach_test

  - base

      1,350,136,760      cycles:k         ( +-  0.07% )
         70,591,712      cycles:u         ( +-  0.26% )

      24.26 +- 2.82 seconds time elapsed  ( +- 11.62% )

  + delayed link free

        996,152,309      cycles:k         ( +-  0.37% )
         69,263,150      cycles:u         ( +-  0.50% )

      15.63 +- 1.80 seconds time elapsed  ( +- 11.51% )

  + kallsym rbtree search

        390,217,706      cycles:k         ( +-  0.66% )
         68,999,019      cycles:u         ( +-  0.46% )

      14.11 +- 2.11 seconds time elapsed  ( +- 14.98% )

  + batch support

         37,410,887      cycles:k         ( +-  0.98% )
         70,062,158      cycles:u         ( +-  0.39% )

      26.80 +- 4.10 seconds time elapsed  ( +- 15.31% )

  + rcu fix

         36,812,432      cycles:k         ( +-  2.52% )
         69,907,191      cycles:u         ( +-  0.38% )

      15.04 +- 2.94 seconds time elapsed  ( +- 19.54% )


I still need to go through the changes and double check them,
also those ftrace changes are most likely wrong and most likely
I broke few tests (hence it's RFC), but I wonder you guys would
like this batch solution and if there are any thoughts on that.

Also available in
  git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  bpf/batch

thanks,
jirka


---
Jiri Olsa (16):
      ftrace: Add check_direct_entry function
      ftrace: Add adjust_direct_size function
      ftrace: Add get/put_direct_func function
      ftrace: Add ftrace_set_filter_ips function
      ftrace: Add register_ftrace_direct_ips function
      ftrace: Add unregister_ftrace_direct_ips function
      kallsyms: Use rb tree for kallsyms name search
      bpf: Use delayed link free in bpf_link_put
      bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
      bpf: Add BPF_TRAMPOLINE_BATCH_DETACH support
      bpf: Sync uapi bpf.h to tools
      bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED)
      libbpf: Add trampoline batch attach support
      libbpf: Add trampoline batch detach support
      selftests/bpf: Add trampoline batch test
      selftests/bpf: Add attach batch test (NOT TO BE MERGED)

 include/linux/bpf.h                                       |  18 +++++-
 include/linux/ftrace.h                                    |   7 +++
 include/uapi/linux/bpf.h                                  |   8 +++
 kernel/bpf/syscall.c                                      | 125 ++++++++++++++++++++++++++++++++++----
 kernel/bpf/trampoline.c                                   |  95 +++++++++++++++++++++++------
 kernel/kallsyms.c                                         |  95 ++++++++++++++++++++++++++---
 kernel/trace/ftrace.c                                     | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
 net/bpf/test_run.c                                        |  55 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                            |   8 +++
 tools/lib/bpf/bpf.c                                       |  24 ++++++++
 tools/lib/bpf/bpf.h                                       |   2 +
 tools/lib/bpf/libbpf.c                                    | 126 ++++++++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.h                                    |   5 +-
 tools/lib/bpf/libbpf.map                                  |   2 +
 tools/testing/selftests/bpf/prog_tests/attach_test.c      |  27 +++++++++
 tools/testing/selftests/bpf/prog_tests/trampoline_batch.c |  45 ++++++++++++++
 tools/testing/selftests/bpf/progs/attach_test.c           |  62 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/trampoline_batch_test.c |  75 +++++++++++++++++++++++
 18 files changed, 995 insertions(+), 88 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/attach_test.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/trampoline_batch.c
 create mode 100644 tools/testing/selftests/bpf/progs/attach_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/trampoline_batch_test.c


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC bpf-next 01/16] ftrace: Add check_direct_entry function
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 02/16] ftrace: Add adjust_direct_size function Jiri Olsa
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Move code that checks on valid direct ip into separate
check_direct_entry function. It will be used in following
patches, there's no functional change.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/ftrace.c | 55 +++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8185f7240095..27e9210073d3 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5025,6 +5025,36 @@ struct ftrace_direct_func *ftrace_find_direct_func(unsigned long addr)
 	return NULL;
 }
 
+static int check_direct_ip(unsigned long ip)
+{
+	struct dyn_ftrace *rec;
+
+	/* See if there's a direct function at @ip already */
+	if (ftrace_find_rec_direct(ip))
+		return -EBUSY;
+
+	rec = lookup_rec(ip, ip);
+	if (!rec)
+		return -ENODEV;
+
+	/*
+	 * Check if the rec says it has a direct call but we didn't
+	 * find one earlier?
+	 */
+	if (WARN_ON(rec->flags & FTRACE_FL_DIRECT))
+		return -ENODEV;
+
+	/* Make sure the ip points to the exact record */
+	if (ip != rec->ip) {
+		ip = rec->ip;
+		/* Need to check this ip for a direct. */
+		if (ftrace_find_rec_direct(ip))
+			return -EBUSY;
+	}
+
+	return 0;
+}
+
 /**
  * register_ftrace_direct - Call a custom trampoline directly
  * @ip: The address of the nop at the beginning of a function
@@ -5047,35 +5077,14 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
 	struct ftrace_direct_func *direct;
 	struct ftrace_func_entry *entry;
 	struct ftrace_hash *free_hash = NULL;
-	struct dyn_ftrace *rec;
 	int ret = -EBUSY;
 
 	mutex_lock(&direct_mutex);
 
-	/* See if there's a direct function at @ip already */
-	if (ftrace_find_rec_direct(ip))
-		goto out_unlock;
-
-	ret = -ENODEV;
-	rec = lookup_rec(ip, ip);
-	if (!rec)
-		goto out_unlock;
-
-	/*
-	 * Check if the rec says it has a direct call but we didn't
-	 * find one earlier?
-	 */
-	if (WARN_ON(rec->flags & FTRACE_FL_DIRECT))
+	ret = check_direct_ip(ip);
+	if (ret)
 		goto out_unlock;
 
-	/* Make sure the ip points to the exact record */
-	if (ip != rec->ip) {
-		ip = rec->ip;
-		/* Need to check this ip for a direct. */
-		if (ftrace_find_rec_direct(ip))
-			goto out_unlock;
-	}
-
 	ret = -ENOMEM;
 	if (ftrace_hash_empty(direct_functions) ||
 	    direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 02/16] ftrace: Add adjust_direct_size function
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 01/16] ftrace: Add check_direct_entry function Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 03/16] ftrace: Add get/put_direct_func function Jiri Olsa
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Move code for adjusting size of direct hash into separate
adjust_direct_size function. It will be used in following
patches, there's no functional change.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/ftrace.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 27e9210073d3..cb8b7a66c6af 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5055,6 +5055,27 @@ static int check_direct_ip(unsigned long ip)
 	return 0;
 }
 
+static int adjust_direct_size(int new_size, struct ftrace_hash **free_hash)
+{
+	if (ftrace_hash_empty(direct_functions) ||
+	    direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
+		struct ftrace_hash *new_hash;
+		int size = ftrace_hash_empty(direct_functions) ? 0 : new_size;
+
+		if (size < 32)
+			size = 32;
+
+		new_hash = dup_hash(direct_functions, size);
+		if (!new_hash)
+			return -ENOMEM;
+
+		*free_hash = direct_functions;
+		direct_functions = new_hash;
+	}
+
+	return 0;
+}
+
 /**
  * register_ftrace_direct - Call a custom trampoline directly
  * @ip: The address of the nop at the beginning of a function
@@ -5086,22 +5107,8 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
 		goto out_unlock;
 
 	ret = -ENOMEM;
-	if (ftrace_hash_empty(direct_functions) ||
-	    direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
-		struct ftrace_hash *new_hash;
-		int size = ftrace_hash_empty(direct_functions) ? 0 :
-			direct_functions->count + 1;
-
-		if (size < 32)
-			size = 32;
-
-		new_hash = dup_hash(direct_functions, size);
-		if (!new_hash)
-			goto out_unlock;
-
-		free_hash = direct_functions;
-		direct_functions = new_hash;
-	}
+	if (adjust_direct_size(direct_functions->count + 1, &free_hash))
+		goto out_unlock;
 
 	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
 	if (!entry)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 03/16] ftrace: Add get/put_direct_func function
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 01/16] ftrace: Add check_direct_entry function Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 02/16] ftrace: Add adjust_direct_size function Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 04/16] ftrace: Add ftrace_set_filter_ips function Jiri Olsa
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Move code for managing ftrace_direct_funcs entries
into get_direct_func and put_direct_func functions.
It will be used in following patches, there's no
functional change.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/ftrace.c | 44 +++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index cb8b7a66c6af..95ef7e2a6a57 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5076,6 +5076,32 @@ static int adjust_direct_size(int new_size, struct ftrace_hash **free_hash)
 	return 0;
 }
 
+static struct ftrace_direct_func *get_direct_func(unsigned long addr)
+{
+	struct ftrace_direct_func *direct;
+
+	direct = ftrace_find_direct_func(addr);
+	if (!direct) {
+		direct = kmalloc(sizeof(*direct), GFP_KERNEL);
+		if (!direct)
+			return NULL;
+		direct->addr = addr;
+		direct->count = 0;
+		list_add_rcu(&direct->next, &ftrace_direct_funcs);
+		ftrace_direct_func_count++;
+	}
+
+	return direct;
+}
+
+static void put_direct_func(struct ftrace_direct_func *direct)
+{
+	list_del_rcu(&direct->next);
+	synchronize_rcu_tasks();
+	kfree(direct);
+	ftrace_direct_func_count--;
+}
+
 /**
  * register_ftrace_direct - Call a custom trampoline directly
  * @ip: The address of the nop at the beginning of a function
@@ -5114,17 +5140,10 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
 	if (!entry)
 		goto out_unlock;
 
-	direct = ftrace_find_direct_func(addr);
+	direct = get_direct_func(addr);
 	if (!direct) {
-		direct = kmalloc(sizeof(*direct), GFP_KERNEL);
-		if (!direct) {
-			kfree(entry);
-			goto out_unlock;
-		}
-		direct->addr = addr;
-		direct->count = 0;
-		list_add_rcu(&direct->next, &ftrace_direct_funcs);
-		ftrace_direct_func_count++;
+		kfree(entry);
+		goto out_unlock;
 	}
 
 	entry->ip = ip;
@@ -5144,13 +5163,10 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
 	if (ret) {
 		kfree(entry);
 		if (!direct->count) {
-			list_del_rcu(&direct->next);
-			synchronize_rcu_tasks();
-			kfree(direct);
+			put_direct_func(direct);
 			if (free_hash)
 				free_ftrace_hash(free_hash);
 			free_hash = NULL;
-			ftrace_direct_func_count--;
 		}
 	} else {
 		direct->count++;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 04/16] ftrace: Add ftrace_set_filter_ips function
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (2 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 03/16] ftrace: Add get/put_direct_func function Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 05/16] ftrace: Add register_ftrace_direct_ips function Jiri Olsa
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding ftrace_set_filter_ips function that allows to set
filter on multiple ip addresses. These are provided as
array of unsigned longs together with the array count:

  int ftrace_set_filter_ips(struct ftrace_ops *ops,
                            unsigned long *ips,
                            int count, int remove);

The function copies logic of ftrace_set_filter_ip but
over multiple ip addresses.

It will be used in following patches for faster direct
ip/addr trampolines update.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/ftrace.h |  3 +++
 kernel/trace/ftrace.c  | 56 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 1bd3a0356ae4..d71d88d10517 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -463,6 +463,8 @@ struct dyn_ftrace {
 int ftrace_force_update(void);
 int ftrace_set_filter_ip(struct ftrace_ops *ops, unsigned long ip,
 			 int remove, int reset);
+int ftrace_set_filter_ips(struct ftrace_ops *ops, unsigned long *ips,
+			 int count, int remove);
 int ftrace_set_filter(struct ftrace_ops *ops, unsigned char *buf,
 		       int len, int reset);
 int ftrace_set_notrace(struct ftrace_ops *ops, unsigned char *buf,
@@ -738,6 +740,7 @@ static inline unsigned long ftrace_location(unsigned long ip)
 #define ftrace_regex_open(ops, flag, inod, file) ({ -ENODEV; })
 #define ftrace_set_early_filter(ops, buf, enable) do { } while (0)
 #define ftrace_set_filter_ip(ops, ip, remove, reset) ({ -ENODEV; })
+#define ftrace_set_filter_ips(ops, ip, remove) ({ -ENODEV; })
 #define ftrace_set_filter(ops, buf, len, reset) ({ -ENODEV; })
 #define ftrace_set_notrace(ops, buf, len, reset) ({ -ENODEV; })
 #define ftrace_free_filter(ops) do { } while (0)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 95ef7e2a6a57..44c2d21b8c19 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -4977,6 +4977,47 @@ ftrace_set_hash(struct ftrace_ops *ops, unsigned char *buf, int len,
 	return ret;
 }
 
+static int
+ftrace_set_hash_ips(struct ftrace_ops *ops, unsigned long *ips,
+		    int count, int remove, int enable)
+{
+	struct ftrace_hash **orig_hash;
+	struct ftrace_hash *hash;
+	int ret, i;
+
+	if (unlikely(ftrace_disabled))
+		return -ENODEV;
+
+	mutex_lock(&ops->func_hash->regex_lock);
+
+	if (enable)
+		orig_hash = &ops->func_hash->filter_hash;
+	else
+		orig_hash = &ops->func_hash->notrace_hash;
+
+	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+	if (!hash) {
+		ret = -ENOMEM;
+		goto out_regex_unlock;
+	}
+
+	for (i = 0; i < count; i++) {
+		ret = ftrace_match_addr(hash, ips[i], remove);
+		if (ret < 0)
+			goto out_regex_unlock;
+	}
+
+	mutex_lock(&ftrace_lock);
+	ret = ftrace_hash_move_and_update_ops(ops, orig_hash, hash, enable);
+	mutex_unlock(&ftrace_lock);
+
+ out_regex_unlock:
+	mutex_unlock(&ops->func_hash->regex_lock);
+
+	free_ftrace_hash(hash);
+	return ret;
+}
+
 static int
 ftrace_set_addr(struct ftrace_ops *ops, unsigned long ip, int remove,
 		int reset, int enable)
@@ -4984,6 +5025,13 @@ ftrace_set_addr(struct ftrace_ops *ops, unsigned long ip, int remove,
 	return ftrace_set_hash(ops, NULL, 0, ip, remove, reset, enable);
 }
 
+static int
+ftrace_set_addrs(struct ftrace_ops *ops, unsigned long *ips,
+		 int count, int remove, int enable)
+{
+	return ftrace_set_hash_ips(ops, ips, count, remove, enable);
+}
+
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 
 struct ftrace_direct_func {
@@ -5395,6 +5443,14 @@ int ftrace_set_filter_ip(struct ftrace_ops *ops, unsigned long ip,
 }
 EXPORT_SYMBOL_GPL(ftrace_set_filter_ip);
 
+int ftrace_set_filter_ips(struct ftrace_ops *ops, unsigned long *ips,
+			 int count, int remove)
+{
+	ftrace_ops_init(ops);
+	return ftrace_set_addrs(ops, ips, count, remove, 1);
+}
+EXPORT_SYMBOL_GPL(ftrace_set_filter_ips);
+
 /**
  * ftrace_ops_set_global_filter - setup ops to use global filters
  * @ops - the ops which will use the global filters
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 05/16] ftrace: Add register_ftrace_direct_ips function
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (3 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 04/16] ftrace: Add ftrace_set_filter_ips function Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 06/16] ftrace: Add unregister_ftrace_direct_ips function Jiri Olsa
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding register_ftrace_direct_ips function that llows
to register array of ip addresses and trampolines
for direct filter. the interface is:

  int register_ftrace_direct_ips(unsigned long *ips,
                                 unsigned long *addrs,
                                 int count);

It wil be used in following patches to register bpf
trampolines in batch mode.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/ftrace.h |  2 ++
 kernel/trace/ftrace.c  | 75 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index d71d88d10517..9ed52755667a 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -291,6 +291,8 @@ int ftrace_modify_direct_caller(struct ftrace_func_entry *entry,
 				unsigned long old_addr,
 				unsigned long new_addr);
 unsigned long ftrace_find_rec_direct(unsigned long ip);
+int register_ftrace_direct_ips(unsigned long *ips, unsigned long *addrs,
+			       int count);
 #else
 # define ftrace_direct_func_count 0
 static inline int register_ftrace_direct(unsigned long ip, unsigned long addr)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 44c2d21b8c19..770bcd1a245a 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5231,6 +5231,81 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(register_ftrace_direct);
 
+int register_ftrace_direct_ips(unsigned long *ips, unsigned long *addrs,
+			       int count)
+{
+	struct ftrace_hash *free_hash = NULL;
+	struct ftrace_direct_func *direct;
+	struct ftrace_func_entry *entry;
+	int i, j;
+	int ret;
+
+	mutex_lock(&direct_mutex);
+
+	/* Check all the ips */
+	for (i = 0; i < count; i++) {
+		ret = check_direct_ip(ips[i]);
+		if (ret)
+			goto out_unlock;
+	}
+
+	ret = -ENOMEM;
+	if (adjust_direct_size(direct_functions->count + count, &free_hash))
+		goto out_unlock;
+
+	for (i = 0; i < count; i++) {
+		entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+		if (!entry)
+			goto out_clean;
+
+		direct = get_direct_func(addrs[i]);
+		if (!direct) {
+			kfree(entry);
+			goto out_clean;
+		}
+
+		direct->count++;
+		entry->ip = ips[i];
+		entry->direct = addrs[i];
+		__add_hash_entry(direct_functions, entry);
+	}
+
+	ret = ftrace_set_filter_ips(&direct_ops, ips, count, 0);
+
+	if (!ret && !(direct_ops.flags & FTRACE_OPS_FL_ENABLED)) {
+		ret = register_ftrace_function(&direct_ops);
+		if (ret)
+			ftrace_set_filter_ips(&direct_ops, ips, count, 1);
+	}
+
+ out_clean:
+	if (ret) {
+		for (j = 0; j < i; j++) {
+			direct = get_direct_func(addrs[j]);
+			if (!direct)
+				continue;
+
+			if (!direct->count)
+				put_direct_func(direct);
+
+			entry = ftrace_lookup_ip(direct_functions, ips[j]);
+			if (WARN_ON_ONCE(!entry))
+				continue;
+			free_hash_entry(direct_functions, entry);
+		}
+	}
+ out_unlock:
+	mutex_unlock(&direct_mutex);
+
+	if (free_hash) {
+		synchronize_rcu_tasks();
+		free_ftrace_hash(free_hash);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(register_ftrace_direct_ips);
+
 static struct ftrace_func_entry *find_direct_entry(unsigned long *ip,
 						   struct dyn_ftrace **recp)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 06/16] ftrace: Add unregister_ftrace_direct_ips function
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (4 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 05/16] ftrace: Add register_ftrace_direct_ips function Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search Jiri Olsa
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding unregister_ftrace_direct_ips function that allows
to unregister array of ip addresses and trampolines for
direct filter. the interface is:

  int unregister_ftrace_direct_ips(unsigned long *ips,
                                   unsigned long *addrs,
                                   int count);

It wil be used in following patches to unregister bpf
trampolines in batch mode.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/ftrace.h |  2 ++
 kernel/trace/ftrace.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 9ed52755667a..24525473043e 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -293,6 +293,8 @@ int ftrace_modify_direct_caller(struct ftrace_func_entry *entry,
 unsigned long ftrace_find_rec_direct(unsigned long ip);
 int register_ftrace_direct_ips(unsigned long *ips, unsigned long *addrs,
 			       int count);
+int unregister_ftrace_direct_ips(unsigned long *ips, unsigned long *addrs,
+				 int count);
 #else
 # define ftrace_direct_func_count 0
 static inline int register_ftrace_direct(unsigned long ip, unsigned long addr)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 770bcd1a245a..15a13e6c1f31 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5374,6 +5374,57 @@ int unregister_ftrace_direct(unsigned long ip, unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(unregister_ftrace_direct);
 
+int unregister_ftrace_direct_ips(unsigned long *ips, unsigned long *addrs,
+				 int count)
+{
+	struct ftrace_direct_func *direct;
+	struct ftrace_func_entry *entry;
+	int i, del = 0, ret = -ENODEV;
+
+	mutex_lock(&direct_mutex);
+
+	for (i = 0; i < count; i++) {
+		entry = find_direct_entry(&ips[i], NULL);
+		if (!entry)
+			goto out_unlock;
+		del++;
+	}
+
+	if (direct_functions->count - del == 0)
+		unregister_ftrace_function(&direct_ops);
+
+	ret = ftrace_set_filter_ips(&direct_ops, ips, count, 1);
+
+	WARN_ON(ret);
+
+	for (i = 0; i < count; i++) {
+		entry = __ftrace_lookup_ip(direct_functions, ips[i]);
+		if (WARN_ON(!entry))
+			continue;
+
+		remove_hash_entry(direct_functions, entry);
+
+		direct = ftrace_find_direct_func(addrs[i]);
+		if (!WARN_ON(!direct)) {
+			/* This is the good path (see the ! before WARN) */
+			direct->count--;
+			WARN_ON(direct->count < 0);
+			if (!direct->count) {
+				list_del_rcu(&direct->next);
+				synchronize_rcu_tasks();
+				kfree(direct);
+				kfree(entry);
+				ftrace_direct_func_count--;
+			}
+		}
+	}
+ out_unlock:
+	mutex_unlock(&direct_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(unregister_ftrace_direct_ips);
+
 static struct ftrace_ops stub_ops = {
 	.func		= ftrace_stub,
 };
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (5 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 06/16] ftrace: Add unregister_ftrace_direct_ips function Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-28 18:25   ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 08/16] bpf: Use delayed link free in bpf_link_put Jiri Olsa
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

The kallsyms_expand_symbol function showed in several bpf related
profiles, because it's doing linear search.

Before:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
   { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     2,535,458,767      cycles:k                         ( +-  0.55% )
       940,046,382      cycles:u                         ( +-  0.27% )

             33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )

Loading all the vmlinux symbols in rbtree and and switch to rbtree
search in kallsyms_lookup_name function to save few cycles and time.

After:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
   { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     2,199,433,771      cycles:k                         ( +-  0.55% )
       936,105,469      cycles:u                         ( +-  0.37% )

             26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )

Each symbol takes 160 bytes, so for my .config I've got about 18 MBs
used for 115285 symbols.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/kallsyms.c | 95 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 86 insertions(+), 9 deletions(-)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 4fb15fa96734..107c8284170e 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -50,6 +50,36 @@ extern const u16 kallsyms_token_index[] __weak;
 
 extern const unsigned int kallsyms_markers[] __weak;
 
+static struct kmem_cache *symbol_cachep;
+
+struct symbol {
+	char		name[KSYM_NAME_LEN];
+	unsigned long	addr;
+	struct rb_node	rb_node;
+};
+
+static struct rb_root symbols_root = RB_ROOT;
+
+static struct symbol *find_symbol(const char *name)
+{
+	struct symbol *sym;
+	struct rb_node *n;
+	int err;
+
+	n = symbols_root.rb_node;
+	while (n) {
+		sym = rb_entry(n, struct symbol, rb_node);
+		err = strcmp(name, sym->name);
+		if (err < 0)
+			n = n->rb_left;
+		else if (err > 0)
+			n = n->rb_right;
+		else
+			return sym;
+	}
+	return NULL;
+}
+
 /*
  * Expand a compressed symbol data into the resulting uncompressed string,
  * if uncompressed string is too long (>= maxlen), it will be truncated,
@@ -164,16 +194,12 @@ static unsigned long kallsyms_sym_address(int idx)
 /* Lookup the address for this symbol. Returns 0 if not found. */
 unsigned long kallsyms_lookup_name(const char *name)
 {
-	char namebuf[KSYM_NAME_LEN];
-	unsigned long i;
-	unsigned int off;
+	struct symbol *sym;
 
-	for (i = 0, off = 0; i < kallsyms_num_syms; i++) {
-		off = kallsyms_expand_symbol(off, namebuf, ARRAY_SIZE(namebuf));
+	sym = find_symbol(name);
+	if (sym)
+		return sym->addr;
 
-		if (strcmp(namebuf, name) == 0)
-			return kallsyms_sym_address(i);
-	}
 	return module_kallsyms_lookup_name(name);
 }
 
@@ -743,9 +769,60 @@ static const struct proc_ops kallsyms_proc_ops = {
 	.proc_release	= seq_release_private,
 };
 
+static bool __init add_symbol(struct symbol *new)
+{
+	struct rb_node *parent = NULL;
+	struct rb_node **p;
+	struct symbol *sym;
+	int err;
+
+	p = &symbols_root.rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		sym = rb_entry(parent, struct symbol, rb_node);
+		err = strcmp(new->name, sym->name);
+		if (err < 0)
+			p = &(*p)->rb_left;
+		else if (err > 0)
+			p = &(*p)->rb_right;
+		else
+			return false;
+	}
+
+	rb_link_node(&new->rb_node, parent, p);
+	rb_insert_color(&new->rb_node, &symbols_root);
+	return true;
+}
+
+static int __init kallsyms_name_search_init(void)
+{
+	bool sym_added = true;
+	struct symbol *sym;
+	unsigned int off;
+	unsigned long i;
+
+	symbol_cachep = KMEM_CACHE(symbol, SLAB_PANIC|SLAB_ACCOUNT);
+
+	for (i = 0, off = 0; i < kallsyms_num_syms; i++) {
+		if (sym_added) {
+			sym = kmem_cache_alloc(symbol_cachep, GFP_KERNEL);
+			if (!sym)
+				return -ENOMEM;
+		}
+		off = kallsyms_expand_symbol(off, sym->name, ARRAY_SIZE(sym->name));
+		sym->addr = kallsyms_sym_address(i);
+		sym_added = add_symbol(sym);
+	}
+
+	if (!sym_added)
+		kmem_cache_free(symbol_cachep, sym);
+	return 0;
+}
+
 static int __init kallsyms_init(void)
 {
 	proc_create("kallsyms", 0444, NULL, &kallsyms_proc_ops);
-	return 0;
+	return kallsyms_name_search_init();
 }
 device_initcall(kallsyms_init);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 08/16] bpf: Use delayed link free in bpf_link_put
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (6 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-23 19:46   ` Andrii Nakryiko
  2020-10-22  8:21 ` [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support Jiri Olsa
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Moving bpf_link_free call into delayed processing so we don't
need to wait for it when releasing the link.

For example bpf_tracing_link_release could take considerable
amount of time in bpf_trampoline_put function due to
synchronize_rcu_tasks call.

It speeds up bpftrace release time in following example:

Before:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     3,290,457,628      cycles:k                                 ( +-  0.27% )
       933,581,973      cycles:u                                 ( +-  0.20% )

             50.25 +- 4.79 seconds time elapsed  ( +-  9.53% )

After:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     2,535,458,767      cycles:k                                 ( +-  0.55% )
       940,046,382      cycles:u                                 ( +-  0.27% )

             33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/bpf/syscall.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1110ecd7d1f3..61ef29f9177d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2346,12 +2346,8 @@ void bpf_link_put(struct bpf_link *link)
 	if (!atomic64_dec_and_test(&link->refcnt))
 		return;
 
-	if (in_atomic()) {
-		INIT_WORK(&link->work, bpf_link_put_deferred);
-		schedule_work(&link->work);
-	} else {
-		bpf_link_free(link);
-	}
+	INIT_WORK(&link->work, bpf_link_put_deferred);
+	schedule_work(&link->work);
 }
 
 static int bpf_link_release(struct inode *inode, struct file *filp)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (7 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 08/16] bpf: Use delayed link free in bpf_link_put Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-23 20:03   ` Andrii Nakryiko
  2020-10-22  8:21 ` [RFC bpf-next 10/16] bpf: Add BPF_TRAMPOLINE_BATCH_DETACH support Jiri Olsa
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding BPF_TRAMPOLINE_BATCH_ATTACH support, that allows to attach
tracing multiple fentry/fexit pograms to trampolines within one
syscall.

Currently each tracing program is attached in seprate bpf syscall
and more importantly by separate register_ftrace_direct call, which
registers trampoline in ftrace subsystem. We can save some cycles
by simple using its batch variant register_ftrace_direct_ips.

Before:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     2,199,433,771      cycles:k               ( +-  0.55% )
       936,105,469      cycles:u               ( +-  0.37% )

             26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )

After:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     1,456,854,867      cycles:k               ( +-  0.57% )
       937,737,431      cycles:u               ( +-  0.13% )

             12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )

The new BPF_TRAMPOLINE_BATCH_ATTACH syscall command expects
following data in union bpf_attr:

  struct {
          __aligned_u64   in;
          __aligned_u64   out;
          __u32           count;
  } trampoline_batch;

  in    - pointer to user space array with file descrptors of loaded bpf
          programs to attach
  out   - pointer to user space array for resulting link descriptor
  count - number of 'in/out' file descriptors

Basically the new code gets programs from 'in' file descriptors and
attaches them the same way the current code does, apart from the last
step that registers probe ip with trampoline. This is done at the end
with new register_ftrace_direct_ips function.

The resulting link descriptors are written in 'out' array and match
'in' array file descriptors order.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/bpf.h      | 15 ++++++-
 include/uapi/linux/bpf.h |  7 ++++
 kernel/bpf/syscall.c     | 88 ++++++++++++++++++++++++++++++++++++++--
 kernel/bpf/trampoline.c  | 69 +++++++++++++++++++++++++++----
 4 files changed, 164 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2b16bf48aab6..d28c7ac3af3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -583,6 +583,13 @@ enum bpf_tramp_prog_type {
 	BPF_TRAMP_REPLACE, /* more than MAX */
 };
 
+struct bpf_trampoline_batch {
+	int count;
+	int idx;
+	unsigned long *ips;
+	unsigned long *addrs;
+};
+
 struct bpf_trampoline {
 	/* hlist for trampoline_table */
 	struct hlist_node hlist;
@@ -644,11 +651,14 @@ static __always_inline unsigned int bpf_dispatcher_nop_func(
 	return bpf_func(ctx, insnsi);
 }
 #ifdef CONFIG_BPF_JIT
-int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
+int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
+			     struct bpf_trampoline_batch *batch);
 int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
 struct bpf_trampoline *bpf_trampoline_get(u64 key,
 					  struct bpf_attach_target_info *tgt_info);
 void bpf_trampoline_put(struct bpf_trampoline *tr);
+struct bpf_trampoline_batch *bpf_trampoline_batch_alloc(int count);
+void bpf_trampoline_batch_free(struct bpf_trampoline_batch *batch);
 #define BPF_DISPATCHER_INIT(_name) {				\
 	.mutex = __MUTEX_INITIALIZER(_name.mutex),		\
 	.func = &_name##_func,					\
@@ -693,7 +703,8 @@ void bpf_ksym_add(struct bpf_ksym *ksym);
 void bpf_ksym_del(struct bpf_ksym *ksym);
 #else
 static inline int bpf_trampoline_link_prog(struct bpf_prog *prog,
-					   struct bpf_trampoline *tr)
+					   struct bpf_trampoline *tr,
+					   struct bpf_trampoline_batch *batch)
 {
 	return -ENOTSUPP;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bf5a99d803e4..04df4d576fd4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -125,6 +125,7 @@ enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TRAMPOLINE_BATCH_ATTACH,
 };
 
 enum bpf_map_type {
@@ -631,6 +632,12 @@ union bpf_attr {
 		__u32 prog_fd;
 	} raw_tracepoint;
 
+	struct { /* anonymous struct used by BPF_TRAMPOLINE_BATCH_ATTACH */
+		__aligned_u64	in;
+		__aligned_u64	out;
+		__u32		count;
+	} trampoline_batch;
+
 	struct { /* anonymous struct for BPF_BTF_LOAD */
 		__aligned_u64	btf;
 		__aligned_u64	btf_log_buf;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 61ef29f9177d..e370b37e3e8e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2553,7 +2553,8 @@ static const struct bpf_link_ops bpf_tracing_link_lops = {
 
 static int bpf_tracing_prog_attach(struct bpf_prog *prog,
 				   int tgt_prog_fd,
-				   u32 btf_id)
+				   u32 btf_id,
+				   struct bpf_trampoline_batch *batch)
 {
 	struct bpf_link_primer link_primer;
 	struct bpf_prog *tgt_prog = NULL;
@@ -2678,7 +2679,7 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog,
 	if (err)
 		goto out_unlock;
 
-	err = bpf_trampoline_link_prog(prog, tr);
+	err = bpf_trampoline_link_prog(prog, tr, batch);
 	if (err) {
 		bpf_link_cleanup(&link_primer);
 		link = NULL;
@@ -2826,7 +2827,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 			tp_name = prog->aux->attach_func_name;
 			break;
 		}
-		return bpf_tracing_prog_attach(prog, 0, 0);
+		return bpf_tracing_prog_attach(prog, 0, 0, NULL);
 	case BPF_PROG_TYPE_RAW_TRACEPOINT:
 	case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE:
 		if (strncpy_from_user(buf,
@@ -2879,6 +2880,81 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	return err;
 }
 
+#define BPF_RAW_TRACEPOINT_OPEN_BATCH_LAST_FIELD trampoline_batch.count
+
+static int bpf_trampoline_batch(const union bpf_attr *attr, int cmd)
+{
+	void __user *uout = u64_to_user_ptr(attr->trampoline_batch.out);
+	void __user *uin = u64_to_user_ptr(attr->trampoline_batch.in);
+	struct bpf_trampoline_batch *batch = NULL;
+	struct bpf_prog *prog;
+	int count, ret, i, fd;
+	u32 *in, *out;
+
+	if (CHECK_ATTR(BPF_RAW_TRACEPOINT_OPEN_BATCH))
+		return -EINVAL;
+
+	if (!uin || !uout)
+		return -EINVAL;
+
+	count = attr->trampoline_batch.count;
+
+	in = kcalloc(count, sizeof(u32), GFP_KERNEL);
+	out = kcalloc(count, sizeof(u32), GFP_KERNEL);
+	if (!in || !out) {
+		kfree(in);
+		kfree(out);
+		return -ENOMEM;
+	}
+
+	ret = copy_from_user(in, uin, count * sizeof(u32));
+	if (ret)
+		goto out_clean;
+
+	/* test read out array */
+	ret = copy_to_user(uout, out, count * sizeof(u32));
+	if (ret)
+		goto out_clean;
+
+	batch = bpf_trampoline_batch_alloc(count);
+	if (!batch)
+		goto out_clean;
+
+	for (i = 0; i < count; i++) {
+		if (cmd == BPF_TRAMPOLINE_BATCH_ATTACH) {
+			prog = bpf_prog_get(in[i]);
+			if (IS_ERR(prog)) {
+				ret = PTR_ERR(prog);
+				goto out_clean;
+			}
+
+			ret = -EINVAL;
+			if (prog->type != BPF_PROG_TYPE_TRACING)
+				goto out_clean;
+			if (prog->type == BPF_PROG_TYPE_TRACING &&
+			    prog->expected_attach_type == BPF_TRACE_RAW_TP)
+				goto out_clean;
+
+			fd = bpf_tracing_prog_attach(prog, 0, 0, batch);
+			if (fd < 0)
+				goto out_clean;
+
+			out[i] = fd;
+		}
+	}
+
+	ret = register_ftrace_direct_ips(batch->ips, batch->addrs, batch->idx);
+	if (!ret)
+		WARN_ON_ONCE(copy_to_user(uout, out, count * sizeof(u32)));
+
+out_clean:
+	/* XXX cleanup partialy attached array */
+	bpf_trampoline_batch_free(batch);
+	kfree(in);
+	kfree(out);
+	return ret;
+}
+
 static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 					     enum bpf_attach_type attach_type)
 {
@@ -4018,7 +4094,8 @@ static int tracing_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *
 	else if (prog->type == BPF_PROG_TYPE_EXT)
 		return bpf_tracing_prog_attach(prog,
 					       attr->link_create.target_fd,
-					       attr->link_create.target_btf_id);
+					       attr->link_create.target_btf_id,
+					       NULL);
 	return -EINVAL;
 }
 
@@ -4437,6 +4514,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_RAW_TRACEPOINT_OPEN:
 		err = bpf_raw_tracepoint_open(&attr);
 		break;
+	case BPF_TRAMPOLINE_BATCH_ATTACH:
+		err = bpf_trampoline_batch(&attr, cmd);
+		break;
 	case BPF_BTF_LOAD:
 		err = bpf_btf_load(&attr);
 		break;
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 35c5887d82ff..3383644eccc8 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -107,6 +107,51 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
 	return tr;
 }
 
+static int bpf_trampoline_batch_add(struct bpf_trampoline_batch *batch,
+				    unsigned long ip, unsigned long addr)
+{
+	int idx = batch->idx;
+
+	if (idx >= batch->count)
+		return -EINVAL;
+
+	batch->ips[idx] = ip;
+	batch->addrs[idx] = addr;
+	batch->idx++;
+	return 0;
+}
+
+struct bpf_trampoline_batch *bpf_trampoline_batch_alloc(int count)
+{
+	struct bpf_trampoline_batch *batch;
+
+	batch = kmalloc(sizeof(*batch), GFP_KERNEL);
+	if (!batch)
+		return NULL;
+
+	batch->ips = kcalloc(count, sizeof(batch->ips[0]), GFP_KERNEL);
+	batch->addrs = kcalloc(count, sizeof(batch->addrs[0]), GFP_KERNEL);
+	if (!batch->ips || !batch->addrs) {
+		kfree(batch->ips);
+		kfree(batch->addrs);
+		kfree(batch);
+		return NULL;
+	}
+
+	batch->count = count;
+	batch->idx = 0;
+	return batch;
+}
+
+void bpf_trampoline_batch_free(struct bpf_trampoline_batch *batch)
+{
+	if (!batch)
+		return;
+	kfree(batch->ips);
+	kfree(batch->addrs);
+	kfree(batch);
+}
+
 static int is_ftrace_location(void *ip)
 {
 	long addr;
@@ -144,7 +189,8 @@ static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_ad
 }
 
 /* first time registering */
-static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
+static int register_fentry(struct bpf_trampoline *tr, void *new_addr,
+			   struct bpf_trampoline_batch *batch)
 {
 	void *ip = tr->func.addr;
 	int ret;
@@ -154,9 +200,12 @@ static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
 		return ret;
 	tr->func.ftrace_managed = ret;
 
-	if (tr->func.ftrace_managed)
-		ret = register_ftrace_direct((long)ip, (long)new_addr);
-	else
+	if (tr->func.ftrace_managed) {
+		if (batch)
+			ret = bpf_trampoline_batch_add(batch, (long)ip, (long)new_addr);
+		else
+			ret = register_ftrace_direct((long)ip, (long)new_addr);
+	} else
 		ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
 	return ret;
 }
@@ -185,7 +234,8 @@ bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total)
 	return tprogs;
 }
 
-static int bpf_trampoline_update(struct bpf_trampoline *tr)
+static int bpf_trampoline_update(struct bpf_trampoline *tr,
+				 struct bpf_trampoline_batch *batch)
 {
 	void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE/2;
 	void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE/2;
@@ -230,7 +280,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
 		err = modify_fentry(tr, old_image, new_image);
 	else
 		/* first time registering */
-		err = register_fentry(tr, new_image);
+		err = register_fentry(tr, new_image, batch);
 	if (err)
 		goto out;
 	tr->selector++;
@@ -261,7 +311,8 @@ static enum bpf_tramp_prog_type bpf_attach_type_to_tramp(struct bpf_prog *prog)
 	}
 }
 
-int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
+int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
+			     struct bpf_trampoline_batch *batch)
 {
 	enum bpf_tramp_prog_type kind;
 	int err = 0;
@@ -299,7 +350,7 @@ int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
 	}
 	hlist_add_head(&prog->aux->tramp_hlist, &tr->progs_hlist[kind]);
 	tr->progs_cnt[kind]++;
-	err = bpf_trampoline_update(tr);
+	err = bpf_trampoline_update(tr, batch);
 	if (err) {
 		hlist_del(&prog->aux->tramp_hlist);
 		tr->progs_cnt[kind]--;
@@ -326,7 +377,7 @@ int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
 	}
 	hlist_del(&prog->aux->tramp_hlist);
 	tr->progs_cnt[kind]--;
-	err = bpf_trampoline_update(tr);
+	err = bpf_trampoline_update(tr, NULL);
 out:
 	mutex_unlock(&tr->mutex);
 	return err;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 10/16] bpf: Add BPF_TRAMPOLINE_BATCH_DETACH support
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (8 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 11/16] bpf: Sync uapi bpf.h to tools Jiri Olsa
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding BPF_TRAMPOLINE_BATCH_DETACH support, that allows to detach
tracing multiple fentry/fexit pograms from trampolines within one
syscall.

The new BPF_TRAMPOLINE_BATCH_DETACH syscall command expects
following data in union bpf_attr:

  struct {
          __aligned_u64   in;
          __aligned_u64   out;
          __u32           count;
  } trampoline_batch;

  in    - pointer to user space array with link descrptors of attached
          bpf programs to detach
  out   - pointer to user space array for resulting error code
  count - number of 'in/out' file descriptors

Basically the new code gets programs from 'in' link descriptors and
detaches them the same way the current code does, apart from the last
step that unregisters probe ip with trampoline. This is done at the end
with new unregister_ftrace_direct function.

The resulting error codes are written in 'out' array and match 'in'
array link descriptors order.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/bpf.h      |  3 ++-
 include/uapi/linux/bpf.h |  3 ++-
 kernel/bpf/syscall.c     | 28 ++++++++++++++++++++++++++--
 kernel/bpf/trampoline.c  | 25 ++++++++++++++++---------
 4 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d28c7ac3af3f..828a4e88224f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -653,7 +653,8 @@ static __always_inline unsigned int bpf_dispatcher_nop_func(
 #ifdef CONFIG_BPF_JIT
 int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
 			     struct bpf_trampoline_batch *batch);
-int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
+int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
+			       struct bpf_trampoline_batch *batch);
 struct bpf_trampoline *bpf_trampoline_get(u64 key,
 					  struct bpf_attach_target_info *tgt_info);
 void bpf_trampoline_put(struct bpf_trampoline *tr);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 04df4d576fd4..b6a08aa49aa4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -126,6 +126,7 @@ enum bpf_cmd {
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
 	BPF_TRAMPOLINE_BATCH_ATTACH,
+	BPF_TRAMPOLINE_BATCH_DETACH,
 };
 
 enum bpf_map_type {
@@ -632,7 +633,7 @@ union bpf_attr {
 		__u32 prog_fd;
 	} raw_tracepoint;
 
-	struct { /* anonymous struct used by BPF_TRAMPOLINE_BATCH_ATTACH */
+	struct { /* anonymous struct used by BPF_TRAMPOLINE_BATCH_* */
 		__aligned_u64	in;
 		__aligned_u64	out;
 		__u32		count;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e370b37e3e8e..19fb608546c0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2505,7 +2505,7 @@ static void bpf_tracing_link_release(struct bpf_link *link)
 		container_of(link, struct bpf_tracing_link, link);
 
 	WARN_ON_ONCE(bpf_trampoline_unlink_prog(link->prog,
-						tr_link->trampoline));
+						tr_link->trampoline, NULL));
 
 	bpf_trampoline_put(tr_link->trampoline);
 
@@ -2940,10 +2940,33 @@ static int bpf_trampoline_batch(const union bpf_attr *attr, int cmd)
 				goto out_clean;
 
 			out[i] = fd;
+		} else {
+			struct bpf_tracing_link *tr_link;
+			struct bpf_link *link;
+
+			link = bpf_link_get_from_fd(in[i]);
+			if (IS_ERR(link)) {
+				ret = PTR_ERR(link);
+				goto out_clean;
+			}
+
+			if (link->type != BPF_LINK_TYPE_TRACING) {
+				ret = -EINVAL;
+				bpf_link_put(link);
+				goto out_clean;
+			}
+
+			tr_link = container_of(link, struct bpf_tracing_link, link);
+			bpf_trampoline_unlink_prog(link->prog, tr_link->trampoline, batch);
+			bpf_link_put(link);
 		}
 	}
 
-	ret = register_ftrace_direct_ips(batch->ips, batch->addrs, batch->idx);
+	if (cmd == BPF_TRAMPOLINE_BATCH_ATTACH)
+		ret = register_ftrace_direct_ips(batch->ips, batch->addrs, batch->idx);
+	else
+		ret = unregister_ftrace_direct_ips(batch->ips, batch->addrs, batch->idx);
+
 	if (!ret)
 		WARN_ON_ONCE(copy_to_user(uout, out, count * sizeof(u32)));
 
@@ -4515,6 +4538,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 		err = bpf_raw_tracepoint_open(&attr);
 		break;
 	case BPF_TRAMPOLINE_BATCH_ATTACH:
+	case BPF_TRAMPOLINE_BATCH_DETACH:
 		err = bpf_trampoline_batch(&attr, cmd);
 		break;
 	case BPF_BTF_LOAD:
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 3383644eccc8..cdad87461e5d 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -164,14 +164,18 @@ static int is_ftrace_location(void *ip)
 	return 1;
 }
 
-static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr)
+static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr,
+			     struct bpf_trampoline_batch *batch)
 {
 	void *ip = tr->func.addr;
 	int ret;
 
-	if (tr->func.ftrace_managed)
-		ret = unregister_ftrace_direct((long)ip, (long)old_addr);
-	else
+	if (tr->func.ftrace_managed) {
+		if (batch)
+			ret = bpf_trampoline_batch_add(batch, (long)ip, (long)old_addr);
+		else
+			ret = unregister_ftrace_direct((long)ip, (long)old_addr);
+	} else
 		ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, NULL);
 	return ret;
 }
@@ -248,7 +252,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr,
 		return PTR_ERR(tprogs);
 
 	if (total == 0) {
-		err = unregister_fentry(tr, old_image);
+		err = unregister_fentry(tr, old_image, batch);
 		tr->selector = 0;
 		goto out;
 	}
@@ -361,13 +365,16 @@ int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
 }
 
 /* bpf_trampoline_unlink_prog() should never fail. */
-int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
+int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
+			       struct bpf_trampoline_batch *batch)
 {
 	enum bpf_tramp_prog_type kind;
-	int err;
+	int err = 0;
 
 	kind = bpf_attach_type_to_tramp(prog);
 	mutex_lock(&tr->mutex);
+	if (hlist_unhashed(&prog->aux->tramp_hlist))
+		goto out;
 	if (kind == BPF_TRAMP_REPLACE) {
 		WARN_ON_ONCE(!tr->extension_prog);
 		err = bpf_arch_text_poke(tr->func.addr, BPF_MOD_JUMP,
@@ -375,9 +382,9 @@ int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
 		tr->extension_prog = NULL;
 		goto out;
 	}
-	hlist_del(&prog->aux->tramp_hlist);
+	hlist_del_init(&prog->aux->tramp_hlist);
 	tr->progs_cnt[kind]--;
-	err = bpf_trampoline_update(tr, NULL);
+	err = bpf_trampoline_update(tr, batch);
 out:
 	mutex_unlock(&tr->mutex);
 	return err;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 11/16] bpf: Sync uapi bpf.h to tools
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (9 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 10/16] bpf: Add BPF_TRAMPOLINE_BATCH_DETACH support Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 12/16] bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED) Jiri Olsa
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Sync uapi bpf.h with trampoline batch attach changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/include/uapi/linux/bpf.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bf5a99d803e4..b6a08aa49aa4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -125,6 +125,8 @@ enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TRAMPOLINE_BATCH_ATTACH,
+	BPF_TRAMPOLINE_BATCH_DETACH,
 };
 
 enum bpf_map_type {
@@ -631,6 +633,12 @@ union bpf_attr {
 		__u32 prog_fd;
 	} raw_tracepoint;
 
+	struct { /* anonymous struct used by BPF_TRAMPOLINE_BATCH_* */
+		__aligned_u64	in;
+		__aligned_u64	out;
+		__u32		count;
+	} trampoline_batch;
+
 	struct { /* anonymous struct for BPF_BTF_LOAD */
 		__aligned_u64	btf;
 		__aligned_u64	btf_log_buf;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 12/16] bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED)
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (10 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 11/16] bpf: Sync uapi bpf.h to tools Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support Jiri Olsa
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

I noticed some of the profiled workloads did not spend more cycles,
but took more time to finish than current code. I tracked it to rcu
synchronize_rcu_mult call in bpf_trampoline_update and when I called
it just once for batch mode it got faster.

The current processing when attaching the program is:

  for each program:
    bpf(BPF_RAW_TRACEPOINT_OPEN
      bpf_tracing_prog_attach
        bpf_trampoline_link_prog
          bpf_trampoline_update
            synchronize_rcu_mult
            register_ftrace_direct

With the change the synchronize_rcu_mult is called just once:

  bpf(BPF_TRAMPOLINE_BATCH_ATTACH
    for each program:
      bpf_tracing_prog_attach
        bpf_trampoline_link_prog
          bpf_trampoline_update

    synchronize_rcu_mult
    register_ftrace_direct_ips

I'm not sure this does not break stuff, because I don't follow rcu
code that much ;-) However stats are nicer now:

Before:

 Performance counter stats for './test_progs -t attach_test' (5 runs):

        37,410,887      cycles:k             ( +-  0.98% )
        70,062,158      cycles:u             ( +-  0.39% )

             26.80 +- 4.10 seconds time elapsed  ( +- 15.31% )

After:

 Performance counter stats for './test_progs -t attach_test' (5 runs):

        36,812,432      cycles:k             ( +-  2.52% )
        69,907,191      cycles:u             ( +-  0.38% )

             15.04 +- 2.94 seconds time elapsed  ( +- 19.54% )

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/bpf/syscall.c    | 3 +++
 kernel/bpf/trampoline.c | 3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19fb608546c0..b315803c34d3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -31,6 +31,7 @@
 #include <linux/poll.h>
 #include <linux/bpf-netns.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/rcupdate_wait.h>
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -2920,6 +2921,8 @@ static int bpf_trampoline_batch(const union bpf_attr *attr, int cmd)
 	if (!batch)
 		goto out_clean;
 
+	synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
+
 	for (i = 0; i < count; i++) {
 		if (cmd == BPF_TRAMPOLINE_BATCH_ATTACH) {
 			prog = bpf_prog_get(in[i]);
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index cdad87461e5d..0d5e4c5860a9 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -271,7 +271,8 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr,
 	 * programs finish executing.
 	 * Wait for these two grace periods together.
 	 */
-	synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
+	if (!batch)
+		synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
 
 	err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2,
 					  &tr->func.model, flags, tprogs,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (11 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 12/16] bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED) Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-23 20:09   ` Andrii Nakryiko
  2020-10-22  8:21 ` [RFC bpf-next 14/16] libbpf: Add trampoline batch detach support Jiri Olsa
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding trampoline batch attach support so it's possible to use
batch mode to load tracing programs.

Adding trampoline_attach_batch bool to struct bpf_object_open_opts.
When set to true the bpf_object__attach_skeleton will try to load
all tracing programs via batch mode.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/lib/bpf/bpf.c      | 12 +++++++
 tools/lib/bpf/bpf.h      |  1 +
 tools/lib/bpf/libbpf.c   | 76 +++++++++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.h   |  5 ++-
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index d27e34133973..21fffff5e237 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -858,6 +858,18 @@ int bpf_raw_tracepoint_open(const char *name, int prog_fd)
 	return sys_bpf(BPF_RAW_TRACEPOINT_OPEN, &attr, sizeof(attr));
 }
 
+int bpf_trampoline_batch_attach(int *ifds, int *ofds, int count)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.trampoline_batch.in = ptr_to_u64(ifds);
+	attr.trampoline_batch.out = ptr_to_u64(ofds);
+	attr.trampoline_batch.count = count;
+
+	return sys_bpf(BPF_TRAMPOLINE_BATCH_ATTACH, &attr, sizeof(attr));
+}
+
 int bpf_load_btf(const void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size,
 		 bool do_log)
 {
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 875dde20d56e..ba3b0b6e3cb0 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -235,6 +235,7 @@ LIBBPF_API int bpf_prog_query(int target_fd, enum bpf_attach_type type,
 			      __u32 query_flags, __u32 *attach_flags,
 			      __u32 *prog_ids, __u32 *prog_cnt);
 LIBBPF_API int bpf_raw_tracepoint_open(const char *name, int prog_fd);
+LIBBPF_API int bpf_trampoline_batch_attach(int *ifds, int *ofds, int count);
 LIBBPF_API int bpf_load_btf(const void *btf, __u32 btf_size, char *log_buf,
 			    __u32 log_buf_size, bool do_log);
 LIBBPF_API int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 313034117070..584da3b401ac 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -421,6 +421,7 @@ struct bpf_object {
 
 	bool loaded;
 	bool has_subcalls;
+	bool trampoline_attach_batch;
 
 	/*
 	 * Information when doing elf related work. Only valid if fd
@@ -6907,6 +6908,9 @@ __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz,
 			return ERR_PTR(-ENOMEM);
 	}
 
+	obj->trampoline_attach_batch = OPTS_GET(opts, trampoline_attach_batch,
+						 false);
+
 	err = bpf_object__elf_init(obj);
 	err = err ? : bpf_object__check_endianness(obj);
 	err = err ? : bpf_object__elf_collect(obj);
@@ -10811,9 +10815,75 @@ int bpf_object__load_skeleton(struct bpf_object_skeleton *s)
 	return 0;
 }
 
+static bool is_trampoline(const struct bpf_program *prog)
+{
+	return prog->type == BPF_PROG_TYPE_TRACING &&
+	      (prog->expected_attach_type == BPF_TRACE_FENTRY ||
+	       prog->expected_attach_type == BPF_TRACE_FEXIT);
+}
+
+static int attach_trace_batch(struct bpf_object_skeleton *s)
+{
+	int i, prog_fd, ret = -ENOMEM;
+	int *in_fds, *out_fds, cnt;
+
+	in_fds = calloc(s->prog_cnt, sizeof(in_fds[0]));
+	out_fds = calloc(s->prog_cnt, sizeof(out_fds[0]));
+	if (!in_fds || !out_fds)
+		goto out_clean;
+
+	ret = -EINVAL;
+	for (cnt = 0, i = 0; i < s->prog_cnt; i++) {
+		struct bpf_program *prog = *s->progs[i].prog;
+
+		if (!is_trampoline(prog))
+			continue;
+
+		prog_fd = bpf_program__fd(prog);
+		if (prog_fd < 0) {
+			pr_warn("prog '%s': can't attach before loaded\n", prog->name);
+			goto out_clean;
+		}
+		in_fds[cnt++] = prog_fd;
+	}
+
+	ret = bpf_trampoline_batch_attach(in_fds, out_fds, cnt);
+	if (ret)
+		goto out_clean;
+
+	for (cnt = 0, i = 0; i < s->prog_cnt; i++) {
+		struct bpf_program *prog = *s->progs[i].prog;
+		struct bpf_link **linkp = s->progs[i].link;
+		struct bpf_link *link;
+
+		if (!is_trampoline(prog))
+			continue;
+
+		link = calloc(1, sizeof(*link));
+		if (!link)
+			goto out_clean;
+
+		link->detach = &bpf_link__detach_fd;
+		link->fd = out_fds[cnt++];
+		*linkp = link;
+	}
+
+out_clean:
+	free(in_fds);
+	free(out_fds);
+	return ret;
+}
+
 int bpf_object__attach_skeleton(struct bpf_object_skeleton *s)
 {
-	int i;
+	struct bpf_object *obj = *s->obj;
+	int i, err;
+
+	if (obj->trampoline_attach_batch) {
+		err = attach_trace_batch(s);
+		if (err)
+			return err;
+	}
 
 	for (i = 0; i < s->prog_cnt; i++) {
 		struct bpf_program *prog = *s->progs[i].prog;
@@ -10823,6 +10893,10 @@ int bpf_object__attach_skeleton(struct bpf_object_skeleton *s)
 		if (!prog->load)
 			continue;
 
+		/* Program was attached via batch mode. */
+		if (obj->trampoline_attach_batch && is_trampoline(prog))
+			continue;
+
 		sec_def = find_sec_def(prog->sec_name);
 		if (!sec_def || !sec_def->attach_fn)
 			continue;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 6909ee81113a..66f8e78aa9f8 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -93,8 +93,11 @@ struct bpf_object_open_opts {
 	 * system Kconfig for CONFIG_xxx externs.
 	 */
 	const char *kconfig;
+	/* Attach trampolines via batch mode.
+	 */
+	bool trampoline_attach_batch;
 };
-#define bpf_object_open_opts__last_field kconfig
+#define bpf_object_open_opts__last_field trampoline_attach_batch
 
 LIBBPF_API struct bpf_object *bpf_object__open(const char *path);
 LIBBPF_API struct bpf_object *
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 4ebfadf45b47..5a5ce921956d 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -336,4 +336,5 @@ LIBBPF_0.2.0 {
 		perf_buffer__epoll_fd;
 		perf_buffer__consume_buffer;
 		xsk_socket__create_shared;
+		bpf_trampoline_batch_attach;
 } LIBBPF_0.1.0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 14/16] libbpf: Add trampoline batch detach support
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (12 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 15/16] selftests/bpf: Add trampoline batch test Jiri Olsa
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding trampoline batch attach support so it's possible to use
batch mode to load tracing programs.

Adding trampoline_attach_batch bool to struct bpf_object_open_opts.
When set to true the bpf_object__detach_skeleton will try to detach
all tracing programs via batch mode.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/lib/bpf/bpf.c      | 16 +++++++++++--
 tools/lib/bpf/bpf.h      |  1 +
 tools/lib/bpf/libbpf.c   | 50 ++++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 4 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 21fffff5e237..9af13e511851 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -858,7 +858,7 @@ int bpf_raw_tracepoint_open(const char *name, int prog_fd)
 	return sys_bpf(BPF_RAW_TRACEPOINT_OPEN, &attr, sizeof(attr));
 }
 
-int bpf_trampoline_batch_attach(int *ifds, int *ofds, int count)
+static int bpf_trampoline_batch(int cmd, int *ifds, int *ofds, int count)
 {
 	union bpf_attr attr;
 
@@ -867,7 +867,19 @@ int bpf_trampoline_batch_attach(int *ifds, int *ofds, int count)
 	attr.trampoline_batch.out = ptr_to_u64(ofds);
 	attr.trampoline_batch.count = count;
 
-	return sys_bpf(BPF_TRAMPOLINE_BATCH_ATTACH, &attr, sizeof(attr));
+	return sys_bpf(cmd, &attr, sizeof(attr));
+}
+
+int bpf_trampoline_batch_attach(int *ifds, int *ofds, int count)
+{
+	return bpf_trampoline_batch(BPF_TRAMPOLINE_BATCH_ATTACH,
+				    ifds, ofds, count);
+}
+
+int bpf_trampoline_batch_detach(int *ifds, int *ofds, int count)
+{
+	return bpf_trampoline_batch(BPF_TRAMPOLINE_BATCH_DETACH,
+				    ifds, ofds, count);
 }
 
 int bpf_load_btf(const void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size,
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ba3b0b6e3cb0..c6fb5977de79 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -236,6 +236,7 @@ LIBBPF_API int bpf_prog_query(int target_fd, enum bpf_attach_type type,
 			      __u32 *prog_ids, __u32 *prog_cnt);
 LIBBPF_API int bpf_raw_tracepoint_open(const char *name, int prog_fd);
 LIBBPF_API int bpf_trampoline_batch_attach(int *ifds, int *ofds, int count);
+LIBBPF_API int bpf_trampoline_batch_detach(int *ifds, int *ofds, int count);
 LIBBPF_API int bpf_load_btf(const void *btf, __u32 btf_size, char *log_buf,
 			    __u32 log_buf_size, bool do_log);
 LIBBPF_API int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 584da3b401ac..02e9e8279aa7 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -10874,6 +10874,47 @@ static int attach_trace_batch(struct bpf_object_skeleton *s)
 	return ret;
 }
 
+static int detach_trace_batch(struct bpf_object_skeleton *s)
+{
+	int *in_fds, *out_fds, cnt;
+	int i, ret = -ENOMEM;
+
+	in_fds = calloc(s->prog_cnt, sizeof(in_fds[0]));
+	out_fds = calloc(s->prog_cnt, sizeof(out_fds[0]));
+	if (!in_fds || !out_fds)
+		goto out_clean;
+
+	for (cnt = 0, i = 0; i < s->prog_cnt; i++) {
+		struct bpf_program *prog = *s->progs[i].prog;
+		struct bpf_link **link = s->progs[i].link;
+
+		if (!is_trampoline(prog))
+			continue;
+		in_fds[cnt++] = (*link)->fd;
+	}
+
+	ret = bpf_trampoline_batch_detach(in_fds, out_fds, cnt);
+	if (ret)
+		goto out_clean;
+
+	for (i = 0; i < s->prog_cnt; i++) {
+		struct bpf_program *prog = *s->progs[i].prog;
+		struct bpf_link **link = s->progs[i].link;
+
+		if (!is_trampoline(prog))
+			continue;
+
+		bpf_link__disconnect(*link);
+		bpf_link__destroy(*link);
+		*link = NULL;
+	}
+
+out_clean:
+	free(in_fds);
+	free(out_fds);
+	return ret;
+}
+
 int bpf_object__attach_skeleton(struct bpf_object_skeleton *s)
 {
 	struct bpf_object *obj = *s->obj;
@@ -10914,11 +10955,20 @@ int bpf_object__attach_skeleton(struct bpf_object_skeleton *s)
 
 void bpf_object__detach_skeleton(struct bpf_object_skeleton *s)
 {
+	struct bpf_object *obj = *s->obj;
 	int i;
 
+	if (obj->trampoline_attach_batch)
+		detach_trace_batch(s);
+
 	for (i = 0; i < s->prog_cnt; i++) {
+		struct bpf_program *prog = *s->progs[i].prog;
 		struct bpf_link **link = s->progs[i].link;
 
+		/* Program was attached via batch mode. */
+		if (obj->trampoline_attach_batch && is_trampoline(prog))
+			continue;
+
 		bpf_link__destroy(*link);
 		*link = NULL;
 	}
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 5a5ce921956d..cfe0b3d52172 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -337,4 +337,5 @@ LIBBPF_0.2.0 {
 		perf_buffer__consume_buffer;
 		xsk_socket__create_shared;
 		bpf_trampoline_batch_attach;
+		bpf_trampoline_batch_detach;
 } LIBBPF_0.1.0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 15/16] selftests/bpf: Add trampoline batch test
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (13 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 14/16] libbpf: Add trampoline batch detach support Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22  8:21 ` [RFC bpf-next 16/16] selftests/bpf: Add attach batch test (NOT TO BE MERGED) Jiri Olsa
  2020-10-22 13:35 ` [RFC bpf-next 00/16] bpf: Speed up trampoline attach Steven Rostedt
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding simple test that loads fentry tracing programs to
bpf_fentry_test* functions and uses trampoline_attach_batch
bool in struct bpf_object_open_opts to attach them in batch
mode.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../bpf/prog_tests/trampoline_batch.c         | 45 +++++++++++
 .../bpf/progs/trampoline_batch_test.c         | 75 +++++++++++++++++++
 2 files changed, 120 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/trampoline_batch.c
 create mode 100644 tools/testing/selftests/bpf/progs/trampoline_batch_test.c

diff --git a/tools/testing/selftests/bpf/prog_tests/trampoline_batch.c b/tools/testing/selftests/bpf/prog_tests/trampoline_batch.c
new file mode 100644
index 000000000000..98929ac0bef6
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/trampoline_batch.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+#include <test_progs.h>
+#include "trampoline_batch_test.skel.h"
+
+void test_trampoline_batch(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts);
+	struct trampoline_batch_test *skel = NULL;
+	int err, prog_fd, i;
+	__u32 duration = 0, retval;
+	__u64 *result;
+
+	opts.trampoline_attach_batch = true;
+
+	skel = trampoline_batch_test__open_opts(&opts);
+	if (CHECK(!skel, "skel_open", "open failed\n"))
+		goto cleanup;
+
+	err = trampoline_batch_test__load(skel);
+	if (CHECK(err, "skel_load", "load failed: %d\n", err))
+		goto cleanup;
+
+	err = trampoline_batch_test__attach(skel);
+	if (CHECK(err, "skel_attach", "attach failed: %d\n", err))
+		goto cleanup;
+
+	prog_fd = bpf_program__fd(skel->progs.test1);
+	err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+				NULL, NULL, &retval, &duration);
+	CHECK(err || retval, "test_run",
+	      "err %d errno %d retval %d duration %d\n",
+	      err, errno, retval, duration);
+
+	result = (__u64 *)skel->bss;
+	for (i = 0; i < 6; i++) {
+		if (CHECK(result[i] != 1, "result",
+			  "trampoline_batch_test fentry_test%d failed err %lld\n",
+			  i + 1, result[i]))
+			goto cleanup;
+	}
+
+cleanup:
+	trampoline_batch_test__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/trampoline_batch_test.c b/tools/testing/selftests/bpf/progs/trampoline_batch_test.c
new file mode 100644
index 000000000000..ff93799037f0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/trampoline_batch_test.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+__u64 test1_result = 0;
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(test1, int a)
+{
+	test1_result = 1;
+	return 0;
+}
+
+__u64 test2_result = 0;
+SEC("fentry/bpf_fentry_test2")
+int BPF_PROG(test2, int a, __u64 b)
+{
+	test2_result = 1;
+	return 0;
+}
+
+__u64 test3_result = 0;
+SEC("fentry/bpf_fentry_test3")
+int BPF_PROG(test3, char a, int b, __u64 c)
+{
+	test3_result = 1;
+	return 0;
+}
+
+__u64 test4_result = 0;
+SEC("fentry/bpf_fentry_test4")
+int BPF_PROG(test4, void *a, char b, int c, __u64 d)
+{
+	test4_result = 1;
+	return 0;
+}
+
+__u64 test5_result = 0;
+SEC("fentry/bpf_fentry_test5")
+int BPF_PROG(test5, __u64 a, void *b, short c, int d, __u64 e)
+{
+	test5_result = 1;
+	return 0;
+}
+
+__u64 test6_result = 0;
+SEC("fentry/bpf_fentry_test6")
+int BPF_PROG(test6, __u64 a, void *b, short c, int d, void * e, __u64 f)
+{
+	test6_result = 1;
+	return 0;
+}
+
+struct bpf_fentry_test_t {
+	struct bpf_fentry_test_t *a;
+};
+
+__u64 test7_result = 0;
+SEC("fentry/bpf_fentry_test7")
+int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
+{
+	test7_result = 1;
+	return 0;
+}
+
+__u64 test8_result = 0;
+SEC("fentry/bpf_fentry_test8")
+int BPF_PROG(test8, struct bpf_fentry_test_t *arg)
+{
+	test8_result = 1;
+	return 0;
+}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 16/16] selftests/bpf: Add attach batch test (NOT TO BE MERGED)
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (14 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 15/16] selftests/bpf: Add trampoline batch test Jiri Olsa
@ 2020-10-22  8:21 ` Jiri Olsa
  2020-10-22 13:35 ` [RFC bpf-next 00/16] bpf: Speed up trampoline attach Steven Rostedt
  16 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

Adding test that attaches to 50 known functions,
that are also added to kernel.

This test is meant only for fast check on attach times,
and can be probably in a different mergeable way, but
at the moment it fits the need.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 net/bpf/test_run.c                            | 55 ++++++++++++++++
 .../selftests/bpf/prog_tests/attach_test.c    | 27 ++++++++
 .../testing/selftests/bpf/progs/attach_test.c | 62 +++++++++++++++++++
 3 files changed, 144 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/attach_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/attach_test.c

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c1c30a9f76f3..8fc6d27fc07f 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -167,6 +167,61 @@ int noinline bpf_modify_return_test(int a, int *b)
 	*b += 1;
 	return a + *b;
 }
+
+#define ATTACH_TEST(__n) \
+	int noinline __PASTE(bpf_attach_test, __n)(void) { return 0; }
+
+ATTACH_TEST(0)
+ATTACH_TEST(1)
+ATTACH_TEST(2)
+ATTACH_TEST(3)
+ATTACH_TEST(4)
+ATTACH_TEST(5)
+ATTACH_TEST(6)
+ATTACH_TEST(7)
+ATTACH_TEST(8)
+ATTACH_TEST(9)
+ATTACH_TEST(10)
+ATTACH_TEST(11)
+ATTACH_TEST(12)
+ATTACH_TEST(13)
+ATTACH_TEST(14)
+ATTACH_TEST(15)
+ATTACH_TEST(16)
+ATTACH_TEST(17)
+ATTACH_TEST(18)
+ATTACH_TEST(19)
+ATTACH_TEST(20)
+ATTACH_TEST(21)
+ATTACH_TEST(22)
+ATTACH_TEST(23)
+ATTACH_TEST(24)
+ATTACH_TEST(25)
+ATTACH_TEST(26)
+ATTACH_TEST(27)
+ATTACH_TEST(28)
+ATTACH_TEST(29)
+ATTACH_TEST(30)
+ATTACH_TEST(31)
+ATTACH_TEST(32)
+ATTACH_TEST(33)
+ATTACH_TEST(34)
+ATTACH_TEST(35)
+ATTACH_TEST(36)
+ATTACH_TEST(37)
+ATTACH_TEST(38)
+ATTACH_TEST(39)
+ATTACH_TEST(40)
+ATTACH_TEST(41)
+ATTACH_TEST(42)
+ATTACH_TEST(43)
+ATTACH_TEST(44)
+ATTACH_TEST(45)
+ATTACH_TEST(46)
+ATTACH_TEST(47)
+ATTACH_TEST(48)
+ATTACH_TEST(49)
+
 __diag_pop();
 
 ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
diff --git a/tools/testing/selftests/bpf/prog_tests/attach_test.c b/tools/testing/selftests/bpf/prog_tests/attach_test.c
new file mode 100644
index 000000000000..c5c6534c49c9
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/attach_test.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include "attach_test.skel.h"
+
+void test_attach_test(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts);
+	struct attach_test *attach_skel = NULL;
+	__u32 duration = 0;
+	int err;
+
+	opts.trampoline_attach_batch = true;
+	attach_skel = attach_test__open_opts(&opts);
+	if (CHECK(!attach_skel, "attach_test__open_opts", "open skeleton failed\n"))
+		goto cleanup;
+
+	err = attach_test__load(attach_skel);
+	if (CHECK(err, "attach_skel_load", "attach skeleton failed\n"))
+		goto cleanup;
+
+	err = attach_test__attach(attach_skel);
+	if (CHECK(err, "attach", "attach failed: %d\n", err))
+		goto cleanup;
+
+cleanup:
+	attach_test__destroy(attach_skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/attach_test.c b/tools/testing/selftests/bpf/progs/attach_test.c
new file mode 100644
index 000000000000..51b18f83c109
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/attach_test.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define ATTACH_PROG(__n)			\
+SEC("fentry/bpf_attach_test" #__n)		\
+int BPF_PROG(prog ## __n) { return 0; }
+
+ATTACH_PROG(0)
+ATTACH_PROG(1)
+ATTACH_PROG(2)
+ATTACH_PROG(3)
+ATTACH_PROG(4)
+ATTACH_PROG(5)
+ATTACH_PROG(6)
+ATTACH_PROG(7)
+ATTACH_PROG(8)
+ATTACH_PROG(9)
+ATTACH_PROG(10)
+ATTACH_PROG(11)
+ATTACH_PROG(12)
+ATTACH_PROG(13)
+ATTACH_PROG(14)
+ATTACH_PROG(15)
+ATTACH_PROG(16)
+ATTACH_PROG(17)
+ATTACH_PROG(18)
+ATTACH_PROG(19)
+ATTACH_PROG(20)
+ATTACH_PROG(21)
+ATTACH_PROG(22)
+ATTACH_PROG(23)
+ATTACH_PROG(24)
+ATTACH_PROG(25)
+ATTACH_PROG(26)
+ATTACH_PROG(27)
+ATTACH_PROG(28)
+ATTACH_PROG(29)
+ATTACH_PROG(30)
+ATTACH_PROG(31)
+ATTACH_PROG(32)
+ATTACH_PROG(33)
+ATTACH_PROG(34)
+ATTACH_PROG(35)
+ATTACH_PROG(36)
+ATTACH_PROG(37)
+ATTACH_PROG(38)
+ATTACH_PROG(39)
+ATTACH_PROG(40)
+ATTACH_PROG(41)
+ATTACH_PROG(42)
+ATTACH_PROG(43)
+ATTACH_PROG(44)
+ATTACH_PROG(45)
+ATTACH_PROG(46)
+ATTACH_PROG(47)
+ATTACH_PROG(48)
+ATTACH_PROG(49)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
                   ` (15 preceding siblings ...)
  2020-10-22  8:21 ` [RFC bpf-next 16/16] selftests/bpf: Add attach batch test (NOT TO BE MERGED) Jiri Olsa
@ 2020-10-22 13:35 ` Steven Rostedt
  2020-10-22 14:11   ` Jiri Olsa
  16 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2020-10-22 13:35 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, netdev,
	bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, 22 Oct 2020 10:21:22 +0200
Jiri Olsa <jolsa@kernel.org> wrote:

> hi,
> this patchset tries to speed up the attach time for trampolines
> and make bpftrace faster for wildcard use cases like:
> 
>   # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); }"
> 
> Profiles show mostly ftrace backend, because we add trampoline
> functions one by one and ftrace direct function registering is
> quite expensive. Thus main change in this patchset is to allow
> batch attach and use just single ftrace call to attach or detach
> multiple ips/trampolines.

The issue I have with this change is that the purpose of the direct
trampoline was to give bpf access to the parameters of a function as if it
was called directly. That is, it could see the parameters of a function
quickly. I even made the direct function work if it wanted to also trace
the return code.

What the direct calls is NOT, is a generic tracing function tracer. If that
is required, then bpftrace should be registering itself with ftrace.
If you are attaching to a set of functions, where it becomes obvious that
its not being used to access specific parameters, then that's an abuse of
the direct calls.

We already have one generic function tracer, we don't need another.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-22 13:35 ` [RFC bpf-next 00/16] bpf: Speed up trampoline attach Steven Rostedt
@ 2020-10-22 14:11   ` Jiri Olsa
  2020-10-22 14:42     ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-22 14:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 22, 2020 at 09:35:10AM -0400, Steven Rostedt wrote:
> On Thu, 22 Oct 2020 10:21:22 +0200
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > hi,
> > this patchset tries to speed up the attach time for trampolines
> > and make bpftrace faster for wildcard use cases like:
> > 
> >   # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); }"
> > 
> > Profiles show mostly ftrace backend, because we add trampoline
> > functions one by one and ftrace direct function registering is
> > quite expensive. Thus main change in this patchset is to allow
> > batch attach and use just single ftrace call to attach or detach
> > multiple ips/trampolines.
> 
> The issue I have with this change is that the purpose of the direct
> trampoline was to give bpf access to the parameters of a function as if it
> was called directly. That is, it could see the parameters of a function
> quickly. I even made the direct function work if it wanted to also trace
> the return code.
> 
> What the direct calls is NOT, is a generic tracing function tracer. If that
> is required, then bpftrace should be registering itself with ftrace.
> If you are attaching to a set of functions, where it becomes obvious that
> its not being used to access specific parameters, then that's an abuse of
> the direct calls.
> 
> We already have one generic function tracer, we don't need another.

I understand direct calls as a way that bpf trampolines and ftrace can
co-exist together - ebpf trampolines need that functionality of accessing
parameters of a function as if it was called directly and at the same
point we need to be able attach to any function and to as many functions
as we want in a fast way

the bpftrace example above did not use arguments for simplicity, but they
could have been there ... I think we could detect arguments presence in
ebpf programs and use ftrace_ops directly in case they are not needed

jirka


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-22 14:11   ` Jiri Olsa
@ 2020-10-22 14:42     ` Steven Rostedt
  2020-10-22 16:21       ` Steven Rostedt
  2020-10-27  4:30       ` Alexei Starovoitov
  0 siblings, 2 replies; 47+ messages in thread
From: Steven Rostedt @ 2020-10-22 14:42 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, 22 Oct 2020 16:11:54 +0200
Jiri Olsa <jolsa@redhat.com> wrote:

> I understand direct calls as a way that bpf trampolines and ftrace can
> co-exist together - ebpf trampolines need that functionality of accessing
> parameters of a function as if it was called directly and at the same
> point we need to be able attach to any function and to as many functions
> as we want in a fast way

I was sold that bpf needed a quick and fast way to get the arguments of a
function, as the only way to do that with ftrace is to save all registers,
which, I was told was too much overhead, as if you only care about
arguments, there's much less that is needed to save.

Direct calls wasn't added so that bpf and ftrace could co-exist, it was
that for certain cases, bpf wanted a faster way to access arguments,
because it still worked with ftrace, but the saving of regs was too
strenuous.

> 
> the bpftrace example above did not use arguments for simplicity, but they
> could have been there ... I think we could detect arguments presence in
> ebpf programs and use ftrace_ops directly in case they are not needed

What I don't see, is how one would need to access arguments for a lot of
calls directly? The direct trampoline was for "one-offs", because for every
function that has a direct trampoline, it prevents kretprobes and function
graph tracer from accessing it. Before I allow a "batch" direct caller, I
need it to not break function graph tracing.

If we are going to have a way to get parameters for multiple functions, I
would then like to have that be directly part of the ftrace infrastructure.
That is, allow more than just bpf to have access to this. If it's going to
be generic, then let's have it work for all function trace users and not
just bpf.

I'd like to see how batch functions will work. I guess I need to start
looking at the bpf trampoline, to see if we can modify the ftrace
trampoline to have a quick access to parameters. It would be much more
beneficial to update the existing generic function tracer to have access to
function parameters that all users could benefit from, than to tweak a
single use case into giving this feature to a single user.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-22 14:42     ` Steven Rostedt
@ 2020-10-22 16:21       ` Steven Rostedt
  2020-10-22 20:52         ` Steven Rostedt
  2020-10-27  4:30       ` Alexei Starovoitov
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2020-10-22 16:21 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, 22 Oct 2020 10:42:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'd like to see how batch functions will work. I guess I need to start
> looking at the bpf trampoline, to see if we can modify the ftrace
> trampoline to have a quick access to parameters. It would be much more
> beneficial to update the existing generic function tracer to have access to
> function parameters that all users could benefit from, than to tweak a
> single use case into giving this feature to a single user.

Looking at the creation of the bpf trampoline, I think I can modify ftrace
to have a more flexible callback. Something that passes the callback the
following:

 the function being traced.
 a pointer to the parent caller (that could be modified)
 a pointer to the original stack frame (what the stack was when the
      function is entered)
 An array of the arguments of the function (which could also be modified)

This is a change I've been wanting to make for some time, because it would
allow function graph to be a user of function tracer, and would give
everything access to the arguments.

We would still need a helper function to store all regs to keep kprobes
working unmodified, but this would still only be done if asked.

The above change shouldn't hurt performance for either ftrace or bpf
because it appears they both do the same. If BPF wants to have a batch
processing of functions, then I think we should modify ftrace to do this
new approach instead of creating another set of function trampolines.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-22 16:21       ` Steven Rostedt
@ 2020-10-22 20:52         ` Steven Rostedt
  2020-10-23  6:09           ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2020-10-22 20:52 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, 22 Oct 2020 12:21:50 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 22 Oct 2020 10:42:05 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > I'd like to see how batch functions will work. I guess I need to start
> > looking at the bpf trampoline, to see if we can modify the ftrace
> > trampoline to have a quick access to parameters. It would be much more
> > beneficial to update the existing generic function tracer to have access to
> > function parameters that all users could benefit from, than to tweak a
> > single use case into giving this feature to a single user.  
> 
> Looking at the creation of the bpf trampoline, I think I can modify ftrace
> to have a more flexible callback. Something that passes the callback the
> following:
> 
>  the function being traced.
>  a pointer to the parent caller (that could be modified)
>  a pointer to the original stack frame (what the stack was when the
>       function is entered)
>  An array of the arguments of the function (which could also be modified)
> 
> This is a change I've been wanting to make for some time, because it would
> allow function graph to be a user of function tracer, and would give
> everything access to the arguments.
> 
> We would still need a helper function to store all regs to keep kprobes
> working unmodified, but this would still only be done if asked.
> 
> The above change shouldn't hurt performance for either ftrace or bpf
> because it appears they both do the same. If BPF wants to have a batch
> processing of functions, then I think we should modify ftrace to do this
> new approach instead of creating another set of function trampolines.

The below is a quick proof of concept patch I whipped up. It will always
save 6 arguments, so if BPF is really interested in just saving the bare
minimum of arguments before calling, it can still use direct. But if you
are going to have a generic callback, you'll need to save all parameters
otherwise you can corrupt the function's parameter if traced function uses
more than you save.

Which looking at the bpf trampoline code, I noticed that if the verifier
can't find the btf func, it falls back to saving 5 parameters. Which can be
a bug on x86 if the function itself uses 6 or more. If you only save 5
parameters, then call a bpf program that calls a helper function that uses
more than 5 parameters, it will likely corrupt the 6th parameter of the
function being traced.

The code in question is this:

int btf_distill_func_proto(struct bpf_verifier_log *log,
			   struct btf *btf,
			   const struct btf_type *func,
			   const char *tname,
			   struct btf_func_model *m)
{
	const struct btf_param *args;
	const struct btf_type *t;
	u32 i, nargs;
	int ret;

	if (!func) {
		/* BTF function prototype doesn't match the verifier types.
		 * Fall back to 5 u64 args.
		 */
		for (i = 0; i < 5; i++)
			m->arg_size[i] = 8;
		m->ret_size = 8;
		m->nr_args = 5;
		return 0;
	}

Shouldn't it be falling back to 6, not 5?

-- Steve

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 7edbd5ee5ed4..b65d73f430ed 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -287,6 +287,10 @@ extern void ftrace_caller_end(void);
 extern void ftrace_caller_op_ptr(void);
 extern void ftrace_regs_caller_op_ptr(void);
 extern void ftrace_regs_caller_jmp(void);
+extern void ftrace_args_caller(void);
+extern void ftrace_args_call(void);
+extern void ftrace_args_caller_end(void);
+extern void ftrace_args_caller_op_ptr(void);
 
 /* movq function_trace_op(%rip), %rdx */
 /* 0x48 0x8b 0x15 <offset-to-ftrace_trace_op (4 bytes)> */
@@ -317,7 +321,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	unsigned long end_offset;
 	unsigned long op_offset;
 	unsigned long call_offset;
-	unsigned long jmp_offset;
+	unsigned long jmp_offset = 0;
 	unsigned long offset;
 	unsigned long npages;
 	unsigned long size;
@@ -336,12 +340,16 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 		op_offset = (unsigned long)ftrace_regs_caller_op_ptr;
 		call_offset = (unsigned long)ftrace_regs_call;
 		jmp_offset = (unsigned long)ftrace_regs_caller_jmp;
+	} else if (ops->flags & FTRACE_OPS_FL_ARGS) {
+		start_offset = (unsigned long)ftrace_args_caller;
+		end_offset = (unsigned long)ftrace_args_caller_end;
+		op_offset = (unsigned long)ftrace_args_caller_op_ptr;
+		call_offset = (unsigned long)ftrace_args_call;
 	} else {
 		start_offset = (unsigned long)ftrace_caller;
 		end_offset = (unsigned long)ftrace_caller_end;
 		op_offset = (unsigned long)ftrace_caller_op_ptr;
 		call_offset = (unsigned long)ftrace_call;
-		jmp_offset = 0;
 	}
 
 	size = end_offset - start_offset;
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index ac3d5f22fe64..65ca634d0b37 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -176,6 +176,58 @@ SYM_INNER_LABEL_ALIGN(ftrace_stub, SYM_L_WEAK)
 	retq
 SYM_FUNC_END(ftrace_epilogue)
 
+SYM_FUNC_START(ftrace_args_caller)
+#ifdef CONFIG_FRAME_POINTER
+	push %rdp
+	movq %rsp %rdp
+# define CALLED_OFFEST (7 * 8)
+# define PARENT_OFFSET (8 * 8)
+#else
+# define CALLED_OFFSET (6 * 8)
+# define PARENT_OFFSET (7 * 8)
+#endif
+	/* save args */
+	pushq %r9
+	pushq %r8
+	pushq %rcx
+	pushq %rdx
+	pushq %rsi
+	pushq %rdi
+
+	/*
+	 * Parameters:
+	 *   Called site (function called)
+	 *   Address of parent location
+	 *   pointer to ftrace_ops
+	 *   Location of stack when function was called
+	 *   Array of arguments.
+	 */
+	movq CALLED_OFFSET(%rsp), %rdi
+	leaq PARENT_OFFSET(%rsp), %rsi
+SYM_INNER_LABEL(ftrace_args_caller_op_ptr, SYM_L_GLOBAL)
+	/* Load the ftrace_ops into the 3rd parameter */
+	movq function_trace_op(%rip), %rdx
+	movq %rsi, %rcx
+	leaq 0(%rsp), %r8
+
+SYM_INNER_LABEL(ftrace_args_call, SYM_L_GLOBAL)
+	callq ftrace_stub
+
+	popq %rdi
+	popq %rsi
+	popq %rdx
+	popq %rcx
+	popq %r8
+	popq %r9
+
+#ifdef CONFIG_FRAME_POINTER
+	popq %rdp
+#endif
+
+SYM_INNER_LABEL(ftrace_args_caller_end, SYM_L_GLOBAL)
+	jmp ftrace_epilogue
+SYM_FUNC_END(ftrace_args_caller)
+
 SYM_FUNC_START(ftrace_regs_caller)
 	/* Save the current flags before any operations that can change them */
 	pushfq
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 1bd3a0356ae4..0d077e8d7bb4 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -92,6 +92,17 @@ struct ftrace_ops;
 typedef void (*ftrace_func_t)(unsigned long ip, unsigned long parent_ip,
 			      struct ftrace_ops *op, struct pt_regs *regs);
 
+typedef void (*ftrace_args_func_t)(unsigned long ip, unsigned long *parent_ip,
+				   struct ftrace_ops *op, unsigned long *stack,
+				   unsigned long *args);
+
+union ftrace_callback {
+	ftrace_func_t		func;
+	ftrace_args_func_t	args_func;
+};
+
+typedef union ftrace_callback ftrace_callback_t;
+
 ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);
 
 /*
@@ -169,6 +180,7 @@ enum {
 	FTRACE_OPS_FL_TRACE_ARRAY		= BIT(15),
 	FTRACE_OPS_FL_PERMANENT                 = BIT(16),
 	FTRACE_OPS_FL_DIRECT			= BIT(17),
+	FTRACE_OPS_FL_ARGS			= BIT(18),
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE
@@ -447,9 +459,11 @@ enum {
 	FTRACE_FL_DISABLED	= (1UL << 25),
 	FTRACE_FL_DIRECT	= (1UL << 24),
 	FTRACE_FL_DIRECT_EN	= (1UL << 23),
+	FTRACE_FL_ARGS		= (1UL << 22),
+	FTRACE_FL_ARGS_EN	= (1UL << 21),
 };
 
-#define FTRACE_REF_MAX_SHIFT	23
+#define FTRACE_REF_MAX_SHIFT	21
 #define FTRACE_REF_MAX		((1UL << FTRACE_REF_MAX_SHIFT) - 1)
 
 #define ftrace_rec_count(rec)	((rec)->flags & FTRACE_REF_MAX)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 4833b6a82ce7..5632b0809dc0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1721,6 +1721,9 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
 			if (ops->flags & FTRACE_OPS_FL_DIRECT)
 				rec->flags |= FTRACE_FL_DIRECT;
 
+			else if (ops->flags & FTRACE_OPS_FL_ARGS)
+				rec->flags |= FTRACE_FL_ARGS;
+
 			/*
 			 * If there's only a single callback registered to a
 			 * function, and the ops has a trampoline registered
@@ -1757,6 +1760,10 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
 			if (ops->flags & FTRACE_OPS_FL_DIRECT)
 				rec->flags &= ~FTRACE_FL_DIRECT;
 
+			/* POC: but we will have more than one */
+			if (ops->flags & FTRACE_OPS_FL_ARGS)
+				rec->flags &= ~FTRACE_FL_ARGS;
+
 			/*
 			 * If the rec had REGS enabled and the ops that is
 			 * being removed had REGS set, then see if there is
@@ -2103,6 +2110,13 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 		    !(rec->flags & FTRACE_FL_TRAMP_EN))
 			flag |= FTRACE_FL_TRAMP;
 
+		/* Proof of concept */
+		if (ftrace_rec_count(rec) == 1) {
+			if (!(rec->flags & FTRACE_FL_ARGS) !=
+			    !(rec->flags & FTRACE_FL_ARGS_EN))
+				flag |= FTRACE_FL_ARGS;
+		}
+
 		/*
 		 * Direct calls are special, as count matters.
 		 * We must test the record for direct, if the
@@ -2144,6 +2158,17 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 				else
 					rec->flags &= ~FTRACE_FL_TRAMP_EN;
 			}
+			if (flag & FTRACE_FL_ARGS) {
+				if (ftrace_rec_count(rec) == 1) {
+					if (rec->flags & FTRACE_FL_ARGS)
+						rec->flags |= FTRACE_FL_ARGS_EN;
+					else
+						rec->flags &= ~FTRACE_FL_ARGS_EN;
+				} else {
+					rec->flags &= ~FTRACE_FL_ARGS_EN;
+				}
+			}
+
 			if (flag & FTRACE_FL_DIRECT) {
 				/*
 				 * If there's only one user (direct_ops helper)
@@ -2192,7 +2217,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 			 * and REGS states. The _EN flags must be disabled though.
 			 */
 			rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
-					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
+					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
+					FTRACE_FL_ARGS_EN);
 	}
 
 	ftrace_bug_type = FTRACE_BUG_NOP;
@@ -3630,7 +3656,8 @@ static int t_show(struct seq_file *m, void *v)
 			   ftrace_rec_count(rec),
 			   rec->flags & FTRACE_FL_REGS ? " R" : "  ",
 			   rec->flags & FTRACE_FL_IPMODIFY ? " I" : "  ",
-			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ");
+			   rec->flags & FTRACE_FL_DIRECT ? " D" :
+			   rec->flags & FTRACE_FL_ARGS ? " A" : "  ");
 		if (rec->flags & FTRACE_FL_TRAMP_EN) {
 			ops = ftrace_find_tramp_ops_any(rec);
 			if (ops) {
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 2c2126e1871d..a3da84b0e599 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -86,9 +86,48 @@ void ftrace_destroy_function_files(struct trace_array *tr)
 	ftrace_free_ftrace_ops(tr);
 }
 
+static void function_args_trace_call(unsigned long ip,
+				     unsigned long *parent_ip,
+				     struct ftrace_ops *op,
+				     unsigned long *stack,
+				     unsigned long *args)
+{
+	struct trace_array *tr = op->private;
+	struct trace_array_cpu *data;
+	unsigned long flags;
+	int bit;
+	int cpu;
+	int pc;
+
+	if (unlikely(!tr->function_enabled))
+		return;
+
+	pc = preempt_count();
+	preempt_disable_notrace();
+
+	bit = trace_test_and_set_recursion(TRACE_FTRACE_START, TRACE_FTRACE_MAX);
+	if (bit < 0)
+		goto out;
+
+	cpu = smp_processor_id();
+	data = per_cpu_ptr(tr->array_buffer.data, cpu);
+	if (!atomic_read(&data->disabled)) {
+		local_save_flags(flags);
+		trace_function(tr, ip, *parent_ip, flags, pc);
+		trace_printk("%pS %lx %lx %lx %lx %lx %lx\n",
+			     (void *)ip, args[0], args[1], args[2], args[3],
+			     args[4], args[5]);
+	}
+	trace_clear_recursion(bit);
+
+ out:
+	preempt_enable_notrace();
+
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
-	ftrace_func_t func;
+	ftrace_callback_t callback;
 
 	/*
 	 * Instance trace_arrays get their ops allocated
@@ -101,11 +140,14 @@ static int function_trace_init(struct trace_array *tr)
 	/* Currently only the global instance can do stack tracing */
 	if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
 	    func_flags.val & TRACE_FUNC_OPT_STACK)
-		func = function_stack_trace_call;
-	else
-		func = function_trace_call;
+		callback.func = function_stack_trace_call;
+	else {
+		tr->ops->flags |= FTRACE_OPS_FL_ARGS;
+		callback.args_func = function_args_trace_call;
+	}
+//		func = function_trace_call;
 
-	ftrace_init_array_ops(tr, func);
+	ftrace_init_array_ops(tr, callback.func);
 
 	tr->array_buffer.cpu = get_cpu();
 	put_cpu();

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-22 20:52         ` Steven Rostedt
@ 2020-10-23  6:09           ` Jiri Olsa
  2020-10-23 13:50             ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-23  6:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 22, 2020 at 04:52:29PM -0400, Steven Rostedt wrote:
> On Thu, 22 Oct 2020 12:21:50 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Thu, 22 Oct 2020 10:42:05 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > I'd like to see how batch functions will work. I guess I need to start
> > > looking at the bpf trampoline, to see if we can modify the ftrace
> > > trampoline to have a quick access to parameters. It would be much more
> > > beneficial to update the existing generic function tracer to have access to
> > > function parameters that all users could benefit from, than to tweak a
> > > single use case into giving this feature to a single user.  
> > 
> > Looking at the creation of the bpf trampoline, I think I can modify ftrace
> > to have a more flexible callback. Something that passes the callback the
> > following:
> > 
> >  the function being traced.
> >  a pointer to the parent caller (that could be modified)
> >  a pointer to the original stack frame (what the stack was when the
> >       function is entered)
> >  An array of the arguments of the function (which could also be modified)
> > 
> > This is a change I've been wanting to make for some time, because it would
> > allow function graph to be a user of function tracer, and would give
> > everything access to the arguments.
> > 
> > We would still need a helper function to store all regs to keep kprobes
> > working unmodified, but this would still only be done if asked.
> > 
> > The above change shouldn't hurt performance for either ftrace or bpf
> > because it appears they both do the same. If BPF wants to have a batch
> > processing of functions, then I think we should modify ftrace to do this
> > new approach instead of creating another set of function trampolines.
> 
> The below is a quick proof of concept patch I whipped up. It will always
> save 6 arguments, so if BPF is really interested in just saving the bare
> minimum of arguments before calling, it can still use direct. But if you
> are going to have a generic callback, you'll need to save all parameters
> otherwise you can corrupt the function's parameter if traced function uses
> more than you save.

nice, I'll take a look, thanks for quick code ;-)

> 
> Which looking at the bpf trampoline code, I noticed that if the verifier
> can't find the btf func, it falls back to saving 5 parameters. Which can be
> a bug on x86 if the function itself uses 6 or more. If you only save 5
> parameters, then call a bpf program that calls a helper function that uses
> more than 5 parameters, it will likely corrupt the 6th parameter of the
> function being traced.

number of args from eBPF program to in-kernel function is
restricted to 5, so we should be fine

> 
> The code in question is this:
> 
> int btf_distill_func_proto(struct bpf_verifier_log *log,
> 			   struct btf *btf,
> 			   const struct btf_type *func,
> 			   const char *tname,
> 			   struct btf_func_model *m)
> {
> 	const struct btf_param *args;
> 	const struct btf_type *t;
> 	u32 i, nargs;
> 	int ret;
> 
> 	if (!func) {
> 		/* BTF function prototype doesn't match the verifier types.
> 		 * Fall back to 5 u64 args.
> 		 */
> 		for (i = 0; i < 5; i++)
> 			m->arg_size[i] = 8;
> 		m->ret_size = 8;
> 		m->nr_args = 5;
> 		return 0;
> 	}
> 
> Shouldn't it be falling back to 6, not 5?

but looks like this actualy could fallback to 6, jit would
allow that, but I'm not sure if there's another restriction

thanks,
jirka

> 
> -- Steve
> 
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index 7edbd5ee5ed4..b65d73f430ed 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -287,6 +287,10 @@ extern void ftrace_caller_end(void);
>  extern void ftrace_caller_op_ptr(void);
>  extern void ftrace_regs_caller_op_ptr(void);
>  extern void ftrace_regs_caller_jmp(void);
> +extern void ftrace_args_caller(void);
> +extern void ftrace_args_call(void);
> +extern void ftrace_args_caller_end(void);
> +extern void ftrace_args_caller_op_ptr(void);
>  
>  /* movq function_trace_op(%rip), %rdx */
>  /* 0x48 0x8b 0x15 <offset-to-ftrace_trace_op (4 bytes)> */
> @@ -317,7 +321,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
>  	unsigned long end_offset;
>  	unsigned long op_offset;
>  	unsigned long call_offset;
> -	unsigned long jmp_offset;
> +	unsigned long jmp_offset = 0;
>  	unsigned long offset;
>  	unsigned long npages;
>  	unsigned long size;
> @@ -336,12 +340,16 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
>  		op_offset = (unsigned long)ftrace_regs_caller_op_ptr;
>  		call_offset = (unsigned long)ftrace_regs_call;
>  		jmp_offset = (unsigned long)ftrace_regs_caller_jmp;
> +	} else if (ops->flags & FTRACE_OPS_FL_ARGS) {
> +		start_offset = (unsigned long)ftrace_args_caller;
> +		end_offset = (unsigned long)ftrace_args_caller_end;
> +		op_offset = (unsigned long)ftrace_args_caller_op_ptr;
> +		call_offset = (unsigned long)ftrace_args_call;
>  	} else {
>  		start_offset = (unsigned long)ftrace_caller;
>  		end_offset = (unsigned long)ftrace_caller_end;
>  		op_offset = (unsigned long)ftrace_caller_op_ptr;
>  		call_offset = (unsigned long)ftrace_call;
> -		jmp_offset = 0;
>  	}
>  
>  	size = end_offset - start_offset;
> diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
> index ac3d5f22fe64..65ca634d0b37 100644
> --- a/arch/x86/kernel/ftrace_64.S
> +++ b/arch/x86/kernel/ftrace_64.S
> @@ -176,6 +176,58 @@ SYM_INNER_LABEL_ALIGN(ftrace_stub, SYM_L_WEAK)
>  	retq
>  SYM_FUNC_END(ftrace_epilogue)
>  
> +SYM_FUNC_START(ftrace_args_caller)
> +#ifdef CONFIG_FRAME_POINTER
> +	push %rdp
> +	movq %rsp %rdp
> +# define CALLED_OFFEST (7 * 8)
> +# define PARENT_OFFSET (8 * 8)
> +#else
> +# define CALLED_OFFSET (6 * 8)
> +# define PARENT_OFFSET (7 * 8)
> +#endif
> +	/* save args */
> +	pushq %r9
> +	pushq %r8
> +	pushq %rcx
> +	pushq %rdx
> +	pushq %rsi
> +	pushq %rdi
> +
> +	/*
> +	 * Parameters:
> +	 *   Called site (function called)
> +	 *   Address of parent location
> +	 *   pointer to ftrace_ops
> +	 *   Location of stack when function was called
> +	 *   Array of arguments.
> +	 */
> +	movq CALLED_OFFSET(%rsp), %rdi
> +	leaq PARENT_OFFSET(%rsp), %rsi
> +SYM_INNER_LABEL(ftrace_args_caller_op_ptr, SYM_L_GLOBAL)
> +	/* Load the ftrace_ops into the 3rd parameter */
> +	movq function_trace_op(%rip), %rdx
> +	movq %rsi, %rcx
> +	leaq 0(%rsp), %r8
> +
> +SYM_INNER_LABEL(ftrace_args_call, SYM_L_GLOBAL)
> +	callq ftrace_stub
> +
> +	popq %rdi
> +	popq %rsi
> +	popq %rdx
> +	popq %rcx
> +	popq %r8
> +	popq %r9
> +
> +#ifdef CONFIG_FRAME_POINTER
> +	popq %rdp
> +#endif
> +
> +SYM_INNER_LABEL(ftrace_args_caller_end, SYM_L_GLOBAL)
> +	jmp ftrace_epilogue
> +SYM_FUNC_END(ftrace_args_caller)
> +
>  SYM_FUNC_START(ftrace_regs_caller)
>  	/* Save the current flags before any operations that can change them */
>  	pushfq
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 1bd3a0356ae4..0d077e8d7bb4 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -92,6 +92,17 @@ struct ftrace_ops;
>  typedef void (*ftrace_func_t)(unsigned long ip, unsigned long parent_ip,
>  			      struct ftrace_ops *op, struct pt_regs *regs);
>  
> +typedef void (*ftrace_args_func_t)(unsigned long ip, unsigned long *parent_ip,
> +				   struct ftrace_ops *op, unsigned long *stack,
> +				   unsigned long *args);
> +
> +union ftrace_callback {
> +	ftrace_func_t		func;
> +	ftrace_args_func_t	args_func;
> +};
> +
> +typedef union ftrace_callback ftrace_callback_t;
> +
>  ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);
>  
>  /*
> @@ -169,6 +180,7 @@ enum {
>  	FTRACE_OPS_FL_TRACE_ARRAY		= BIT(15),
>  	FTRACE_OPS_FL_PERMANENT                 = BIT(16),
>  	FTRACE_OPS_FL_DIRECT			= BIT(17),
> +	FTRACE_OPS_FL_ARGS			= BIT(18),
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE
> @@ -447,9 +459,11 @@ enum {
>  	FTRACE_FL_DISABLED	= (1UL << 25),
>  	FTRACE_FL_DIRECT	= (1UL << 24),
>  	FTRACE_FL_DIRECT_EN	= (1UL << 23),
> +	FTRACE_FL_ARGS		= (1UL << 22),
> +	FTRACE_FL_ARGS_EN	= (1UL << 21),
>  };
>  
> -#define FTRACE_REF_MAX_SHIFT	23
> +#define FTRACE_REF_MAX_SHIFT	21
>  #define FTRACE_REF_MAX		((1UL << FTRACE_REF_MAX_SHIFT) - 1)
>  
>  #define ftrace_rec_count(rec)	((rec)->flags & FTRACE_REF_MAX)
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 4833b6a82ce7..5632b0809dc0 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1721,6 +1721,9 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
>  			if (ops->flags & FTRACE_OPS_FL_DIRECT)
>  				rec->flags |= FTRACE_FL_DIRECT;
>  
> +			else if (ops->flags & FTRACE_OPS_FL_ARGS)
> +				rec->flags |= FTRACE_FL_ARGS;
> +
>  			/*
>  			 * If there's only a single callback registered to a
>  			 * function, and the ops has a trampoline registered
> @@ -1757,6 +1760,10 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
>  			if (ops->flags & FTRACE_OPS_FL_DIRECT)
>  				rec->flags &= ~FTRACE_FL_DIRECT;
>  
> +			/* POC: but we will have more than one */
> +			if (ops->flags & FTRACE_OPS_FL_ARGS)
> +				rec->flags &= ~FTRACE_FL_ARGS;
> +
>  			/*
>  			 * If the rec had REGS enabled and the ops that is
>  			 * being removed had REGS set, then see if there is
> @@ -2103,6 +2110,13 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
>  		    !(rec->flags & FTRACE_FL_TRAMP_EN))
>  			flag |= FTRACE_FL_TRAMP;
>  
> +		/* Proof of concept */
> +		if (ftrace_rec_count(rec) == 1) {
> +			if (!(rec->flags & FTRACE_FL_ARGS) !=
> +			    !(rec->flags & FTRACE_FL_ARGS_EN))
> +				flag |= FTRACE_FL_ARGS;
> +		}
> +
>  		/*
>  		 * Direct calls are special, as count matters.
>  		 * We must test the record for direct, if the
> @@ -2144,6 +2158,17 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
>  				else
>  					rec->flags &= ~FTRACE_FL_TRAMP_EN;
>  			}
> +			if (flag & FTRACE_FL_ARGS) {
> +				if (ftrace_rec_count(rec) == 1) {
> +					if (rec->flags & FTRACE_FL_ARGS)
> +						rec->flags |= FTRACE_FL_ARGS_EN;
> +					else
> +						rec->flags &= ~FTRACE_FL_ARGS_EN;
> +				} else {
> +					rec->flags &= ~FTRACE_FL_ARGS_EN;
> +				}
> +			}
> +
>  			if (flag & FTRACE_FL_DIRECT) {
>  				/*
>  				 * If there's only one user (direct_ops helper)
> @@ -2192,7 +2217,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
>  			 * and REGS states. The _EN flags must be disabled though.
>  			 */
>  			rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
> -					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
> +					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
> +					FTRACE_FL_ARGS_EN);
>  	}
>  
>  	ftrace_bug_type = FTRACE_BUG_NOP;
> @@ -3630,7 +3656,8 @@ static int t_show(struct seq_file *m, void *v)
>  			   ftrace_rec_count(rec),
>  			   rec->flags & FTRACE_FL_REGS ? " R" : "  ",
>  			   rec->flags & FTRACE_FL_IPMODIFY ? " I" : "  ",
> -			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ");
> +			   rec->flags & FTRACE_FL_DIRECT ? " D" :
> +			   rec->flags & FTRACE_FL_ARGS ? " A" : "  ");
>  		if (rec->flags & FTRACE_FL_TRAMP_EN) {
>  			ops = ftrace_find_tramp_ops_any(rec);
>  			if (ops) {
> diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
> index 2c2126e1871d..a3da84b0e599 100644
> --- a/kernel/trace/trace_functions.c
> +++ b/kernel/trace/trace_functions.c
> @@ -86,9 +86,48 @@ void ftrace_destroy_function_files(struct trace_array *tr)
>  	ftrace_free_ftrace_ops(tr);
>  }
>  
> +static void function_args_trace_call(unsigned long ip,
> +				     unsigned long *parent_ip,
> +				     struct ftrace_ops *op,
> +				     unsigned long *stack,
> +				     unsigned long *args)
> +{
> +	struct trace_array *tr = op->private;
> +	struct trace_array_cpu *data;
> +	unsigned long flags;
> +	int bit;
> +	int cpu;
> +	int pc;
> +
> +	if (unlikely(!tr->function_enabled))
> +		return;
> +
> +	pc = preempt_count();
> +	preempt_disable_notrace();
> +
> +	bit = trace_test_and_set_recursion(TRACE_FTRACE_START, TRACE_FTRACE_MAX);
> +	if (bit < 0)
> +		goto out;
> +
> +	cpu = smp_processor_id();
> +	data = per_cpu_ptr(tr->array_buffer.data, cpu);
> +	if (!atomic_read(&data->disabled)) {
> +		local_save_flags(flags);
> +		trace_function(tr, ip, *parent_ip, flags, pc);
> +		trace_printk("%pS %lx %lx %lx %lx %lx %lx\n",
> +			     (void *)ip, args[0], args[1], args[2], args[3],
> +			     args[4], args[5]);
> +	}
> +	trace_clear_recursion(bit);
> +
> + out:
> +	preempt_enable_notrace();
> +
> +}
> +
>  static int function_trace_init(struct trace_array *tr)
>  {
> -	ftrace_func_t func;
> +	ftrace_callback_t callback;
>  
>  	/*
>  	 * Instance trace_arrays get their ops allocated
> @@ -101,11 +140,14 @@ static int function_trace_init(struct trace_array *tr)
>  	/* Currently only the global instance can do stack tracing */
>  	if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
>  	    func_flags.val & TRACE_FUNC_OPT_STACK)
> -		func = function_stack_trace_call;
> -	else
> -		func = function_trace_call;
> +		callback.func = function_stack_trace_call;
> +	else {
> +		tr->ops->flags |= FTRACE_OPS_FL_ARGS;
> +		callback.args_func = function_args_trace_call;
> +	}
> +//		func = function_trace_call;
>  
> -	ftrace_init_array_ops(tr, func);
> +	ftrace_init_array_ops(tr, callback.func);
>  
>  	tr->array_buffer.cpu = get_cpu();
>  	put_cpu();
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-23  6:09           ` Jiri Olsa
@ 2020-10-23 13:50             ` Steven Rostedt
  2020-10-25 19:01               ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2020-10-23 13:50 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Fri, 23 Oct 2020 08:09:32 +0200
Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > The below is a quick proof of concept patch I whipped up. It will always
> > save 6 arguments, so if BPF is really interested in just saving the bare
> > minimum of arguments before calling, it can still use direct. But if you
> > are going to have a generic callback, you'll need to save all parameters
> > otherwise you can corrupt the function's parameter if traced function uses
> > more than you save.  
> 
> nice, I'll take a look, thanks for quick code ;-)

Playing with it more, I have it where I don't add a new ARGS flag, but just
make that the default (if arch supports it). And with this change, I even
have function graph working directly with the function tracer, and I can
now get rid of the function graph trampoline! Of course, this will be a
slow process because it has to be changed across architectures, but with a
HAVE_FTRACE_ARGS flag, I can do it one by one.

> 
> > 
> > Which looking at the bpf trampoline code, I noticed that if the verifier
> > can't find the btf func, it falls back to saving 5 parameters. Which can be
> > a bug on x86 if the function itself uses 6 or more. If you only save 5
> > parameters, then call a bpf program that calls a helper function that uses
> > more than 5 parameters, it will likely corrupt the 6th parameter of the
> > function being traced.  
> 
> number of args from eBPF program to in-kernel function is
> restricted to 5, so we should be fine

Is there something to keep an eBPF program from tracing a function with 6
args? If the program saves only 5 args, but traces a function that has 6
args, then the tracing program may end up using the register that the 6
argument is in, and corrupting it.

I'm looking at bpf/trampoline.c, that has:

	arch_prepare_bpf_trampoline(new_image, ...)

and that new_image is passed into:

	register_ftrace_direct(ip, new_addr);

where new_addr == new_image.

And I don't see anywhere in the creating on that new_image that saves the
6th parameter.

The bpf program calls some helper functions which are allowed to clobber
%r9 (where the 6th parameter is stored on x86_64). That means, when it
returns to the function it traced, the 6th parameter is no longer correct.

At a minimum, direct callers must save all the parameters used by the
function, not just what the eBPF code may use.

> 
> > 
> > The code in question is this:
> > 
> > int btf_distill_func_proto(struct bpf_verifier_log *log,
> > 			   struct btf *btf,
> > 			   const struct btf_type *func,
> > 			   const char *tname,
> > 			   struct btf_func_model *m)
> > {
> > 	const struct btf_param *args;
> > 	const struct btf_type *t;
> > 	u32 i, nargs;
> > 	int ret;
> > 
> > 	if (!func) {
> > 		/* BTF function prototype doesn't match the verifier types.
> > 		 * Fall back to 5 u64 args.
> > 		 */
> > 		for (i = 0; i < 5; i++)
> > 			m->arg_size[i] = 8;
> > 		m->ret_size = 8;
> > 		m->nr_args = 5;
> > 		return 0;
> > 	}
> > 
> > Shouldn't it be falling back to 6, not 5?  
> 
> but looks like this actualy could fallback to 6, jit would
> allow that, but I'm not sure if there's another restriction


Either way, the direct trampoline must save all registers used by
parameters of the function, and if it doesn't know how many parameters are
used, it must save all possible ones (like ftrace does).

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 08/16] bpf: Use delayed link free in bpf_link_put
  2020-10-22  8:21 ` [RFC bpf-next 08/16] bpf: Use delayed link free in bpf_link_put Jiri Olsa
@ 2020-10-23 19:46   ` Andrii Nakryiko
  2020-10-25 19:02     ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-23 19:46 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Daniel Xu, Steven Rostedt, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 22, 2020 at 8:01 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Moving bpf_link_free call into delayed processing so we don't
> need to wait for it when releasing the link.
>
> For example bpf_tracing_link_release could take considerable
> amount of time in bpf_trampoline_put function due to
> synchronize_rcu_tasks call.
>
> It speeds up bpftrace release time in following example:
>
> Before:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      3,290,457,628      cycles:k                                 ( +-  0.27% )
>        933,581,973      cycles:u                                 ( +-  0.20% )
>
>              50.25 +- 4.79 seconds time elapsed  ( +-  9.53% )
>
> After:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      2,535,458,767      cycles:k                                 ( +-  0.55% )
>        940,046,382      cycles:u                                 ( +-  0.27% )
>
>              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  kernel/bpf/syscall.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 1110ecd7d1f3..61ef29f9177d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2346,12 +2346,8 @@ void bpf_link_put(struct bpf_link *link)
>         if (!atomic64_dec_and_test(&link->refcnt))
>                 return;
>
> -       if (in_atomic()) {
> -               INIT_WORK(&link->work, bpf_link_put_deferred);
> -               schedule_work(&link->work);
> -       } else {
> -               bpf_link_free(link);
> -       }
> +       INIT_WORK(&link->work, bpf_link_put_deferred);
> +       schedule_work(&link->work);

We just recently reverted this exact change. Doing this makes it
non-deterministic from user-space POV when the BPF program is
**actually** detached. This makes user-space programming much more
complicated and unpredictable. So please don't do this. Let's find
some other way to speed this up.

>  }
>
>  static int bpf_link_release(struct inode *inode, struct file *filp)
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
  2020-10-22  8:21 ` [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support Jiri Olsa
@ 2020-10-23 20:03   ` Andrii Nakryiko
  2020-10-23 20:31     ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-23 20:03 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Daniel Xu, Steven Rostedt, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 22, 2020 at 8:01 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding BPF_TRAMPOLINE_BATCH_ATTACH support, that allows to attach
> tracing multiple fentry/fexit pograms to trampolines within one
> syscall.
>
> Currently each tracing program is attached in seprate bpf syscall
> and more importantly by separate register_ftrace_direct call, which
> registers trampoline in ftrace subsystem. We can save some cycles
> by simple using its batch variant register_ftrace_direct_ips.
>
> Before:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      2,199,433,771      cycles:k               ( +-  0.55% )
>        936,105,469      cycles:u               ( +-  0.37% )
>
>              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
>
> After:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      1,456,854,867      cycles:k               ( +-  0.57% )
>        937,737,431      cycles:u               ( +-  0.13% )
>
>              12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )
>
> The new BPF_TRAMPOLINE_BATCH_ATTACH syscall command expects
> following data in union bpf_attr:
>
>   struct {
>           __aligned_u64   in;
>           __aligned_u64   out;
>           __u32           count;
>   } trampoline_batch;
>
>   in    - pointer to user space array with file descrptors of loaded bpf
>           programs to attach
>   out   - pointer to user space array for resulting link descriptor
>   count - number of 'in/out' file descriptors
>
> Basically the new code gets programs from 'in' file descriptors and
> attaches them the same way the current code does, apart from the last
> step that registers probe ip with trampoline. This is done at the end
> with new register_ftrace_direct_ips function.
>
> The resulting link descriptors are written in 'out' array and match
> 'in' array file descriptors order.
>

I think this is a pretty hard API to use correctly from user-space.
Think about all those partially attached and/or partially detached BPF
programs. And subsequent clean up for them. Also there is nothing even
close to atomicity, so you might get a spurious invocation a few times
before batch-attach fails mid-way and the kernel (hopefully) will
detach those already attached programs in an attempt to clean
everything up. Debugging and handling that is a big pain for users,
IMO.

Here's a raw idea, let's think if it would be possible to implement
something like this. It seems like what you need is to create a set of
logically-grouped placeholders for multiple functions you are about to
attach to. Until the BPF program is attached, those placeholders are
just no-ops (e.g., they might jump to an "inactive" single trampoline,
which just immediately returns). Then you attach the BPF program
atomically into a single place, and all those no-op jumps to a
trampoline start to call the BPF program at the same time. It's not
strictly atomic, but is much closer in time with each other. Also,
because it's still a single trampoline, you get a nice mapping to a
single bpf_link, so detaching is not an issue.

Basically, maybe ftrace subsystem could provide a set of APIs to
prepare a set of functions to attach to. Then BPF subsystem would just
do what it does today, except instead of attaching to a specific
kernel function, it would attach to ftrace's placeholder. I don't know
anything about ftrace implementation, so this might be far off. But I
thought that looking at this problem from a bit of a different angle
would benefit the discussion. Thoughts?


> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/bpf.h      | 15 ++++++-
>  include/uapi/linux/bpf.h |  7 ++++
>  kernel/bpf/syscall.c     | 88 ++++++++++++++++++++++++++++++++++++++--
>  kernel/bpf/trampoline.c  | 69 +++++++++++++++++++++++++++----
>  4 files changed, 164 insertions(+), 15 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support
  2020-10-22  8:21 ` [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support Jiri Olsa
@ 2020-10-23 20:09   ` Andrii Nakryiko
  2020-10-25 19:11     ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-23 20:09 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Daniel Xu, Steven Rostedt, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 22, 2020 at 2:03 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding trampoline batch attach support so it's possible to use
> batch mode to load tracing programs.
>
> Adding trampoline_attach_batch bool to struct bpf_object_open_opts.
> When set to true the bpf_object__attach_skeleton will try to load
> all tracing programs via batch mode.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---

Assuming we go with the current kernel API for batch-attach, why can't
libbpf just detect kernel support for it and just use it always,
without requiring users to opt into anything?

But I'm also confused a bit how this is supposed to be used with BPF
skeleton. You use case described in a cover letter (bpftrace glob
attach, right?) would have a single BPF program attached to many
different functions. While here you are trying to collect different
programs and attach each one to its respective kernel function. Do you
expect users to have hundreds of BPF programs in their skeletons? If
not, I don't really see why adding this complexity. What am I missing?

Now it also seems weird to me for the kernel API to allow attaching
many-to-many BPF programs-to-attach points. One BPF program-to-many
attach points seems like a more sane and common requirement, no?


>  tools/lib/bpf/bpf.c      | 12 +++++++
>  tools/lib/bpf/bpf.h      |  1 +
>  tools/lib/bpf/libbpf.c   | 76 +++++++++++++++++++++++++++++++++++++++-
>  tools/lib/bpf/libbpf.h   |  5 ++-
>  tools/lib/bpf/libbpf.map |  1 +
>  5 files changed, 93 insertions(+), 2 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
  2020-10-23 20:03   ` Andrii Nakryiko
@ 2020-10-23 20:31     ` Steven Rostedt
  2020-10-23 22:23       ` Andrii Nakryiko
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2020-10-23 20:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Fri, 23 Oct 2020 13:03:22 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> Basically, maybe ftrace subsystem could provide a set of APIs to
> prepare a set of functions to attach to. Then BPF subsystem would just
> do what it does today, except instead of attaching to a specific
> kernel function, it would attach to ftrace's placeholder. I don't know
> anything about ftrace implementation, so this might be far off. But I
> thought that looking at this problem from a bit of a different angle
> would benefit the discussion. Thoughts?

I probably understand bpf internals as much as you understand ftrace
internals ;-)

Anyway, what I'm currently working on, is a fast way to get to the
arguments of a function. For now, I'm just focused on x86_64, and only add
6 argments.

The main issue that Alexei had with using the ftrace trampoline, was that
the only way to get to the arguments was to set the "REGS" flag, which
would give a regs parameter that contained a full pt_regs. The problem with
this approach is that it required saving *all* regs for every function
traced. Alexei felt that this was too much overehead.

Looking at Jiri's patch, I took a look at the creation of the bpf
trampoline, and noticed that it's copying the regs on a stack (at least
what is used, which I think could be an issue).

For tracing a function, one must store all argument registers used, and
restore them, as that's how they are passed from caller to callee. And
since they are stored anyway, I figure, that should also be sent to the
function callbacks, so that they have access to them too.

I'm working on a set of patches to make this a reality.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
  2020-10-23 20:31     ` Steven Rostedt
@ 2020-10-23 22:23       ` Andrii Nakryiko
  2020-10-25 19:41         ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-23 22:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Fri, Oct 23, 2020 at 1:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 23 Oct 2020 13:03:22 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > Basically, maybe ftrace subsystem could provide a set of APIs to
> > prepare a set of functions to attach to. Then BPF subsystem would just
> > do what it does today, except instead of attaching to a specific
> > kernel function, it would attach to ftrace's placeholder. I don't know
> > anything about ftrace implementation, so this might be far off. But I
> > thought that looking at this problem from a bit of a different angle
> > would benefit the discussion. Thoughts?
>
> I probably understand bpf internals as much as you understand ftrace
> internals ;-)
>

Heh :) But while we are here, what do you think about this idea of
preparing a no-op trampoline, that a bunch (thousands, potentially) of
function entries will jump to. And once all that is ready and patched
through kernel functions entry points, then allow to attach BPF
program or ftrace callback (if I get the terminology right) in a one
fast and simple operation? For users that would mean that they will
either get calls for all or none of attached kfuncs, with a simple and
reliable semantics.

Something like this, where bpf_prog attachment (which replaces nop)
happens as step 2:

+------------+  +----------+  +----------+
|  kfunc1    |  |  kfunc2  |  |  kfunc3  |
+------+-----+  +----+-----+  +----+-----+
       |             |             |
       |             |             |
       +---------------------------+
                     |
                     v
                 +---+---+           +-----------+
                 |  nop  +----------->  bpf_prog |
                 +-------+           +-----------+


> Anyway, what I'm currently working on, is a fast way to get to the
> arguments of a function. For now, I'm just focused on x86_64, and only add
> 6 argments.
>
> The main issue that Alexei had with using the ftrace trampoline, was that
> the only way to get to the arguments was to set the "REGS" flag, which
> would give a regs parameter that contained a full pt_regs. The problem with
> this approach is that it required saving *all* regs for every function
> traced. Alexei felt that this was too much overehead.
>
> Looking at Jiri's patch, I took a look at the creation of the bpf
> trampoline, and noticed that it's copying the regs on a stack (at least
> what is used, which I think could be an issue).

Right. And BPF doesn't get access to the entire pt_regs struct, so it
doesn't have to pay the prices of saving it.

But just FYI. Alexei is out till next week, so don't expect him to
reply in the next few days. But he's probably best to discuss these
nitty-gritty details with :)

>
> For tracing a function, one must store all argument registers used, and
> restore them, as that's how they are passed from caller to callee. And
> since they are stored anyway, I figure, that should also be sent to the
> function callbacks, so that they have access to them too.
>
> I'm working on a set of patches to make this a reality.
>
> -- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-23 13:50             ` Steven Rostedt
@ 2020-10-25 19:01               ` Jiri Olsa
  0 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-25 19:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Fri, Oct 23, 2020 at 09:50:20AM -0400, Steven Rostedt wrote:

SNIP

> Is there something to keep an eBPF program from tracing a function with 6
> args? If the program saves only 5 args, but traces a function that has 6
> args, then the tracing program may end up using the register that the 6
> argument is in, and corrupting it.
> 
> I'm looking at bpf/trampoline.c, that has:
> 
> 	arch_prepare_bpf_trampoline(new_image, ...)
> 
> and that new_image is passed into:
> 
> 	register_ftrace_direct(ip, new_addr);
> 
> where new_addr == new_image.
> 
> And I don't see anywhere in the creating on that new_image that saves the
> 6th parameter.

  arch_prepare_bpf_trampoline
    ...
    save_regs(m, &prog, nr_args, stack_size);

> 
> The bpf program calls some helper functions which are allowed to clobber
> %r9 (where the 6th parameter is stored on x86_64). That means, when it
> returns to the function it traced, the 6th parameter is no longer correct.
> 
> At a minimum, direct callers must save all the parameters used by the
> function, not just what the eBPF code may use.
> 
> > 
> > > 
> > > The code in question is this:
> > > 
> > > int btf_distill_func_proto(struct bpf_verifier_log *log,
> > > 			   struct btf *btf,
> > > 			   const struct btf_type *func,
> > > 			   const char *tname,
> > > 			   struct btf_func_model *m)
> > > {
> > > 	const struct btf_param *args;
> > > 	const struct btf_type *t;
> > > 	u32 i, nargs;
> > > 	int ret;
> > > 
> > > 	if (!func) {
> > > 		/* BTF function prototype doesn't match the verifier types.
> > > 		 * Fall back to 5 u64 args.
> > > 		 */
> > > 		for (i = 0; i < 5; i++)
> > > 			m->arg_size[i] = 8;
> > > 		m->ret_size = 8;
> > > 		m->nr_args = 5;
> > > 		return 0;
> > > 	}

the fallback code in btf_distill_func_proto you're reffering to
is for case of tracing another ebpf program, when hooking to
kernel function, all args are used with no fallback to 5 args

I'm not sure what are the rules wrt args count when tracing
another ebpf program

jirka


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 08/16] bpf: Use delayed link free in bpf_link_put
  2020-10-23 19:46   ` Andrii Nakryiko
@ 2020-10-25 19:02     ` Jiri Olsa
  0 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-25 19:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Fri, Oct 23, 2020 at 12:46:15PM -0700, Andrii Nakryiko wrote:
> On Thu, Oct 22, 2020 at 8:01 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Moving bpf_link_free call into delayed processing so we don't
> > need to wait for it when releasing the link.
> >
> > For example bpf_tracing_link_release could take considerable
> > amount of time in bpf_trampoline_put function due to
> > synchronize_rcu_tasks call.
> >
> > It speeds up bpftrace release time in following example:
> >
> > Before:
> >
> >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
> >     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> >
> >      3,290,457,628      cycles:k                                 ( +-  0.27% )
> >        933,581,973      cycles:u                                 ( +-  0.20% )
> >
> >              50.25 +- 4.79 seconds time elapsed  ( +-  9.53% )
> >
> > After:
> >
> >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
> >     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> >
> >      2,535,458,767      cycles:k                                 ( +-  0.55% )
> >        940,046,382      cycles:u                                 ( +-  0.27% )
> >
> >              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  kernel/bpf/syscall.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 1110ecd7d1f3..61ef29f9177d 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2346,12 +2346,8 @@ void bpf_link_put(struct bpf_link *link)
> >         if (!atomic64_dec_and_test(&link->refcnt))
> >                 return;
> >
> > -       if (in_atomic()) {
> > -               INIT_WORK(&link->work, bpf_link_put_deferred);
> > -               schedule_work(&link->work);
> > -       } else {
> > -               bpf_link_free(link);
> > -       }
> > +       INIT_WORK(&link->work, bpf_link_put_deferred);
> > +       schedule_work(&link->work);
> 
> We just recently reverted this exact change. Doing this makes it
> non-deterministic from user-space POV when the BPF program is
> **actually** detached. This makes user-space programming much more
> complicated and unpredictable. So please don't do this. Let's find
> some other way to speed this up.

ok, makes sense

jirka

> 
> >  }
> >
> >  static int bpf_link_release(struct inode *inode, struct file *filp)
> > --
> > 2.26.2
> >
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support
  2020-10-23 20:09   ` Andrii Nakryiko
@ 2020-10-25 19:11     ` Jiri Olsa
  2020-10-26 23:15       ` Andrii Nakryiko
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-25 19:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Fri, Oct 23, 2020 at 01:09:26PM -0700, Andrii Nakryiko wrote:
> On Thu, Oct 22, 2020 at 2:03 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding trampoline batch attach support so it's possible to use
> > batch mode to load tracing programs.
> >
> > Adding trampoline_attach_batch bool to struct bpf_object_open_opts.
> > When set to true the bpf_object__attach_skeleton will try to load
> > all tracing programs via batch mode.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> 
> Assuming we go with the current kernel API for batch-attach, why can't
> libbpf just detect kernel support for it and just use it always,
> without requiring users to opt into anything?

yea, it's rfc ;-) I wanted some simple usage of the
interface so it's obvious how it works

if we'll end up with some batch interface I agree
we should use it as you suggested

> 
> But I'm also confused a bit how this is supposed to be used with BPF
> skeleton. You use case described in a cover letter (bpftrace glob
> attach, right?) would have a single BPF program attached to many
> different functions. While here you are trying to collect different
> programs and attach each one to its respective kernel function. Do you
> expect users to have hundreds of BPF programs in their skeletons? If
> not, I don't really see why adding this complexity. What am I missing?

AFAIU when you use trampoline program you declare the attach point
at the load time, so you actually can't use same program for different
kernel functions - which would be great speed up actually, because
that's where the rest of the cycles in bpftrace is spent (in that cover
letter example) - load/verifier check of all those programs

it's different for kprobe where you hook single kprobe via multiple
kprobe perf events to different kernel function

> 
> Now it also seems weird to me for the kernel API to allow attaching
> many-to-many BPF programs-to-attach points. One BPF program-to-many
> attach points seems like a more sane and common requirement, no?

right, but that's the consequence of what I wrote above

jirka

> 
> 
> >  tools/lib/bpf/bpf.c      | 12 +++++++
> >  tools/lib/bpf/bpf.h      |  1 +
> >  tools/lib/bpf/libbpf.c   | 76 +++++++++++++++++++++++++++++++++++++++-
> >  tools/lib/bpf/libbpf.h   |  5 ++-
> >  tools/lib/bpf/libbpf.map |  1 +
> >  5 files changed, 93 insertions(+), 2 deletions(-)
> >
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
  2020-10-23 22:23       ` Andrii Nakryiko
@ 2020-10-25 19:41         ` Jiri Olsa
  2020-10-26 23:19           ` Andrii Nakryiko
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-25 19:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Steven Rostedt, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Networking, bpf, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Daniel Xu,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Fri, Oct 23, 2020 at 03:23:10PM -0700, Andrii Nakryiko wrote:
> On Fri, Oct 23, 2020 at 1:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Fri, 23 Oct 2020 13:03:22 -0700
> > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > > Basically, maybe ftrace subsystem could provide a set of APIs to
> > > prepare a set of functions to attach to. Then BPF subsystem would just
> > > do what it does today, except instead of attaching to a specific
> > > kernel function, it would attach to ftrace's placeholder. I don't know
> > > anything about ftrace implementation, so this might be far off. But I
> > > thought that looking at this problem from a bit of a different angle
> > > would benefit the discussion. Thoughts?
> >
> > I probably understand bpf internals as much as you understand ftrace
> > internals ;-)
> >
> 
> Heh :) But while we are here, what do you think about this idea of
> preparing a no-op trampoline, that a bunch (thousands, potentially) of
> function entries will jump to. And once all that is ready and patched
> through kernel functions entry points, then allow to attach BPF
> program or ftrace callback (if I get the terminology right) in a one
> fast and simple operation? For users that would mean that they will
> either get calls for all or none of attached kfuncs, with a simple and
> reliable semantics.

so the main pain point the batch interface is addressing, is that
every attach (BPF_RAW_TRACEPOINT_OPEN command) calls register_ftrace_direct,
and you'll need to do the same for nop trampoline, no?

I wonder if we could create some 'transaction object' represented
by fd and add it to bpf_attr::raw_tracepoint

then attach (BPF_RAW_TRACEPOINT_OPEN command) would add program to this
new 'transaction object' instead of updating ftrace directly

and when the collection is done (all BPF_RAW_TRACEPOINT_OPEN command
are executed), we'd call new bpf syscall command on that transaction
and it would call ftrace interface

something like:

  bpf(TRANSACTION_NEW) = fd
  bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_1, fd
  bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_2, fd
  ...
  bpf(TRANSACTION_DONE) for fd

jirka

> 
> Something like this, where bpf_prog attachment (which replaces nop)
> happens as step 2:
> 
> +------------+  +----------+  +----------+
> |  kfunc1    |  |  kfunc2  |  |  kfunc3  |
> +------+-----+  +----+-----+  +----+-----+
>        |             |             |
>        |             |             |
>        +---------------------------+
>                      |
>                      v
>                  +---+---+           +-----------+
>                  |  nop  +----------->  bpf_prog |
>                  +-------+           +-----------+
> 
> 
> > Anyway, what I'm currently working on, is a fast way to get to the
> > arguments of a function. For now, I'm just focused on x86_64, and only add
> > 6 argments.
> >
> > The main issue that Alexei had with using the ftrace trampoline, was that
> > the only way to get to the arguments was to set the "REGS" flag, which
> > would give a regs parameter that contained a full pt_regs. The problem with
> > this approach is that it required saving *all* regs for every function
> > traced. Alexei felt that this was too much overehead.
> >
> > Looking at Jiri's patch, I took a look at the creation of the bpf
> > trampoline, and noticed that it's copying the regs on a stack (at least
> > what is used, which I think could be an issue).
> 
> Right. And BPF doesn't get access to the entire pt_regs struct, so it
> doesn't have to pay the prices of saving it.
> 
> But just FYI. Alexei is out till next week, so don't expect him to
> reply in the next few days. But he's probably best to discuss these
> nitty-gritty details with :)
> 
> >
> > For tracing a function, one must store all argument registers used, and
> > restore them, as that's how they are passed from caller to callee. And
> > since they are stored anyway, I figure, that should also be sent to the
> > function callbacks, so that they have access to them too.
> >
> > I'm working on a set of patches to make this a reality.
> >
> > -- Steve
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support
  2020-10-25 19:11     ` Jiri Olsa
@ 2020-10-26 23:15       ` Andrii Nakryiko
  2020-10-27 19:03         ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-26 23:15 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Sun, Oct 25, 2020 at 12:12 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Fri, Oct 23, 2020 at 01:09:26PM -0700, Andrii Nakryiko wrote:
> > On Thu, Oct 22, 2020 at 2:03 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Adding trampoline batch attach support so it's possible to use
> > > batch mode to load tracing programs.
> > >
> > > Adding trampoline_attach_batch bool to struct bpf_object_open_opts.
> > > When set to true the bpf_object__attach_skeleton will try to load
> > > all tracing programs via batch mode.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> >
> > Assuming we go with the current kernel API for batch-attach, why can't
> > libbpf just detect kernel support for it and just use it always,
> > without requiring users to opt into anything?
>
> yea, it's rfc ;-) I wanted some simple usage of the
> interface so it's obvious how it works
>
> if we'll end up with some batch interface I agree
> we should use it as you suggested
>
> >
> > But I'm also confused a bit how this is supposed to be used with BPF
> > skeleton. You use case described in a cover letter (bpftrace glob
> > attach, right?) would have a single BPF program attached to many
> > different functions. While here you are trying to collect different
> > programs and attach each one to its respective kernel function. Do you
> > expect users to have hundreds of BPF programs in their skeletons? If
> > not, I don't really see why adding this complexity. What am I missing?
>
> AFAIU when you use trampoline program you declare the attach point
> at the load time, so you actually can't use same program for different
> kernel functions - which would be great speed up actually, because
> that's where the rest of the cycles in bpftrace is spent (in that cover
> letter example) - load/verifier check of all those programs

Ah, I see, you are right. And yes, I agree, it would be nice to not
have to clone the BPF program many times to attach to fentry/fexit, if
the program itself doesn't really change.

>
> it's different for kprobe where you hook single kprobe via multiple
> kprobe perf events to different kernel function
>
> >
> > Now it also seems weird to me for the kernel API to allow attaching
> > many-to-many BPF programs-to-attach points. One BPF program-to-many
> > attach points seems like a more sane and common requirement, no?
>
> right, but that's the consequence of what I wrote above

Well, maybe we should get rid of that limitation first ;)

>
> jirka
>
> >
> >
> > >  tools/lib/bpf/bpf.c      | 12 +++++++
> > >  tools/lib/bpf/bpf.h      |  1 +
> > >  tools/lib/bpf/libbpf.c   | 76 +++++++++++++++++++++++++++++++++++++++-
> > >  tools/lib/bpf/libbpf.h   |  5 ++-
> > >  tools/lib/bpf/libbpf.map |  1 +
> > >  5 files changed, 93 insertions(+), 2 deletions(-)
> > >
> >
> > [...]
> >
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
  2020-10-25 19:41         ` Jiri Olsa
@ 2020-10-26 23:19           ` Andrii Nakryiko
  0 siblings, 0 replies; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-26 23:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Networking, bpf, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Daniel Xu,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Sun, Oct 25, 2020 at 12:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Fri, Oct 23, 2020 at 03:23:10PM -0700, Andrii Nakryiko wrote:
> > On Fri, Oct 23, 2020 at 1:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Fri, 23 Oct 2020 13:03:22 -0700
> > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > > Basically, maybe ftrace subsystem could provide a set of APIs to
> > > > prepare a set of functions to attach to. Then BPF subsystem would just
> > > > do what it does today, except instead of attaching to a specific
> > > > kernel function, it would attach to ftrace's placeholder. I don't know
> > > > anything about ftrace implementation, so this might be far off. But I
> > > > thought that looking at this problem from a bit of a different angle
> > > > would benefit the discussion. Thoughts?
> > >
> > > I probably understand bpf internals as much as you understand ftrace
> > > internals ;-)
> > >
> >
> > Heh :) But while we are here, what do you think about this idea of
> > preparing a no-op trampoline, that a bunch (thousands, potentially) of
> > function entries will jump to. And once all that is ready and patched
> > through kernel functions entry points, then allow to attach BPF
> > program or ftrace callback (if I get the terminology right) in a one
> > fast and simple operation? For users that would mean that they will
> > either get calls for all or none of attached kfuncs, with a simple and
> > reliable semantics.
>
> so the main pain point the batch interface is addressing, is that
> every attach (BPF_RAW_TRACEPOINT_OPEN command) calls register_ftrace_direct,
> and you'll need to do the same for nop trampoline, no?

I guess I had a hope that if we know it's a nop that we are
installing, then we can do it without extra waiting, which should
speed it up quite a bit.

>
> I wonder if we could create some 'transaction object' represented
> by fd and add it to bpf_attr::raw_tracepoint
>
> then attach (BPF_RAW_TRACEPOINT_OPEN command) would add program to this
> new 'transaction object' instead of updating ftrace directly
>
> and when the collection is done (all BPF_RAW_TRACEPOINT_OPEN command
> are executed), we'd call new bpf syscall command on that transaction
> and it would call ftrace interface
>

This is conceptually something like what I had in mind, but I had a
single BPF program attached to many kernel functions in mind.
Something that's impossible today, as you mentioned in another thread.

> something like:
>
>   bpf(TRANSACTION_NEW) = fd
>   bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_1, fd
>   bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_2, fd
>   ...
>   bpf(TRANSACTION_DONE) for fd
>
> jirka
>
> >
> > Something like this, where bpf_prog attachment (which replaces nop)
> > happens as step 2:
> >
> > +------------+  +----------+  +----------+
> > |  kfunc1    |  |  kfunc2  |  |  kfunc3  |
> > +------+-----+  +----+-----+  +----+-----+
> >        |             |             |
> >        |             |             |
> >        +---------------------------+
> >                      |
> >                      v
> >                  +---+---+           +-----------+
> >                  |  nop  +----------->  bpf_prog |
> >                  +-------+           +-----------+
> >
> >
> > > Anyway, what I'm currently working on, is a fast way to get to the
> > > arguments of a function. For now, I'm just focused on x86_64, and only add
> > > 6 argments.
> > >
> > > The main issue that Alexei had with using the ftrace trampoline, was that
> > > the only way to get to the arguments was to set the "REGS" flag, which
> > > would give a regs parameter that contained a full pt_regs. The problem with
> > > this approach is that it required saving *all* regs for every function
> > > traced. Alexei felt that this was too much overehead.
> > >
> > > Looking at Jiri's patch, I took a look at the creation of the bpf
> > > trampoline, and noticed that it's copying the regs on a stack (at least
> > > what is used, which I think could be an issue).
> >
> > Right. And BPF doesn't get access to the entire pt_regs struct, so it
> > doesn't have to pay the prices of saving it.
> >
> > But just FYI. Alexei is out till next week, so don't expect him to
> > reply in the next few days. But he's probably best to discuss these
> > nitty-gritty details with :)
> >
> > >
> > > For tracing a function, one must store all argument registers used, and
> > > restore them, as that's how they are passed from caller to callee. And
> > > since they are stored anyway, I figure, that should also be sent to the
> > > function callbacks, so that they have access to them too.
> > >
> > > I'm working on a set of patches to make this a reality.
> > >
> > > -- Steve
> >
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-22 14:42     ` Steven Rostedt
  2020-10-22 16:21       ` Steven Rostedt
@ 2020-10-27  4:30       ` Alexei Starovoitov
  2020-10-27 13:14         ` Steven Rostedt
  2020-10-27 14:28         ` Jiri Olsa
  1 sibling, 2 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2020-10-27  4:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, netdev, bpf, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Daniel Xu,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:
> On Thu, 22 Oct 2020 16:11:54 +0200
> Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > I understand direct calls as a way that bpf trampolines and ftrace can
> > co-exist together - ebpf trampolines need that functionality of accessing
> > parameters of a function as if it was called directly and at the same
> > point we need to be able attach to any function and to as many functions
> > as we want in a fast way
> 
> I was sold that bpf needed a quick and fast way to get the arguments of a
> function, as the only way to do that with ftrace is to save all registers,
> which, I was told was too much overhead, as if you only care about
> arguments, there's much less that is needed to save.
> 
> Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> that for certain cases, bpf wanted a faster way to access arguments,
> because it still worked with ftrace, but the saving of regs was too
> strenuous.

Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
There is no other use for it.

Jiri,
could you please redo your benchmarking hardcoding ftrace_managed=false ?
If going through register_ftrace_direct() is indeed so much slower
than arch_text_poke() then something gotta give.
Either register_ftrace_direct() has to become faster or users
have to give up on co-existing of bpf and ftrace.
So far not a single user cared about using trampoline and ftrace together.
So the latter is certainly an option.

Regardless, the patch 7 (rbtree of kallsyms) is probably good on its own.
Can you benchmark it independently and maybe resubmit if it's useful
without other patches?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-27  4:30       ` Alexei Starovoitov
@ 2020-10-27 13:14         ` Steven Rostedt
  2020-10-27 14:28         ` Jiri Olsa
  1 sibling, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2020-10-27 13:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, netdev, bpf, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Daniel Xu,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Mon, 26 Oct 2020 21:30:14 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> > that for certain cases, bpf wanted a faster way to access arguments,
> > because it still worked with ftrace, but the saving of regs was too
> > strenuous.  
> 
> Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
> There is no other use for it.

What does that even mean? And I'm guessing when you say "trampoline"
you mean a "bpf trampoline" because "trampoline" is used for a lot more
than bpf, and bpf does not own that term.

Do you mean, "direct calls in ftrace were done so that bpf trampolines
could work". Remember, ftrace has a lot of users, and it must remain
backward compatible.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-27  4:30       ` Alexei Starovoitov
  2020-10-27 13:14         ` Steven Rostedt
@ 2020-10-27 14:28         ` Jiri Olsa
  2020-10-28 21:13           ` Alexei Starovoitov
  1 sibling, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-27 14:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, netdev, bpf, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Daniel Xu,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Mon, Oct 26, 2020 at 09:30:14PM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:
> > On Thu, 22 Oct 2020 16:11:54 +0200
> > Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > > I understand direct calls as a way that bpf trampolines and ftrace can
> > > co-exist together - ebpf trampolines need that functionality of accessing
> > > parameters of a function as if it was called directly and at the same
> > > point we need to be able attach to any function and to as many functions
> > > as we want in a fast way
> > 
> > I was sold that bpf needed a quick and fast way to get the arguments of a
> > function, as the only way to do that with ftrace is to save all registers,
> > which, I was told was too much overhead, as if you only care about
> > arguments, there's much less that is needed to save.
> > 
> > Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> > that for certain cases, bpf wanted a faster way to access arguments,
> > because it still worked with ftrace, but the saving of regs was too
> > strenuous.
> 
> Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
> There is no other use for it.
> 
> Jiri,
> could you please redo your benchmarking hardcoding ftrace_managed=false ?
> If going through register_ftrace_direct() is indeed so much slower
> than arch_text_poke() then something gotta give.
> Either register_ftrace_direct() has to become faster or users
> have to give up on co-existing of bpf and ftrace.
> So far not a single user cared about using trampoline and ftrace together.
> So the latter is certainly an option.

I tried that, and IIRC it was not much faster, but I don't have details
on that.. but it should be quick check, I'll do it

anyway later I realized that for us we need ftrace to stay, so I abandoned
this idea ;-) and started to check on how to keep them both together and
just make it faster

also currently bpf trampolines will not work without ftrace being
enabled, because ftrace is doing the preparation work during compile,
and replaces all the fentry calls with nop instructions and the
replace code depends on those nops...  so if we go this way, we would
need to make this preparation code generic

> 
> Regardless, the patch 7 (rbtree of kallsyms) is probably good on its own.
> Can you benchmark it independently and maybe resubmit if it's useful
> without other patches?

yes, I'll submit that in separate patch

thanks,
jirka


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support
  2020-10-26 23:15       ` Andrii Nakryiko
@ 2020-10-27 19:03         ` Jiri Olsa
  0 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-27 19:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Mon, Oct 26, 2020 at 04:15:48PM -0700, Andrii Nakryiko wrote:
> On Sun, Oct 25, 2020 at 12:12 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Fri, Oct 23, 2020 at 01:09:26PM -0700, Andrii Nakryiko wrote:
> > > On Thu, Oct 22, 2020 at 2:03 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Adding trampoline batch attach support so it's possible to use
> > > > batch mode to load tracing programs.
> > > >
> > > > Adding trampoline_attach_batch bool to struct bpf_object_open_opts.
> > > > When set to true the bpf_object__attach_skeleton will try to load
> > > > all tracing programs via batch mode.
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > >
> > > Assuming we go with the current kernel API for batch-attach, why can't
> > > libbpf just detect kernel support for it and just use it always,
> > > without requiring users to opt into anything?
> >
> > yea, it's rfc ;-) I wanted some simple usage of the
> > interface so it's obvious how it works
> >
> > if we'll end up with some batch interface I agree
> > we should use it as you suggested
> >
> > >
> > > But I'm also confused a bit how this is supposed to be used with BPF
> > > skeleton. You use case described in a cover letter (bpftrace glob
> > > attach, right?) would have a single BPF program attached to many
> > > different functions. While here you are trying to collect different
> > > programs and attach each one to its respective kernel function. Do you
> > > expect users to have hundreds of BPF programs in their skeletons? If
> > > not, I don't really see why adding this complexity. What am I missing?
> >
> > AFAIU when you use trampoline program you declare the attach point
> > at the load time, so you actually can't use same program for different
> > kernel functions - which would be great speed up actually, because
> > that's where the rest of the cycles in bpftrace is spent (in that cover
> > letter example) - load/verifier check of all those programs
> 
> Ah, I see, you are right. And yes, I agree, it would be nice to not
> have to clone the BPF program many times to attach to fentry/fexit, if
> the program itself doesn't really change.
> 
> >
> > it's different for kprobe where you hook single kprobe via multiple
> > kprobe perf events to different kernel function
> >
> > >
> > > Now it also seems weird to me for the kernel API to allow attaching
> > > many-to-many BPF programs-to-attach points. One BPF program-to-many
> > > attach points seems like a more sane and common requirement, no?
> >
> > right, but that's the consequence of what I wrote above
> 
> Well, maybe we should get rid of that limitation first ;)
>

I see this as 2 different things.. even when this would be
possible - attach single program to multiple trampolines,
you still need to install those trampolines via ftrace and
you'll end up in this discussion ;-)

jirka


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search
  2020-10-22  8:21 ` [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search Jiri Olsa
@ 2020-10-28 18:25   ` Jiri Olsa
  2020-10-28 21:15     ` Alexei Starovoitov
  2020-10-28 22:40     ` Andrii Nakryiko
  0 siblings, 2 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-28 18:25 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, netdev,
	bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Daniel Xu, Steven Rostedt, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 22, 2020 at 10:21:29AM +0200, Jiri Olsa wrote:
> The kallsyms_expand_symbol function showed in several bpf related
> profiles, because it's doing linear search.
> 
> Before:
> 
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
>    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> 
>      2,535,458,767      cycles:k                         ( +-  0.55% )
>        940,046,382      cycles:u                         ( +-  0.27% )
> 
>              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
> 
> Loading all the vmlinux symbols in rbtree and and switch to rbtree
> search in kallsyms_lookup_name function to save few cycles and time.
> 
> After:
> 
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
>    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> 
>      2,199,433,771      cycles:k                         ( +-  0.55% )
>        936,105,469      cycles:u                         ( +-  0.37% )
> 
>              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
> 
> Each symbol takes 160 bytes, so for my .config I've got about 18 MBs
> used for 115285 symbols.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>

FYI there's init_kprobes dependency on kallsyms_lookup_name in early
init call, so this won't work as it is :-\ will address this in v2

also I'll switch to sorted array and bsearch, because kallsyms is not
dynamically updated

jirka

> ---
>  kernel/kallsyms.c | 95 ++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 86 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> index 4fb15fa96734..107c8284170e 100644
> --- a/kernel/kallsyms.c
> +++ b/kernel/kallsyms.c
> @@ -50,6 +50,36 @@ extern const u16 kallsyms_token_index[] __weak;
>  
>  extern const unsigned int kallsyms_markers[] __weak;
>  
> +static struct kmem_cache *symbol_cachep;
> +
> +struct symbol {
> +	char		name[KSYM_NAME_LEN];
> +	unsigned long	addr;
> +	struct rb_node	rb_node;
> +};
> +
> +static struct rb_root symbols_root = RB_ROOT;
> +
> +static struct symbol *find_symbol(const char *name)
> +{
> +	struct symbol *sym;
> +	struct rb_node *n;
> +	int err;
> +
> +	n = symbols_root.rb_node;
> +	while (n) {
> +		sym = rb_entry(n, struct symbol, rb_node);
> +		err = strcmp(name, sym->name);
> +		if (err < 0)
> +			n = n->rb_left;
> +		else if (err > 0)
> +			n = n->rb_right;
> +		else
> +			return sym;
> +	}
> +	return NULL;
> +}
> +
>  /*
>   * Expand a compressed symbol data into the resulting uncompressed string,
>   * if uncompressed string is too long (>= maxlen), it will be truncated,
> @@ -164,16 +194,12 @@ static unsigned long kallsyms_sym_address(int idx)
>  /* Lookup the address for this symbol. Returns 0 if not found. */
>  unsigned long kallsyms_lookup_name(const char *name)
>  {
> -	char namebuf[KSYM_NAME_LEN];
> -	unsigned long i;
> -	unsigned int off;
> +	struct symbol *sym;
>  
> -	for (i = 0, off = 0; i < kallsyms_num_syms; i++) {
> -		off = kallsyms_expand_symbol(off, namebuf, ARRAY_SIZE(namebuf));
> +	sym = find_symbol(name);
> +	if (sym)
> +		return sym->addr;
>  
> -		if (strcmp(namebuf, name) == 0)
> -			return kallsyms_sym_address(i);
> -	}
>  	return module_kallsyms_lookup_name(name);
>  }
>  
> @@ -743,9 +769,60 @@ static const struct proc_ops kallsyms_proc_ops = {
>  	.proc_release	= seq_release_private,
>  };
>  
> +static bool __init add_symbol(struct symbol *new)
> +{
> +	struct rb_node *parent = NULL;
> +	struct rb_node **p;
> +	struct symbol *sym;
> +	int err;
> +
> +	p = &symbols_root.rb_node;
> +
> +	while (*p != NULL) {
> +		parent = *p;
> +		sym = rb_entry(parent, struct symbol, rb_node);
> +		err = strcmp(new->name, sym->name);
> +		if (err < 0)
> +			p = &(*p)->rb_left;
> +		else if (err > 0)
> +			p = &(*p)->rb_right;
> +		else
> +			return false;
> +	}
> +
> +	rb_link_node(&new->rb_node, parent, p);
> +	rb_insert_color(&new->rb_node, &symbols_root);
> +	return true;
> +}
> +
> +static int __init kallsyms_name_search_init(void)
> +{
> +	bool sym_added = true;
> +	struct symbol *sym;
> +	unsigned int off;
> +	unsigned long i;
> +
> +	symbol_cachep = KMEM_CACHE(symbol, SLAB_PANIC|SLAB_ACCOUNT);
> +
> +	for (i = 0, off = 0; i < kallsyms_num_syms; i++) {
> +		if (sym_added) {
> +			sym = kmem_cache_alloc(symbol_cachep, GFP_KERNEL);
> +			if (!sym)
> +				return -ENOMEM;
> +		}
> +		off = kallsyms_expand_symbol(off, sym->name, ARRAY_SIZE(sym->name));
> +		sym->addr = kallsyms_sym_address(i);
> +		sym_added = add_symbol(sym);
> +	}
> +
> +	if (!sym_added)
> +		kmem_cache_free(symbol_cachep, sym);
> +	return 0;
> +}
> +
>  static int __init kallsyms_init(void)
>  {
>  	proc_create("kallsyms", 0444, NULL, &kallsyms_proc_ops);
> -	return 0;
> +	return kallsyms_name_search_init();
>  }
>  device_initcall(kallsyms_init);
> -- 
> 2.26.2
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-27 14:28         ` Jiri Olsa
@ 2020-10-28 21:13           ` Alexei Starovoitov
  2020-10-29 11:09             ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2020-10-28 21:13 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, netdev, bpf, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Daniel Xu,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Tue, Oct 27, 2020 at 03:28:03PM +0100, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 09:30:14PM -0700, Alexei Starovoitov wrote:
> > On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:
> > > On Thu, 22 Oct 2020 16:11:54 +0200
> > > Jiri Olsa <jolsa@redhat.com> wrote:
> > > 
> > > > I understand direct calls as a way that bpf trampolines and ftrace can
> > > > co-exist together - ebpf trampolines need that functionality of accessing
> > > > parameters of a function as if it was called directly and at the same
> > > > point we need to be able attach to any function and to as many functions
> > > > as we want in a fast way
> > > 
> > > I was sold that bpf needed a quick and fast way to get the arguments of a
> > > function, as the only way to do that with ftrace is to save all registers,
> > > which, I was told was too much overhead, as if you only care about
> > > arguments, there's much less that is needed to save.
> > > 
> > > Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> > > that for certain cases, bpf wanted a faster way to access arguments,
> > > because it still worked with ftrace, but the saving of regs was too
> > > strenuous.
> > 
> > Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
> > There is no other use for it.
> > 
> > Jiri,
> > could you please redo your benchmarking hardcoding ftrace_managed=false ?
> > If going through register_ftrace_direct() is indeed so much slower
> > than arch_text_poke() then something gotta give.
> > Either register_ftrace_direct() has to become faster or users
> > have to give up on co-existing of bpf and ftrace.
> > So far not a single user cared about using trampoline and ftrace together.
> > So the latter is certainly an option.
> 
> I tried that, and IIRC it was not much faster, but I don't have details
> on that.. but it should be quick check, I'll do it
> 
> anyway later I realized that for us we need ftrace to stay, so I abandoned
> this idea ;-) and started to check on how to keep them both together and
> just make it faster
> 
> also currently bpf trampolines will not work without ftrace being
> enabled, because ftrace is doing the preparation work during compile,
> and replaces all the fentry calls with nop instructions and the
> replace code depends on those nops...  so if we go this way, we would
> need to make this preparation code generic

I didn't mean that part.
I was talking about register_ftrace_direct() only.
Could you please still do ftrace_managed=false experiment?
Sounds like the time to attach/detach will stay the same?
If so, then don't touch ftrace internals then. What's the point?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search
  2020-10-28 18:25   ` Jiri Olsa
@ 2020-10-28 21:15     ` Alexei Starovoitov
  2020-10-29  9:29       ` Jiri Olsa
  2020-10-28 22:40     ` Andrii Nakryiko
  1 sibling, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2020-10-28 21:15 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Wed, Oct 28, 2020 at 07:25:34PM +0100, Jiri Olsa wrote:
> On Thu, Oct 22, 2020 at 10:21:29AM +0200, Jiri Olsa wrote:
> > The kallsyms_expand_symbol function showed in several bpf related
> > profiles, because it's doing linear search.
> > 
> > Before:
> > 
> >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > 
> >      2,535,458,767      cycles:k                         ( +-  0.55% )
> >        940,046,382      cycles:u                         ( +-  0.27% )
> > 
> >              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
> > 
> > Loading all the vmlinux symbols in rbtree and and switch to rbtree
> > search in kallsyms_lookup_name function to save few cycles and time.
> > 
> > After:
> > 
> >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > 
> >      2,199,433,771      cycles:k                         ( +-  0.55% )
> >        936,105,469      cycles:u                         ( +-  0.37% )
> > 
> >              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
> > 
> > Each symbol takes 160 bytes, so for my .config I've got about 18 MBs
> > used for 115285 symbols.
> > 
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> 
> FYI there's init_kprobes dependency on kallsyms_lookup_name in early
> init call, so this won't work as it is :-\ will address this in v2
> 
> also I'll switch to sorted array and bsearch, because kallsyms is not
> dynamically updated

wait wat? kallsyms are dynamically updated. bpf adds and removes from it.
You even worked on some of those patches :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search
  2020-10-28 18:25   ` Jiri Olsa
  2020-10-28 21:15     ` Alexei Starovoitov
@ 2020-10-28 22:40     ` Andrii Nakryiko
  2020-10-29  9:33       ` Jiri Olsa
  1 sibling, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-28 22:40 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Wed, Oct 28, 2020 at 3:29 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Thu, Oct 22, 2020 at 10:21:29AM +0200, Jiri Olsa wrote:
> > The kallsyms_expand_symbol function showed in several bpf related
> > profiles, because it's doing linear search.
> >
> > Before:
> >
> >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> >
> >      2,535,458,767      cycles:k                         ( +-  0.55% )
> >        940,046,382      cycles:u                         ( +-  0.27% )
> >
> >              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
> >
> > Loading all the vmlinux symbols in rbtree and and switch to rbtree
> > search in kallsyms_lookup_name function to save few cycles and time.
> >
> > After:
> >
> >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> >
> >      2,199,433,771      cycles:k                         ( +-  0.55% )
> >        936,105,469      cycles:u                         ( +-  0.37% )
> >
> >              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
> >
> > Each symbol takes 160 bytes, so for my .config I've got about 18 MBs
> > used for 115285 symbols.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
>
> FYI there's init_kprobes dependency on kallsyms_lookup_name in early
> init call, so this won't work as it is :-\ will address this in v2
>
> also I'll switch to sorted array and bsearch, because kallsyms is not
> dynamically updated

what about kernel modules then?

>
> jirka
>
> > ---
> >  kernel/kallsyms.c | 95 ++++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 86 insertions(+), 9 deletions(-)
> >

[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search
  2020-10-28 21:15     ` Alexei Starovoitov
@ 2020-10-29  9:29       ` Jiri Olsa
  2020-10-29 22:45         ` Andrii Nakryiko
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2020-10-29  9:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	netdev, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Wed, Oct 28, 2020 at 02:15:02PM -0700, Alexei Starovoitov wrote:
> On Wed, Oct 28, 2020 at 07:25:34PM +0100, Jiri Olsa wrote:
> > On Thu, Oct 22, 2020 at 10:21:29AM +0200, Jiri Olsa wrote:
> > > The kallsyms_expand_symbol function showed in several bpf related
> > > profiles, because it's doing linear search.
> > > 
> > > Before:
> > > 
> > >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> > >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > > 
> > >      2,535,458,767      cycles:k                         ( +-  0.55% )
> > >        940,046,382      cycles:u                         ( +-  0.27% )
> > > 
> > >              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
> > > 
> > > Loading all the vmlinux symbols in rbtree and and switch to rbtree
> > > search in kallsyms_lookup_name function to save few cycles and time.
> > > 
> > > After:
> > > 
> > >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> > >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > > 
> > >      2,199,433,771      cycles:k                         ( +-  0.55% )
> > >        936,105,469      cycles:u                         ( +-  0.37% )
> > > 
> > >              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
> > > 
> > > Each symbol takes 160 bytes, so for my .config I've got about 18 MBs
> > > used for 115285 symbols.
> > > 
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > 
> > FYI there's init_kprobes dependency on kallsyms_lookup_name in early
> > init call, so this won't work as it is :-\ will address this in v2
> > 
> > also I'll switch to sorted array and bsearch, because kallsyms is not
> > dynamically updated
> 
> wait wat? kallsyms are dynamically updated. bpf adds and removes from it.
> You even worked on some of those patches :)

yes, it's tricky ;-) kallsyms_lookup_name function goes through builtin
(compiled in) symbols and "standard modules" symbols

we add bpf symbols as "pseudo module" symbol, which is not covered by
this function search, it is covered when displaying /proc/kallsyms
(check get_ksymbol_bpf function), same for ftrace and kprobe symbols

AFAICS we use kallsyms_lookup_name only to search builtin kernel symbols,
so we don't care it does not cover "pseudo modules"

now.. what's even more funny, is that if I switch to sort/bsearch,
performance is back on the same numbers as the current code :-\

jirka


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search
  2020-10-28 22:40     ` Andrii Nakryiko
@ 2020-10-29  9:33       ` Jiri Olsa
  0 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-29  9:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Daniel Xu, Steven Rostedt,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Wed, Oct 28, 2020 at 03:40:46PM -0700, Andrii Nakryiko wrote:
> On Wed, Oct 28, 2020 at 3:29 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Thu, Oct 22, 2020 at 10:21:29AM +0200, Jiri Olsa wrote:
> > > The kallsyms_expand_symbol function showed in several bpf related
> > > profiles, because it's doing linear search.
> > >
> > > Before:
> > >
> > >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> > >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > >
> > >      2,535,458,767      cycles:k                         ( +-  0.55% )
> > >        940,046,382      cycles:u                         ( +-  0.27% )
> > >
> > >              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
> > >
> > > Loading all the vmlinux symbols in rbtree and and switch to rbtree
> > > search in kallsyms_lookup_name function to save few cycles and time.
> > >
> > > After:
> > >
> > >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> > >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > >
> > >      2,199,433,771      cycles:k                         ( +-  0.55% )
> > >        936,105,469      cycles:u                         ( +-  0.37% )
> > >
> > >              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
> > >
> > > Each symbol takes 160 bytes, so for my .config I've got about 18 MBs
> > > used for 115285 symbols.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> >
> > FYI there's init_kprobes dependency on kallsyms_lookup_name in early
> > init call, so this won't work as it is :-\ will address this in v2
> >
> > also I'll switch to sorted array and bsearch, because kallsyms is not
> > dynamically updated
> 
> what about kernel modules then?

please check my answer to Alexei, I just answered it there

thanks,
jirka

> 
> >
> > jirka
> >
> > > ---
> > >  kernel/kallsyms.c | 95 ++++++++++++++++++++++++++++++++++++++++++-----
> > >  1 file changed, 86 insertions(+), 9 deletions(-)
> > >
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 00/16] bpf: Speed up trampoline attach
  2020-10-28 21:13           ` Alexei Starovoitov
@ 2020-10-29 11:09             ` Jiri Olsa
  0 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2020-10-29 11:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, netdev, bpf, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Daniel Xu,
	Jesper Brouer, Toke Høiland-Jørgensen, Viktor Malik

On Wed, Oct 28, 2020 at 02:13:25PM -0700, Alexei Starovoitov wrote:
> On Tue, Oct 27, 2020 at 03:28:03PM +0100, Jiri Olsa wrote:
> > On Mon, Oct 26, 2020 at 09:30:14PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:
> > > > On Thu, 22 Oct 2020 16:11:54 +0200
> > > > Jiri Olsa <jolsa@redhat.com> wrote:
> > > > 
> > > > > I understand direct calls as a way that bpf trampolines and ftrace can
> > > > > co-exist together - ebpf trampolines need that functionality of accessing
> > > > > parameters of a function as if it was called directly and at the same
> > > > > point we need to be able attach to any function and to as many functions
> > > > > as we want in a fast way
> > > > 
> > > > I was sold that bpf needed a quick and fast way to get the arguments of a
> > > > function, as the only way to do that with ftrace is to save all registers,
> > > > which, I was told was too much overhead, as if you only care about
> > > > arguments, there's much less that is needed to save.
> > > > 
> > > > Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> > > > that for certain cases, bpf wanted a faster way to access arguments,
> > > > because it still worked with ftrace, but the saving of regs was too
> > > > strenuous.
> > > 
> > > Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
> > > There is no other use for it.
> > > 
> > > Jiri,
> > > could you please redo your benchmarking hardcoding ftrace_managed=false ?
> > > If going through register_ftrace_direct() is indeed so much slower
> > > than arch_text_poke() then something gotta give.
> > > Either register_ftrace_direct() has to become faster or users
> > > have to give up on co-existing of bpf and ftrace.
> > > So far not a single user cared about using trampoline and ftrace together.
> > > So the latter is certainly an option.
> > 
> > I tried that, and IIRC it was not much faster, but I don't have details
> > on that.. but it should be quick check, I'll do it
> > 
> > anyway later I realized that for us we need ftrace to stay, so I abandoned
> > this idea ;-) and started to check on how to keep them both together and
> > just make it faster
> > 
> > also currently bpf trampolines will not work without ftrace being
> > enabled, because ftrace is doing the preparation work during compile,
> > and replaces all the fentry calls with nop instructions and the
> > replace code depends on those nops...  so if we go this way, we would
> > need to make this preparation code generic
> 
> I didn't mean that part.
> I was talking about register_ftrace_direct() only.
> Could you please still do ftrace_managed=false experiment?
> Sounds like the time to attach/detach will stay the same?
> If so, then don't touch ftrace internals then. What's the point?

actually, there's some speedup.. by running:

  # perf stat --table -e cycles:k,cycles:u -r 10 ./src/bpftrace -ve 'kfunc:__x64_sys_s* { } i:ms:10 { print("exit\n"); exit();}'

I've got following numbers on base:

     3,463,157,566      cycles:k                                                      ( +-  0.14% )
     1,164,026,270      cycles:u                                                      ( +-  0.29% )

             # Table of individual measurements:
             37.61 (-12.20) #######
             49.35 (-0.46) #
             54.03 (+4.22) ##
             50.82 (+1.01) #
             46.87 (-2.94) ##
             53.10 (+3.29) ##
             58.27 (+8.46) ###
             64.85 (+15.04) #####
             47.37 (-2.44) ##
             35.83 (-13.98) ########

             # Final result:
             49.81 +- 2.76 seconds time elapsed  ( +-  5.54% )


and following numbers with the patch below:

     2,037,364,413      cycles:k        ( +-  0.52% )
     1,164,769,939      cycles:u        ( +-  0.19% )

             # Table of individual measurements:
             30.52 (-8.54) ######
             43.43 (+4.37) ###
             43.72 (+4.66) ###
             35.70 (-3.36) ##
             40.70 (+1.63) #
             43.51 (+4.44) ###
             26.44 (-12.62) ##########
             40.21 (+1.14) #
             43.32 (+4.25) ##
             43.09 (+4.03) ##

             # Final result:
             39.06 +- 1.95 seconds time elapsed  ( +-  4.99% )


it looks like even ftrace_managed=false could be faster
with batch update, which is not used, but there's support
for it via text_poke_bp_batch function

jirka


---
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 35c5887d82ff..0a241e6eac7d 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -111,6 +111,8 @@ static int is_ftrace_location(void *ip)
 {
 	long addr;
 
+	return 0;
+
 	addr = ftrace_location((long)ip);
 	if (!addr)
 		return 0;


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search
  2020-10-29  9:29       ` Jiri Olsa
@ 2020-10-29 22:45         ` Andrii Nakryiko
  0 siblings, 0 replies; 47+ messages in thread
From: Andrii Nakryiko @ 2020-10-29 22:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Jiri Olsa, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Networking, bpf,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Daniel Xu, Steven Rostedt, Jesper Brouer,
	Toke Høiland-Jørgensen, Viktor Malik

On Thu, Oct 29, 2020 at 2:31 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Wed, Oct 28, 2020 at 02:15:02PM -0700, Alexei Starovoitov wrote:
> > On Wed, Oct 28, 2020 at 07:25:34PM +0100, Jiri Olsa wrote:
> > > On Thu, Oct 22, 2020 at 10:21:29AM +0200, Jiri Olsa wrote:
> > > > The kallsyms_expand_symbol function showed in several bpf related
> > > > profiles, because it's doing linear search.
> > > >
> > > > Before:
> > > >
> > > >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> > > >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > > >
> > > >      2,535,458,767      cycles:k                         ( +-  0.55% )
> > > >        940,046,382      cycles:u                         ( +-  0.27% )
> > > >
> > > >              33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )
> > > >
> > > > Loading all the vmlinux symbols in rbtree and and switch to rbtree
> > > > search in kallsyms_lookup_name function to save few cycles and time.
> > > >
> > > > After:
> > > >
> > > >  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s* \
> > > >    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
> > > >
> > > >      2,199,433,771      cycles:k                         ( +-  0.55% )
> > > >        936,105,469      cycles:u                         ( +-  0.37% )
> > > >
> > > >              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
> > > >
> > > > Each symbol takes 160 bytes, so for my .config I've got about 18 MBs
> > > > used for 115285 symbols.
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > >
> > > FYI there's init_kprobes dependency on kallsyms_lookup_name in early
> > > init call, so this won't work as it is :-\ will address this in v2
> > >
> > > also I'll switch to sorted array and bsearch, because kallsyms is not
> > > dynamically updated
> >
> > wait wat? kallsyms are dynamically updated. bpf adds and removes from it.
> > You even worked on some of those patches :)
>
> yes, it's tricky ;-) kallsyms_lookup_name function goes through builtin
> (compiled in) symbols and "standard modules" symbols
>
> we add bpf symbols as "pseudo module" symbol, which is not covered by
> this function search, it is covered when displaying /proc/kallsyms
> (check get_ksymbol_bpf function), same for ftrace and kprobe symbols
>
> AFAICS we use kallsyms_lookup_name only to search builtin kernel symbols,
> so we don't care it does not cover "pseudo modules"
>
> now.. what's even more funny, is that if I switch to sort/bsearch,
> performance is back on the same numbers as the current code :-\

If you do hashmap instead of RB tree or sort+bsearch, it will beat
both (assuming you have an adequate number of hash buckets, of
course).

>
> jirka
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2020-10-29 22:46 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-22  8:21 [RFC bpf-next 00/16] bpf: Speed up trampoline attach Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 01/16] ftrace: Add check_direct_entry function Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 02/16] ftrace: Add adjust_direct_size function Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 03/16] ftrace: Add get/put_direct_func function Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 04/16] ftrace: Add ftrace_set_filter_ips function Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 05/16] ftrace: Add register_ftrace_direct_ips function Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 06/16] ftrace: Add unregister_ftrace_direct_ips function Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 07/16] kallsyms: Use rb tree for kallsyms name search Jiri Olsa
2020-10-28 18:25   ` Jiri Olsa
2020-10-28 21:15     ` Alexei Starovoitov
2020-10-29  9:29       ` Jiri Olsa
2020-10-29 22:45         ` Andrii Nakryiko
2020-10-28 22:40     ` Andrii Nakryiko
2020-10-29  9:33       ` Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 08/16] bpf: Use delayed link free in bpf_link_put Jiri Olsa
2020-10-23 19:46   ` Andrii Nakryiko
2020-10-25 19:02     ` Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support Jiri Olsa
2020-10-23 20:03   ` Andrii Nakryiko
2020-10-23 20:31     ` Steven Rostedt
2020-10-23 22:23       ` Andrii Nakryiko
2020-10-25 19:41         ` Jiri Olsa
2020-10-26 23:19           ` Andrii Nakryiko
2020-10-22  8:21 ` [RFC bpf-next 10/16] bpf: Add BPF_TRAMPOLINE_BATCH_DETACH support Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 11/16] bpf: Sync uapi bpf.h to tools Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 12/16] bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED) Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 13/16] libbpf: Add trampoline batch attach support Jiri Olsa
2020-10-23 20:09   ` Andrii Nakryiko
2020-10-25 19:11     ` Jiri Olsa
2020-10-26 23:15       ` Andrii Nakryiko
2020-10-27 19:03         ` Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 14/16] libbpf: Add trampoline batch detach support Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 15/16] selftests/bpf: Add trampoline batch test Jiri Olsa
2020-10-22  8:21 ` [RFC bpf-next 16/16] selftests/bpf: Add attach batch test (NOT TO BE MERGED) Jiri Olsa
2020-10-22 13:35 ` [RFC bpf-next 00/16] bpf: Speed up trampoline attach Steven Rostedt
2020-10-22 14:11   ` Jiri Olsa
2020-10-22 14:42     ` Steven Rostedt
2020-10-22 16:21       ` Steven Rostedt
2020-10-22 20:52         ` Steven Rostedt
2020-10-23  6:09           ` Jiri Olsa
2020-10-23 13:50             ` Steven Rostedt
2020-10-25 19:01               ` Jiri Olsa
2020-10-27  4:30       ` Alexei Starovoitov
2020-10-27 13:14         ` Steven Rostedt
2020-10-27 14:28         ` Jiri Olsa
2020-10-28 21:13           ` Alexei Starovoitov
2020-10-29 11:09             ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).