dwarves.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH dwarves 0/3] Parallelize BTF type info generating of pahole
@ 2022-01-24 19:18 Kui-Feng Lee
  2022-01-24 19:18 ` [PATCH dwarves v2 1/3] dwarf_loader: Receive per-thread data on worker threads Kui-Feng Lee
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Kui-Feng Lee @ 2022-01-24 19:18 UTC (permalink / raw)
  To: dwarves, arnaldo.melo; +Cc: ast, daniel, andrii, Kui-Feng Lee

Creating an instance of btf for each worker thread allows
steal-function provided by pahole to add type info on multiple threads
without a lock.  The main thread merges the results of worker threads
to the primary instance.

Copying data from per-thread btf instances to the primary instance is
expensive now.  However, there is a patch landed at the bpf-next
repository. [1] With the patch for bpf-next and this patch, they drop
total runtime to 5.4s from 6.0s with "-j4" on my device to generate
BTF for Linux.

V2 fixes typo and syntax according the comments got from v1.
It also divides part 1 of v1 into part 1 & 2.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=d81283d27266
[v1] https://lore.kernel.org/dwarves/20220120010817.2803482-1-kuifeng@fb.com/

Kui-Feng Lee (3):
  dwarf_loader: Receive per-thread data on worker threads.
  dwarf_loader: Prepare and pass per-thread data to worker threads.
  pahole: Use per-thread btf instances to avoid mutex locking.

 btf_encoder.c  |   5 +++
 btf_encoder.h  |   2 +
 btf_loader.c   |   2 +-
 ctf_loader.c   |   2 +-
 dwarf_loader.c |  58 ++++++++++++++++++------
 dwarves.h      |   9 +++-
 pahole.c       | 117 ++++++++++++++++++++++++++++++++++++++++++++++---
 pdwtags.c      |   3 +-
 pfunct.c       |   4 +-
 9 files changed, 177 insertions(+), 25 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH dwarves v2 1/3] dwarf_loader: Receive per-thread data on worker threads.
  2022-01-24 19:18 [PATCH dwarves 0/3] Parallelize BTF type info generating of pahole Kui-Feng Lee
@ 2022-01-24 19:18 ` Kui-Feng Lee
  2022-01-24 19:18 ` [PATCH dwarves v2 2/3] dwarf_loader: Prepare and pass per-thread data to " Kui-Feng Lee
  2022-01-24 19:18 ` [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking Kui-Feng Lee
  2 siblings, 0 replies; 7+ messages in thread
From: Kui-Feng Lee @ 2022-01-24 19:18 UTC (permalink / raw)
  To: dwarves, arnaldo.melo; +Cc: ast, daniel, andrii, Kui-Feng Lee

Add arguments to steal and thread_exit callbacks of conf_load to
receive per-thread data.

Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
---
 btf_loader.c   | 2 +-
 ctf_loader.c   | 2 +-
 dwarf_loader.c | 4 ++--
 dwarves.h      | 5 +++--
 pahole.c       | 3 ++-
 pdwtags.c      | 3 ++-
 pfunct.c       | 4 +++-
 7 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/btf_loader.c b/btf_loader.c
index 7a5b16ff393e..b61cadd55127 100644
--- a/btf_loader.c
+++ b/btf_loader.c
@@ -624,7 +624,7 @@ static int cus__load_btf(struct cus *cus, struct conf_load *conf, const char *fi
 	 * The app stole this cu, possibly deleting it,
 	 * so forget about it
 	 */
-	if (conf && conf->steal && conf->steal(cu, conf))
+	if (conf && conf->steal && conf->steal(cu, conf, NULL))
 		return 0;
 
 	cus__add(cus, cu);
diff --git a/ctf_loader.c b/ctf_loader.c
index 7c34739afdce..de6d4dbfce97 100644
--- a/ctf_loader.c
+++ b/ctf_loader.c
@@ -722,7 +722,7 @@ int ctf__load_file(struct cus *cus, struct conf_load *conf,
 	 * The app stole this cu, possibly deleting it,
 	 * so forget about it
 	 */
-	if (conf && conf->steal && conf->steal(cu, conf))
+	if (conf && conf->steal && conf->steal(cu, conf, NULL))
 		return 0;
 
 	cus__add(cus, cu);
diff --git a/dwarf_loader.c b/dwarf_loader.c
index e30b03c1c541..bf9ea3765419 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -2686,7 +2686,7 @@ static int cu__finalize(struct cu *cu, struct conf_load *conf)
 {
 	cu__for_all_tags(cu, class_member__cache_byte_size, conf);
 	if (conf && conf->steal) {
-		return conf->steal(cu, conf);
+		return conf->steal(cu, conf, NULL);
 	}
 	return LSK__KEEPIT;
 }
@@ -2930,7 +2930,7 @@ static void *dwarf_cus__process_cu_thread(void *arg)
 			goto out_abort;
 	}
 
-	if (dcus->conf->thread_exit && dcus->conf->thread_exit() != 0)
+	if (dcus->conf->thread_exit && dcus->conf->thread_exit(dcus->conf, NULL) != 0)
 		goto out_abort;
 
 	return (void *)DWARF_CB_OK;
diff --git a/dwarves.h b/dwarves.h
index 52d162d67456..9a8e4de8843a 100644
--- a/dwarves.h
+++ b/dwarves.h
@@ -48,8 +48,9 @@ struct conf_fprintf;
  */
 struct conf_load {
 	enum load_steal_kind	(*steal)(struct cu *cu,
-					 struct conf_load *conf);
-	int			(*thread_exit)(void);
+					 struct conf_load *conf,
+					 void *thr_data);
+	int			(*thread_exit)(struct conf_load *conf, void *thr_data);
 	void			*cookie;
 	char			*format_path;
 	int			nr_jobs;
diff --git a/pahole.c b/pahole.c
index f3a51cb2fe74..f3eeaaca4cdf 100644
--- a/pahole.c
+++ b/pahole.c
@@ -2799,7 +2799,8 @@ out:
 static struct type_instance *header;
 
 static enum load_steal_kind pahole_stealer(struct cu *cu,
-					   struct conf_load *conf_load)
+					   struct conf_load *conf_load,
+					   void *thr_data)
 {
 	int ret = LSK__DELETE;
 
diff --git a/pdwtags.c b/pdwtags.c
index 2b5ba1bf6745..8b1d6f1c96cb 100644
--- a/pdwtags.c
+++ b/pdwtags.c
@@ -72,7 +72,8 @@ static int cu__emit_tags(struct cu *cu)
 }
 
 static enum load_steal_kind pdwtags_stealer(struct cu *cu,
-					    struct conf_load *conf_load __maybe_unused)
+					    struct conf_load *conf_load __maybe_unused,
+					    void *thr_data __maybe_unused)
 {
 	cu__emit_tags(cu);
 	return LSK__DELETE;
diff --git a/pfunct.c b/pfunct.c
index 5485622e639b..314915b774f4 100644
--- a/pfunct.c
+++ b/pfunct.c
@@ -489,7 +489,9 @@ int elf_symtabs__show(char *filenames[])
 	return EXIT_SUCCESS;
 }
 
-static enum load_steal_kind pfunct_stealer(struct cu *cu, struct conf_load *conf_load __maybe_unused)
+static enum load_steal_kind pfunct_stealer(struct cu *cu,
+					   struct conf_load *conf_load __maybe_unused,
+					   void *thr_data __maybe_unused)
 {
 
 	if (function_name) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH dwarves v2 2/3] dwarf_loader: Prepare and pass per-thread data to worker threads.
  2022-01-24 19:18 [PATCH dwarves 0/3] Parallelize BTF type info generating of pahole Kui-Feng Lee
  2022-01-24 19:18 ` [PATCH dwarves v2 1/3] dwarf_loader: Receive per-thread data on worker threads Kui-Feng Lee
@ 2022-01-24 19:18 ` Kui-Feng Lee
  2022-01-24 19:18 ` [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking Kui-Feng Lee
  2 siblings, 0 replies; 7+ messages in thread
From: Kui-Feng Lee @ 2022-01-24 19:18 UTC (permalink / raw)
  To: dwarves, arnaldo.melo; +Cc: ast, daniel, andrii, Kui-Feng Lee

Add interfaces to allow users of dwarf_loader to prepare and pass
per-thread data to steal-functions running on worker threads.

Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
---
 dwarf_loader.c | 58 +++++++++++++++++++++++++++++++++++++++-----------
 dwarves.h      |  4 ++++
 2 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index bf9ea3765419..ed415d2db11f 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -2682,18 +2682,18 @@ static int class_member__cache_byte_size(struct tag *tag, struct cu *cu,
 	return 0;
 }
 
-static int cu__finalize(struct cu *cu, struct conf_load *conf)
+static int cu__finalize(struct cu *cu, struct conf_load *conf, void *thr_data)
 {
 	cu__for_all_tags(cu, class_member__cache_byte_size, conf);
 	if (conf && conf->steal) {
-		return conf->steal(cu, conf, NULL);
+		return conf->steal(cu, conf, thr_data);
 	}
 	return LSK__KEEPIT;
 }
 
-static int cus__finalize(struct cus *cus, struct cu *cu, struct conf_load *conf)
+static int cus__finalize(struct cus *cus, struct cu *cu, struct conf_load *conf, void *thr_data)
 {
-	int lsk = cu__finalize(cu, conf);
+	int lsk = cu__finalize(cu, conf, thr_data);
 	switch (lsk) {
 	case LSK__DELETE:
 		cu__delete(cu);
@@ -2862,7 +2862,13 @@ struct dwarf_cus {
 	struct dwarf_cu	    *type_dcu;
 };
 
-static int dwarf_cus__create_and_process_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die, uint8_t pointer_size)
+struct dwarf_thread {
+	struct dwarf_cus	*dcus;
+	void			*data;
+};
+
+static int dwarf_cus__create_and_process_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die,
+					    uint8_t pointer_size, void *thr_data)
 {
 	/*
 	 * DW_AT_name in DW_TAG_compile_unit can be NULL, first seen in:
@@ -2884,7 +2890,7 @@ static int dwarf_cus__create_and_process_cu(struct dwarf_cus *dcus, Dwarf_Die *c
 	cu->dfops = &dwarf__ops;
 
 	if (die__process_and_recode(cu_die, cu, dcus->conf) != 0 ||
-	    cus__finalize(dcus->cus, cu, dcus->conf) == LSK__STOP_LOADING)
+	    cus__finalize(dcus->cus, cu, dcus->conf, thr_data) == LSK__STOP_LOADING)
 		return DWARF_CB_ABORT;
 
        return DWARF_CB_OK;
@@ -2918,7 +2924,8 @@ out_unlock:
 
 static void *dwarf_cus__process_cu_thread(void *arg)
 {
-	struct dwarf_cus *dcus = arg;
+	struct dwarf_thread *dthr = arg;
+	struct dwarf_cus *dcus = dthr->dcus;
 	uint8_t pointer_size, offset_size;
 	Dwarf_Die die_mem, *cu_die;
 
@@ -2926,11 +2933,13 @@ static void *dwarf_cus__process_cu_thread(void *arg)
 		if (cu_die == NULL)
 			break;
 
-		if (dwarf_cus__create_and_process_cu(dcus, cu_die, pointer_size) == DWARF_CB_ABORT)
+		if (dwarf_cus__create_and_process_cu(dcus, cu_die,
+						     pointer_size, dthr->data) == DWARF_CB_ABORT)
 			goto out_abort;
 	}
 
-	if (dcus->conf->thread_exit && dcus->conf->thread_exit(dcus->conf, NULL) != 0)
+	if (dcus->conf->thread_exit &&
+	    dcus->conf->thread_exit(dcus->conf, dthr->data) != 0)
 		goto out_abort;
 
 	return (void *)DWARF_CB_OK;
@@ -2941,10 +2950,25 @@ out_abort:
 static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
 {
 	pthread_t threads[dcus->conf->nr_jobs];
+	struct dwarf_thread dthr[dcus->conf->nr_jobs];
+	void *thread_data[dcus->conf->nr_jobs];
+	int res;
 	int i;
 
+	if (dcus->conf->threads_prepare) {
+		res = dcus->conf->threads_prepare(dcus->conf, dcus->conf->nr_jobs, thread_data);
+		if (res != 0)
+			return res;
+	} else
+		memset(thread_data, 0, sizeof(void *) * dcus->conf->nr_jobs);
+
 	for (i = 0; i < dcus->conf->nr_jobs; ++i) {
-		dcus->error = pthread_create(&threads[i], NULL, dwarf_cus__process_cu_thread, dcus);
+		dthr[i].dcus = dcus;
+		dthr[i].data = thread_data[i];
+
+		dcus->error = pthread_create(&threads[i], NULL,
+					     dwarf_cus__process_cu_thread,
+					     &dthr[i]);
 		if (dcus->error)
 			goto out_join;
 	}
@@ -2960,6 +2984,13 @@ out_join:
 			dcus->error = (long)res;
 	}
 
+	if (dcus->conf->threads_collect) {
+		res = dcus->conf->threads_collect(dcus->conf, dcus->conf->nr_jobs,
+						  thread_data, dcus->error);
+		if (dcus->error == 0)
+			dcus->error = res;
+	}
+
 	return dcus->error;
 }
 
@@ -2976,7 +3007,8 @@ static int __dwarf_cus__process_cus(struct dwarf_cus *dcus)
 		if (cu_die == NULL)
 			break;
 
-		if (dwarf_cus__create_and_process_cu(dcus, cu_die, pointer_size) == DWARF_CB_ABORT)
+		if (dwarf_cus__create_and_process_cu(dcus, cu_die,
+						     pointer_size, NULL) == DWARF_CB_ABORT)
 			return DWARF_CB_ABORT;
 
 		dcus->off = noff;
@@ -3070,7 +3102,7 @@ static int cus__merge_and_process_cu(struct cus *cus, struct conf_load *conf,
 	if (cu__resolve_func_ret_types(cu) != LSK__KEEPIT)
 		goto out_abort;
 
-	if (cus__finalize(cus, cu, conf) == LSK__STOP_LOADING)
+	if (cus__finalize(cus, cu, conf, NULL) == LSK__STOP_LOADING)
 		goto out_abort;
 
 	return 0;
@@ -3102,7 +3134,7 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 	}
 
 	if (type_cu != NULL) {
-		type_lsk = cu__finalize(type_cu, conf);
+		type_lsk = cu__finalize(type_cu, conf, NULL);
 		if (type_lsk == LSK__KEEPIT) {
 			cus__add(cus, type_cu);
 		}
diff --git a/dwarves.h b/dwarves.h
index 9a8e4de8843a..de152f9b64cf 100644
--- a/dwarves.h
+++ b/dwarves.h
@@ -71,6 +71,10 @@ struct conf_load {
 	const char		*kabi_prefix;
 	struct btf		*base_btf;
 	struct conf_fprintf	*conf_fprintf;
+	int			(*threads_prepare)(struct conf_load *conf, int nr_threads,
+						   void **thr_data);
+	int			(*threads_collect)(struct conf_load *conf, int nr_threads,
+						   void **thr_data, int error);
 };
 
 /** struct conf_fprintf - hints to the __fprintf routines
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking.
  2022-01-24 19:18 [PATCH dwarves 0/3] Parallelize BTF type info generating of pahole Kui-Feng Lee
  2022-01-24 19:18 ` [PATCH dwarves v2 1/3] dwarf_loader: Receive per-thread data on worker threads Kui-Feng Lee
  2022-01-24 19:18 ` [PATCH dwarves v2 2/3] dwarf_loader: Prepare and pass per-thread data to " Kui-Feng Lee
@ 2022-01-24 19:18 ` Kui-Feng Lee
  2022-01-24 20:13   ` Andrii Nakryiko
  2 siblings, 1 reply; 7+ messages in thread
From: Kui-Feng Lee @ 2022-01-24 19:18 UTC (permalink / raw)
  To: dwarves, arnaldo.melo; +Cc: ast, daniel, andrii, Kui-Feng Lee

Create an instance of btf for each worker thread, and add type info to
the local btf instance in the steal-function of pahole without mutex
acquiring.  Once finished with all worker threads, merge all
per-thread btf instances to the primary btf instance.

Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
---
 btf_encoder.c |   5 +++
 btf_encoder.h |   2 +
 pahole.c      | 114 +++++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 116 insertions(+), 5 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index 9d015f304e92..56a76f5d7275 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -1529,3 +1529,8 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu)
 out:
 	return err;
 }
+
+struct btf *btf_encoder__btf(struct btf_encoder *encoder)
+{
+	return encoder->btf;
+}
diff --git a/btf_encoder.h b/btf_encoder.h
index f133b0d7674d..0f0eee84df74 100644
--- a/btf_encoder.h
+++ b/btf_encoder.h
@@ -29,4 +29,6 @@ struct btf_encoder *btf_encoders__first(struct list_head *encoders);
 
 struct btf_encoder *btf_encoders__next(struct btf_encoder *encoder);
 
+struct btf *btf_encoder__btf(struct btf_encoder *encoder);
+
 #endif /* _BTF_ENCODER_H_ */
diff --git a/pahole.c b/pahole.c
index f3eeaaca4cdf..93b844f97c38 100644
--- a/pahole.c
+++ b/pahole.c
@@ -2798,6 +2798,73 @@ out:
 
 static struct type_instance *header;
 
+struct thread_data {
+	struct btf *btf;
+	struct btf_encoder *encoder;
+};
+
+static int pahole_threads_prepare(struct conf_load *conf, int nr_threads, void **thr_data)
+{
+	int i;
+	struct thread_data *threads = calloc(sizeof(struct thread_data), nr_threads);
+
+	for (i = 0; i < nr_threads; i++)
+		thr_data[i] = threads + i;
+
+	return 0;
+}
+
+static int pahole_thread_exit(struct conf_load *conf, void *thr_data)
+{
+	struct thread_data *thread = thr_data;
+
+	if (thread == NULL)
+		return 0;
+
+	/*
+	 * Here we will call btf__dedup() here once we extend
+	 * btf__dedup().
+	 */
+
+	return 0;
+}
+
+static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void **thr_data,
+				  int error)
+{
+	struct thread_data **threads = (struct thread_data **)thr_data;
+	int i;
+	int err = 0;
+
+	if (error)
+		goto out;
+
+	for (i = 0; i < nr_threads; i++) {
+		/*
+		 * Merge content of the btf instances of worker
+		 * threads to the btf instance of the primary
+		 * btf_encoder.
+		 */
+		if (!threads[i]->btf || threads[i]->encoder == btf_encoder)
+			continue; /* The primary btf_encoder */
+		err = btf__add_btf(btf_encoder__btf(btf_encoder), threads[i]->btf);
+		if (err < 0)
+			goto out;
+		btf_encoder__delete(threads[i]->encoder);
+		threads[i]->encoder = NULL;
+	}
+	err = 0;
+
+out:
+	for (i = 0; i < nr_threads; i++) {
+		if (threads[i]->encoder && threads[i]->encoder != btf_encoder)
+			btf_encoder__delete(threads[i]->encoder);
+	}
+	free(threads[0]);
+
+	return err;
+}
+
 static enum load_steal_kind pahole_stealer(struct cu *cu,
 					   struct conf_load *conf_load,
 					   void *thr_data)
@@ -2819,8 +2886,8 @@ static enum load_steal_kind pahole_stealer(struct cu *cu,
 
 	if (btf_encode) {
 		static pthread_mutex_t btf_lock = PTHREAD_MUTEX_INITIALIZER;
+		struct btf_encoder *encoder;
 
-		pthread_mutex_lock(&btf_lock);
 		/*
 		 * FIXME:
 		 *
@@ -2828,21 +2895,55 @@ static enum load_steal_kind pahole_stealer(struct cu *cu,
 		 * point we'll have cu->elf setup...
 		 */
 		if (!btf_encoder) {
-			btf_encoder = btf_encoder__new(cu, detached_btf_filename, conf_load->base_btf, skip_encoding_btf_vars,
-						       btf_encode_force, btf_gen_floats, global_verbose);
+			pthread_mutex_lock(&btf_lock);
+			if (!btf_encoder) {
+				/*
+				 * btf_encoder is the primary encoder.
+				 * And, it is used by the thread
+				 * create it.
+				 */
+				btf_encoder = btf_encoder__new(cu, detached_btf_filename,
+							       conf_load->base_btf,
+							       skip_encoding_btf_vars,
+							       btf_encode_force,
+							       btf_gen_floats, global_verbose);
+				if (btf_encoder && thr_data) {
+					struct thread_data *thread = (struct thread_data *)thr_data;
+
+					thread->encoder = btf_encoder;
+					thread->btf = btf_encoder__btf(btf_encoder);
+				}
+			}
+			pthread_mutex_unlock(&btf_lock);
 			if (btf_encoder == NULL) {
 				ret = LSK__STOP_LOADING;
 				goto out_btf;
 			}
 		}
 
-		if (btf_encoder__encode_cu(btf_encoder, cu)) {
+		if (thr_data) {
+			struct thread_data *thread = (struct thread_data *)thr_data;
+
+			if (thread->encoder == NULL) {
+				thread->encoder =
+					btf_encoder__new(cu, detached_btf_filename,
+							 NULL,
+							 skip_encoding_btf_vars,
+							 btf_encode_force,
+							 btf_gen_floats,
+							 global_verbose);
+				thread->btf = btf_encoder__btf(thread->encoder);
+			}
+			encoder = thread->encoder;
+		} else
+			encoder = btf_encoder;
+
+		if (btf_encoder__encode_cu(encoder, cu)) {
 			fprintf(stderr, "Encountered error while encoding BTF.\n");
 			exit(1);
 		}
 		ret = LSK__DELETE;
 out_btf:
-		pthread_mutex_unlock(&btf_lock);
 		return ret;
 	}
 #if 0
@@ -3207,6 +3308,9 @@ int main(int argc, char *argv[])
 	memset(tab, ' ', sizeof(tab) - 1);
 
 	conf_load.steal = pahole_stealer;
+	conf_load.thread_exit = pahole_thread_exit;
+	conf_load.threads_prepare = pahole_threads_prepare;
+	conf_load.threads_collect = pahole_threads_collect;
 
 	// Make 'pahole --header type < file' a shorter form of 'pahole -C type --count 1 < file'
 	if (conf.header_type && !class_name && prettify_input) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking.
  2022-01-24 19:18 ` [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking Kui-Feng Lee
@ 2022-01-24 20:13   ` Andrii Nakryiko
  2022-01-25 19:38     ` Kui-Feng Lee
  0 siblings, 1 reply; 7+ messages in thread
From: Andrii Nakryiko @ 2022-01-24 20:13 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: dwarves, Arnaldo Carvalho de Melo, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko

On Mon, Jan 24, 2022 at 11:19 AM Kui-Feng Lee <kuifeng@fb.com> wrote:
>
> Create an instance of btf for each worker thread, and add type info to
> the local btf instance in the steal-function of pahole without mutex
> acquiring.  Once finished with all worker threads, merge all
> per-thread btf instances to the primary btf instance.
>
> Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
> ---

BTW, I've already synced your btf__add_btf() optimization to Github,
so you can bump pahole libbpf submodule reference in the next
revision. You'll get a bunch of deprecation warnings, so it would be
great to fix those at the same time.

>  btf_encoder.c |   5 +++
>  btf_encoder.h |   2 +
>  pahole.c      | 114 +++++++++++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 116 insertions(+), 5 deletions(-)
>
> diff --git a/btf_encoder.c b/btf_encoder.c
> index 9d015f304e92..56a76f5d7275 100644
> --- a/btf_encoder.c
> +++ b/btf_encoder.c
> @@ -1529,3 +1529,8 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu)
>  out:
>         return err;
>  }
> +
> +struct btf *btf_encoder__btf(struct btf_encoder *encoder)
> +{
> +       return encoder->btf;
> +}
> diff --git a/btf_encoder.h b/btf_encoder.h
> index f133b0d7674d..0f0eee84df74 100644
> --- a/btf_encoder.h
> +++ b/btf_encoder.h
> @@ -29,4 +29,6 @@ struct btf_encoder *btf_encoders__first(struct list_head *encoders);
>
>  struct btf_encoder *btf_encoders__next(struct btf_encoder *encoder);
>
> +struct btf *btf_encoder__btf(struct btf_encoder *encoder);
> +
>  #endif /* _BTF_ENCODER_H_ */
> diff --git a/pahole.c b/pahole.c
> index f3eeaaca4cdf..93b844f97c38 100644
> --- a/pahole.c
> +++ b/pahole.c
> @@ -2798,6 +2798,73 @@ out:
>
>  static struct type_instance *header;
>
> +struct thread_data {
> +       struct btf *btf;
> +       struct btf_encoder *encoder;
> +};
> +
> +static int pahole_threads_prepare(struct conf_load *conf, int nr_threads, void **thr_data)
> +{
> +       int i;
> +       struct thread_data *threads = calloc(sizeof(struct thread_data), nr_threads);
> +
> +       for (i = 0; i < nr_threads; i++)
> +               thr_data[i] = threads + i;
> +
> +       return 0;
> +}
> +
> +static int pahole_thread_exit(struct conf_load *conf, void *thr_data)
> +{
> +       struct thread_data *thread = thr_data;
> +
> +       if (thread == NULL)
> +               return 0;
> +
> +       /*
> +        * Here we will call btf__dedup() here once we extend
> +        * btf__dedup().
> +        */
> +
> +       return 0;
> +}
> +
> +static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void **thr_data,
> +                                 int error)
> +{
> +       struct thread_data **threads = (struct thread_data **)thr_data;
> +       int i;
> +       int err = 0;
> +
> +       if (error)
> +               goto out;
> +
> +       for (i = 0; i < nr_threads; i++) {
> +               /*
> +                * Merge content of the btf instances of worker
> +                * threads to the btf instance of the primary
> +                * btf_encoder.
> +                */
> +               if (!threads[i]->btf || threads[i]->encoder == btf_encoder)
> +                       continue; /* The primary btf_encoder */
> +               err = btf__add_btf(btf_encoder__btf(btf_encoder), threads[i]->btf);

You've ignored my question and suggestion to move btf__add_btf() into
pahole_thread_exit. Can you please comment on why it's not a good
idea, if you insist on not doing that?

> +               if (err < 0)
> +                       goto out;
> +               btf_encoder__delete(threads[i]->encoder);
> +               threads[i]->encoder = NULL;
> +       }
> +       err = 0;
> +
> +out:
> +       for (i = 0; i < nr_threads; i++) {
> +               if (threads[i]->encoder && threads[i]->encoder != btf_encoder)
> +                       btf_encoder__delete(threads[i]->encoder);
> +       }
> +       free(threads[0]);
> +
> +       return err;
> +}
> +
>  static enum load_steal_kind pahole_stealer(struct cu *cu,
>                                            struct conf_load *conf_load,
>                                            void *thr_data)
> @@ -2819,8 +2886,8 @@ static enum load_steal_kind pahole_stealer(struct cu *cu,
>
>         if (btf_encode) {
>                 static pthread_mutex_t btf_lock = PTHREAD_MUTEX_INITIALIZER;
> +               struct btf_encoder *encoder;
>
> -               pthread_mutex_lock(&btf_lock);
>                 /*
>                  * FIXME:
>                  *
> @@ -2828,21 +2895,55 @@ static enum load_steal_kind pahole_stealer(struct cu *cu,
>                  * point we'll have cu->elf setup...
>                  */
>                 if (!btf_encoder) {
> -                       btf_encoder = btf_encoder__new(cu, detached_btf_filename, conf_load->base_btf, skip_encoding_btf_vars,
> -                                                      btf_encode_force, btf_gen_floats, global_verbose);
> +                       pthread_mutex_lock(&btf_lock);
> +                       if (!btf_encoder) {

are you trying to minimize lock contention here with !btf_encoder
check outside and inside locked area? Why? It just adds more code
nesting and this locking can't be a real bottleneck. We are talking
about a very few threads being initialized. Please keep it simple,
lock, check !btf_encode and goto unlock, if it's already set.
Otherwise proceed to initialization.

> +                               /*
> +                                * btf_encoder is the primary encoder.
> +                                * And, it is used by the thread
> +                                * create it.
> +                                */
> +                               btf_encoder = btf_encoder__new(cu, detached_btf_filename,
> +                                                              conf_load->base_btf,
> +                                                              skip_encoding_btf_vars,
> +                                                              btf_encode_force,
> +                                                              btf_gen_floats, global_verbose);
> +                               if (btf_encoder && thr_data) {
> +                                       struct thread_data *thread = (struct thread_data *)thr_data;
> +
> +                                       thread->encoder = btf_encoder;
> +                                       thread->btf = btf_encoder__btf(btf_encoder);

Can you summarize the relationship between thr_data, btf_encode and
thread->encoder? I'm re-reading this code (including `if (thr_data)`
piece below) over and over and can't figure out why the initialization
pattern is so complicated.

> +                               }
> +                       }
> +                       pthread_mutex_unlock(&btf_lock);
>                         if (btf_encoder == NULL) {
>                                 ret = LSK__STOP_LOADING;
>                                 goto out_btf;
>                         }
>                 }
>
> -               if (btf_encoder__encode_cu(btf_encoder, cu)) {
> +               if (thr_data) {
> +                       struct thread_data *thread = (struct thread_data *)thr_data;
> +
> +                       if (thread->encoder == NULL) {
> +                               thread->encoder =
> +                                       btf_encoder__new(cu, detached_btf_filename,
> +                                                        NULL,
> +                                                        skip_encoding_btf_vars,
> +                                                        btf_encode_force,
> +                                                        btf_gen_floats,
> +                                                        global_verbose);
> +                               thread->btf = btf_encoder__btf(thread->encoder);
> +                       }
> +                       encoder = thread->encoder;
> +               } else
> +                       encoder = btf_encoder;
> +
> +               if (btf_encoder__encode_cu(encoder, cu)) {
>                         fprintf(stderr, "Encountered error while encoding BTF.\n");
>                         exit(1);
>                 }
>                 ret = LSK__DELETE;
>  out_btf:
> -               pthread_mutex_unlock(&btf_lock);
>                 return ret;
>         }
>  #if 0
> @@ -3207,6 +3308,9 @@ int main(int argc, char *argv[])
>         memset(tab, ' ', sizeof(tab) - 1);
>
>         conf_load.steal = pahole_stealer;
> +       conf_load.thread_exit = pahole_thread_exit;
> +       conf_load.threads_prepare = pahole_threads_prepare;
> +       conf_load.threads_collect = pahole_threads_collect;
>
>         // Make 'pahole --header type < file' a shorter form of 'pahole -C type --count 1 < file'
>         if (conf.header_type && !class_name && prettify_input) {
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking.
  2022-01-24 20:13   ` Andrii Nakryiko
@ 2022-01-25 19:38     ` Kui-Feng Lee
  2022-01-25 21:49       ` Andrii Nakryiko
  0 siblings, 1 reply; 7+ messages in thread
From: Kui-Feng Lee @ 2022-01-25 19:38 UTC (permalink / raw)
  To: andrii.nakryiko; +Cc: daniel, ast, arnaldo.melo, andrii, dwarves

On Mon, 2022-01-24 at 12:13 -0800, Andrii Nakryiko wrote:
> On Mon, Jan 24, 2022 at 11:19 AM Kui-Feng Lee <kuifeng@fb.com> wrote:
> > 
> > Create an instance of btf for each worker thread, and add type info
> > to
> > the local btf instance in the steal-function of pahole without mutex
> > acquiring.  Once finished with all worker threads, merge all
> > per-thread btf instances to the primary btf instance.
> > 
> > Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
> > ---
> 
> BTW, I've already synced your btf__add_btf() optimization to Github,
> so you can bump pahole libbpf submodule reference in the next
> revision. You'll get a bunch of deprecation warnings, so it would be
> great to fix those at the same time.

Got it!

...... cut .....
> > 
> > +static int pahole_thread_exit(struct conf_load *conf, void
> > *thr_data)
> > +{
> > +       struct thread_data *thread = thr_data;
> > +
> > +       if (thread == NULL)
> > +               return 0;
> > +
> > +       /*
> > +        * Here we will call btf__dedup() here once we extend
> > +        * btf__dedup().
> > +        */
> > +
> > +       return 0;
> > +}
> > +
> > 
> You've ignored my question and suggestion to move btf__add_btf() into
> pahole_thread_exit. Can you please comment on why it's not a good
> idea, if you insist on not doing that?

Sorry for not replying for this part.
I tried it yesterday, and I don't see obvious improvement.  I guess the
improvement is not big enough to be observed.

> > @@ -2819,8 +2886,8 @@ static enum load_steal_kind
> > pahole_stealer(struct cu *cu,
> > 
> >         if (btf_encode) {
> >                 static pthread_mutex_t btf_lock =
> > PTHREAD_MUTEX_INITIALIZER;
> > +               struct btf_encoder *encoder;
> > 
> > -               pthread_mutex_lock(&btf_lock);
> >                 /*
> >                  * FIXME:
> >                  *
> > @@ -2828,21 +2895,55 @@ static enum load_steal_kind
> > pahole_stealer(struct cu *cu,
> >                  * point we'll have cu->elf setup...
> >                  */
> >                 if (!btf_encoder) {
> > -                       btf_encoder = btf_encoder__new(cu,
> > detached_btf_filename, conf_load->base_btf, skip_encoding_btf_vars,
> > -                                                     
> > btf_encode_force, btf_gen_floats, global_verbose);
> > +                       pthread_mutex_lock(&btf_lock);
> > +                       if (!btf_encoder) {
> 
> are you trying to minimize lock contention here with !btf_encoder
> check outside and inside locked area? Why? It just adds more code
> nesting and this locking can't be a real bottleneck. We are talking
> about a very few threads being initialized. Please keep it simple,
> lock, check !btf_encode and goto unlock, if it's already set.
> Otherwise proceed to initialization.

Got it!

.......................
> > +                               if (btf_encoder && thr_data) {
> > +                                       struct thread_data *thread =
> > (struct thread_data *)thr_data;
> > +
> > +                                       thread->encoder =
> > btf_encoder;
> > +                                       thread->btf =
> > btf_encoder__btf(btf_encoder);
> 
> Can you summarize the relationship between thr_data, btf_encode and
> thread->encoder? I'm re-reading this code (including `if (thr_data)`
> piece below) over and over and can't figure out why the initialization
> pattern is so complicated.

thr_data keeps per-thread information to worker threads. So, every
worker thread has its instance to keep an encoder. The main thread use
btf_encoder to encode BTF. thread->encoder is for a worker thread to
add type info. Once finishing all worker threads, the main thread will
add the btf instances of thread->encoders to the btf instance of
btf_encoder. However, I reuse btf_encoder for the first worker thread,
which reaches this function to create btf_encoder. It avoids copying
data for the first worker thread.

Does it make sense? If so, I will put the explanation in the code.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking.
  2022-01-25 19:38     ` Kui-Feng Lee
@ 2022-01-25 21:49       ` Andrii Nakryiko
  0 siblings, 0 replies; 7+ messages in thread
From: Andrii Nakryiko @ 2022-01-25 21:49 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: daniel, ast, arnaldo.melo, andrii, dwarves

On Tue, Jan 25, 2022 at 11:38 AM Kui-Feng Lee <kuifeng@fb.com> wrote:
>
> On Mon, 2022-01-24 at 12:13 -0800, Andrii Nakryiko wrote:
> > On Mon, Jan 24, 2022 at 11:19 AM Kui-Feng Lee <kuifeng@fb.com> wrote:
> > >
> > > Create an instance of btf for each worker thread, and add type info
> > > to
> > > the local btf instance in the steal-function of pahole without mutex
> > > acquiring.  Once finished with all worker threads, merge all
> > > per-thread btf instances to the primary btf instance.
> > >
> > > Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
> > > ---
> >
> > BTW, I've already synced your btf__add_btf() optimization to Github,
> > so you can bump pahole libbpf submodule reference in the next
> > revision. You'll get a bunch of deprecation warnings, so it would be
> > great to fix those at the same time.
>
> Got it!
>
> ...... cut .....
> > >
> > > +static int pahole_thread_exit(struct conf_load *conf, void
> > > *thr_data)
> > > +{
> > > +       struct thread_data *thread = thr_data;
> > > +
> > > +       if (thread == NULL)
> > > +               return 0;
> > > +
> > > +       /*
> > > +        * Here we will call btf__dedup() here once we extend
> > > +        * btf__dedup().
> > > +        */
> > > +
> > > +       return 0;
> > > +}
> > > +
> > >
> > You've ignored my question and suggestion to move btf__add_btf() into
> > pahole_thread_exit. Can you please comment on why it's not a good
> > idea, if you insist on not doing that?
>
> Sorry for not replying for this part.
> I tried it yesterday, and I don't see obvious improvement.  I guess the
> improvement is not big enough to be observed.

I see. It still seems a good move anyways, because those per-thread
BTF instances are owned by each thread, so when the thread is done
it's cleaner to merge it into common BTF and free up local BTF,
instead of keeping it around for longer than necessary.

>
> > > @@ -2819,8 +2886,8 @@ static enum load_steal_kind
> > > pahole_stealer(struct cu *cu,
> > >
> > >         if (btf_encode) {
> > >                 static pthread_mutex_t btf_lock =
> > > PTHREAD_MUTEX_INITIALIZER;
> > > +               struct btf_encoder *encoder;
> > >
> > > -               pthread_mutex_lock(&btf_lock);
> > >                 /*
> > >                  * FIXME:
> > >                  *
> > > @@ -2828,21 +2895,55 @@ static enum load_steal_kind
> > > pahole_stealer(struct cu *cu,
> > >                  * point we'll have cu->elf setup...
> > >                  */
> > >                 if (!btf_encoder) {
> > > -                       btf_encoder = btf_encoder__new(cu,
> > > detached_btf_filename, conf_load->base_btf, skip_encoding_btf_vars,
> > > -
> > > btf_encode_force, btf_gen_floats, global_verbose);
> > > +                       pthread_mutex_lock(&btf_lock);
> > > +                       if (!btf_encoder) {
> >
> > are you trying to minimize lock contention here with !btf_encoder
> > check outside and inside locked area? Why? It just adds more code
> > nesting and this locking can't be a real bottleneck. We are talking
> > about a very few threads being initialized. Please keep it simple,
> > lock, check !btf_encode and goto unlock, if it's already set.
> > Otherwise proceed to initialization.
>
> Got it!
>
> .......................
> > > +                               if (btf_encoder && thr_data) {
> > > +                                       struct thread_data *thread =
> > > (struct thread_data *)thr_data;
> > > +
> > > +                                       thread->encoder =
> > > btf_encoder;
> > > +                                       thread->btf =
> > > btf_encoder__btf(btf_encoder);
> >
> > Can you summarize the relationship between thr_data, btf_encode and
> > thread->encoder? I'm re-reading this code (including `if (thr_data)`
> > piece below) over and over and can't figure out why the initialization
> > pattern is so complicated.
>
> thr_data keeps per-thread information to worker threads. So, every
> worker thread has its instance to keep an encoder. The main thread use
> btf_encoder to encode BTF. thread->encoder is for a worker thread to
> add type info. Once finishing all worker threads, the main thread will
> add the btf instances of thread->encoders to the btf instance of
> btf_encoder. However, I reuse btf_encoder for the first worker thread,
> which reaches this function to create btf_encoder. It avoids copying
> data for the first worker thread.
>
> Does it make sense? If so, I will put the explanation in the code.
>

Oh, I see, thanks, it makes it a bit clearer. Yes, please leave a
comment describing this, thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-25 21:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-24 19:18 [PATCH dwarves 0/3] Parallelize BTF type info generating of pahole Kui-Feng Lee
2022-01-24 19:18 ` [PATCH dwarves v2 1/3] dwarf_loader: Receive per-thread data on worker threads Kui-Feng Lee
2022-01-24 19:18 ` [PATCH dwarves v2 2/3] dwarf_loader: Prepare and pass per-thread data to " Kui-Feng Lee
2022-01-24 19:18 ` [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking Kui-Feng Lee
2022-01-24 20:13   ` Andrii Nakryiko
2022-01-25 19:38     ` Kui-Feng Lee
2022-01-25 21:49       ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).