[PATCH dwarves 0/3] add option to merge more dwarf cu's into

dwarves.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH dwarves 0/3] add option to merge more dwarf cu's into
@ 2021-03-25  6:53 Yonghong Song
  2021-03-25  6:53 ` [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-25  6:53 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

For vmlinux built with clang thin-lto or lto for latest bpf-next,
there exist cross cu debuginfo type references. For example,
      compile unit 1:
         tag 10:  type A
      compile unit 2:
         ...
           refer to type A (tag 10 in compile unit 1)
I only checked a few but have seen type A may be a simple type
like "unsigned char" or a complex type like an array of base types.
I am using latest llvm trunk and bpf-next. I suspect llvm12 or
linus tree >= 5.12 rc2 should be able to exhibit the issue as well.
Both thin-lto and lto have the same issues.

Current pahole cannot handle this. It will report types cannot
be found error. Bill Wendling has attempted to fix the issue
with [1] by permitting all tags/types are hashed to the same
hash table and then process cu's one by one. This does not
really work. The reason is that each cu resolves types locally
so for the above example we may have
  compile unit 1:
    type A : type_id = 10
  compile unit 2:
    refer to type A : type A will be resolved as type id = 10
But id 10 refers to compile unit 1, we will get either out
of bound type id or incorrect one.

This patch set is a continuation of Bill's work. We still
increase the hashtable size and traverse all cu's before
recoding and finalization. But instead of creating one-to-one
mapping between debuginfo cu and pahole cu, we just create
one pahole cu, which should solve the above incorrect type
id issue.

Patches #1 and #2 are refactoring the existing code
and Patch #3 added an option "merge_cus" to permit
merging all debuginfo cu's into one pahole cu.
For vmlinux built, it can be detected that if LTO or Thin-LTO
is enabled, "merge_cus" can be added into pahole
command line.

  [1] https://www.spinics.net/lists/dwarves/msg00999.html

Yonghong Song (3):
  dwarf_loader: permits flexible HASHTAGS__BITS
  dwarf_loader: factor out common code to initialize a cu
  dwarf_loader: add option to merge more dwarf cu's into one pahole cu

 dwarf_loader.c | 179 +++++++++++++++++++++++++++++++++++++++----------
 dwarves.h      |   2 +
 pahole.c       |   8 +++
 3 files changed, 155 insertions(+), 34 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS
  2021-03-25  6:53 [PATCH dwarves 0/3] add option to merge more dwarf cu's into Yonghong Song
@ 2021-03-25  6:53 ` Yonghong Song
  2021-03-26 23:13   ` Andrii Nakryiko
  2021-03-25  6:53 ` [PATCH dwarves 2/3] dwarf_loader: factor out common code to initialize a cu Yonghong Song
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Yonghong Song @ 2021-03-25  6:53 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Currently, types/tags hash table has fixed HASHTAGS__BITS = 15.
That means the number of buckets will be 1UL << 15 = 32768.
In my experiments, a thin-LTO built vmlinux has roughly 9M entries
in types table and 5.2M entries in tags table. So the number
of buckets is too less for an efficient lookup. This patch
refactored the code to allow the number of buckets to be changed.

In addition, currently hashtags__fn(key) return value is
assigned to uint16_t. Change to uint32_t as in a later patch
the number of hashtag bits can be increased to be more than 16.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 dwarf_loader.c | 48 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index c106919..a02ef23 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -50,7 +50,12 @@ struct strings *strings;
 #define DW_FORM_implicit_const 0x21
 #endif
 
-#define hashtags__fn(key) hash_64(key, HASHTAGS__BITS)
+static uint32_t hashtags__bits = 15;
+
+uint32_t hashtags__fn(Dwarf_Off key)
+{
+	return hash_64(key, hashtags__bits);
+}
 
 bool no_bitfield_type_recode = true;
 
@@ -102,9 +107,6 @@ static void dwarf_tag__set_spec(struct dwarf_tag *dtag, dwarf_off_ref spec)
 	*(dwarf_off_ref *)(dtag + 1) = spec;
 }
 
-#define HASHTAGS__BITS 15
-#define HASHTAGS__SIZE (1UL << HASHTAGS__BITS)
-
 #define obstack_chunk_alloc malloc
 #define obstack_chunk_free free
 
@@ -118,22 +120,41 @@ static void *obstack_zalloc(struct obstack *obstack, size_t size)
 }
 
 struct dwarf_cu {
-	struct hlist_head hash_tags[HASHTAGS__SIZE];
-	struct hlist_head hash_types[HASHTAGS__SIZE];
+	struct hlist_head *hash_tags;
+	struct hlist_head *hash_types;
 	struct obstack obstack;
 	struct cu *cu;
 	struct dwarf_cu *type_unit;
 };
 
-static void dwarf_cu__init(struct dwarf_cu *dcu)
+static int dwarf_cu__init(struct dwarf_cu *dcu)
 {
+	uint64_t hashtags_size = 1UL << hashtags__bits;
+	dcu->hash_tags = malloc(sizeof(struct hlist_head) * hashtags_size);
+	if (!dcu->hash_tags)
+		return -ENOMEM;
+
+	dcu->hash_types = malloc(sizeof(struct hlist_head) * hashtags_size);
+	if (!dcu->hash_types) {
+		free(dcu->hash_tags);
+		return -ENOMEM;
+	}
+
 	unsigned int i;
-	for (i = 0; i < HASHTAGS__SIZE; ++i) {
+	for (i = 0; i < hashtags_size; ++i) {
 		INIT_HLIST_HEAD(&dcu->hash_tags[i]);
 		INIT_HLIST_HEAD(&dcu->hash_types[i]);
 	}
 	obstack_init(&dcu->obstack);
 	dcu->type_unit = NULL;
+	return 0;
+}
+
+static void dwarf_cu__delete(struct cu *cu)
+{
+	struct dwarf_cu *dcu = cu->priv;
+	free(dcu->hash_tags);
+	free(dcu->hash_types);
 }
 
 static void hashtags__hash(struct hlist_head *hashtable,
@@ -151,7 +172,7 @@ static struct dwarf_tag *hashtags__find(const struct hlist_head *hashtable,
 
 	struct dwarf_tag *tpos;
 	struct hlist_node *pos;
-	uint16_t bucket = hashtags__fn(id);
+	uint32_t bucket = hashtags__fn(id);
 	const struct hlist_head *head = hashtable + bucket;
 
 	hlist_for_each_entry(tpos, pos, head, hash_node) {
@@ -2429,7 +2450,9 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 			}
 			cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
 
-			dwarf_cu__init(dcup);
+			if (dwarf_cu__init(dcup) != 0)
+				return DWARF_CB_ABORT;
+
 			dcup->cu = cu;
 			/* Funny hack.  */
 			dcup->type_unit = dcup;
@@ -2521,7 +2544,9 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 
 		struct dwarf_cu dcu;
 
-		dwarf_cu__init(&dcu);
+		if (dwarf_cu__init(&dcu) != 0)
+			return DWARF_CB_ABORT;
+
 		dcu.cu = cu;
 		dcu.type_unit = type_cu ? &type_dcu : NULL;
 		cu->priv = &dcu;
@@ -2672,5 +2697,6 @@ struct debug_fmt_ops dwarf__ops = {
 	.tag__decl_file	     = dwarf_tag__decl_file,
 	.tag__decl_line	     = dwarf_tag__decl_line,
 	.tag__orig_id	     = dwarf_tag__orig_id,
+	.cu__delete	     = dwarf_cu__delete,
 	.has_alignment_info  = true,
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH dwarves 2/3] dwarf_loader: factor out common code to initialize a cu
  2021-03-25  6:53 [PATCH dwarves 0/3] add option to merge more dwarf cu's into Yonghong Song
  2021-03-25  6:53 ` [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
@ 2021-03-25  6:53 ` Yonghong Song
  2021-03-25  6:53 ` [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu Yonghong Song
  2021-03-25 13:10 ` [PATCH dwarves 0/3] add option to merge more dwarf cu's into Arnaldo Carvalho de Melo
  3 siblings, 0 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-25  6:53 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Both cus__load_debug_types() and cus__load_module()
created new cu's followed by initialization. The
initialization codes are identical so let us refactor
into a common function which can be used later as
well when dealing with merging cu's.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 dwarf_loader.c | 45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index a02ef23..dc66df0 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -2411,6 +2411,23 @@ static int finalize_cu_immediately(struct cus *cus, struct cu *cu,
 	return lsk;
 }
 
+static int cu__set_common(struct cu *cu, struct conf_load *conf,
+			  Dwfl_Module *mod, Elf *elf)
+{
+	cu->uses_global_strings = true;
+	cu->elf = elf;
+	cu->dwfl = mod;
+	cu->extra_dbg_info = conf ? conf->extra_dbg_info : 0;
+	cu->has_addr_info = conf ? conf->get_addr_info : 0;
+
+	GElf_Ehdr ehdr;
+	if (gelf_getehdr(elf, &ehdr) == NULL)
+		return DWARF_CB_ABORT;
+
+	cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
+	return 0;
+}
+
 static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 				 Dwfl_Module *mod, Dwarf *dw, Elf *elf,
 				 const char *filename,
@@ -2434,22 +2451,11 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 
 			cu = cu__new("", pointer_size, build_id,
 				     build_id_len, filename);
-			if (cu == NULL) {
+			if (cu == NULL ||
+			    cu__set_common(cu, conf, mod, elf) != 0) {
 				return DWARF_CB_ABORT;
 			}
 
-			cu->uses_global_strings = true;
-			cu->elf = elf;
-			cu->dwfl = mod;
-			cu->extra_dbg_info = conf ? conf->extra_dbg_info : 0;
-			cu->has_addr_info = conf ? conf->get_addr_info : 0;
-
-			GElf_Ehdr ehdr;
-			if (gelf_getehdr(elf, &ehdr) == NULL) {
-				return DWARF_CB_ABORT;
-			}
-			cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
-
 			if (dwarf_cu__init(dcup) != 0)
 				return DWARF_CB_ABORT;
 
@@ -2528,19 +2534,8 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 		const char *name = attr_string(cu_die, DW_AT_name);
 		struct cu *cu = cu__new(name ?: "", pointer_size,
 					build_id, build_id_len, filename);
-		if (cu == NULL)
-			return DWARF_CB_ABORT;
-		cu->uses_global_strings = true;
-		cu->elf = elf;
-		cu->dwfl = mod;
-		cu->extra_dbg_info = conf ? conf->extra_dbg_info : 0;
-		cu->has_addr_info = conf ? conf->get_addr_info : 0;
-
-		GElf_Ehdr ehdr;
-		if (gelf_getehdr(elf, &ehdr) == NULL) {
+		if (cu == NULL || cu__set_common(cu, conf, mod, elf) != 0)
 			return DWARF_CB_ABORT;
-		}
-		cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
 
 		struct dwarf_cu dcu;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-25  6:53 [PATCH dwarves 0/3] add option to merge more dwarf cu's into Yonghong Song
  2021-03-25  6:53 ` [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
  2021-03-25  6:53 ` [PATCH dwarves 2/3] dwarf_loader: factor out common code to initialize a cu Yonghong Song
@ 2021-03-25  6:53 ` Yonghong Song
  2021-03-26 14:41   ` Arnaldo Carvalho de Melo
  2021-03-26 23:21   ` Andrii Nakryiko
  2021-03-25 13:10 ` [PATCH dwarves 0/3] add option to merge more dwarf cu's into Arnaldo Carvalho de Melo
  3 siblings, 2 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-25  6:53 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

This patch added an option "merge_cus", which will permit
to merge all debug info cu's into one pahole cu.
For vmlinux built with clang thin-lto or lto, there exist
cross cu type references. For example, you could have
  compile unit 1:
     tag 10:  type A
  compile unit 2:
     ...
       refer to type A (tag 10 in compile unit 1)
I only checked a few but have seen type A may be a simple type
like "unsigned char" or a complex type like an array of base types.

There are two different ways to resolve this issue:
(1). merge all compile units as one pahole cu so tags/types
     can be resolved easily, or
(2). try to do on-demand type traversal in other debuginfo cu's
     when we do die_process().
The method (2) is much more complicated so I picked method (1).
An option "merge_cus" is added to permit such an operation.

Merging cu's will create a single cu with lots of types, tags
and functions. For example with clang thin-lto built vmlinux,
I saw 9M entries in types table, 5.2M in tags table. The
below are pahole wallclock time for different hashbits:
command line: time pahole -J --merge_cus vmlinux
      # of hashbits            wallclock time in seconds
          15                       460
          16                       255
          17                       131
          18                       97
          19                       75
          20                       69
          21                       64
          22                       62
          23                       58
          24                       64

Note that the number of hashbits 24 makes performance worse
than 23. The reason could be that 23 hashbits can cover 8M
buckets (close to 9M for the number of entries in types table).
Higher number of hash bits allocates more memory and becomes
less cache efficient compared to 23 hashbits.

This patch picks # of hashbits 21 as the starting value
and will try to allocate memory based on that, if memory
allocation fails, we will go with less hashbits until
we reach hashbits 15 which is the default for
non merge-cu case.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 dwarf_loader.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++
 dwarves.h      |  2 ++
 pahole.c       |  8 +++++
 3 files changed, 100 insertions(+)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index dc66df0..ed4f0da 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -51,6 +51,7 @@ struct strings *strings;
 #endif
 
 static uint32_t hashtags__bits = 15;
+static uint32_t max_hashtags__bits = 21;
 
 uint32_t hashtags__fn(Dwarf_Off key)
 {
@@ -2484,6 +2485,85 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 	return 0;
 }
 
+static int cus__merge_and_process_cu(struct cus *cus, struct conf_load *conf,
+				     Dwfl_Module *mod, Dwarf *dw, Elf *elf,
+				     const char *filename,
+				     const unsigned char *build_id,
+				     int build_id_len,
+				     struct dwarf_cu *type_dcu)
+{
+	uint8_t pointer_size, offset_size;
+	struct dwarf_cu *dcu = NULL;
+	Dwarf_Off off = 0, noff;
+	struct cu *cu = NULL;
+	size_t cuhl;
+
+	/* Merge all cus */
+	while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
+			    &offset_size) == 0) {
+		Dwarf_Die die_mem;
+		Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
+
+		if (cu_die == NULL)
+			break;
+
+		if (cu == NULL) {
+			cu = cu__new("", pointer_size, build_id, build_id_len,
+				     filename);
+			if (cu == NULL || cu__set_common(cu, conf, mod, elf) != 0)
+				return DWARF_CB_ABORT;
+
+			dcu = malloc(sizeof(struct dwarf_cu));
+			if (dcu == NULL)
+				return DWARF_CB_ABORT;
+
+			/* Merged cu tends to need a lot more memory.
+			 * Let us start with max_hashtags__bits and
+			 * go down to find a proper hashtag bit value.
+			 */
+			uint32_t default_hbits = hashtags__bits;
+			for (hashtags__bits = max_hashtags__bits;
+			     hashtags__bits >= default_hbits;
+			     hashtags__bits--) {
+				if (dwarf_cu__init(dcu) == 0)
+					break;
+			}
+			if (hashtags__bits < default_hbits)
+				return DWARF_CB_ABORT;
+
+			dcu->cu = cu;
+			dcu->type_unit = type_dcu;
+			cu->priv = dcu;
+			cu->dfops = &dwarf__ops;
+			cu->language = attr_numeric(cu_die, DW_AT_language);
+		}
+
+		const uint16_t tag = dwarf_tag(cu_die);
+		if (tag != DW_TAG_compile_unit && tag != DW_TAG_type_unit) {
+			fprintf(stderr, "%s: DW_TAG_compile_unit or DW_TAG_type_unit expected got %s!\n",
+				__FUNCTION__, dwarf_tag_name(tag));
+			return DWARF_CB_ABORT;
+		}
+
+		Dwarf_Die child;
+		if (dwarf_child(cu_die, &child) == 0) {
+			if (die__process_unit(&child, cu) != 0)
+				return DWARF_CB_ABORT;
+		}
+
+		off = noff;
+	}
+
+	/* process merged cu */
+	if (cu__recode_dwarf_types(cu) != LSK__KEEPIT)
+		return DWARF_CB_ABORT;
+	if (finalize_cu_immediately(cus, cu, dcu, conf)
+	    == LSK__STOP_LOADING)
+		return DWARF_CB_ABORT;
+
+	return 0;
+}
+
 static int cus__load_module(struct cus *cus, struct conf_load *conf,
 			    Dwfl_Module *mod, Dwarf *dw, Elf *elf,
 			    const char *filename)
@@ -2518,6 +2598,15 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 		}
 	}
 
+	if (conf->merge_cus == true) {
+		res = cus__merge_and_process_cu(cus, conf, mod, dw, elf, filename,
+						build_id, build_id_len,
+						type_cu ? &type_dcu : NULL);
+		if (res != 0)
+			return res;
+		goto out;
+	}
+
 	while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
 			    &offset_size) == 0) {
 		Dwarf_Die die_mem;
@@ -2557,6 +2646,7 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 		off = noff;
 	}
 
+out:
 	if (type_lsk == LSK__DELETE)
 		cu__delete(type_cu);
 
diff --git a/dwarves.h b/dwarves.h
index 98caf1a..29b518d 100644
--- a/dwarves.h
+++ b/dwarves.h
@@ -40,6 +40,7 @@ struct conf_fprintf;
  * @extra_dbg_info - keep original debugging format extra info
  *		     (e.g. DWARF's decl_{line,file}, id, etc)
  * @fixup_silly_bitfields - Fixup silly things such as "int foo:32;"
+ * @merge_cus - Merge compile units except possible types_cu
  * @get_addr_info - wheter to load DW_AT_location and other addr info
  */
 struct conf_load {
@@ -50,6 +51,7 @@ struct conf_load {
 	bool			extra_dbg_info;
 	bool			fixup_silly_bitfields;
 	bool			get_addr_info;
+	bool			merge_cus;
 	struct conf_fprintf	*conf_fprintf;
 };
 
diff --git a/pahole.c b/pahole.c
index df6aa83..29fbe1d 100644
--- a/pahole.c
+++ b/pahole.c
@@ -827,6 +827,7 @@ ARGP_PROGRAM_VERSION_HOOK_DEF = dwarves_print_version;
 #define ARGP_btf_base		   321
 #define ARGP_btf_gen_floats	   322
 #define ARGP_btf_gen_all	   323
+#define ARGP_merge_cus		   324
 
 static const struct argp_option pahole__options[] = {
 	{
@@ -1151,6 +1152,11 @@ static const struct argp_option pahole__options[] = {
 		.key  = ARGP_numeric_version,
 		.doc  = "Print a numeric version, i.e. 119 instead of v1.19"
 	},
+	{
+		.name = "merge_cus",
+		.key  = ARGP_merge_cus,
+		.doc  = "Merge all cus (except possible types_cu)"
+	},
 	{
 		.name = NULL,
 	}
@@ -1270,6 +1276,8 @@ static error_t pahole__options_parser(int key, char *arg,
 		btf_gen_floats = true;			break;
 	case ARGP_btf_gen_all:
 		btf_gen_floats = true;			break;
+	case ARGP_merge_cus:
+		conf_load.merge_cus = true;		break;
 	default:
 		return ARGP_ERR_UNKNOWN;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 0/3] add option to merge more dwarf cu's into
  2021-03-25  6:53 [PATCH dwarves 0/3] add option to merge more dwarf cu's into Yonghong Song
                   ` (2 preceding siblings ...)
  2021-03-25  6:53 ` [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu Yonghong Song
@ 2021-03-25 13:10 ` Arnaldo Carvalho de Melo
  2021-03-26  1:41   ` Yonghong Song
  3 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-25 13:10 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Wed, Mar 24, 2021 at 11:53:16PM -0700, Yonghong Song escreveu:
> For vmlinux built with clang thin-lto or lto for latest bpf-next,
> there exist cross cu debuginfo type references. For example,
>       compile unit 1:
>          tag 10:  type A
>       compile unit 2:
>          ...
>            refer to type A (tag 10 in compile unit 1)
> I only checked a few but have seen type A may be a simple type
> like "unsigned char" or a complex type like an array of base types.
> I am using latest llvm trunk and bpf-next. I suspect llvm12 or
> linus tree >= 5.12 rc2 should be able to exhibit the issue as well.
> Both thin-lto and lto have the same issues.
> 
> Current pahole cannot handle this. It will report types cannot
> be found error. Bill Wendling has attempted to fix the issue
> with [1] by permitting all tags/types are hashed to the same
> hash table and then process cu's one by one. This does not
> really work. The reason is that each cu resolves types locally
> so for the above example we may have
>   compile unit 1:
>     type A : type_id = 10
>   compile unit 2:
>     refer to type A : type A will be resolved as type id = 10
> But id 10 refers to compile unit 1, we will get either out
> of bound type id or incorrect one.
> 
> This patch set is a continuation of Bill's work. We still
> increase the hashtable size and traverse all cu's before
> recoding and finalization. But instead of creating one-to-one
> mapping between debuginfo cu and pahole cu, we just create
> one pahole cu, which should solve the above incorrect type
> id issue.
> 
> Patches #1 and #2 are refactoring the existing code
> and Patch #3 added an option "merge_cus" to permit
> merging all debuginfo cu's into one pahole cu.
> For vmlinux built, it can be detected that if LTO or Thin-LTO
> is enabled, "merge_cus" can be added into pahole
> command line.
> 
>   [1] https://www.spinics.net/lists/dwarves/msg00999.html

Thanks for working on this, I'll look at it today.

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 0/3] add option to merge more dwarf cu's into
  2021-03-25 13:10 ` [PATCH dwarves 0/3] add option to merge more dwarf cu's into Arnaldo Carvalho de Melo
@ 2021-03-26  1:41   ` Yonghong Song
  0 siblings, 0 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-26  1:41 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

[-- Attachment #1: Type: text/plain, Size: 2248 bytes --]



On 3/25/21 6:10 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Mar 24, 2021 at 11:53:16PM -0700, Yonghong Song escreveu:
>> For vmlinux built with clang thin-lto or lto for latest bpf-next,
>> there exist cross cu debuginfo type references. For example,
>>        compile unit 1:
>>           tag 10:  type A
>>        compile unit 2:
>>           ...
>>             refer to type A (tag 10 in compile unit 1)
>> I only checked a few but have seen type A may be a simple type
>> like "unsigned char" or a complex type like an array of base types.
>> I am using latest llvm trunk and bpf-next. I suspect llvm12 or
>> linus tree >= 5.12 rc2 should be able to exhibit the issue as well.
>> Both thin-lto and lto have the same issues.
>>
>> Current pahole cannot handle this. It will report types cannot
>> be found error. Bill Wendling has attempted to fix the issue
>> with [1] by permitting all tags/types are hashed to the same
>> hash table and then process cu's one by one. This does not
>> really work. The reason is that each cu resolves types locally
>> so for the above example we may have
>>    compile unit 1:
>>      type A : type_id = 10
>>    compile unit 2:
>>      refer to type A : type A will be resolved as type id = 10
>> But id 10 refers to compile unit 1, we will get either out
>> of bound type id or incorrect one.
>>
>> This patch set is a continuation of Bill's work. We still
>> increase the hashtable size and traverse all cu's before
>> recoding and finalization. But instead of creating one-to-one
>> mapping between debuginfo cu and pahole cu, we just create
>> one pahole cu, which should solve the above incorrect type
>> id issue.
>>
>> Patches #1 and #2 are refactoring the existing code
>> and Patch #3 added an option "merge_cus" to permit
>> merging all debuginfo cu's into one pahole cu.
>> For vmlinux built, it can be detected that if LTO or Thin-LTO
>> is enabled, "merge_cus" can be added into pahole
>> command line.
>>
>>    [1] https://www.spinics.net/lists/dwarves/msg00999.html
> 
> Thanks for working on this, I'll look at it today.

Thanks! In case that you want to test with the kernel, I attached a 
patch on top of bpf-next to use --merge_cus when building kernel and 
modules.

> 
> - Arnaldo
> 

[-- Attachment #2: 0001-scripts-bpf-add-pahole-merge_cus-support.patch --]
[-- Type: text/plain, Size: 2300 bytes --]

From 0dfb561a14a9eb1c5bd077fb9b4729455dbb5ec4 Mon Sep 17 00:00:00 2001
From: Yonghong Song <yhs@fb.com>
Date: Thu, 25 Mar 2021 18:15:38 -0700
Subject: [PATCH] scripts: bpf: add pahole --merge_cus support

The following is the command line I used to build the kernel:
  make LLVM=1 LLVM_IAS=1 -j20 && make LLVM=1 LLVM_IAS=1 -j60 vmlinux
Make sure your config has CONFIG_LTO_CLANG_THIN on.
You may also try CONFIG_LTO_CLANG_FULL, but in my box, it takes
quite some time and the llvm linker (ld.lld) takes more than
13 minutes.

The following is the command line to build the bpf selftests:
  make -C tools/testing/selftests/bpf -j60 LLVM=1

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 scripts/Makefile.modfinal | 9 ++++++++-
 scripts/link-vmlinux.sh   | 7 ++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/scripts/Makefile.modfinal b/scripts/Makefile.modfinal
index 735e11e9041b..5fc9d91c6976 100644
--- a/scripts/Makefile.modfinal
+++ b/scripts/Makefile.modfinal
@@ -47,6 +47,13 @@ cmd_ld_ko_o +=								\
 endif # SKIP_STACK_VALIDATION
 endif # CONFIG_STACK_VALIDATION
 
+ifdef CONFIG_LTO_CLANG_THIN
+merge_cus = "--merge_cus"
+endif
+ifdef CONFIG_LTO_CLANG_FULL
+merge_cus = "--merge_cus"
+endif
+
 endif # CONFIG_LTO_CLANG
 
 quiet_cmd_ld_ko_o = LD [M]  $@
@@ -59,7 +66,7 @@ quiet_cmd_ld_ko_o = LD [M]  $@
 quiet_cmd_btf_ko = BTF [M] $@
       cmd_btf_ko = 							\
 	if [ -f vmlinux ]; then						\
-		LLVM_OBJCOPY=$(OBJCOPY) $(PAHOLE) -J --btf_base vmlinux $@; \
+		LLVM_OBJCOPY=$(OBJCOPY) $(PAHOLE) -J --btf_base vmlinux $(merge_cus) $@; \
 	else								\
 		printf "Skipping BTF generation for %s due to unavailability of vmlinux\n" $@ 1>&2; \
 	fi;
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index 3b261b0f74f0..6b52c86acdad 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -227,8 +227,13 @@ gen_btf()
 
 	vmlinux_link ${1}
 
+	merge_cus=
+	if [ -n "${CONFIG_LTO_CLANG_THIN}" -o -n "${CONFIG_LTO_CLANG_FULL}" ]; then
+		merge_cus="--merge_cus"
+	fi
+
 	info "BTF" ${2}
-	LLVM_OBJCOPY=${OBJCOPY} ${PAHOLE} -J ${1}
+	LLVM_OBJCOPY=${OBJCOPY} ${PAHOLE} -J ${1} ${merge_cus}
 
 	# Create ${2} which contains just .BTF section but no symbols. Add
 	# SHF_ALLOC because .BTF will be part of the vmlinux image. --strip-all
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-25  6:53 ` [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu Yonghong Song
@ 2021-03-26 14:41   ` Arnaldo Carvalho de Melo
  2021-03-26 15:18     ` Yonghong Song
  2021-03-26 15:18     ` Arnaldo Carvalho de Melo
  2021-03-26 23:21   ` Andrii Nakryiko
  1 sibling, 2 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-26 14:41 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu:
> This patch added an option "merge_cus", which will permit
> to merge all debug info cu's into one pahole cu.
> For vmlinux built with clang thin-lto or lto, there exist
> cross cu type references. For example, you could have
>   compile unit 1:
>      tag 10:  type A
>   compile unit 2:
>      ...
>        refer to type A (tag 10 in compile unit 1)
> I only checked a few but have seen type A may be a simple type
> like "unsigned char" or a complex type like an array of base types.
> 
> There are two different ways to resolve this issue:
> (1). merge all compile units as one pahole cu so tags/types
>      can be resolved easily, or
> (2). try to do on-demand type traversal in other debuginfo cu's
>      when we do die_process().
> The method (2) is much more complicated so I picked method (1).
> An option "merge_cus" is added to permit such an operation.
> 
> Merging cu's will create a single cu with lots of types, tags
> and functions. For example with clang thin-lto built vmlinux,
> I saw 9M entries in types table, 5.2M in tags table. The
> below are pahole wallclock time for different hashbits:
> command line: time pahole -J --merge_cus vmlinux
>       # of hashbits            wallclock time in seconds
>           15                       460
>           16                       255
>           17                       131
>           18                       97
>           19                       75
>           20                       69
>           21                       64
>           22                       62
>           23                       58
>           24                       64
> 
> Note that the number of hashbits 24 makes performance worse
> than 23. The reason could be that 23 hashbits can cover 8M
> buckets (close to 9M for the number of entries in types table).
> Higher number of hash bits allocates more memory and becomes
> less cache efficient compared to 23 hashbits.
> 
> This patch picks # of hashbits 21 as the starting value
> and will try to allocate memory based on that, if memory
> allocation fails, we will go with less hashbits until
> we reach hashbits 15 which is the default for
> non merge-cu case.

I'll probably add a way to specify the starting max_hashbits to be able
to use 'perf stat' to show what causes the performance difference.

I'm also adding the man page patch below, now to build the kernel with
your bpf-next patch to test it.

- Arnaldo

[acme@five pahole]$ git diff
diff --git a/man-pages/pahole.1 b/man-pages/pahole.1
index cbbefbf22556412c..1be2a293ad4bcc50 100644
--- a/man-pages/pahole.1
+++ b/man-pages/pahole.1
@@ -208,6 +208,12 @@ information has float types.
 .B \-\-btf_gen_all
 Allow using all the BTF features supported by pahole.

+.TP
+.B \-\-merge_cus
+Merge all cus (except possible types_cu) when loading DWARF, this is needed
+when processing files that have inter-CU references, this happens, for instance
+when building the Linux kernel with clang using thin-LTO or LTO.
+
 .TP
 .B \-l, \-\-show_first_biggest_size_base_type_member
 Show first biggest size base_type member.
[acme@five pahole]$

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 14:41   ` Arnaldo Carvalho de Melo
@ 2021-03-26 15:18     ` Yonghong Song
  2021-03-26 17:35       ` Arnaldo Carvalho de Melo
  2021-03-26 18:19       ` Arnaldo Carvalho de Melo
  2021-03-26 15:18     ` Arnaldo Carvalho de Melo
  1 sibling, 2 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-26 15:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team



On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu:
>> This patch added an option "merge_cus", which will permit
>> to merge all debug info cu's into one pahole cu.
>> For vmlinux built with clang thin-lto or lto, there exist
>> cross cu type references. For example, you could have
>>    compile unit 1:
>>       tag 10:  type A
>>    compile unit 2:
>>       ...
>>         refer to type A (tag 10 in compile unit 1)
>> I only checked a few but have seen type A may be a simple type
>> like "unsigned char" or a complex type like an array of base types.
>>
>> There are two different ways to resolve this issue:
>> (1). merge all compile units as one pahole cu so tags/types
>>       can be resolved easily, or
>> (2). try to do on-demand type traversal in other debuginfo cu's
>>       when we do die_process().
>> The method (2) is much more complicated so I picked method (1).
>> An option "merge_cus" is added to permit such an operation.
>>
>> Merging cu's will create a single cu with lots of types, tags
>> and functions. For example with clang thin-lto built vmlinux,
>> I saw 9M entries in types table, 5.2M in tags table. The
>> below are pahole wallclock time for different hashbits:
>> command line: time pahole -J --merge_cus vmlinux
>>        # of hashbits            wallclock time in seconds
>>            15                       460
>>            16                       255
>>            17                       131
>>            18                       97
>>            19                       75
>>            20                       69
>>            21                       64
>>            22                       62
>>            23                       58
>>            24                       64
>>
>> Note that the number of hashbits 24 makes performance worse
>> than 23. The reason could be that 23 hashbits can cover 8M
>> buckets (close to 9M for the number of entries in types table).
>> Higher number of hash bits allocates more memory and becomes
>> less cache efficient compared to 23 hashbits.
>>
>> This patch picks # of hashbits 21 as the starting value
>> and will try to allocate memory based on that, if memory
>> allocation fails, we will go with less hashbits until
>> we reach hashbits 15 which is the default for
>> non merge-cu case.
> 
> I'll probably add a way to specify the starting max_hashbits to be able
> to use 'perf stat' to show what causes the performance difference.

The problem is with hashtags__find(), esp. the loop

         uint32_t bucket = hashtags__fn(id);
         const struct hlist_head *head = hashtable + bucket;

         hlist_for_each_entry(tpos, pos, head, hash_node) {
                 if (tpos->id == id)
                         return tpos;
         }

Say we have 8M types and (1 << 15) buckets, that means
each bucket will 64 elements. So each lookup will traverse
the loop 32 iterations on average.

If we have 1 << 21 buckets, then each buckets will have 4 elements,
and the average number of loop iterations for hashtags__find()
will be 2.

If the patch needs respin, I can add the above descriptions
in the commit message.

> 
> I'm also adding the man page patch below, now to build the kernel with
> your bpf-next patch to test it.

Thanks for adding man page and testing, let me know if you
need any help!

> 
> - Arnaldo
> 
> [acme@five pahole]$ git diff
> diff --git a/man-pages/pahole.1 b/man-pages/pahole.1
> index cbbefbf22556412c..1be2a293ad4bcc50 100644
> --- a/man-pages/pahole.1
> +++ b/man-pages/pahole.1
> @@ -208,6 +208,12 @@ information has float types.
>   .B \-\-btf_gen_all
>   Allow using all the BTF features supported by pahole.
> 
> +.TP
> +.B \-\-merge_cus
> +Merge all cus (except possible types_cu) when loading DWARF, this is needed
> +when processing files that have inter-CU references, this happens, for instance
> +when building the Linux kernel with clang using thin-LTO or LTO.
> +
>   .TP
>   .B \-l, \-\-show_first_biggest_size_base_type_member
>   Show first biggest size base_type member.
> [acme@five pahole]$
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 14:41   ` Arnaldo Carvalho de Melo
  2021-03-26 15:18     ` Yonghong Song
@ 2021-03-26 15:18     ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-26 15:18 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Fri, Mar 26, 2021 at 11:41:32AM -0300, Arnaldo Carvalho de Melo escreveu:
> I'm also adding the man page patch below, now to build the kernel with
> your bpf-next patch to test it.

[acme@five bpf]$ grep CONFIG_CLANG ../build/bpf_clang_thin_lto/.config
CONFIG_CLANG_VERSION=110000
[acme@five bpf]$ grep CLANG ../build/bpf_clang_thin_lto/.config
CONFIG_CC_IS_CLANG=y
CONFIG_CLANG_VERSION=110000
CONFIG_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_HAS_LTO_CLANG=y
# CONFIG_LTO_CLANG_FULL is not set
CONFIG_LTO_CLANG_THIN=y
[acme@five bpf]$


Building now.

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 15:18     ` Yonghong Song
@ 2021-03-26 17:35       ` Arnaldo Carvalho de Melo
  2021-03-26 18:19       ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-26 17:35 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Fri, Mar 26, 2021 at 08:18:07AM -0700, Yonghong Song escreveu:
> 
> 
> On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu:
> > > This patch added an option "merge_cus", which will permit
> > > to merge all debug info cu's into one pahole cu.
> > > For vmlinux built with clang thin-lto or lto, there exist
> > > cross cu type references. For example, you could have
> > >    compile unit 1:
> > >       tag 10:  type A
> > >    compile unit 2:
> > >       ...
> > >         refer to type A (tag 10 in compile unit 1)
> > > I only checked a few but have seen type A may be a simple type
> > > like "unsigned char" or a complex type like an array of base types.
> > > 
> > > There are two different ways to resolve this issue:
> > > (1). merge all compile units as one pahole cu so tags/types
> > >       can be resolved easily, or
> > > (2). try to do on-demand type traversal in other debuginfo cu's
> > >       when we do die_process().
> > > The method (2) is much more complicated so I picked method (1).
> > > An option "merge_cus" is added to permit such an operation.
> > > 
> > > Merging cu's will create a single cu with lots of types, tags
> > > and functions. For example with clang thin-lto built vmlinux,
> > > I saw 9M entries in types table, 5.2M in tags table. The
> > > below are pahole wallclock time for different hashbits:
> > > command line: time pahole -J --merge_cus vmlinux
> > >        # of hashbits            wallclock time in seconds
> > >            15                       460
> > >            16                       255
> > >            17                       131
> > >            18                       97
> > >            19                       75
> > >            20                       69
> > >            21                       64
> > >            22                       62
> > >            23                       58
> > >            24                       64
> > > 
> > > Note that the number of hashbits 24 makes performance worse
> > > than 23. The reason could be that 23 hashbits can cover 8M
> > > buckets (close to 9M for the number of entries in types table).
> > > Higher number of hash bits allocates more memory and becomes
> > > less cache efficient compared to 23 hashbits.
> > > 
> > > This patch picks # of hashbits 21 as the starting value
> > > and will try to allocate memory based on that, if memory
> > > allocation fails, we will go with less hashbits until
> > > we reach hashbits 15 which is the default for
> > > non merge-cu case.
> > 
> > I'll probably add a way to specify the starting max_hashbits to be able
> > to use 'perf stat' to show what causes the performance difference.
> 
> The problem is with hashtags__find(), esp. the loop
> 
>         uint32_t bucket = hashtags__fn(id);
>         const struct hlist_head *head = hashtable + bucket;
> 
>         hlist_for_each_entry(tpos, pos, head, hash_node) {
>                 if (tpos->id == id)
>                         return tpos;
>         }
> 
> Say we have 8M types and (1 << 15) buckets, that means
> each bucket will 64 elements. So each lookup will traverse
> the loop 32 iterations on average.
> 
> If we have 1 << 21 buckets, then each buckets will have 4 elements,
> and the average number of loop iterations for hashtags__find()
> will be 2.
> 
> If the patch needs respin, I can add the above descriptions
> in the commit message.

I can add that, as a comment.

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 15:18     ` Yonghong Song
  2021-03-26 17:35       ` Arnaldo Carvalho de Melo
@ 2021-03-26 18:19       ` Arnaldo Carvalho de Melo
  2021-03-26 23:05         ` Yonghong Song
  1 sibling, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-26 18:19 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Fri, Mar 26, 2021 at 08:18:07AM -0700, Yonghong Song escreveu:
> 
> 
> On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu:
> > I'm also adding the man page patch below, now to build the kernel with
> > your bpf-next patch to test it.
 
> Thanks for adding man page and testing, let me know if you
> need any help!

So, this is also needed if the vmlinux was buit with LTO:

[acme@seventh pahole]$ git diff btfdiff
diff --git a/btfdiff b/btfdiff
index 4db703245e7d..440241de7c2e 100755
--- a/btfdiff
+++ b/btfdiff
@@ -18,6 +18,7 @@ dwarf_output=$(mktemp /tmp/btfdiff.dwarf.XXXXXX)
 pahole_bin=${PAHOLE-"pahole"}

 ${pahole_bin} -F dwarf \
+             --merge_cus \
              --flat_arrays \
              --suppress_aligned_attribute \
              --suppress_force_paddings \
[acme@seventh pahole]$

After that we're down tho this diff, which probably isn't related to the
patches being tested, but some difference in how clang encodes this in
DWARF and then how the BTF encoder does it, or perhaps some problem in
the dwarves_fprintf.c routine, I'll check:

[acme@seventh pahole]$ ./btfdiff vmlinux
--- /tmp/btfdiff.dwarf.ik3LN3	2021-03-26 15:08:05.833806712 -0300
+++ /tmp/btfdiff.btf.69SSZs	2021-03-26 15:08:06.124802727 -0300
@@ -67233,7 +67233,7 @@ struct cpu_rmap {
 	struct {
 		u16                index;                /*    16     2 */
 		u16                dist;                 /*    18     2 */
-	} near[0]; /*    16     0 */
+	} near[]; /*    16     0 */

 	/* size: 16, cachelines: 1, members: 5 */
 	/* last cacheline: 16 bytes */
@@ -101159,7 +101159,7 @@ struct linux_efi_memreserve {
 	struct {
 		phys_addr_t        base;                 /*    16     8 */
 		phys_addr_t        size;                 /*    24     8 */
-	} entry[0]; /*    16     0 */
+	} entry[]; /*    16     0 */

 	/* size: 16, cachelines: 1, members: 4 */
 	/* last cacheline: 16 bytes */
@@ -113494,7 +113494,7 @@ struct netlink_policy_dump_state {
 	struct {
 		const struct nla_policy  * policy;       /*    16     8 */
 		unsigned int       maxtype;              /*    24     4 */
-	} policies[0]; /*    16     0 */
+	} policies[]; /*    16     0 */

 	/* size: 16, cachelines: 1, members: 4 */
 	/* sum members: 12, holes: 1, sum holes: 4 */
[acme@seventh pahole]$

But we need to find a way to discover if the costly --merge_cus need to
be used...

For the kernel its just a matter of looking if that CONFIG_ asking for
one of the CLANG LTO variants is present, but for pahole users wanting
to work with a LTO vmlinux this gets confusing as it crashes, perhaps I
need to count how many lookups fail, fix the segfaults and at the end
emit a warning...

OR we can look at...

[acme@five bpf]$ eu-readelf -winfo ../build/bpf_clang_thin_lto/vmlinux | grep -i producer -m1
           producer             (strp) "clang version 11.0.0 (Fedora 11.0.0-2.fc33)"
[acme@five bpf]$

oops, it seems a kernel built with clang doesn't come with the compiler
options used like when using gcc:

[acme@five bpf]$ eu-readelf -winfo ../build/v5.12.0-rc4+/vmlinux | grep -i producer -m2
           producer             (strp) "GNU AS 2.35"
           producer             (strp) "GNU C89 10.2.1 20201125 (Red Hat 10.2.1-9) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -gdwarf-4 -O2 -std=gnu90 -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -fcf-protection=none -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-strict-overflow -fstack-check=no -fconserve-stack -fno-stack-protector"
[acme@five bpf]$

Humm, can't we automagically detect that we need to merge the CUs and do
it if needed?

Have to go AFK now, will try to think about it while driving Pedro from
school...

Did a last test, may be unrelated:

[acme@five pahole]$ fullcircle ./tcp_ipv4.o
/home/acme/bin/fullcircle: line 40: 984531 Segmentation fault      (core dumped) ${codiff_bin} -q -s $file $o_output
[acme@five pahole]$ pahole --help | grep merge
      --merge_cus            Merge all cus (except possible types_cu)
[acme@five pahole]$


- Arnaldo

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 18:19       ` Arnaldo Carvalho de Melo
@ 2021-03-26 23:05         ` Yonghong Song
  2021-03-26 23:12           ` Alexei Starovoitov
  2021-03-29 14:04           ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-26 23:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team



On 3/26/21 11:19 AM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Mar 26, 2021 at 08:18:07AM -0700, Yonghong Song escreveu:
>>
>>
>> On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote:
>>> Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu:
>>> I'm also adding the man page patch below, now to build the kernel with
>>> your bpf-next patch to test it.
>   
>> Thanks for adding man page and testing, let me know if you
>> need any help!
> 
> So, this is also needed if the vmlinux was buit with LTO:
> 
> [acme@seventh pahole]$ git diff btfdiff
> diff --git a/btfdiff b/btfdiff
> index 4db703245e7d..440241de7c2e 100755
> --- a/btfdiff
> +++ b/btfdiff
> @@ -18,6 +18,7 @@ dwarf_output=$(mktemp /tmp/btfdiff.dwarf.XXXXXX)
>   pahole_bin=${PAHOLE-"pahole"}
> 
>   ${pahole_bin} -F dwarf \
> +             --merge_cus \
>                --flat_arrays \
>                --suppress_aligned_attribute \
>                --suppress_force_paddings \
> [acme@seventh pahole]$
> 
> After that we're down tho this diff, which probably isn't related to the
> patches being tested, but some difference in how clang encodes this in
> DWARF and then how the BTF encoder does it, or perhaps some problem in
> the dwarves_fprintf.c routine, I'll check:
> 
> [acme@seventh pahole]$ ./btfdiff vmlinux
> --- /tmp/btfdiff.dwarf.ik3LN3	2021-03-26 15:08:05.833806712 -0300
> +++ /tmp/btfdiff.btf.69SSZs	2021-03-26 15:08:06.124802727 -0300
> @@ -67233,7 +67233,7 @@ struct cpu_rmap {
>   	struct {
>   		u16                index;                /*    16     2 */
>   		u16                dist;                 /*    18     2 */
> -	} near[0]; /*    16     0 */
> +	} near[]; /*    16     0 */
> 
>   	/* size: 16, cachelines: 1, members: 5 */
>   	/* last cacheline: 16 bytes */
> @@ -101159,7 +101159,7 @@ struct linux_efi_memreserve {
>   	struct {
>   		phys_addr_t        base;                 /*    16     8 */
>   		phys_addr_t        size;                 /*    24     8 */
> -	} entry[0]; /*    16     0 */
> +	} entry[]; /*    16     0 */
> 
>   	/* size: 16, cachelines: 1, members: 4 */
>   	/* last cacheline: 16 bytes */
> @@ -113494,7 +113494,7 @@ struct netlink_policy_dump_state {
>   	struct {
>   		const struct nla_policy  * policy;       /*    16     8 */
>   		unsigned int       maxtype;              /*    24     4 */
> -	} policies[0]; /*    16     0 */
> +	} policies[]; /*    16     0 */
> 
>   	/* size: 16, cachelines: 1, members: 4 */
>   	/* sum members: 12, holes: 1, sum holes: 4 */
> [acme@seventh pahole]$
> 
> But we need to find a way to discover if the costly --merge_cus need to
> be used...
> 
> For the kernel its just a matter of looking if that CONFIG_ asking for
> one of the CLANG LTO variants is present, but for pahole users wanting
> to work with a LTO vmlinux this gets confusing as it crashes, perhaps I
> need to count how many lookups fail, fix the segfaults and at the end
> emit a warning...
> 
> OR we can look at...
> 
> [acme@five bpf]$ eu-readelf -winfo ../build/bpf_clang_thin_lto/vmlinux | grep -i producer -m1
>             producer             (strp) "clang version 11.0.0 (Fedora 11.0.0-2.fc33)"
> [acme@five bpf]$
> 
> oops, it seems a kernel built with clang doesn't come with the compiler
> options used like when using gcc:
> 
> [acme@five bpf]$ eu-readelf -winfo ../build/v5.12.0-rc4+/vmlinux | grep -i producer -m2
>             producer             (strp) "GNU AS 2.35"
>             producer             (strp) "GNU C89 10.2.1 20201125 (Red Hat 10.2.1-9) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -gdwarf-4 -O2 -std=gnu90 -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -fcf-protection=none -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-strict-overflow -fstack-check=no -fconserve-stack -fno-stack-protector"
> [acme@five bpf]$
> 
> Humm, can't we automagically detect that we need to merge the CUs and do
> it if needed?

This is a good question. In the beginning, I wanted to automatically
detect lto mode as well so we don't have to invent this options.
Since we cannot get hints from the dwarf, the only thing we can do is
to actually scan through each cu and if somehow we cannot resolve
the tag, then we try to the merging-cu mechanism. This is a little
bit heavy weight. That is why I invented this option.

Now since you found gcc actually has flags in dwarf tag producer which
will provides whether lto is used, I went on clang side found that
the following flag is needed in clang in order to embed flags in
the producer tag:
    -grecord-gcc-switches

So I am going to make the following changes:
   In pahole:
      - check one DW_AT_producer, if lto flag is in flags,
        phaole will merge cus,
      - otherwise, old way, one cu at a time.
   In Linux:
      - add flag -grecord-gcc-switches if clang lto is enabled.

Then just for vmlinux-lto, we won't need merge_cus option.
But for other lto built binaries without -grecord-gcc-switches,
pahole will not work. Maybe we still need --merge_cus option
eventually, but we can delay this until a later point.

Another further suggestions? I will start to do a v2 based on
my above outline.

> 
> Have to go AFK now, will try to think about it while driving Pedro from
> school...
> 
> Did a last test, may be unrelated:
> 
> [acme@five pahole]$ fullcircle ./tcp_ipv4.o
> /home/acme/bin/fullcircle: line 40: 984531 Segmentation fault      (core dumped) ${codiff_bin} -q -s $file $o_output

The .o file in lto build is not really an elf .o, it is llvm internal
ir bitcode.

> [acme@five pahole]$ pahole --help | grep merge
>        --merge_cus            Merge all cus (except possible types_cu)
> [acme@five pahole]$
> 
> 
> - Arnaldo
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 23:05         ` Yonghong Song
@ 2021-03-26 23:12           ` Alexei Starovoitov
  2021-03-26 23:17             ` Yonghong Song
  2021-03-29 14:04           ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2021-03-26 23:12 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, Arnaldo Carvalho de Melo, dwarves,
	Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf,
	Kernel Team

On Fri, Mar 26, 2021 at 4:05 PM Yonghong Song <yhs@fb.com> wrote:
>
> Now since you found gcc actually has flags in dwarf tag producer which
> will provides whether lto is used, I went on clang side found that
> the following flag is needed in clang in order to embed flags in
> the producer tag:
>     -grecord-gcc-switches
...
>    In Linux:
>       - add flag -grecord-gcc-switches if clang lto is enabled.

I think that will help to make dwarf output a bit more uniform between
gcc and clang. So it's a good thing on its own.
Recording compilation flags in the debug info could be useful in
other cases too. I would pass it for both lto and non-lto builds.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS
  2021-03-25  6:53 ` [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
@ 2021-03-26 23:13   ` Andrii Nakryiko
  2021-03-26 23:26     ` Yonghong Song
  0 siblings, 1 reply; 21+ messages in thread
From: Andrii Nakryiko @ 2021-03-26 23:13 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, Kernel Team

On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote:
>
> Currently, types/tags hash table has fixed HASHTAGS__BITS = 15.
> That means the number of buckets will be 1UL << 15 = 32768.
> In my experiments, a thin-LTO built vmlinux has roughly 9M entries
> in types table and 5.2M entries in tags table. So the number
> of buckets is too less for an efficient lookup. This patch
> refactored the code to allow the number of buckets to be changed.
>
> In addition, currently hashtags__fn(key) return value is
> assigned to uint16_t. Change to uint32_t as in a later patch
> the number of hashtag bits can be increased to be more than 16.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  dwarf_loader.c | 48 +++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 37 insertions(+), 11 deletions(-)
>
> diff --git a/dwarf_loader.c b/dwarf_loader.c
> index c106919..a02ef23 100644
> --- a/dwarf_loader.c
> +++ b/dwarf_loader.c
> @@ -50,7 +50,12 @@ struct strings *strings;
>  #define DW_FORM_implicit_const 0x21
>  #endif
>
> -#define hashtags__fn(key) hash_64(key, HASHTAGS__BITS)
> +static uint32_t hashtags__bits = 15;
> +
> +uint32_t hashtags__fn(Dwarf_Off key)
> +{
> +       return hash_64(key, hashtags__bits);

I vaguely remember pahole patch that updated hash function to use the
same one as libbpf's hashmap is using. Arnaldo, wasn't that patch
accepted?

But more to the point, I think hashtags__fn() should probably preserve
all 64 bits of the hash?

> +}
>
>  bool no_bitfield_type_recode = true;
>
> @@ -102,9 +107,6 @@ static void dwarf_tag__set_spec(struct dwarf_tag *dtag, dwarf_off_ref spec)
>         *(dwarf_off_ref *)(dtag + 1) = spec;
>  }
>
> -#define HASHTAGS__BITS 15
> -#define HASHTAGS__SIZE (1UL << HASHTAGS__BITS)
> -
>  #define obstack_chunk_alloc malloc
>  #define obstack_chunk_free free
>
> @@ -118,22 +120,41 @@ static void *obstack_zalloc(struct obstack *obstack, size_t size)
>  }
>
>  struct dwarf_cu {
> -       struct hlist_head hash_tags[HASHTAGS__SIZE];
> -       struct hlist_head hash_types[HASHTAGS__SIZE];
> +       struct hlist_head *hash_tags;
> +       struct hlist_head *hash_types;
>         struct obstack obstack;
>         struct cu *cu;
>         struct dwarf_cu *type_unit;
>  };
>
> -static void dwarf_cu__init(struct dwarf_cu *dcu)
> +static int dwarf_cu__init(struct dwarf_cu *dcu)
>  {
> +       uint64_t hashtags_size = 1UL << hashtags__bits;

I wish pahole could just use libbpf's dynamically resized hashmap,
instead of hard-coding maximum size like this :(

Arnaldo, libbpf is not going to expose its hashmap as public API, but
if you'd like to use it, feel free to just copy/paste the code. It
hasn't change for a while and is unlikely to change (unless some day
we decide to make more efficient open-addressing implementation).

> +       dcu->hash_tags = malloc(sizeof(struct hlist_head) * hashtags_size);
> +       if (!dcu->hash_tags)
> +               return -ENOMEM;
> +
> +       dcu->hash_types = malloc(sizeof(struct hlist_head) * hashtags_size);
> +       if (!dcu->hash_types) {
> +               free(dcu->hash_tags);
> +               return -ENOMEM;
> +       }
> +

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 23:12           ` Alexei Starovoitov
@ 2021-03-26 23:17             ` Yonghong Song
  0 siblings, 0 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-26 23:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Arnaldo Carvalho de Melo, Arnaldo Carvalho de Melo, dwarves,
	Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf,
	Kernel Team



On 3/26/21 4:12 PM, Alexei Starovoitov wrote:
> On Fri, Mar 26, 2021 at 4:05 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Now since you found gcc actually has flags in dwarf tag producer which
>> will provides whether lto is used, I went on clang side found that
>> the following flag is needed in clang in order to embed flags in
>> the producer tag:
>>      -grecord-gcc-switches
> ...
>>     In Linux:
>>        - add flag -grecord-gcc-switches if clang lto is enabled.
> 
> I think that will help to make dwarf output a bit more uniform between
> gcc and clang. So it's a good thing on its own.
> Recording compilation flags in the debug info could be useful in
> other cases too. I would pass it for both lto and non-lto builds.

Good point. Will do this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-25  6:53 ` [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu Yonghong Song
  2021-03-26 14:41   ` Arnaldo Carvalho de Melo
@ 2021-03-26 23:21   ` Andrii Nakryiko
  2021-03-27  0:19     ` Yonghong Song
  1 sibling, 1 reply; 21+ messages in thread
From: Andrii Nakryiko @ 2021-03-26 23:21 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, Kernel Team

On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote:
>
> This patch added an option "merge_cus", which will permit
> to merge all debug info cu's into one pahole cu.
> For vmlinux built with clang thin-lto or lto, there exist
> cross cu type references. For example, you could have
>   compile unit 1:
>      tag 10:  type A
>   compile unit 2:
>      ...
>        refer to type A (tag 10 in compile unit 1)
> I only checked a few but have seen type A may be a simple type
> like "unsigned char" or a complex type like an array of base types.
>
> There are two different ways to resolve this issue:
> (1). merge all compile units as one pahole cu so tags/types
>      can be resolved easily, or
> (2). try to do on-demand type traversal in other debuginfo cu's
>      when we do die_process().
> The method (2) is much more complicated so I picked method (1).
> An option "merge_cus" is added to permit such an operation.
>
> Merging cu's will create a single cu with lots of types, tags
> and functions. For example with clang thin-lto built vmlinux,
> I saw 9M entries in types table, 5.2M in tags table. The
> below are pahole wallclock time for different hashbits:
> command line: time pahole -J --merge_cus vmlinux
>       # of hashbits            wallclock time in seconds
>           15                       460
>           16                       255
>           17                       131
>           18                       97
>           19                       75
>           20                       69
>           21                       64
>           22                       62
>           23                       58
>           24                       64

What were the numbers for different hashbits without --merge_cus?

>
> Note that the number of hashbits 24 makes performance worse
> than 23. The reason could be that 23 hashbits can cover 8M
> buckets (close to 9M for the number of entries in types table).
> Higher number of hash bits allocates more memory and becomes
> less cache efficient compared to 23 hashbits.
>
> This patch picks # of hashbits 21 as the starting value
> and will try to allocate memory based on that, if memory
> allocation fails, we will go with less hashbits until
> we reach hashbits 15 which is the default for
> non merge-cu case.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  dwarf_loader.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  dwarves.h      |  2 ++
>  pahole.c       |  8 +++++
>  3 files changed, 100 insertions(+)
>

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS
  2021-03-26 23:13   ` Andrii Nakryiko
@ 2021-03-26 23:26     ` Yonghong Song
  2021-03-29 14:02       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 21+ messages in thread
From: Yonghong Song @ 2021-03-26 23:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, Kernel Team



On 3/26/21 4:13 PM, Andrii Nakryiko wrote:
> On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> Currently, types/tags hash table has fixed HASHTAGS__BITS = 15.
>> That means the number of buckets will be 1UL << 15 = 32768.
>> In my experiments, a thin-LTO built vmlinux has roughly 9M entries
>> in types table and 5.2M entries in tags table. So the number
>> of buckets is too less for an efficient lookup. This patch
>> refactored the code to allow the number of buckets to be changed.
>>
>> In addition, currently hashtags__fn(key) return value is
>> assigned to uint16_t. Change to uint32_t as in a later patch
>> the number of hashtag bits can be increased to be more than 16.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   dwarf_loader.c | 48 +++++++++++++++++++++++++++++++++++++-----------
>>   1 file changed, 37 insertions(+), 11 deletions(-)
>>
>> diff --git a/dwarf_loader.c b/dwarf_loader.c
>> index c106919..a02ef23 100644
>> --- a/dwarf_loader.c
>> +++ b/dwarf_loader.c
>> @@ -50,7 +50,12 @@ struct strings *strings;
>>   #define DW_FORM_implicit_const 0x21
>>   #endif
>>
>> -#define hashtags__fn(key) hash_64(key, HASHTAGS__BITS)
>> +static uint32_t hashtags__bits = 15;
>> +
>> +uint32_t hashtags__fn(Dwarf_Off key)
>> +{
>> +       return hash_64(key, hashtags__bits);
> 
> I vaguely remember pahole patch that updated hash function to use the
> same one as libbpf's hashmap is using. Arnaldo, wasn't that patch
> accepted?
> 
> But more to the point, I think hashtags__fn() should probably preserve
> all 64 bits of the hash?

I don't know the context. If the purpose is to avoid future changes
in case that the hashtags__bits > 32 happens, yes, the change may
make sense.

> 
>> +}
>>
>>   bool no_bitfield_type_recode = true;
>>
>> @@ -102,9 +107,6 @@ static void dwarf_tag__set_spec(struct dwarf_tag *dtag, dwarf_off_ref spec)
>>          *(dwarf_off_ref *)(dtag + 1) = spec;
>>   }
>>
>> -#define HASHTAGS__BITS 15
>> -#define HASHTAGS__SIZE (1UL << HASHTAGS__BITS)
>> -
>>   #define obstack_chunk_alloc malloc
>>   #define obstack_chunk_free free
>>
>> @@ -118,22 +120,41 @@ static void *obstack_zalloc(struct obstack *obstack, size_t size)
>>   }
>>
>>   struct dwarf_cu {
>> -       struct hlist_head hash_tags[HASHTAGS__SIZE];
>> -       struct hlist_head hash_types[HASHTAGS__SIZE];
>> +       struct hlist_head *hash_tags;
>> +       struct hlist_head *hash_types;
>>          struct obstack obstack;
>>          struct cu *cu;
>>          struct dwarf_cu *type_unit;
>>   };
>>
>> -static void dwarf_cu__init(struct dwarf_cu *dcu)
>> +static int dwarf_cu__init(struct dwarf_cu *dcu)
>>   {
>> +       uint64_t hashtags_size = 1UL << hashtags__bits;
> 
> I wish pahole could just use libbpf's dynamically resized hashmap,
> instead of hard-coding maximum size like this :(
> 
> Arnaldo, libbpf is not going to expose its hashmap as public API, but
> if you'd like to use it, feel free to just copy/paste the code. It
> hasn't change for a while and is unlikely to change (unless some day
> we decide to make more efficient open-addressing implementation).
> 
>> +       dcu->hash_tags = malloc(sizeof(struct hlist_head) * hashtags_size);
>> +       if (!dcu->hash_tags)
>> +               return -ENOMEM;
>> +
>> +       dcu->hash_types = malloc(sizeof(struct hlist_head) * hashtags_size);
>> +       if (!dcu->hash_types) {
>> +               free(dcu->hash_tags);
>> +               return -ENOMEM;
>> +       }
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 23:21   ` Andrii Nakryiko
@ 2021-03-27  0:19     ` Yonghong Song
  0 siblings, 0 replies; 21+ messages in thread
From: Yonghong Song @ 2021-03-27  0:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, Kernel Team



On 3/26/21 4:21 PM, Andrii Nakryiko wrote:
> On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> This patch added an option "merge_cus", which will permit
>> to merge all debug info cu's into one pahole cu.
>> For vmlinux built with clang thin-lto or lto, there exist
>> cross cu type references. For example, you could have
>>    compile unit 1:
>>       tag 10:  type A
>>    compile unit 2:
>>       ...
>>         refer to type A (tag 10 in compile unit 1)
>> I only checked a few but have seen type A may be a simple type
>> like "unsigned char" or a complex type like an array of base types.
>>
>> There are two different ways to resolve this issue:
>> (1). merge all compile units as one pahole cu so tags/types
>>       can be resolved easily, or
>> (2). try to do on-demand type traversal in other debuginfo cu's
>>       when we do die_process().
>> The method (2) is much more complicated so I picked method (1).
>> An option "merge_cus" is added to permit such an operation.
>>
>> Merging cu's will create a single cu with lots of types, tags
>> and functions. For example with clang thin-lto built vmlinux,
>> I saw 9M entries in types table, 5.2M in tags table. The
>> below are pahole wallclock time for different hashbits:
>> command line: time pahole -J --merge_cus vmlinux
>>        # of hashbits            wallclock time in seconds
>>            15                       460
>>            16                       255
>>            17                       131
>>            18                       97
>>            19                       75
>>            20                       69
>>            21                       64
>>            22                       62
>>            23                       58
>>            24                       64
> 
> What were the numbers for different hashbits without --merge_cus?

Without --merge_cus means non-lto vmlinux.
Just did quick measurement, for hashbits 10 - 18,
all ranges from 37s - 39s for "pahole -J vmlinux" run
with 10 - 15 between 37 - 38 and the rest 38 - 39.

The number of cus for my particular vmlinux is 2915.
The total number of types among all cus is roughly 8M based
on a rough regex matching, so each cu roughly 2K.

So the current default setting is okay for
non-lto vmlinux.

> 
>>
>> Note that the number of hashbits 24 makes performance worse
>> than 23. The reason could be that 23 hashbits can cover 8M
>> buckets (close to 9M for the number of entries in types table).
>> Higher number of hash bits allocates more memory and becomes
>> less cache efficient compared to 23 hashbits.
>>
>> This patch picks # of hashbits 21 as the starting value
>> and will try to allocate memory based on that, if memory
>> allocation fails, we will go with less hashbits until
>> we reach hashbits 15 which is the default for
>> non merge-cu case.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   dwarf_loader.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   dwarves.h      |  2 ++
>>   pahole.c       |  8 +++++
>>   3 files changed, 100 insertions(+)
>>
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS
  2021-03-26 23:26     ` Yonghong Song
@ 2021-03-29 14:02       ` Arnaldo Carvalho de Melo
  2021-03-31  4:30         ` Andrii Nakryiko
  0 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-29 14:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, Arnaldo Carvalho de Melo, dwarves,
	Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf,
	Kernel Team

Em Fri, Mar 26, 2021 at 04:26:20PM -0700, Yonghong Song escreveu:
> 
> 
> On 3/26/21 4:13 PM, Andrii Nakryiko wrote:
> > On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote:
> > > 
> > > Currently, types/tags hash table has fixed HASHTAGS__BITS = 15.
> > > That means the number of buckets will be 1UL << 15 = 32768.
> > > In my experiments, a thin-LTO built vmlinux has roughly 9M entries
> > > in types table and 5.2M entries in tags table. So the number
> > > of buckets is too less for an efficient lookup. This patch
> > > refactored the code to allow the number of buckets to be changed.
> > > 
> > > In addition, currently hashtags__fn(key) return value is
> > > assigned to uint16_t. Change to uint32_t as in a later patch
> > > the number of hashtag bits can be increased to be more than 16.
> > > 
> > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > > ---
> > >   dwarf_loader.c | 48 +++++++++++++++++++++++++++++++++++++-----------
> > >   1 file changed, 37 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/dwarf_loader.c b/dwarf_loader.c
> > > index c106919..a02ef23 100644
> > > --- a/dwarf_loader.c
> > > +++ b/dwarf_loader.c
> > > @@ -50,7 +50,12 @@ struct strings *strings;
> > >   #define DW_FORM_implicit_const 0x21
> > >   #endif
> > > 
> > > -#define hashtags__fn(key) hash_64(key, HASHTAGS__BITS)
> > > +static uint32_t hashtags__bits = 15;
> > > +
> > > +uint32_t hashtags__fn(Dwarf_Off key)
> > > +{
> > > +       return hash_64(key, hashtags__bits);
> > 
> > I vaguely remember pahole patch that updated hash function to use the
> > same one as libbpf's hashmap is using. Arnaldo, wasn't that patch
> > accepted?

I guess so:

https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=9fecc77ed82d429fd3fe49ba275465813228e617

dwarf_loader: Use a better hashing function, from libbpf

This hashing function[1] produces better hash table bucket
distributions. The original hashing function always produced zeros in
the three least significant bits. The new hashing function gives a
modest performance boost:

  Original: 0:11.373s
  New:      0:11.110s

for a performance improvement of ~2%.

[1] From the hash function used in libbpf.

Committer notes:

Bill found the suboptimality of the hash function being used, Andrii
suggested using the libbpf one, which ended up being better.

Signed-off-by: Bill Wendling <morbo@google.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Cc: bpf@vger.kernel.org
Cc: dwarves@vger.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 
> > But more to the point, I think hashtags__fn() should probably preserve
> > all 64 bits of the hash?
> 
> I don't know the context. If the purpose is to avoid future changes
> in case that the hashtags__bits > 32 happens, yes, the change may
> make sense.
> 
> > 
> > > +}
> > > 
> > >   bool no_bitfield_type_recode = true;
> > > 
> > > @@ -102,9 +107,6 @@ static void dwarf_tag__set_spec(struct dwarf_tag *dtag, dwarf_off_ref spec)
> > >          *(dwarf_off_ref *)(dtag + 1) = spec;
> > >   }
> > > 
> > > -#define HASHTAGS__BITS 15
> > > -#define HASHTAGS__SIZE (1UL << HASHTAGS__BITS)
> > > -
> > >   #define obstack_chunk_alloc malloc
> > >   #define obstack_chunk_free free
> > > 
> > > @@ -118,22 +120,41 @@ static void *obstack_zalloc(struct obstack *obstack, size_t size)
> > >   }
> > > 
> > >   struct dwarf_cu {
> > > -       struct hlist_head hash_tags[HASHTAGS__SIZE];
> > > -       struct hlist_head hash_types[HASHTAGS__SIZE];
> > > +       struct hlist_head *hash_tags;
> > > +       struct hlist_head *hash_types;
> > >          struct obstack obstack;
> > >          struct cu *cu;
> > >          struct dwarf_cu *type_unit;
> > >   };
> > > 
> > > -static void dwarf_cu__init(struct dwarf_cu *dcu)
> > > +static int dwarf_cu__init(struct dwarf_cu *dcu)
> > >   {
> > > +       uint64_t hashtags_size = 1UL << hashtags__bits;
> > 
> > I wish pahole could just use libbpf's dynamically resized hashmap,
> > instead of hard-coding maximum size like this :(
> > 
> > Arnaldo, libbpf is not going to expose its hashmap as public API, but
> > if you'd like to use it, feel free to just copy/paste the code. It
> > hasn't change for a while and is unlikely to change (unless some day
> > we decide to make more efficient open-addressing implementation).
> > 
> > > +       dcu->hash_tags = malloc(sizeof(struct hlist_head) * hashtags_size);
> > > +       if (!dcu->hash_tags)
> > > +               return -ENOMEM;
> > > +
> > > +       dcu->hash_types = malloc(sizeof(struct hlist_head) * hashtags_size);
> > > +       if (!dcu->hash_types) {
> > > +               free(dcu->hash_tags);
> > > +               return -ENOMEM;
> > > +       }
> > > +
> > 
> > [...]
> > 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu
  2021-03-26 23:05         ` Yonghong Song
  2021-03-26 23:12           ` Alexei Starovoitov
@ 2021-03-29 14:04           ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-29 14:04 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Fri, Mar 26, 2021 at 04:05:45PM -0700, Yonghong Song escreveu:
> On 3/26/21 11:19 AM, Arnaldo Carvalho de Melo wrote:
> > [acme@five pahole]$ fullcircle ./tcp_ipv4.o
> > /home/acme/bin/fullcircle: line 40: 984531 Segmentation fault      (core dumped) ${codiff_bin} -q -s $file $o_output
> 
> The .o file in lto build is not really an elf .o, it is llvm internal
> ir bitcode.

This one wasn't from a LTO build, I'll revisit this soon. Testing v3
now.

- Arnaldo
 
> > [acme@five pahole]$ pahole --help | grep merge
> >        --merge_cus            Merge all cus (except possible types_cu)
> > [acme@five pahole]$
> > 
> > 
> > - Arnaldo
> > 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS
  2021-03-29 14:02       ` Arnaldo Carvalho de Melo
@ 2021-03-31  4:30         ` Andrii Nakryiko
  0 siblings, 0 replies; 21+ messages in thread
From: Andrii Nakryiko @ 2021-03-31  4:30 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Yonghong Song, Arnaldo Carvalho de Melo, dwarves,
	Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf,
	Kernel Team

On Mon, Mar 29, 2021 at 7:02 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> Em Fri, Mar 26, 2021 at 04:26:20PM -0700, Yonghong Song escreveu:
> >
> >
> > On 3/26/21 4:13 PM, Andrii Nakryiko wrote:
> > > On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote:
> > > >
> > > > Currently, types/tags hash table has fixed HASHTAGS__BITS = 15.
> > > > That means the number of buckets will be 1UL << 15 = 32768.
> > > > In my experiments, a thin-LTO built vmlinux has roughly 9M entries
> > > > in types table and 5.2M entries in tags table. So the number
> > > > of buckets is too less for an efficient lookup. This patch
> > > > refactored the code to allow the number of buckets to be changed.
> > > >
> > > > In addition, currently hashtags__fn(key) return value is
> > > > assigned to uint16_t. Change to uint32_t as in a later patch
> > > > the number of hashtag bits can be increased to be more than 16.
> > > >
> > > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > > > ---
> > > >   dwarf_loader.c | 48 +++++++++++++++++++++++++++++++++++++-----------
> > > >   1 file changed, 37 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/dwarf_loader.c b/dwarf_loader.c
> > > > index c106919..a02ef23 100644
> > > > --- a/dwarf_loader.c
> > > > +++ b/dwarf_loader.c
> > > > @@ -50,7 +50,12 @@ struct strings *strings;
> > > >   #define DW_FORM_implicit_const 0x21
> > > >   #endif
> > > >
> > > > -#define hashtags__fn(key) hash_64(key, HASHTAGS__BITS)
> > > > +static uint32_t hashtags__bits = 15;
> > > > +
> > > > +uint32_t hashtags__fn(Dwarf_Off key)
> > > > +{
> > > > +       return hash_64(key, hashtags__bits);
> > >
> > > I vaguely remember pahole patch that updated hash function to use the
> > > same one as libbpf's hashmap is using. Arnaldo, wasn't that patch
> > > accepted?
>
> I guess so:
>
> https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=9fecc77ed82d429fd3fe49ba275465813228e617

Oh, my bad. I fetched the latest master but didn't notice that I had
some local changes that conflicted, so my master didn't actually
update. Sorry about the noise.

>
> dwarf_loader: Use a better hashing function, from libbpf
>
> This hashing function[1] produces better hash table bucket
> distributions. The original hashing function always produced zeros in
> the three least significant bits. The new hashing function gives a
> modest performance boost:
>
>   Original: 0:11.373s
>   New:      0:11.110s
>
> for a performance improvement of ~2%.
>
> [1] From the hash function used in libbpf.
>
> Committer notes:
>
> Bill found the suboptimality of the hash function being used, Andrii
> suggested using the libbpf one, which ended up being better.
>
> Signed-off-by: Bill Wendling <morbo@google.com>
> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> Cc: bpf@vger.kernel.org
> Cc: dwarves@vger.kernel.org
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
>
> > > But more to the point, I think hashtags__fn() should probably preserve
> > > all 64 bits of the hash?
> >
> > I don't know the context. If the purpose is to avoid future changes
> > in case that the hashtags__bits > 32 happens, yes, the change may
> > make sense.
> >
> > >
> > > > +}
> > > >
> > > >   bool no_bitfield_type_recode = true;
> > > >
> > > > @@ -102,9 +107,6 @@ static void dwarf_tag__set_spec(struct dwarf_tag *dtag, dwarf_off_ref spec)
> > > >          *(dwarf_off_ref *)(dtag + 1) = spec;
> > > >   }
> > > >
> > > > -#define HASHTAGS__BITS 15
> > > > -#define HASHTAGS__SIZE (1UL << HASHTAGS__BITS)
> > > > -
> > > >   #define obstack_chunk_alloc malloc
> > > >   #define obstack_chunk_free free
> > > >
> > > > @@ -118,22 +120,41 @@ static void *obstack_zalloc(struct obstack *obstack, size_t size)
> > > >   }
> > > >
> > > >   struct dwarf_cu {
> > > > -       struct hlist_head hash_tags[HASHTAGS__SIZE];
> > > > -       struct hlist_head hash_types[HASHTAGS__SIZE];
> > > > +       struct hlist_head *hash_tags;
> > > > +       struct hlist_head *hash_types;
> > > >          struct obstack obstack;
> > > >          struct cu *cu;
> > > >          struct dwarf_cu *type_unit;
> > > >   };
> > > >
> > > > -static void dwarf_cu__init(struct dwarf_cu *dcu)
> > > > +static int dwarf_cu__init(struct dwarf_cu *dcu)
> > > >   {
> > > > +       uint64_t hashtags_size = 1UL << hashtags__bits;
> > >
> > > I wish pahole could just use libbpf's dynamically resized hashmap,
> > > instead of hard-coding maximum size like this :(
> > >
> > > Arnaldo, libbpf is not going to expose its hashmap as public API, but
> > > if you'd like to use it, feel free to just copy/paste the code. It
> > > hasn't change for a while and is unlikely to change (unless some day
> > > we decide to make more efficient open-addressing implementation).
> > >
> > > > +       dcu->hash_tags = malloc(sizeof(struct hlist_head) * hashtags_size);
> > > > +       if (!dcu->hash_tags)
> > > > +               return -ENOMEM;
> > > > +
> > > > +       dcu->hash_types = malloc(sizeof(struct hlist_head) * hashtags_size);
> > > > +       if (!dcu->hash_types) {
> > > > +               free(dcu->hash_tags);
> > > > +               return -ENOMEM;
> > > > +       }
> > > > +
> > >
> > > [...]
> > >
>
> --
>
> - Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-03-31  4:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25  6:53 [PATCH dwarves 0/3] add option to merge more dwarf cu's into Yonghong Song
2021-03-25  6:53 ` [PATCH dwarves 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
2021-03-26 23:13   ` Andrii Nakryiko
2021-03-26 23:26     ` Yonghong Song
2021-03-29 14:02       ` Arnaldo Carvalho de Melo
2021-03-31  4:30         ` Andrii Nakryiko
2021-03-25  6:53 ` [PATCH dwarves 2/3] dwarf_loader: factor out common code to initialize a cu Yonghong Song
2021-03-25  6:53 ` [PATCH dwarves 3/3] dwarf_loader: add option to merge more dwarf cu's into one pahole cu Yonghong Song
2021-03-26 14:41   ` Arnaldo Carvalho de Melo
2021-03-26 15:18     ` Yonghong Song
2021-03-26 17:35       ` Arnaldo Carvalho de Melo
2021-03-26 18:19       ` Arnaldo Carvalho de Melo
2021-03-26 23:05         ` Yonghong Song
2021-03-26 23:12           ` Alexei Starovoitov
2021-03-26 23:17             ` Yonghong Song
2021-03-29 14:04           ` Arnaldo Carvalho de Melo
2021-03-26 15:18     ` Arnaldo Carvalho de Melo
2021-03-26 23:21   ` Andrii Nakryiko
2021-03-27  0:19     ` Yonghong Song
2021-03-25 13:10 ` [PATCH dwarves 0/3] add option to merge more dwarf cu's into Arnaldo Carvalho de Melo
2021-03-26  1:41   ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).