All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
@ 2021-03-28 20:14 Yonghong Song
  2021-03-28 20:14 ` [PATCH dwarves v3 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Yonghong Song @ 2021-03-28 20:14 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

For vmlinux built with clang thin-lto or lto for latest bpf-next,
there exist cross cu debuginfo type references. For example,
      compile unit 1:
         tag 10:  type A
      compile unit 2:
         ...
           refer to type A (tag 10 in compile unit 1)
I only checked a few but have seen type A may be a simple type
like "unsigned char" or a complex type like an array of base types.
I am using latest llvm trunk and bpf-next. I suspect llvm12 or
linus tree >= 5.12 rc2 should be able to exhibit the issue as well.
Both thin-lto and lto have the same issues.

Current pahole cannot handle this. It will report types cannot
be found error. Bill Wendling has attempted to fix the issue
with [1] by permitting all tags/types are hashed to the same
hash table and then process cu's one by one. This does not
really work. The reason is that each cu resolves types locally
so for the above example we may have
  compile unit 1:
    type A : type_id = 10
  compile unit 2:
    refer to type A : type A will be resolved as type id = 10
But id 10 refers to compile unit 1, we will get either out
of bound type id or incorrect one.

This patch set is a continuation of Bill's work. We still
increase the hashtable size and traverse all cu's before
recoding and finalization. But instead of creating one-to-one
mapping between debuginfo cu and pahole cu, we just create
one pahole cu, which should solve the above incorrect type
id issue.

This patch set depends on kernel patch [2]
to emit compilation flags for clang lto build so pahole
can properly discover whether to merge cu's or not.

Patch #1 and #2 are refactoring the existing code and
Patch #3 added logic to premit merging all debuginfo cu's
into one pahole cu. The detection for whether merging is
desirable is done by checking the existence of
"clang" compiler and its "lto" option in dwarf producer tag.

[1] https://lore.kernel.org/bpf/20210212211607.2890660-1-morbo@google.com/
[2] https://lore.kernel.org/bpf/20210328064121.2062927-1-yhs@fb.com/

Changelogs:
  v2 -> v3:
    . change "return 1" to "return DWARF_CB_ABORT" in
      cus__merge_and_process_cu().
    . add kbuild/bpf link (above [2]) for kernel patch reference.
  v1 -> v2:
    . removed "--merge_cus" option, relied on detections on
      clang compiler and its lto flags.

Yonghong Song (3):
  dwarf_loader: permits flexible HASHTAGS__BITS
  dwarf_loader: factor out common code to initialize a cu
  dwarf_loader: permit merging all dwarf cu's for clang lto built binary

 dwarf_loader.c | 209 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 175 insertions(+), 34 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH dwarves v3 1/3] dwarf_loader: permits flexible HASHTAGS__BITS
  2021-03-28 20:14 [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary Yonghong Song
@ 2021-03-28 20:14 ` Yonghong Song
  2021-03-28 20:14 ` [PATCH dwarves v3 2/3] dwarf_loader: factor out common code to initialize a cu Yonghong Song
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Yonghong Song @ 2021-03-28 20:14 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Currently, types/tags hash table has fixed HASHTAGS__BITS = 15.
That means the number of buckets will be 1UL << 15 = 32768.
In my experiments, a thin-LTO built vmlinux has roughly 9M entries
in types table and 5.2M entries in tags table. So the number
of buckets is too less for an efficient lookup. This patch
refactored the code to allow the number of buckets to be changed.

In addition, currently hashtags__fn(key) return value is
assigned to uint16_t. Change to uint32_t as in a later patch
the number of hashtag bits can be increased to be more than 16.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 dwarf_loader.c | 48 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index c106919..5a1e860 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -50,7 +50,12 @@ struct strings *strings;
 #define DW_FORM_implicit_const 0x21
 #endif
 
-#define hashtags__fn(key) hash_64(key, HASHTAGS__BITS)
+static uint32_t hashtags__bits = 15;
+
+static uint32_t hashtags__fn(Dwarf_Off key)
+{
+	return hash_64(key, hashtags__bits);
+}
 
 bool no_bitfield_type_recode = true;
 
@@ -102,9 +107,6 @@ static void dwarf_tag__set_spec(struct dwarf_tag *dtag, dwarf_off_ref spec)
 	*(dwarf_off_ref *)(dtag + 1) = spec;
 }
 
-#define HASHTAGS__BITS 15
-#define HASHTAGS__SIZE (1UL << HASHTAGS__BITS)
-
 #define obstack_chunk_alloc malloc
 #define obstack_chunk_free free
 
@@ -118,22 +120,41 @@ static void *obstack_zalloc(struct obstack *obstack, size_t size)
 }
 
 struct dwarf_cu {
-	struct hlist_head hash_tags[HASHTAGS__SIZE];
-	struct hlist_head hash_types[HASHTAGS__SIZE];
+	struct hlist_head *hash_tags;
+	struct hlist_head *hash_types;
 	struct obstack obstack;
 	struct cu *cu;
 	struct dwarf_cu *type_unit;
 };
 
-static void dwarf_cu__init(struct dwarf_cu *dcu)
+static int dwarf_cu__init(struct dwarf_cu *dcu)
 {
+	uint64_t hashtags_size = 1UL << hashtags__bits;
+	dcu->hash_tags = malloc(sizeof(struct hlist_head) * hashtags_size);
+	if (!dcu->hash_tags)
+		return -ENOMEM;
+
+	dcu->hash_types = malloc(sizeof(struct hlist_head) * hashtags_size);
+	if (!dcu->hash_types) {
+		free(dcu->hash_tags);
+		return -ENOMEM;
+	}
+
 	unsigned int i;
-	for (i = 0; i < HASHTAGS__SIZE; ++i) {
+	for (i = 0; i < hashtags_size; ++i) {
 		INIT_HLIST_HEAD(&dcu->hash_tags[i]);
 		INIT_HLIST_HEAD(&dcu->hash_types[i]);
 	}
 	obstack_init(&dcu->obstack);
 	dcu->type_unit = NULL;
+	return 0;
+}
+
+static void dwarf_cu__delete(struct cu *cu)
+{
+	struct dwarf_cu *dcu = cu->priv;
+	free(dcu->hash_tags);
+	free(dcu->hash_types);
 }
 
 static void hashtags__hash(struct hlist_head *hashtable,
@@ -151,7 +172,7 @@ static struct dwarf_tag *hashtags__find(const struct hlist_head *hashtable,
 
 	struct dwarf_tag *tpos;
 	struct hlist_node *pos;
-	uint16_t bucket = hashtags__fn(id);
+	uint32_t bucket = hashtags__fn(id);
 	const struct hlist_head *head = hashtable + bucket;
 
 	hlist_for_each_entry(tpos, pos, head, hash_node) {
@@ -2429,7 +2450,9 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 			}
 			cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
 
-			dwarf_cu__init(dcup);
+			if (dwarf_cu__init(dcup) != 0)
+				return DWARF_CB_ABORT;
+
 			dcup->cu = cu;
 			/* Funny hack.  */
 			dcup->type_unit = dcup;
@@ -2521,7 +2544,9 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 
 		struct dwarf_cu dcu;
 
-		dwarf_cu__init(&dcu);
+		if (dwarf_cu__init(&dcu) != 0)
+			return DWARF_CB_ABORT;
+
 		dcu.cu = cu;
 		dcu.type_unit = type_cu ? &type_dcu : NULL;
 		cu->priv = &dcu;
@@ -2672,5 +2697,6 @@ struct debug_fmt_ops dwarf__ops = {
 	.tag__decl_file	     = dwarf_tag__decl_file,
 	.tag__decl_line	     = dwarf_tag__decl_line,
 	.tag__orig_id	     = dwarf_tag__orig_id,
+	.cu__delete	     = dwarf_cu__delete,
 	.has_alignment_info  = true,
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH dwarves v3 2/3] dwarf_loader: factor out common code to initialize a cu
  2021-03-28 20:14 [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary Yonghong Song
  2021-03-28 20:14 ` [PATCH dwarves v3 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
@ 2021-03-28 20:14 ` Yonghong Song
  2021-03-28 20:14 ` [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary Yonghong Song
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Yonghong Song @ 2021-03-28 20:14 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Both cus__load_debug_types() and cus__load_module()
created new cu's followed by initialization. The
initialization codes are identical so let us refactor
into a common function which can be used later as
well when dealing with merging cu's.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 dwarf_loader.c | 45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index 5a1e860..aa6372a 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -2411,6 +2411,23 @@ static int finalize_cu_immediately(struct cus *cus, struct cu *cu,
 	return lsk;
 }
 
+static int cu__set_common(struct cu *cu, struct conf_load *conf,
+			  Dwfl_Module *mod, Elf *elf)
+{
+	cu->uses_global_strings = true;
+	cu->elf = elf;
+	cu->dwfl = mod;
+	cu->extra_dbg_info = conf ? conf->extra_dbg_info : 0;
+	cu->has_addr_info = conf ? conf->get_addr_info : 0;
+
+	GElf_Ehdr ehdr;
+	if (gelf_getehdr(elf, &ehdr) == NULL)
+		return DWARF_CB_ABORT;
+
+	cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
+	return 0;
+}
+
 static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 				 Dwfl_Module *mod, Dwarf *dw, Elf *elf,
 				 const char *filename,
@@ -2434,22 +2451,11 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 
 			cu = cu__new("", pointer_size, build_id,
 				     build_id_len, filename);
-			if (cu == NULL) {
+			if (cu == NULL ||
+			    cu__set_common(cu, conf, mod, elf) != 0) {
 				return DWARF_CB_ABORT;
 			}
 
-			cu->uses_global_strings = true;
-			cu->elf = elf;
-			cu->dwfl = mod;
-			cu->extra_dbg_info = conf ? conf->extra_dbg_info : 0;
-			cu->has_addr_info = conf ? conf->get_addr_info : 0;
-
-			GElf_Ehdr ehdr;
-			if (gelf_getehdr(elf, &ehdr) == NULL) {
-				return DWARF_CB_ABORT;
-			}
-			cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
-
 			if (dwarf_cu__init(dcup) != 0)
 				return DWARF_CB_ABORT;
 
@@ -2528,19 +2534,8 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 		const char *name = attr_string(cu_die, DW_AT_name);
 		struct cu *cu = cu__new(name ?: "", pointer_size,
 					build_id, build_id_len, filename);
-		if (cu == NULL)
-			return DWARF_CB_ABORT;
-		cu->uses_global_strings = true;
-		cu->elf = elf;
-		cu->dwfl = mod;
-		cu->extra_dbg_info = conf ? conf->extra_dbg_info : 0;
-		cu->has_addr_info = conf ? conf->get_addr_info : 0;
-
-		GElf_Ehdr ehdr;
-		if (gelf_getehdr(elf, &ehdr) == NULL) {
+		if (cu == NULL || cu__set_common(cu, conf, mod, elf) != 0)
 			return DWARF_CB_ABORT;
-		}
-		cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
 
 		struct dwarf_cu dcu;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary
  2021-03-28 20:14 [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary Yonghong Song
  2021-03-28 20:14 ` [PATCH dwarves v3 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
  2021-03-28 20:14 ` [PATCH dwarves v3 2/3] dwarf_loader: factor out common code to initialize a cu Yonghong Song
@ 2021-03-28 20:14 ` Yonghong Song
  2021-03-30 20:08   ` Bill Wendling
  2021-03-29 17:40 ` [PATCH dwarves v3 0/3] " Arnaldo Carvalho de Melo
  2021-03-29 23:14 ` Nick Desaulniers
  4 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2021-03-28 20:14 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, dwarves
  Cc: Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf, kernel-team

For vmlinux built with clang thin-lto or lto, there exist
cross cu type references. For example, the below can happen:
  compile unit 1:
     tag 10:  type A
  compile unit 2:
     ...
       refer to type A (tag 10 in compile unit 1)
I only checked a few but have seen type A may be a simple type
like "unsigned char" or a complex type like an array of base types.

To resolve this issue, the tag DW_AT_producer of the first
DW_TAG_compile_unit is checked. If the binary is built
with clang lto, all debuginfo dwarf cu's will be merged
into one pahole cu which will resolve the above
cross-cu tag reference issue. To test whether a binary
is built with clang lto or not, The "clang version"
and "-flto" will be checked against DW_AT_producer string
for the first 5 debuginfo cu's. The reason is that
a few linux files disabled lto for various reasons.

Merging cu's will create a single cu with lots of types, tags
and functions. For example with clang thin-lto built vmlinux,
I saw 9M entries in types table, 5.2M in tags table. The
below are pahole wallclock time for different hashbits:
command line: time pahole -J vmlinux
      # of hashbits            wallclock time in seconds
          15                       460
          16                       255
          17                       131
          18                       97
          19                       75
          20                       69
          21                       64
          22                       62
          23                       58
          24                       64

The problem is with hashtags__find(), esp. the loop
    uint32_t bucket = hashtags__fn(id);
    const struct hlist_head *head = hashtable + bucket;
    hlist_for_each_entry(tpos, pos, head, hash_node) {
            if (tpos->id == id)
                    return tpos;
    }

Say we have 9M types and (1 << 15) buckets, that means each bucket
will have roughly 64 elements. So each lookup will traverse
the loop 32 iterations on average.

If we have 1 << 21 buckets, then each buckets will have 4 elements,
and the average number of loop iterations for hashtags__find()
will be 2.

Note that the number of hashbits 24 makes performance worse
than 23. The reason could be that 23 hashbits can cover 8M
buckets (close to 9M for the number of entries in types table).
Higher number of hash bits allocates more memory and becomes
less cache efficient compared to 23 hashbits.

This patch picks # of hashbits 21 as the starting value
and will try to allocate memory based on that, if memory
allocation fails, we will go with less hashbits until
we reach hashbits 15 which is the default for
non merge-cu case.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 dwarf_loader.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index aa6372a..a51391e 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -51,6 +51,7 @@ struct strings *strings;
 #endif
 
 static uint32_t hashtags__bits = 15;
+static uint32_t max_hashtags__bits = 21;
 
 static uint32_t hashtags__fn(Dwarf_Off key)
 {
@@ -2484,6 +2485,115 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
 	return 0;
 }
 
+static bool cus__merging_cu(Dwarf *dw)
+{
+	uint8_t pointer_size, offset_size;
+	Dwarf_Off off = 0, noff;
+	size_t cuhl;
+	int cnt = 0;
+
+	/*
+	 * Just checking the first cu is not enough.
+	 * In linux, some C files may have LTO is disabled, e.g.,
+	 *   e242db40be27  x86, vdso: disable LTO only for vDSO
+	 *   d2dcd3e37475  x86, cpu: disable LTO for cpu.c
+	 * Fortunately, disabling LTO for a particular file in a LTO build
+	 * is rather an exception. Iterating 5 cu's to check whether
+	 * LTO is used or not should be enough.
+	 */
+	while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
+			    &offset_size) == 0) {
+		Dwarf_Die die_mem;
+		Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
+
+		if (cu_die == NULL)
+			break;
+
+		if (++cnt > 5)
+			break;
+
+		const char *producer = attr_string(cu_die, DW_AT_producer);
+		if (strstr(producer, "clang version") != NULL &&
+		    strstr(producer, "-flto") != NULL)
+			return true;
+
+		off = noff;
+	}
+
+	return false;
+}
+
+static int cus__merge_and_process_cu(struct cus *cus, struct conf_load *conf,
+				     Dwfl_Module *mod, Dwarf *dw, Elf *elf,
+				     const char *filename,
+				     const unsigned char *build_id,
+				     int build_id_len,
+				     struct dwarf_cu *type_dcu)
+{
+	uint8_t pointer_size, offset_size;
+	struct dwarf_cu *dcu = NULL;
+	Dwarf_Off off = 0, noff;
+	struct cu *cu = NULL;
+	size_t cuhl;
+
+	while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
+			    &offset_size) == 0) {
+		Dwarf_Die die_mem;
+		Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
+
+		if (cu_die == NULL)
+			break;
+
+		if (cu == NULL) {
+			cu = cu__new("", pointer_size, build_id, build_id_len,
+				     filename);
+			if (cu == NULL || cu__set_common(cu, conf, mod, elf) != 0)
+				return DWARF_CB_ABORT;
+
+			dcu = malloc(sizeof(struct dwarf_cu));
+			if (dcu == NULL)
+				return DWARF_CB_ABORT;
+
+			/* Merged cu tends to need a lot more memory.
+			 * Let us start with max_hashtags__bits and
+			 * go down to find a proper hashtag bit value.
+			 */
+			uint32_t default_hbits = hashtags__bits;
+			for (hashtags__bits = max_hashtags__bits;
+			     hashtags__bits >= default_hbits;
+			     hashtags__bits--) {
+				if (dwarf_cu__init(dcu) == 0)
+					break;
+			}
+			if (hashtags__bits < default_hbits)
+				return DWARF_CB_ABORT;
+
+			dcu->cu = cu;
+			dcu->type_unit = type_dcu;
+			cu->priv = dcu;
+			cu->dfops = &dwarf__ops;
+			cu->language = attr_numeric(cu_die, DW_AT_language);
+		}
+
+		Dwarf_Die child;
+		if (dwarf_child(cu_die, &child) == 0) {
+			if (die__process_unit(&child, cu) != 0)
+				return DWARF_CB_ABORT;
+		}
+
+		off = noff;
+	}
+
+	/* process merged cu */
+	if (cu__recode_dwarf_types(cu) != LSK__KEEPIT)
+		return DWARF_CB_ABORT;
+	if (finalize_cu_immediately(cus, cu, dcu, conf)
+	    == LSK__STOP_LOADING)
+		return DWARF_CB_ABORT;
+
+	return 0;
+}
+
 static int cus__load_module(struct cus *cus, struct conf_load *conf,
 			    Dwfl_Module *mod, Dwarf *dw, Elf *elf,
 			    const char *filename)
@@ -2518,6 +2628,15 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 		}
 	}
 
+	if (cus__merging_cu(dw)) {
+		res = cus__merge_and_process_cu(cus, conf, mod, dw, elf, filename,
+						build_id, build_id_len,
+						type_cu ? &type_dcu : NULL);
+		if (res)
+			return res;
+		goto out;
+	}
+
 	while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
 			    &offset_size) == 0) {
 		Dwarf_Die die_mem;
@@ -2557,6 +2676,7 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 		off = noff;
 	}
 
+out:
 	if (type_lsk == LSK__DELETE)
 		cu__delete(type_cu);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-28 20:14 [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary Yonghong Song
                   ` (2 preceding siblings ...)
  2021-03-28 20:14 ` [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary Yonghong Song
@ 2021-03-29 17:40 ` Arnaldo Carvalho de Melo
  2021-03-30 15:10   ` Arnaldo Carvalho de Melo
  2021-03-29 23:14 ` Nick Desaulniers
  4 siblings, 1 reply; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-29 17:40 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Sun, Mar 28, 2021 at 01:14:00PM -0700, Yonghong Song escreveu:
> For vmlinux built with clang thin-lto or lto for latest bpf-next,
> there exist cross cu debuginfo type references. For example,
>       compile unit 1:
>          tag 10:  type A
>       compile unit 2:
>          ...
>            refer to type A (tag 10 in compile unit 1)
> I only checked a few but have seen type A may be a simple type
> like "unsigned char" or a complex type like an array of base types.
> I am using latest llvm trunk and bpf-next. I suspect llvm12 or
> linus tree >= 5.12 rc2 should be able to exhibit the issue as well.
> Both thin-lto and lto have the same issues.

Works, now we're again at:

[acme@five pahole]$ time btfdiff vmlinux
real	0m7.679s
user	0m7.337s
sys	0m0.303s
[acme@five pahole]$ time btfdiff vmlinux.clang.thin.LTO
--- /tmp/btfdiff.dwarf.Ls059V	2021-03-29 14:36:02.675859035 -0300
+++ /tmp/btfdiff.btf.rxRd6R	2021-03-29 14:36:02.935864663 -0300
@@ -67255,7 +67255,7 @@ struct cpu_rmap {
 	struct {
 		u16                index;                /*    16     2 */
 		u16                dist;                 /*    18     2 */
-	} near[0]; /*    16     0 */
+	} near[]; /*    16     0 */

 	/* size: 16, cachelines: 1, members: 5 */
 	/* last cacheline: 16 bytes */
@@ -101181,7 +101181,7 @@ struct linux_efi_memreserve {
 	struct {
 		phys_addr_t        base;                 /*    16     8 */
 		phys_addr_t        size;                 /*    24     8 */
-	} entry[0]; /*    16     0 */
+	} entry[]; /*    16     0 */

 	/* size: 16, cachelines: 1, members: 4 */
 	/* last cacheline: 16 bytes */
@@ -113516,7 +113516,7 @@ struct netlink_policy_dump_state {
 	struct {
 		const struct nla_policy  * policy;       /*    16     8 */
 		unsigned int       maxtype;              /*    24     4 */
-	} policies[0]; /*    16     0 */
+	} policies[]; /*    16     0 */

 	/* size: 16, cachelines: 1, members: 4 */
 	/* sum members: 12, holes: 1, sum holes: 4 */

real	0m20.402s
user	0m19.163s
sys	0m1.096s
[acme@five pahole]$

And:

[acme@five pahole]$ ulimit -c 10000000
[acme@five pahole]$
[acme@five pahole]$ file tcp_bbr.o
tcp_bbr.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
[acme@five pahole]$ readelf -wi tcp_bbr.o | grep DW_AT_producer
    <d>   DW_AT_producer    : (indirect string, offset: 0x4a97): GNU C89 10.2.1 20200723 (Red Hat 10.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -O2 -std=gnu90 -p -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-var-tracking-assignments -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fstack-check=no -fconserve-stack -fcf-protection=none
[acme@five pahole]$ fullcircle tcp_bbr.o
/home/acme/bin/fullcircle: line 38: 3969006 Segmentation fault      (core dumped) ${pfunct_bin} --compile $file > $c_output
/tmp/fullcircle.4XujnI.c:1435:2: error: unterminated comment
 1435 |  /* si
      |  ^
/tmp/fullcircle.4XujnI.c:1433:2: error: expected specifier-qualifier-list at end of input
 1433 |  u32 *                      saved_syn;            /*  2184     8 */
      |  ^~~
codiff: couldn't load debugging info from /tmp/fullcircle.ZOVXGv.o
/home/acme/bin/fullcircle: line 40: 3969019 Segmentation fault      (core dumped) ${codiff_bin} -q -s $file $o_output
[acme@five pahole]$

Both seem unrelated to what you've done here, I'm investigating it now.

- Arnaldo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-28 20:14 [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary Yonghong Song
                   ` (3 preceding siblings ...)
  2021-03-29 17:40 ` [PATCH dwarves v3 0/3] " Arnaldo Carvalho de Melo
@ 2021-03-29 23:14 ` Nick Desaulniers
  4 siblings, 0 replies; 18+ messages in thread
From: Nick Desaulniers @ 2021-03-29 23:14 UTC (permalink / raw)
  To: yhs
  Cc: andrii, arnaldo.melo, ast, bpf, dwarves, kernel-team, morbo,
	clang-built-linux, sedat.dilek, Nick Desaulniers

(replying manually to https://lore.kernel.org/dwarves/20210328201400.1426437-1-yhs@fb.com/)

I didn't validate or try to use the produced data, but with this and the
kernel patch
https://lore.kernel.org/bpf/20210328064121.2062927-1-yhs@fb.com/

I was able to build a x86_64 defconfig + CONFIG_LTO_CLANG_THIN +
CONFIG_DEBUG_INFO_BTF without further errors.  Thank you for the series! FWIW:

Tested-by: Nick Desaulniers <ndesaulniers@google.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-29 17:40 ` [PATCH dwarves v3 0/3] " Arnaldo Carvalho de Melo
@ 2021-03-30 15:10   ` Arnaldo Carvalho de Melo
  2021-03-30 18:08     ` Arnaldo Carvalho de Melo
  2021-03-31  0:29     ` Yonghong Song
  0 siblings, 2 replies; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-30 15:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Yonghong Song, dwarves, Alexei Starovoitov, Andrii Nakryiko,
	Bill Wendling, bpf, kernel-team

Em Mon, Mar 29, 2021 at 02:40:05PM -0300, Arnaldo Carvalho de Melo escreveu:
> [acme@five pahole]$ ulimit -c 10000000
> [acme@five pahole]$
> [acme@five pahole]$ file tcp_bbr.o
> tcp_bbr.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
> [acme@five pahole]$ readelf -wi tcp_bbr.o | grep DW_AT_producer
>     <d>   DW_AT_producer    : (indirect string, offset: 0x4a97): GNU C89 10.2.1 20200723 (Red Hat 10.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -O2 -std=gnu90 -p -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-var-tracking-assignments -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fstack-check=no -fconserve-stack -fcf-protection=none
> [acme@five pahole]$ fullcircle tcp_bbr.o
> /home/acme/bin/fullcircle: line 38: 3969006 Segmentation fault      (core dumped) ${pfunct_bin} --compile $file > $c_output
> /tmp/fullcircle.4XujnI.c:1435:2: error: unterminated comment
>  1435 |  /* si
>       |  ^
> /tmp/fullcircle.4XujnI.c:1433:2: error: expected specifier-qualifier-list at end of input
>  1433 |  u32 *                      saved_syn;            /*  2184     8 */
>       |  ^~~
> codiff: couldn't load debugging info from /tmp/fullcircle.ZOVXGv.o
> /home/acme/bin/fullcircle: line 40: 3969019 Segmentation fault      (core dumped) ${codiff_bin} -q -s $file $o_output
> [acme@five pahole]$
> 
> Both seem unrelated to what you've done here, I'm investigating it now.

The fullcircle one, that crashes at the 'codiff' utility is related to
the patch that makes dwarf_cu to allocate space for the hash tables, as
you introduced a destructor for the dwarf_cu hashtables and the dwarf_cu
that was assigned to cu->priv was a local variable, which wasn't much of
a problem because we were not freeing it, as it went away at each loop
iteration, the following patch to that first patch in the series seems
to cure it, I'm folding it into your patch + a commiter note.

- Arnaldo

diff --git a/dwarf_loader.c b/dwarf_loader.c
index 5a1e860da079e04c..3e7875d4ab577f1b 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -150,6 +150,18 @@ static int dwarf_cu__init(struct dwarf_cu *dcu)
 	return 0;
 }
 
+static struct dwarf_cu *dwarf_cu__new(void)
+{
+	struct dwarf_cu *dwarf_cu = zalloc(sizeof(*dwarf_cu));
+
+	if (dwarf_cu != NULL && dwarf_cu__init(dwarf_cu) != 0) {
+		free(dwarf_cu);
+		dwarf_cu = NULL;
+	}
+
+	return dwarf_cu;
+}
+
 static void dwarf_cu__delete(struct cu *cu)
 {
 	struct dwarf_cu *dcu = cu->priv;
@@ -2542,21 +2554,20 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 		}
 		cu->little_endian = ehdr.e_ident[EI_DATA] == ELFDATA2LSB;
 
-		struct dwarf_cu dcu;
+		struct dwarf_cu *dcu = dwarf_cu__new();
 
-		if (dwarf_cu__init(&dcu) != 0)
+		if (dcu == NULL)
 			return DWARF_CB_ABORT;
 
-		dcu.cu = cu;
-		dcu.type_unit = type_cu ? &type_dcu : NULL;
-		cu->priv = &dcu;
+		dcu->cu = cu;
+		dcu->type_unit = type_cu ? &type_dcu : NULL;
+		cu->priv = dcu;
 		cu->dfops = &dwarf__ops;
 
 		if (die__process_and_recode(cu_die, cu) != 0)
 			return DWARF_CB_ABORT;
 
-		if (finalize_cu_immediately(cus, cu, &dcu, conf)
-		    == LSK__STOP_LOADING)
+		if (finalize_cu_immediately(cus, cu, dcu, conf) == LSK__STOP_LOADING)
 			return DWARF_CB_ABORT;
 
 		off = noff;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-30 15:10   ` Arnaldo Carvalho de Melo
@ 2021-03-30 18:08     ` Arnaldo Carvalho de Melo
  2021-03-30 18:24       ` Arnaldo Carvalho de Melo
  2021-03-31  0:29     ` Yonghong Song
  1 sibling, 1 reply; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-30 18:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Yonghong Song, dwarves, Alexei Starovoitov, Andrii Nakryiko,
	Bill Wendling, bpf, kernel-team

Em Tue, Mar 30, 2021 at 12:10:10PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Mar 29, 2021 at 02:40:05PM -0300, Arnaldo Carvalho de Melo escreveu:
> > [acme@five pahole]$ ulimit -c 10000000
> > [acme@five pahole]$
> > [acme@five pahole]$ file tcp_bbr.o
> > tcp_bbr.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
> > [acme@five pahole]$ readelf -wi tcp_bbr.o | grep DW_AT_producer
> >     <d>   DW_AT_producer    : (indirect string, offset: 0x4a97): GNU C89 10.2.1 20200723 (Red Hat 10.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -O2 -std=gnu90 -p -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-var-tracking-assignments -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fstack-check=no -fconserve-stack -fcf-protection=none
> > [acme@five pahole]$ fullcircle tcp_bbr.o
> > /home/acme/bin/fullcircle: line 38: 3969006 Segmentation fault      (core dumped) ${pfunct_bin} --compile $file > $c_output
> > /tmp/fullcircle.4XujnI.c:1435:2: error: unterminated comment
> >  1435 |  /* si
> >       |  ^
> > /tmp/fullcircle.4XujnI.c:1433:2: error: expected specifier-qualifier-list at end of input
> >  1433 |  u32 *                      saved_syn;            /*  2184     8 */
> >       |  ^~~
> > codiff: couldn't load debugging info from /tmp/fullcircle.ZOVXGv.o
> > /home/acme/bin/fullcircle: line 40: 3969019 Segmentation fault      (core dumped) ${codiff_bin} -q -s $file $o_output
> > [acme@five pahole]$
> > 
> > Both seem unrelated to what you've done here, I'm investigating it now.
> 
> The fullcircle one, that crashes at the 'codiff' utility is related to
> the patch that makes dwarf_cu to allocate space for the hash tables, as
> you introduced a destructor for the dwarf_cu hashtables and the dwarf_cu
> that was assigned to cu->priv was a local variable, which wasn't much of
> a problem because we were not freeing it, as it went away at each loop
> iteration, the following patch to that first patch in the series seems
> to cure it, I'm folding it into your patch + a commiter note.

[acme@five pahole]$ codiff tcp_bbr.o /tmp/fullcircle.ceBLyj.o
/home/acme/git/linux/net/ipv4/tcp_bbr.c:
  bbr_unregister                    |   -6
  __compiletime_assert_691          |   +0
  bbr_register                      |  -11
  bbr_ssthresh                      |  -76
  bbr_undo_cwnd                     | -101
  bbr_sndbuf_expand                 |  -11
  bbr_init                          | -385
  bbr_main                          | -2640
  bbr_lt_bw_sampling                | -803
  bbr_packets_in_net_at_edt         | -212
  bbr_inflight                      | -172
  __compiletime_assert_655          |   +0
  bbr_set_pacing_rate               | -182
  kcsan_check_access                |   +6
  kasan_check_write                 |  +14
  tcp_unregister_congestion_control |   +0
  tcp_register_congestion_control   |   +0
  minmax_running_max                |   +0
  prandom_u32                       |   +0
  __warn_printk                     |   +0
  __stack_chk_fail                  |   +0
 21 functions changed, 20 bytes added, 4599 bytes removed, diff: -4579
[acme@five pahole]$
[acme@five pahole]$
[acme@five pahole]$ fullcircle tcp_bbr.o
[acme@five pahole]$

This one is dealt with, doing some more tests and looking at that
array[] versus array[0].

- Arnaldo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-30 18:08     ` Arnaldo Carvalho de Melo
@ 2021-03-30 18:24       ` Arnaldo Carvalho de Melo
  2021-03-31  3:20         ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-30 18:24 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Yonghong Song, dwarves, Alexei Starovoitov, Andrii Nakryiko,
	Bill Wendling, bpf, kernel-team

Em Tue, Mar 30, 2021 at 03:08:06PM -0300, Arnaldo Carvalho de Melo escreveu:
> [acme@five pahole]$
> [acme@five pahole]$
> [acme@five pahole]$ fullcircle tcp_bbr.o
> [acme@five pahole]$
> 
> This one is dealt with, doing some more tests and looking at that
> array[] versus array[0].

I've pushed what I have to the main repos at kernel.org and github,
please check, I'll continue from there.

- Arnaldo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary
  2021-03-28 20:14 ` [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary Yonghong Song
@ 2021-03-30 20:08   ` Bill Wendling
  2021-03-30 20:15     ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Bill Wendling @ 2021-03-30 20:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, bpf, kernel-team

On Sun, Mar 28, 2021 at 1:14 PM Yonghong Song <yhs@fb.com> wrote:
>
> For vmlinux built with clang thin-lto or lto, there exist
> cross cu type references. For example, the below can happen:
>   compile unit 1:
>      tag 10:  type A
>   compile unit 2:
>      ...
>        refer to type A (tag 10 in compile unit 1)
> I only checked a few but have seen type A may be a simple type
> like "unsigned char" or a complex type like an array of base types.
>
> To resolve this issue, the tag DW_AT_producer of the first
> DW_TAG_compile_unit is checked. If the binary is built
> with clang lto, all debuginfo dwarf cu's will be merged
> into one pahole cu which will resolve the above
> cross-cu tag reference issue. To test whether a binary
> is built with clang lto or not, The "clang version"
> and "-flto" will be checked against DW_AT_producer string
> for the first 5 debuginfo cu's. The reason is that
> a few linux files disabled lto for various reasons.
>
> Merging cu's will create a single cu with lots of types, tags
> and functions. For example with clang thin-lto built vmlinux,
> I saw 9M entries in types table, 5.2M in tags table. The
> below are pahole wallclock time for different hashbits:
> command line: time pahole -J vmlinux
>       # of hashbits            wallclock time in seconds
>           15                       460
>           16                       255
>           17                       131
>           18                       97
>           19                       75
>           20                       69
>           21                       64
>           22                       62
>           23                       58
>           24                       64
>
> The problem is with hashtags__find(), esp. the loop
>     uint32_t bucket = hashtags__fn(id);
>     const struct hlist_head *head = hashtable + bucket;
>     hlist_for_each_entry(tpos, pos, head, hash_node) {
>             if (tpos->id == id)
>                     return tpos;
>     }
>
> Say we have 9M types and (1 << 15) buckets, that means each bucket
> will have roughly 64 elements. So each lookup will traverse
> the loop 32 iterations on average.
>
> If we have 1 << 21 buckets, then each buckets will have 4 elements,
> and the average number of loop iterations for hashtags__find()
> will be 2.
>
> Note that the number of hashbits 24 makes performance worse
> than 23. The reason could be that 23 hashbits can cover 8M
> buckets (close to 9M for the number of entries in types table).
> Higher number of hash bits allocates more memory and becomes
> less cache efficient compared to 23 hashbits.
>
> This patch picks # of hashbits 21 as the starting value
> and will try to allocate memory based on that, if memory
> allocation fails, we will go with less hashbits until
> we reach hashbits 15 which is the default for
> non merge-cu case.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  dwarf_loader.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 120 insertions(+)
>
> diff --git a/dwarf_loader.c b/dwarf_loader.c
> index aa6372a..a51391e 100644
> --- a/dwarf_loader.c
> +++ b/dwarf_loader.c
> @@ -51,6 +51,7 @@ struct strings *strings;
>  #endif
>
>  static uint32_t hashtags__bits = 15;
> +static uint32_t max_hashtags__bits = 21;
>
>  static uint32_t hashtags__fn(Dwarf_Off key)
>  {
> @@ -2484,6 +2485,115 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
>         return 0;
>  }
>
> +static bool cus__merging_cu(Dwarf *dw)
> +{
> +       uint8_t pointer_size, offset_size;
> +       Dwarf_Off off = 0, noff;
> +       size_t cuhl;
> +       int cnt = 0;
> +
> +       /*
> +        * Just checking the first cu is not enough.
> +        * In linux, some C files may have LTO is disabled, e.g.,
> +        *   e242db40be27  x86, vdso: disable LTO only for vDSO
> +        *   d2dcd3e37475  x86, cpu: disable LTO for cpu.c
> +        * Fortunately, disabling LTO for a particular file in a LTO build
> +        * is rather an exception. Iterating 5 cu's to check whether
> +        * LTO is used or not should be enough.
> +        */
> +       while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
> +                           &offset_size) == 0) {
> +               Dwarf_Die die_mem;
> +               Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
> +
> +               if (cu_die == NULL)
> +                       break;
> +
> +               if (++cnt > 5)
> +                       break;
> +
> +               const char *producer = attr_string(cu_die, DW_AT_producer);
> +               if (strstr(producer, "clang version") != NULL &&
> +                   strstr(producer, "-flto") != NULL)

Instead of checking for flags, which can be a bit brittle, would it
make more sense to scan the abbreviations to see if there are any
"sec_offset" encodings used for type attributes to indicate that LTO
was used?

Thank you for improving on my hacky patch! :-)

-bw

> +                       return true;
> +
> +               off = noff;
> +       }
> +
> +       return false;
> +}
> +
> +static int cus__merge_and_process_cu(struct cus *cus, struct conf_load *conf,
> +                                    Dwfl_Module *mod, Dwarf *dw, Elf *elf,
> +                                    const char *filename,
> +                                    const unsigned char *build_id,
> +                                    int build_id_len,
> +                                    struct dwarf_cu *type_dcu)
> +{
> +       uint8_t pointer_size, offset_size;
> +       struct dwarf_cu *dcu = NULL;
> +       Dwarf_Off off = 0, noff;
> +       struct cu *cu = NULL;
> +       size_t cuhl;
> +
> +       while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
> +                           &offset_size) == 0) {
> +               Dwarf_Die die_mem;
> +               Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
> +
> +               if (cu_die == NULL)
> +                       break;
> +
> +               if (cu == NULL) {
> +                       cu = cu__new("", pointer_size, build_id, build_id_len,
> +                                    filename);
> +                       if (cu == NULL || cu__set_common(cu, conf, mod, elf) != 0)
> +                               return DWARF_CB_ABORT;
> +
> +                       dcu = malloc(sizeof(struct dwarf_cu));
> +                       if (dcu == NULL)
> +                               return DWARF_CB_ABORT;
> +
> +                       /* Merged cu tends to need a lot more memory.
> +                        * Let us start with max_hashtags__bits and
> +                        * go down to find a proper hashtag bit value.
> +                        */
> +                       uint32_t default_hbits = hashtags__bits;
> +                       for (hashtags__bits = max_hashtags__bits;
> +                            hashtags__bits >= default_hbits;
> +                            hashtags__bits--) {
> +                               if (dwarf_cu__init(dcu) == 0)
> +                                       break;
> +                       }
> +                       if (hashtags__bits < default_hbits)
> +                               return DWARF_CB_ABORT;
> +
> +                       dcu->cu = cu;
> +                       dcu->type_unit = type_dcu;
> +                       cu->priv = dcu;
> +                       cu->dfops = &dwarf__ops;
> +                       cu->language = attr_numeric(cu_die, DW_AT_language);
> +               }
> +
> +               Dwarf_Die child;
> +               if (dwarf_child(cu_die, &child) == 0) {
> +                       if (die__process_unit(&child, cu) != 0)
> +                               return DWARF_CB_ABORT;
> +               }
> +
> +               off = noff;
> +       }
> +
> +       /* process merged cu */
> +       if (cu__recode_dwarf_types(cu) != LSK__KEEPIT)
> +               return DWARF_CB_ABORT;
> +       if (finalize_cu_immediately(cus, cu, dcu, conf)
> +           == LSK__STOP_LOADING)
> +               return DWARF_CB_ABORT;
> +
> +       return 0;
> +}
> +
>  static int cus__load_module(struct cus *cus, struct conf_load *conf,
>                             Dwfl_Module *mod, Dwarf *dw, Elf *elf,
>                             const char *filename)
> @@ -2518,6 +2628,15 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
>                 }
>         }
>
> +       if (cus__merging_cu(dw)) {
> +               res = cus__merge_and_process_cu(cus, conf, mod, dw, elf, filename,
> +                                               build_id, build_id_len,
> +                                               type_cu ? &type_dcu : NULL);
> +               if (res)
> +                       return res;
> +               goto out;
> +       }
> +
>         while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
>                             &offset_size) == 0) {
>                 Dwarf_Die die_mem;
> @@ -2557,6 +2676,7 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
>                 off = noff;
>         }
>
> +out:
>         if (type_lsk == LSK__DELETE)
>                 cu__delete(type_cu);
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary
  2021-03-30 20:08   ` Bill Wendling
@ 2021-03-30 20:15     ` Yonghong Song
  2021-03-30 21:44       ` Bill Wendling
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2021-03-30 20:15 UTC (permalink / raw)
  To: Bill Wendling
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, bpf, kernel-team



On 3/30/21 1:08 PM, Bill Wendling wrote:
> On Sun, Mar 28, 2021 at 1:14 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> For vmlinux built with clang thin-lto or lto, there exist
>> cross cu type references. For example, the below can happen:
>>    compile unit 1:
>>       tag 10:  type A
>>    compile unit 2:
>>       ...
>>         refer to type A (tag 10 in compile unit 1)
>> I only checked a few but have seen type A may be a simple type
>> like "unsigned char" or a complex type like an array of base types.
>>
>> To resolve this issue, the tag DW_AT_producer of the first
>> DW_TAG_compile_unit is checked. If the binary is built
>> with clang lto, all debuginfo dwarf cu's will be merged
>> into one pahole cu which will resolve the above
>> cross-cu tag reference issue. To test whether a binary
>> is built with clang lto or not, The "clang version"
>> and "-flto" will be checked against DW_AT_producer string
>> for the first 5 debuginfo cu's. The reason is that
>> a few linux files disabled lto for various reasons.
>>
>> Merging cu's will create a single cu with lots of types, tags
>> and functions. For example with clang thin-lto built vmlinux,
>> I saw 9M entries in types table, 5.2M in tags table. The
>> below are pahole wallclock time for different hashbits:
>> command line: time pahole -J vmlinux
>>        # of hashbits            wallclock time in seconds
>>            15                       460
>>            16                       255
>>            17                       131
>>            18                       97
>>            19                       75
>>            20                       69
>>            21                       64
>>            22                       62
>>            23                       58
>>            24                       64
>>
>> The problem is with hashtags__find(), esp. the loop
>>      uint32_t bucket = hashtags__fn(id);
>>      const struct hlist_head *head = hashtable + bucket;
>>      hlist_for_each_entry(tpos, pos, head, hash_node) {
>>              if (tpos->id == id)
>>                      return tpos;
>>      }
>>
>> Say we have 9M types and (1 << 15) buckets, that means each bucket
>> will have roughly 64 elements. So each lookup will traverse
>> the loop 32 iterations on average.
>>
>> If we have 1 << 21 buckets, then each buckets will have 4 elements,
>> and the average number of loop iterations for hashtags__find()
>> will be 2.
>>
>> Note that the number of hashbits 24 makes performance worse
>> than 23. The reason could be that 23 hashbits can cover 8M
>> buckets (close to 9M for the number of entries in types table).
>> Higher number of hash bits allocates more memory and becomes
>> less cache efficient compared to 23 hashbits.
>>
>> This patch picks # of hashbits 21 as the starting value
>> and will try to allocate memory based on that, if memory
>> allocation fails, we will go with less hashbits until
>> we reach hashbits 15 which is the default for
>> non merge-cu case.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   dwarf_loader.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 120 insertions(+)
>>
>> diff --git a/dwarf_loader.c b/dwarf_loader.c
>> index aa6372a..a51391e 100644
>> --- a/dwarf_loader.c
>> +++ b/dwarf_loader.c
>> @@ -51,6 +51,7 @@ struct strings *strings;
>>   #endif
>>
>>   static uint32_t hashtags__bits = 15;
>> +static uint32_t max_hashtags__bits = 21;
>>
>>   static uint32_t hashtags__fn(Dwarf_Off key)
>>   {
>> @@ -2484,6 +2485,115 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
>>          return 0;
>>   }
>>
>> +static bool cus__merging_cu(Dwarf *dw)
>> +{
>> +       uint8_t pointer_size, offset_size;
>> +       Dwarf_Off off = 0, noff;
>> +       size_t cuhl;
>> +       int cnt = 0;
>> +
>> +       /*
>> +        * Just checking the first cu is not enough.
>> +        * In linux, some C files may have LTO is disabled, e.g.,
>> +        *   e242db40be27  x86, vdso: disable LTO only for vDSO
>> +        *   d2dcd3e37475  x86, cpu: disable LTO for cpu.c
>> +        * Fortunately, disabling LTO for a particular file in a LTO build
>> +        * is rather an exception. Iterating 5 cu's to check whether
>> +        * LTO is used or not should be enough.
>> +        */
>> +       while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
>> +                           &offset_size) == 0) {
>> +               Dwarf_Die die_mem;
>> +               Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
>> +
>> +               if (cu_die == NULL)
>> +                       break;
>> +
>> +               if (++cnt > 5)
>> +                       break;
>> +
>> +               const char *producer = attr_string(cu_die, DW_AT_producer);
>> +               if (strstr(producer, "clang version") != NULL &&
>> +                   strstr(producer, "-flto") != NULL)
> 
> Instead of checking for flags, which can be a bit brittle, would it
> make more sense to scan the abbreviations to see if there are any
> "sec_offset" encodings used for type attributes to indicate that LTO
> was used?

Do you have additional info related to "sec_offset"? I scanned through 
my llvm-dwarfdump result and didn't find it.

> 
> Thank you for improving on my hacky patch! :-)
> 
> -bw
> 
>> +                       return true;
>> +
>> +               off = noff;
>> +       }
>> +
>> +       return false;
>> +}
>> +
[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary
  2021-03-30 20:15     ` Yonghong Song
@ 2021-03-30 21:44       ` Bill Wendling
  2021-03-30 22:28         ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Bill Wendling @ 2021-03-30 21:44 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, bpf, kernel-team

On Tue, Mar 30, 2021 at 1:15 PM Yonghong Song <yhs@fb.com> wrote:
> On 3/30/21 1:08 PM, Bill Wendling wrote:
> > On Sun, Mar 28, 2021 at 1:14 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> For vmlinux built with clang thin-lto or lto, there exist
> >> cross cu type references. For example, the below can happen:
> >>    compile unit 1:
> >>       tag 10:  type A
> >>    compile unit 2:
> >>       ...
> >>         refer to type A (tag 10 in compile unit 1)
> >> I only checked a few but have seen type A may be a simple type
> >> like "unsigned char" or a complex type like an array of base types.
> >>
> >> To resolve this issue, the tag DW_AT_producer of the first
> >> DW_TAG_compile_unit is checked. If the binary is built
> >> with clang lto, all debuginfo dwarf cu's will be merged
> >> into one pahole cu which will resolve the above
> >> cross-cu tag reference issue. To test whether a binary
> >> is built with clang lto or not, The "clang version"
> >> and "-flto" will be checked against DW_AT_producer string
> >> for the first 5 debuginfo cu's. The reason is that
> >> a few linux files disabled lto for various reasons.
> >>
> >> Merging cu's will create a single cu with lots of types, tags
> >> and functions. For example with clang thin-lto built vmlinux,
> >> I saw 9M entries in types table, 5.2M in tags table. The
> >> below are pahole wallclock time for different hashbits:
> >> command line: time pahole -J vmlinux
> >>        # of hashbits            wallclock time in seconds
> >>            15                       460
> >>            16                       255
> >>            17                       131
> >>            18                       97
> >>            19                       75
> >>            20                       69
> >>            21                       64
> >>            22                       62
> >>            23                       58
> >>            24                       64
> >>
> >> The problem is with hashtags__find(), esp. the loop
> >>      uint32_t bucket = hashtags__fn(id);
> >>      const struct hlist_head *head = hashtable + bucket;
> >>      hlist_for_each_entry(tpos, pos, head, hash_node) {
> >>              if (tpos->id == id)
> >>                      return tpos;
> >>      }
> >>
> >> Say we have 9M types and (1 << 15) buckets, that means each bucket
> >> will have roughly 64 elements. So each lookup will traverse
> >> the loop 32 iterations on average.
> >>
> >> If we have 1 << 21 buckets, then each buckets will have 4 elements,
> >> and the average number of loop iterations for hashtags__find()
> >> will be 2.
> >>
> >> Note that the number of hashbits 24 makes performance worse
> >> than 23. The reason could be that 23 hashbits can cover 8M
> >> buckets (close to 9M for the number of entries in types table).
> >> Higher number of hash bits allocates more memory and becomes
> >> less cache efficient compared to 23 hashbits.
> >>
> >> This patch picks # of hashbits 21 as the starting value
> >> and will try to allocate memory based on that, if memory
> >> allocation fails, we will go with less hashbits until
> >> we reach hashbits 15 which is the default for
> >> non merge-cu case.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   dwarf_loader.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 120 insertions(+)
> >>
> >> diff --git a/dwarf_loader.c b/dwarf_loader.c
> >> index aa6372a..a51391e 100644
> >> --- a/dwarf_loader.c
> >> +++ b/dwarf_loader.c
> >> @@ -51,6 +51,7 @@ struct strings *strings;
> >>   #endif
> >>
> >>   static uint32_t hashtags__bits = 15;
> >> +static uint32_t max_hashtags__bits = 21;
> >>
> >>   static uint32_t hashtags__fn(Dwarf_Off key)
> >>   {
> >> @@ -2484,6 +2485,115 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
> >>          return 0;
> >>   }
> >>
> >> +static bool cus__merging_cu(Dwarf *dw)
> >> +{
> >> +       uint8_t pointer_size, offset_size;
> >> +       Dwarf_Off off = 0, noff;
> >> +       size_t cuhl;
> >> +       int cnt = 0;
> >> +
> >> +       /*
> >> +        * Just checking the first cu is not enough.
> >> +        * In linux, some C files may have LTO is disabled, e.g.,
> >> +        *   e242db40be27  x86, vdso: disable LTO only for vDSO
> >> +        *   d2dcd3e37475  x86, cpu: disable LTO for cpu.c
> >> +        * Fortunately, disabling LTO for a particular file in a LTO build
> >> +        * is rather an exception. Iterating 5 cu's to check whether
> >> +        * LTO is used or not should be enough.
> >> +        */
> >> +       while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
> >> +                           &offset_size) == 0) {
> >> +               Dwarf_Die die_mem;
> >> +               Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
> >> +
> >> +               if (cu_die == NULL)
> >> +                       break;
> >> +
> >> +               if (++cnt > 5)
> >> +                       break;
> >> +
> >> +               const char *producer = attr_string(cu_die, DW_AT_producer);
> >> +               if (strstr(producer, "clang version") != NULL &&
> >> +                   strstr(producer, "-flto") != NULL)
> >
> > Instead of checking for flags, which can be a bit brittle, would it
> > make more sense to scan the abbreviations to see if there are any
> > "sec_offset" encodings used for type attributes to indicate that LTO
> > was used?
>
> Do you have additional info related to "sec_offset"? I scanned through
> my llvm-dwarfdump result and didn't find it.
>
Sorry about that. It was the wrong thing to check. I consulted our
DWARF expert here and he said this.

"DW_FORM_ref_addr is the important thing used for cross-CU references,
like in the DW_AT_abstract_origin of the DW_TAG_inlined_subroutine. In
intra-CU references, DW_FORM_ref4 is used instead."

-bw

> >
> > Thank you for improving on my hacky patch! :-)
> >
> > -bw
> >
> >> +                       return true;
> >> +
> >> +               off = noff;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> [...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary
  2021-03-30 21:44       ` Bill Wendling
@ 2021-03-30 22:28         ` Yonghong Song
  2021-03-30 23:25           ` Bill Wendling
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2021-03-30 22:28 UTC (permalink / raw)
  To: Bill Wendling
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, bpf, kernel-team



On 3/30/21 2:44 PM, Bill Wendling wrote:
> On Tue, Mar 30, 2021 at 1:15 PM Yonghong Song <yhs@fb.com> wrote:
>> On 3/30/21 1:08 PM, Bill Wendling wrote:
>>> On Sun, Mar 28, 2021 at 1:14 PM Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>> For vmlinux built with clang thin-lto or lto, there exist
>>>> cross cu type references. For example, the below can happen:
>>>>     compile unit 1:
>>>>        tag 10:  type A
>>>>     compile unit 2:
>>>>        ...
>>>>          refer to type A (tag 10 in compile unit 1)
>>>> I only checked a few but have seen type A may be a simple type
>>>> like "unsigned char" or a complex type like an array of base types.
>>>>
>>>> To resolve this issue, the tag DW_AT_producer of the first
>>>> DW_TAG_compile_unit is checked. If the binary is built
>>>> with clang lto, all debuginfo dwarf cu's will be merged
>>>> into one pahole cu which will resolve the above
>>>> cross-cu tag reference issue. To test whether a binary
>>>> is built with clang lto or not, The "clang version"
>>>> and "-flto" will be checked against DW_AT_producer string
>>>> for the first 5 debuginfo cu's. The reason is that
>>>> a few linux files disabled lto for various reasons.
>>>>
>>>> Merging cu's will create a single cu with lots of types, tags
>>>> and functions. For example with clang thin-lto built vmlinux,
>>>> I saw 9M entries in types table, 5.2M in tags table. The
>>>> below are pahole wallclock time for different hashbits:
>>>> command line: time pahole -J vmlinux
>>>>         # of hashbits            wallclock time in seconds
>>>>             15                       460
>>>>             16                       255
>>>>             17                       131
>>>>             18                       97
>>>>             19                       75
>>>>             20                       69
>>>>             21                       64
>>>>             22                       62
>>>>             23                       58
>>>>             24                       64
>>>>
>>>> The problem is with hashtags__find(), esp. the loop
>>>>       uint32_t bucket = hashtags__fn(id);
>>>>       const struct hlist_head *head = hashtable + bucket;
>>>>       hlist_for_each_entry(tpos, pos, head, hash_node) {
>>>>               if (tpos->id == id)
>>>>                       return tpos;
>>>>       }
>>>>
>>>> Say we have 9M types and (1 << 15) buckets, that means each bucket
>>>> will have roughly 64 elements. So each lookup will traverse
>>>> the loop 32 iterations on average.
>>>>
>>>> If we have 1 << 21 buckets, then each buckets will have 4 elements,
>>>> and the average number of loop iterations for hashtags__find()
>>>> will be 2.
>>>>
>>>> Note that the number of hashbits 24 makes performance worse
>>>> than 23. The reason could be that 23 hashbits can cover 8M
>>>> buckets (close to 9M for the number of entries in types table).
>>>> Higher number of hash bits allocates more memory and becomes
>>>> less cache efficient compared to 23 hashbits.
>>>>
>>>> This patch picks # of hashbits 21 as the starting value
>>>> and will try to allocate memory based on that, if memory
>>>> allocation fails, we will go with less hashbits until
>>>> we reach hashbits 15 which is the default for
>>>> non merge-cu case.
>>>>
>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>> ---
>>>>    dwarf_loader.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 120 insertions(+)
>>>>
>>>> diff --git a/dwarf_loader.c b/dwarf_loader.c
>>>> index aa6372a..a51391e 100644
>>>> --- a/dwarf_loader.c
>>>> +++ b/dwarf_loader.c
>>>> @@ -51,6 +51,7 @@ struct strings *strings;
>>>>    #endif
>>>>
>>>>    static uint32_t hashtags__bits = 15;
>>>> +static uint32_t max_hashtags__bits = 21;
>>>>
>>>>    static uint32_t hashtags__fn(Dwarf_Off key)
>>>>    {
>>>> @@ -2484,6 +2485,115 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
>>>>           return 0;
>>>>    }
>>>>
>>>> +static bool cus__merging_cu(Dwarf *dw)
>>>> +{
>>>> +       uint8_t pointer_size, offset_size;
>>>> +       Dwarf_Off off = 0, noff;
>>>> +       size_t cuhl;
>>>> +       int cnt = 0;
>>>> +
>>>> +       /*
>>>> +        * Just checking the first cu is not enough.
>>>> +        * In linux, some C files may have LTO is disabled, e.g.,
>>>> +        *   e242db40be27  x86, vdso: disable LTO only for vDSO
>>>> +        *   d2dcd3e37475  x86, cpu: disable LTO for cpu.c
>>>> +        * Fortunately, disabling LTO for a particular file in a LTO build
>>>> +        * is rather an exception. Iterating 5 cu's to check whether
>>>> +        * LTO is used or not should be enough.
>>>> +        */
>>>> +       while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
>>>> +                           &offset_size) == 0) {
>>>> +               Dwarf_Die die_mem;
>>>> +               Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
>>>> +
>>>> +               if (cu_die == NULL)
>>>> +                       break;
>>>> +
>>>> +               if (++cnt > 5)
>>>> +                       break;
>>>> +
>>>> +               const char *producer = attr_string(cu_die, DW_AT_producer);
>>>> +               if (strstr(producer, "clang version") != NULL &&
>>>> +                   strstr(producer, "-flto") != NULL)
>>>
>>> Instead of checking for flags, which can be a bit brittle, would it
>>> make more sense to scan the abbreviations to see if there are any
>>> "sec_offset" encodings used for type attributes to indicate that LTO
>>> was used?
>>
>> Do you have additional info related to "sec_offset"? I scanned through
>> my llvm-dwarfdump result and didn't find it.
>>
> Sorry about that. It was the wrong thing to check. I consulted our
> DWARF expert here and he said this.
> 
> "DW_FORM_ref_addr is the important thing used for cross-CU references,
> like in the DW_AT_abstract_origin of the DW_TAG_inlined_subroutine. In
> intra-CU references, DW_FORM_ref4 is used instead."

Thanks. I checked with lto dwarf, it should work. The only drawback is 
that it needs to traverse all DW_TAG_inlined_subroutine
and DW_TAG_type's in the cu. If nothing found, going to the next.
This is exactly what I tried to avoid in the beginning, going deep
in the dwarf cu. I could remember the maximum and minimum of the ref. 
type tags for each cu, if it is beyond the current cu
range, it means cross-cu access. But I didn't use this approach
as it is something every pahole user will pay the price regardless
of whether their binary is lto or not (mostly probably not.)

I checked my current vmlinux-lto. I will be able
to find a DW_FORM_ref_addr in the 5th cu. May not be too bad.
We could have a heuristic to only check the first 10 cu's.
If no cross cu reference, that means not lto. Not perfect.
But for sure we don't want to go through all cu's just to
find it is not lto build.

Still feel we should get whether it is a lto built binary
as soon as possible. With an explicit option, compiler flags
seem second choice...

> 
> -bw
> 
>>>
>>> Thank you for improving on my hacky patch! :-)
>>>
>>> -bw
>>>
>>>> +                       return true;
>>>> +
>>>> +               off = noff;
>>>> +       }
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>> [...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary
  2021-03-30 22:28         ` Yonghong Song
@ 2021-03-30 23:25           ` Bill Wendling
  0 siblings, 0 replies; 18+ messages in thread
From: Bill Wendling @ 2021-03-30 23:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, bpf, kernel-team

On Tue, Mar 30, 2021 at 3:28 PM Yonghong Song <yhs@fb.com> wrote:
> On 3/30/21 2:44 PM, Bill Wendling wrote:
> > On Tue, Mar 30, 2021 at 1:15 PM Yonghong Song <yhs@fb.com> wrote:
> >> On 3/30/21 1:08 PM, Bill Wendling wrote:
> >>> On Sun, Mar 28, 2021 at 1:14 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>
> >>>> For vmlinux built with clang thin-lto or lto, there exist
> >>>> cross cu type references. For example, the below can happen:
> >>>>     compile unit 1:
> >>>>        tag 10:  type A
> >>>>     compile unit 2:
> >>>>        ...
> >>>>          refer to type A (tag 10 in compile unit 1)
> >>>> I only checked a few but have seen type A may be a simple type
> >>>> like "unsigned char" or a complex type like an array of base types.
> >>>>
> >>>> To resolve this issue, the tag DW_AT_producer of the first
> >>>> DW_TAG_compile_unit is checked. If the binary is built
> >>>> with clang lto, all debuginfo dwarf cu's will be merged
> >>>> into one pahole cu which will resolve the above
> >>>> cross-cu tag reference issue. To test whether a binary
> >>>> is built with clang lto or not, The "clang version"
> >>>> and "-flto" will be checked against DW_AT_producer string
> >>>> for the first 5 debuginfo cu's. The reason is that
> >>>> a few linux files disabled lto for various reasons.
> >>>>
> >>>> Merging cu's will create a single cu with lots of types, tags
> >>>> and functions. For example with clang thin-lto built vmlinux,
> >>>> I saw 9M entries in types table, 5.2M in tags table. The
> >>>> below are pahole wallclock time for different hashbits:
> >>>> command line: time pahole -J vmlinux
> >>>>         # of hashbits            wallclock time in seconds
> >>>>             15                       460
> >>>>             16                       255
> >>>>             17                       131
> >>>>             18                       97
> >>>>             19                       75
> >>>>             20                       69
> >>>>             21                       64
> >>>>             22                       62
> >>>>             23                       58
> >>>>             24                       64
> >>>>
> >>>> The problem is with hashtags__find(), esp. the loop
> >>>>       uint32_t bucket = hashtags__fn(id);
> >>>>       const struct hlist_head *head = hashtable + bucket;
> >>>>       hlist_for_each_entry(tpos, pos, head, hash_node) {
> >>>>               if (tpos->id == id)
> >>>>                       return tpos;
> >>>>       }
> >>>>
> >>>> Say we have 9M types and (1 << 15) buckets, that means each bucket
> >>>> will have roughly 64 elements. So each lookup will traverse
> >>>> the loop 32 iterations on average.
> >>>>
> >>>> If we have 1 << 21 buckets, then each buckets will have 4 elements,
> >>>> and the average number of loop iterations for hashtags__find()
> >>>> will be 2.
> >>>>
> >>>> Note that the number of hashbits 24 makes performance worse
> >>>> than 23. The reason could be that 23 hashbits can cover 8M
> >>>> buckets (close to 9M for the number of entries in types table).
> >>>> Higher number of hash bits allocates more memory and becomes
> >>>> less cache efficient compared to 23 hashbits.
> >>>>
> >>>> This patch picks # of hashbits 21 as the starting value
> >>>> and will try to allocate memory based on that, if memory
> >>>> allocation fails, we will go with less hashbits until
> >>>> we reach hashbits 15 which is the default for
> >>>> non merge-cu case.
> >>>>
> >>>> Signed-off-by: Yonghong Song <yhs@fb.com>
> >>>> ---
> >>>>    dwarf_loader.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 120 insertions(+)
> >>>>
> >>>> diff --git a/dwarf_loader.c b/dwarf_loader.c
> >>>> index aa6372a..a51391e 100644
> >>>> --- a/dwarf_loader.c
> >>>> +++ b/dwarf_loader.c
> >>>> @@ -51,6 +51,7 @@ struct strings *strings;
> >>>>    #endif
> >>>>
> >>>>    static uint32_t hashtags__bits = 15;
> >>>> +static uint32_t max_hashtags__bits = 21;
> >>>>
> >>>>    static uint32_t hashtags__fn(Dwarf_Off key)
> >>>>    {
> >>>> @@ -2484,6 +2485,115 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf,
> >>>>           return 0;
> >>>>    }
> >>>>
> >>>> +static bool cus__merging_cu(Dwarf *dw)
> >>>> +{
> >>>> +       uint8_t pointer_size, offset_size;
> >>>> +       Dwarf_Off off = 0, noff;
> >>>> +       size_t cuhl;
> >>>> +       int cnt = 0;
> >>>> +
> >>>> +       /*
> >>>> +        * Just checking the first cu is not enough.
> >>>> +        * In linux, some C files may have LTO is disabled, e.g.,
> >>>> +        *   e242db40be27  x86, vdso: disable LTO only for vDSO
> >>>> +        *   d2dcd3e37475  x86, cpu: disable LTO for cpu.c
> >>>> +        * Fortunately, disabling LTO for a particular file in a LTO build
> >>>> +        * is rather an exception. Iterating 5 cu's to check whether
> >>>> +        * LTO is used or not should be enough.
> >>>> +        */
> >>>> +       while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size,
> >>>> +                           &offset_size) == 0) {
> >>>> +               Dwarf_Die die_mem;
> >>>> +               Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem);
> >>>> +
> >>>> +               if (cu_die == NULL)
> >>>> +                       break;
> >>>> +
> >>>> +               if (++cnt > 5)
> >>>> +                       break;
> >>>> +
> >>>> +               const char *producer = attr_string(cu_die, DW_AT_producer);
> >>>> +               if (strstr(producer, "clang version") != NULL &&
> >>>> +                   strstr(producer, "-flto") != NULL)
> >>>
> >>> Instead of checking for flags, which can be a bit brittle, would it
> >>> make more sense to scan the abbreviations to see if there are any
> >>> "sec_offset" encodings used for type attributes to indicate that LTO
> >>> was used?
> >>
> >> Do you have additional info related to "sec_offset"? I scanned through
> >> my llvm-dwarfdump result and didn't find it.
> >>
> > Sorry about that. It was the wrong thing to check. I consulted our
> > DWARF expert here and he said this.
> >
> > "DW_FORM_ref_addr is the important thing used for cross-CU references,
> > like in the DW_AT_abstract_origin of the DW_TAG_inlined_subroutine. In
> > intra-CU references, DW_FORM_ref4 is used instead."
>
> Thanks. I checked with lto dwarf, it should work. The only drawback is
> that it needs to traverse all DW_TAG_inlined_subroutine
> and DW_TAG_type's in the cu. If nothing found, going to the next.
> This is exactly what I tried to avoid in the beginning, going deep
> in the dwarf cu. I could remember the maximum and minimum of the ref.
> type tags for each cu, if it is beyond the current cu
> range, it means cross-cu access. But I didn't use this approach
> as it is something every pahole user will pay the price regardless
> of whether their binary is lto or not (mostly probably not.)
>
> I checked my current vmlinux-lto. I will be able
> to find a DW_FORM_ref_addr in the 5th cu. May not be too bad.
> We could have a heuristic to only check the first 10 cu's.
> If no cross cu reference, that means not lto. Not perfect.
> But for sure we don't want to go through all cu's just to
> find it is not lto build.
>
> Still feel we should get whether it is a lto built binary
> as soon as possible. With an explicit option, compiler flags
> seem second choice...
>
That's a good point. It sounds like restricting the search to the
first X CUs will be even more brittle than looking at the command line
flags. It's probably fine then to go with what you originally have.
Sorry for the noise.

-bw

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-30 15:10   ` Arnaldo Carvalho de Melo
  2021-03-30 18:08     ` Arnaldo Carvalho de Melo
@ 2021-03-31  0:29     ` Yonghong Song
  1 sibling, 0 replies; 18+ messages in thread
From: Yonghong Song @ 2021-03-31  0:29 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Arnaldo Carvalho de Melo
  Cc: dwarves, Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf,
	kernel-team



On 3/30/21 8:10 AM, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 29, 2021 at 02:40:05PM -0300, Arnaldo Carvalho de Melo escreveu:
>> [acme@five pahole]$ ulimit -c 10000000
>> [acme@five pahole]$
>> [acme@five pahole]$ file tcp_bbr.o
>> tcp_bbr.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
>> [acme@five pahole]$ readelf -wi tcp_bbr.o | grep DW_AT_producer
>>      <d>   DW_AT_producer    : (indirect string, offset: 0x4a97): GNU C89 10.2.1 20200723 (Red Hat 10.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -O2 -std=gnu90 -p -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-var-tracking-assignments -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fstack-check=no -fconserve-stack -fcf-protection=none
>> [acme@five pahole]$ fullcircle tcp_bbr.o
>> /home/acme/bin/fullcircle: line 38: 3969006 Segmentation fault      (core dumped) ${pfunct_bin} --compile $file > $c_output
>> /tmp/fullcircle.4XujnI.c:1435:2: error: unterminated comment
>>   1435 |  /* si
>>        |  ^
>> /tmp/fullcircle.4XujnI.c:1433:2: error: expected specifier-qualifier-list at end of input
>>   1433 |  u32 *                      saved_syn;            /*  2184     8 */
>>        |  ^~~
>> codiff: couldn't load debugging info from /tmp/fullcircle.ZOVXGv.o
>> /home/acme/bin/fullcircle: line 40: 3969019 Segmentation fault      (core dumped) ${codiff_bin} -q -s $file $o_output
>> [acme@five pahole]$
>>
>> Both seem unrelated to what you've done here, I'm investigating it now.
> 
> The fullcircle one, that crashes at the 'codiff' utility is related to
> the patch that makes dwarf_cu to allocate space for the hash tables, as
> you introduced a destructor for the dwarf_cu hashtables and the dwarf_cu
> that was assigned to cu->priv was a local variable, which wasn't much of
> a problem because we were not freeing it, as it went away at each loop
> iteration, the following patch to that first patch in the series seems
> to cure it, I'm folding it into your patch + a commiter note.

Thanks for the fix!

> 
> - Arnaldo
> 
> diff --git a/dwarf_loader.c b/dwarf_loader.c
> index 5a1e860da079e04c..3e7875d4ab577f1b 100644
> --- a/dwarf_loader.c
> +++ b/dwarf_loader.c
> @@ -150,6 +150,18 @@ static int dwarf_cu__init(struct dwarf_cu *dcu)
>   	return 0;
>   }
>   
[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-30 18:24       ` Arnaldo Carvalho de Melo
@ 2021-03-31  3:20         ` Yonghong Song
  2021-03-31 13:54           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2021-03-31  3:20 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Arnaldo Carvalho de Melo
  Cc: dwarves, Alexei Starovoitov, Andrii Nakryiko, Bill Wendling, bpf,
	kernel-team



On 3/30/21 11:24 AM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Mar 30, 2021 at 03:08:06PM -0300, Arnaldo Carvalho de Melo escreveu:
>> [acme@five pahole]$
>> [acme@five pahole]$
>> [acme@five pahole]$ fullcircle tcp_bbr.o
>> [acme@five pahole]$
>>
>> This one is dealt with, doing some more tests and looking at that
>> array[] versus array[0].
> 
> I've pushed what I have to the main repos at kernel.org and github,
> please check, I'll continue from there.

Looks good. Thanks!

I will try to experiment with an alternative way ([1]) to check whether
cross-cu reference happens or not. But at least checking flags
approach can be adapted to gcc (if we want after comparing the 
alternative) since gcc always has flags in dwarf.

[1] 
https://lore.kernel.org/bpf/d34a3d62-bae8-3a30-26b6-4e5e8efcd0af@fb.com/T/#m1b0b1206091c19a90b15d054aa26239101289f84

> 
> - Arnaldo
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-31  3:20         ` Yonghong Song
@ 2021-03-31 13:54           ` Arnaldo Carvalho de Melo
  2021-03-31 15:08             ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-31 13:54 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team

Em Tue, Mar 30, 2021 at 08:20:20PM -0700, Yonghong Song escreveu:
> On 3/30/21 11:24 AM, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Mar 30, 2021 at 03:08:06PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > [acme@five pahole]$ fullcircle tcp_bbr.o
> > > [acme@five pahole]$

> > > This one is dealt with, doing some more tests and looking at that
> > > array[] versus array[0].

> > I've pushed what I have to the main repos at kernel.org and github,
> > please check, I'll continue from there.

> Looks good. Thanks!

> I will try to experiment with an alternative way ([1]) to check whether
> cross-cu reference happens or not. But at least checking flags
> approach can be adapted to gcc (if we want after comparing the alternative)
> since gcc always has flags in dwarf.
 
> [1] https://lore.kernel.org/bpf/d34a3d62-bae8-3a30-26b6-4e5e8efcd0af@fb.com/T/#m1b0b1206091c19a90b15d054aa26239101289f84

I thought about some other method, like adding a ELF note to vmlinux
stating that this was built with LTO, that would be the fastest way, I
think. If that note wasn't there, then we would fallback to looking at
inter CU references, that way we would have the best of both worlds and
wouldn't incur in per-CU DW_AT_producer overheads with the flags for
each object file.

- Arnaldo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary
  2021-03-31 13:54           ` Arnaldo Carvalho de Melo
@ 2021-03-31 15:08             ` Yonghong Song
  0 siblings, 0 replies; 18+ messages in thread
From: Yonghong Song @ 2021-03-31 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Arnaldo Carvalho de Melo, dwarves, Alexei Starovoitov,
	Andrii Nakryiko, Bill Wendling, bpf, kernel-team



On 3/31/21 6:54 AM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Mar 30, 2021 at 08:20:20PM -0700, Yonghong Song escreveu:
>> On 3/30/21 11:24 AM, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Mar 30, 2021 at 03:08:06PM -0300, Arnaldo Carvalho de Melo escreveu:
>>>> [acme@five pahole]$ fullcircle tcp_bbr.o
>>>> [acme@five pahole]$
> 
>>>> This one is dealt with, doing some more tests and looking at that
>>>> array[] versus array[0].
> 
>>> I've pushed what I have to the main repos at kernel.org and github,
>>> please check, I'll continue from there.
> 
>> Looks good. Thanks!
> 
>> I will try to experiment with an alternative way ([1]) to check whether
>> cross-cu reference happens or not. But at least checking flags
>> approach can be adapted to gcc (if we want after comparing the alternative)
>> since gcc always has flags in dwarf.
>   
>> [1] https://lore.kernel.org/bpf/d34a3d62-bae8-3a30-26b6-4e5e8efcd0af@fb.com/T/#m1b0b1206091c19a90b15d054aa26239101289f84
> 
> I thought about some other method, like adding a ELF note to vmlinux
> stating that this was built with LTO, that would be the fastest way, I

Adding to the ELF .notes is a great idea. Let me explore it. Thanks!

> think. If that note wasn't there, then we would fallback to looking at
> inter CU references, that way we would have the best of both worlds and
> wouldn't incur in per-CU DW_AT_producer overheads with the flags for
> each object file.

Totally agree.

> 
> - Arnaldo
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-03-31 15:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-28 20:14 [PATCH dwarves v3 0/3] permit merging all dwarf cu's for clang lto built binary Yonghong Song
2021-03-28 20:14 ` [PATCH dwarves v3 1/3] dwarf_loader: permits flexible HASHTAGS__BITS Yonghong Song
2021-03-28 20:14 ` [PATCH dwarves v3 2/3] dwarf_loader: factor out common code to initialize a cu Yonghong Song
2021-03-28 20:14 ` [PATCH dwarves v3 3/3] dwarf_loader: permit merging all dwarf cu's for clang lto built binary Yonghong Song
2021-03-30 20:08   ` Bill Wendling
2021-03-30 20:15     ` Yonghong Song
2021-03-30 21:44       ` Bill Wendling
2021-03-30 22:28         ` Yonghong Song
2021-03-30 23:25           ` Bill Wendling
2021-03-29 17:40 ` [PATCH dwarves v3 0/3] " Arnaldo Carvalho de Melo
2021-03-30 15:10   ` Arnaldo Carvalho de Melo
2021-03-30 18:08     ` Arnaldo Carvalho de Melo
2021-03-30 18:24       ` Arnaldo Carvalho de Melo
2021-03-31  3:20         ` Yonghong Song
2021-03-31 13:54           ` Arnaldo Carvalho de Melo
2021-03-31 15:08             ` Yonghong Song
2021-03-31  0:29     ` Yonghong Song
2021-03-29 23:14 ` Nick Desaulniers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.