linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting
@ 2020-06-05 22:22 Saravana Kannan
  2020-06-16 21:39 ` Will Deacon
  0 siblings, 1 reply; 6+ messages in thread
From: Saravana Kannan @ 2020-06-05 22:22 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Saravana Kannan, Ard Biesheuvel, kernel-team, linux-arm-kernel,
	linux-kernel

When loading a module, module_frob_arch_sections() tries to figure out
the number of PLTs that'll be needed to handle all the RELAs. While
doing this, it tries to dedupe PLT allocations for multiple
R_AARCH64_CALL26 relocations to the same symbol. It does the same for
R_AARCH64_JUMP26 relocations too.

To make checks for duplicates easier/faster, it sorts the relocation
list by type, symbol and addend. That way, to check for a duplicate
relocation, it just needs to compare with the previous entry.

However, sorting the entire relocation array is unnecessary and
expensive (O(n log n)) because there are a lot of other relocation types
that don't need deduping or can't be deduped.

So this commit partitions the array into entries that need deduping and
those that don't. And then sorts just the part that needs deduping. And
when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
if it's disabled.

This gives significant reduction in module load time for modules with
large number of relocations with no measurable impact on modules with a
small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
enabled, the load time for one module went down from 268ms to 100ms.
Another module went down from 143ms to 83ms.

This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is
disabled because it looks like PLTs are not allocated for
R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled.

Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Saravana Kannan <saravanak@google.com>
---
 arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
index 65b08a74aec6..bf5118b3b828 100644
--- a/arch/arm64/kernel/module-plts.c
+++ b/arch/arm64/kernel/module-plts.c
@@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
 	return ret;
 }
 
+static bool rela_needs_dedup(Elf64_Rela *rela)
+{
+	return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
+	       || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
+}
+
+/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
+static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
+{
+	int i = 0, j = numrels - 1;
+	Elf64_Rela t;
+
+	while (i < j) {
+		while (rela_needs_dedup(rela + i) && i < j)
+			i++;
+		while (!rela_needs_dedup(rela + j) && i < j)
+			j--;
+		if (i < j) {
+			t = *(rela + j);
+			*(rela + j) = *(rela + i);
+			*(rela + i) = t;
+		}
+	}
+	/* If the entire array needs dedup, make sure i == numrels */
+	if (rela_needs_dedup(rela + i))
+		i++;
+
+	return i;
+}
+
 int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
 			      char *secstrings, struct module *mod)
 {
@@ -291,6 +321,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
 	for (i = 0; i < ehdr->e_shnum; i++) {
 		Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
 		int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
+		int num_dedup_rels = 0;
 		Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
 
 		if (sechdrs[i].sh_type != SHT_RELA)
@@ -300,8 +331,12 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
 		if (!(dstsec->sh_flags & SHF_EXECINSTR))
 			continue;
 
+		if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
+			num_dedup_rels = partition_dedup_relas(rels, numrels);
 		/* sort by type, symbol index and addend */
-		sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL);
+		if (num_dedup_rels)
+			sort(rels, num_dedup_rels, sizeof(Elf64_Rela),
+			     cmp_rela, NULL);
 
 		if (!str_has_prefix(secstrings + dstsec->sh_name, ".init"))
 			core_plts += count_plts(syms, rels, numrels,
-- 
2.27.0.278.ge193c7cf3a9-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting
  2020-06-05 22:22 [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting Saravana Kannan
@ 2020-06-16 21:39 ` Will Deacon
  2020-06-17  3:38   ` Saravana Kannan
  2020-06-17  8:17   ` Ard Biesheuvel
  0 siblings, 2 replies; 6+ messages in thread
From: Will Deacon @ 2020-06-16 21:39 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Catalin Marinas, Ard Biesheuvel, kernel-team, linux-arm-kernel,
	linux-kernel

On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote:
> When loading a module, module_frob_arch_sections() tries to figure out
> the number of PLTs that'll be needed to handle all the RELAs. While
> doing this, it tries to dedupe PLT allocations for multiple
> R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> R_AARCH64_JUMP26 relocations too.
> 
> To make checks for duplicates easier/faster, it sorts the relocation
> list by type, symbol and addend. That way, to check for a duplicate
> relocation, it just needs to compare with the previous entry.
> 
> However, sorting the entire relocation array is unnecessary and
> expensive (O(n log n)) because there are a lot of other relocation types
> that don't need deduping or can't be deduped.
> 
> So this commit partitions the array into entries that need deduping and
> those that don't. And then sorts just the part that needs deduping. And
> when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> if it's disabled.
> 
> This gives significant reduction in module load time for modules with
> large number of relocations with no measurable impact on modules with a
> small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> enabled, the load time for one module went down from 268ms to 100ms.
> Another module went down from 143ms to 83ms.

Whilst I can see that's a significant relative saving, what proportion of
actual boot time are we talking about here? It would be interesting to
know if there are bigger potential savings elsewhere.

> This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is
> disabled because it looks like PLTs are not allocated for
> R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled.
> 
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> ---
>  arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++-
>  1 file changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> index 65b08a74aec6..bf5118b3b828 100644
> --- a/arch/arm64/kernel/module-plts.c
> +++ b/arch/arm64/kernel/module-plts.c
> @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
>  	return ret;
>  }
>  
> +static bool rela_needs_dedup(Elf64_Rela *rela)
> +{
> +	return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
> +	       || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> +}

Does this handle A53 erratum 843419 correctly? I'm worried that we skip
the ADRP PLTs there.

> +
> +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
> +static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
> +{
> +	int i = 0, j = numrels - 1;
> +	Elf64_Rela t;
> +
> +	while (i < j) {
> +		while (rela_needs_dedup(rela + i) && i < j)
> +			i++;
> +		while (!rela_needs_dedup(rela + j) && i < j)
> +			j--;
> +		if (i < j) {
> +			t = *(rela + j);
> +			*(rela + j) = *(rela + i);
> +			*(rela + i) = t;
> +		}
> +	}

This is very hard to read and I think some of the 'i < j' comparisons are
redundant. Would it make more sense to assign a temporary rather than
post-inc/decrement and recheck?

Will

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting
  2020-06-16 21:39 ` Will Deacon
@ 2020-06-17  3:38   ` Saravana Kannan
  2020-06-17  8:17   ` Ard Biesheuvel
  1 sibling, 0 replies; 6+ messages in thread
From: Saravana Kannan @ 2020-06-17  3:38 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Ard Biesheuvel, Android Kernel Team,
	linux-arm-kernel, LKML

On Tue, Jun 16, 2020 at 2:40 PM Will Deacon <will@kernel.org> wrote:
>
> On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote:
> > When loading a module, module_frob_arch_sections() tries to figure out
> > the number of PLTs that'll be needed to handle all the RELAs. While
> > doing this, it tries to dedupe PLT allocations for multiple
> > R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> > R_AARCH64_JUMP26 relocations too.
> >
> > To make checks for duplicates easier/faster, it sorts the relocation
> > list by type, symbol and addend. That way, to check for a duplicate
> > relocation, it just needs to compare with the previous entry.
> >
> > However, sorting the entire relocation array is unnecessary and
> > expensive (O(n log n)) because there are a lot of other relocation types
> > that don't need deduping or can't be deduped.
> >
> > So this commit partitions the array into entries that need deduping and
> > those that don't. And then sorts just the part that needs deduping. And
> > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> > if it's disabled.
> >
> > This gives significant reduction in module load time for modules with
> > large number of relocations with no measurable impact on modules with a
> > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > enabled, the load time for one module went down from 268ms to 100ms.
> > Another module went down from 143ms to 83ms.
>
> Whilst I can see that's a significant relative saving, what proportion of
> actual boot time are we talking about here? It would be interesting to
> know if there are bigger potential savings elsewhere.

100ms if pretty big in terms of boot time from a phone OEM
perspective. It adds up. And for these two modules I profiled, it adds
up to 228 ms. That's quite a lot.

Also, if you look at just the kernel init time and all the module load
time, it comes to around 2.2 seconds in the hardware I'm testing on.
That's a 10% reduction in the "kernel init" time. Also, this 2.2
seconds is without async probing. If we do that, this becomes and even
larger % of the kernel init time.

> > This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is
> > disabled because it looks like PLTs are not allocated for
> > R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled.
> >
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > ---
> >  arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++-
> >  1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > index 65b08a74aec6..bf5118b3b828 100644
> > --- a/arch/arm64/kernel/module-plts.c
> > +++ b/arch/arm64/kernel/module-plts.c
> > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> >       return ret;
> >  }
> >
> > +static bool rela_needs_dedup(Elf64_Rela *rela)
> > +{
> > +     return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
> > +            || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > +}
>
> Does this handle A53 erratum 843419 correctly? I'm worried that we skip
> the ADRP PLTs there.

So this isn't really skipping any PLTs or RELAs. It's just picking the
RELAs that can benefit from deduping using sort. Even if I delete the
entire sort() call, all the modules load and the device seems to be
working fine (Video streaming over Wifi works). The catch there is
that the module takes up more memory (because the PLTs are duplicated
often) and you might potentially lose some caching benefits for PLTs
that have a significant amount of repetition (say, printk or something
like that).

You'll see that the duplicate_rel() call isn't even called in any code
related to that erratum. So, I think this is safe.

> > +
> > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
> > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
> > +{
> > +     int i = 0, j = numrels - 1;
> > +     Elf64_Rela t;
> > +
> > +     while (i < j) {
> > +             while (rela_needs_dedup(rela + i) && i < j)
> > +                     i++;
> > +             while (!rela_needs_dedup(rela + j) && i < j)
> > +                     j--;
> > +             if (i < j) {
> > +                     t = *(rela + j);
> > +                     *(rela + j) = *(rela + i);
> > +                     *(rela + i) = t;
> > +             }
> > +     }
>
> This is very hard to read and I think some of the 'i < j' comparisons are
> redundant.

I have no strong opinion on how this function is written. It's a
trivial partitioning code. I'll be happy to copy paste any clean up
anyone wants to suggest. I think all the i < j checks are necessary
for my current implementation.

> Would it make more sense to assign a temporary rather than
> post-inc/decrement and recheck?

I'm not sure I get what you are trying to say.

-Saravana

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting
  2020-06-16 21:39 ` Will Deacon
  2020-06-17  3:38   ` Saravana Kannan
@ 2020-06-17  8:17   ` Ard Biesheuvel
  2020-06-17 14:05     ` Will Deacon
  1 sibling, 1 reply; 6+ messages in thread
From: Ard Biesheuvel @ 2020-06-17  8:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Saravana Kannan, Catalin Marinas, kernel-team,
	Linux Kernel Mailing List, Linux ARM, Ard Biesheuvel

On Tue, 16 Jun 2020 at 23:40, Will Deacon <will@kernel.org> wrote:
>
> On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote:
> > When loading a module, module_frob_arch_sections() tries to figure out
> > the number of PLTs that'll be needed to handle all the RELAs. While
> > doing this, it tries to dedupe PLT allocations for multiple
> > R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> > R_AARCH64_JUMP26 relocations too.
> >
> > To make checks for duplicates easier/faster, it sorts the relocation
> > list by type, symbol and addend. That way, to check for a duplicate
> > relocation, it just needs to compare with the previous entry.
> >
> > However, sorting the entire relocation array is unnecessary and
> > expensive (O(n log n)) because there are a lot of other relocation types
> > that don't need deduping or can't be deduped.
> >
> > So this commit partitions the array into entries that need deduping and
> > those that don't. And then sorts just the part that needs deduping. And
> > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> > if it's disabled.
> >
> > This gives significant reduction in module load time for modules with
> > large number of relocations with no measurable impact on modules with a
> > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > enabled, the load time for one module went down from 268ms to 100ms.
> > Another module went down from 143ms to 83ms.
>
> Whilst I can see that's a significant relative saving, what proportion of
> actual boot time are we talking about here? It would be interesting to
> know if there are bigger potential savings elsewhere.
>

Also, 'some module' vs 'some other module' doesn't really say
anything. Please explain which modules and their sizes.

> > This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is
> > disabled because it looks like PLTs are not allocated for
> > R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled.
> >
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > ---
> >  arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++-
> >  1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > index 65b08a74aec6..bf5118b3b828 100644
> > --- a/arch/arm64/kernel/module-plts.c
> > +++ b/arch/arm64/kernel/module-plts.c
> > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> >       return ret;
> >  }
> >
> > +static bool rela_needs_dedup(Elf64_Rela *rela)
> > +{
> > +     return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
> > +            || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > +}
>

Would it help to check the section index here as well? Call/jump
instructions within a section are never sent through a PLT entry.

> Does this handle A53 erratum 843419 correctly? I'm worried that we skip
> the ADRP PLTs there.
>

ADRP PLTs cannot be deduplicated, as they incorporate a relative jump
back to the caller.

> > +
> > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
> > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
> > +{
> > +     int i = 0, j = numrels - 1;
> > +     Elf64_Rela t;
> > +
> > +     while (i < j) {
> > +             while (rela_needs_dedup(rela + i) && i < j)
> > +                     i++;
> > +             while (!rela_needs_dedup(rela + j) && i < j)
> > +                     j--;
> > +             if (i < j) {
> > +                     t = *(rela + j);
> > +                     *(rela + j) = *(rela + i);
> > +                     *(rela + i) = t;
> > +             }
> > +     }
>
> This is very hard to read and I think some of the 'i < j' comparisons are
> redundant. Would it make more sense to assign a temporary rather than
> post-inc/decrement and recheck?
>

Agreed.

Also, what's wrong with [] array indexing?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting
  2020-06-17  8:17   ` Ard Biesheuvel
@ 2020-06-17 14:05     ` Will Deacon
  2020-06-18  2:06       ` Saravana Kannan
  0 siblings, 1 reply; 6+ messages in thread
From: Will Deacon @ 2020-06-17 14:05 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Saravana Kannan, Catalin Marinas, kernel-team,
	Linux Kernel Mailing List, Linux ARM, Ard Biesheuvel

Hi all,

On Wed, Jun 17, 2020 at 10:17:33AM +0200, Ard Biesheuvel wrote:
> On Tue, 16 Jun 2020 at 23:40, Will Deacon <will@kernel.org> wrote:
> > On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote:
> > > This gives significant reduction in module load time for modules with
> > > large number of relocations with no measurable impact on modules with a
> > > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > > enabled, the load time for one module went down from 268ms to 100ms.
> > > Another module went down from 143ms to 83ms.
> >
> > Whilst I can see that's a significant relative saving, what proportion of
> > actual boot time are we talking about here? It would be interesting to
> > know if there are bigger potential savings elsewhere.
> >
> 
> Also, 'some module' vs 'some other module' doesn't really say
> anything. Please explain which modules and their sizes.

I suspect these are all out-of-tree modules, but yes, some metadata such as
sizes, nr or relocs etc would be good to have in the commit message.

> > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > > index 65b08a74aec6..bf5118b3b828 100644
> > > --- a/arch/arm64/kernel/module-plts.c
> > > +++ b/arch/arm64/kernel/module-plts.c
> > > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> > >       return ret;
> > >  }
> > >
> > > +static bool rela_needs_dedup(Elf64_Rela *rela)
> > > +{
> > > +     return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
> > > +            || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > > +}
> >
> 
> Would it help to check the section index here as well? Call/jump
> instructions within a section are never sent through a PLT entry.

(I tried hacking this in below)

> > Does this handle A53 erratum 843419 correctly? I'm worried that we skip
> > the ADRP PLTs there.
> >
> 
> ADRP PLTs cannot be deduplicated, as they incorporate a relative jump
> back to the caller.

Duh yes, thanks. We can't trash the link register here.

> > > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
> > > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
> > > +{
> > > +     int i = 0, j = numrels - 1;
> > > +     Elf64_Rela t;
> > > +
> > > +     while (i < j) {
> > > +             while (rela_needs_dedup(rela + i) && i < j)
> > > +                     i++;
> > > +             while (!rela_needs_dedup(rela + j) && i < j)
> > > +                     j--;
> > > +             if (i < j) {
> > > +                     t = *(rela + j);
> > > +                     *(rela + j) = *(rela + i);
> > > +                     *(rela + i) = t;
> > > +             }
> > > +     }
> >
> > This is very hard to read and I think some of the 'i < j' comparisons are
> > redundant. Would it make more sense to assign a temporary rather than
> > post-inc/decrement and recheck?
> >
> 
> Agreed.
> 
> Also, what's wrong with [] array indexing?

Saravana, since our stylistic objections are reasonably vague, I tried
to clean this up so you can get an idea of how I'd prefer it to look (can't
speak for Ard). I haven't tried running this, but please feel free to adapt
it. Replacement diff below.

Will

--->8

diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
index 65b08a74aec6..204290314054 100644
--- a/arch/arm64/kernel/module-plts.c
+++ b/arch/arm64/kernel/module-plts.c
@@ -253,6 +253,39 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
 	return ret;
 }
 
+static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
+				  Elf64_Word dstidx)
+{
+
+	Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
+
+	if (s->st_shndx == dstidx)
+		return false;
+
+	return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
+	       ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
+}
+
+static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
+				      int numrels, Elf64_Word dstidx)
+{
+	int i = 0, j = numrels - 1;
+
+	if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
+		return 0;
+
+	while (i < j) {
+		if (branch_rela_needs_plt(syms, &rela[i], dstidx))
+			i++;
+		else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
+			swap(rela[i], rela[j]);
+		else
+			j--;
+	}
+
+	return i;
+}
+
 int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
 			      char *secstrings, struct module *mod)
 {
@@ -290,7 +323,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
 
 	for (i = 0; i < ehdr->e_shnum; i++) {
 		Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
-		int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
+		int nents, numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
 		Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
 
 		if (sechdrs[i].sh_type != SHT_RELA)
@@ -300,8 +333,14 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
 		if (!(dstsec->sh_flags & SHF_EXECINSTR))
 			continue;
 
-		/* sort by type, symbol index and addend */
-		sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL);
+		/*
+		 * sort branch relocations requiring a PLT by type, symbol index
+		 * and addend
+		 */
+		nents = partition_branch_plt_relas(syms, rels, numrels,
+						   sechdrs[i].sh_info);
+		if (nents)
+			sort(rels, nents, sizeof(Elf64_Rela), cmp_rela, NULL);
 
 		if (!str_has_prefix(secstrings + dstsec->sh_name, ".init"))
 			core_plts += count_plts(syms, rels, numrels,

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting
  2020-06-17 14:05     ` Will Deacon
@ 2020-06-18  2:06       ` Saravana Kannan
  0 siblings, 0 replies; 6+ messages in thread
From: Saravana Kannan @ 2020-06-18  2:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ard Biesheuvel, Catalin Marinas, Android Kernel Team,
	Linux Kernel Mailing List, Linux ARM, Ard Biesheuvel

On Wed, Jun 17, 2020 at 7:05 AM Will Deacon <will@kernel.org> wrote:
>
> Hi all,
>
> On Wed, Jun 17, 2020 at 10:17:33AM +0200, Ard Biesheuvel wrote:
> > On Tue, 16 Jun 2020 at 23:40, Will Deacon <will@kernel.org> wrote:
> > > On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote:
> > > > This gives significant reduction in module load time for modules with
> > > > large number of relocations with no measurable impact on modules with a
> > > > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > > > enabled, the load time for one module went down from 268ms to 100ms.
> > > > Another module went down from 143ms to 83ms.
> > >
> > > Whilst I can see that's a significant relative saving, what proportion of
> > > actual boot time are we talking about here? It would be interesting to
> > > know if there are bigger potential savings elsewhere.
> > >
> >
> > Also, 'some module' vs 'some other module' doesn't really say
> > anything. Please explain which modules and their sizes.
>
> I suspect these are all out-of-tree modules, but yes, some metadata such as
> sizes, nr or relocs etc would be good to have in the commit message.
>
> > > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > > > index 65b08a74aec6..bf5118b3b828 100644
> > > > --- a/arch/arm64/kernel/module-plts.c
> > > > +++ b/arch/arm64/kernel/module-plts.c
> > > > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> > > >       return ret;
> > > >  }
> > > >
> > > > +static bool rela_needs_dedup(Elf64_Rela *rela)
> > > > +{
> > > > +     return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
> > > > +            || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > > > +}
> > >
> >
> > Would it help to check the section index here as well? Call/jump
> > instructions within a section are never sent through a PLT entry.
>
> (I tried hacking this in below)
>
> > > Does this handle A53 erratum 843419 correctly? I'm worried that we skip
> > > the ADRP PLTs there.
> > >
> >
> > ADRP PLTs cannot be deduplicated, as they incorporate a relative jump
> > back to the caller.
>
> Duh yes, thanks. We can't trash the link register here.
>
> > > > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
> > > > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
> > > > +{
> > > > +     int i = 0, j = numrels - 1;
> > > > +     Elf64_Rela t;
> > > > +
> > > > +     while (i < j) {
> > > > +             while (rela_needs_dedup(rela + i) && i < j)
> > > > +                     i++;
> > > > +             while (!rela_needs_dedup(rela + j) && i < j)
> > > > +                     j--;
> > > > +             if (i < j) {
> > > > +                     t = *(rela + j);
> > > > +                     *(rela + j) = *(rela + i);
> > > > +                     *(rela + i) = t;
> > > > +             }
> > > > +     }
> > >
> > > This is very hard to read and I think some of the 'i < j' comparisons are
> > > redundant. Would it make more sense to assign a temporary rather than
> > > post-inc/decrement and recheck?
> > >
> >
> > Agreed.
> >
> > Also, what's wrong with [] array indexing?
>
> Saravana, since our stylistic objections are reasonably vague, I tried
> to clean this up so you can get an idea of how I'd prefer it to look (can't
> speak for Ard). I haven't tried running this, but please feel free to adapt
> it. Replacement diff below.
>
> Will
>
> --->8
>
> diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> index 65b08a74aec6..204290314054 100644
> --- a/arch/arm64/kernel/module-plts.c
> +++ b/arch/arm64/kernel/module-plts.c
> @@ -253,6 +253,39 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
>         return ret;
>  }
>
> +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
> +                                 Elf64_Word dstidx)
> +{
> +
> +       Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
> +
> +       if (s->st_shndx == dstidx)
> +               return false;
> +
> +       return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
> +              ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> +}
> +
> +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
> +                                     int numrels, Elf64_Word dstidx)
> +{
> +       int i = 0, j = numrels - 1;
> +
> +       if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> +               return 0;
> +
> +       while (i < j) {
> +               if (branch_rela_needs_plt(syms, &rela[i], dstidx))
> +                       i++;
> +               else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
> +                       swap(rela[i], rela[j]);
> +               else
> +                       j--;
> +       }
> +
> +       return i;
> +}
> +
>  int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
>                               char *secstrings, struct module *mod)
>  {
> @@ -290,7 +323,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
>
>         for (i = 0; i < ehdr->e_shnum; i++) {
>                 Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
> -               int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
> +               int nents, numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
>                 Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
>
>                 if (sechdrs[i].sh_type != SHT_RELA)
> @@ -300,8 +333,14 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
>                 if (!(dstsec->sh_flags & SHF_EXECINSTR))
>                         continue;
>
> -               /* sort by type, symbol index and addend */
> -               sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL);
> +               /*
> +                * sort branch relocations requiring a PLT by type, symbol index
> +                * and addend
> +                */
> +               nents = partition_branch_plt_relas(syms, rels, numrels,
> +                                                  sechdrs[i].sh_info);
> +               if (nents)
> +                       sort(rels, nents, sizeof(Elf64_Rela), cmp_rela, NULL);
>
>                 if (!str_has_prefix(secstrings + dstsec->sh_name, ".init"))
>                         core_plts += count_plts(syms, rels, numrels,

Thanks Will & Ard. I'll incorporate your feedback and send a v2 within
a few days.

-Saravana

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-06-18  2:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-05 22:22 [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting Saravana Kannan
2020-06-16 21:39 ` Will Deacon
2020-06-17  3:38   ` Saravana Kannan
2020-06-17  8:17   ` Ard Biesheuvel
2020-06-17 14:05     ` Will Deacon
2020-06-18  2:06       ` Saravana Kannan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).