* [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting
@ 2020-06-23 1:18 Saravana Kannan
2020-06-23 16:12 ` Will Deacon
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Saravana Kannan @ 2020-06-23 1:18 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Saravana Kannan, Ard Biesheuvel, kernel-team, linux-arm-kernel,
linux-kernel
When loading a module, module_frob_arch_sections() tries to figure out
the number of PLTs that'll be needed to handle all the RELAs. While
doing this, it tries to dedupe PLT allocations for multiple
R_AARCH64_CALL26 relocations to the same symbol. It does the same for
R_AARCH64_JUMP26 relocations.
To make checks for duplicates easier/faster, it sorts the relocation
list by type, symbol and addend. That way, to check for a duplicate
relocation, it just needs to compare with the previous entry.
However, sorting the entire relocation array is unnecessary and
expensive (O(n log n)) because there are a lot of other relocation types
that don't need deduping or can't be deduped.
So this commit partitions the array into entries that need deduping and
those that don't. And then sorts just the part that needs deduping. And
when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
if it's disabled.
This gives significant reduction in module load time for modules with
large number of relocations with no measurable impact on modules with a
small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
enabled, these were the results for a few downstream modules:
Module Size (MB)
wlan 14
video codec 3.8
drm 1.8
IPA 2.5
audio 1.2
gpu 1.8
Without this patch:
Module Number of entries sorted Module load time (ms)
wlan 243739 283
video codec 74029 138
drm 53837 67
IPA 42800 90
audio 21326 27
gpu 20967 32
Total time to load all these module: 637 ms
With this patch:
Module Number of entries sorted Module load time (ms)
wlan 22454 61
video codec 10150 47
drm 13014 40
IPA 8097 63
audio 4606 16
gpu 6527 20
Total time to load all these modules: 247
Time saved during boot for just these 6 modules: 390 ms
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Saravana Kannan <saravanak@google.com>
---
v1 -> v2:
- Provided more details in the commit text
- Pulled in Will's comments on the coding style
- Pulled in Ard's suggestion about skipping jumps with the same section
index (parts of Will's suggested code)
arch/arm64/kernel/module-plts.c | 46 ++++++++++++++++++++++++++++++---
1 file changed, 43 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
index 65b08a74aec6..0ce3a28e3347 100644
--- a/arch/arm64/kernel/module-plts.c
+++ b/arch/arm64/kernel/module-plts.c
@@ -253,6 +253,40 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
return ret;
}
+static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
+ Elf64_Word dstidx)
+{
+
+ Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
+
+ if (s->st_shndx == dstidx)
+ return false;
+
+ return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
+ ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
+}
+
+/* Group branch PLT relas at the front end of the array. */
+static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
+ int numrels, Elf64_Word dstidx)
+{
+ int i = 0, j = numrels - 1;
+
+ if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
+ return 0;
+
+ while (i < j) {
+ if (branch_rela_needs_plt(syms, &rela[i], dstidx))
+ i++;
+ else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
+ swap(rela[i], rela[j]);
+ else
+ j--;
+ }
+
+ return i;
+}
+
int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
char *secstrings, struct module *mod)
{
@@ -290,7 +324,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
for (i = 0; i < ehdr->e_shnum; i++) {
Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
- int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
+ int nents, numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
if (sechdrs[i].sh_type != SHT_RELA)
@@ -300,8 +334,14 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
if (!(dstsec->sh_flags & SHF_EXECINSTR))
continue;
- /* sort by type, symbol index and addend */
- sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL);
+ /*
+ * sort branch relocations requiring a PLT by type, symbol index
+ * and addend
+ */
+ nents = partition_branch_plt_relas(syms, rels, numrels,
+ sechdrs[i].sh_info);
+ if (nents)
+ sort(rels, nents, sizeof(Elf64_Rela), cmp_rela, NULL);
if (!str_has_prefix(secstrings + dstsec->sh_name, ".init"))
core_plts += count_plts(syms, rels, numrels,
--
2.27.0.111.gc72c7da667-goog
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting
2020-06-23 1:18 [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting Saravana Kannan
@ 2020-06-23 16:12 ` Will Deacon
2020-07-02 11:24 ` Catalin Marinas
2020-07-02 15:29 ` Ard Biesheuvel
2 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2020-06-23 16:12 UTC (permalink / raw)
To: Saravana Kannan
Cc: Catalin Marinas, Ard Biesheuvel, kernel-team, linux-arm-kernel,
linux-kernel
On Mon, Jun 22, 2020 at 06:18:02PM -0700, Saravana Kannan wrote:
> When loading a module, module_frob_arch_sections() tries to figure out
> the number of PLTs that'll be needed to handle all the RELAs. While
> doing this, it tries to dedupe PLT allocations for multiple
> R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> R_AARCH64_JUMP26 relocations.
>
> To make checks for duplicates easier/faster, it sorts the relocation
> list by type, symbol and addend. That way, to check for a duplicate
> relocation, it just needs to compare with the previous entry.
>
> However, sorting the entire relocation array is unnecessary and
> expensive (O(n log n)) because there are a lot of other relocation types
> that don't need deduping or can't be deduped.
>
> So this commit partitions the array into entries that need deduping and
> those that don't. And then sorts just the part that needs deduping. And
> when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> if it's disabled.
>
> This gives significant reduction in module load time for modules with
> large number of relocations with no measurable impact on modules with a
> small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> enabled, these were the results for a few downstream modules:
>
> Module Size (MB)
> wlan 14
> video codec 3.8
> drm 1.8
> IPA 2.5
> audio 1.2
> gpu 1.8
>
> Without this patch:
> Module Number of entries sorted Module load time (ms)
> wlan 243739 283
> video codec 74029 138
> drm 53837 67
> IPA 42800 90
> audio 21326 27
> gpu 20967 32
>
> Total time to load all these module: 637 ms
>
> With this patch:
> Module Number of entries sorted Module load time (ms)
> wlan 22454 61
> video codec 10150 47
> drm 13014 40
> IPA 8097 63
> audio 4606 16
> gpu 6527 20
>
> Total time to load all these modules: 247
>
> Time saved during boot for just these 6 modules: 390 ms
>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> ---
>
> v1 -> v2:
> - Provided more details in the commit text
> - Pulled in Will's comments on the coding style
> - Pulled in Ard's suggestion about skipping jumps with the same section
> index (parts of Will's suggested code)
>
> arch/arm64/kernel/module-plts.c | 46 ++++++++++++++++++++++++++++++---
> 1 file changed, 43 insertions(+), 3 deletions(-)
Nice, it looks like you were more-or-less able to use my suggestion
directly! Commit message looks much better to, so:
Acked-by: Will Deacon <will@kernel.org>
Catalin can pick this up when he starts queuing patches for 5.9.
Cheers,
Will
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting
2020-06-23 1:18 [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting Saravana Kannan
2020-06-23 16:12 ` Will Deacon
@ 2020-07-02 11:24 ` Catalin Marinas
2020-07-02 15:29 ` Ard Biesheuvel
2 siblings, 0 replies; 7+ messages in thread
From: Catalin Marinas @ 2020-07-02 11:24 UTC (permalink / raw)
To: Will Deacon, Saravana Kannan
Cc: Ard Biesheuvel, kernel-team, linux-arm-kernel, linux-kernel
On Mon, 22 Jun 2020 18:18:02 -0700, Saravana Kannan wrote:
> When loading a module, module_frob_arch_sections() tries to figure out
> the number of PLTs that'll be needed to handle all the RELAs. While
> doing this, it tries to dedupe PLT allocations for multiple
> R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> R_AARCH64_JUMP26 relocations.
>
> To make checks for duplicates easier/faster, it sorts the relocation
> list by type, symbol and addend. That way, to check for a duplicate
> relocation, it just needs to compare with the previous entry.
>
> [...]
Applied to arm64 (for-next/misc), thanks!
[1/1] arm64/module: Optimize module load time by optimizing PLT counting
https://git.kernel.org/arm64/c/d4e0340919fb
--
Catalin
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting
2020-06-23 1:18 [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting Saravana Kannan
2020-06-23 16:12 ` Will Deacon
2020-07-02 11:24 ` Catalin Marinas
@ 2020-07-02 15:29 ` Ard Biesheuvel
2020-07-04 0:47 ` Saravana Kannan
2 siblings, 1 reply; 7+ messages in thread
From: Ard Biesheuvel @ 2020-07-02 15:29 UTC (permalink / raw)
To: Saravana Kannan
Cc: Catalin Marinas, Will Deacon, Android Kernel Team,
linux-arm-kernel, Linux Kernel Mailing List
On Tue, 23 Jun 2020 at 03:27, Saravana Kannan <saravanak@google.com> wrote:
>
> When loading a module, module_frob_arch_sections() tries to figure out
> the number of PLTs that'll be needed to handle all the RELAs. While
> doing this, it tries to dedupe PLT allocations for multiple
> R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> R_AARCH64_JUMP26 relocations.
>
> To make checks for duplicates easier/faster, it sorts the relocation
> list by type, symbol and addend. That way, to check for a duplicate
> relocation, it just needs to compare with the previous entry.
>
> However, sorting the entire relocation array is unnecessary and
> expensive (O(n log n)) because there are a lot of other relocation types
> that don't need deduping or can't be deduped.
>
> So this commit partitions the array into entries that need deduping and
> those that don't. And then sorts just the part that needs deduping. And
> when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> if it's disabled.
>
> This gives significant reduction in module load time for modules with
> large number of relocations with no measurable impact on modules with a
> small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> enabled, these were the results for a few downstream modules:
>
> Module Size (MB)
> wlan 14
> video codec 3.8
> drm 1.8
> IPA 2.5
> audio 1.2
> gpu 1.8
>
> Without this patch:
> Module Number of entries sorted Module load time (ms)
> wlan 243739 283
> video codec 74029 138
> drm 53837 67
> IPA 42800 90
> audio 21326 27
> gpu 20967 32
>
> Total time to load all these module: 637 ms
>
> With this patch:
> Module Number of entries sorted Module load time (ms)
> wlan 22454 61
> video codec 10150 47
> drm 13014 40
> IPA 8097 63
> audio 4606 16
> gpu 6527 20
>
> Total time to load all these modules: 247
>
> Time saved during boot for just these 6 modules: 390 ms
>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[I am no longer at Linaro so please don't use my @linaro.org address]
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> ---
>
> v1 -> v2:
> - Provided more details in the commit text
> - Pulled in Will's comments on the coding style
> - Pulled in Ard's suggestion about skipping jumps with the same section
> index (parts of Will's suggested code)
>
> arch/arm64/kernel/module-plts.c | 46 ++++++++++++++++++++++++++++++---
> 1 file changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> index 65b08a74aec6..0ce3a28e3347 100644
> --- a/arch/arm64/kernel/module-plts.c
> +++ b/arch/arm64/kernel/module-plts.c
> @@ -253,6 +253,40 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> return ret;
> }
>
> +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
> + Elf64_Word dstidx)
> +{
> +
> + Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
> +
> + if (s->st_shndx == dstidx)
> + return false;
> +
> + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
> + ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> +}
> +
> +/* Group branch PLT relas at the front end of the array. */
> +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
> + int numrels, Elf64_Word dstidx)
> +{
> + int i = 0, j = numrels - 1;
> +
> + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> + return 0;
> +
> + while (i < j) {
> + if (branch_rela_needs_plt(syms, &rela[i], dstidx))
> + i++;
> + else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
> + swap(rela[i], rela[j]);
Nit: would be slightly better to put
swap(rela[i++], rela[j]);
here so the next iteration of the loop will not call
branch_rela_needs_plt() on rela[i] redundantly. But the current code
is also correct.
> + else
> + j--;
> + }
> +
> + return i;
> +}
> +
> int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
> char *secstrings, struct module *mod)
> {
> @@ -290,7 +324,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
>
> for (i = 0; i < ehdr->e_shnum; i++) {
> Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
> - int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
> + int nents, numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
> Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
>
> if (sechdrs[i].sh_type != SHT_RELA)
> @@ -300,8 +334,14 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
> if (!(dstsec->sh_flags & SHF_EXECINSTR))
> continue;
>
> - /* sort by type, symbol index and addend */
> - sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL);
> + /*
> + * sort branch relocations requiring a PLT by type, symbol index
> + * and addend
> + */
> + nents = partition_branch_plt_relas(syms, rels, numrels,
> + sechdrs[i].sh_info);
> + if (nents)
> + sort(rels, nents, sizeof(Elf64_Rela), cmp_rela, NULL);
>
> if (!str_has_prefix(secstrings + dstsec->sh_name, ".init"))
> core_plts += count_plts(syms, rels, numrels,
> --
> 2.27.0.111.gc72c7da667-goog
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting
2020-07-02 15:29 ` Ard Biesheuvel
@ 2020-07-04 0:47 ` Saravana Kannan
2020-07-04 12:09 ` Will Deacon
0 siblings, 1 reply; 7+ messages in thread
From: Saravana Kannan @ 2020-07-04 0:47 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Catalin Marinas, Will Deacon, Android Kernel Team,
linux-arm-kernel, Linux Kernel Mailing List
On Thu, Jul 2, 2020 at 8:30 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Tue, 23 Jun 2020 at 03:27, Saravana Kannan <saravanak@google.com> wrote:
> >
> > When loading a module, module_frob_arch_sections() tries to figure out
> > the number of PLTs that'll be needed to handle all the RELAs. While
> > doing this, it tries to dedupe PLT allocations for multiple
> > R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> > R_AARCH64_JUMP26 relocations.
> >
> > To make checks for duplicates easier/faster, it sorts the relocation
> > list by type, symbol and addend. That way, to check for a duplicate
> > relocation, it just needs to compare with the previous entry.
> >
> > However, sorting the entire relocation array is unnecessary and
> > expensive (O(n log n)) because there are a lot of other relocation types
> > that don't need deduping or can't be deduped.
> >
> > So this commit partitions the array into entries that need deduping and
> > those that don't. And then sorts just the part that needs deduping. And
> > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> > if it's disabled.
> >
> > This gives significant reduction in module load time for modules with
> > large number of relocations with no measurable impact on modules with a
> > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > enabled, these were the results for a few downstream modules:
> >
> > Module Size (MB)
> > wlan 14
> > video codec 3.8
> > drm 1.8
> > IPA 2.5
> > audio 1.2
> > gpu 1.8
> >
> > Without this patch:
> > Module Number of entries sorted Module load time (ms)
> > wlan 243739 283
> > video codec 74029 138
> > drm 53837 67
> > IPA 42800 90
> > audio 21326 27
> > gpu 20967 32
> >
> > Total time to load all these module: 637 ms
> >
> > With this patch:
> > Module Number of entries sorted Module load time (ms)
> > wlan 22454 61
> > video codec 10150 47
> > drm 13014 40
> > IPA 8097 63
> > audio 4606 16
> > gpu 6527 20
> >
> > Total time to load all these modules: 247
> >
> > Time saved during boot for just these 6 modules: 390 ms
> >
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>
> [I am no longer at Linaro so please don't use my @linaro.org address]
Hmm... I'm pretty sure I got this using the get_maintainers script.
Maybe update the MAINTAINERS file if you haven't already (I didn't
check)?
But if I ever manually add in your email, I'll keep this in mind.
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > ---
> >
> > v1 -> v2:
> > - Provided more details in the commit text
> > - Pulled in Will's comments on the coding style
> > - Pulled in Ard's suggestion about skipping jumps with the same section
> > index (parts of Will's suggested code)
> >
> > arch/arm64/kernel/module-plts.c | 46 ++++++++++++++++++++++++++++++---
> > 1 file changed, 43 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > index 65b08a74aec6..0ce3a28e3347 100644
> > --- a/arch/arm64/kernel/module-plts.c
> > +++ b/arch/arm64/kernel/module-plts.c
> > @@ -253,6 +253,40 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> > return ret;
> > }
> >
> > +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
> > + Elf64_Word dstidx)
> > +{
> > +
> > + Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
> > +
> > + if (s->st_shndx == dstidx)
> > + return false;
> > +
> > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
> > + ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > +}
> > +
> > +/* Group branch PLT relas at the front end of the array. */
> > +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
> > + int numrels, Elf64_Word dstidx)
> > +{
> > + int i = 0, j = numrels - 1;
> > +
> > + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> > + return 0;
> > +
> > + while (i < j) {
> > + if (branch_rela_needs_plt(syms, &rela[i], dstidx))
> > + i++;
> > + else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
> > + swap(rela[i], rela[j]);
>
> Nit: would be slightly better to put
>
> swap(rela[i++], rela[j]);
>
> here so the next iteration of the loop will not call
> branch_rela_needs_plt() on rela[i] redundantly. But the current code
> is also correct.
Oh yeah, I noticed that unnecessary repeat of branch_rela_needs_plt()
on rela[i] when j had to be decremented, but forgot to handle it after
I was done with all the testing.
But I did compare it to the code I had written in v1 that didn't have
this extra check for rela[i]. I couldn't find any measurable
difference in the module load time. Maybe 1ms for the worst case
module, but that could have been just run to run variation.
Anyway, maybe send this as another patch since Catalin has already
picked this mine?
Thanks,
Saravana
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting
2020-07-04 0:47 ` Saravana Kannan
@ 2020-07-04 12:09 ` Will Deacon
2020-07-04 13:47 ` Ard Biesheuvel
0 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2020-07-04 12:09 UTC (permalink / raw)
To: Saravana Kannan
Cc: Ard Biesheuvel, Catalin Marinas, Android Kernel Team,
linux-arm-kernel, Linux Kernel Mailing List
On Fri, Jul 03, 2020 at 05:47:24PM -0700, Saravana Kannan wrote:
> On Thu, Jul 2, 2020 at 8:30 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Tue, 23 Jun 2020 at 03:27, Saravana Kannan <saravanak@google.com> wrote:
> > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > > index 65b08a74aec6..0ce3a28e3347 100644
> > > --- a/arch/arm64/kernel/module-plts.c
> > > +++ b/arch/arm64/kernel/module-plts.c
> > > @@ -253,6 +253,40 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> > > return ret;
> > > }
> > >
> > > +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
> > > + Elf64_Word dstidx)
> > > +{
> > > +
> > > + Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
> > > +
> > > + if (s->st_shndx == dstidx)
> > > + return false;
> > > +
> > > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
> > > + ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > > +}
> > > +
> > > +/* Group branch PLT relas at the front end of the array. */
> > > +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
> > > + int numrels, Elf64_Word dstidx)
> > > +{
> > > + int i = 0, j = numrels - 1;
> > > +
> > > + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> > > + return 0;
> > > +
> > > + while (i < j) {
> > > + if (branch_rela_needs_plt(syms, &rela[i], dstidx))
> > > + i++;
> > > + else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
> > > + swap(rela[i], rela[j]);
> >
> > Nit: would be slightly better to put
> >
> > swap(rela[i++], rela[j]);
> >
> > here so the next iteration of the loop will not call
> > branch_rela_needs_plt() on rela[i] redundantly. But the current code
> > is also correct.
>
> Oh yeah, I noticed that unnecessary repeat of branch_rela_needs_plt()
> on rela[i] when j had to be decremented, but forgot to handle it after
> I was done with all the testing.
Yeah, I guess you can decrement j as well, but I just think it makes the
logic harder to read and more error-prone if we change it later.
> But I did compare it to the code I had written in v1 that didn't have
> this extra check for rela[i]. I couldn't find any measurable
> difference in the module load time. Maybe 1ms for the worst case
> module, but that could have been just run to run variation.
>
> Anyway, maybe send this as another patch since Catalin has already
> picked this mine?
I think the queued code is fine, so we don't need to micro-optimise it.
Will
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting
2020-07-04 12:09 ` Will Deacon
@ 2020-07-04 13:47 ` Ard Biesheuvel
0 siblings, 0 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2020-07-04 13:47 UTC (permalink / raw)
To: Will Deacon
Cc: Saravana Kannan, Catalin Marinas, Android Kernel Team,
linux-arm-kernel, Linux Kernel Mailing List
On Sat, 4 Jul 2020 at 14:09, Will Deacon <will@kernel.org> wrote:
>
> On Fri, Jul 03, 2020 at 05:47:24PM -0700, Saravana Kannan wrote:
> > On Thu, Jul 2, 2020 at 8:30 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > On Tue, 23 Jun 2020 at 03:27, Saravana Kannan <saravanak@google.com> wrote:
> > > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > > > index 65b08a74aec6..0ce3a28e3347 100644
> > > > --- a/arch/arm64/kernel/module-plts.c
> > > > +++ b/arch/arm64/kernel/module-plts.c
> > > > @@ -253,6 +253,40 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> > > > return ret;
> > > > }
> > > >
> > > > +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
> > > > + Elf64_Word dstidx)
> > > > +{
> > > > +
> > > > + Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
> > > > +
> > > > + if (s->st_shndx == dstidx)
> > > > + return false;
> > > > +
> > > > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
> > > > + ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > > > +}
> > > > +
> > > > +/* Group branch PLT relas at the front end of the array. */
> > > > +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
> > > > + int numrels, Elf64_Word dstidx)
> > > > +{
> > > > + int i = 0, j = numrels - 1;
> > > > +
> > > > + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> > > > + return 0;
> > > > +
> > > > + while (i < j) {
> > > > + if (branch_rela_needs_plt(syms, &rela[i], dstidx))
> > > > + i++;
> > > > + else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
> > > > + swap(rela[i], rela[j]);
> > >
> > > Nit: would be slightly better to put
> > >
> > > swap(rela[i++], rela[j]);
> > >
> > > here so the next iteration of the loop will not call
> > > branch_rela_needs_plt() on rela[i] redundantly. But the current code
> > > is also correct.
> >
> > Oh yeah, I noticed that unnecessary repeat of branch_rela_needs_plt()
> > on rela[i] when j had to be decremented, but forgot to handle it after
> > I was done with all the testing.
>
> Yeah, I guess you can decrement j as well, but I just think it makes the
> logic harder to read and more error-prone if we change it later.
>
Indeed,
swap(rela[i++], rela[j--]);
looks even bettter!
But you're right, it's not a big deal.
> > But I did compare it to the code I had written in v1 that didn't have
> > this extra check for rela[i]. I couldn't find any measurable
> > difference in the module load time. Maybe 1ms for the worst case
> > module, but that could have been just run to run variation.
> >
> > Anyway, maybe send this as another patch since Catalin has already
> > picked this mine?
>
> I think the queued code is fine, so we don't need to micro-optimise it.
>
> Will
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-07-04 13:48 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-23 1:18 [PATCH v2] arm64/module: Optimize module load time by optimizing PLT counting Saravana Kannan
2020-06-23 16:12 ` Will Deacon
2020-07-02 11:24 ` Catalin Marinas
2020-07-02 15:29 ` Ard Biesheuvel
2020-07-04 0:47 ` Saravana Kannan
2020-07-04 12:09 ` Will Deacon
2020-07-04 13:47 ` Ard Biesheuvel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).