All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel
@ 2022-04-27 17:12 Ard Biesheuvel
  2022-04-27 17:12 ` [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints Ard Biesheuvel
  2022-04-27 17:12 ` [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels Ard Biesheuvel
  0 siblings, 2 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-27 17:12 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: clang-built-linux, will, catalin.marinas, keescook, mark.rutland,
	nathan, Ard Biesheuvel, Sami Tolvanen, Nick Desaulniers

Building the KASLR kernel without -fpie but linking it with -pie works
in practice, but it is not something that is explicitly supported by the
toolchains - it happens to work because the default 'small' code model
used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
symbol references.

Code generation with -fpie used to result in unnecessary overhead, as
all references to symbols with external visibility use emitted via
entries in the GOT, resulting in an additional load from memory for each
global variable access.

However, we can now manage this my using 'hidden' visibility (which is
already used in places such as the decompressor or the EFI stub), so we
can enable -fpie code generation without the overhead.

This series is RFC given that, beyond switching to a better supported
combination of compiler and linker options, I am not aware of any
advantages or disadvantages of making this change.

Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>

Ard Biesheuvel (2):
  arm64: jump_label: use more precise asm constraints
  arm64: kernel: switch to PIE code generation for relocatable kernels

 arch/arm64/Makefile                 | 4 ++++
 arch/arm64/include/asm/jump_label.h | 8 ++++----
 arch/arm64/kernel/vmlinux.lds.S     | 9 ++++-----
 3 files changed, 12 insertions(+), 9 deletions(-)

-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints
  2022-04-27 17:12 [RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel Ard Biesheuvel
@ 2022-04-27 17:12 ` Ard Biesheuvel
  2022-04-27 18:58   ` Nick Desaulniers
  2022-04-28  9:51   ` Mark Rutland
  2022-04-27 17:12 ` [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels Ard Biesheuvel
  1 sibling, 2 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-27 17:12 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: clang-built-linux, will, catalin.marinas, keescook, mark.rutland,
	nathan, Ard Biesheuvel, Sami Tolvanen, Nick Desaulniers

In order to set bit #0 of the struct static_key pointer in the the jump
label struct, we currently cast the pointer to char[], and take the
address of either the 0th or 1st array member, depending on the value of
'branch'. This works but creates problems with -fpie code generation:
GCC complains about the constraint being unsatisfiable, and Clang
miscompiles the code in a way that causes stability issues (immediate
panic on 'attempt to kill init')

So instead, pass the struct static_key reference and the 'branch'
immediate individually, in a way that satisfies both GCC and Clang (GCC
wants the 'S' constraint, whereas Clang wants the 'i' constraint for
argument %0)

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/jump_label.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
index cea441b6aa5d..f741bbacf175 100644
--- a/arch/arm64/include/asm/jump_label.h
+++ b/arch/arm64/include/asm/jump_label.h
@@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key *key,
 		 "	.pushsection	__jump_table, \"aw\"	\n\t"
 		 "	.align		3			\n\t"
 		 "	.long		1b - ., %l[l_yes] - .	\n\t"
-		 "	.quad		%c0 - .			\n\t"
+		 "	.quad		%c0 - . + %1		\n\t"
 		 "	.popsection				\n\t"
-		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
+		 :  :  "Si"(key), "i"(branch) :  : l_yes);
 
 	return false;
 l_yes:
@@ -40,9 +40,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key,
 		 "	.pushsection	__jump_table, \"aw\"	\n\t"
 		 "	.align		3			\n\t"
 		 "	.long		1b - ., %l[l_yes] - .	\n\t"
-		 "	.quad		%c0 - .			\n\t"
+		 "	.quad		%c0 - . + %1		\n\t"
 		 "	.popsection				\n\t"
-		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
+		 :  :  "Si"(key), "i"(branch) :  : l_yes);
 
 	return false;
 l_yes:
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-27 17:12 [RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel Ard Biesheuvel
  2022-04-27 17:12 ` [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints Ard Biesheuvel
@ 2022-04-27 17:12 ` Ard Biesheuvel
  2022-04-28  2:40   ` Fangrui Song
  1 sibling, 1 reply; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-27 17:12 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: clang-built-linux, will, catalin.marinas, keescook, mark.rutland,
	nathan, Ard Biesheuvel, Sami Tolvanen, Nick Desaulniers

We currently use ordinary, position dependent code generation for the
core kernel, which happens to default to the 'small' code model on both
GCC and Clang. This is the code model that relies on ADRP/ADD or
ADRP/LDR pairs for symbol references, which are PC-relative with a range
of -/+ 4 GiB, and therefore happen to be position independent in
practice.

This means that the fact that we can link the relocatable KASLR kernel
using the -pie linker flag (which generates the runtime relocations and
inserts them into the binary) is somewhat of a coincidence, and not
something which is explicitly supported by the toolchains.

The reason we have not used -fpie for code generation so far (which is
the compiler flag that should be used to generate code that is to be
linked with -pie) is that by default, it generates code based on
assumptions that only hold for shared libraries and PIE executables,
i.e., that gathering all relocatable quantities into a Global Offset
Table (GOT) is desirable because it reduces the CoW footprint, and
because it permits ELF symbol preemption (which lets an executable
override symbols defined in a shared library, in a way that forces the
shared library to update all of its internal references as well).
Ironically, this means we end up with many more absolute references that
all need to be fixed up at boot.

Fortunately, we can convince the compiler to handle this in a way that
is a bit more suitable for freestanding binaries such as the kernel, by
setting the 'hidden' visibility #pragma, which informs the compiler that
symbol preemption or CoW footprint are of no concern to us, and so
PC-relative references that are resolved at link time are perfectly
fine.

So let's enable this #pragma and build with -fpie when building a
relocatable kernel. This also means that all constant data items that
carry statically initialized pointer variables are now emitted into the
.data.rel.ro* sections, so move these into .rodata where they belong.

Code size impact (GCC):

Before:

      text       data        bss      total filename
  16712396   18659064     534556   35906016 vmlinux

After:

      text       data        bss      total filename
  16804400   18612876     534556   35951832 vmlinux

Code size impact (Clang):

Before:

      text       data        bss      total filename
  17194584   13335060     535268   31064912 vmlinux

After:

      text       data        bss      total filename
  17194536   13310032     535268   31039836 vmlinux

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/Makefile             | 4 ++++
 arch/arm64/kernel/vmlinux.lds.S | 9 ++++-----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 2f1de88651e6..94b6c51f5de6 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y)
 # with the relocation offsets always being zero.
 LDFLAGS_vmlinux		+= -shared -Bsymbolic -z notext \
 			$(call ld-option, --no-apply-dynamic-relocs)
+
+# Generate position independent code without relying on a Global Offset Table
+KBUILD_CFLAGS_KERNEL   += -fpie -include $(srctree)/include/linux/hidden.h
+
 endif
 
 ifeq ($(CONFIG_ARM64_ERRATUM_843419),y)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index edaf0faf766f..b1e071ac1acf 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -174,8 +174,6 @@ SECTIONS
 			KEXEC_TEXT
 			TRAMP_TEXT
 			*(.gnu.warning)
-		. = ALIGN(16);
-		*(.got)			/* Global offset table		*/
 	}
 
 	/*
@@ -192,6 +190,8 @@ SECTIONS
 	/* everything from this point to __init_begin will be marked RO NX */
 	RO_DATA(PAGE_SIZE)
 
+	.data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) }
+
 	HYPERVISOR_DATA_SECTIONS
 
 	idmap_pg_dir = .;
@@ -273,6 +273,8 @@ SECTIONS
 	_sdata = .;
 	RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
 
+	.data.rel : ALIGN(8) { *(.data.rel*) }
+
 	/*
 	 * Data written with the MMU off but read with the MMU on requires
 	 * cache lines to be invalidated, discarding up to a Cache Writeback
@@ -320,9 +322,6 @@ SECTIONS
 		*(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt)
 	}
 	ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!")
-
-	.data.rel.ro : { *(.data.rel.ro) }
-	ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!")
 }
 
 #include "image-vars.h"
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints
  2022-04-27 17:12 ` [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints Ard Biesheuvel
@ 2022-04-27 18:58   ` Nick Desaulniers
  2022-04-27 21:50     ` Ard Biesheuvel
  2022-04-28  9:51   ` Mark Rutland
  1 sibling, 1 reply; 17+ messages in thread
From: Nick Desaulniers @ 2022-04-27 18:58 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, clang-built-linux, will, catalin.marinas,
	keescook, mark.rutland, nathan, Sami Tolvanen

On Wed, Apr 27, 2022 at 10:13 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> In order to set bit #0 of the struct static_key pointer in the the jump
> label struct, we currently cast the pointer to char[], and take the
> address of either the 0th or 1st array member, depending on the value of
> 'branch'. This works but creates problems with -fpie code generation:
> GCC complains about the constraint being unsatisfiable, and Clang
> miscompiles the code in a way that causes stability issues (immediate
> panic on 'attempt to kill init')

Any more info on the "miscompile?" Perhaps worth an upstream bug report?

>
> So instead, pass the struct static_key reference and the 'branch'
> immediate individually, in a way that satisfies both GCC and Clang (GCC
> wants the 'S' constraint, whereas Clang wants the 'i' constraint for
> argument %0)

Anything we can do to improve Clang's handling of S constraints?
https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
>> An absolute symbolic address or a label reference

>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/include/asm/jump_label.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> index cea441b6aa5d..f741bbacf175 100644
> --- a/arch/arm64/include/asm/jump_label.h
> +++ b/arch/arm64/include/asm/jump_label.h
> @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key *key,
>                  "      .pushsection    __jump_table, \"aw\"    \n\t"
>                  "      .align          3                       \n\t"
>                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> -                "      .quad           %c0 - .                 \n\t"
> +                "      .quad           %c0 - . + %1            \n\t"

If %1 is "i" constrained, then I think we can use the %c output
template for it as well?
https://gcc.gnu.org/onlinedocs/gccint/Output-Template.html

Is the expression clearer if we keep the `- .` at the end of each expression?

>                  "      .popsection                             \n\t"
> -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +                :  :  "Si"(key), "i"(branch) :  : l_yes);
>
>         return false;
>  l_yes:
> @@ -40,9 +40,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key,
>                  "      .pushsection    __jump_table, \"aw\"    \n\t"
>                  "      .align          3                       \n\t"
>                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> -                "      .quad           %c0 - .                 \n\t"
> +                "      .quad           %c0 - . + %1            \n\t"
>                  "      .popsection                             \n\t"
> -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +                :  :  "Si"(key), "i"(branch) :  : l_yes);
>
>         return false;
>  l_yes:
> --
> 2.30.2
>


-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints
  2022-04-27 18:58   ` Nick Desaulniers
@ 2022-04-27 21:50     ` Ard Biesheuvel
  2022-04-28  9:35       ` Ard Biesheuvel
  0 siblings, 1 reply; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-27 21:50 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Linux ARM, clang-built-linux, Will Deacon, Catalin Marinas,
	Kees Cook, Mark Rutland, Nathan Chancellor, Sami Tolvanen

Hi Nick,

On Wed, 27 Apr 2022 at 20:59, Nick Desaulniers <ndesaulniers@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 10:13 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > In order to set bit #0 of the struct static_key pointer in the the jump
> > label struct, we currently cast the pointer to char[], and take the
> > address of either the 0th or 1st array member, depending on the value of
> > 'branch'. This works but creates problems with -fpie code generation:
> > GCC complains about the constraint being unsatisfiable, and Clang
> > miscompiles the code in a way that causes stability issues (immediate
> > panic on 'attempt to kill init')
>
> Any more info on the "miscompile?" Perhaps worth an upstream bug report?
>

I made very little progress trying to narrow it down: a segfault in
user space caused by who knows what. I'd file a bug if I knew how to
reproduce it.

> >
> > So instead, pass the struct static_key reference and the 'branch'
> > immediate individually, in a way that satisfies both GCC and Clang (GCC
> > wants the 'S' constraint, whereas Clang wants the 'i' constraint for
> > argument %0)
>
> Anything we can do to improve Clang's handling of S constraints?
> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
> >> An absolute symbolic address or a label reference
>

"S" seems more appropriate than "i" for a symbol reference, and GCC
rejects "i" when using -fpie. But the actual literal being emitted is
a relative reference, not an absolute one. This likely needs more
discussion, but using "i" in the way we do now is definitely dodgy.

> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/include/asm/jump_label.h | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > index cea441b6aa5d..f741bbacf175 100644
> > --- a/arch/arm64/include/asm/jump_label.h
> > +++ b/arch/arm64/include/asm/jump_label.h
> > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key *key,
> >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
> >                  "      .align          3                       \n\t"
> >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> > -                "      .quad           %c0 - .                 \n\t"
> > +                "      .quad           %c0 - . + %1            \n\t"
>
> If %1 is "i" constrained, then I think we can use the %c output
> template for it as well?
> https://gcc.gnu.org/onlinedocs/gccint/Output-Template.html
>

Is %c what prevents the leading # from being emitted? I'm not sure
whether that matters here, as AArch64 asm does not require it (and the
code builds fine with either compiler). But if it reflects the use
more precisely, I agree we should add it.

> Is the expression clearer if we keep the `- .` at the end of each expression?
>

I don't think so. The add sets the enabled bit, so I'd even be
inclined to write this as (%c0 - .) + %c1 to emphasize that this is a
relative reference with bit #0 set separately.

> >                  "      .popsection                             \n\t"
> > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > +                :  :  "Si"(key), "i"(branch) :  : l_yes);
> >
> >         return false;
> >  l_yes:
> > @@ -40,9 +40,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key,
> >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
> >                  "      .align          3                       \n\t"
> >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> > -                "      .quad           %c0 - .                 \n\t"
> > +                "      .quad           %c0 - . + %1            \n\t"
> >                  "      .popsection                             \n\t"
> > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > +                :  :  "Si"(key), "i"(branch) :  : l_yes);
> >
> >         return false;
> >  l_yes:
> > --
> > 2.30.2
> >
>
>
> --
> Thanks,
> ~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-27 17:12 ` [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels Ard Biesheuvel
@ 2022-04-28  2:40   ` Fangrui Song
  2022-04-28  6:23     ` Ard Biesheuvel
  0 siblings, 1 reply; 17+ messages in thread
From: Fangrui Song @ 2022-04-28  2:40 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, clang-built-linux, will, catalin.marinas,
	keescook, mark.rutland, nathan, Sami Tolvanen, Nick Desaulniers

On 2022-04-27, Ard Biesheuvel wrote:
>We currently use ordinary, position dependent code generation for the
>core kernel, which happens to default to the 'small' code model on both
>GCC and Clang. This is the code model that relies on ADRP/ADD or
>ADRP/LDR pairs for symbol references, which are PC-relative with a range
>of -/+ 4 GiB, and therefore happen to be position independent in
>practice.
>
>This means that the fact that we can link the relocatable KASLR kernel
>using the -pie linker flag (which generates the runtime relocations and
>inserts them into the binary) is somewhat of a coincidence, and not
>something which is explicitly supported by the toolchains.

Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a
conincidence, not guaranteed by the toolchain.

-shared needs -fpic object files. -shared -Bsymbolic is very similar to
-pie and therefore works with -fpie object files, but the usage is not
recommended from the toolchain perspective.

>The reason we have not used -fpie for code generation so far (which is
>the compiler flag that should be used to generate code that is to be
>linked with -pie) is that by default, it generates code based on
>assumptions that only hold for shared libraries and PIE executables,
>i.e., that gathering all relocatable quantities into a Global Offset
>Table (GOT) is desirable because it reduces the CoW footprint, and
>because it permits ELF symbol preemption (which lets an executable
>override symbols defined in a shared library, in a way that forces the
>shared library to update all of its internal references as well).
>Ironically, this means we end up with many more absolute references that
>all need to be fixed up at boot.

This is not about symbol preemption (when the executable and a shared
objectdefine the same symbol, which one wins). An executable using a GOT
which will be resolved to a shared object => this is regular relocation
resolving and there is no preemption.

It is that the compiler prefers code generation which can avoid text
relocations / copy relocations / canonical PLT entries
(https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary).

>Fortunately, we can convince the compiler to handle this in a way that
>is a bit more suitable for freestanding binaries such as the kernel, by
>setting the 'hidden' visibility #pragma, which informs the compiler that
>symbol preemption or CoW footprint are of no concern to us, and so
>PC-relative references that are resolved at link time are perfectly
>fine.

Agree

>So let's enable this #pragma and build with -fpie when building a
>relocatable kernel. This also means that all constant data items that
>carry statically initialized pointer variables are now emitted into the
>.data.rel.ro* sections, so move these into .rodata where they belong.

LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata

>Code size impact (GCC):
>
>Before:
>
>      text       data        bss      total filename
>  16712396   18659064     534556   35906016 vmlinux
>
>After:
>
>      text       data        bss      total filename
>  16804400   18612876     534556   35951832 vmlinux
>
>Code size impact (Clang):
>
>Before:
>
>      text       data        bss      total filename
>  17194584   13335060     535268   31064912 vmlinux
>
>After:
>
>      text       data        bss      total filename
>  17194536   13310032     535268   31039836 vmlinux
>
>Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>---
> arch/arm64/Makefile             | 4 ++++
> arch/arm64/kernel/vmlinux.lds.S | 9 ++++-----
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
>diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
>index 2f1de88651e6..94b6c51f5de6 100644
>--- a/arch/arm64/Makefile
>+++ b/arch/arm64/Makefile
>@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y)
> # with the relocation offsets always being zero.
> LDFLAGS_vmlinux		+= -shared -Bsymbolic -z notext \
> 			$(call ld-option, --no-apply-dynamic-relocs)
>+
>+# Generate position independent code without relying on a Global Offset Table
>+KBUILD_CFLAGS_KERNEL   += -fpie -include $(srctree)/include/linux/hidden.h
>+
> endif
>
> ifeq ($(CONFIG_ARM64_ERRATUM_843419),y)
>diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
>index edaf0faf766f..b1e071ac1acf 100644
>--- a/arch/arm64/kernel/vmlinux.lds.S
>+++ b/arch/arm64/kernel/vmlinux.lds.S
>@@ -174,8 +174,6 @@ SECTIONS
> 			KEXEC_TEXT
> 			TRAMP_TEXT
> 			*(.gnu.warning)
>-		. = ALIGN(16);
>-		*(.got)			/* Global offset table		*/
> 	}
>
> 	/*
>@@ -192,6 +190,8 @@ SECTIONS
> 	/* everything from this point to __init_begin will be marked RO NX */
> 	RO_DATA(PAGE_SIZE)
>
>+	.data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) }
>+
> 	HYPERVISOR_DATA_SECTIONS
>
> 	idmap_pg_dir = .;
>@@ -273,6 +273,8 @@ SECTIONS
> 	_sdata = .;
> 	RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
>
>+	.data.rel : ALIGN(8) { *(.data.rel*) }
>+
> 	/*
> 	 * Data written with the MMU off but read with the MMU on requires
> 	 * cache lines to be invalidated, discarding up to a Cache Writeback
>@@ -320,9 +322,6 @@ SECTIONS
> 		*(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt)
> 	}
> 	ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!")
>-
>-	.data.rel.ro : { *(.data.rel.ro) }
>-	ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!")
> }
>
> #include "image-vars.h"
>-- 
>2.30.2
>
>-- 
>You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
>To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20220427171241.2426592-3-ardb%40kernel.org.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-28  2:40   ` Fangrui Song
@ 2022-04-28  6:23     ` Ard Biesheuvel
  2022-04-28  6:57       ` Fangrui Song
  0 siblings, 1 reply; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-28  6:23 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Linux ARM, clang-built-linux, Will Deacon, Catalin Marinas,
	Kees Cook, Mark Rutland, Nathan Chancellor, Sami Tolvanen,
	Nick Desaulniers

On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
>
> On 2022-04-27, Ard Biesheuvel wrote:
> >We currently use ordinary, position dependent code generation for the
> >core kernel, which happens to default to the 'small' code model on both
> >GCC and Clang. This is the code model that relies on ADRP/ADD or
> >ADRP/LDR pairs for symbol references, which are PC-relative with a range
> >of -/+ 4 GiB, and therefore happen to be position independent in
> >practice.
> >
> >This means that the fact that we can link the relocatable KASLR kernel
> >using the -pie linker flag (which generates the runtime relocations and
> >inserts them into the binary) is somewhat of a coincidence, and not
> >something which is explicitly supported by the toolchains.
>
> Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a
> conincidence, not guaranteed by the toolchain.
>
> -shared needs -fpic object files. -shared -Bsymbolic is very similar to
> -pie and therefore works with -fpie object files, but the usage is not
> recommended from the toolchain perspective.
>

So are you suggesting we should also switch from -shared to -Bsymbol
to -pie if we can? I don't remember the details, but IIRC ld.bfd
didn't set the ELF binary type correctly, but perhaps this has now
been fixed.

> >The reason we have not used -fpie for code generation so far (which is
> >the compiler flag that should be used to generate code that is to be
> >linked with -pie) is that by default, it generates code based on
> >assumptions that only hold for shared libraries and PIE executables,
> >i.e., that gathering all relocatable quantities into a Global Offset
> >Table (GOT) is desirable because it reduces the CoW footprint, and
> >because it permits ELF symbol preemption (which lets an executable
> >override symbols defined in a shared library, in a way that forces the
> >shared library to update all of its internal references as well).
> >Ironically, this means we end up with many more absolute references that
> >all need to be fixed up at boot.
>
> This is not about symbol preemption (when the executable and a shared
> objectdefine the same symbol, which one wins). An executable using a GOT
> which will be resolved to a shared object => this is regular relocation
> resolving and there is no preemption.
>
> It is that the compiler prefers code generation which can avoid text
> relocations / copy relocations / canonical PLT entries
> (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary).
>

Fair enough. So the compiler cannot generate relative references to
undefined external symbols since it doesn't know at codegen time
whether the symbol reference will be satisfied by the executable
itself or by a shared library, and in the latter case, the relative
distance to the symbol is not known at build time, and so a runtime
relocation is required. But how about references to symbols with
external visibility that are defined in the same compilation unit? I
don't quite understand why those references need to go via the GOT as
well.

> >Fortunately, we can convince the compiler to handle this in a way that
> >is a bit more suitable for freestanding binaries such as the kernel, by
> >setting the 'hidden' visibility #pragma, which informs the compiler that
> >symbol preemption or CoW footprint are of no concern to us, and so
> >PC-relative references that are resolved at link time are perfectly
> >fine.
>
> Agree
>

The only unfortunate thing is that -fvisibility=hidden does not give
us the behavior we want, and we are forced to use the #pragma instead.

> >So let's enable this #pragma and build with -fpie when building a
> >relocatable kernel. This also means that all constant data items that
> >carry statically initialized pointer variables are now emitted into the
> >.data.rel.ro* sections, so move these into .rodata where they belong.
>
> LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata
>

I am referring to the .rodata pseudo-segment that we have in the
kernel, which runs from _etext to __inittext_begin.

> >Code size impact (GCC):
> >
> >Before:
> >
> >      text       data        bss      total filename
> >  16712396   18659064     534556   35906016 vmlinux
> >
> >After:
> >
> >      text       data        bss      total filename
> >  16804400   18612876     534556   35951832 vmlinux
> >
> >Code size impact (Clang):
> >
> >Before:
> >
> >      text       data        bss      total filename
> >  17194584   13335060     535268   31064912 vmlinux
> >
> >After:
> >
> >      text       data        bss      total filename
> >  17194536   13310032     535268   31039836 vmlinux
> >
> >Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >---
> > arch/arm64/Makefile             | 4 ++++
> > arch/arm64/kernel/vmlinux.lds.S | 9 ++++-----
> > 2 files changed, 8 insertions(+), 5 deletions(-)
> >
> >diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
> >index 2f1de88651e6..94b6c51f5de6 100644
> >--- a/arch/arm64/Makefile
> >+++ b/arch/arm64/Makefile
> >@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y)
> > # with the relocation offsets always being zero.
> > LDFLAGS_vmlinux               += -shared -Bsymbolic -z notext \
> >                       $(call ld-option, --no-apply-dynamic-relocs)
> >+
> >+# Generate position independent code without relying on a Global Offset Table
> >+KBUILD_CFLAGS_KERNEL   += -fpie -include $(srctree)/include/linux/hidden.h
> >+
> > endif
> >
> > ifeq ($(CONFIG_ARM64_ERRATUM_843419),y)
> >diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> >index edaf0faf766f..b1e071ac1acf 100644
> >--- a/arch/arm64/kernel/vmlinux.lds.S
> >+++ b/arch/arm64/kernel/vmlinux.lds.S
> >@@ -174,8 +174,6 @@ SECTIONS
> >                       KEXEC_TEXT
> >                       TRAMP_TEXT
> >                       *(.gnu.warning)
> >-              . = ALIGN(16);
> >-              *(.got)                 /* Global offset table          */
> >       }
> >
> >       /*
> >@@ -192,6 +190,8 @@ SECTIONS
> >       /* everything from this point to __init_begin will be marked RO NX */
> >       RO_DATA(PAGE_SIZE)
> >
> >+      .data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) }
> >+
> >       HYPERVISOR_DATA_SECTIONS
> >
> >       idmap_pg_dir = .;
> >@@ -273,6 +273,8 @@ SECTIONS
> >       _sdata = .;
> >       RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
> >
> >+      .data.rel : ALIGN(8) { *(.data.rel*) }
> >+
> >       /*
> >        * Data written with the MMU off but read with the MMU on requires
> >        * cache lines to be invalidated, discarding up to a Cache Writeback
> >@@ -320,9 +322,6 @@ SECTIONS
> >               *(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt)
> >       }
> >       ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!")
> >-
> >-      .data.rel.ro : { *(.data.rel.ro) }
> >-      ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!")
> > }
> >
> > #include "image-vars.h"
> >--
> >2.30.2
> >
> >--
> >You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> >To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
> >To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20220427171241.2426592-3-ardb%40kernel.org.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-28  6:23     ` Ard Biesheuvel
@ 2022-04-28  6:57       ` Fangrui Song
  2022-04-28 16:03         ` Ard Biesheuvel
  2022-04-28 18:53         ` Nick Desaulniers
  0 siblings, 2 replies; 17+ messages in thread
From: Fangrui Song @ 2022-04-28  6:57 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Linux ARM, clang-built-linux, Will Deacon, Catalin Marinas,
	Kees Cook, Mark Rutland, Nathan Chancellor, Sami Tolvanen,
	Nick Desaulniers

On 2022-04-28, Ard Biesheuvel wrote:
>On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
>>
>> On 2022-04-27, Ard Biesheuvel wrote:
>> >We currently use ordinary, position dependent code generation for the
>> >core kernel, which happens to default to the 'small' code model on both
>> >GCC and Clang. This is the code model that relies on ADRP/ADD or
>> >ADRP/LDR pairs for symbol references, which are PC-relative with a range
>> >of -/+ 4 GiB, and therefore happen to be position independent in
>> >practice.
>> >
>> >This means that the fact that we can link the relocatable KASLR kernel
>> >using the -pie linker flag (which generates the runtime relocations and
>> >inserts them into the binary) is somewhat of a coincidence, and not
>> >something which is explicitly supported by the toolchains.
>>
>> Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a
>> conincidence, not guaranteed by the toolchain.
>>
>> -shared needs -fpic object files. -shared -Bsymbolic is very similar to
>> -pie and therefore works with -fpie object files, but the usage is not
>> recommended from the toolchain perspective.
>>
>
>So are you suggesting we should also switch from -shared to -Bsymbol
>to -pie if we can? I don't remember the details, but IIRC ld.bfd
>didn't set the ELF binary type correctly, but perhaps this has now
>been fixed.

Yes, -shared -Bsymbolic => -pie, but that can be done later.

For e_type: ET_DYN, I think unlikely there was a bug.
-pie was added by binutils in 2003: it's close to -shared but doesn't
allow its definitions to be preempted/interposed. Code earlier than that
might use -shared -Bsymbolic before -pie was available.

>> >The reason we have not used -fpie for code generation so far (which is
>> >the compiler flag that should be used to generate code that is to be
>> >linked with -pie) is that by default, it generates code based on
>> >assumptions that only hold for shared libraries and PIE executables,
>> >i.e., that gathering all relocatable quantities into a Global Offset
>> >Table (GOT) is desirable because it reduces the CoW footprint, and
>> >because it permits ELF symbol preemption (which lets an executable
>> >override symbols defined in a shared library, in a way that forces the
>> >shared library to update all of its internal references as well).
>> >Ironically, this means we end up with many more absolute references that
>> >all need to be fixed up at boot.
>>
>> This is not about symbol preemption (when the executable and a shared
>> objectdefine the same symbol, which one wins). An executable using a GOT
>> which will be resolved to a shared object => this is regular relocation
>> resolving and there is no preemption.
>>
>> It is that the compiler prefers code generation which can avoid text
>> relocations / copy relocations / canonical PLT entries
>> (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary).
>>
>
>Fair enough. So the compiler cannot generate relative references to
>undefined external symbols since it doesn't know at codegen time
>whether the symbol reference will be satisfied by the executable
>itself or by a shared library, and in the latter case, the relative
>distance to the symbol is not known at build time, and so a runtime
>relocation is required.

Right.

>But how about references to symbols with
>external visibility that are defined in the same compilation unit? I
>don't quite understand why those references need to go via the GOT as
>well.

If you mean references to a non-local STV_DEFAULT (default visibility) definition =>

* -fpic: use GOT because the definition may be replaced by another at run time.
   Conservatively use a GOT-generating code sequence to allow potential symbol
   preemption(interposition). The linker may optimize out the GOT (x86-64
   GOTPCRELX, recent ld.lld for aarch64, powerpc64 TOC-indirect to
   TOC-relative optimization).
* -fpie or -fno-pie: the definition cannot be replaced. GOT is unneeded.

-fpie is an optimization on top of -fpic: (a) non-local STV_DEFAULT
definitions can be assumed non-interposable (b) (irrelevant to the
kernel) TLS can use more optimized models.

>> >Fortunately, we can convince the compiler to handle this in a way that
>> >is a bit more suitable for freestanding binaries such as the kernel, by
>> >setting the 'hidden' visibility #pragma, which informs the compiler that
>> >symbol preemption or CoW footprint are of no concern to us, and so
>> >PC-relative references that are resolved at link time are perfectly
>> >fine.
>>
>> Agree
>>
>
>The only unfortunate thing is that -fvisibility=hidden does not give
>us the behavior we want, and we are forced to use the #pragma instead.

Right. For a very long time there had been no option controlling the
access mode for undefined symbols (-fvisibility= is for defined
symbols).

I added -fdirect-access-external-data to Clang which supports
many architectures (x86, aarch64, arm, riscv, ...).
GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).

The use of `#pragma GCC visibility push(hidden)` looks good as a
portable solution.

>> >So let's enable this #pragma and build with -fpie when building a
>> >relocatable kernel. This also means that all constant data items that
>> >carry statically initialized pointer variables are now emitted into the
>> >.data.rel.ro* sections, so move these into .rodata where they belong.
>>
>> LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata
>>
>
>I am referring to the .rodata pseudo-segment that we have in the
>kernel, which runs from _etext to __inittext_begin.

OK

>> >Code size impact (GCC):
>> >
>> >Before:
>> >
>> >      text       data        bss      total filename
>> >  16712396   18659064     534556   35906016 vmlinux
>> >
>> >After:
>> >
>> >      text       data        bss      total filename
>> >  16804400   18612876     534556   35951832 vmlinux
>> >
>> >Code size impact (Clang):
>> >
>> >Before:
>> >
>> >      text       data        bss      total filename
>> >  17194584   13335060     535268   31064912 vmlinux
>> >
>> >After:
>> >
>> >      text       data        bss      total filename
>> >  17194536   13310032     535268   31039836 vmlinux

The size difference for Clang matches my expecation:)
I am somewhat surprised that data is smaller, though...

I wonder how GCC makes code bloated so much...

>> >Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> >---
>> > arch/arm64/Makefile             | 4 ++++
>> > arch/arm64/kernel/vmlinux.lds.S | 9 ++++-----
>> > 2 files changed, 8 insertions(+), 5 deletions(-)
>> >
>> >diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
>> >index 2f1de88651e6..94b6c51f5de6 100644
>> >--- a/arch/arm64/Makefile
>> >+++ b/arch/arm64/Makefile
>> >@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y)
>> > # with the relocation offsets always being zero.
>> > LDFLAGS_vmlinux               += -shared -Bsymbolic -z notext \
>> >                       $(call ld-option, --no-apply-dynamic-relocs)
>> >+
>> >+# Generate position independent code without relying on a Global Offset Table
>> >+KBUILD_CFLAGS_KERNEL   += -fpie -include $(srctree)/include/linux/hidden.h
>> >+
>> > endif
>> >
>> > ifeq ($(CONFIG_ARM64_ERRATUM_843419),y)
>> >diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
>> >index edaf0faf766f..b1e071ac1acf 100644
>> >--- a/arch/arm64/kernel/vmlinux.lds.S
>> >+++ b/arch/arm64/kernel/vmlinux.lds.S
>> >@@ -174,8 +174,6 @@ SECTIONS
>> >                       KEXEC_TEXT
>> >                       TRAMP_TEXT
>> >                       *(.gnu.warning)
>> >-              . = ALIGN(16);
>> >-              *(.got)                 /* Global offset table          */
>> >       }
>> >
>> >       /*
>> >@@ -192,6 +190,8 @@ SECTIONS
>> >       /* everything from this point to __init_begin will be marked RO NX */
>> >       RO_DATA(PAGE_SIZE)
>> >
>> >+      .data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) }
>> >+
>> >       HYPERVISOR_DATA_SECTIONS
>> >
>> >       idmap_pg_dir = .;
>> >@@ -273,6 +273,8 @@ SECTIONS
>> >       _sdata = .;
>> >       RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
>> >
>> >+      .data.rel : ALIGN(8) { *(.data.rel*) }
>> >+
>> >       /*
>> >        * Data written with the MMU off but read with the MMU on requires
>> >        * cache lines to be invalidated, discarding up to a Cache Writeback
>> >@@ -320,9 +322,6 @@ SECTIONS
>> >               *(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt)
>> >       }
>> >       ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!")
>> >-
>> >-      .data.rel.ro : { *(.data.rel.ro) }
>> >-      ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!")
>> > }
>> >
>> > #include "image-vars.h"
>> >--
>> >2.30.2
>> >
>> >--
>> >You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
>> >To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
>> >To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20220427171241.2426592-3-ardb%40kernel.org.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints
  2022-04-27 21:50     ` Ard Biesheuvel
@ 2022-04-28  9:35       ` Ard Biesheuvel
  0 siblings, 0 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-28  9:35 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Linux ARM, clang-built-linux, Will Deacon, Catalin Marinas,
	Kees Cook, Mark Rutland, Nathan Chancellor, Sami Tolvanen

On Wed, 27 Apr 2022 at 23:50, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Hi Nick,
>
> On Wed, 27 Apr 2022 at 20:59, Nick Desaulniers <ndesaulniers@google.com> wrote:
> >
> > On Wed, Apr 27, 2022 at 10:13 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > In order to set bit #0 of the struct static_key pointer in the the jump
> > > label struct, we currently cast the pointer to char[], and take the
> > > address of either the 0th or 1st array member, depending on the value of
> > > 'branch'. This works but creates problems with -fpie code generation:
> > > GCC complains about the constraint being unsatisfiable, and Clang
> > > miscompiles the code in a way that causes stability issues (immediate
> > > panic on 'attempt to kill init')
> >
> > Any more info on the "miscompile?" Perhaps worth an upstream bug report?
> >
>
> I made very little progress trying to narrow it down: a segfault in
> user space caused by who knows what. I'd file a bug if I knew how to
> reproduce it.
>

... and now, I cannot even reproduce it anymore, so I'll drop this
mention altogether.

> > >
> > > So instead, pass the struct static_key reference and the 'branch'
> > > immediate individually, in a way that satisfies both GCC and Clang (GCC
> > > wants the 'S' constraint, whereas Clang wants the 'i' constraint for
> > > argument %0)
> >
> > Anything we can do to improve Clang's handling of S constraints?
> > https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
> > >> An absolute symbolic address or a label reference
> >
>
> "S" seems more appropriate than "i" for a symbol reference, and GCC
> rejects "i" when using -fpie. But the actual literal being emitted is
> a relative reference, not an absolute one. This likely needs more
> discussion, but using "i" in the way we do now is definitely dodgy.
>

I have tried to create a reproducer for this issue, but the code
below, which does essentially the same thing as jump_label.h, works
fine with Clang and GCC with or without -fpie.

extern struct s {
  struct { } m;
} ss;

static inline __attribute__((always_inline)) int inner(struct s *s) {
  asm goto("adrp xzr, %c0" : : "S"(&s->m) : : l);

  return 0;
l:return 1;
}

int outer(void) {
  return inner(&ss);
}

So what's tricky here is that arch_static_branch() [like inner()
above] does not refer to the global directly, which is either struct
static_key_true or struct static_key_false, but to its 'key' member,
which is the first member of either former type. However, Clang does
not seem to care in case of this example, but when building the
kernel, it errors out with

/home/ard/linux/arch/arm64/include/asm/jump_label.h:22:3: error:
invalid operand for inline asm constraint 'S'

Another thing I hate about this code is that it needs optimizations
enabled, or it won't compile.

> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > >  arch/arm64/include/asm/jump_label.h | 8 ++++----
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > > index cea441b6aa5d..f741bbacf175 100644
> > > --- a/arch/arm64/include/asm/jump_label.h
> > > +++ b/arch/arm64/include/asm/jump_label.h
> > > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key *key,
> > >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
> > >                  "      .align          3                       \n\t"
> > >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> > > -                "      .quad           %c0 - .                 \n\t"
> > > +                "      .quad           %c0 - . + %1            \n\t"
> >
> > If %1 is "i" constrained, then I think we can use the %c output
> > template for it as well?
> > https://gcc.gnu.org/onlinedocs/gccint/Output-Template.html
> >
>
> Is %c what prevents the leading # from being emitted? I'm not sure
> whether that matters here, as AArch64 asm does not require it (and the
> code builds fine with either compiler). But if it reflects the use
> more precisely, I agree we should add it.
>
> > Is the expression clearer if we keep the `- .` at the end of each expression?
> >
>
> I don't think so. The add sets the enabled bit, so I'd even be
> inclined to write this as (%c0 - .) + %c1 to emphasize that this is a
> relative reference with bit #0 set separately.
>
> > >                  "      .popsection                             \n\t"
> > > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > > +                :  :  "Si"(key), "i"(branch) :  : l_yes);
> > >
> > >         return false;
> > >  l_yes:
> > > @@ -40,9 +40,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key,
> > >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
> > >                  "      .align          3                       \n\t"
> > >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> > > -                "      .quad           %c0 - .                 \n\t"
> > > +                "      .quad           %c0 - . + %1            \n\t"
> > >                  "      .popsection                             \n\t"
> > > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > > +                :  :  "Si"(key), "i"(branch) :  : l_yes);
> > >
> > >         return false;
> > >  l_yes:
> > > --
> > > 2.30.2
> > >
> >
> >
> > --
> > Thanks,
> > ~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints
  2022-04-27 17:12 ` [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints Ard Biesheuvel
  2022-04-27 18:58   ` Nick Desaulniers
@ 2022-04-28  9:51   ` Mark Rutland
  2022-04-28 16:05     ` Ard Biesheuvel
  1 sibling, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2022-04-28  9:51 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, clang-built-linux, will, catalin.marinas,
	keescook, nathan, Sami Tolvanen, Nick Desaulniers

Hi Ard,

On Wed, Apr 27, 2022 at 07:12:40PM +0200, Ard Biesheuvel wrote:
> In order to set bit #0 of the struct static_key pointer in the the jump
> label struct

I think you mean jump_entry::key here?

> , we currently cast the pointer to char[], and take the
> address of either the 0th or 1st array member, depending on the value of
> 'branch'. This works but creates problems with -fpie code generation:
> GCC complains about the constraint being unsatisfiable, and Clang
> miscompiles the code in a way that causes stability issues (immediate
> panic on 'attempt to kill init')

I couldn't reproduce that stability issue locally playing with Clang 12.0.0 and
14.0.0 (and just applying patch 2 of this series atop v5.18-rc1). I built
defconfig and booted that under a QEMU HVF VM on an M1 Mac.

FWIW, I used the binaries from llvm.org and built with:

  // magic script to add the toolchains to my PATH
  usellvm 12.0.0 make LLVM=1 ARCH=arm64 defconfig 
  usellvm 12.0.0 make LLVM=1 ARCH=arm64 -j50 Image

... and QEMU isn't providing entropy to the guest, so it's possible that:

* This only goes wrong when randomizing VAs (maybe we get a redundant
  relocation, and corrupt the key offset?).

* This is specific to the LLVM binaries you're using.

* This is down to a combination of LLVM + binutils, if you're not building with
  LLVM=1?

  I had a go with Clang 12.0.0 and the kernel.org crosstool GCC 11.1.0
  release's binutils. I made the constraint "Si" but left the indexing logic,
  and that still worked fine.

> So instead, pass the struct static_key reference and the 'branch'
> immediate individually, in a way that satisfies both GCC and Clang (GCC
> wants the 'S' constraint, whereas Clang wants the 'i' constraint for
> argument %0)
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/include/asm/jump_label.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> index cea441b6aa5d..f741bbacf175 100644
> --- a/arch/arm64/include/asm/jump_label.h
> +++ b/arch/arm64/include/asm/jump_label.h
> @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key *key,
>  		 "	.pushsection	__jump_table, \"aw\"	\n\t"
>  		 "	.align		3			\n\t"
>  		 "	.long		1b - ., %l[l_yes] - .	\n\t"
> -		 "	.quad		%c0 - .			\n\t"
> +		 "	.quad		%c0 - . + %1		\n\t"
>  		 "	.popsection				\n\t"
> -		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +		 :  :  "Si"(key), "i"(branch) :  : l_yes);

Nice! I like that this clearly separate the "set bit 0" portion out, and IMO
that's much clearer than the array indexing.

Thanks,
Mark.

>  
>  	return false;
>  l_yes:
> @@ -40,9 +40,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key,
>  		 "	.pushsection	__jump_table, \"aw\"	\n\t"
>  		 "	.align		3			\n\t"
>  		 "	.long		1b - ., %l[l_yes] - .	\n\t"
> -		 "	.quad		%c0 - .			\n\t"
> +		 "	.quad		%c0 - . + %1		\n\t"
>  		 "	.popsection				\n\t"
> -		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +		 :  :  "Si"(key), "i"(branch) :  : l_yes);
>  
>  	return false;
>  l_yes:
> -- 
> 2.30.2
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-28  6:57       ` Fangrui Song
@ 2022-04-28 16:03         ` Ard Biesheuvel
  2022-04-28 18:53         ` Nick Desaulniers
  1 sibling, 0 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-28 16:03 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Linux ARM, clang-built-linux, Will Deacon, Catalin Marinas,
	Kees Cook, Mark Rutland, Nathan Chancellor, Sami Tolvanen,
	Nick Desaulniers

On Thu, 28 Apr 2022 at 08:57, Fangrui Song <maskray@google.com> wrote:
>
> On 2022-04-28, Ard Biesheuvel wrote:
> >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
> >>
> >> On 2022-04-27, Ard Biesheuvel wrote:
> >> >We currently use ordinary, position dependent code generation for the
> >> >core kernel, which happens to default to the 'small' code model on both
> >> >GCC and Clang. This is the code model that relies on ADRP/ADD or
> >> >ADRP/LDR pairs for symbol references, which are PC-relative with a range
> >> >of -/+ 4 GiB, and therefore happen to be position independent in
> >> >practice.
> >> >
> >> >This means that the fact that we can link the relocatable KASLR kernel
> >> >using the -pie linker flag (which generates the runtime relocations and
> >> >inserts them into the binary) is somewhat of a coincidence, and not
> >> >something which is explicitly supported by the toolchains.
> >>
> >> Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a
> >> conincidence, not guaranteed by the toolchain.
> >>
> >> -shared needs -fpic object files. -shared -Bsymbolic is very similar to
> >> -pie and therefore works with -fpie object files, but the usage is not
> >> recommended from the toolchain perspective.
> >>
> >
> >So are you suggesting we should also switch from -shared to -Bsymbol
> >to -pie if we can? I don't remember the details, but IIRC ld.bfd
> >didn't set the ELF binary type correctly, but perhaps this has now
> >been fixed.
>
> Yes, -shared -Bsymbolic => -pie, but that can be done later.
>
> For e_type: ET_DYN, I think unlikely there was a bug.
> -pie was added by binutils in 2003: it's close to -shared but doesn't
> allow its definitions to be preempted/interposed. Code earlier than that
> might use -shared -Bsymbolic before -pie was available.
>
> >> >The reason we have not used -fpie for code generation so far (which is
> >> >the compiler flag that should be used to generate code that is to be
> >> >linked with -pie) is that by default, it generates code based on
> >> >assumptions that only hold for shared libraries and PIE executables,
> >> >i.e., that gathering all relocatable quantities into a Global Offset
> >> >Table (GOT) is desirable because it reduces the CoW footprint, and
> >> >because it permits ELF symbol preemption (which lets an executable
> >> >override symbols defined in a shared library, in a way that forces the
> >> >shared library to update all of its internal references as well).
> >> >Ironically, this means we end up with many more absolute references that
> >> >all need to be fixed up at boot.
> >>
> >> This is not about symbol preemption (when the executable and a shared
> >> objectdefine the same symbol, which one wins). An executable using a GOT
> >> which will be resolved to a shared object => this is regular relocation
> >> resolving and there is no preemption.
> >>
> >> It is that the compiler prefers code generation which can avoid text
> >> relocations / copy relocations / canonical PLT entries
> >> (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary).
> >>
> >
> >Fair enough. So the compiler cannot generate relative references to
> >undefined external symbols since it doesn't know at codegen time
> >whether the symbol reference will be satisfied by the executable
> >itself or by a shared library, and in the latter case, the relative
> >distance to the symbol is not known at build time, and so a runtime
> >relocation is required.
>
> Right.
>
> >But how about references to symbols with
> >external visibility that are defined in the same compilation unit? I
> >don't quite understand why those references need to go via the GOT as
> >well.
>
> If you mean references to a non-local STV_DEFAULT (default visibility) definition =>
>
> * -fpic: use GOT because the definition may be replaced by another at run time.
>    Conservatively use a GOT-generating code sequence to allow potential symbol
>    preemption(interposition). The linker may optimize out the GOT (x86-64
>    GOTPCRELX, recent ld.lld for aarch64, powerpc64 TOC-indirect to
>    TOC-relative optimization).
> * -fpie or -fno-pie: the definition cannot be replaced. GOT is unneeded.
>
> -fpie is an optimization on top of -fpic: (a) non-local STV_DEFAULT
> definitions can be assumed non-interposable (b) (irrelevant to the
> kernel) TLS can use more optimized models.
>
> >> >Fortunately, we can convince the compiler to handle this in a way that
> >> >is a bit more suitable for freestanding binaries such as the kernel, by
> >> >setting the 'hidden' visibility #pragma, which informs the compiler that
> >> >symbol preemption or CoW footprint are of no concern to us, and so
> >> >PC-relative references that are resolved at link time are perfectly
> >> >fine.
> >>
> >> Agree
> >>
> >
> >The only unfortunate thing is that -fvisibility=hidden does not give
> >us the behavior we want, and we are forced to use the #pragma instead.
>
> Right. For a very long time there had been no option controlling the
> access mode for undefined symbols (-fvisibility= is for defined
> symbols).
>
> I added -fdirect-access-external-data to Clang which supports
> many architectures (x86, aarch64, arm, riscv, ...).
> GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).
>
> The use of `#pragma GCC visibility push(hidden)` looks good as a
> portable solution.
>

OK

> >> >So let's enable this #pragma and build with -fpie when building a
> >> >relocatable kernel. This also means that all constant data items that
> >> >carry statically initialized pointer variables are now emitted into the
> >> >.data.rel.ro* sections, so move these into .rodata where they belong.
> >>
> >> LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata
> >>
> >
> >I am referring to the .rodata pseudo-segment that we have in the
> >kernel, which runs from _etext to __inittext_begin.
>
> OK
>
> >> >Code size impact (GCC):
> >> >
> >> >Before:
> >> >
> >> >      text       data        bss      total filename
> >> >  16712396   18659064     534556   35906016 vmlinux
> >> >
> >> >After:
> >> >
> >> >      text       data        bss      total filename
> >> >  16804400   18612876     534556   35951832 vmlinux
> >> >
> >> >Code size impact (Clang):
> >> >
> >> >Before:
> >> >
> >> >      text       data        bss      total filename
> >> >  17194584   13335060     535268   31064912 vmlinux
> >> >
> >> >After:
> >> >
> >> >      text       data        bss      total filename
> >> >  17194536   13310032     535268   31039836 vmlinux
>
> The size difference for Clang matches my expecation:)
> I am somewhat surprised that data is smaller, though...
>
> I wonder how GCC makes code bloated so much...
>

This is caused by the use of RELA format instead of RELR.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints
  2022-04-28  9:51   ` Mark Rutland
@ 2022-04-28 16:05     ` Ard Biesheuvel
  0 siblings, 0 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-28 16:05 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Linux ARM, clang-built-linux, Will Deacon, Catalin Marinas,
	Kees Cook, Nathan Chancellor, Sami Tolvanen, Nick Desaulniers

On Thu, 28 Apr 2022 at 11:51, Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Ard,
>
> On Wed, Apr 27, 2022 at 07:12:40PM +0200, Ard Biesheuvel wrote:
> > In order to set bit #0 of the struct static_key pointer in the the jump
> > label struct
>
> I think you mean jump_entry::key here?
>

Yes, which points to a struct static_key - I'll clarify this in v2.

> > , we currently cast the pointer to char[], and take the
> > address of either the 0th or 1st array member, depending on the value of
> > 'branch'. This works but creates problems with -fpie code generation:
> > GCC complains about the constraint being unsatisfiable, and Clang
> > miscompiles the code in a way that causes stability issues (immediate
> > panic on 'attempt to kill init')
>
> I couldn't reproduce that stability issue locally playing with Clang 12.0.0 and
> 14.0.0 (and just applying patch 2 of this series atop v5.18-rc1). I built
> defconfig and booted that under a QEMU HVF VM on an M1 Mac.
>
> FWIW, I used the binaries from llvm.org and built with:
>
>   // magic script to add the toolchains to my PATH
>   usellvm 12.0.0 make LLVM=1 ARCH=arm64 defconfig
>   usellvm 12.0.0 make LLVM=1 ARCH=arm64 -j50 Image
>
> ... and QEMU isn't providing entropy to the guest, so it's possible that:
>
> * This only goes wrong when randomizing VAs (maybe we get a redundant
>   relocation, and corrupt the key offset?).
>
> * This is specific to the LLVM binaries you're using.
>
> * This is down to a combination of LLVM + binutils, if you're not building with
>   LLVM=1?
>
>   I had a go with Clang 12.0.0 and the kernel.org crosstool GCC 11.1.0
>   release's binutils. I made the constraint "Si" but left the indexing logic,
>   and that still worked fine.
>

Yeah, as I reported in another email, I failed to reproduce this, and
I experienced some other issues yesterday due to the fact that llvm-nm
and clang/lld on my system were out of sync. So I think this was a
false positive.

> > So instead, pass the struct static_key reference and the 'branch'
> > immediate individually, in a way that satisfies both GCC and Clang (GCC
> > wants the 'S' constraint, whereas Clang wants the 'i' constraint for
> > argument %0)
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/include/asm/jump_label.h | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > index cea441b6aa5d..f741bbacf175 100644
> > --- a/arch/arm64/include/asm/jump_label.h
> > +++ b/arch/arm64/include/asm/jump_label.h
> > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key *key,
> >                "      .pushsection    __jump_table, \"aw\"    \n\t"
> >                "      .align          3                       \n\t"
> >                "      .long           1b - ., %l[l_yes] - .   \n\t"
> > -              "      .quad           %c0 - .                 \n\t"
> > +              "      .quad           %c0 - . + %1            \n\t"
> >                "      .popsection                             \n\t"
> > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > +              :  :  "Si"(key), "i"(branch) :  : l_yes);
>
> Nice! I like that this clearly separate the "set bit 0" portion out, and IMO
> that's much clearer than the array indexing.
>
> Thanks,
> Mark.
>
> >
> >       return false;
> >  l_yes:
> > @@ -40,9 +40,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key,
> >                "      .pushsection    __jump_table, \"aw\"    \n\t"
> >                "      .align          3                       \n\t"
> >                "      .long           1b - ., %l[l_yes] - .   \n\t"
> > -              "      .quad           %c0 - .                 \n\t"
> > +              "      .quad           %c0 - . + %1            \n\t"
> >                "      .popsection                             \n\t"
> > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > +              :  :  "Si"(key), "i"(branch) :  : l_yes);
> >
> >       return false;
> >  l_yes:
> > --
> > 2.30.2
> >

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-28  6:57       ` Fangrui Song
  2022-04-28 16:03         ` Ard Biesheuvel
@ 2022-04-28 18:53         ` Nick Desaulniers
  2022-04-28 19:36           ` Ard Biesheuvel
  1 sibling, 1 reply; 17+ messages in thread
From: Nick Desaulniers @ 2022-04-28 18:53 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Ard Biesheuvel, Linux ARM, clang-built-linux, Will Deacon,
	Catalin Marinas, Kees Cook, Mark Rutland, Nathan Chancellor,
	Sami Tolvanen

On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote:
>
> On 2022-04-28, Ard Biesheuvel wrote:
> >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
> >>
> >> On 2022-04-27, Ard Biesheuvel wrote:
> >> >Fortunately, we can convince the compiler to handle this in a way that
> >> >is a bit more suitable for freestanding binaries such as the kernel, by
> >> >setting the 'hidden' visibility #pragma, which informs the compiler that
> >> >symbol preemption or CoW footprint are of no concern to us, and so
> >> >PC-relative references that are resolved at link time are perfectly
> >> >fine.
> >>
> >> Agree
> >>
> >
> >The only unfortunate thing is that -fvisibility=hidden does not give
> >us the behavior we want, and we are forced to use the #pragma instead.
>
> Right. For a very long time there had been no option controlling the
> access mode for undefined symbols (-fvisibility= is for defined
> symbols).
>
> I added -fdirect-access-external-data to Clang which supports
> many architectures (x86, aarch64, arm, riscv, ...).
> GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).
>
> The use of `#pragma GCC visibility push(hidden)` looks good as a
> portable solution.

Portable, sure, which is fine for now.

But there's just something about injecting a header into ever TU via
-include in order to set a pragma and that there's such pragmas
effecting codegen that makes my skin crawl.

Perhaps we can come up with a formal feature request for toolchain
vendors for an actual command line flag?

Does the pragma have the same effect as
`-fdirect-access-external-data`/`-mdirect-extern-access`, or would
this feature request look like yet another distinct flag?
-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-28 18:53         ` Nick Desaulniers
@ 2022-04-28 19:36           ` Ard Biesheuvel
  2022-04-29  7:03             ` Fangrui Song
  0 siblings, 1 reply; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-28 19:36 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Fangrui Song, Linux ARM, clang-built-linux, Will Deacon,
	Catalin Marinas, Kees Cook, Mark Rutland, Nathan Chancellor,
	Sami Tolvanen

On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote:
> >
> > On 2022-04-28, Ard Biesheuvel wrote:
> > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
> > >>
> > >> On 2022-04-27, Ard Biesheuvel wrote:
> > >> >Fortunately, we can convince the compiler to handle this in a way that
> > >> >is a bit more suitable for freestanding binaries such as the kernel, by
> > >> >setting the 'hidden' visibility #pragma, which informs the compiler that
> > >> >symbol preemption or CoW footprint are of no concern to us, and so
> > >> >PC-relative references that are resolved at link time are perfectly
> > >> >fine.
> > >>
> > >> Agree
> > >>
> > >
> > >The only unfortunate thing is that -fvisibility=hidden does not give
> > >us the behavior we want, and we are forced to use the #pragma instead.
> >
> > Right. For a very long time there had been no option controlling the
> > access mode for undefined symbols (-fvisibility= is for defined
> > symbols).
> >
> > I added -fdirect-access-external-data to Clang which supports
> > many architectures (x86, aarch64, arm, riscv, ...).
> > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).
> >
> > The use of `#pragma GCC visibility push(hidden)` looks good as a
> > portable solution.
>
> Portable, sure, which is fine for now.
>
> But there's just something about injecting a header into ever TU via
> -include in order to set a pragma and that there's such pragmas
> effecting codegen that makes my skin crawl.
>
> Perhaps we can come up with a formal feature request for toolchain
> vendors for an actual command line flag?
>
> Does the pragma have the same effect as
> `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould
> this feature request look like yet another distinct flag?

I agree that this is rather nasty. What I don't understand is why
-fvisibility=hidden gives different behavior to begin with, or why
-ffreestanding -fpie builds don't default to hidden visibility for
symbol declarations as well as definitions.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-28 19:36           ` Ard Biesheuvel
@ 2022-04-29  7:03             ` Fangrui Song
  2022-04-29  7:27               ` Ard Biesheuvel
  0 siblings, 1 reply; 17+ messages in thread
From: Fangrui Song @ 2022-04-29  7:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Nick Desaulniers, Linux ARM, clang-built-linux, Will Deacon,
	Catalin Marinas, Kees Cook, Mark Rutland, Nathan Chancellor,
	Sami Tolvanen

On 2022-04-28, Ard Biesheuvel wrote:
>On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote:
>>
>> On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote:
>> >
>> > On 2022-04-28, Ard Biesheuvel wrote:
>> > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
>> > >>
>> > >> On 2022-04-27, Ard Biesheuvel wrote:
>> > >> >Fortunately, we can convince the compiler to handle this in a way that
>> > >> >is a bit more suitable for freestanding binaries such as the kernel, by
>> > >> >setting the 'hidden' visibility #pragma, which informs the compiler that
>> > >> >symbol preemption or CoW footprint are of no concern to us, and so
>> > >> >PC-relative references that are resolved at link time are perfectly
>> > >> >fine.
>> > >>
>> > >> Agree
>> > >>
>> > >
>> > >The only unfortunate thing is that -fvisibility=hidden does not give
>> > >us the behavior we want, and we are forced to use the #pragma instead.
>> >
>> > Right. For a very long time there had been no option controlling the
>> > access mode for undefined symbols (-fvisibility= is for defined
>> > symbols).
>> >
>> > I added -fdirect-access-external-data to Clang which supports
>> > many architectures (x86, aarch64, arm, riscv, ...).
>> > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).
>> >
>> > The use of `#pragma GCC visibility push(hidden)` looks good as a
>> > portable solution.
>>
>> Portable, sure, which is fine for now.
>>
>> But there's just something about injecting a header into ever TU via
>> -include in order to set a pragma and that there's such pragmas
>> effecting codegen that makes my skin crawl.
>>
>> Perhaps we can come up with a formal feature request for toolchain
>> vendors for an actual command line flag?
>>
>> Does the pragma have the same effect as
>> `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould
>> this feature request look like yet another distinct flag?

`#pragma GCC visibility push(hidden)` is very similar to
-fvisibility=hidden -fdirect-access-external-data with Clang.
In Clang there are only two differences:

   // TLS initial-exec model with -fdirect-access-external-data;
   // TLS local-exec model with `#pragma GCC visibility push(hidden)`
   extern __thread int var;
   int foo() { return var; }

   // hidden visibility suppresses -fno-plt.
   // -fdirect-access-external-data / GCC -mdirect-extern-access doesn't suppress -fno-plt.
   extern int bar();
   int foo() { return bar() + 2; }


The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
-fdirect-access-external-data can replace `#pragma GCC visibility
push(hidden)`.

>I agree that this is rather nasty. What I don't understand is why
>-fvisibility=hidden gives different behavior to begin with, or why
>-ffreestanding -fpie builds don't default to hidden visibility for
>symbol declarations as well as definitions.

-ffreestanding doesn't mean there is no DSO. A libc implementation (e.g.
musl) may use -ffreestanding to avoid libc dependencies from the host
environment. It may ship several shared objects and export multiple symbols.
Implied -fvisibility=hidden will get in the way.

There is a merit to make options orthogonal.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-29  7:03             ` Fangrui Song
@ 2022-04-29  7:27               ` Ard Biesheuvel
  2022-04-29  7:53                 ` Fangrui Song
  0 siblings, 1 reply; 17+ messages in thread
From: Ard Biesheuvel @ 2022-04-29  7:27 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Nick Desaulniers, Linux ARM, clang-built-linux, Will Deacon,
	Catalin Marinas, Kees Cook, Mark Rutland, Nathan Chancellor,
	Sami Tolvanen

On Fri, 29 Apr 2022 at 09:03, Fangrui Song <maskray@google.com> wrote:
>
> On 2022-04-28, Ard Biesheuvel wrote:
> >On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote:
> >>
> >> On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote:
> >> >
> >> > On 2022-04-28, Ard Biesheuvel wrote:
> >> > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
> >> > >>
> >> > >> On 2022-04-27, Ard Biesheuvel wrote:
> >> > >> >Fortunately, we can convince the compiler to handle this in a way that
> >> > >> >is a bit more suitable for freestanding binaries such as the kernel, by
> >> > >> >setting the 'hidden' visibility #pragma, which informs the compiler that
> >> > >> >symbol preemption or CoW footprint are of no concern to us, and so
> >> > >> >PC-relative references that are resolved at link time are perfectly
> >> > >> >fine.
> >> > >>
> >> > >> Agree
> >> > >>
> >> > >
> >> > >The only unfortunate thing is that -fvisibility=hidden does not give
> >> > >us the behavior we want, and we are forced to use the #pragma instead.
> >> >
> >> > Right. For a very long time there had been no option controlling the
> >> > access mode for undefined symbols (-fvisibility= is for defined
> >> > symbols).
> >> >
> >> > I added -fdirect-access-external-data to Clang which supports
> >> > many architectures (x86, aarch64, arm, riscv, ...).
> >> > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).
> >> >
> >> > The use of `#pragma GCC visibility push(hidden)` looks good as a
> >> > portable solution.
> >>
> >> Portable, sure, which is fine for now.
> >>
> >> But there's just something about injecting a header into ever TU via
> >> -include in order to set a pragma and that there's such pragmas
> >> effecting codegen that makes my skin crawl.
> >>
> >> Perhaps we can come up with a formal feature request for toolchain
> >> vendors for an actual command line flag?
> >>
> >> Does the pragma have the same effect as
> >> `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould
> >> this feature request look like yet another distinct flag?
>
> `#pragma GCC visibility push(hidden)` is very similar to
> -fvisibility=hidden -fdirect-access-external-data with Clang.
> In Clang there are only two differences:
>
>    // TLS initial-exec model with -fdirect-access-external-data;
>    // TLS local-exec model with `#pragma GCC visibility push(hidden)`
>    extern __thread int var;
>    int foo() { return var; }
>
>    // hidden visibility suppresses -fno-plt.
>    // -fdirect-access-external-data / GCC -mdirect-extern-access doesn't suppress -fno-plt.
>    extern int bar();
>    int foo() { return bar() + 2; }
>
>
> The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
> -fdirect-access-external-data can replace `#pragma GCC visibility
> push(hidden)`.
>

OK. But you mentioned that GCC does not implement
-mdirect-extern-access for AArch64, right? So for now, the pragma is
the only portable option we have.

> >I agree that this is rather nasty. What I don't understand is why
> >-fvisibility=hidden gives different behavior to begin with, or why
> >-ffreestanding -fpie builds don't default to hidden visibility for
> >symbol declarations as well as definitions.
>
> -ffreestanding doesn't mean there is no DSO. A libc implementation (e.g.
> musl) may use -ffreestanding to avoid libc dependencies from the host
> environment. It may ship several shared objects and export multiple symbols.
> Implied -fvisibility=hidden will get in the way.
>
> There is a merit to make options orthogonal.

Fair enough.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels
  2022-04-29  7:27               ` Ard Biesheuvel
@ 2022-04-29  7:53                 ` Fangrui Song
  0 siblings, 0 replies; 17+ messages in thread
From: Fangrui Song @ 2022-04-29  7:53 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Nick Desaulniers, Linux ARM, clang-built-linux, Will Deacon,
	Catalin Marinas, Kees Cook, Mark Rutland, Nathan Chancellor,
	Sami Tolvanen

On 2022-04-29, Ard Biesheuvel wrote:
>On Fri, 29 Apr 2022 at 09:03, Fangrui Song <maskray@google.com> wrote:
>>
>> On 2022-04-28, Ard Biesheuvel wrote:
>> >On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote:
>> >>
>> >> On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote:
>> >> >
>> >> > On 2022-04-28, Ard Biesheuvel wrote:
>> >> > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote:
>> >> > >>
>> >> > >> On 2022-04-27, Ard Biesheuvel wrote:
>> >> > >> >Fortunately, we can convince the compiler to handle this in a way that
>> >> > >> >is a bit more suitable for freestanding binaries such as the kernel, by
>> >> > >> >setting the 'hidden' visibility #pragma, which informs the compiler that
>> >> > >> >symbol preemption or CoW footprint are of no concern to us, and so
>> >> > >> >PC-relative references that are resolved at link time are perfectly
>> >> > >> >fine.
>> >> > >>
>> >> > >> Agree
>> >> > >>
>> >> > >
>> >> > >The only unfortunate thing is that -fvisibility=hidden does not give
>> >> > >us the behavior we want, and we are forced to use the #pragma instead.
>> >> >
>> >> > Right. For a very long time there had been no option controlling the
>> >> > access mode for undefined symbols (-fvisibility= is for defined
>> >> > symbols).
>> >> >
>> >> > I added -fdirect-access-external-data to Clang which supports
>> >> > many architectures (x86, aarch64, arm, riscv, ...).
>> >> > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64).
>> >> >
>> >> > The use of `#pragma GCC visibility push(hidden)` looks good as a
>> >> > portable solution.
>> >>
>> >> Portable, sure, which is fine for now.
>> >>
>> >> But there's just something about injecting a header into ever TU via
>> >> -include in order to set a pragma and that there's such pragmas
>> >> effecting codegen that makes my skin crawl.
>> >>
>> >> Perhaps we can come up with a formal feature request for toolchain
>> >> vendors for an actual command line flag?
>> >>
>> >> Does the pragma have the same effect as
>> >> `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould
>> >> this feature request look like yet another distinct flag?
>>
>> `#pragma GCC visibility push(hidden)` is very similar to
>> -fvisibility=hidden -fdirect-access-external-data with Clang.
>> In Clang there are only two differences:
>>
>>    // TLS initial-exec model with -fdirect-access-external-data;
>>    // TLS local-exec model with `#pragma GCC visibility push(hidden)`
>>    extern __thread int var;
>>    int foo() { return var; }
>>
>>    // hidden visibility suppresses -fno-plt.
>>    // -fdirect-access-external-data / GCC -mdirect-extern-access doesn't suppress -fno-plt.
>>    extern int bar();
>>    int foo() { return bar() + 2; }
>>
>>
>> The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
>> -fdirect-access-external-data can replace `#pragma GCC visibility
>> push(hidden)`.
>>
>
>OK. But you mentioned that GCC does not implement
>-mdirect-extern-access for AArch64, right? So for now, the pragma is
>the only portable option we have.

Right.

>> >I agree that this is rather nasty. What I don't understand is why
>> >-fvisibility=hidden gives different behavior to begin with, or why
>> >-ffreestanding -fpie builds don't default to hidden visibility for
>> >symbol declarations as well as definitions.
>>
>> -ffreestanding doesn't mean there is no DSO. A libc implementation (e.g.
>> musl) may use -ffreestanding to avoid libc dependencies from the host
>> environment. It may ship several shared objects and export multiple symbols.
>> Implied -fvisibility=hidden will get in the way.
>>
>> There is a merit to make options orthogonal.
>
>Fair enough.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-04-29  7:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-27 17:12 [RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel Ard Biesheuvel
2022-04-27 17:12 ` [RFC PATCH 1/2] arm64: jump_label: use more precise asm constraints Ard Biesheuvel
2022-04-27 18:58   ` Nick Desaulniers
2022-04-27 21:50     ` Ard Biesheuvel
2022-04-28  9:35       ` Ard Biesheuvel
2022-04-28  9:51   ` Mark Rutland
2022-04-28 16:05     ` Ard Biesheuvel
2022-04-27 17:12 ` [RFC PATCH 2/2] arm64: kernel: switch to PIE code generation for relocatable kernels Ard Biesheuvel
2022-04-28  2:40   ` Fangrui Song
2022-04-28  6:23     ` Ard Biesheuvel
2022-04-28  6:57       ` Fangrui Song
2022-04-28 16:03         ` Ard Biesheuvel
2022-04-28 18:53         ` Nick Desaulniers
2022-04-28 19:36           ` Ard Biesheuvel
2022-04-29  7:03             ` Fangrui Song
2022-04-29  7:27               ` Ard Biesheuvel
2022-04-29  7:53                 ` Fangrui Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.