All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible
@ 2023-04-28  9:50 Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 01/43] x86/crypto: Adapt assembly for PIE support Hou Wenlong
                   ` (43 more replies)
  0 siblings, 44 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Nathan Chancellor, Nick Desaulniers, Tom Rix, bpf, llvm

Purpose:

These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated
below the top 2G of the virtual address space. And this patchset
provides an example to allow kernel image to be relocated in top 512G of
the address space.

The ultimate purpose for PIE kernel is to increase the security of the
the kernel and also the fleixbility of the kernel image's virtual
address, which can be even in the low half of the address space. More
locations the kernel can fit in, this means an attacker could guess
harder.

The patchset is based on Thomas Garnier's X86 PIE patchset v6[1] and
v11[2]. However, some design changes are made and some bugs are fixed by
testing with different configurations and compilers.

  Important changes:
  - For fixmap area, move vsyscall page out of fixmap area and unify
    __FIXADDR_TOP for x86. Then fixmap area could be relocated together
    with kernel image.

  - For compile-time base address of kernel image, keep it in top 2G of
    address space. Introduce a new variable to store the run-time base
    address and adapt for VA/PA transition during runtime.

  - For percpu section, keep it as zero mapping for SMP. Because
    compile-time base address of kernel image still resides in top 2G of
    address space, then RIP-relative reference can still be used when
    percpu section is zero mapping. However, when do relocation for
    percpu variable references, percpu variable should be treated as
    normal variable and absolute references should be relocated
    accordingly. In addition, the relocation offset should be subtracted
    from the GS base in order to ensure correct operation.

  - For x86/boot/head64.c, don't build it as mcmodel=large. Instead, use
    data relocation to acqiure global symbol's value and make
    fixup_pointer() as a nop when running in identity mapping. This is
    because not all global symbol references in the code use
    fixup_pointer(), e.g. variables in macro related to 5-level paging,
    which can be optimized by GCC as relative referencs. If build it as
    mcmodel=large, there will be more fixup_pointer() calls, resulting
    in uglier code. Actually, if build it as PIE even when
    CONFIG_X86_PIE is disabled, then all fixup_pointer() could be
    dropped. However stack protector would be broken if per-cpu stack
    protector is not supported.

  Limitations:
  - Since I am not familiar with XEN, it has been disabled for now as it
    is not adapted for PIE. This is due to the assignment of wrong
    pointers (with low address values) to x86_ini_ops when running in
    identity mapping. This issue can be resolved by building pagetable
    eraly and jumping to high kernel address as soon as possible.

  - It is not allowed to reference global variables in an alternative
    section since RIP-relative addressing is not fixed in
    apply_alternatives(). Fortunately, all disallowed relocations in the
    alternative section can be captured by objtool. I believe that this
    issue can also be fixed by using objtool.

  - For module loading, only allow to load module without GOT for
    simplicity. Only weak global variable referencs are using GOT.

  Tests:
    I only have tested booting with GCC 5.1.0 (min version), GCC 12.2.0
    and CLANG 15.0.7. And I have also run the following tests for both
    default configuration and Ubuntu configuration.

Performance/Size impact (GCC 12.2.0):

Size of vmlinux (Default configuration):
 File size:
 - PIE disabled: +0.012%
 - PIE enabled: -2.219%
 instructions:
 - PIE disabled: same
 - PIE enabled: +1.383%
 .text section:
 - PIE disabled: same
 - PIE enabled: +0.589%

Size of vmlinux (Ubuntu configuration):
 File size:
 - PIE disabled: same
 - PIE enabled: +2.391%
 instructions:
 - PIE disabled: +0.013%
 - PIE enabled: +1.566%
 .text section:
 - PIE disabled: same
 - PIE enabled: +0.055%

The .text section size increase is due to more instructions required for
PIE code. There are two reasons that have been mentioned in previous
mailist. Firstly, switch folding is disabled under PIE [3]. Secondly,
two instructions are needed for PIE to represent a single instruction
with sign extension, such as when accessing an array element. While only
one instruction is required when using mcmode=kernel, for PIE, it needs
to use lea to get the base of the array first.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
 - PIE disabled: no significant change (avg -/+ 0.5% on default config).
 - PIE enabled: -2% to +2% in average (default config).

Kernbench (average of 10 Half and Optimal runs):
 Elapsed Time:
 - PIE disabled: no significant change (avg -0.2% on ubuntu config)
 - PIE enabled: average -0.2% to +0.2%
 System Time:
 - PIE disabled: no significant change (avg -0.5% on ubuntu config)
 - PIE enabled: average -0.5% to +0.5%

[1] https://lore.kernel.org/all/20190131192533.34130-1-thgarnie@chromium.org
[2] https://lore.kernel.org/all/20200228000105.165012-1-thgarnie@chromium.org
[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

Brian Gerst (1):
  x86-64: Use per-cpu stack canary if supported by compiler

Hou Wenlong (29):
  x86/irq: Adapt assembly for PIE support
  x86,rethook: Adapt assembly for PIE support
  x86/paravirt: Use relative reference for original instruction
  x86/Kconfig: Introduce new Kconfig for PIE kernel building
  x86/PVH: Use fixed_percpu_data to set up GS base
  x86/pie: Enable stack protector only if per-cpu stack canary is
    supported
  x86/percpu: Use PC-relative addressing for percpu variable references
  x86/tools: Explicitly include autoconf.h for hostprogs
  x86/percpu: Adapt percpu references relocation for PIE support
  x86/ftrace: Adapt assembly for PIE support
  x86/pie: Force hidden visibility for all symbol references
  x86/boot/compressed: Adapt sed command to generate voffset.h when PIE
    is enabled
  x86/pie: Add .data.rel.* sections into link script
  KVM: x86: Adapt assembly for PIE support
  x86/PVH: Adapt PVH booting for PIE support
  x86/bpf: Adapt BPF_CALL JIT codegen for PIE support
  x86/modules: Adapt module loading for PIE support
  x86/boot/64: Use data relocation to get absloute address when PIE is
    enabled
  objtool: Add validation for x86 PIE support
  objtool: Adapt indirect call of __fentry__() for PIE support
  x86/pie: Build the kernel as PIE
  x86/vsyscall: Don't use set_fixmap() to map vsyscall page
  x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap
    area
  x86/fixmap: Move vsyscall page out of fixmap area
  x86/fixmap: Unify FIXADDR_TOP
  x86/boot: Fill kernel image puds dynamically
  x86/mm: Sort address_markers array when X86 PIE is enabled
  x86/pie: Allow kernel image to be relocated in top 512G
  x86/boot: Extend relocate range for PIE kernel image

Thomas Garnier (13):
  x86/crypto: Adapt assembly for PIE support
  x86: Add macro to get symbol address for PIE support
  x86: relocate_kernel - Adapt assembly for PIE support
  x86/entry/64: Adapt assembly for PIE support
  x86: pm-trace: Adapt assembly for PIE support
  x86/CPU: Adapt assembly for PIE support
  x86/acpi: Adapt assembly for PIE support
  x86/boot/64: Adapt assembly for PIE support
  x86/power/64: Adapt assembly for PIE support
  x86/alternatives: Adapt assembly for PIE support
  x86/ftrace: Adapt ftrace nop patching for PIE support
  x86/mm: Make the x86 GOT read-only
  x86/relocs: Handle PIE relocations

 Documentation/x86/x86_64/mm.rst              |   4 +
 arch/x86/Kconfig                             |  36 +++++-
 arch/x86/Makefile                            |  33 +++--
 arch/x86/boot/compressed/Makefile            |   2 +-
 arch/x86/boot/compressed/kaslr.c             |  55 +++++++++
 arch/x86/boot/compressed/misc.c              |   4 +-
 arch/x86/boot/compressed/misc.h              |   9 ++
 arch/x86/crypto/aegis128-aesni-asm.S         |   6 +-
 arch/x86/crypto/aesni-intel_asm.S            |   2 +-
 arch/x86/crypto/aesni-intel_avx-x86_64.S     |   3 +-
 arch/x86/crypto/aria-aesni-avx-asm_64.S      |  30 ++---
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  |  30 ++---
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S |  30 ++---
 arch/x86/crypto/camellia-x86_64-asm_64.S     |   8 +-
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S    |  50 ++++----
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S    |  44 ++++---
 arch/x86/crypto/crc32c-pcl-intel-asm_64.S    |   3 +-
 arch/x86/crypto/des3_ede-asm_64.S            |  96 ++++++++++-----
 arch/x86/crypto/ghash-clmulni-intel_asm.S    |   4 +-
 arch/x86/crypto/sha256-avx2-asm.S            |  18 ++-
 arch/x86/entry/calling.h                     |  17 ++-
 arch/x86/entry/entry_64.S                    |  22 +++-
 arch/x86/entry/vdso/Makefile                 |   2 +-
 arch/x86/entry/vsyscall/vsyscall_64.c        |   7 +-
 arch/x86/include/asm/alternative.h           |   6 +-
 arch/x86/include/asm/asm.h                   |   1 +
 arch/x86/include/asm/fixmap.h                |  28 +----
 arch/x86/include/asm/irq_stack.h             |   2 +-
 arch/x86/include/asm/kmsan.h                 |   6 +-
 arch/x86/include/asm/nospec-branch.h         |  10 +-
 arch/x86/include/asm/page_64.h               |   8 +-
 arch/x86/include/asm/page_64_types.h         |   8 ++
 arch/x86/include/asm/paravirt.h              |  17 ++-
 arch/x86/include/asm/paravirt_types.h        |  12 +-
 arch/x86/include/asm/percpu.h                |  29 ++++-
 arch/x86/include/asm/pgtable_64_types.h      |  10 +-
 arch/x86/include/asm/pm-trace.h              |   2 +-
 arch/x86/include/asm/processor.h             |  17 ++-
 arch/x86/include/asm/sections.h              |   5 +
 arch/x86/include/asm/stackprotector.h        |  16 ++-
 arch/x86/include/asm/sync_core.h             |   6 +-
 arch/x86/include/asm/vsyscall.h              |  13 ++
 arch/x86/kernel/acpi/wakeup_64.S             |  31 ++---
 arch/x86/kernel/alternative.c                |   8 +-
 arch/x86/kernel/asm-offsets_64.c             |   2 +-
 arch/x86/kernel/callthunks.c                 |   2 +-
 arch/x86/kernel/cpu/common.c                 |  15 ++-
 arch/x86/kernel/ftrace.c                     |  46 ++++++-
 arch/x86/kernel/ftrace_64.S                  |   9 +-
 arch/x86/kernel/head64.c                     |  77 +++++++++---
 arch/x86/kernel/head_64.S                    |  68 ++++++++---
 arch/x86/kernel/kvm.c                        |  21 +++-
 arch/x86/kernel/module.c                     |  27 +++++
 arch/x86/kernel/paravirt.c                   |   4 +
 arch/x86/kernel/relocate_kernel_64.S         |   2 +-
 arch/x86/kernel/rethook.c                    |   8 ++
 arch/x86/kernel/setup.c                      |   6 +
 arch/x86/kernel/vmlinux.lds.S                |  10 +-
 arch/x86/kvm/svm/vmenter.S                   |  10 +-
 arch/x86/kvm/vmx/vmenter.S                   |   2 +-
 arch/x86/lib/cmpxchg16b_emu.S                |   8 +-
 arch/x86/mm/dump_pagetables.c                |  36 +++++-
 arch/x86/mm/fault.c                          |   1 -
 arch/x86/mm/init_64.c                        |  10 +-
 arch/x86/mm/ioremap.c                        |   5 +-
 arch/x86/mm/kasan_init_64.c                  |   4 +-
 arch/x86/mm/pat/set_memory.c                 |   2 +-
 arch/x86/mm/pgtable.c                        |  13 ++
 arch/x86/mm/pgtable_32.c                     |   3 -
 arch/x86/mm/physaddr.c                       |  14 +--
 arch/x86/net/bpf_jit_comp.c                  |  17 ++-
 arch/x86/platform/efi/efi_thunk_64.S         |   4 +
 arch/x86/platform/pvh/head.S                 |  29 ++++-
 arch/x86/power/hibernate_asm_64.S            |   4 +-
 arch/x86/tools/Makefile                      |   4 +-
 arch/x86/tools/relocs.c                      | 113 ++++++++++++++++-
 arch/x86/xen/mmu_pv.c                        |  32 +++--
 arch/x86/xen/xen-asm.S                       |  10 +-
 arch/x86/xen/xen-head.S                      |  14 ++-
 include/asm-generic/vmlinux.lds.h            |  12 ++
 scripts/Makefile.lib                         |   1 +
 scripts/recordmcount.c                       |  81 ++++++++-----
 tools/objtool/arch/x86/decode.c              |  10 +-
 tools/objtool/builtin-check.c                |   4 +-
 tools/objtool/check.c                        | 121 +++++++++++++++++++
 tools/objtool/include/objtool/builtin.h      |   1 +
 86 files changed, 1202 insertions(+), 410 deletions(-)


Patchset is based on tip/master.
base-commit: 01cbd032298654fe4c85e153dd9a224e5bc10194
--
2.31.1


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH RFC 01/43] x86/crypto: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 02/43] x86: Add macro to get symbol address " Hou Wenlong
                   ` (42 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Herbert Xu, David S. Miller, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, linux-crypto

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

[Hou Wenlong: Adapt new assembly code in x86/crypto]

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/crypto/aegis128-aesni-asm.S         |  6 +-
 arch/x86/crypto/aesni-intel_asm.S            |  2 +-
 arch/x86/crypto/aesni-intel_avx-x86_64.S     |  3 +-
 arch/x86/crypto/aria-aesni-avx-asm_64.S      | 30 +++---
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  | 30 +++---
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 30 +++---
 arch/x86/crypto/camellia-x86_64-asm_64.S     |  8 +-
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S    | 50 +++++-----
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S    | 44 +++++----
 arch/x86/crypto/crc32c-pcl-intel-asm_64.S    |  3 +-
 arch/x86/crypto/des3_ede-asm_64.S            | 96 +++++++++++++-------
 arch/x86/crypto/ghash-clmulni-intel_asm.S    |  4 +-
 arch/x86/crypto/sha256-avx2-asm.S            | 18 ++--
 13 files changed, 187 insertions(+), 137 deletions(-)

diff --git a/arch/x86/crypto/aegis128-aesni-asm.S b/arch/x86/crypto/aegis128-aesni-asm.S
index cdf3215ec272..ad7f4c891625 100644
--- a/arch/x86/crypto/aegis128-aesni-asm.S
+++ b/arch/x86/crypto/aegis128-aesni-asm.S
@@ -201,8 +201,8 @@ SYM_FUNC_START(crypto_aegis128_aesni_init)
 	movdqa KEY, STATE4
 
 	/* load the constants: */
-	movdqa .Laegis128_const_0, STATE2
-	movdqa .Laegis128_const_1, STATE1
+	movdqa .Laegis128_const_0(%rip), STATE2
+	movdqa .Laegis128_const_1(%rip), STATE1
 	pxor STATE2, STATE3
 	pxor STATE1, STATE4
 
@@ -682,7 +682,7 @@ SYM_TYPED_FUNC_START(crypto_aegis128_aesni_dec_tail)
 	punpcklbw T0, T0
 	punpcklbw T0, T0
 	punpcklbw T0, T0
-	movdqa .Laegis128_counter, T1
+	movdqa .Laegis128_counter(%rip), T1
 	pcmpgtb T1, T0
 	pand T0, MSG
 
diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S
index 837c1e0aa021..ca99a2274d55 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -2717,7 +2717,7 @@ SYM_FUNC_END(aesni_cts_cbc_dec)
  *	BSWAP_MASK == endian swapping mask
  */
 SYM_FUNC_START_LOCAL(_aesni_inc_init)
-	movaps .Lbswap_mask, BSWAP_MASK
+	movaps .Lbswap_mask(%rip), BSWAP_MASK
 	movaps IV, CTR
 	pshufb BSWAP_MASK, CTR
 	mov $1, TCTR_LOW
diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 0852ab573fd3..9f3a2fc56c24 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -649,7 +649,8 @@ _get_AAD_rest0\@:
 	vpshufb and an array of shuffle masks */
 	movq    %r12, %r11
 	salq    $4, %r11
-	vmovdqu  aad_shift_arr(%r11), \T1
+	leaq    aad_shift_arr(%rip), %rax
+	vmovdqu  (%rax,%r11,), \T1
 	vpshufb \T1, \T7, \T7
 _get_AAD_rest_final\@:
 	vpshufb SHUF_MASK(%rip), \T7, \T7
diff --git a/arch/x86/crypto/aria-aesni-avx-asm_64.S b/arch/x86/crypto/aria-aesni-avx-asm_64.S
index 9243f6289d34..e4f9b624d98c 100644
--- a/arch/x86/crypto/aria-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/aria-aesni-avx-asm_64.S
@@ -80,7 +80,7 @@
 	transpose_4x4(c0, c1, c2, c3, a0, a1);		\
 	transpose_4x4(d0, d1, d2, d3, a0, a1);		\
 							\
-	vmovdqu .Lshufb_16x16b, a0;			\
+	vmovdqu .Lshufb_16x16b(%rip), a0;		\
 	vmovdqu st1, a1;				\
 	vpshufb a0, a2, a2;				\
 	vpshufb a0, a3, a3;				\
@@ -132,7 +132,7 @@
 	transpose_4x4(c0, c1, c2, c3, a0, a1);		\
 	transpose_4x4(d0, d1, d2, d3, a0, a1);		\
 							\
-	vmovdqu .Lshufb_16x16b, a0;			\
+	vmovdqu .Lshufb_16x16b(%rip), a0;		\
 	vmovdqu st1, a1;				\
 	vpshufb a0, a2, a2;				\
 	vpshufb a0, a3, a3;				\
@@ -300,11 +300,11 @@
 			    x4, x5, x6, x7,		\
 			    t0, t1, t2, t3,		\
 			    t4, t5, t6, t7)		\
-	vmovdqa .Ltf_s2_bitmatrix, t0;			\
-	vmovdqa .Ltf_inv_bitmatrix, t1;			\
-	vmovdqa .Ltf_id_bitmatrix, t2;			\
-	vmovdqa .Ltf_aff_bitmatrix, t3;			\
-	vmovdqa .Ltf_x2_bitmatrix, t4;			\
+	vmovdqa .Ltf_s2_bitmatrix(%rip), t0;		\
+	vmovdqa .Ltf_inv_bitmatrix(%rip), t1;		\
+	vmovdqa .Ltf_id_bitmatrix(%rip), t2;		\
+	vmovdqa .Ltf_aff_bitmatrix(%rip), t3;		\
+	vmovdqa .Ltf_x2_bitmatrix(%rip), t4;		\
 	vgf2p8affineinvqb $(tf_s2_const), t0, x1, x1;	\
 	vgf2p8affineinvqb $(tf_s2_const), t0, x5, x5;	\
 	vgf2p8affineqb $(tf_inv_const), t1, x2, x2;	\
@@ -324,13 +324,13 @@
 		       x4, x5, x6, x7,			\
 		       t0, t1, t2, t3,			\
 		       t4, t5, t6, t7)			\
-	vmovdqa .Linv_shift_row, t0;			\
-	vmovdqa .Lshift_row, t1;			\
-	vbroadcastss .L0f0f0f0f, t6;			\
-	vmovdqa .Ltf_lo__inv_aff__and__s2, t2;		\
-	vmovdqa .Ltf_hi__inv_aff__and__s2, t3;		\
-	vmovdqa .Ltf_lo__x2__and__fwd_aff, t4;		\
-	vmovdqa .Ltf_hi__x2__and__fwd_aff, t5;		\
+	vmovdqa .Linv_shift_row(%rip), t0;		\
+	vmovdqa .Lshift_row(%rip), t1;			\
+	vpbroadcastd .L0f0f0f0f(%rip), t6;		\
+	vmovdqa .Ltf_lo__inv_aff__and__s2(%rip), t2;	\
+	vmovdqa .Ltf_hi__inv_aff__and__s2(%rip), t3;	\
+	vmovdqa .Ltf_lo__x2__and__fwd_aff(%rip), t4;	\
+	vmovdqa .Ltf_hi__x2__and__fwd_aff(%rip), t5;	\
 							\
 	vaesenclast t7, x0, x0;				\
 	vaesenclast t7, x4, x4;				\
@@ -1035,7 +1035,7 @@ SYM_FUNC_START_LOCAL(__aria_aesni_avx_ctr_gen_keystream_16way)
 	/* load IV and byteswap */
 	vmovdqu (%r8), %xmm8;
 
-	vmovdqa .Lbswap128_mask (%rip), %xmm1;
+	vmovdqa .Lbswap128_mask(%rip), %xmm1;
 	vpshufb %xmm1, %xmm8, %xmm3; /* be => le */
 
 	vpcmpeqd %xmm0, %xmm0, %xmm0;
diff --git a/arch/x86/crypto/camellia-aesni-avx-asm_64.S b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
index 4a30618281ec..646477a13e11 100644
--- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
@@ -52,10 +52,10 @@
 	/* \
 	 * S-function with AES subbytes \
 	 */ \
-	vmovdqa .Linv_shift_row, t4; \
-	vbroadcastss .L0f0f0f0f, t7; \
-	vmovdqa .Lpre_tf_lo_s1, t0; \
-	vmovdqa .Lpre_tf_hi_s1, t1; \
+	vmovdqa .Linv_shift_row(%rip), t4; \
+	vbroadcastss .L0f0f0f0f(%rip), t7; \
+	vmovdqa .Lpre_tf_lo_s1(%rip), t0; \
+	vmovdqa .Lpre_tf_hi_s1(%rip), t1; \
 	\
 	/* AES inverse shift rows */ \
 	vpshufb t4, x0, x0; \
@@ -68,8 +68,8 @@
 	vpshufb t4, x6, x6; \
 	\
 	/* prefilter sboxes 1, 2 and 3 */ \
-	vmovdqa .Lpre_tf_lo_s4, t2; \
-	vmovdqa .Lpre_tf_hi_s4, t3; \
+	vmovdqa .Lpre_tf_lo_s4(%rip), t2; \
+	vmovdqa .Lpre_tf_hi_s4(%rip), t3; \
 	filter_8bit(x0, t0, t1, t7, t6); \
 	filter_8bit(x7, t0, t1, t7, t6); \
 	filter_8bit(x1, t0, t1, t7, t6); \
@@ -83,8 +83,8 @@
 	filter_8bit(x6, t2, t3, t7, t6); \
 	\
 	/* AES subbytes + AES shift rows */ \
-	vmovdqa .Lpost_tf_lo_s1, t0; \
-	vmovdqa .Lpost_tf_hi_s1, t1; \
+	vmovdqa .Lpost_tf_lo_s1(%rip), t0; \
+	vmovdqa .Lpost_tf_hi_s1(%rip), t1; \
 	vaesenclast t4, x0, x0; \
 	vaesenclast t4, x7, x7; \
 	vaesenclast t4, x1, x1; \
@@ -95,16 +95,16 @@
 	vaesenclast t4, x6, x6; \
 	\
 	/* postfilter sboxes 1 and 4 */ \
-	vmovdqa .Lpost_tf_lo_s3, t2; \
-	vmovdqa .Lpost_tf_hi_s3, t3; \
+	vmovdqa .Lpost_tf_lo_s3(%rip), t2; \
+	vmovdqa .Lpost_tf_hi_s3(%rip), t3; \
 	filter_8bit(x0, t0, t1, t7, t6); \
 	filter_8bit(x7, t0, t1, t7, t6); \
 	filter_8bit(x3, t0, t1, t7, t6); \
 	filter_8bit(x6, t0, t1, t7, t6); \
 	\
 	/* postfilter sbox 3 */ \
-	vmovdqa .Lpost_tf_lo_s2, t4; \
-	vmovdqa .Lpost_tf_hi_s2, t5; \
+	vmovdqa .Lpost_tf_lo_s2(%rip), t4; \
+	vmovdqa .Lpost_tf_hi_s2(%rip), t5; \
 	filter_8bit(x2, t2, t3, t7, t6); \
 	filter_8bit(x5, t2, t3, t7, t6); \
 	\
@@ -443,7 +443,7 @@ SYM_FUNC_END(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
 	transpose_4x4(c0, c1, c2, c3, a0, a1); \
 	transpose_4x4(d0, d1, d2, d3, a0, a1); \
 	\
-	vmovdqu .Lshufb_16x16b, a0; \
+	vmovdqu .Lshufb_16x16b(%rip), a0; \
 	vmovdqu st1, a1; \
 	vpshufb a0, a2, a2; \
 	vpshufb a0, a3, a3; \
@@ -482,7 +482,7 @@ SYM_FUNC_END(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
 #define inpack16_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \
 		     y6, y7, rio, key) \
 	vmovq key, x0; \
-	vpshufb .Lpack_bswap, x0, x0; \
+	vpshufb .Lpack_bswap(%rip), x0, x0; \
 	\
 	vpxor 0 * 16(rio), x0, y7; \
 	vpxor 1 * 16(rio), x0, y6; \
@@ -533,7 +533,7 @@ SYM_FUNC_END(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
 	vmovdqu x0, stack_tmp0; \
 	\
 	vmovq key, x0; \
-	vpshufb .Lpack_bswap, x0, x0; \
+	vpshufb .Lpack_bswap(%rip), x0, x0; \
 	\
 	vpxor x0, y7, y7; \
 	vpxor x0, y6, y6; \
diff --git a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
index deaf62aa73a6..a0eb94e53b1b 100644
--- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
@@ -64,12 +64,12 @@
 	/* \
 	 * S-function with AES subbytes \
 	 */ \
-	vbroadcasti128 .Linv_shift_row, t4; \
-	vpbroadcastd .L0f0f0f0f, t7; \
-	vbroadcasti128 .Lpre_tf_lo_s1, t5; \
-	vbroadcasti128 .Lpre_tf_hi_s1, t6; \
-	vbroadcasti128 .Lpre_tf_lo_s4, t2; \
-	vbroadcasti128 .Lpre_tf_hi_s4, t3; \
+	vbroadcasti128 .Linv_shift_row(%rip), t4; \
+	vpbroadcastd .L0f0f0f0f(%rip), t7; \
+	vbroadcasti128 .Lpre_tf_lo_s1(%rip), t5; \
+	vbroadcasti128 .Lpre_tf_hi_s1(%rip), t6; \
+	vbroadcasti128 .Lpre_tf_lo_s4(%rip), t2; \
+	vbroadcasti128 .Lpre_tf_hi_s4(%rip), t3; \
 	\
 	/* AES inverse shift rows */ \
 	vpshufb t4, x0, x0; \
@@ -115,8 +115,8 @@
 	vinserti128 $1, t2##_x, x6, x6; \
 	vextracti128 $1, x1, t3##_x; \
 	vextracti128 $1, x4, t2##_x; \
-	vbroadcasti128 .Lpost_tf_lo_s1, t0; \
-	vbroadcasti128 .Lpost_tf_hi_s1, t1; \
+	vbroadcasti128 .Lpost_tf_lo_s1(%rip), t0; \
+	vbroadcasti128 .Lpost_tf_hi_s1(%rip), t1; \
 	vaesenclast t4##_x, x2##_x, x2##_x; \
 	vaesenclast t4##_x, t6##_x, t6##_x; \
 	vinserti128 $1, t6##_x, x2, x2; \
@@ -131,16 +131,16 @@
 	vinserti128 $1, t2##_x, x4, x4; \
 	\
 	/* postfilter sboxes 1 and 4 */ \
-	vbroadcasti128 .Lpost_tf_lo_s3, t2; \
-	vbroadcasti128 .Lpost_tf_hi_s3, t3; \
+	vbroadcasti128 .Lpost_tf_lo_s3(%rip), t2; \
+	vbroadcasti128 .Lpost_tf_hi_s3(%rip), t3; \
 	filter_8bit(x0, t0, t1, t7, t6); \
 	filter_8bit(x7, t0, t1, t7, t6); \
 	filter_8bit(x3, t0, t1, t7, t6); \
 	filter_8bit(x6, t0, t1, t7, t6); \
 	\
 	/* postfilter sbox 3 */ \
-	vbroadcasti128 .Lpost_tf_lo_s2, t4; \
-	vbroadcasti128 .Lpost_tf_hi_s2, t5; \
+	vbroadcasti128 .Lpost_tf_lo_s2(%rip), t4; \
+	vbroadcasti128 .Lpost_tf_hi_s2(%rip), t5; \
 	filter_8bit(x2, t2, t3, t7, t6); \
 	filter_8bit(x5, t2, t3, t7, t6); \
 	\
@@ -475,7 +475,7 @@ SYM_FUNC_END(roundsm32_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
 	transpose_4x4(c0, c1, c2, c3, a0, a1); \
 	transpose_4x4(d0, d1, d2, d3, a0, a1); \
 	\
-	vbroadcasti128 .Lshufb_16x16b, a0; \
+	vbroadcasti128 .Lshufb_16x16b(%rip), a0; \
 	vmovdqu st1, a1; \
 	vpshufb a0, a2, a2; \
 	vpshufb a0, a3, a3; \
@@ -514,7 +514,7 @@ SYM_FUNC_END(roundsm32_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
 #define inpack32_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \
 		     y6, y7, rio, key) \
 	vpbroadcastq key, x0; \
-	vpshufb .Lpack_bswap, x0, x0; \
+	vpshufb .Lpack_bswap(%rip), x0, x0; \
 	\
 	vpxor 0 * 32(rio), x0, y7; \
 	vpxor 1 * 32(rio), x0, y6; \
@@ -565,7 +565,7 @@ SYM_FUNC_END(roundsm32_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
 	vmovdqu x0, stack_tmp0; \
 	\
 	vpbroadcastq key, x0; \
-	vpshufb .Lpack_bswap, x0, x0; \
+	vpshufb .Lpack_bswap(%rip), x0, x0; \
 	\
 	vpxor x0, y7, y7; \
 	vpxor x0, y6, y6; \
diff --git a/arch/x86/crypto/camellia-x86_64-asm_64.S b/arch/x86/crypto/camellia-x86_64-asm_64.S
index 347c059f5940..b7c822d813a8 100644
--- a/arch/x86/crypto/camellia-x86_64-asm_64.S
+++ b/arch/x86/crypto/camellia-x86_64-asm_64.S
@@ -77,11 +77,13 @@
 #define RXORbl %r9b
 
 #define xor2ror16(T0, T1, tmp1, tmp2, ab, dst) \
+	leaq T0(%rip), 			tmp1; \
 	movzbl ab ## bl,		tmp2 ## d; \
+	xorq (tmp1, tmp2, 8),		dst; \
+	leaq T1(%rip), 			tmp2; \
 	movzbl ab ## bh,		tmp1 ## d; \
-	rorq $16,			ab; \
-	xorq T0(, tmp2, 8),		dst; \
-	xorq T1(, tmp1, 8),		dst;
+	xorq (tmp2, tmp1, 8),		dst; \
+	rorq $16,			ab;
 
 /**********************************************************************
   1-way camellia
diff --git a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
index 0326a01503c3..438c404a03bc 100644
--- a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
@@ -83,16 +83,20 @@
 
 
 #define lookup_32bit(src, dst, op1, op2, op3, interleave_op, il_reg) \
-	movzbl		src ## bh,     RID1d;    \
-	movzbl		src ## bl,     RID2d;    \
-	shrq $16,	src;                     \
-	movl		s1(, RID1, 4), dst ## d; \
-	op1		s2(, RID2, 4), dst ## d; \
-	movzbl		src ## bh,     RID1d;    \
-	movzbl		src ## bl,     RID2d;    \
-	interleave_op(il_reg);			 \
-	op2		s3(, RID1, 4), dst ## d; \
-	op3		s4(, RID2, 4), dst ## d;
+	movzbl		src ## bh,       RID1d;    \
+	leaq		s1(%rip),        RID2;     \
+	movl		(RID2, RID1, 4), dst ## d; \
+	movzbl		src ## bl,       RID2d;    \
+	leaq		s2(%rip),        RID1;     \
+	op1		(RID1, RID2, 4), dst ## d; \
+	shrq $16,	src;                       \
+	movzbl		src ## bh,     RID1d;      \
+	leaq		s3(%rip),        RID2;     \
+	op2		(RID2, RID1, 4), dst ## d; \
+	movzbl		src ## bl,     RID2d;      \
+	leaq		s4(%rip),        RID1;     \
+	op3		(RID1, RID2, 4), dst ## d; \
+	interleave_op(il_reg);
 
 #define dummy(d) /* do nothing */
 
@@ -151,15 +155,15 @@
 	subround(l ## 3, r ## 3, l ## 4, r ## 4, f);
 
 #define enc_preload_rkr() \
-	vbroadcastss	.L16_mask,                RKR;      \
+	vbroadcastss	.L16_mask(%rip),          RKR;      \
 	/* add 16-bit rotation to key rotations (mod 32) */ \
 	vpxor		kr(CTX),                  RKR, RKR;
 
 #define dec_preload_rkr() \
-	vbroadcastss	.L16_mask,                RKR;      \
+	vbroadcastss	.L16_mask(%rip),          RKR;      \
 	/* add 16-bit rotation to key rotations (mod 32) */ \
 	vpxor		kr(CTX),                  RKR, RKR; \
-	vpshufb		.Lbswap128_mask,          RKR, RKR;
+	vpshufb		.Lbswap128_mask(%rip),    RKR, RKR;
 
 #define transpose_2x4(x0, x1, t0, t1) \
 	vpunpckldq		x1, x0, t0; \
@@ -235,9 +239,9 @@ SYM_FUNC_START_LOCAL(__cast5_enc_blk16)
 
 	movq %rdi, CTX;
 
-	vmovdqa .Lbswap_mask, RKM;
-	vmovd .Lfirst_mask, R1ST;
-	vmovd .L32_mask, R32;
+	vmovdqa .Lbswap_mask(%rip), RKM;
+	vmovd .Lfirst_mask(%rip), R1ST;
+	vmovd .L32_mask(%rip), R32;
 	enc_preload_rkr();
 
 	inpack_blocks(RL1, RR1, RTMP, RX, RKM);
@@ -271,7 +275,7 @@ SYM_FUNC_START_LOCAL(__cast5_enc_blk16)
 	popq %rbx;
 	popq %r15;
 
-	vmovdqa .Lbswap_mask, RKM;
+	vmovdqa .Lbswap_mask(%rip), RKM;
 
 	outunpack_blocks(RR1, RL1, RTMP, RX, RKM);
 	outunpack_blocks(RR2, RL2, RTMP, RX, RKM);
@@ -308,9 +312,9 @@ SYM_FUNC_START_LOCAL(__cast5_dec_blk16)
 
 	movq %rdi, CTX;
 
-	vmovdqa .Lbswap_mask, RKM;
-	vmovd .Lfirst_mask, R1ST;
-	vmovd .L32_mask, R32;
+	vmovdqa .Lbswap_mask(%rip), RKM;
+	vmovd .Lfirst_mask(%rip), R1ST;
+	vmovd .L32_mask(%rip), R32;
 	dec_preload_rkr();
 
 	inpack_blocks(RL1, RR1, RTMP, RX, RKM);
@@ -341,7 +345,7 @@ SYM_FUNC_START_LOCAL(__cast5_dec_blk16)
 	round(RL, RR, 1, 2);
 	round(RR, RL, 0, 1);
 
-	vmovdqa .Lbswap_mask, RKM;
+	vmovdqa .Lbswap_mask(%rip), RKM;
 	popq %rbx;
 	popq %r15;
 
@@ -504,8 +508,8 @@ SYM_FUNC_START(cast5_ctr_16way)
 
 	vpcmpeqd RKR, RKR, RKR;
 	vpaddq RKR, RKR, RKR; /* low: -2, high: -2 */
-	vmovdqa .Lbswap_iv_mask, R1ST;
-	vmovdqa .Lbswap128_mask, RKM;
+	vmovdqa .Lbswap_iv_mask(%rip), R1ST;
+	vmovdqa .Lbswap128_mask(%rip), RKM;
 
 	/* load IV and byteswap */
 	vmovq (%rcx), RX;
diff --git a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
index 82b716fd5dba..180fb9c78de2 100644
--- a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
@@ -83,16 +83,20 @@
 
 
 #define lookup_32bit(src, dst, op1, op2, op3, interleave_op, il_reg) \
-	movzbl		src ## bh,     RID1d;    \
-	movzbl		src ## bl,     RID2d;    \
-	shrq $16,	src;                     \
-	movl		s1(, RID1, 4), dst ## d; \
-	op1		s2(, RID2, 4), dst ## d; \
-	movzbl		src ## bh,     RID1d;    \
-	movzbl		src ## bl,     RID2d;    \
-	interleave_op(il_reg);			 \
-	op2		s3(, RID1, 4), dst ## d; \
-	op3		s4(, RID2, 4), dst ## d;
+	movzbl		src ## bh,       RID1d;    \
+	leaq		s1(%rip),        RID2;     \
+	movl		(RID2, RID1, 4), dst ## d; \
+	movzbl		src ## bl,       RID2d;    \
+	leaq		s2(%rip),        RID1;     \
+	op1		(RID1, RID2, 4), dst ## d; \
+	shrq $16,	src;                       \
+	movzbl		src ## bh,     RID1d;      \
+	leaq		s3(%rip),        RID2;     \
+	op2		(RID2, RID1, 4), dst ## d; \
+	movzbl		src ## bl,     RID2d;      \
+	leaq		s4(%rip),        RID1;     \
+	op3		(RID1, RID2, 4), dst ## d; \
+	interleave_op(il_reg);
 
 #define dummy(d) /* do nothing */
 
@@ -175,10 +179,10 @@
 	qop(RD, RC, 1);
 
 #define shuffle(mask) \
-	vpshufb		mask,            RKR, RKR;
+	vpshufb		mask(%rip),            RKR, RKR;
 
 #define preload_rkr(n, do_mask, mask) \
-	vbroadcastss	.L16_mask,                RKR;      \
+	vbroadcastss	.L16_mask(%rip),          RKR;      \
 	/* add 16-bit rotation to key rotations (mod 32) */ \
 	vpxor		(kr+n*16)(CTX),           RKR, RKR; \
 	do_mask(mask);
@@ -258,9 +262,9 @@ SYM_FUNC_START_LOCAL(__cast6_enc_blk8)
 
 	movq %rdi, CTX;
 
-	vmovdqa .Lbswap_mask, RKM;
-	vmovd .Lfirst_mask, R1ST;
-	vmovd .L32_mask, R32;
+	vmovdqa .Lbswap_mask(%rip), RKM;
+	vmovd .Lfirst_mask(%rip), R1ST;
+	vmovd .L32_mask(%rip), R32;
 
 	inpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
 	inpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
@@ -284,7 +288,7 @@ SYM_FUNC_START_LOCAL(__cast6_enc_blk8)
 	popq %rbx;
 	popq %r15;
 
-	vmovdqa .Lbswap_mask, RKM;
+	vmovdqa .Lbswap_mask(%rip), RKM;
 
 	outunpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
 	outunpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
@@ -306,9 +310,9 @@ SYM_FUNC_START_LOCAL(__cast6_dec_blk8)
 
 	movq %rdi, CTX;
 
-	vmovdqa .Lbswap_mask, RKM;
-	vmovd .Lfirst_mask, R1ST;
-	vmovd .L32_mask, R32;
+	vmovdqa .Lbswap_mask(%rip), RKM;
+	vmovd .Lfirst_mask(%rip), R1ST;
+	vmovd .L32_mask(%rip), R32;
 
 	inpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
 	inpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
@@ -332,7 +336,7 @@ SYM_FUNC_START_LOCAL(__cast6_dec_blk8)
 	popq %rbx;
 	popq %r15;
 
-	vmovdqa .Lbswap_mask, RKM;
+	vmovdqa .Lbswap_mask(%rip), RKM;
 	outunpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
 	outunpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
 
diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index ec35915f0901..5f843dce77f1 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -168,7 +168,8 @@ continue_block:
 	xor     crc2, crc2
 
 	## branch into array
-	mov	jump_table(,%rax,8), %bufp
+	leaq	jump_table(%rip), %bufp
+	mov	(%bufp,%rax,8), %bufp
 	JMP_NOSPEC bufp
 
 	################################################################
diff --git a/arch/x86/crypto/des3_ede-asm_64.S b/arch/x86/crypto/des3_ede-asm_64.S
index f4c760f4cade..cf21b998e77c 100644
--- a/arch/x86/crypto/des3_ede-asm_64.S
+++ b/arch/x86/crypto/des3_ede-asm_64.S
@@ -129,21 +129,29 @@
 	movzbl RW0bl, RT2d; \
 	movzbl RW0bh, RT3d; \
 	shrq $16, RW0; \
-	movq s8(, RT0, 8), RT0; \
-	xorq s6(, RT1, 8), to; \
+	leaq s8(%rip), RW1; \
+	movq (RW1, RT0, 8), RT0; \
+	leaq s6(%rip), RW1; \
+	xorq (RW1, RT1, 8), to; \
 	movzbl RW0bl, RL1d; \
 	movzbl RW0bh, RT1d; \
 	shrl $16, RW0d; \
-	xorq s4(, RT2, 8), RT0; \
-	xorq s2(, RT3, 8), to; \
+	leaq s4(%rip), RW1; \
+	xorq (RW1, RT2, 8), RT0; \
+	leaq s2(%rip), RW1; \
+	xorq (RW1, RT3, 8), to; \
 	movzbl RW0bl, RT2d; \
 	movzbl RW0bh, RT3d; \
-	xorq s7(, RL1, 8), RT0; \
-	xorq s5(, RT1, 8), to; \
-	xorq s3(, RT2, 8), RT0; \
+	leaq s7(%rip), RW1; \
+	xorq (RW1, RL1, 8), RT0; \
+	leaq s5(%rip), RW1; \
+	xorq (RW1, RT1, 8), to; \
+	leaq s3(%rip), RW1; \
+	xorq (RW1, RT2, 8), RT0; \
 	load_next_key(n, RW0); \
 	xorq RT0, to; \
-	xorq s1(, RT3, 8), to; \
+	leaq s1(%rip), RW1; \
+	xorq (RW1, RT3, 8), to; \
 
 #define load_next_key(n, RWx) \
 	movq (((n) + 1) * 8)(CTX), RWx;
@@ -355,65 +363,89 @@ SYM_FUNC_END(des3_ede_x86_64_crypt_blk)
 	movzbl RW0bl, RT3d; \
 	movzbl RW0bh, RT1d; \
 	shrq $16, RW0; \
-	xorq s8(, RT3, 8), to##0; \
-	xorq s6(, RT1, 8), to##0; \
+	leaq s8(%rip), RT2; \
+	xorq (RT2, RT3, 8), to##0; \
+	leaq s6(%rip), RT2; \
+	xorq (RT2, RT1, 8), to##0; \
 	movzbl RW0bl, RT3d; \
 	movzbl RW0bh, RT1d; \
 	shrq $16, RW0; \
-	xorq s4(, RT3, 8), to##0; \
-	xorq s2(, RT1, 8), to##0; \
+	leaq s4(%rip), RT2; \
+	xorq (RT2, RT3, 8), to##0; \
+	leaq s2(%rip), RT2; \
+	xorq (RT2, RT1, 8), to##0; \
 	movzbl RW0bl, RT3d; \
 	movzbl RW0bh, RT1d; \
 	shrl $16, RW0d; \
-	xorq s7(, RT3, 8), to##0; \
-	xorq s5(, RT1, 8), to##0; \
+	leaq s7(%rip), RT2; \
+	xorq (RT2, RT3, 8), to##0; \
+	leaq s5(%rip), RT2; \
+	xorq (RT2, RT1, 8), to##0; \
 	movzbl RW0bl, RT3d; \
 	movzbl RW0bh, RT1d; \
 	load_next_key(n, RW0); \
-	xorq s3(, RT3, 8), to##0; \
-	xorq s1(, RT1, 8), to##0; \
+	leaq s3(%rip), RT2; \
+	xorq (RT2, RT3, 8), to##0; \
+	leaq s1(%rip), RT2; \
+	xorq (RT2, RT1, 8), to##0; \
 		xorq from##1, RW1; \
 		movzbl RW1bl, RT3d; \
 		movzbl RW1bh, RT1d; \
 		shrq $16, RW1; \
-		xorq s8(, RT3, 8), to##1; \
-		xorq s6(, RT1, 8), to##1; \
+		leaq s8(%rip), RT2; \
+		xorq (RT2, RT3, 8), to##1; \
+		leaq s6(%rip), RT2; \
+		xorq (RT2, RT1, 8), to##1; \
 		movzbl RW1bl, RT3d; \
 		movzbl RW1bh, RT1d; \
 		shrq $16, RW1; \
-		xorq s4(, RT3, 8), to##1; \
-		xorq s2(, RT1, 8), to##1; \
+		leaq s4(%rip), RT2; \
+		xorq (RT2, RT3, 8), to##1; \
+		leaq s2(%rip), RT2; \
+		xorq (RT2, RT1, 8), to##1; \
 		movzbl RW1bl, RT3d; \
 		movzbl RW1bh, RT1d; \
 		shrl $16, RW1d; \
-		xorq s7(, RT3, 8), to##1; \
-		xorq s5(, RT1, 8), to##1; \
+		leaq s7(%rip), RT2; \
+		xorq (RT2, RT3, 8), to##1; \
+		leaq s5(%rip), RT2; \
+		xorq (RT2, RT1, 8), to##1; \
 		movzbl RW1bl, RT3d; \
 		movzbl RW1bh, RT1d; \
 		do_movq(RW0, RW1); \
-		xorq s3(, RT3, 8), to##1; \
-		xorq s1(, RT1, 8), to##1; \
+		leaq s3(%rip), RT2; \
+		xorq (RT2, RT3, 8), to##1; \
+		leaq s1(%rip), RT2; \
+		xorq (RT2, RT1, 8), to##1; \
 			xorq from##2, RW2; \
 			movzbl RW2bl, RT3d; \
 			movzbl RW2bh, RT1d; \
 			shrq $16, RW2; \
-			xorq s8(, RT3, 8), to##2; \
-			xorq s6(, RT1, 8), to##2; \
+			leaq s8(%rip), RT2; \
+			xorq (RT2, RT3, 8), to##2; \
+			leaq s6(%rip), RT2; \
+			xorq (RT2, RT1, 8), to##2; \
 			movzbl RW2bl, RT3d; \
 			movzbl RW2bh, RT1d; \
 			shrq $16, RW2; \
-			xorq s4(, RT3, 8), to##2; \
-			xorq s2(, RT1, 8), to##2; \
+			leaq s4(%rip), RT2; \
+			xorq (RT2, RT3, 8), to##2; \
+			leaq s2(%rip), RT2; \
+			xorq (RT2, RT1, 8), to##2; \
 			movzbl RW2bl, RT3d; \
 			movzbl RW2bh, RT1d; \
 			shrl $16, RW2d; \
-			xorq s7(, RT3, 8), to##2; \
-			xorq s5(, RT1, 8), to##2; \
+			leaq s7(%rip), RT2; \
+			xorq (RT2, RT3, 8), to##2; \
+			leaq s5(%rip), RT2; \
+			xorq (RT2, RT1, 8), to##2; \
 			movzbl RW2bl, RT3d; \
 			movzbl RW2bh, RT1d; \
 			do_movq(RW0, RW2); \
-			xorq s3(, RT3, 8), to##2; \
-			xorq s1(, RT1, 8), to##2;
+			leaq s3(%rip), RT2; \
+			xorq (RT2, RT3, 8), to##2; \
+			leaq s1(%rip), RT2; \
+			xorq (RT2, RT1, 8), to##2;
 
 #define __movq(src, dst) \
 	movq src, dst;
diff --git a/arch/x86/crypto/ghash-clmulni-intel_asm.S b/arch/x86/crypto/ghash-clmulni-intel_asm.S
index 257ed9446f3e..99cb983ded9e 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_asm.S
+++ b/arch/x86/crypto/ghash-clmulni-intel_asm.S
@@ -93,7 +93,7 @@ SYM_FUNC_START(clmul_ghash_mul)
 	FRAME_BEGIN
 	movups (%rdi), DATA
 	movups (%rsi), SHASH
-	movaps .Lbswap_mask, BSWAP
+	movaps .Lbswap_mask(%rip), BSWAP
 	pshufb BSWAP, DATA
 	call __clmul_gf128mul_ble
 	pshufb BSWAP, DATA
@@ -110,7 +110,7 @@ SYM_FUNC_START(clmul_ghash_update)
 	FRAME_BEGIN
 	cmp $16, %rdx
 	jb .Lupdate_just_ret	# check length
-	movaps .Lbswap_mask, BSWAP
+	movaps .Lbswap_mask(%rip), BSWAP
 	movups (%rdi), DATA
 	movups (%rcx), SHASH
 	pshufb BSWAP, DATA
diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S
index 3eada9416852..10a3396bad35 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -589,19 +589,23 @@ last_block_enter:
 
 .align 16
 loop1:
-	vpaddd	K256+0*32(SRND), X0, XFER
+	leaq	K256(%rip), INP
+	vpaddd	0*32(INP, SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 0*32
 
-	vpaddd	K256+1*32(SRND), X0, XFER
+	leaq	K256(%rip), INP
+	vpaddd	1*32(INP, SRND), X0, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 1*32
 
-	vpaddd	K256+2*32(SRND), X0, XFER
+	leaq	K256(%rip), INP
+	vpaddd	2*32(INP, SRND), X0, XFER
 	vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 2*32
 
-	vpaddd	K256+3*32(SRND), X0, XFER
+	leaq	K256(%rip), INP
+	vpaddd	3*32(INP, SRND), X0, XFER
 	vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 3*32
 
@@ -611,11 +615,13 @@ loop1:
 
 loop2:
 	## Do last 16 rounds with no scheduling
-	vpaddd	K256+0*32(SRND), X0, XFER
+	leaq	K256(%rip), INP
+	vpaddd	0*32(INP, SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 0*32
 
-	vpaddd	K256+1*32(SRND), X1, XFER
+	leaq	K256(%rip), INP
+	vpaddd	1*32(INP, SRND), X1, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 1*32
 	add	$2*32, SRND
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 02/43] x86: Add macro to get symbol address for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 01/43] x86/crypto: Adapt assembly for PIE support Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 03/43] x86: relocate_kernel - Adapt assembly " Hou Wenlong
                   ` (41 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Add a new _ASM_MOVABS macro to fetch a symbol address. Replace
"_ASM_MOV $<symbol>, %dst" code construct that are not compatible with
PIE.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/asm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index fbcfec4dc4cc..05974cc060c6 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -35,6 +35,7 @@
 #define _ASM_ALIGN	__ASM_SEL(.balign 4, .balign 8)
 
 #define _ASM_MOV	__ASM_SIZE(mov)
+#define _ASM_MOVABS	__ASM_SEL(movl, movabsq)
 #define _ASM_INC	__ASM_SIZE(inc)
 #define _ASM_DEC	__ASM_SIZE(dec)
 #define _ASM_ADD	__ASM_SIZE(add)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 03/43] x86: relocate_kernel - Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 01/43] x86/crypto: Adapt assembly for PIE support Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 02/43] x86: Add macro to get symbol address " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 04/43] x86/entry/64: " Hou Wenlong
                   ` (40 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra (Intel),
	Alexandre Chartre, Josh Poimboeuf, Konrad Rzeszutek Wilk

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly code to use only absolute references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/relocate_kernel_64.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..05d916e9df47 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -223,7 +223,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%rax, %cr3
 	lea	PAGE_SIZE(%r8), %rsp
 	call	swap_pages
-	movq	$virtual_mapped, %rax
+	movabsq	$virtual_mapped, %rax
 	pushq	%rax
 	ANNOTATE_UNRET_SAFE
 	ret
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 04/43] x86/entry/64: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (2 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 03/43] x86: relocate_kernel - Adapt assembly " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 05/43] x86: pm-trace: " Hou Wenlong
                   ` (39 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/entry/entry_64.S | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 21dca946955e..6f2297ebb15f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1089,7 +1089,8 @@ SYM_CODE_START(error_entry)
 	movl	%ecx, %eax			/* zero extend */
 	cmpq	%rax, RIP+8(%rsp)
 	je	.Lbstep_iret
-	cmpq	$.Lgs_change, RIP+8(%rsp)
+	leaq	.Lgs_change(%rip), %rcx
+	cmpq	%rcx, RIP+8(%rsp)
 	jne	.Lerror_entry_done_lfence
 
 	/*
@@ -1302,10 +1303,10 @@ SYM_CODE_START(asm_exc_nmi)
 	 * resume the outer NMI.
 	 */
 
-	movq	$repeat_nmi, %rdx
+	leaq	repeat_nmi(%rip), %rdx
 	cmpq	8(%rsp), %rdx
 	ja	1f
-	movq	$end_repeat_nmi, %rdx
+	leaq	end_repeat_nmi(%rip), %rdx
 	cmpq	8(%rsp), %rdx
 	ja	nested_nmi_out
 1:
@@ -1359,7 +1360,8 @@ nested_nmi:
 	pushq	%rdx
 	pushfq
 	pushq	$__KERNEL_CS
-	pushq	$repeat_nmi
+	leaq	repeat_nmi(%rip), %rdx
+	pushq	%rdx
 
 	/* Put stack back */
 	addq	$(6*8), %rsp
@@ -1398,7 +1400,11 @@ first_nmi:
 	addq	$8, (%rsp)	/* Fix up RSP */
 	pushfq			/* RFLAGS */
 	pushq	$__KERNEL_CS	/* CS */
-	pushq	$1f		/* RIP */
+	pushq	$0		/* Space for RIP */
+	pushq	%rdx		/* Save RDX */
+	leaq	1f(%rip), %rdx	/* Put the address of 1f label into RDX */
+	movq	%rdx, 8(%rsp)   /* Store it in RIP field */
+	popq	%rdx		/* Restore RDX */
 	iretq			/* continues at repeat_nmi below */
 	UNWIND_HINT_IRET_REGS
 1:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 05/43] x86: pm-trace: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (3 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 04/43] x86/entry/64: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 06/43] x86/CPU: " Hou Wenlong
                   ` (38 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change assembly to use the new _ASM_MOVABS macro instead of _ASM_MOV for
the assembly to be PIE compatible.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/pm-trace.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pm-trace.h b/arch/x86/include/asm/pm-trace.h
index bfa32aa428e5..972070806ce9 100644
--- a/arch/x86/include/asm/pm-trace.h
+++ b/arch/x86/include/asm/pm-trace.h
@@ -8,7 +8,7 @@
 do {								\
 	if (pm_trace_enabled) {					\
 		const void *tracedata;				\
-		asm volatile(_ASM_MOV " $1f,%0\n"		\
+		asm volatile(_ASM_MOVABS " $1f,%0\n"		\
 			     ".section .tracedata,\"a\"\n"	\
 			     "1:\t.word %c1\n\t"		\
 			     _ASM_PTR " %c2\n"			\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 06/43] x86/CPU: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (4 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 05/43] x86: pm-trace: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 07/43] x86/acpi: " Hou Wenlong
                   ` (37 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/sync_core.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..fa5b1fe1a692 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -31,10 +31,12 @@ static inline void iret_to_self(void)
 		"pushfq\n\t"
 		"mov %%cs, %0\n\t"
 		"pushq %q0\n\t"
-		"pushq $1f\n\t"
+		"leaq 1f(%%rip), %q0\n\t"
+		"pushq %q0\n\t"
 		"iretq\n\t"
 		"1:"
-		: "=&r" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
+		: "=&r" (tmp), ASM_CALL_CONSTRAINT
+		: : "cc", "memory");
 }
 #endif /* CONFIG_X86_32 */
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 07/43] x86/acpi: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (5 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 06/43] x86/CPU: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28 11:32   ` Rafael J. Wysocki
  2023-04-28  9:50 ` [PATCH RFC 08/43] x86/boot/64: " Hou Wenlong
                   ` (36 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-pm, linux-acpi

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/acpi/wakeup_64.S | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S
index d5d8a352eafa..fe688bd87d72 100644
--- a/arch/x86/kernel/acpi/wakeup_64.S
+++ b/arch/x86/kernel/acpi/wakeup_64.S
@@ -17,7 +17,7 @@
 	 * Hooray, we are in Long 64-bit mode (but still running in low memory)
 	 */
 SYM_FUNC_START(wakeup_long64)
-	movq	saved_magic, %rax
+	movq	saved_magic(%rip), %rax
 	movq	$0x123456789abcdef0, %rdx
 	cmpq	%rdx, %rax
 	je	2f
@@ -33,14 +33,14 @@ SYM_FUNC_START(wakeup_long64)
 	movw	%ax, %es
 	movw	%ax, %fs
 	movw	%ax, %gs
-	movq	saved_rsp, %rsp
+	movq	saved_rsp(%rip), %rsp
 
-	movq	saved_rbx, %rbx
-	movq	saved_rdi, %rdi
-	movq	saved_rsi, %rsi
-	movq	saved_rbp, %rbp
+	movq	saved_rbx(%rip), %rbx
+	movq	saved_rdi(%rip), %rdi
+	movq	saved_rsi(%rip), %rsi
+	movq	saved_rbp(%rip), %rbp
 
-	movq	saved_rip, %rax
+	movq	saved_rip(%rip), %rax
 	ANNOTATE_RETPOLINE_SAFE
 	jmp	*%rax
 SYM_FUNC_END(wakeup_long64)
@@ -51,7 +51,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
 	xorl	%eax, %eax
 	call	save_processor_state
 
-	movq	$saved_context, %rax
+	leaq	saved_context(%rip), %rax
 	movq	%rsp, pt_regs_sp(%rax)
 	movq	%rbp, pt_regs_bp(%rax)
 	movq	%rsi, pt_regs_si(%rax)
@@ -70,13 +70,14 @@ SYM_FUNC_START(do_suspend_lowlevel)
 	pushfq
 	popq	pt_regs_flags(%rax)
 
-	movq	$.Lresume_point, saved_rip(%rip)
+	leaq	.Lresume_point(%rip), %rax
+	movq	%rax, saved_rip(%rip)
 
-	movq	%rsp, saved_rsp
-	movq	%rbp, saved_rbp
-	movq	%rbx, saved_rbx
-	movq	%rdi, saved_rdi
-	movq	%rsi, saved_rsi
+	movq	%rsp, saved_rsp(%rip)
+	movq	%rbp, saved_rbp(%rip)
+	movq	%rbx, saved_rbx(%rip)
+	movq	%rdi, saved_rdi(%rip)
+	movq	%rsi, saved_rsi(%rip)
 
 	addq	$8, %rsp
 	movl	$3, %edi
@@ -88,7 +89,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
 	.align 4
 .Lresume_point:
 	/* We don't restore %rax, it must be 0 anyway */
-	movq	$saved_context, %rax
+	leaq	saved_context(%rip), %rax
 	movq	saved_context_cr4(%rax), %rbx
 	movq	%rbx, %cr4
 	movq	saved_context_cr3(%rax), %rbx
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 08/43] x86/boot/64: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (6 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 07/43] x86/acpi: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 09/43] x86/power/64: " Hou Wenlong
                   ` (35 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, David Woodhouse, Peter Zijlstra, Brian Gerst,
	Josh Poimboeuf

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly code to use absolute reference for transition
between address spaces and relative references when referencing global
variables in the same address space. Ensure the kernel built with PIE
references the correct addresses based on context.

[Hou Wenlong: Adapt new assembly code and remove change for
initial_code(%rip)]

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/head_64.S | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index a5df3e994f04..21f0556d3ac0 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -114,7 +114,8 @@ SYM_CODE_START_NOALIGN(startup_64)
 	popq	%rsi
 
 	/* Form the CR3 value being sure to include the CR3 modifier */
-	addq	$(early_top_pgt - __START_KERNEL_map), %rax
+	movabs  $(early_top_pgt - __START_KERNEL_map), %rcx
+	addq    %rcx, %rax
 	jmp 1f
 SYM_CODE_END(startup_64)
 
@@ -156,13 +157,14 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	 * added to the initial pgdir entry that will be programmed into CR3.
 	 */
 #ifdef CONFIG_AMD_MEM_ENCRYPT
-	movq	sme_me_mask, %rax
+	movq	sme_me_mask(%rip), %rax
 #else
 	xorq	%rax, %rax
 #endif
 
 	/* Form the CR3 value being sure to include the CR3 modifier */
-	addq	$(init_top_pgt - __START_KERNEL_map), %rax
+	movabs	$(init_top_pgt - __START_KERNEL_map), %rcx
+	addq    %rcx, %rax
 1:
 
 #ifdef CONFIG_X86_MCE
@@ -226,7 +228,7 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	movq	%rax, %cr4
 
 	/* Ensure I am executing from virtual addresses */
-	movq	$1f, %rax
+	movabs  $1f, %rax
 	ANNOTATE_RETPOLINE_SAFE
 	jmp	*%rax
 1:
@@ -237,7 +239,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	movl	smpboot_control(%rip), %ecx
 
 	/* Get the per cpu offset for the given CPU# which is in ECX */
-	movq	__per_cpu_offset(,%rcx,8), %rdx
+	leaq	__per_cpu_offset(%rip), %rdx
+	movq	(%rdx,%rcx,8), %rdx
 #else
 	xorl	%edx, %edx /* zero-extended to clear all of RDX */
 #endif /* CONFIG_SMP */
@@ -248,7 +251,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	 *
 	 * RDX contains the per-cpu offset
 	 */
-	movq	pcpu_hot + X86_current_task(%rdx), %rax
+	leaq	(pcpu_hot + X86_current_task)(%rip), %rax
+	movq	(%rdx,%rax,1), %rax
 	movq	TASK_threadsp(%rax), %rsp
 
 	/*
@@ -259,7 +263,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	 */
 	subq	$16, %rsp
 	movw	$(GDT_SIZE-1), (%rsp)
-	leaq	gdt_page(%rdx), %rax
+	leaq	gdt_page(%rip), %rax
+	addq	%rdx, %rax
 	movq	%rax, 2(%rsp)
 	lgdt	(%rsp)
 	addq	$16, %rsp
@@ -362,7 +367,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	 *	REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
 	 *		address given in m16:64.
 	 */
-	pushq	$.Lafter_lret	# put return address on stack for unwinder
+	movabs  $.Lafter_lret, %rax
+	pushq	%rax		# put return address on stack for unwinder
 	xorl	%ebp, %ebp	# clear frame pointer
 	movq	initial_code(%rip), %rax
 	pushq	$__KERNEL_CS	# set correct cs
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 09/43] x86/power/64: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (7 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 08/43] x86/boot/64: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 10/43] x86/alternatives: " Hou Wenlong
                   ` (34 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Rafael J. Wysocki, Pavel Machek, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, linux-pm

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/power/hibernate_asm_64.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/power/hibernate_asm_64.S b/arch/x86/power/hibernate_asm_64.S
index 0a0539e1cc81..1d96a119d29d 100644
--- a/arch/x86/power/hibernate_asm_64.S
+++ b/arch/x86/power/hibernate_asm_64.S
@@ -39,7 +39,7 @@ SYM_FUNC_START(restore_registers)
 	movq	%rax, %cr4;  # turn PGE back on
 
 	/* We don't restore %rax, it must be 0 anyway */
-	movq	$saved_context, %rax
+	leaq	saved_context(%rip), %rax
 	movq	pt_regs_sp(%rax), %rsp
 	movq	pt_regs_bp(%rax), %rbp
 	movq	pt_regs_si(%rax), %rsi
@@ -70,7 +70,7 @@ SYM_FUNC_START(restore_registers)
 SYM_FUNC_END(restore_registers)
 
 SYM_FUNC_START(swsusp_arch_suspend)
-	movq	$saved_context, %rax
+	leaq	saved_context(%rip), %rax
 	movq	%rsp, pt_regs_sp(%rax)
 	movq	%rbp, pt_regs_bp(%rax)
 	movq	%rsi, pt_regs_si(%rax)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 10/43] x86/alternatives: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (8 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 09/43] x86/power/64: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 11/43] x86/irq: " Hou Wenlong
                   ` (33 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Willy Tarreau

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the assembly options to work with pointers instead of integers.
The generated code is the same PIE just ensures input is a pointer.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/alternative.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index d7da28fada87..cbf7c93087c8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -307,7 +307,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
 /* Like alternative_io, but for replacing a direct call with another one. */
 #define alternative_call(oldfunc, newfunc, ft_flags, output, input...)	\
 	asm_inline volatile (ALTERNATIVE("call %P[old]", "call %P[new]", ft_flags) \
-		: output : [old] "i" (oldfunc), [new] "i" (newfunc), ## input)
+		: output : [old] "X" (oldfunc), [new] "X" (newfunc), ## input)
 
 /*
  * Like alternative_call, but there are two features and respective functions.
@@ -320,8 +320,8 @@ static inline int alternatives_text_reserved(void *start, void *end)
 	asm_inline volatile (ALTERNATIVE_2("call %P[old]", "call %P[new1]", ft_flags1,\
 		"call %P[new2]", ft_flags2)				      \
 		: output, ASM_CALL_CONSTRAINT				      \
-		: [old] "i" (oldfunc), [new1] "i" (newfunc1),		      \
-		  [new2] "i" (newfunc2), ## input)
+		: [old] "X" (oldfunc), [new1] "X" (newfunc1),		      \
+		  [new2] "X" (newfunc2), ## input)
 
 /*
  * use this macro(s) if you need more than one output parameter
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 11/43] x86/irq: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (9 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 10/43] x86/alternatives: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 12/43] x86,rethook: " Hou Wenlong
                   ` (32 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra (Intel),
	Sebastian Andrzej Siewior, Arnd Bergmann

Change the assembly options to work with pointers instead of integers.
The generated code is the same PIE just ensures input is a pointer.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/irq_stack.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 798183867d78..caba5d1d0800 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -93,7 +93,7 @@
 	"popq	%%rsp					\n"		\
 									\
 	: "+r" (tos), ASM_CALL_CONSTRAINT				\
-	: [__func] "i" (func), [tos] "r" (tos) argconstr		\
+	: [__func] "X" (func), [tos] "r" (tos) argconstr		\
 	: "cc", "rax", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10",	\
 	  "memory"							\
 	);								\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 12/43] x86,rethook: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (10 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 11/43] x86/irq: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction Hou Wenlong
                   ` (31 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/rethook.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
index 8a1c0111ae79..ff3733b765e0 100644
--- a/arch/x86/kernel/rethook.c
+++ b/arch/x86/kernel/rethook.c
@@ -27,7 +27,15 @@ asm(
 #ifdef CONFIG_X86_64
 	ANNOTATE_NOENDBR	/* This is only jumped from ret instruction */
 	/* Push a fake return address to tell the unwinder it's a rethook. */
+#ifdef CONFIG_X86_PIE
+	"	pushq $0\n"
+	"	pushq %rdi\n"
+	"	leaq arch_rethook_trampoline(%rip), %rdi\n"
+	"	movq %rdi, 8(%rsp)\n"
+	"	popq %rdi\n"
+#else
 	"	pushq $arch_rethook_trampoline\n"
+#endif
 	UNWIND_HINT_FUNC
 	"       pushq $" __stringify(__KERNEL_DS) "\n"
 	/* Save the 'sp - 16', this will be fixed later. */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (11 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 12/43] x86,rethook: " Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-06-01  9:29     ` Juergen Gross via Virtualization
  2023-04-28  9:50 ` [PATCH RFC 14/43] x86/Kconfig: Introduce new Kconfig for PIE kernel building Hou Wenlong
                   ` (30 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Juergen Gross, Srivatsa S. Bhat (VMware),
	Alexey Makhalov, VMware PV-Drivers Reviewers, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Song Liu, Nadav Amit, Arnd Bergmann,
	virtualization

Similar to the alternative patching, use relative reference for original
instruction rather than absolute one, which saves 8 bytes for one entry
on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
R_X86_64_64 relocation, which also reduces relocation metadata on
relocatable builds. And the alignment could be hard coded to be 4 now.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/paravirt.h       | 10 +++++-----
 arch/x86/include/asm/paravirt_types.h |  8 ++++----
 arch/x86/kernel/alternative.c         |  8 +++++---
 arch/x86/kernel/callthunks.c          |  2 +-
 4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b49778664d2b..2350ceb43db0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -742,16 +742,16 @@ extern void default_banner(void);
 
 #else  /* __ASSEMBLY__ */
 
-#define _PVSITE(ptype, ops, word, algn)		\
+#define _PVSITE(ptype, ops)			\
 771:;						\
 	ops;					\
 772:;						\
 	.pushsection .parainstructions,"a";	\
-	 .align	algn;				\
-	 word 771b;				\
+	 .align	4;				\
+	 .long 771b-.;				\
 	 .byte ptype;				\
 	 .byte 772b-771b;			\
-	 _ASM_ALIGN;				\
+	 .align 4;				\
 	.popsection
 
 
@@ -759,7 +759,7 @@ extern void default_banner(void);
 #ifdef CONFIG_PARAVIRT_XXL
 
 #define PARA_PATCH(off)		((off) / 8)
-#define PARA_SITE(ptype, ops)	_PVSITE(ptype, ops, .quad, 8)
+#define PARA_SITE(ptype, ops)	_PVSITE(ptype, ops)
 #define PARA_INDIRECT(addr)	*addr(%rip)
 
 #ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 4acbcddddc29..982a234f5a06 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -5,7 +5,7 @@
 #ifndef __ASSEMBLY__
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch_site {
-	u8 *instr;		/* original instructions */
+	s32 instr_offset;	/* original instructions */
 	u8 type;		/* type of this instruction */
 	u8 len;			/* length of original instruction */
 };
@@ -270,11 +270,11 @@ extern struct paravirt_patch_template pv_ops;
 #define _paravirt_alt(insn_string, type)		\
 	"771:\n\t" insn_string "\n" "772:\n"		\
 	".pushsection .parainstructions,\"a\"\n"	\
-	_ASM_ALIGN "\n"					\
-	_ASM_PTR " 771b\n"				\
+	"  .align 4\n"					\
+	"  .long 771b-.\n"				\
 	"  .byte " type "\n"				\
 	"  .byte 772b-771b\n"				\
-	_ASM_ALIGN "\n"					\
+	"  .align 4\n"					\
 	".popsection\n"
 
 /* Generate patchable code, with the default asm parameters. */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f615e0cb6d93..25c59da6c53b 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1230,20 +1230,22 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 {
 	struct paravirt_patch_site *p;
 	char insn_buff[MAX_PATCH_LEN];
+	u8 *instr;
 
 	for (p = start; p < end; p++) {
 		unsigned int used;
 
+		instr = (u8 *)&p->instr_offset + p->instr_offset;
 		BUG_ON(p->len > MAX_PATCH_LEN);
 		/* prep the buffer with the original instructions */
-		memcpy(insn_buff, p->instr, p->len);
-		used = paravirt_patch(p->type, insn_buff, (unsigned long)p->instr, p->len);
+		memcpy(insn_buff, instr, p->len);
+		used = paravirt_patch(p->type, insn_buff, (unsigned long)instr, p->len);
 
 		BUG_ON(used > p->len);
 
 		/* Pad the rest with nops */
 		add_nops(insn_buff + used, p->len - used);
-		text_poke_early(p->instr, insn_buff, p->len);
+		text_poke_early(instr, insn_buff, p->len);
 	}
 }
 extern struct paravirt_patch_site __start_parainstructions[],
diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
index ffea98f9064b..f15405acfd42 100644
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -245,7 +245,7 @@ patch_paravirt_call_sites(struct paravirt_patch_site *start,
 	struct paravirt_patch_site *p;
 
 	for (p = start; p < end; p++)
-		patch_call(p->instr, ct);
+		patch_call((void *)&p->instr_offset + p->instr_offset, ct);
 }
 
 static __init_or_module void
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 14/43] x86/Kconfig: Introduce new Kconfig for PIE kernel building
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (12 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 15/43] x86/PVH: Use fixed_percpu_data to set up GS base Hou Wenlong
                   ` (29 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

Add a new Kconfig to control the behaviour of PIE building, and disable
it now.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c94297369448..68e5da464b96 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2208,6 +2208,10 @@ config RELOCATABLE
 	  it has been loaded at and the compile time physical address
 	  (CONFIG_PHYSICAL_START) is used as the minimum location.
 
+config X86_PIE
+	def_bool n
+	depends on X86_64
+
 config RANDOMIZE_BASE
 	bool "Randomize the address of the kernel image (KASLR)"
 	depends on RELOCATABLE
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 15/43] x86/PVH: Use fixed_percpu_data to set up GS base
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (13 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 14/43] x86/Kconfig: Introduce new Kconfig for PIE kernel building Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler Hou Wenlong
                   ` (28 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Juergen Gross, Boris Ostrovsky, Darren Hart, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, xen-devel, platform-driver-x86

startup_64() and startup_xen() both use fixed_percpu_data to set up GS
base. So for consitency, use it too in PVH entry.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/platform/pvh/head.S | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index c4365a05ab83..b093996b7e19 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -96,7 +96,7 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 1:
 	/* Set base address in stack canary descriptor. */
 	mov $MSR_GS_BASE,%ecx
-	mov $_pa(canary), %eax
+	mov $_pa(INIT_PER_CPU_VAR(fixed_percpu_data)), %eax
 	xor %edx, %edx
 	wrmsr
 
@@ -156,8 +156,6 @@ SYM_DATA_START_LOCAL(gdt_start)
 SYM_DATA_END_LABEL(gdt_start, SYM_L_LOCAL, gdt_end)
 
 	.balign 16
-SYM_DATA_LOCAL(canary, .fill 48, 1, 0)
-
 SYM_DATA_START_LOCAL(early_stack)
 	.fill BOOT_STACK_SIZE, 1, 0
 SYM_DATA_END_LABEL(early_stack, SYM_L_LOCAL, early_stack_end)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (14 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 15/43] x86/PVH: Use fixed_percpu_data to set up GS base Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-05-01 17:27   ` Nick Desaulniers
  2023-05-04 10:31   ` Juergen Gross
  2023-04-28  9:50 ` [PATCH RFC 17/43] x86/pie: Enable stack protector only if per-cpu stack canary is supported Hou Wenlong
                   ` (27 subsequent siblings)
  43 siblings, 2 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Brian Gerst, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski, Juergen Gross,
	Boris Ostrovsky, Darren Hart, Andy Shevchenko, Nathan Chancellor,
	Nick Desaulniers, Tom Rix, Peter Zijlstra, Mike Rapoport (IBM),
	Ashok Raj, Rick Edgecombe, Catalin Marinas, Guo Ren,
	Greg Kroah-Hartman, Jason A. Donenfeld, Pawan Gupta,
	Kim Phillips, David Woodhouse, Josh Poimboeuf, xen-devel,
	platform-driver-x86, llvm

From: Brian Gerst <brgerst@gmail.com>

From: Brian Gerst <brgerst@gmail.com>

If the compiler supports it, use a standard per-cpu variable for the
stack protector instead of the old fixed location.  Keep the fixed
location code for compatibility with older compilers.

[Hou Wenlong: Disable it on Clang, adapt new code change and adapt
missing GS set up path in pvh_start_xen()]

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig                      | 12 ++++++++++++
 arch/x86/Makefile                     | 21 ++++++++++++++-------
 arch/x86/entry/entry_64.S             |  6 +++++-
 arch/x86/include/asm/processor.h      | 17 ++++++++++++-----
 arch/x86/include/asm/stackprotector.h | 16 +++++++---------
 arch/x86/kernel/asm-offsets_64.c      |  2 +-
 arch/x86/kernel/cpu/common.c          | 15 +++++++--------
 arch/x86/kernel/head_64.S             | 16 ++++++++++------
 arch/x86/kernel/vmlinux.lds.S         |  4 +++-
 arch/x86/platform/pvh/head.S          |  8 ++++++++
 arch/x86/xen/xen-head.S               | 14 +++++++++-----
 11 files changed, 88 insertions(+), 43 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68e5da464b96..55cce8cdf9bd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -410,6 +410,18 @@ config CC_HAS_SANE_STACKPROTECTOR
 	  the compiler produces broken code or if it does not let us control
 	  the segment on 32-bit kernels.
 
+config CC_HAS_CUSTOMIZED_STACKPROTECTOR
+	bool
+	# Although clang supports -mstack-protector-guard-reg option, it
+	# would generate GOT reference for __stack_chk_guard even with
+	# -fno-PIE flag.
+	default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))
+
+config STACKPROTECTOR_FIXED
+	bool
+	depends on X86_64 && STACKPROTECTOR
+	default !CC_HAS_CUSTOMIZED_STACKPROTECTOR
+
 menu "Processor type and features"
 
 config SMP
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index b39975977c03..57e4dbbf501d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -111,13 +111,7 @@ ifeq ($(CONFIG_X86_32),y)
         # temporary until string.h is fixed
         KBUILD_CFLAGS += -ffreestanding
 
-	ifeq ($(CONFIG_STACKPROTECTOR),y)
-		ifeq ($(CONFIG_SMP),y)
-			KBUILD_CFLAGS += -mstack-protector-guard-reg=fs -mstack-protector-guard-symbol=__stack_chk_guard
-		else
-			KBUILD_CFLAGS += -mstack-protector-guard=global
-		endif
-	endif
+	percpu_seg := fs
 else
         BITS := 64
         UTS_MACHINE := x86_64
@@ -167,6 +161,19 @@ else
         KBUILD_CFLAGS += -mcmodel=kernel
         KBUILD_RUSTFLAGS += -Cno-redzone=y
         KBUILD_RUSTFLAGS += -Ccode-model=kernel
+
+	percpu_seg := gs
+endif
+
+ifeq ($(CONFIG_STACKPROTECTOR),y)
+	ifneq ($(CONFIG_STACKPROTECTOR_FIXED),y)
+		ifeq ($(CONFIG_SMP),y)
+			KBUILD_CFLAGS += -mstack-protector-guard-reg=$(percpu_seg) \
+					 -mstack-protector-guard-symbol=__stack_chk_guard
+		else
+			KBUILD_CFLAGS += -mstack-protector-guard=global
+		endif
+	endif
 endif
 
 #
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6f2297ebb15f..df79b7aa65bb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,10 @@ SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
 	int3
 SYM_CODE_END(entry_SYSCALL_64)
 
+#ifdef CONFIG_STACKPROTECTOR_FIXED
+#define __stack_chk_guard fixed_percpu_data + FIXED_stack_canary
+#endif
+
 /*
  * %rdi: prev task
  * %rsi: next task
@@ -252,7 +256,7 @@ SYM_FUNC_START(__switch_to_asm)
 
 #ifdef CONFIG_STACKPROTECTOR
 	movq	TASK_stack_canary(%rsi), %rbx
-	movq	%rbx, PER_CPU_VAR(fixed_percpu_data) + FIXED_stack_canary
+	movq	%rbx, PER_CPU_VAR(__stack_chk_guard)
 #endif
 
 	/*
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2a5ec5750ba7..3890f609569d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -379,6 +379,8 @@ struct irq_stack {
 } __aligned(IRQ_STACK_SIZE);
 
 #ifdef CONFIG_X86_64
+
+#ifdef CONFIG_STACKPROTECTOR_FIXED
 struct fixed_percpu_data {
 	/*
 	 * GCC hardcodes the stack canary as %gs:40.  Since the
@@ -394,21 +396,26 @@ struct fixed_percpu_data {
 
 DECLARE_PER_CPU_FIRST(struct fixed_percpu_data, fixed_percpu_data) __visible;
 DECLARE_INIT_PER_CPU(fixed_percpu_data);
+#endif /* CONFIG_STACKPROTECTOR_FIXED */
 
 static inline unsigned long cpu_kernelmode_gs_base(int cpu)
 {
+#ifdef CONFIG_STACKPROTECTOR_FIXED
 	return (unsigned long)per_cpu(fixed_percpu_data.gs_base, cpu);
+#else
+#ifdef CONFIG_SMP
+	return per_cpu_offset(cpu);
+#else
+	return 0;
+#endif
+#endif
 }
 
 extern asmlinkage void ignore_sysret(void);
 
 /* Save actual FS/GS selectors and bases to current->thread */
 void current_save_fsgs(void);
-#else	/* X86_64 */
-#ifdef CONFIG_STACKPROTECTOR
-DECLARE_PER_CPU(unsigned long, __stack_chk_guard);
-#endif
-#endif	/* !X86_64 */
+#endif	/* X86_64 */
 
 struct perf_event;
 
diff --git a/arch/x86/include/asm/stackprotector.h b/arch/x86/include/asm/stackprotector.h
index 00473a650f51..24aa0e2ad0dd 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -36,6 +36,12 @@
 
 #include <linux/sched.h>
 
+#ifdef CONFIG_STACKPROTECTOR_FIXED
+#define __stack_chk_guard fixed_percpu_data.stack_canary
+#else
+DECLARE_PER_CPU(unsigned long, __stack_chk_guard);
+#endif
+
 /*
  * Initialize the stackprotector canary value.
  *
@@ -51,25 +57,17 @@ static __always_inline void boot_init_stack_canary(void)
 {
 	unsigned long canary = get_random_canary();
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_STACKPROTECTOR_FIXED
 	BUILD_BUG_ON(offsetof(struct fixed_percpu_data, stack_canary) != 40);
 #endif
 
 	current->stack_canary = canary;
-#ifdef CONFIG_X86_64
-	this_cpu_write(fixed_percpu_data.stack_canary, canary);
-#else
 	this_cpu_write(__stack_chk_guard, canary);
-#endif
 }
 
 static inline void cpu_init_stack_canary(int cpu, struct task_struct *idle)
 {
-#ifdef CONFIG_X86_64
-	per_cpu(fixed_percpu_data.stack_canary, cpu) = idle->stack_canary;
-#else
 	per_cpu(__stack_chk_guard, cpu) = idle->stack_canary;
-#endif
 }
 
 #else	/* STACKPROTECTOR */
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bb65371ea9df..f39baf90126c 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -56,7 +56,7 @@ int main(void)
 
 	BLANK();
 
-#ifdef CONFIG_STACKPROTECTOR
+#ifdef CONFIG_STACKPROTECTOR_FIXED
 	OFFSET(FIXED_stack_canary, fixed_percpu_data, stack_canary);
 	BLANK();
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3ea06b0b4570..972b1babf731 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2051,10 +2051,6 @@ DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = {
 EXPORT_PER_CPU_SYMBOL(pcpu_hot);
 
 #ifdef CONFIG_X86_64
-DEFINE_PER_CPU_FIRST(struct fixed_percpu_data,
-		     fixed_percpu_data) __aligned(PAGE_SIZE) __visible;
-EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
-
 static void wrmsrl_cstar(unsigned long val)
 {
 	/*
@@ -2102,15 +2098,18 @@ void syscall_init(void)
 	       X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
 	       X86_EFLAGS_AC|X86_EFLAGS_ID);
 }
-
-#else	/* CONFIG_X86_64 */
+#endif	/* CONFIG_X86_64 */
 
 #ifdef CONFIG_STACKPROTECTOR
+#ifdef CONFIG_STACKPROTECTOR_FIXED
+DEFINE_PER_CPU_FIRST(struct fixed_percpu_data,
+		     fixed_percpu_data) __aligned(PAGE_SIZE) __visible;
+EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
+#else
 DEFINE_PER_CPU(unsigned long, __stack_chk_guard);
 EXPORT_PER_CPU_SYMBOL(__stack_chk_guard);
 #endif
-
-#endif	/* CONFIG_X86_64 */
+#endif
 
 /*
  * Clear all 6 debug registers:
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 21f0556d3ac0..61f1873d0ff7 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -68,7 +68,13 @@ SYM_CODE_START_NOALIGN(startup_64)
 
 	/* Setup GSBASE to allow stack canary access for C code */
 	movl	$MSR_GS_BASE, %ecx
+#if defined(CONFIG_STACKPROTECTOR_FIXED)
 	leaq	INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
+#elif defined(CONFIG_SMP)
+	movabs	$__per_cpu_load, %rdx
+#else
+	xorl	%edx, %edx
+#endif
 	movl	%edx, %eax
 	shrq	$32,  %rdx
 	wrmsr
@@ -283,16 +289,14 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	movl %eax,%fs
 	movl %eax,%gs
 
-	/* Set up %gs.
-	 *
-	 * The base of %gs always points to fixed_percpu_data. If the
-	 * stack protector canary is enabled, it is located at %gs:40.
+	/*
+	 * Set up GS base.
 	 * Note that, on SMP, the boot cpu uses init data section until
 	 * the per cpu areas are set up.
 	 */
 	movl	$MSR_GS_BASE,%ecx
-#ifndef CONFIG_SMP
-	leaq	INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
+#if !defined(CONFIG_SMP) && defined(CONFIG_STACKPROTECTOR_FIXED)
+	leaq	__per_cpu_load(%rip), %rdx
 #endif
 	movl	%edx, %eax
 	shrq	$32, %rdx
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 25f155205770..f02dcde9f8a8 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -500,12 +500,14 @@ SECTIONS
  */
 #define INIT_PER_CPU(x) init_per_cpu__##x = ABSOLUTE(x) + __per_cpu_load
 INIT_PER_CPU(gdt_page);
-INIT_PER_CPU(fixed_percpu_data);
 INIT_PER_CPU(irq_stack_backing_store);
 
+#ifdef CONFIG_STACKPROTECTOR_FIXED
+INIT_PER_CPU(fixed_percpu_data);
 #ifdef CONFIG_SMP
 . = ASSERT((fixed_percpu_data == 0),
            "fixed_percpu_data is not at start of per-cpu area");
 #endif
+#endif
 
 #endif /* CONFIG_X86_64 */
diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index b093996b7e19..5842fe0e4f96 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -96,8 +96,16 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 1:
 	/* Set base address in stack canary descriptor. */
 	mov $MSR_GS_BASE,%ecx
+#if defined(CONFIG_STACKPROTECTOR_FIXED)
 	mov $_pa(INIT_PER_CPU_VAR(fixed_percpu_data)), %eax
 	xor %edx, %edx
+#elif defined(CONFIG_SMP)
+	mov $__per_cpu_load, %rax
+	cdq
+#else
+	xor %eax, %eax
+	xor %edx, %edx
+#endif
 	wrmsr
 
 	call xen_prepare_pvh
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 643d02900fbb..09eaf59e8066 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -51,15 +51,19 @@ SYM_CODE_START(startup_xen)
 
 	leaq	(__end_init_task - PTREGS_SIZE)(%rip), %rsp
 
-	/* Set up %gs.
-	 *
-	 * The base of %gs always points to fixed_percpu_data.  If the
-	 * stack protector canary is enabled, it is located at %gs:40.
+	/*
+	 * Set up GS base.
 	 * Note that, on SMP, the boot cpu uses init data section until
 	 * the per cpu areas are set up.
 	 */
 	movl	$MSR_GS_BASE,%ecx
-	movq	$INIT_PER_CPU_VAR(fixed_percpu_data),%rax
+#if defined(CONFIG_STACKPROTECTOR_FIXED)
+	leaq	INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
+#elif defined(CONFIG_SMP)
+	movabs	$__per_cpu_load, %rdx
+#else
+	xorl	%eax, %eax
+#endif
 	cdq
 	wrmsr
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 17/43] x86/pie: Enable stack protector only if per-cpu stack canary is supported
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (15 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 18/43] x86/percpu: Use PC-relative addressing for percpu variable references Hou Wenlong
                   ` (26 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

Since -fPIE option is not incompatible with -mcmode=kernel option, PIE
kernel would drop -mcmodel=kernel option. However, GCC would use %fs as
segment register for stack protector when -mcmodel=kernel option is
dropped. So only enable stack protector for PIE kernel if per-cpu stack
canary is supported.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55cce8cdf9bd..b26941ef50ee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -403,6 +403,7 @@ config PGTABLE_LEVELS
 
 config CC_HAS_SANE_STACKPROTECTOR
 	bool
+	default CC_HAS_CUSTOMIZED_STACKPROTECTOR if X86_PIE
 	default $(success,$(srctree)/scripts/gcc-x86_64-has-stack-protector.sh $(CC) $(CLANG_FLAGS)) if 64BIT
 	default $(success,$(srctree)/scripts/gcc-x86_32-has-stack-protector.sh $(CC) $(CLANG_FLAGS))
 	help
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 18/43] x86/percpu: Use PC-relative addressing for percpu variable references
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (16 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 17/43] x86/pie: Enable stack protector only if per-cpu stack canary is supported Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:50 ` [PATCH RFC 19/43] x86/tools: Explicitly include autoconf.h for hostprogs Hou Wenlong
                   ` (25 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Pawan Gupta, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Juergen Gross,
	Boris Ostrovsky, Nathan Chancellor, Nick Desaulniers, Tom Rix,
	David Woodhouse, Brian Gerst, linux-mm, kvm, xen-devel, llvm

For PIE binary, all symbol references are PC-relative addressing, even
for percpu variable. So to keep compatible with PIE, add %rip suffix in
percpu assembly macros if PIE is enabled. However, relocation of percpu
variable references is broken now for PIE. It would be fixed later.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/entry/calling.h             | 17 ++++++++++++----
 arch/x86/include/asm/nospec-branch.h | 10 +++++-----
 arch/x86/include/asm/percpu.h        | 29 +++++++++++++++++++++++++---
 arch/x86/kernel/head_64.S            |  2 +-
 arch/x86/kernel/kvm.c                | 21 ++++++++++++++++----
 arch/x86/lib/cmpxchg16b_emu.S        |  8 ++++----
 arch/x86/xen/xen-asm.S               | 10 +++++-----
 7 files changed, 71 insertions(+), 26 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index f6907627172b..11328578741d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -173,7 +173,7 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 #define THIS_CPU_user_pcid_flush_mask   \
-	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
+	PER_CPU_VAR(cpu_tlbstate + TLB_STATE_user_pcid_flush_mask)
 
 .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
@@ -370,8 +370,8 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 .macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req
+	GET_PERCPU_BASE \scratch_reg \save_reg
 	rdgsbase \save_reg
-	GET_PERCPU_BASE \scratch_reg
 	wrgsbase \scratch_reg
 .endm
 
@@ -407,15 +407,24 @@ For 32-bit we have the following conventions - kernel is built with
  * Thus the kernel would consume a guest's TSC_AUX if an NMI arrives
  * while running KVM's run loop.
  */
-.macro GET_PERCPU_BASE reg:req
+#ifdef CONFIG_X86_PIE
+.macro GET_PERCPU_BASE reg:req scratch_reg:req
+	LOAD_CPU_AND_NODE_SEG_LIMIT \reg
+	andq	$VDSO_CPUNODE_MASK, \reg
+	leaq	__per_cpu_offset(%rip), \scratch_reg
+	movq	(\scratch_reg, \reg, 8), \reg
+.endm
+#else
+.macro GET_PERCPU_BASE reg:req scratch_reg:req
 	LOAD_CPU_AND_NODE_SEG_LIMIT \reg
 	andq	$VDSO_CPUNODE_MASK, \reg
 	movq	__per_cpu_offset(, \reg, 8), \reg
 .endm
+#endif /* CONFIG_X86_PIE */
 
 #else
 
-.macro GET_PERCPU_BASE reg:req
+.macro GET_PERCPU_BASE reg:req scratch_reg:req
 	movq	pcpu_unit_offsets(%rip), \reg
 .endm
 
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index edb2b0cb8efe..d8fd935e0697 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -59,13 +59,13 @@
 
 #ifdef CONFIG_CALL_THUNKS_DEBUG
 # define CALL_THUNKS_DEBUG_INC_CALLS				\
-	incq	%gs:__x86_call_count;
+	incq	%gs:(__x86_call_count)__percpu_rel;
 # define CALL_THUNKS_DEBUG_INC_RETS				\
-	incq	%gs:__x86_ret_count;
+	incq	%gs:(__x86_ret_count)__percpu_rel;
 # define CALL_THUNKS_DEBUG_INC_STUFFS				\
-	incq	%gs:__x86_stuffs_count;
+	incq	%gs:(__x86_stuffs_count)__percpu_rel;
 # define CALL_THUNKS_DEBUG_INC_CTXSW				\
-	incq	%gs:__x86_ctxsw_count;
+	incq	%gs:(__x86_ctxsw_count)__percpu_rel;
 #else
 # define CALL_THUNKS_DEBUG_INC_CALLS
 # define CALL_THUNKS_DEBUG_INC_RETS
@@ -95,7 +95,7 @@
 	CALL_THUNKS_DEBUG_INC_CALLS
 
 #define INCREMENT_CALL_DEPTH					\
-	sarq	$5, %gs:pcpu_hot + X86_call_depth;		\
+	sarq	$5, %gs:(pcpu_hot + X86_call_depth)__percpu_rel;\
 	CALL_THUNKS_DEBUG_INC_CALLS
 
 #define ASM_INCREMENT_CALL_DEPTH				\
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 13c0d63ed55e..a627a073c6ea 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -4,16 +4,26 @@
 
 #ifdef CONFIG_X86_64
 #define __percpu_seg		gs
+#ifdef CONFIG_X86_PIE
+#define __percpu_rel		(%rip)
+#else
+#define __percpu_rel
+#endif /* CONFIG_X86_PIE */
 #else
 #define __percpu_seg		fs
+#define __percpu_rel
 #endif
 
 #ifdef __ASSEMBLY__
 
 #ifdef CONFIG_SMP
-#define PER_CPU_VAR(var)	%__percpu_seg:var
+/* Compatible with Position Independent Code */
+#define PER_CPU_VAR(var)	%__percpu_seg:(var)##__percpu_rel
+/* Rare absolute reference */
+#define PER_CPU_VAR_ABS(var)	%__percpu_seg:var
 #else /* ! SMP */
-#define PER_CPU_VAR(var)	var
+#define PER_CPU_VAR(var)	(var)##__percpu_rel
+#define PER_CPU_VAR_ABS(var)	var
 #endif	/* SMP */
 
 #ifdef CONFIG_X86_64_SMP
@@ -148,10 +158,23 @@ do {									\
 	(typeof(_var))(unsigned long) pfo_val__;			\
 })
 
+/*
+ * Position Independent code uses relative addresses only.
+ * The 'P' modifier prevents RIP-relative addressing in GCC,
+ * so use 'a' modifier instead. Howerver, 'P' modifier allows
+ * RIP-relative addressing in Clang but Clang doesn't support
+ * 'a' modifier.
+ */
+#if defined(CONFIG_X86_PIE) && defined(CONFIG_CC_IS_GCC)
+#define __percpu_stable_arg	__percpu_arg(a[var])
+#else
+#define __percpu_stable_arg	__percpu_arg(P[var])
+#endif
+
 #define percpu_stable_op(size, op, _var)				\
 ({									\
 	__pcpu_type_##size pfo_val__;					\
-	asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]")	\
+	asm(__pcpu_op2_##size(op, __percpu_stable_arg, "%[val]")	\
 	    : [val] __pcpu_reg_##size("=", pfo_val__)			\
 	    : [var] "p" (&(_var)));					\
 	(typeof(_var))(unsigned long) pfo_val__;			\
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 61f1873d0ff7..1eed50b7d1ac 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -396,7 +396,7 @@ SYM_CODE_START(start_cpu0)
 	UNWIND_HINT_END_OF_STACK
 
 	/* Find the idle task stack */
-	movq	PER_CPU_VAR(pcpu_hot) + X86_current_task, %rcx
+	movq	PER_CPU_VAR(pcpu_hot + X86_current_task), %rcx
 	movq	TASK_threadsp(%rcx), %rsp
 
 	jmp	.Ljump_to_C_code
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cceac5984da..32d7b201f4f0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -794,14 +794,27 @@ PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
 extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
 
+#ifndef CONFIG_X86_PIE
+#define KVM_CHECK_VCPU_PREEMPTED			\
+	"movq	__per_cpu_offset(,%rdi,8), %rax;"	\
+	"cmpb	$0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
+#else
+#define KVM_CHECK_VCPU_PREEMPTED			\
+	"pushq	%rdi;"					\
+	"leaq	__per_cpu_offset(%rip), %rax;"		\
+	"movq	(%rax,%rdi,8), %rax;"			\
+	"leaq	steal_time(%rip), %rdi;"		\
+	"cmpb	$0, (%rax, %rdi, 1);"			\
+	"popq	%rdi;"
+#endif
+
 /*
  * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
  * restoring to/from the stack.
  */
-#define PV_VCPU_PREEMPTED_ASM						     \
- "movq   __per_cpu_offset(,%rdi,8), %rax\n\t"				     \
- "cmpb   $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax)\n\t" \
- "setne  %al\n\t"
+#define PV_VCPU_PREEMPTED_ASM		\
+	KVM_CHECK_VCPU_PREEMPTED	\
+	"setne  %al\n\t"
 
 DEFINE_PARAVIRT_ASM(__raw_callee_save___kvm_vcpu_is_preempted,
 		    PV_VCPU_PREEMPTED_ASM, .text);
diff --git a/arch/x86/lib/cmpxchg16b_emu.S b/arch/x86/lib/cmpxchg16b_emu.S
index 33c70c0160ea..891c5e9fd868 100644
--- a/arch/x86/lib/cmpxchg16b_emu.S
+++ b/arch/x86/lib/cmpxchg16b_emu.S
@@ -27,13 +27,13 @@ SYM_FUNC_START(this_cpu_cmpxchg16b_emu)
 	pushfq
 	cli
 
-	cmpq PER_CPU_VAR((%rsi)), %rax
+	cmpq PER_CPU_VAR_ABS((%rsi)), %rax
 	jne .Lnot_same
-	cmpq PER_CPU_VAR(8(%rsi)), %rdx
+	cmpq PER_CPU_VAR_ABS(8(%rsi)), %rdx
 	jne .Lnot_same
 
-	movq %rbx, PER_CPU_VAR((%rsi))
-	movq %rcx, PER_CPU_VAR(8(%rsi))
+	movq %rbx, PER_CPU_VAR_ABS((%rsi))
+	movq %rcx, PER_CPU_VAR_ABS(8(%rsi))
 
 	popfq
 	mov $1, %al
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 9e5e68008785..448958ddbaf8 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -28,7 +28,7 @@
  * non-zero.
  */
 SYM_FUNC_START(xen_irq_disable_direct)
-	movb $1, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	movb $1, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	RET
 SYM_FUNC_END(xen_irq_disable_direct)
 
@@ -69,7 +69,7 @@ SYM_FUNC_END(check_events)
 SYM_FUNC_START(xen_irq_enable_direct)
 	FRAME_BEGIN
 	/* Unmask events */
-	movb $0, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	movb $0, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 
 	/*
 	 * Preempt here doesn't matter because that will deal with any
@@ -78,7 +78,7 @@ SYM_FUNC_START(xen_irq_enable_direct)
 	 */
 
 	/* Test for pending */
-	testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending
+	testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_pending)
 	jz 1f
 
 	call check_events
@@ -97,7 +97,7 @@ SYM_FUNC_END(xen_irq_enable_direct)
  * x86 use opposite senses (mask vs enable).
  */
 SYM_FUNC_START(xen_save_fl_direct)
-	testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	setz %ah
 	addb %ah, %ah
 	RET
@@ -113,7 +113,7 @@ SYM_FUNC_END(xen_read_cr2);
 
 SYM_FUNC_START(xen_read_cr2_direct)
 	FRAME_BEGIN
-	_ASM_MOV PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_arch_cr2, %_ASM_AX
+	_ASM_MOV PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_arch_cr2), %_ASM_AX
 	FRAME_END
 	RET
 SYM_FUNC_END(xen_read_cr2_direct);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 19/43] x86/tools: Explicitly include autoconf.h for hostprogs
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (17 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 18/43] x86/percpu: Use PC-relative addressing for percpu variable references Hou Wenlong
@ 2023-04-28  9:50 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 20/43] x86/percpu: Adapt percpu references relocation for PIE support Hou Wenlong
                   ` (24 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Nicolas Schier, Masahiro Yamada

The relocs tool needs access to the CONFIG_* symbols found in
include/generated/autoconf.h, however, the header file is not included.
so the #if CONFIG_FW_LOADER code in arch/x86/tool/relocs.c is never
compiled.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/tools/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/tools/Makefile b/arch/x86/tools/Makefile
index 90e820ac9771..8af4aeeb72af 100644
--- a/arch/x86/tools/Makefile
+++ b/arch/x86/tools/Makefile
@@ -38,7 +38,9 @@ $(obj)/insn_decoder_test.o: $(srctree)/tools/arch/x86/lib/insn.c $(srctree)/tool
 
 $(obj)/insn_sanity.o: $(srctree)/tools/arch/x86/lib/insn.c $(srctree)/tools/arch/x86/lib/inat.c $(srctree)/tools/arch/x86/include/asm/inat_types.h $(srctree)/tools/arch/x86/include/asm/inat.h $(srctree)/tools/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c
 
-HOST_EXTRACFLAGS += -I$(srctree)/tools/include
+HOST_EXTRACFLAGS += -I$(srctree)/tools/include \
+		    -include include/generated/autoconf.h
+
 hostprogs	+= relocs
 relocs-objs     := relocs_32.o relocs_64.o relocs_common.o
 PHONY += relocs
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 20/43] x86/percpu: Adapt percpu references relocation for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (18 preceding siblings ...)
  2023-04-28  9:50 ` [PATCH RFC 19/43] x86/tools: Explicitly include autoconf.h for hostprogs Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 21/43] x86/ftrace: Adapt assembly " Hou Wenlong
                   ` (23 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, David Woodhouse, Peter Zijlstra, Brian Gerst,
	Josh Poimboeuf, Sami Tolvanen

The original design of percpu references relocation only handles
relative references and ignores absolute references. Because percpu
variable had already been relative based on segment. And .percpu ELF
section has a virtual address of zero and absolute references can be
kept if kaslr is enabled. As for a little relative references, they
needs to be relocated by negative offset.

However, it is not compatible with PIE, because almost all percpu
references would be RIP-relative. But RIP-relative addressing could only
support -2G ~ +2G. In order to move kernel address below top 2G, percpu
relative references wouldn't be relocated, instead, percpu base could be
adjusted. As for absolute references, they would be relocated like
normal variable. After that, percpu references in .altinstr_replacement
section couldn't work right, because no fixups are applied for percpu
references in apply_alternatives(). However, it could be caught by
objtool.  Currently, only call depth tracking uses it, so disable it if
X86_PIE is enabled.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig          |  2 +-
 arch/x86/kernel/head_64.S | 10 ++++++++++
 arch/x86/tools/relocs.c   | 17 ++++++++++++++---
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b26941ef50ee..715f0734d065 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2635,7 +2635,7 @@ config CPU_UNRET_ENTRY
 
 config CALL_DEPTH_TRACKING
 	bool "Mitigate RSB underflow with call depth tracking"
-	depends on CPU_SUP_INTEL && HAVE_CALL_THUNKS
+	depends on CPU_SUP_INTEL && HAVE_CALL_THUNKS && !X86_PIE
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
 	select CALL_THUNKS
 	default y
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1eed50b7d1ac..94c5defec8cc 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,6 +72,11 @@ SYM_CODE_START_NOALIGN(startup_64)
 	leaq	INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
 #elif defined(CONFIG_SMP)
 	movabs	$__per_cpu_load, %rdx
+#ifdef CONFIG_X86_PIE
+	movabs	$__per_cpu_start, %rax
+	subq	%rax, %rdx
+	movq	%rdx, __per_cpu_offset(%rip)
+#endif
 #else
 	xorl	%edx, %edx
 #endif
@@ -79,6 +84,11 @@ SYM_CODE_START_NOALIGN(startup_64)
 	shrq	$32,  %rdx
 	wrmsr
 
+#if defined(CONFIG_X86_PIE) && defined(CONFIG_SMP)
+	movq	__per_cpu_offset(%rip), %rdx
+	movq	%rdx, PER_CPU_VAR(this_cpu_off)
+#endif
+
 	pushq	%rsi
 	call	startup_64_setup_env
 	popq	%rsi
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 2925074b9a58..038e9c12fad3 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -848,6 +848,7 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
 
 	case R_X86_64_PC32:
 	case R_X86_64_PLT32:
+#ifndef CONFIG_X86_PIE
 		/*
 		 * PC relative relocations don't need to be adjusted unless
 		 * referencing a percpu symbol.
@@ -856,6 +857,7 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
 		 */
 		if (is_percpu_sym(sym, symname))
 			add_reloc(&relocs32neg, offset);
+#endif
 		break;
 
 	case R_X86_64_PC64:
@@ -871,10 +873,18 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
 	case R_X86_64_32S:
 	case R_X86_64_64:
 		/*
-		 * References to the percpu area don't need to be adjusted.
+		 * References to the percpu area don't need to be adjusted when
+		 * CONFIG_X86_PIE is not enabled.
 		 */
-		if (is_percpu_sym(sym, symname))
+		if (is_percpu_sym(sym, symname)) {
+#if CONFIG_X86_PIE
+			if (r_type != R_X86_64_64)
+				die("Invalid absolute reference against per-CPU symbol %s\n",
+				    symname);
+			add_reloc(&relocs64, offset);
+#endif
 			break;
+		}
 
 		if (shn_abs) {
 			/*
@@ -1044,7 +1054,8 @@ static int cmp_relocs(const void *va, const void *vb)
 
 static void sort_relocs(struct relocs *r)
 {
-	qsort(r->offset, r->count, sizeof(r->offset[0]), cmp_relocs);
+	if (r->count)
+		qsort(r->offset, r->count, sizeof(r->offset[0]), cmp_relocs);
 }
 
 static int write32(uint32_t v, FILE *f)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 21/43] x86/ftrace: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (19 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 20/43] x86/percpu: Adapt percpu references relocation for PIE support Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28 13:37   ` Steven Rostedt
  2023-04-28  9:51 ` [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching " Hou Wenlong
                   ` (22 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Steven Rostedt, Masami Hiramatsu, Mark Rutland, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-trace-kernel

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/ftrace_64.S | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index eddb4fabc16f..411fa4148e18 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -315,7 +315,14 @@ STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)
 SYM_FUNC_START(__fentry__)
 	CALL_DEPTH_ACCOUNT
 
+#ifdef CONFIG_X86_PIE
+	pushq %r8
+	leaq ftrace_stub(%rip), %r8
+	cmpq %r8, ftrace_trace_function(%rip)
+	popq %r8
+#else
 	cmpq $ftrace_stub, ftrace_trace_function
+#endif
 	jnz trace
 	RET
 
@@ -329,7 +336,7 @@ trace:
 	 * ip and parent ip are used and the list function is called when
 	 * function tracing is enabled.
 	 */
-	movq ftrace_trace_function, %r8
+	movq ftrace_trace_function(%rip), %r8
 	CALL_NOSPEC r8
 	restore_mcount_regs
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (20 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 21/43] x86/ftrace: Adapt assembly " Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28 13:44   ` Steven Rostedt
  2023-04-28  9:51 ` [PATCH RFC 23/43] x86/pie: Force hidden visibility for all symbol references Hou Wenlong
                   ` (21 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Steven Rostedt, Masami Hiramatsu, Mark Rutland, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Huacai Chen, Qing Zhang, linux-trace-kernel

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

When using PIE with function tracing, the compiler generates a
call through the GOT (call *__fentry__@GOTPCREL). This instruction
takes 6-bytes instead of 5-bytes with a relative call. And -mnop-mcount
option is not implemented for -fPIE now.

If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
so ftrace can handle the previous 5-bytes as before.

[Hou Wenlong: Adapt code change and fix wrong offset calculation in
make_nop_x86()]

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/ftrace.c | 46 ++++++++++++++++++++++-
 scripts/recordmcount.c   | 81 ++++++++++++++++++++++++++--------------
 2 files changed, 98 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 5e7ead52cfdb..b795f9dde561 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -124,6 +124,50 @@ ftrace_modify_code_direct(unsigned long ip, const char *old_code,
 	return 0;
 }
 
+/* Bytes before call GOT offset */
+static const unsigned char got_call_preinsn[] = { 0xff, 0x15 };
+
+static int __ref
+ftrace_modify_initial_code(unsigned long ip, unsigned const char *old_code,
+			   unsigned const char *new_code)
+{
+	unsigned char replaced[MCOUNT_INSN_SIZE + 1];
+
+	/*
+	 * If PIE is not enabled default to the original approach to code
+	 * modification.
+	 */
+	if (!IS_ENABLED(CONFIG_X86_PIE))
+		return ftrace_modify_code_direct(ip, old_code, new_code);
+
+	ftrace_expected = old_code;
+
+	/* Ensure the instructions point to a call to the GOT */
+	if (copy_from_kernel_nofault(replaced, (void *)ip, sizeof(replaced))) {
+		WARN_ONCE(1, "invalid function");
+		return -EFAULT;
+	}
+
+	if (memcmp(replaced, got_call_preinsn, sizeof(got_call_preinsn))) {
+		WARN_ONCE(1, "invalid function call");
+		return -EINVAL;
+	}
+
+	/*
+	 * Build a nop slide with a 5-byte nop and 1-byte nop to keep the ftrace
+	 * hooking algorithm working with the expected 5 bytes instruction.
+	 */
+	memset(replaced, x86_nops[1][0], sizeof(replaced));
+	memcpy(replaced, new_code, MCOUNT_INSN_SIZE);
+
+	/* replace the text with the new text */
+	if (ftrace_poke_late)
+		text_poke_queue((void *)ip, replaced, MCOUNT_INSN_SIZE + 1, NULL);
+	else
+		text_poke_early((void *)ip, replaced, MCOUNT_INSN_SIZE + 1);
+	return 0;
+}
+
 int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long addr)
 {
 	unsigned long ip = rec->ip;
@@ -141,7 +185,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long ad
 	 * just modify the code directly.
 	 */
 	if (addr == MCOUNT_ADDR)
-		return ftrace_modify_code_direct(ip, old, new);
+		return ftrace_modify_initial_code(ip, old, new);
 
 	/*
 	 * x86 overrides ftrace_replace_code -- this function will never be used
diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c
index e30216525325..02783a29d428 100644
--- a/scripts/recordmcount.c
+++ b/scripts/recordmcount.c
@@ -218,36 +218,10 @@ static void *mmap_file(char const *fname)
 	return file_map;
 }
 
-
-static unsigned char ideal_nop5_x86_64[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
-static unsigned char ideal_nop5_x86_32[5] = { 0x3e, 0x8d, 0x74, 0x26, 0x00 };
-static unsigned char *ideal_nop;
-
 static char rel_type_nop;
-
 static int (*make_nop)(void *map, size_t const offset);
 
-static int make_nop_x86(void *map, size_t const offset)
-{
-	uint32_t *ptr;
-	unsigned char *op;
-
-	/* Confirm we have 0xe8 0x0 0x0 0x0 0x0 */
-	ptr = map + offset;
-	if (*ptr != 0)
-		return -1;
-
-	op = map + offset - 1;
-	if (*op != 0xe8)
-		return -1;
-
-	/* convert to nop */
-	if (ulseek(offset - 1, SEEK_SET) < 0)
-		return -1;
-	if (uwrite(ideal_nop, 5) < 0)
-		return -1;
-	return 0;
-}
+static unsigned char *ideal_nop;
 
 static unsigned char ideal_nop4_arm_le[4] = { 0x00, 0x00, 0xa0, 0xe1 }; /* mov r0, r0 */
 static unsigned char ideal_nop4_arm_be[4] = { 0xe1, 0xa0, 0x00, 0x00 }; /* mov r0, r0 */
@@ -504,6 +478,50 @@ static void MIPS64_r_info(Elf64_Rel *const rp, unsigned sym, unsigned type)
 	}).r_info;
 }
 
+static unsigned char ideal_nop5_x86_64[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char ideal_nop6_x86_64[6] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char ideal_nop5_x86_32[5] = { 0x3e, 0x8d, 0x74, 0x26, 0x00 };
+static size_t ideal_nop_x86_size;
+
+static unsigned char stub_default_x86[2] = { 0xe8, 0x00 };   /* call relative */
+static unsigned char stub_got_x86[3] = { 0xff, 0x15, 0x00 }; /* call .got */
+static unsigned char *stub_x86;
+static size_t stub_x86_size;
+
+static int make_nop_x86(void *map, size_t const offset)
+{
+	uint32_t *ptr;
+	size_t stub_offset = offset + 1 - stub_x86_size;
+
+	/* Confirm we have the expected stub */
+	ptr = map + stub_offset;
+	if (memcmp(ptr, stub_x86, stub_x86_size))
+		return -1;
+
+	/* convert to nop */
+	if (ulseek(stub_offset, SEEK_SET) < 0)
+		return -1;
+	if (uwrite(ideal_nop, ideal_nop_x86_size) < 0)
+		return -1;
+	return 0;
+}
+
+/* Swap the stub and nop for a got call if the binary is built with PIE */
+static int is_fake_mcount_x86_x64(Elf64_Rel const *rp)
+{
+	if (ELF64_R_TYPE(rp->r_info) == R_X86_64_GOTPCREL) {
+		ideal_nop = ideal_nop6_x86_64;
+		ideal_nop_x86_size = sizeof(ideal_nop6_x86_64);
+		stub_x86 = stub_got_x86;
+		stub_x86_size = sizeof(stub_got_x86);
+		mcount_adjust_64 = 1 - stub_x86_size;
+	}
+
+	/* Once the relocation was checked, rollback to default */
+	is_fake_mcount64 = fn_is_fake_mcount64;
+	return is_fake_mcount64(rp);
+}
+
 static int do_file(char const *const fname)
 {
 	unsigned int reltype = 0;
@@ -568,6 +586,9 @@ static int do_file(char const *const fname)
 		rel_type_nop = R_386_NONE;
 		make_nop = make_nop_x86;
 		ideal_nop = ideal_nop5_x86_32;
+		ideal_nop_x86_size = sizeof(ideal_nop5_x86_32);
+		stub_x86 = stub_default_x86;
+		stub_x86_size = sizeof(stub_default_x86);
 		mcount_adjust_32 = -1;
 		gpfx = 0;
 		break;
@@ -597,9 +618,13 @@ static int do_file(char const *const fname)
 	case EM_X86_64:
 		make_nop = make_nop_x86;
 		ideal_nop = ideal_nop5_x86_64;
+		ideal_nop_x86_size = sizeof(ideal_nop5_x86_64);
+		stub_x86 = stub_default_x86;
+		stub_x86_size = sizeof(stub_default_x86);
 		reltype = R_X86_64_64;
 		rel_type_nop = R_X86_64_NONE;
-		mcount_adjust_64 = -1;
+		is_fake_mcount64 = is_fake_mcount_x86_x64;
+		mcount_adjust_64 = 1 - stub_x86_size;
 		gpfx = 0;
 		break;
 	}  /* end switch */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 23/43] x86/pie: Force hidden visibility for all symbol references
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (21 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching " Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 24/43] x86/boot/compressed: Adapt sed command to generate voffset.h when PIE is enabled Hou Wenlong
                   ` (20 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski

Eliminate all GOT entries in the kernel, by forcing hidden visibility
for all symbol references, which informs the compiler that such
references will be resolved at link time without the need for allocating
GOT entries. However, there are still some GOT entries after this, one
for __fentry__() indirect call, and others are due to global weak symbol
references.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Makefile            | 7 +++++++
 arch/x86/entry/vdso/Makefile | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 57e4dbbf501d..81500011396d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -158,6 +158,11 @@ else
         KBUILD_RUSTFLAGS += $(rustflags-y)
 
         KBUILD_CFLAGS += -mno-red-zone
+
+ifdef CONFIG_X86_PIE
+        PIE_CFLAGS := -include $(srctree)/include/linux/hidden.h
+        KBUILD_CFLAGS += $(PIE_CFLAGS)
+endif
         KBUILD_CFLAGS += -mcmodel=kernel
         KBUILD_RUSTFLAGS += -Cno-redzone=y
         KBUILD_RUSTFLAGS += -Ccode-model=kernel
@@ -176,6 +181,8 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
 	endif
 endif
 
+export PIE_CFLAGS
+
 #
 # If the function graph tracer is used with mcount instead of fentry,
 # '-maccumulate-outgoing-args' is needed to prevent a GCC bug
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 6a1821bd7d5e..9437653a9de2 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -92,7 +92,7 @@ ifneq ($(RETPOLINE_VDSO_CFLAGS),)
 endif
 endif
 
-$(vobjs): KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
+$(vobjs): KBUILD_CFLAGS := $(filter-out $(PIE_CFLAGS) $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
 $(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO
 
 #
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 24/43] x86/boot/compressed: Adapt sed command to generate voffset.h when PIE is enabled
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (22 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 23/43] x86/pie: Force hidden visibility for all symbol references Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 25/43] x86/mm: Make the x86 GOT read-only Hou Wenlong
                   ` (19 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Nathan Chancellor, Ard Biesheuvel,
	Nick Desaulniers, Andrew Morton, Alexander Potapenko, Xin Li

When PIE is enabled, all symbols would be set as hidden to reduce GOT
references. According to generic ABI, a hidden symbol contained in a
relocatable object must be either removed or converted to STB_LOCAL
binding by the link-editor when the relocatable object is included in an
executable file or shared object. Both gold and ld.lld change the
binding of a STV_HIDDEND symbol to STB_LOCAL. But For GNU ld, it will
keep global hidden.  However, sed command to generate voffset.h only
captures global symbol, then empty voffset.h would be generated when PIE
is enabled with lld. So capture local symbol too in sed command.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..678881496c44 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -79,7 +79,7 @@ LDFLAGS_vmlinux += -T
 hostprogs	:= mkpiggy
 HOST_EXTRACFLAGS += -I$(srctree)/tools/include
 
-sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
+sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVWabcdgrstvw] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
 
 quiet_cmd_voffset = VOFFSET $@
       cmd_voffset = $(NM) $< | sed -n $(sed-voffset) > $@
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 25/43] x86/mm: Make the x86 GOT read-only
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (23 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 24/43] x86/boot/compressed: Adapt sed command to generate voffset.h when PIE is enabled Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-30 14:23   ` Ard Biesheuvel
  2023-04-28  9:51 ` [PATCH RFC 26/43] x86/pie: Add .data.rel.* sections into link script Hou Wenlong
                   ` (18 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Arnd Bergmann, Peter Zijlstra, Josh Poimboeuf,
	Juergen Gross, Brian Gerst, linux-arch

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

The GOT is changed during early boot when relocations are applied. Make
it read-only directly. This table exists only for PIE binary. Since weak
symbol reference would always be GOT reference, there are 8 entries in
GOT, but only one entry for __fentry__() is in use.  Other GOT
references have been optimized by linker.

[Hou Wenlong: Change commit message and skip GOT size check]

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/vmlinux.lds.S     |  2 ++
 include/asm-generic/vmlinux.lds.h | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index f02dcde9f8a8..fa4c6582663f 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -462,6 +462,7 @@ SECTIONS
 #endif
 	       "Unexpected GOT/PLT entries detected!")
 
+#ifndef CONFIG_X86_PIE
 	/*
 	 * Sections that should stay zero sized, which is safer to
 	 * explicitly check instead of blindly discarding.
@@ -470,6 +471,7 @@ SECTIONS
 		*(.got) *(.igot.*)
 	}
 	ASSERT(SIZEOF(.got) == 0, "Unexpected GOT entries detected!")
+#endif
 
 	.plt : {
 		*(.plt) *(.plt.*) *(.iplt)
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index d1f57e4868ed..438ed8b39896 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -441,6 +441,17 @@
 	__end_ro_after_init = .;
 #endif
 
+#ifdef CONFIG_X86_PIE
+#define RO_GOT_X86							\
+	.got        : AT(ADDR(.got) - LOAD_OFFSET) {			\
+		__start_got = .;					\
+		*(.got) *(.igot.*);					\
+		__end_got = .;						\
+	}
+#else
+#define RO_GOT_X86
+#endif
+
 /*
  * .kcfi_traps contains a list KCFI trap locations.
  */
@@ -486,6 +497,7 @@
 		BOUNDED_SECTION_PRE_LABEL(.pci_fixup_suspend_late, _pci_fixups_suspend_late, __start, __end) \
 	}								\
 									\
+	RO_GOT_X86							\
 	FW_LOADER_BUILT_IN_DATA						\
 	TRACEDATA							\
 									\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 26/43] x86/pie: Add .data.rel.* sections into link script
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (24 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 25/43] x86/mm: Make the x86 GOT read-only Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 27/43] x86/relocs: Handle PIE relocations Hou Wenlong
                   ` (17 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Juergen Gross,
	Brian Gerst

The .data.rel.local and .data.rel.ro.local sections are generated when
PIE is enabled, merge them into .data section, since no dynamic loader
is used.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/vmlinux.lds.S | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index fa4c6582663f..71e0769d2b52 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -177,6 +177,10 @@ SECTIONS
 		/* rarely changed data like cpu maps */
 		READ_MOSTLY_DATA(INTERNODE_CACHE_BYTES)
 
+#ifdef CONFIG_X86_PIE
+		*(.data.rel)
+		*(.data.rel.*)
+#endif
 		/* End of data section */
 		_edata = .;
 	} :data
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 27/43] x86/relocs: Handle PIE relocations
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (25 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 26/43] x86/pie: Add .data.rel.* sections into link script Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 28/43] KVM: x86: Adapt assembly for PIE support Hou Wenlong
                   ` (16 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Sami Tolvanen, Peter Zijlstra (Intel)

From: Thomas Garnier <thgarnie@chromium.org>

From: Thomas Garnier <thgarnie@chromium.org>

Change the relocation tool to correctly handle relocations generated by
-fPIE option:

 - Add relocation for each entry of the .got section given the linker
   does not generate R_X86_64_GLOB_DAT on a simple link.
 - Ignore R_X86_64_GOTPCREL.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/tools/relocs.c | 96 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 038e9c12fad3..97ac96195232 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -42,6 +42,7 @@ struct section {
 	Elf32_Word     *xsymtab;
 	Elf_Rel        *reltab;
 	char           *strtab;
+	Elf_Addr       *got;
 };
 static struct section *secs;
 
@@ -308,6 +309,36 @@ static Elf_Sym *sym_lookup(const char *symname)
 	return 0;
 }
 
+static Elf_Sym *sym_lookup_addr(Elf_Addr addr, const char **name)
+{
+	int i;
+
+	for (i = 0; i < ehdr.e_shnum; i++) {
+		struct section *sec = &secs[i];
+		long nsyms;
+		Elf_Sym *symtab;
+		Elf_Sym *sym;
+
+		if (sec->shdr.sh_type != SHT_SYMTAB)
+			continue;
+
+		nsyms = sec->shdr.sh_size/sizeof(Elf_Sym);
+		symtab = sec->symtab;
+
+		for (sym = symtab; --nsyms >= 0; sym++) {
+			if (sym->st_value == addr) {
+				if (name) {
+					*name = sym_name(sec->link->strtab,
+							 sym);
+				}
+				return sym;
+			}
+		}
+	}
+	return 0;
+}
+
+
 #if BYTE_ORDER == LITTLE_ENDIAN
 #define le16_to_cpu(val) (val)
 #define le32_to_cpu(val) (val)
@@ -588,6 +619,35 @@ static void read_relocs(FILE *fp)
 	}
 }
 
+static void read_got(FILE *fp)
+{
+	int i;
+
+	for (i = 0; i < ehdr.e_shnum; i++) {
+		struct section *sec = &secs[i];
+
+		sec->got = NULL;
+		if (sec->shdr.sh_type != SHT_PROGBITS ||
+		    strcmp(sec_name(i), ".got")) {
+			continue;
+		}
+		sec->got = malloc(sec->shdr.sh_size);
+		if (!sec->got) {
+			die("malloc of %" FMT " bytes for got failed\n",
+				sec->shdr.sh_size);
+		}
+		if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+			die("Seek to %" FMT " failed: %s\n",
+				sec->shdr.sh_offset, strerror(errno));
+		}
+		if (fread(sec->got, 1, sec->shdr.sh_size, fp)
+		    != sec->shdr.sh_size) {
+			die("Cannot read got: %s\n",
+				strerror(errno));
+		}
+	}
+}
+
 
 static void print_absolute_symbols(void)
 {
@@ -718,6 +778,32 @@ static void add_reloc(struct relocs *r, uint32_t offset)
 	r->offset[r->count++] = offset;
 }
 
+/*
+ * The linker does not generate relocations for the GOT for the kernel.
+ * If a GOT is found, simulate the relocations that should have been included.
+ */
+static void walk_got_table(int (*process)(struct section *sec, Elf_Rel *rel,
+					  Elf_Sym *sym, const char *symname),
+			   struct section *sec)
+{
+	int i;
+	Elf_Addr entry;
+	Elf_Sym *sym;
+	const char *symname;
+	Elf_Rel rel;
+
+	for (i = 0; i < sec->shdr.sh_size/sizeof(Elf_Addr); i++) {
+		entry = sec->got[i];
+		sym = sym_lookup_addr(entry, &symname);
+		if (!sym)
+			die("Could not found got symbol for entry %d\n", i);
+		rel.r_offset = sec->shdr.sh_addr + i * sizeof(Elf_Addr);
+		rel.r_info = ELF_BITS == 64 ? R_X86_64_GLOB_DAT
+			     : R_386_GLOB_DAT;
+		process(sec, &rel, sym, symname);
+	}
+}
+
 static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
 			Elf_Sym *sym, const char *symname))
 {
@@ -731,6 +817,8 @@ static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
 		struct section *sec = &secs[i];
 
 		if (sec->shdr.sh_type != SHT_REL_TYPE) {
+			if (sec->got)
+				walk_got_table(process, sec);
 			continue;
 		}
 		sec_symtab  = sec->link;
@@ -842,6 +930,7 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
 		offset += per_cpu_load_addr;
 
 	switch (r_type) {
+	case R_X86_64_GOTPCREL:
 	case R_X86_64_NONE:
 		/* NONE can be ignored. */
 		break;
@@ -905,7 +994,7 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
 		 * the relocations are processed.
 		 * Make sure that the offset will fit.
 		 */
-		if ((int32_t)offset != (int64_t)offset)
+		if (r_type != R_X86_64_64 && (int32_t)offset != (int64_t)offset)
 			die("Relocation offset doesn't fit in 32 bits\n");
 
 		if (r_type == R_X86_64_64)
@@ -914,6 +1003,10 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
 			add_reloc(&relocs32, offset);
 		break;
 
+	case R_X86_64_GLOB_DAT:
+		add_reloc(&relocs64, offset);
+		break;
+
 	default:
 		die("Unsupported relocation type: %s (%d)\n",
 		    rel_type(r_type), r_type);
@@ -1188,6 +1281,7 @@ void process(FILE *fp, int use_real_mode, int as_text,
 	read_strtabs(fp);
 	read_symtabs(fp);
 	read_relocs(fp);
+	read_got(fp);
 	if (ELF_BITS == 64)
 		percpu_init();
 	if (show_absolute_syms) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 28/43] KVM: x86: Adapt assembly for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (26 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 27/43] x86/relocs: Handle PIE relocations Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 29/43] x86/PVH: Adapt PVH booting " Hou Wenlong
                   ` (15 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, kvm

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kvm/svm/vmenter.S | 10 +++++-----
 arch/x86/kvm/vmx/vmenter.S |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S
index 8e8295e774f0..25be1a66c59d 100644
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -270,16 +270,16 @@ SYM_FUNC_START(__svm_vcpu_run)
 	RESTORE_GUEST_SPEC_CTRL_BODY
 	RESTORE_HOST_SPEC_CTRL_BODY
 
-10:	cmpb $0, kvm_rebooting
+10:	cmpb $0, _ASM_RIP(kvm_rebooting)
 	jne 2b
 	ud2
-30:	cmpb $0, kvm_rebooting
+30:	cmpb $0, _ASM_RIP(kvm_rebooting)
 	jne 4b
 	ud2
-50:	cmpb $0, kvm_rebooting
+50:	cmpb $0, _ASM_RIP(kvm_rebooting)
 	jne 6b
 	ud2
-70:	cmpb $0, kvm_rebooting
+70:	cmpb $0, _ASM_RIP(kvm_rebooting)
 	jne 8b
 	ud2
 
@@ -381,7 +381,7 @@ SYM_FUNC_START(__svm_sev_es_vcpu_run)
 	RESTORE_GUEST_SPEC_CTRL_BODY
 	RESTORE_HOST_SPEC_CTRL_BODY
 
-3:	cmpb $0, kvm_rebooting
+3:	cmpb $0, _ASM_RIP(kvm_rebooting)
 	jne 2b
 	ud2
 
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 631fd7da2bc3..b7cc3c17736a 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -289,7 +289,7 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
 	RET
 
 .Lfixup:
-	cmpb $0, kvm_rebooting
+	cmpb $0, _ASM_RIP(kvm_rebooting)
 	jne .Lvmfail
 	ud2
 .Lvmfail:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 29/43] x86/PVH: Adapt PVH booting for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (27 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 28/43] KVM: x86: Adapt assembly for PIE support Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 30/43] x86/bpf: Adapt BPF_CALL JIT codegen " Hou Wenlong
                   ` (14 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Juergen Gross, Boris Ostrovsky, Darren Hart, Andy Shevchenko,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, xen-devel, platform-driver-x86

If PIE is enabled, all symbol references would be RIP-relative. However,
PVH booting runs in low address space, which could cause wrong x86_init
callbacks assignment. Since init_top_pgt has building high kernel
address mapping, let PVH booting runs in high address space to make all
things right.

PVH booting assumes that no relocation happened. Since the kernel
compile address is still in top 2G, so it is allowed to use R_X86_64_32S
for symbol references in pvh_start_xen().

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/platform/pvh/head.S | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index 5842fe0e4f96..09518d4de042 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -94,6 +94,13 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 	/* 64-bit entry point. */
 	.code64
 1:
+#ifdef CONFIG_X86_PIE
+	movabs  $2f, %rax
+	ANNOTATE_RETPOLINE_SAFE
+	jmp *%rax
+2:
+	ANNOTATE_NOENDBR // above
+#endif
 	/* Set base address in stack canary descriptor. */
 	mov $MSR_GS_BASE,%ecx
 #if defined(CONFIG_STACKPROTECTOR_FIXED)
@@ -149,9 +156,15 @@ SYM_CODE_END(pvh_start_xen)
 	.section ".init.data","aw"
 	.balign 8
 SYM_DATA_START_LOCAL(gdt)
+	/*
+	 * Use an ASM_PTR (quad on x64) for _pa(gdt_start) because PIE requires
+	 * a pointer size storage value before applying the relocation. On
+	 * 32-bit _ASM_PTR will be a long which is aligned the space needed for
+	 * relocation.
+	 */
 	.word gdt_end - gdt_start
-	.long _pa(gdt_start)
-	.word 0
+	_ASM_PTR _pa(gdt_start)
+	.balign 8
 SYM_DATA_END(gdt)
 SYM_DATA_START_LOCAL(gdt_start)
 	.quad 0x0000000000000000            /* NULL descriptor */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 30/43] x86/bpf: Adapt BPF_CALL JIT codegen for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (28 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 29/43] x86/PVH: Adapt PVH booting " Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 31/43] x86/modules: Adapt module loading " Hou Wenlong
                   ` (13 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, bpf, netdev

If image is NULL, ip calculated is in bottom address and func is in
kernel image address, then the offset is valid when the kernel stays in
the top 2G address space. However, PIE kernel image could be below top
2G, which makes the offset out of range.  Since the length of
PC-relative call instruction is fixed, it's pointless to calculate the
offset without the proper image base (it has been zero until the last
pass). Use 1 as the dummy offset to generate the instruction in the
first pass.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/net/bpf_jit_comp.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 1056bbf55b17..0da41833e426 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1549,8 +1549,21 @@ st:			if (is_imm8(insn->off))
 					return -EINVAL;
 				offs = x86_call_depth_emit_accounting(&prog, func);
 			}
-			if (emit_call(&prog, func, image + addrs[i - 1] + offs))
-				return -EINVAL;
+			/*
+			 * If image is NULL, ip is in bottom address and func
+			 * is in kernel image address (top 2G), so the offset
+			 * is valid. However, PIE kernel image could be below
+			 * top 2G, then the offset would be out of range. Since
+			 * the length of PC-relative call(0xe8) is fixed, so it's
+			 * pointless to calculate the offset until the last pass.
+			 * Use 1 as the dummy offset if image is NULL.
+			 */
+			if (image)
+				err = emit_call(&prog, func, image + addrs[i - 1] + offs);
+			else
+				err = emit_call(&prog, (void *)(X86_PATCH_SIZE + 1UL), 0);
+			if (err)
+				return err;
 			break;
 		}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (29 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 30/43] x86/bpf: Adapt BPF_CALL JIT codegen " Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28 19:29   ` Ard Biesheuvel
  2023-04-28  9:51 ` [PATCH RFC 32/43] x86/boot/64: Use data relocation to get absloute address when PIE is enabled Hou Wenlong
                   ` (12 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Ard Biesheuvel

Adapt module loading to support PIE relocations. No GOT is generared for
module, all the GOT entry of got references in module should exist in
kernel GOT.  Currently, there is only one usable got reference for
__fentry__().

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/sections.h |  5 +++++
 arch/x86/kernel/module.c        | 27 +++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index a6e8373a5170..dc1c2b08ec48 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
+
+#ifdef CONFIG_X86_PIE
+extern char __start_got[], __end_got[];
+#endif
+
 #endif
 
 extern char __end_of_kernel_reserve[];
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 84ad0e61ba6e..051f88e6884e 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
 	return 0;
 }
 #else /*X86_64*/
+#ifdef CONFIG_X86_PIE
+static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
+{
+	u64 *pos;
+
+	for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
+		if (*pos == sym->st_value)
+			return (u64)pos + rela->r_addend;
+	return 0;
+}
+#endif
+
 static int __write_relocate_add(Elf64_Shdr *sechdrs,
 		   const char *strtab,
 		   unsigned int symindex,
@@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
 		case R_X86_64_64:
 			size = 8;
 			break;
+#ifndef CONFIG_X86_PIE
 		case R_X86_64_32:
 			if (val != *(u32 *)&val)
 				goto overflow;
@@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
 				goto overflow;
 			size = 4;
 			break;
+#else
+		case R_X86_64_GOTPCREL:
+			val = find_got_kernel_entry(sym, rel);
+			if (!val)
+				goto unexpected_got_reference;
+			fallthrough;
+#endif
 		case R_X86_64_PC32:
 		case R_X86_64_PLT32:
 			val -= (u64)loc;
@@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
 	}
 	return 0;
 
+#ifdef CONFIG_X86_PIE
+unexpected_got_reference:
+	pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
+	return -ENOEXEC;
+#else
 overflow:
 	pr_err("overflow in relocation type %d val %Lx\n",
 	       (int)ELF64_R_TYPE(rel[i].r_info), val);
 	pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
 	       me->name);
+#endif
+
 	return -ENOEXEC;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 32/43] x86/boot/64: Use data relocation to get absloute address when PIE is enabled
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (30 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 31/43] x86/modules: Adapt module loading " Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 33/43] objtool: Add validation for x86 PIE support Hou Wenlong
                   ` (11 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Juergen Gross, Anshuman Khandual, Mike Rapoport,
	Josh Poimboeuf, Pasha Tatashin

When PIE is enabled, all symbol references are RIP-relative, so there is
no need to fixup global symbol references when in low address.  However,
in order to acquire absloute virtual address of symbol, introduce a
macro to use data relocation to get it.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/head64.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 49f7629b17f7..ef7ad96f2154 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -86,10 +86,22 @@ static struct desc_ptr startup_gdt_descr = {
 
 #define __head	__section(".head.text")
 
+#ifdef CONFIG_X86_PIE
+#define SYM_ABS_VAL(sym)	\
+	({ static unsigned long __initdata __##sym = (unsigned long)sym; __##sym; })
+
+static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr;
+}
+#else
+#define SYM_ABS_VAL(sym) ((unsigned long)sym)
+
 static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
 {
 	return ptr - (void *)_text + (void *)physaddr;
 }
+#endif /* CONFIG_X86_PIE */
 
 static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
 {
@@ -142,8 +154,8 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
 	 * attribute.
 	 */
 	if (sme_get_me_mask()) {
-		vaddr = (unsigned long)__start_bss_decrypted;
-		vaddr_end = (unsigned long)__end_bss_decrypted;
+		vaddr = SYM_ABS_VAL(__start_bss_decrypted);
+		vaddr_end = SYM_ABS_VAL(__end_bss_decrypted);
 
 		for (; vaddr < vaddr_end; vaddr += PMD_SIZE) {
 			/*
@@ -189,6 +201,8 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	bool la57;
 	int i;
 	unsigned int *next_pgt_ptr;
+	unsigned long text_base = SYM_ABS_VAL(_text);
+	unsigned long end_base = SYM_ABS_VAL(_end);
 
 	la57 = check_la57_support(physaddr);
 
@@ -200,7 +214,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	 * Compute the delta between the address I am compiled to run at
 	 * and the address I am actually running at.
 	 */
-	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+	load_delta = physaddr - (text_base - __START_KERNEL_map);
 
 	/* Is the address not 2M aligned? */
 	if (load_delta & ~PMD_MASK)
@@ -214,9 +228,9 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	p = pgd + pgd_index(__START_KERNEL_map);
 	if (la57)
-		*p = (unsigned long)level4_kernel_pgt;
+		*p = SYM_ABS_VAL(level4_kernel_pgt);
 	else
-		*p = (unsigned long)level3_kernel_pgt;
+		*p = SYM_ABS_VAL(level3_kernel_pgt);
 	*p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
 
 	if (la57) {
@@ -273,7 +287,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	pmd_entry += sme_get_me_mask();
 	pmd_entry +=  physaddr;
 
-	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
+	for (i = 0; i < DIV_ROUND_UP(end_base - text_base, PMD_SIZE); i++) {
 		int idx = i + (physaddr >> PMD_SHIFT);
 
 		pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE;
@@ -298,11 +312,11 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
 
 	/* invalidate pages before the kernel image */
-	for (i = 0; i < pmd_index((unsigned long)_text); i++)
+	for (i = 0; i < pmd_index(text_base); i++)
 		pmd[i] &= ~_PAGE_PRESENT;
 
 	/* fixup pages that are part of the kernel image */
-	for (; i <= pmd_index((unsigned long)_end); i++)
+	for (; i <= pmd_index(end_base); i++)
 		if (pmd[i] & _PAGE_PRESENT)
 			pmd[i] += load_delta;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 33/43] objtool: Add validation for x86 PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (31 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 32/43] x86/boot/64: Use data relocation to get absloute address when PIE is enabled Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28 10:28   ` Christophe Leroy
  2023-04-28  9:51 ` [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for " Hou Wenlong
                   ` (10 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Masahiro Yamada, Nathan Chancellor,
	Nick Desaulniers, Nicolas Schier, Josh Poimboeuf, Peter Zijlstra,
	Christophe Leroy, Sathvika Vasireddy, Thomas Weißschuh,
	linux-kbuild

For x86 PIE binary, only RIP-relative addressing is allowed, however,
there are still a little absolute references of R_X86_64_64 relocation
type for data section and a little absolute references of R_X86_64_32S
relocation type in pvh_start_xen() function.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig                        |  1 +
 scripts/Makefile.lib                    |  1 +
 tools/objtool/builtin-check.c           |  4 +-
 tools/objtool/check.c                   | 82 +++++++++++++++++++++++++
 tools/objtool/include/objtool/builtin.h |  1 +
 5 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 715f0734d065..b753a54e5ea7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2224,6 +2224,7 @@ config RELOCATABLE
 config X86_PIE
 	def_bool n
 	depends on X86_64
+	select OBJTOOL if HAVE_OBJTOOL
 
 config RANDOMIZE_BASE
 	bool "Randomize the address of the kernel image (KASLR)"
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 100a386fcd71..e3c804fbc421 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -270,6 +270,7 @@ objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE)		+= --static-call
 objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION)		+= --uaccess
 objtool-args-$(CONFIG_GCOV_KERNEL)			+= --no-unreachable
 objtool-args-$(CONFIG_PREFIX_SYMBOLS)			+= --prefix=$(CONFIG_FUNCTION_PADDING_BYTES)
+objtool-args-$(CONFIG_X86_PIE)			        += --pie
 
 objtool-args = $(objtool-args-y)					\
 	$(if $(delay-objtool), --link)					\
diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
index 7c175198d09f..1cf1d00464e0 100644
--- a/tools/objtool/builtin-check.c
+++ b/tools/objtool/builtin-check.c
@@ -81,6 +81,7 @@ static const struct option check_options[] = {
 	OPT_BOOLEAN('t', "static-call", &opts.static_call, "annotate static calls"),
 	OPT_BOOLEAN('u', "uaccess", &opts.uaccess, "validate uaccess rules for SMAP"),
 	OPT_BOOLEAN(0  , "cfi", &opts.cfi, "annotate kernel control flow integrity (kCFI) function preambles"),
+	OPT_BOOLEAN(0, "pie", &opts.pie, "validate addressing rules for PIE"),
 	OPT_CALLBACK_OPTARG(0, "dump", NULL, NULL, "orc", "dump metadata", parse_dump),
 
 	OPT_GROUP("Options:"),
@@ -137,7 +138,8 @@ static bool opts_valid(void)
 	    opts.sls			||
 	    opts.stackval		||
 	    opts.static_call		||
-	    opts.uaccess) {
+	    opts.uaccess		||
+	    opts.pie) {
 		if (opts.dump_orc) {
 			ERROR("--dump can't be combined with other options");
 			return false;
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 5b600bbf2389..d67b80251eec 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
 	for (insn = next_insn_same_sec(file, insn); insn;		\
 	     insn = next_insn_same_sec(file, insn))
 
+static struct instruction *find_insn_containing(struct objtool_file *file,
+						struct section *sec,
+						unsigned long offset)
+{
+	struct instruction *insn;
+
+	insn = find_insn(file, sec, 0);
+	if (!insn)
+		return NULL;
+
+	sec_for_each_insn_from(file, insn) {
+		if (insn->offset > offset)
+			return NULL;
+		if (insn->offset <= offset && (insn->offset + insn->len) > offset)
+			return insn;
+	}
+
+	return NULL;
+}
+
+
 static inline struct symbol *insn_call_dest(struct instruction *insn)
 {
 	if (insn->type == INSN_JUMP_DYNAMIC ||
@@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
 	return 0;
 }
 
+static int is_in_pvh_code(struct instruction *insn)
+{
+	struct symbol *sym = insn->sym;
+
+	return sym && !strcmp(sym->name, "pvh_start_xen");
+}
+
+static int validate_pie(struct objtool_file *file)
+{
+	struct section *sec;
+	struct reloc *reloc;
+	struct instruction *insn;
+	int warnings = 0;
+
+	for_each_sec(file, sec) {
+		if (!sec->reloc)
+			continue;
+		if (!(sec->sh.sh_flags & SHF_ALLOC))
+			continue;
+
+		list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
+			switch (reloc->type) {
+			case R_X86_64_NONE:
+			case R_X86_64_PC32:
+			case R_X86_64_PLT32:
+			case R_X86_64_64:
+			case R_X86_64_PC64:
+			case R_X86_64_GOTPCREL:
+				break;
+			case R_X86_64_32:
+			case R_X86_64_32S:
+				insn = find_insn_containing(file, sec, reloc->offset);
+				if (!insn) {
+					WARN("can't find relocate insn near %s+0x%lx",
+					     sec->name, reloc->offset);
+				} else {
+					if (is_in_pvh_code(insn))
+						break;
+					WARN("insn at %s+0x%lx is not compatible with PIE",
+					     sec->name, insn->offset);
+				}
+				warnings++;
+				break;
+			default:
+				WARN("unexpected relocation type %d at %s+0x%lx",
+				     reloc->type, sec->name, reloc->offset);
+				warnings++;
+				break;
+			}
+		}
+	}
+
+	return warnings;
+}
+
 int check(struct objtool_file *file)
 {
 	int ret, warnings = 0;
@@ -4673,6 +4749,12 @@ int check(struct objtool_file *file)
 		warnings += ret;
 	}
 
+	if (opts.pie) {
+		ret = validate_pie(file);
+		if (ret < 0)
+			return ret;
+		warnings += ret;
+	}
 
 	if (opts.stats) {
 		printf("nr_insns_visited: %ld\n", nr_insns_visited);
diff --git a/tools/objtool/include/objtool/builtin.h b/tools/objtool/include/objtool/builtin.h
index 2a108e648b7a..1151211a5cea 100644
--- a/tools/objtool/include/objtool/builtin.h
+++ b/tools/objtool/include/objtool/builtin.h
@@ -26,6 +26,7 @@ struct opts {
 	bool uaccess;
 	int prefix;
 	bool cfi;
+	bool pie;
 
 	/* options: */
 	bool backtrace;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for PIE support
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (32 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 33/43] objtool: Add validation for x86 PIE support Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28 15:18   ` Peter Zijlstra
  2023-04-28  9:51 ` [PATCH RFC 35/43] x86/pie: Build the kernel as PIE Hou Wenlong
                   ` (9 subsequent siblings)
  43 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Josh Poimboeuf, Peter Zijlstra, Christophe Leroy,
	Sathvika Vasireddy

When using PIE with function tracing, the compiler generates a call
through the GOT (call *__fentry__@GOTPCREL). This instruction is an
indirect call (INSN_CALL_DYNAMIC) and wouldn't be collected by
add_call_destinations().  So collect those indirect calls of
__fentry__() individually for PIE support. And replace the 6th byte of
the GOT call by a 1-byte nop so ftrace can handle the previous 5-bytes
as before.

When RETPOLINE is enabled, __fentry__() is still an indirect call, which
generates warnings in objtool. For simplicity, select DYNAMIC_FTRACE to
patch it as NOPs. And regard it as INSN_CALL to omit warnings for jump
table and retpoline checks in ojbtool.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig                |  1 +
 tools/objtool/arch/x86/decode.c | 10 +++++++--
 tools/objtool/check.c           | 39 +++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b753a54e5ea7..5ac5f335855e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2225,6 +2225,7 @@ config X86_PIE
 	def_bool n
 	depends on X86_64
 	select OBJTOOL if HAVE_OBJTOOL
+	select DYNAMIC_FTRACE if FUNCTION_TRACER && RETPOLINE
 
 config RANDOMIZE_BASE
 	bool "Randomize the address of the kernel image (KASLR)"
diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index 9ef024fd648c..cd9a81002efe 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -747,15 +747,21 @@ void arch_initial_func_cfi_state(struct cfi_init_state *state)
 
 const char *arch_nop_insn(int len)
 {
-	static const char nops[5][5] = {
+	static const char nops[6][6] = {
 		{ BYTES_NOP1 },
 		{ BYTES_NOP2 },
 		{ BYTES_NOP3 },
 		{ BYTES_NOP4 },
 		{ BYTES_NOP5 },
+		/*
+		 * For PIE kernel, use a 5-byte nop
+		 * and 1-byte nop to keep the frace
+		 * hooking algorithm working correct.
+		 */
+		{ BYTES_NOP5, BYTES_NOP1 },
 	};
 
-	if (len < 1 || len > 5) {
+	if (len < 1 || len > 6) {
 		WARN("invalid NOP size: %d\n", len);
 		return NULL;
 	}
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index d67b80251eec..2456ab931fe5 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1785,6 +1785,38 @@ static int add_call_destinations(struct objtool_file *file)
 	return 0;
 }
 
+static int add_indirect_mcount_calls(struct objtool_file *file)
+{
+	struct instruction *insn;
+	struct reloc *reloc;
+
+	for_each_insn(file, insn) {
+		if (insn->type != INSN_CALL_DYNAMIC)
+			continue;
+
+		reloc = insn_reloc(file, insn);
+		if (!reloc)
+			continue;
+		if (!reloc->sym->fentry)
+			continue;
+
+		/*
+		 * __fentry__() is an indirect call even in RETPOLINE builiding
+		 * when X86_PIE is enabled, so DYNAMIC_FTRACE is selected. Then
+		 * all indirect calls of __fentry__() would be patched as NOP
+		 * later, so regard it as retpoline safe as a hack here. Also
+		 * regard it as a direct call, otherwise, it would be treat as
+		 * a jump to jump table in insn_jump_table(), because
+		 * _jump_table and _call_dest share the same memory.
+		 */
+		insn->type = INSN_CALL;
+		insn->retpoline_safe = true;
+		add_call_dest(file, insn, reloc->sym, false);
+	}
+
+	return 0;
+}
+
 /*
  * The .alternatives section requires some extra special care over and above
  * other special sections because alternatives are patched in place.
@@ -2668,6 +2700,13 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
+	/*
+	 * For X86 PIE kernel, __fentry__ call is an indirect call instead
+	 * of direct call.
+	 */
+	if (opts.pie)
+		add_indirect_mcount_calls(file);
+
 	/*
 	 * Must be after add_call_destinations() such that it can override
 	 * dead_end_function() marks.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 35/43] x86/pie: Build the kernel as PIE
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (33 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for " Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 36/43] x86/vsyscall: Don't use set_fixmap() to map vsyscall page Hou Wenlong
                   ` (8 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

The kernel is currently build with mcmode=kernel option which forces it
to stay on the top 2G of the virtual address space. For PIE, use -fPIE
option to build the kernel as a Position Independent Executable (PIE),
which uses RIP-relative addressing and could be able to move below the
top 2G.

The --emit-relocs linker option was kept instead of using -pie to limit
the impact on mapped sections. Any incompatible relocation will be catch
by the objtool at compile time.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig  | 8 ++++++--
 arch/x86/Makefile | 9 +++++++--
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ac5f335855e..9f8020991184 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2222,10 +2222,14 @@ config RELOCATABLE
 	  (CONFIG_PHYSICAL_START) is used as the minimum location.
 
 config X86_PIE
-	def_bool n
-	depends on X86_64
+	bool "Build a PIE kernel"
+	default n
+	depends on X86_64 && !XEN
 	select OBJTOOL if HAVE_OBJTOOL
 	select DYNAMIC_FTRACE if FUNCTION_TRACER && RETPOLINE
+	help
+	  This builds a PIE kernel image that could be put at any
+	  virtual address.
 
 config RANDOMIZE_BASE
 	bool "Randomize the address of the kernel image (KASLR)"
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 81500011396d..6631974e2003 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -160,10 +160,15 @@ else
         KBUILD_CFLAGS += -mno-red-zone
 
 ifdef CONFIG_X86_PIE
-        PIE_CFLAGS := -include $(srctree)/include/linux/hidden.h
+        PIE_CFLAGS := -fPIE -include $(srctree)/include/linux/hidden.h
         KBUILD_CFLAGS += $(PIE_CFLAGS)
-endif
+        # Relax relocation in both CFLAGS and LDFLAGS to support older compilers
+        KBUILD_CFLAGS += $(call cc-option,-Wa$(comma)-mrelax-relocations=no)
+        LDFLAGS_vmlinux += $(call ld-option,--no-relax)
+        KBUILD_LDFLAGS_MODULE += $(call ld-option,--no-relax)
+else
         KBUILD_CFLAGS += -mcmodel=kernel
+endif
         KBUILD_RUSTFLAGS += -Cno-redzone=y
         KBUILD_RUSTFLAGS += -Ccode-model=kernel
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 36/43] x86/vsyscall: Don't use set_fixmap() to map vsyscall page
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (34 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 35/43] x86/pie: Build the kernel as PIE Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 37/43] x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap area Hou Wenlong
                   ` (7 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Juergen Gross,
	Srivatsa S. Bhat (VMware),
	Alexey Makhalov, VMware PV-Drivers Reviewers, Boris Ostrovsky,
	Andrew Morton, Mike Rapoport (IBM),
	Liam R. Howlett, Suren Baghdasaryan, Kirill A. Shutemov,
	virtualization, xen-devel

In order to unify FIXADDR_TOP for x86 and allow fixmap area to be
moveable, vsyscall page should be mapped individually. However, for
XENPV guest, vsyscall page needs to be mapped into user pagetable too.
So introduce a new PVMMU op to help to map vsyscall page.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/entry/vsyscall/vsyscall_64.c |  3 +--
 arch/x86/include/asm/paravirt.h       |  7 +++++++
 arch/x86/include/asm/paravirt_types.h |  4 ++++
 arch/x86/include/asm/vsyscall.h       | 13 +++++++++++++
 arch/x86/kernel/paravirt.c            |  4 ++++
 arch/x86/xen/mmu_pv.c                 | 20 ++++++++++++++------
 6 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index e0ca8120aea8..4373460ebbde 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -385,8 +385,7 @@ void __init map_vsyscall(void)
 	 * page.
 	 */
 	if (vsyscall_mode == EMULATE) {
-		__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
-			     PAGE_KERNEL_VVAR);
+		__set_vsyscall_page(physaddr_vsyscall, PAGE_KERNEL_VVAR);
 		set_vsyscall_pgtable_user_bits(swapper_pg_dir);
 	}
 
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 2350ceb43db0..dcc0706287ee 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -576,6 +576,13 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 {
 	pv_ops.mmu.set_fixmap(idx, phys, flags);
 }
+
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+static inline void __set_vsyscall_page(phys_addr_t phys, pgprot_t flags)
+{
+	pv_ops.mmu.set_vsyscall_page(phys, flags);
+}
+#endif
 #endif
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 982a234f5a06..e79f38232849 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -224,6 +224,10 @@ struct pv_mmu_ops {
 	   an mfn.  We can tell which is which from the index. */
 	void (*set_fixmap)(unsigned /* enum fixed_addresses */ idx,
 			   phys_addr_t phys, pgprot_t flags);
+
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+	void (*set_vsyscall_page)(phys_addr_t phys, pgprot_t flags);
+#endif
 #endif
 } __no_randomize_layout;
 
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index ab60a71a8dcb..73691fc60924 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_VSYSCALL_H
 #define _ASM_X86_VSYSCALL_H
 
+#include <asm/pgtable.h>
 #include <linux/seqlock.h>
 #include <uapi/asm/vsyscall.h>
 
@@ -15,6 +16,18 @@ extern void set_vsyscall_pgtable_user_bits(pgd_t *root);
  */
 extern bool emulate_vsyscall(unsigned long error_code,
 			     struct pt_regs *regs, unsigned long address);
+static inline void native_set_vsyscall_page(phys_addr_t phys, pgprot_t flags)
+{
+	pgprot_val(flags) &= __default_kernel_pte_mask;
+	set_pte_vaddr(VSYSCALL_ADDR, pfn_pte(phys >> PAGE_SHIFT, flags));
+}
+
+#ifndef CONFIG_PARAVIRT_XXL
+#define __set_vsyscall_page	native_set_vsyscall_page
+#else
+#include <asm/paravirt.h>
+#endif
+
 #else
 static inline void map_vsyscall(void) {}
 static inline bool emulate_vsyscall(unsigned long error_code,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ac10b46c5832..13c81402f377 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -33,6 +33,7 @@
 #include <asm/tlb.h>
 #include <asm/io_bitmap.h>
 #include <asm/gsseg.h>
+#include <asm/vsyscall.h>
 
 /*
  * nop stub, which must not clobber anything *including the stack* to
@@ -357,6 +358,9 @@ struct paravirt_patch_template pv_ops = {
 	},
 
 	.mmu.set_fixmap		= native_set_fixmap,
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+	.mmu.set_vsyscall_page	= native_set_vsyscall_page,
+#endif
 #endif /* CONFIG_PARAVIRT_XXL */
 
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index fdc91deece7e..a59bc013ee5b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -59,6 +59,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
+#include <asm/vsyscall.h>
 #include <asm/mmu_context.h>
 #include <asm/setup.h>
 #include <asm/paravirt.h>
@@ -2020,9 +2021,6 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 
 	switch (idx) {
 	case FIX_BTMAP_END ... FIX_BTMAP_BEGIN:
-#ifdef CONFIG_X86_VSYSCALL_EMULATION
-	case VSYSCALL_PAGE:
-#endif
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
@@ -2058,14 +2056,21 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 	vaddr = __fix_to_virt(idx);
 	if (HYPERVISOR_update_va_mapping(vaddr, pte, UVMF_INVLPG))
 		BUG();
+}
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
+static void xen_set_vsyscall_page(phys_addr_t phys, pgprot_t prot)
+{
+	pte_t pte = pfn_pte(phys >> PAGE_SHIFT, prot);
+
+	if (HYPERVISOR_update_va_mapping(VSYSCALL_ADDR, pte, UVMF_INVLPG))
+		BUG();
+
 	/* Replicate changes to map the vsyscall page into the user
 	   pagetable vsyscall mapping. */
-	if (idx == VSYSCALL_PAGE)
-		set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
-#endif
+	set_pte_vaddr_pud(level3_user_vsyscall, VSYSCALL_ADDR, pte);
 }
+#endif
 
 static void __init xen_post_allocator_init(void)
 {
@@ -2156,6 +2161,9 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 		},
 
 		.set_fixmap = xen_set_fixmap,
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+		.set_vsyscall_page = xen_set_vsyscall_page,
+#endif
 	},
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 37/43] x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap area
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (35 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 36/43] x86/vsyscall: Don't use set_fixmap() to map vsyscall page Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 38/43] x86/fixmap: Move vsyscall page " Hou Wenlong
                   ` (6 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Juergen Gross, Boris Ostrovsky, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, xen-devel

If vsyscall page is moved out of fixmap area, then FIXADDR_TOP would be
below vsyscall page. So it should pin up to VSYSCALL_ADDR if vsyscall is
enabled.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/xen/mmu_pv.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index a59bc013ee5b..28392f3478a0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -587,6 +587,12 @@ static void xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
 	xen_pud_walk(mm, pud, func, last, limit);
 }
 
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+#define __KERNEL_MAP_TOP	(VSYSCALL_ADDR + PAGE_SIZE)
+#else
+#define __KERNEL_MAP_TOP	FIXADDR_TOP
+#endif
+
 /*
  * (Yet another) pagetable walker.  This one is intended for pinning a
  * pagetable.  This means that it walks a pagetable and calls the
@@ -594,7 +600,7 @@ static void xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
  * at every level.  It walks the entire pagetable, but it only bothers
  * pinning pte pages which are below limit.  In the normal case this
  * will be STACK_TOP_MAX, but at boot we need to pin up to
- * FIXADDR_TOP.
+ * __KERNEL_MAP_TOP.
  *
  * We must skip the Xen hole in the middle of the address space, just after
  * the big x86-64 virtual hole.
@@ -609,7 +615,7 @@ static void __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
 
 	/* The limit is the last byte to be touched */
 	limit--;
-	BUG_ON(limit >= FIXADDR_TOP);
+	BUG_ON(limit >= __KERNEL_MAP_TOP);
 
 	/*
 	 * 64-bit has a great big hole in the middle of the address
@@ -797,7 +803,7 @@ static void __init xen_after_bootmem(void)
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
 	SetPagePinned(virt_to_page(level3_user_vsyscall));
 #endif
-	xen_pgd_walk(&init_mm, xen_mark_pinned, FIXADDR_TOP);
+	xen_pgd_walk(&init_mm, xen_mark_pinned, __KERNEL_MAP_TOP);
 }
 
 static void xen_unpin_page(struct mm_struct *mm, struct page *page,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 38/43] x86/fixmap: Move vsyscall page out of fixmap area
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (36 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 37/43] x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap area Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 39/43] x86/fixmap: Unify FIXADDR_TOP Hou Wenlong
                   ` (5 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	Mike Rapoport (IBM),
	Liam R. Howlett, Suren Baghdasaryan, Kirill A. Shutemov,
	David Woodhouse, Brian Gerst, Josh Poimboeuf

After mapping vsyscall page individually, vsyscall page could
be moved out of fixmap area.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/entry/vsyscall/vsyscall_64.c |  4 ----
 arch/x86/include/asm/fixmap.h         | 17 +++++------------
 arch/x86/kernel/head_64.S             |  6 +++---
 arch/x86/mm/fault.c                   |  1 -
 arch/x86/mm/init_64.c                 |  2 +-
 5 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 4373460ebbde..f469f8dc36d4 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -35,7 +35,6 @@
 
 #include <asm/vsyscall.h>
 #include <asm/unistd.h>
-#include <asm/fixmap.h>
 #include <asm/traps.h>
 #include <asm/paravirt.h>
 
@@ -391,7 +390,4 @@ void __init map_vsyscall(void)
 
 	if (vsyscall_mode == XONLY)
 		vm_flags_init(&gate_vma, VM_EXEC);
-
-	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
-		     (unsigned long)VSYSCALL_ADDR);
 }
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index d0dcefb5cc59..eeb152ad9682 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -23,13 +23,13 @@
  * covered fully.
  */
 #ifndef CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
-# define FIXMAP_PMD_NUM	2
+# define FIXMAP_PMD_NUM	1
 #else
 # define KM_PMDS	(KM_MAX_IDX * ((CONFIG_NR_CPUS + 511) / 512))
-# define FIXMAP_PMD_NUM (KM_PMDS + 2)
+# define FIXMAP_PMD_NUM (KM_PMDS + 1)
 #endif
-/* fixmap starts downwards from the 507th entry in level2_fixmap_pgt */
-#define FIXMAP_PMD_TOP	507
+/* fixmap starts downwards from the 506th entry in level2_fixmap_pgt */
+#define FIXMAP_PMD_TOP	506
 
 #ifndef __ASSEMBLY__
 #include <linux/kernel.h>
@@ -38,8 +38,6 @@
 #include <asm/pgtable_types.h>
 #ifdef CONFIG_X86_32
 #include <linux/threads.h>
-#else
-#include <uapi/asm/vsyscall.h>
 #endif
 
 /*
@@ -55,8 +53,7 @@
 extern unsigned long __FIXADDR_TOP;
 #define FIXADDR_TOP	((unsigned long)__FIXADDR_TOP)
 #else
-#define FIXADDR_TOP	(round_up(VSYSCALL_ADDR + PAGE_SIZE, 1<<PMD_SHIFT) - \
-			 PAGE_SIZE)
+#define FIXADDR_TOP	(0xffffffffff600000UL - PAGE_SIZE)
 #endif
 
 /*
@@ -81,10 +78,6 @@ extern unsigned long __FIXADDR_TOP;
 enum fixed_addresses {
 #ifdef CONFIG_X86_32
 	FIX_HOLE,
-#else
-#ifdef CONFIG_X86_VSYSCALL_EMULATION
-	VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
-#endif
 #endif
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 94c5defec8cc..19cb2852238b 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -659,15 +659,15 @@ SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
 SYM_DATA_END(level2_kernel_pgt)
 
 SYM_DATA_START_PAGE_ALIGNED(level2_fixmap_pgt)
-	.fill	(512 - 4 - FIXMAP_PMD_NUM),8,0
+	.fill	(512 - 5 - FIXMAP_PMD_NUM),8,0
 	pgtno = 0
 	.rept (FIXMAP_PMD_NUM)
 	.quad level1_fixmap_pgt + (pgtno << PAGE_SHIFT) - __START_KERNEL_map \
 		+ _PAGE_TABLE_NOENC;
 	pgtno = pgtno + 1
 	.endr
-	/* 6 MB reserved space + a 2MB hole */
-	.fill	4,8,0
+	/* 2MB (with 4KB vsyscall page inside) + 6 MB reserved space + a 2MB hole */
+	.fill	5,8,0
 SYM_DATA_END(level2_fixmap_pgt)
 
 SYM_DATA_START_PAGE_ALIGNED(level1_fixmap_pgt)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7beb0ba6b2ec..548c0803d9f4 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -22,7 +22,6 @@
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
-#include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
 #include <asm/mmu_context.h>		/* vma_pkey()			*/
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a190aae8ceaf..b7fd05a1ba1d 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -40,7 +40,7 @@
 #include <linux/uaccess.h>
 #include <asm/pgalloc.h>
 #include <asm/dma.h>
-#include <asm/fixmap.h>
+#include <asm/vsyscall.h>
 #include <asm/e820/api.h>
 #include <asm/apic.h>
 #include <asm/tlb.h>
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 39/43] x86/fixmap: Unify FIXADDR_TOP
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (37 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 38/43] x86/fixmap: Move vsyscall page " Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 40/43] x86/boot: Fill kernel image puds dynamically Hou Wenlong
                   ` (4 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra, Juergen Gross,
	Anshuman Khandual, Josh Poimboeuf, Pasha Tatashin

Now FIXADDR_TOP is nothing to do with vsyscall page, it can be declared
as variable too for x86_64, so unify it for x86.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/fixmap.h | 13 -------------
 arch/x86/kernel/head64.c      |  1 -
 arch/x86/mm/dump_pagetables.c |  3 ++-
 arch/x86/mm/ioremap.c         |  5 ++---
 arch/x86/mm/pgtable.c         | 13 +++++++++++++
 arch/x86/mm/pgtable_32.c      |  3 ---
 6 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index eeb152ad9682..9433109e4853 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -40,21 +40,8 @@
 #include <linux/threads.h>
 #endif
 
-/*
- * We can't declare FIXADDR_TOP as variable for x86_64 because vsyscall
- * uses fixmaps that relies on FIXADDR_TOP for proper address calculation.
- * Because of this, FIXADDR_TOP x86 integration was left as later work.
- */
-#ifdef CONFIG_X86_32
-/*
- * Leave one empty page between vmalloc'ed areas and
- * the start of the fixmap.
- */
 extern unsigned long __FIXADDR_TOP;
 #define FIXADDR_TOP	((unsigned long)__FIXADDR_TOP)
-#else
-#define FIXADDR_TOP	(0xffffffffff600000UL - PAGE_SIZE)
-#endif
 
 /*
  * Here we define all the compile-time 'special' virtual
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index ef7ad96f2154..8295b547b64f 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -499,7 +499,6 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
 	BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
 	MAYBE_BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
 				(__START_KERNEL & PGDIR_MASK)));
-	BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
 
 	cr4_init_shadow();
 
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e1b599ecbbc2..df1a708a038a 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -104,7 +104,7 @@ static struct addr_marker address_markers[] = {
 	[HIGH_KERNEL_NR]	= { __START_KERNEL_map,	"High Kernel Mapping" },
 	[MODULES_VADDR_NR]	= { MODULES_VADDR,	"Modules" },
 	[MODULES_END_NR]	= { MODULES_END,	"End Modules" },
-	[FIXADDR_START_NR]	= { FIXADDR_START,	"Fixmap Area" },
+	[FIXADDR_START_NR]	= { 0UL,		"Fixmap Area" },
 	[END_OF_SPACE_NR]	= { -1,			NULL }
 };
 
@@ -453,6 +453,7 @@ static int __init pt_dump_init(void)
 	address_markers[KASAN_SHADOW_START_NR].start_address = KASAN_SHADOW_START;
 	address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
 #endif
+	address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
 #endif
 #ifdef CONFIG_X86_32
 	address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index aa7d279321ea..44f9c6781c15 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -879,10 +879,9 @@ void __init early_ioremap_init(void)
 	pmd_t *pmd;
 
 #ifdef CONFIG_X86_64
-	BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
-#else
-	WARN_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
+	BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
 #endif
+	WARN_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
 
 	early_ioremap_setup();
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index afab0bc7862b..726c0c369676 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -627,6 +627,19 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifdef CONFIG_X86_32
+/*
+ * Leave one empty page between vmalloc'ed areas and
+ * the start of the fixmap.
+ */
+#define __FIXADDR_TOP_BASE	0xfffff000
+#else
+#define __FIXADDR_TOP_BASE	(0xffffffffff600000UL - PAGE_SIZE)
+#endif
+
+unsigned long __FIXADDR_TOP = __FIXADDR_TOP_BASE;
+EXPORT_SYMBOL(__FIXADDR_TOP);
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index c234634e26ba..2b9a00976fee 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -65,9 +65,6 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 	flush_tlb_one_kernel(vaddr);
 }
 
-unsigned long __FIXADDR_TOP = 0xfffff000;
-EXPORT_SYMBOL(__FIXADDR_TOP);
-
 /*
  * vmalloc=size forces the vmalloc area to be exactly 'size'
  * bytes. This can be used to increase (or decrease) the
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 40/43] x86/boot: Fill kernel image puds dynamically
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (38 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 39/43] x86/fixmap: Unify FIXADDR_TOP Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 41/43] x86/mm: Sort address_markers array when X86 PIE is enabled Hou Wenlong
                   ` (3 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Juergen Gross, Anshuman Khandual, Mike Rapoport,
	Josh Poimboeuf, Pasha Tatashin

For PIE kernel, it could be randomized in any address. Later, kernel
image would be moved down the top 2G, so fille kernel image puds
dynamically.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>a
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/kernel/head64.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 8295b547b64f..c5cd61aab8ae 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -239,8 +239,18 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	}
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
-	pud[510] += load_delta;
-	pud[511] += load_delta;
+	if (IS_ENABLED(CONFIG_X86_PIE)) {
+		pud[510] = 0;
+		pud[511] = 0;
+
+		i = pud_index(text_base);
+		pgtable_flags = _KERNPG_TABLE_NOENC - __START_KERNEL_map + load_delta;
+		pud[i] = pgtable_flags + SYM_ABS_VAL(level2_kernel_pgt);
+		pud[i + 1] = pgtable_flags + SYM_ABS_VAL(level2_fixmap_pgt);
+	} else {
+		pud[510] += load_delta;
+		pud[511] += load_delta;
+	}
 
 	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
 	for (i = FIXMAP_PMD_TOP; i > FIXMAP_PMD_TOP - FIXMAP_PMD_NUM; i--)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 41/43] x86/mm: Sort address_markers array when X86 PIE is enabled
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (39 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 40/43] x86/boot: Fill kernel image puds dynamically Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 42/43] x86/pie: Allow kernel image to be relocated in top 512G Hou Wenlong
                   ` (2 subsequent siblings)
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin

When X86 PIE is enabled, kernel image is allowed to relocated in top
512G, then kernel image address could be below EFI range address. So
sort address_markers array to make the order right.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/mm/dump_pagetables.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index df1a708a038a..81aa1c0b39cc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -17,6 +17,7 @@
 #include <linux/highmem.h>
 #include <linux/pci.h>
 #include <linux/ptdump.h>
+#include <linux/sort.h>
 
 #include <asm/e820/types.h>
 
@@ -436,6 +437,27 @@ void ptdump_walk_pgd_level_checkwx(void)
 	ptdump_walk_pgd_level_core(NULL, &init_mm, INIT_PGD, true, false);
 }
 
+#ifdef CONFIG_X86_PIE
+static int __init address_markers_sort_cmp(const void *pa, const void *pb)
+{
+	struct addr_marker *a = (struct addr_marker *)pa;
+	struct addr_marker *b = (struct addr_marker *)pb;
+
+	return (a->start_address > b->start_address) -
+	       (a->start_address < b->start_address);
+}
+
+static void __init address_markers_sort(void)
+{
+	sort(&address_markers[0], ARRAY_SIZE(address_markers), sizeof(address_markers[0]),
+	     address_markers_sort_cmp, NULL);
+}
+#else
+static void __init address_markers_sort(void)
+{
+}
+#endif
+
 static int __init pt_dump_init(void)
 {
 	/*
@@ -467,6 +489,8 @@ static int __init pt_dump_init(void)
 	address_markers[LDT_NR].start_address = LDT_BASE_ADDR;
 # endif
 #endif
+	address_markers_sort();
+
 	return 0;
 }
 __initcall(pt_dump_init);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 42/43] x86/pie: Allow kernel image to be relocated in top 512G
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (40 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 41/43] x86/mm: Sort address_markers array when X86 PIE is enabled Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28  9:51 ` [PATCH RFC 43/43] x86/boot: Extend relocate range for PIE kernel image Hou Wenlong
  2023-04-28 15:22 ` [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Peter Zijlstra
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin,
	Andrey Konovalov, Vincenzo Frascino, Ard Biesheuvel, Darren Hart,
	Andy Shevchenko, Andrew Morton, Mike Rapoport (IBM),
	Guo Ren, Stafford Horne, David Hildenbrand, Juergen Gross,
	Anshuman Khandual, Josh Poimboeuf, Pasha Tatashin,
	David Woodhouse, Brian Gerst, XueBing Chen, Yuntao Wang,
	Jonathan McDowell, Jason A. Donenfeld, Dan Williams, Jane Chu,
	Davidlohr Bueso, Sean Christopherson, kasan-dev, linux-efi,
	platform-driver-x86

For PIE kernel image, it could be relocated at any address. To be
simplified, treat the 2G area which including kernel image, modules area
and fixmap area as a whole, allow it to be relocated in top 512G.  After
that, the relocated kernel address may be below than __START_KERNEL_map,
so use a global variable to store the base of relocated kernel image.
And pa/va transformation of kernel image address is adapted.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/kmsan.h            |  6 ++---
 arch/x86/include/asm/page_64.h          |  8 +++----
 arch/x86/include/asm/page_64_types.h    |  8 +++++++
 arch/x86/include/asm/pgtable_64_types.h | 10 ++++----
 arch/x86/kernel/head64.c                | 32 ++++++++++++++++++-------
 arch/x86/kernel/head_64.S               | 12 ++++++++++
 arch/x86/kernel/setup.c                 |  6 +++++
 arch/x86/mm/dump_pagetables.c           |  9 ++++---
 arch/x86/mm/init_64.c                   |  8 +++----
 arch/x86/mm/kasan_init_64.c             |  4 ++--
 arch/x86/mm/pat/set_memory.c            |  2 +-
 arch/x86/mm/physaddr.c                  | 14 +++++------
 arch/x86/platform/efi/efi_thunk_64.S    |  4 ++++
 13 files changed, 87 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h
index 8fa6ac0e2d76..a635d825342d 100644
--- a/arch/x86/include/asm/kmsan.h
+++ b/arch/x86/include/asm/kmsan.h
@@ -63,16 +63,16 @@ static inline bool kmsan_phys_addr_valid(unsigned long addr)
 static inline bool kmsan_virt_addr_valid(void *addr)
 {
 	unsigned long x = (unsigned long)addr;
-	unsigned long y = x - __START_KERNEL_map;
+	unsigned long y = x - KERNEL_MAP_BASE;
 
-	/* use the carry flag to determine if x was < __START_KERNEL_map */
+	/* use the carry flag to determine if x was < KERNEL_MAP_BASE */
 	if (unlikely(x > y)) {
 		x = y + phys_base;
 
 		if (y >= KERNEL_IMAGE_SIZE)
 			return false;
 	} else {
-		x = y + (__START_KERNEL_map - PAGE_OFFSET);
+		x = y + (KERNEL_MAP_BASE - PAGE_OFFSET);
 
 		/* carry flag will be set if starting x was >= PAGE_OFFSET */
 		if ((x > y) || !kmsan_phys_addr_valid(x))
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index cc6b8e087192..b8692e6cc939 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -20,10 +20,10 @@ extern unsigned long vmemmap_base;
 
 static __always_inline unsigned long __phys_addr_nodebug(unsigned long x)
 {
-	unsigned long y = x - __START_KERNEL_map;
+	unsigned long y = x - KERNEL_MAP_BASE;
 
-	/* use the carry flag to determine if x was < __START_KERNEL_map */
-	x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
+	/* use the carry flag to determine if x was < KERNEL_MAP_BASE */
+	x = y + ((x > y) ? phys_base : (KERNEL_MAP_BASE - PAGE_OFFSET));
 
 	return x;
 }
@@ -34,7 +34,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 #else
 #define __phys_addr(x)		__phys_addr_nodebug(x)
 #define __phys_addr_symbol(x) \
-	((unsigned long)(x) - __START_KERNEL_map + phys_base)
+	((unsigned long)(x) - KERNEL_MAP_BASE + phys_base)
 #endif
 
 #define __phys_reloc_hide(x)	(x)
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index e9e2c3ba5923..933d37845064 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -4,6 +4,8 @@
 
 #ifndef __ASSEMBLY__
 #include <asm/kaslr.h>
+
+extern unsigned long kernel_map_base;
 #endif
 
 #ifdef CONFIG_KASAN
@@ -49,6 +51,12 @@
 
 #define __START_KERNEL_map	_AC(0xffffffff80000000, UL)
 
+#ifdef CONFIG_X86_PIE
+#define KERNEL_MAP_BASE		kernel_map_base
+#else
+#define KERNEL_MAP_BASE		__START_KERNEL_map
+#endif /* CONFIG_X86_PIE */
+
 /* See Documentation/x86/x86_64/mm.rst for a description of the memory map. */
 
 #define __PHYSICAL_MASK_SHIFT	52
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 38bf837e3554..3d6951128a07 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -187,14 +187,16 @@ extern unsigned int ptrs_per_p4d;
 #define KMSAN_MODULES_ORIGIN_START	(KMSAN_MODULES_SHADOW_START + MODULES_LEN)
 #endif /* CONFIG_KMSAN */
 
-#define MODULES_VADDR		(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
+#define RAW_MODULES_VADDR	(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
+#define MODULES_VADDR		(KERNEL_MAP_BASE + KERNEL_IMAGE_SIZE)
 /* The module sections ends with the start of the fixmap */
 #ifndef CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
-# define MODULES_END		_AC(0xffffffffff000000, UL)
+# define RAW_MODULES_END       _AC(0xffffffffff000000, UL)
 #else
-# define MODULES_END		_AC(0xfffffffffe000000, UL)
+# define RAW_MODULES_END       _AC(0xfffffffffe000000, UL)
 #endif
-#define MODULES_LEN		(MODULES_END - MODULES_VADDR)
+#define MODULES_LEN		(RAW_MODULES_END - RAW_MODULES_VADDR)
+#define MODULES_END		(MODULES_VADDR + MODULES_LEN)
 
 #define ESPFIX_PGD_ENTRY	_AC(-2, UL)
 #define ESPFIX_BASE_ADDR	(ESPFIX_PGD_ENTRY << P4D_SHIFT)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c5cd61aab8ae..234ac796863a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -66,6 +66,11 @@ unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
 EXPORT_SYMBOL(vmemmap_base);
 #endif
 
+#ifdef CONFIG_X86_PIE
+unsigned long kernel_map_base __ro_after_init = __START_KERNEL_map;
+EXPORT_SYMBOL(kernel_map_base);
+#endif
+
 /*
  * GDT used on the boot CPU before switching to virtual addresses.
  */
@@ -193,6 +198,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
 {
 	unsigned long load_delta, *p;
 	unsigned long pgtable_flags;
+	unsigned long kernel_map_base_offset = 0;
 	pgdval_t *pgd;
 	p4dval_t *p4d;
 	pudval_t *pud;
@@ -252,6 +258,13 @@ unsigned long __head __startup_64(unsigned long physaddr,
 		pud[511] += load_delta;
 	}
 
+#ifdef CONFIG_X86_PIE
+	kernel_map_base_offset = text_base & PUD_MASK;
+	*fixup_long(&kernel_map_base, physaddr) = kernel_map_base_offset;
+	kernel_map_base_offset -= __START_KERNEL_map;
+	*fixup_long(&__FIXADDR_TOP, physaddr) += kernel_map_base_offset;
+#endif
+
 	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
 	for (i = FIXMAP_PMD_TOP; i > FIXMAP_PMD_TOP - FIXMAP_PMD_NUM; i--)
 		pmd[i] += load_delta;
@@ -328,7 +341,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	/* fixup pages that are part of the kernel image */
 	for (; i <= pmd_index(end_base); i++)
 		if (pmd[i] & _PAGE_PRESENT)
-			pmd[i] += load_delta;
+			pmd[i] += load_delta + kernel_map_base_offset;
 
 	/* invalidate pages after the kernel image */
 	for (; i < PTRS_PER_PMD; i++)
@@ -338,7 +351,8 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	 * Fixup phys_base - remove the memory encryption mask to obtain
 	 * the true physical address.
 	 */
-	*fixup_long(&phys_base, physaddr) += load_delta - sme_get_me_mask();
+	*fixup_long(&phys_base, physaddr) += load_delta + kernel_map_base_offset -
+					     sme_get_me_mask();
 
 	return sme_postprocess_startup(bp, pmd);
 }
@@ -376,7 +390,7 @@ bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 	if (!pgtable_l5_enabled())
 		p4d_p = pgd_p;
 	else if (pgd)
-		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + KERNEL_MAP_BASE - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -385,13 +399,13 @@ bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 
 		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
-		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*pgd_p = (pgdval_t)p4d_p - KERNEL_MAP_BASE + phys_base + _KERNPG_TABLE;
 	}
 	p4d_p += p4d_index(address);
 	p4d = *p4d_p;
 
 	if (p4d)
-		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + KERNEL_MAP_BASE - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -400,13 +414,13 @@ bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - KERNEL_MAP_BASE + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
 
 	if (pud)
-		pmd_p = (pmdval_t *)((pud & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+		pmd_p = (pmdval_t *)((pud & PTE_PFN_MASK) + KERNEL_MAP_BASE - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -415,7 +429,7 @@ bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 
 		pmd_p = (pmdval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pmd_p, 0, sizeof(*pmd_p) * PTRS_PER_PMD);
-		*pud_p = (pudval_t)pmd_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*pud_p = (pudval_t)pmd_p - KERNEL_MAP_BASE + phys_base + _KERNPG_TABLE;
 	}
 	pmd_p[pmd_index(address)] = pmd;
 
@@ -497,6 +511,7 @@ static void __init copy_bootdata(char *real_mode_data)
 
 asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode_data)
 {
+#ifndef CONFIG_X86_PIE
 	/*
 	 * Build-time sanity checks on the kernel image and module
 	 * area mappings. (these are purely build-time and produce no code)
@@ -509,6 +524,7 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
 	BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
 	MAYBE_BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
 				(__START_KERNEL & PGDIR_MASK)));
+#endif
 
 	cr4_init_shadow();
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 19cb2852238b..feb14304d1ed 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -130,7 +130,13 @@ SYM_CODE_START_NOALIGN(startup_64)
 	popq	%rsi
 
 	/* Form the CR3 value being sure to include the CR3 modifier */
+#ifdef CONFIG_X86_PIE
+	movq	kernel_map_base(%rip), %rdi
+	movabs	$early_top_pgt, %rcx
+	subq	%rdi, %rcx
+#else
 	movabs  $(early_top_pgt - __START_KERNEL_map), %rcx
+#endif
 	addq    %rcx, %rax
 	jmp 1f
 SYM_CODE_END(startup_64)
@@ -179,7 +185,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 #endif
 
 	/* Form the CR3 value being sure to include the CR3 modifier */
+#ifdef CONFIG_X86_PIE
+	movq	kernel_map_base(%rip), %rdi
+	movabs	$init_top_pgt, %rcx
+	subq	%rdi, %rcx
+#else
 	movabs	$(init_top_pgt - __START_KERNEL_map), %rcx
+#endif
 	addq    %rcx, %rax
 1:
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16babff771bd..e68ca78b829c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -808,11 +808,17 @@ static int
 dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
 {
 	if (kaslr_enabled()) {
+#ifdef CONFIG_X86_PIE
+		pr_emerg("Kernel Offset: 0x%lx from 0x%lx\n",
+			kaslr_offset(),
+			__START_KERNEL);
+#else
 		pr_emerg("Kernel Offset: 0x%lx from 0x%lx (relocation range: 0x%lx-0x%lx)\n",
 			 kaslr_offset(),
 			 __START_KERNEL,
 			 __START_KERNEL_map,
 			 MODULES_VADDR-1);
+#endif
 	} else {
 		pr_emerg("Kernel Offset: disabled\n");
 	}
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 81aa1c0b39cc..d5c6f61242aa 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -102,9 +102,9 @@ static struct addr_marker address_markers[] = {
 #ifdef CONFIG_EFI
 	[EFI_END_NR]		= { EFI_VA_END,		"EFI Runtime Services" },
 #endif
-	[HIGH_KERNEL_NR]	= { __START_KERNEL_map,	"High Kernel Mapping" },
-	[MODULES_VADDR_NR]	= { MODULES_VADDR,	"Modules" },
-	[MODULES_END_NR]	= { MODULES_END,	"End Modules" },
+	[HIGH_KERNEL_NR]	= { 0UL,		"High Kernel Mapping" },
+	[MODULES_VADDR_NR]	= { 0UL,		"Modules" },
+	[MODULES_END_NR]	= { 0UL,		"End Modules" },
 	[FIXADDR_START_NR]	= { 0UL,		"Fixmap Area" },
 	[END_OF_SPACE_NR]	= { -1,			NULL }
 };
@@ -475,6 +475,9 @@ static int __init pt_dump_init(void)
 	address_markers[KASAN_SHADOW_START_NR].start_address = KASAN_SHADOW_START;
 	address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
 #endif
+	address_markers[HIGH_KERNEL_NR].start_address = KERNEL_MAP_BASE;
+	address_markers[MODULES_VADDR_NR].start_address = MODULES_VADDR;
+	address_markers[MODULES_END_NR].start_address = MODULES_END;
 	address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
 #endif
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b7fd05a1ba1d..54bcd46c229d 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -413,7 +413,7 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
 /*
  * The head.S code sets up the kernel high mapping:
  *
- *   from __START_KERNEL_map to __START_KERNEL_map + size (== _end-_text)
+ *   from KERNEL_MAP_BASE to KERNEL_MAP_BASE + size (== _end-_text)
  *
  * phys_base holds the negative offset to the kernel, which is added
  * to the compile time generated pmds. This results in invalid pmds up
@@ -425,8 +425,8 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
  */
 void __init cleanup_highmap(void)
 {
-	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long vaddr_end = __START_KERNEL_map + KERNEL_IMAGE_SIZE;
+	unsigned long vaddr = KERNEL_MAP_BASE;
+	unsigned long vaddr_end = KERNEL_MAP_BASE + KERNEL_IMAGE_SIZE;
 	unsigned long end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
 	pmd_t *pmd = level2_kernel_pgt;
 
@@ -436,7 +436,7 @@ void __init cleanup_highmap(void)
 	 *	arch/x86/xen/mmu.c:xen_setup_kernel_pagetable().
 	 */
 	if (max_pfn_mapped)
-		vaddr_end = __START_KERNEL_map + (max_pfn_mapped << PAGE_SHIFT);
+		vaddr_end = KERNEL_MAP_BASE + (max_pfn_mapped << PAGE_SHIFT);
 
 	for (; vaddr + PMD_SIZE - 1 < vaddr_end; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0302491d799d..0edc8fdfb419 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -197,7 +197,7 @@ static inline p4d_t *early_p4d_offset(pgd_t *pgd, unsigned long addr)
 		return (p4d_t *)pgd;
 
 	p4d = pgd_val(*pgd) & PTE_PFN_MASK;
-	p4d += __START_KERNEL_map - phys_base;
+	p4d += KERNEL_MAP_BASE - phys_base;
 	return (p4d_t *)p4d + p4d_index(addr);
 }
 
@@ -420,7 +420,7 @@ void __init kasan_init(void)
 			      shadow_cea_per_cpu_begin, 0);
 
 	kasan_populate_early_shadow((void *)shadow_cea_end,
-			kasan_mem_to_shadow((void *)__START_KERNEL_map));
+			kasan_mem_to_shadow((void *)KERNEL_MAP_BASE));
 
 	kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(_stext),
 			      (unsigned long)kasan_mem_to_shadow(_end),
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index c434aea9939c..2fb89be3a750 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1709,7 +1709,7 @@ static int cpa_process_alias(struct cpa_data *cpa)
 	if (!within(vaddr, (unsigned long)_text, _brk_end) &&
 	    __cpa_pfn_in_highmap(cpa->pfn)) {
 		unsigned long temp_cpa_vaddr = (cpa->pfn << PAGE_SHIFT) +
-					       __START_KERNEL_map - phys_base;
+					       KERNEL_MAP_BASE - phys_base;
 		alias_cpa = *cpa;
 		alias_cpa.vaddr = &temp_cpa_vaddr;
 		alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY);
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index fc3f3d3e2ef2..9cb6d898329c 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -14,15 +14,15 @@
 #ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
-	unsigned long y = x - __START_KERNEL_map;
+	unsigned long y = x - KERNEL_MAP_BASE;
 
-	/* use the carry flag to determine if x was < __START_KERNEL_map */
+	/* use the carry flag to determine if x was < KERNEL_MAP_BASE */
 	if (unlikely(x > y)) {
 		x = y + phys_base;
 
 		VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
 	} else {
-		x = y + (__START_KERNEL_map - PAGE_OFFSET);
+		x = y + (KERNEL_MAP_BASE - PAGE_OFFSET);
 
 		/* carry flag will be set if starting x was >= PAGE_OFFSET */
 		VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));
@@ -34,7 +34,7 @@ EXPORT_SYMBOL(__phys_addr);
 
 unsigned long __phys_addr_symbol(unsigned long x)
 {
-	unsigned long y = x - __START_KERNEL_map;
+	unsigned long y = x - KERNEL_MAP_BASE;
 
 	/* only check upper bounds since lower bounds will trigger carry */
 	VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
@@ -46,16 +46,16 @@ EXPORT_SYMBOL(__phys_addr_symbol);
 
 bool __virt_addr_valid(unsigned long x)
 {
-	unsigned long y = x - __START_KERNEL_map;
+	unsigned long y = x - KERNEL_MAP_BASE;
 
-	/* use the carry flag to determine if x was < __START_KERNEL_map */
+	/* use the carry flag to determine if x was < KERNEL_MAP_BASE */
 	if (unlikely(x > y)) {
 		x = y + phys_base;
 
 		if (y >= KERNEL_IMAGE_SIZE)
 			return false;
 	} else {
-		x = y + (__START_KERNEL_map - PAGE_OFFSET);
+		x = y + (KERNEL_MAP_BASE - PAGE_OFFSET);
 
 		/* carry flag will be set if starting x was >= PAGE_OFFSET */
 		if ((x > y) || !phys_addr_valid(x))
diff --git a/arch/x86/platform/efi/efi_thunk_64.S b/arch/x86/platform/efi/efi_thunk_64.S
index c4b1144f99f6..0997363821e7 100644
--- a/arch/x86/platform/efi/efi_thunk_64.S
+++ b/arch/x86/platform/efi/efi_thunk_64.S
@@ -52,7 +52,11 @@ STACK_FRAME_NON_STANDARD __efi64_thunk
 	/*
 	 * Calculate the physical address of the kernel text.
 	 */
+#ifdef CONFIG_X86_PIE
+	movq	kernel_map_base(%rip), %rax
+#else
 	movq	$__START_KERNEL_map, %rax
+#endif
 	subq	phys_base(%rip), %rax
 
 	leaq	1f(%rip), %rbp
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 43/43] x86/boot: Extend relocate range for PIE kernel image
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (41 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 42/43] x86/pie: Allow kernel image to be relocated in top 512G Hou Wenlong
@ 2023-04-28  9:51 ` Hou Wenlong
  2023-04-28 15:22 ` [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Peter Zijlstra
  43 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-28  9:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Hou Wenlong,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Jonathan Corbet, Wang Yong, Masahiro Yamada,
	Jiapeng Chong, Alexander Lobakin, Mike Rapoport, Michael Roth,
	David Hildenbrand, Nikunj A Dadhania, Kirill A. Shutemov,
	linux-doc

Allow PIE kernel image to be relocated in unused holes in top 512G of
address space.

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Thomas Garnier <thgarnie@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
---
 Documentation/x86/x86_64/mm.rst  |  4 +++
 arch/x86/Kconfig                 | 11 +++++++
 arch/x86/boot/compressed/kaslr.c | 55 ++++++++++++++++++++++++++++++++
 arch/x86/boot/compressed/misc.c  |  4 ++-
 arch/x86/boot/compressed/misc.h  |  9 ++++++
 5 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst
index 35e5e18c83d0..b456501a5b69 100644
--- a/Documentation/x86/x86_64/mm.rst
+++ b/Documentation/x86/x86_64/mm.rst
@@ -149,6 +149,10 @@ Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
 physical memory, vmalloc/ioremap space and virtual memory map are randomized.
 Their order is preserved but their base will be offset early at boot time.
 
+Note that if EXTENDED_RANDOMIZE_BASE is enabled, the kernel image area
+including kernel image, module area and fixmap area is randomized as a whole
+in top 512G of address space.
+
 Be very careful vs. KASLR when changing anything here. The KASLR address
 range must not overlap with anything except the KASAN shadow area, which is
 correct as KASAN disables KASLR.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9f8020991184..6d18d4333389 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2266,6 +2266,17 @@ config RANDOMIZE_BASE
 
 	  If unsure, say Y.
 
+config EXTENDED_RANDOMIZE_BASE
+	bool "Randomize the address of the kernel image (PIE)"
+	default y
+	depends on X86_PIE && RANDOMIZE_BASE
+	help
+	  This packs kernel image, module area and fixmap area as a
+	  whole, and allows it to be randomized in top 512G of virtual
+	  address space when PIE is enabled.
+
+	  If unsure, say Y.
+
 # Relocation on x86 needs some additional build support
 config X86_NEED_RELOCS
 	def_bool y
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..e0e092fe7fe2 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -871,3 +871,58 @@ void choose_random_location(unsigned long input,
 		random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);
 	*virt_addr = random_addr;
 }
+
+#ifdef CONFIG_EXTENDED_RANDOMIZE_BASE
+struct kernel_image_slot {
+	unsigned long start;
+	unsigned long end;
+	unsigned long pud_slots;
+};
+
+/*
+ * Currently, there are two unused hole in top 512G, see
+ * Documentation/x86/x86_64/mm.rst, use the hole as the kernel image base.
+ */
+struct kernel_image_slot available_slots[] = {
+	{
+		.start = 0xffffff8000000000UL,
+		.end = 0xffffffeeffffffffUL,
+	},
+	{
+		.start = 0xffffffff00000000UL,
+		.end = 0xffffffffffffffffUL,
+	},
+};
+
+unsigned long pie_randomize(void)
+{
+	unsigned long total, slot;
+	int i;
+
+	if (cmdline_find_option_bool("nokaslr"))
+		return 0;
+
+	total = 0;
+	for (i = 0; i < ARRAY_SIZE(available_slots); i++) {
+		available_slots[i].pud_slots = (available_slots[i].end -
+						available_slots[i].start + 1UL) /
+						PUD_SIZE - 1UL;
+		total += available_slots[i].pud_slots;
+	}
+
+	slot = kaslr_get_random_long("PIE slot") % total;
+	for (i = 0; i < ARRAY_SIZE(available_slots); i++) {
+		if (slot < available_slots[i].pud_slots)
+			break;
+
+		slot -= available_slots[i].pud_slots;
+	}
+
+	if (i == ARRAY_SIZE(available_slots) || slot >= available_slots[i].pud_slots) {
+		warn("PIE randomize disabled: available slots are bad!");
+		return 0;
+	}
+
+	return (available_slots[i].start + slot * PUD_SIZE) - __START_KERNEL_map;
+}
+#endif
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b..e111b55edb8b 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -210,8 +210,10 @@ static void handle_relocations(void *output, unsigned long output_len,
 	 * needed if KASLR has chosen a different starting address offset
 	 * from __START_KERNEL_map.
 	 */
-	if (IS_ENABLED(CONFIG_X86_64))
+	if (IS_ENABLED(CONFIG_X86_64)) {
 		delta = virt_addr - LOAD_PHYSICAL_ADDR;
+		delta += pie_randomize();
+	}
 
 	if (!delta) {
 		debug_putstr("No relocation needed... ");
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 2f155a0e3041..f50717092902 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -113,6 +113,15 @@ static inline void choose_random_location(unsigned long input,
 }
 #endif
 
+#ifdef CONFIG_EXTENDED_RANDOMIZE_BASE
+unsigned long pie_randomize(void);
+#else
+static inline unsigned long pie_randomize(void)
+{
+	return 0;
+}
+#endif
+
 /* cpuflags.c */
 bool has_cpuflag(int flag);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support
  2023-04-28  9:51 ` [PATCH RFC 33/43] objtool: Add validation for x86 PIE support Hou Wenlong
@ 2023-04-28 10:28   ` Christophe Leroy
  2023-04-28 11:43     ` Peter Zijlstra
  2023-04-29  3:52     ` Hou Wenlong
  0 siblings, 2 replies; 80+ messages in thread
From: Christophe Leroy @ 2023-04-28 10:28 UTC (permalink / raw)
  To: Hou Wenlong, linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Masahiro Yamada, Nathan Chancellor, Nick Desaulniers,
	Nicolas Schier, Josh Poimboeuf, Peter Zijlstra,
	Sathvika Vasireddy, Thomas Weißschuh, linux-kbuild



Le 28/04/2023 à 11:51, Hou Wenlong a écrit :
> [Vous ne recevez pas souvent de courriers de houwenlong.hwl@antgroup.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> For x86 PIE binary, only RIP-relative addressing is allowed, however,
> there are still a little absolute references of R_X86_64_64 relocation
> type for data section and a little absolute references of R_X86_64_32S
> relocation type in pvh_start_xen() function.
> 
> Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>   arch/x86/Kconfig                        |  1 +
>   scripts/Makefile.lib                    |  1 +
>   tools/objtool/builtin-check.c           |  4 +-
>   tools/objtool/check.c                   | 82 +++++++++++++++++++++++++
>   tools/objtool/include/objtool/builtin.h |  1 +
>   5 files changed, 88 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 715f0734d065..b753a54e5ea7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2224,6 +2224,7 @@ config RELOCATABLE
>   config X86_PIE
>          def_bool n
>          depends on X86_64
> +       select OBJTOOL if HAVE_OBJTOOL
> 
>   config RANDOMIZE_BASE
>          bool "Randomize the address of the kernel image (KASLR)"
> diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> index 100a386fcd71..e3c804fbc421 100644
> --- a/scripts/Makefile.lib
> +++ b/scripts/Makefile.lib
> @@ -270,6 +270,7 @@ objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE)              += --static-call
>   objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION)         += --uaccess
>   objtool-args-$(CONFIG_GCOV_KERNEL)                     += --no-unreachable
>   objtool-args-$(CONFIG_PREFIX_SYMBOLS)                  += --prefix=$(CONFIG_FUNCTION_PADDING_BYTES)
> +objtool-args-$(CONFIG_X86_PIE)                         += --pie
> 
>   objtool-args = $(objtool-args-y)                                       \
>          $(if $(delay-objtool), --link)                                  \
> diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
> index 7c175198d09f..1cf1d00464e0 100644
> --- a/tools/objtool/builtin-check.c
> +++ b/tools/objtool/builtin-check.c
> @@ -81,6 +81,7 @@ static const struct option check_options[] = {
>          OPT_BOOLEAN('t', "static-call", &opts.static_call, "annotate static calls"),
>          OPT_BOOLEAN('u', "uaccess", &opts.uaccess, "validate uaccess rules for SMAP"),
>          OPT_BOOLEAN(0  , "cfi", &opts.cfi, "annotate kernel control flow integrity (kCFI) function preambles"),
> +       OPT_BOOLEAN(0, "pie", &opts.pie, "validate addressing rules for PIE"),
>          OPT_CALLBACK_OPTARG(0, "dump", NULL, NULL, "orc", "dump metadata", parse_dump),
> 
>          OPT_GROUP("Options:"),
> @@ -137,7 +138,8 @@ static bool opts_valid(void)
>              opts.sls                    ||
>              opts.stackval               ||
>              opts.static_call            ||
> -           opts.uaccess) {
> +           opts.uaccess                ||
> +           opts.pie) {
>                  if (opts.dump_orc) {
>                          ERROR("--dump can't be combined with other options");
>                          return false;
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 5b600bbf2389..d67b80251eec 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
>          for (insn = next_insn_same_sec(file, insn); insn;               \
>               insn = next_insn_same_sec(file, insn))
> 
> +static struct instruction *find_insn_containing(struct objtool_file *file,
> +                                               struct section *sec,
> +                                               unsigned long offset)
> +{
> +       struct instruction *insn;
> +
> +       insn = find_insn(file, sec, 0);
> +       if (!insn)
> +               return NULL;
> +
> +       sec_for_each_insn_from(file, insn) {
> +               if (insn->offset > offset)
> +                       return NULL;
> +               if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> +                       return insn;
> +       }
> +
> +       return NULL;
> +}
> +
> +
>   static inline struct symbol *insn_call_dest(struct instruction *insn)
>   {
>          if (insn->type == INSN_JUMP_DYNAMIC ||
> @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
>          return 0;
>   }
> 
> +static int is_in_pvh_code(struct instruction *insn)
> +{
> +       struct symbol *sym = insn->sym;
> +
> +       return sym && !strcmp(sym->name, "pvh_start_xen");
> +}
> +
> +static int validate_pie(struct objtool_file *file)
> +{
> +       struct section *sec;
> +       struct reloc *reloc;
> +       struct instruction *insn;
> +       int warnings = 0;
> +
> +       for_each_sec(file, sec) {
> +               if (!sec->reloc)
> +                       continue;
> +               if (!(sec->sh.sh_flags & SHF_ALLOC))
> +                       continue;
> +
> +               list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> +                       switch (reloc->type) {
> +                       case R_X86_64_NONE:
> +                       case R_X86_64_PC32:
> +                       case R_X86_64_PLT32:
> +                       case R_X86_64_64:
> +                       case R_X86_64_PC64:
> +                       case R_X86_64_GOTPCREL:
> +                               break;
> +                       case R_X86_64_32:
> +                       case R_X86_64_32S:

That looks very specific to X86, should it go at another place ?

If it can work for any architecture, can you add generic macros, just 
like commit c1449735211d ("objtool: Use macros to define arch specific 
reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount 
specific implementation") ?

> +                               insn = find_insn_containing(file, sec, reloc->offset);
> +                               if (!insn) {
> +                                       WARN("can't find relocate insn near %s+0x%lx",
> +                                            sec->name, reloc->offset);
> +                               } else {
> +                                       if (is_in_pvh_code(insn))
> +                                               break;
> +                                       WARN("insn at %s+0x%lx is not compatible with PIE",
> +                                            sec->name, insn->offset);
> +                               }
> +                               warnings++;
> +                               break;
> +                       default:
> +                               WARN("unexpected relocation type %d at %s+0x%lx",
> +                                    reloc->type, sec->name, reloc->offset);
> +                               warnings++;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       return warnings;
> +}
> +
>   int check(struct objtool_file *file)
>   {
>          int ret, warnings = 0;
> @@ -4673,6 +4749,12 @@ int check(struct objtool_file *file)
>                  warnings += ret;
>          }
> 
> +       if (opts.pie) {
> +               ret = validate_pie(file);
> +               if (ret < 0)
> +                       return ret;
> +               warnings += ret;
> +       }
> 
>          if (opts.stats) {
>                  printf("nr_insns_visited: %ld\n", nr_insns_visited);
> diff --git a/tools/objtool/include/objtool/builtin.h b/tools/objtool/include/objtool/builtin.h
> index 2a108e648b7a..1151211a5cea 100644
> --- a/tools/objtool/include/objtool/builtin.h
> +++ b/tools/objtool/include/objtool/builtin.h
> @@ -26,6 +26,7 @@ struct opts {
>          bool uaccess;
>          int prefix;
>          bool cfi;
> +       bool pie;
> 
>          /* options: */
>          bool backtrace;
> --
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 07/43] x86/acpi: Adapt assembly for PIE support
  2023-04-28  9:50 ` [PATCH RFC 07/43] x86/acpi: " Hou Wenlong
@ 2023-04-28 11:32   ` Rafael J. Wysocki
  0 siblings, 0 replies; 80+ messages in thread
From: Rafael J. Wysocki @ 2023-04-28 11:32 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-pm, linux-acpi

On Fri, Apr 28, 2023 at 11:52 AM Hou Wenlong
<houwenlong.hwl@antgroup.com> wrote:
>
> From: Thomas Garnier <thgarnie@chromium.org>
>
> From: Thomas Garnier <thgarnie@chromium.org>
>
> Change the assembly code to use only relative references of symbols for the
> kernel to be PIE compatible.
>
> Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/kernel/acpi/wakeup_64.S | 31 ++++++++++++++++---------------
>  1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S
> index d5d8a352eafa..fe688bd87d72 100644
> --- a/arch/x86/kernel/acpi/wakeup_64.S
> +++ b/arch/x86/kernel/acpi/wakeup_64.S
> @@ -17,7 +17,7 @@
>          * Hooray, we are in Long 64-bit mode (but still running in low memory)
>          */
>  SYM_FUNC_START(wakeup_long64)
> -       movq    saved_magic, %rax
> +       movq    saved_magic(%rip), %rax
>         movq    $0x123456789abcdef0, %rdx
>         cmpq    %rdx, %rax
>         je      2f
> @@ -33,14 +33,14 @@ SYM_FUNC_START(wakeup_long64)
>         movw    %ax, %es
>         movw    %ax, %fs
>         movw    %ax, %gs
> -       movq    saved_rsp, %rsp
> +       movq    saved_rsp(%rip), %rsp
>
> -       movq    saved_rbx, %rbx
> -       movq    saved_rdi, %rdi
> -       movq    saved_rsi, %rsi
> -       movq    saved_rbp, %rbp
> +       movq    saved_rbx(%rip), %rbx
> +       movq    saved_rdi(%rip), %rdi
> +       movq    saved_rsi(%rip), %rsi
> +       movq    saved_rbp(%rip), %rbp
>
> -       movq    saved_rip, %rax
> +       movq    saved_rip(%rip), %rax
>         ANNOTATE_RETPOLINE_SAFE
>         jmp     *%rax
>  SYM_FUNC_END(wakeup_long64)
> @@ -51,7 +51,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
>         xorl    %eax, %eax
>         call    save_processor_state
>
> -       movq    $saved_context, %rax
> +       leaq    saved_context(%rip), %rax
>         movq    %rsp, pt_regs_sp(%rax)
>         movq    %rbp, pt_regs_bp(%rax)
>         movq    %rsi, pt_regs_si(%rax)
> @@ -70,13 +70,14 @@ SYM_FUNC_START(do_suspend_lowlevel)
>         pushfq
>         popq    pt_regs_flags(%rax)
>
> -       movq    $.Lresume_point, saved_rip(%rip)
> +       leaq    .Lresume_point(%rip), %rax
> +       movq    %rax, saved_rip(%rip)
>
> -       movq    %rsp, saved_rsp
> -       movq    %rbp, saved_rbp
> -       movq    %rbx, saved_rbx
> -       movq    %rdi, saved_rdi
> -       movq    %rsi, saved_rsi
> +       movq    %rsp, saved_rsp(%rip)
> +       movq    %rbp, saved_rbp(%rip)
> +       movq    %rbx, saved_rbx(%rip)
> +       movq    %rdi, saved_rdi(%rip)
> +       movq    %rsi, saved_rsi(%rip)
>
>         addq    $8, %rsp
>         movl    $3, %edi
> @@ -88,7 +89,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
>         .align 4
>  .Lresume_point:
>         /* We don't restore %rax, it must be 0 anyway */
> -       movq    $saved_context, %rax
> +       leaq    saved_context(%rip), %rax
>         movq    saved_context_cr4(%rax), %rbx
>         movq    %rbx, %cr4
>         movq    saved_context_cr3(%rax), %rbx
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support
  2023-04-28 10:28   ` Christophe Leroy
@ 2023-04-28 11:43     ` Peter Zijlstra
  2023-04-29  4:04       ` Hou Wenlong
  2023-04-29  3:52     ` Hou Wenlong
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-04-28 11:43 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Hou Wenlong, linux-kernel, Thomas Garnier, Lai Jiangshan,
	Kees Cook, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Masahiro Yamada,
	Nathan Chancellor, Nick Desaulniers, Nicolas Schier,
	Josh Poimboeuf, Sathvika Vasireddy, Thomas Weißschuh,
	linux-kbuild

On Fri, Apr 28, 2023 at 10:28:19AM +0000, Christophe Leroy wrote:


> > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > index 5b600bbf2389..d67b80251eec 100644
> > --- a/tools/objtool/check.c
> > +++ b/tools/objtool/check.c
> > @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
> >          for (insn = next_insn_same_sec(file, insn); insn;               \
> >               insn = next_insn_same_sec(file, insn))
> > 
> > +static struct instruction *find_insn_containing(struct objtool_file *file,
> > +                                               struct section *sec,
> > +                                               unsigned long offset)
> > +{
> > +       struct instruction *insn;
> > +
> > +       insn = find_insn(file, sec, 0);
> > +       if (!insn)
> > +               return NULL;
> > +
> > +       sec_for_each_insn_from(file, insn) {
> > +               if (insn->offset > offset)
> > +                       return NULL;
> > +               if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> > +                       return insn;
> > +       }
> > +
> > +       return NULL;
> > +}

Urgh, this is horrendous crap. Yes you're only using it in case of a
warning, but adding a function like this makes it appear like it's
actually sane to use.

A far better implementation -- but still not stellar -- would be
something like:

	sym = find_symbol_containing(sec, offset);
	if (!sym)
		// fail
	sym_for_each_insn(file, sym, insn) {
		...
	}

But given insn_hash uses sec_offset_hash() you can do something similar
to find_reloc_by_dest_range()

	start = offset - (INSN_MAX_SIZE - 1);
	for_offset_range(o, start, start + INSN_MAX_SIZE) {
		hash_for_each_possible(file->insn_hash, insn, hash, sec_offset_hash(sec, o)) {
			if (insn->sec != sec)
				continue;

			if (insn->offset <= offset &&
			    insn->offset + inns->len > offset)
				return insn;
		}
	}
	return NULL;

> > +
> > +
> >   static inline struct symbol *insn_call_dest(struct instruction *insn)
> >   {
> >          if (insn->type == INSN_JUMP_DYNAMIC ||
> > @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
> >          return 0;
> >   }
> > 
> > +static int is_in_pvh_code(struct instruction *insn)
> > +{
> > +       struct symbol *sym = insn->sym;
> > +
> > +       return sym && !strcmp(sym->name, "pvh_start_xen");
> > +}
> > +
> > +static int validate_pie(struct objtool_file *file)
> > +{
> > +       struct section *sec;
> > +       struct reloc *reloc;
> > +       struct instruction *insn;
> > +       int warnings = 0;
> > +
> > +       for_each_sec(file, sec) {
> > +               if (!sec->reloc)
> > +                       continue;
> > +               if (!(sec->sh.sh_flags & SHF_ALLOC))
> > +                       continue;
> > +
> > +               list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> > +                       switch (reloc->type) {
> > +                       case R_X86_64_NONE:
> > +                       case R_X86_64_PC32:
> > +                       case R_X86_64_PLT32:
> > +                       case R_X86_64_64:
> > +                       case R_X86_64_PC64:
> > +                       case R_X86_64_GOTPCREL:
> > +                               break;
> > +                       case R_X86_64_32:
> > +                       case R_X86_64_32S:
> 
> That looks very specific to X86, should it go at another place ?
> 
> If it can work for any architecture, can you add generic macros, just 
> like commit c1449735211d ("objtool: Use macros to define arch specific 
> reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount 
> specific implementation") ?

Yes, this should be something like arch_PIE_reloc() or so. Similar to
arch_pc_relative_reloc().

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 21/43] x86/ftrace: Adapt assembly for PIE support
  2023-04-28  9:51 ` [PATCH RFC 21/43] x86/ftrace: Adapt assembly " Hou Wenlong
@ 2023-04-28 13:37   ` Steven Rostedt
  2023-04-29  3:43     ` Hou Wenlong
  0 siblings, 1 reply; 80+ messages in thread
From: Steven Rostedt @ 2023-04-28 13:37 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Masami Hiramatsu, Mark Rutland, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-trace-kernel

On Fri, 28 Apr 2023 17:51:01 +0800
"Hou Wenlong" <houwenlong.hwl@antgroup.com> wrote:

> Change the assembly code to use only relative references of symbols for
> the kernel to be PIE compatible.
> 
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/kernel/ftrace_64.S | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
> index eddb4fabc16f..411fa4148e18 100644
> --- a/arch/x86/kernel/ftrace_64.S
> +++ b/arch/x86/kernel/ftrace_64.S
> @@ -315,7 +315,14 @@ STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)
>  SYM_FUNC_START(__fentry__)
>  	CALL_DEPTH_ACCOUNT
>  
> +#ifdef CONFIG_X86_PIE
> +	pushq %r8
> +	leaq ftrace_stub(%rip), %r8
> +	cmpq %r8, ftrace_trace_function(%rip)
> +	popq %r8
> +#else
>  	cmpq $ftrace_stub, ftrace_trace_function
> +#endif
>  	jnz trace
>  	RET
>  
> @@ -329,7 +336,7 @@ trace:
>  	 * ip and parent ip are used and the list function is called when
>  	 * function tracing is enabled.
>  	 */
> -	movq ftrace_trace_function, %r8
> +	movq ftrace_trace_function(%rip), %r8
>  	CALL_NOSPEC r8
>  	restore_mcount_regs
>  

I really don't want to add more updates to !DYNAMIC_FTRACE. This code only
exists to make sure I don't break it for other architectures.

How about

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 442eccc00960..ee4d0713139d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -37,7 +37,7 @@ config X86_64
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
-	depends on X86_32
+	depends on X86_32 || X86_PIE
 	depends on FUNCTION_TRACER
 	select DYNAMIC_FTRACE
 	help


?

-- Steve

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching for PIE support
  2023-04-28  9:51 ` [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching " Hou Wenlong
@ 2023-04-28 13:44   ` Steven Rostedt
  2023-04-29  3:38     ` Hou Wenlong
  0 siblings, 1 reply; 80+ messages in thread
From: Steven Rostedt @ 2023-04-28 13:44 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Masami Hiramatsu, Mark Rutland, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Huacai Chen,
	Qing Zhang, linux-trace-kernel

On Fri, 28 Apr 2023 17:51:02 +0800
"Hou Wenlong" <houwenlong.hwl@antgroup.com> wrote:

> From: Thomas Garnier <thgarnie@chromium.org>
> 
> From: Thomas Garnier <thgarnie@chromium.org>
> 
> When using PIE with function tracing, the compiler generates a
> call through the GOT (call *__fentry__@GOTPCREL). This instruction
> takes 6-bytes instead of 5-bytes with a relative call. And -mnop-mcount
> option is not implemented for -fPIE now.
> 
> If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
> so ftrace can handle the previous 5-bytes as before.

Wait! This won't work!

You can't just append another nop to fill in the blanks here. We must
either have a single 6 byte nop, or we need to refactor the entire logic to
something that other archs have.

The two nops means that the CPU can take it as two separate commands.
There's nothing stopping the computer from preempting a task between the
two. If that happens, and you modify the 1byte nop and 5byte nop with a
single 6 byte command, when the task get's rescheduled, it will execute the
last 5 bytes of that 6 byte command and take a general protection fault, and
likely crash the machine.

NACK on this. It needs a better solution.

-- Steve


> 
> [Hou Wenlong: Adapt code change and fix wrong offset calculation in
> make_nop_x86()]
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for PIE support
  2023-04-28  9:51 ` [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for " Hou Wenlong
@ 2023-04-28 15:18   ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2023-04-28 15:18 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Josh Poimboeuf, Christophe Leroy,
	Sathvika Vasireddy

On Fri, Apr 28, 2023 at 05:51:14PM +0800, Hou Wenlong wrote:

> --- a/tools/objtool/arch/x86/decode.c
> +++ b/tools/objtool/arch/x86/decode.c
> @@ -747,15 +747,21 @@ void arch_initial_func_cfi_state(struct cfi_init_state *state)
>  
>  const char *arch_nop_insn(int len)
>  {
> -	static const char nops[5][5] = {
> +	static const char nops[6][6] = {
>  		{ BYTES_NOP1 },
>  		{ BYTES_NOP2 },
>  		{ BYTES_NOP3 },
>  		{ BYTES_NOP4 },
>  		{ BYTES_NOP5 },
> +		/*
> +		 * For PIE kernel, use a 5-byte nop
> +		 * and 1-byte nop to keep the frace
> +		 * hooking algorithm working correct.
> +		 */
> +		{ BYTES_NOP5, BYTES_NOP1 },
>  	};
> -	if (len < 1 || len > 5) {
> +	if (len < 1 || len > 6) {
>  		WARN("invalid NOP size: %d\n", len);
>  		return NULL;
>  	}

Like Steve already said, this is broken, we hard rely on these things
being single instructions, this must absolutely be BYTES_NOP6.

And yes, then you get to fix a whole lot more.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible
  2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
                   ` (42 preceding siblings ...)
  2023-04-28  9:51 ` [PATCH RFC 43/43] x86/boot: Extend relocate range for PIE kernel image Hou Wenlong
@ 2023-04-28 15:22 ` Peter Zijlstra
  2023-05-06  7:19   ` Hou Wenlong
  43 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-04-28 15:22 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Nathan Chancellor, Nick Desaulniers, Tom Rix, bpf, llvm


For some raison I didn't get 0/n but did get all of the others. Please
keep your Cc list consistent.

On Fri, Apr 28, 2023 at 05:50:40PM +0800, Hou Wenlong wrote:

>   - It is not allowed to reference global variables in an alternative
>     section since RIP-relative addressing is not fixed in
>     apply_alternatives(). Fortunately, all disallowed relocations in the
>     alternative section can be captured by objtool. I believe that this
>     issue can also be fixed by using objtool.

https://lkml.kernel.org/r/Y9py2a5Xw0xbB8ou@hirez.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-04-28  9:51 ` [PATCH RFC 31/43] x86/modules: Adapt module loading " Hou Wenlong
@ 2023-04-28 19:29   ` Ard Biesheuvel
  2023-05-08  8:32     ` Hou Wenlong
  0 siblings, 1 reply; 80+ messages in thread
From: Ard Biesheuvel @ 2023-04-28 19:29 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> Adapt module loading to support PIE relocations. No GOT is generared for
> module, all the GOT entry of got references in module should exist in
> kernel GOT.  Currently, there is only one usable got reference for
> __fentry__().
>

I don't think this is the right approach. We should permit GOTPCREL
relocations properly, which means making them point to a location in
memory that carries the absolute address of the symbol. There are
several ways to go about that, but perhaps the simplest way is to make
the symbol address in ksymtab a 64-bit absolute value (but retain the
PC32 references for the symbol name and the symbol namespace name).
That way, you can always resolve such GOTPCREL relocations by pointing
it to the ksymtab entry. Another option would be to take inspiration
from the PLT code we have on ARM and arm64 (and other architectures,
surely) and to count the GOT based relocations, allocate some extra
r/o module space for each, and allocate slots and populate them with
the right value as you fix up the relocations.

Then, many such relocations can be relaxed at module load time if the
symbol is in range. IIUC, the module and kernel will still be inside
the same 2G window even after widening the KASLR range to 512G, so
most GOT loads can be converted into RIP relative LEA instructions.

Note that this will also permit you to do things like

#define PV_VCPU_PREEMPTED_ASM \
 "leaq __per_cpu_offset(%rip), %rax \n\t" \
 "movq (%rax,%rdi,8), %rax \n\t" \
 "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
 "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
 "setne %al\n\t"

or

+#ifdef CONFIG_X86_PIE
+ " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
+#else
" pushq $arch_rethook_trampoline\n"
+#endif

instead of having these kludgy push/pop sequences to free up temp registers.

(FYI I have looked into this PIE linking just a few weeks ago [0] so
this is all rather fresh in my memory)




[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie


> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/include/asm/sections.h |  5 +++++
>  arch/x86/kernel/module.c        | 27 +++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> index a6e8373a5170..dc1c2b08ec48 100644
> --- a/arch/x86/include/asm/sections.h
> +++ b/arch/x86/include/asm/sections.h
> @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
>
>  #if defined(CONFIG_X86_64)
>  extern char __end_rodata_hpage_align[];
> +
> +#ifdef CONFIG_X86_PIE
> +extern char __start_got[], __end_got[];
> +#endif
> +
>  #endif
>
>  extern char __end_of_kernel_reserve[];
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index 84ad0e61ba6e..051f88e6884e 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
>         return 0;
>  }
>  #else /*X86_64*/
> +#ifdef CONFIG_X86_PIE
> +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> +{
> +       u64 *pos;
> +
> +       for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> +               if (*pos == sym->st_value)
> +                       return (u64)pos + rela->r_addend;
> +       return 0;
> +}
> +#endif
> +
>  static int __write_relocate_add(Elf64_Shdr *sechdrs,
>                    const char *strtab,
>                    unsigned int symindex,
> @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
>                 case R_X86_64_64:
>                         size = 8;
>                         break;
> +#ifndef CONFIG_X86_PIE
>                 case R_X86_64_32:
>                         if (val != *(u32 *)&val)
>                                 goto overflow;
> @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
>                                 goto overflow;
>                         size = 4;
>                         break;
> +#else
> +               case R_X86_64_GOTPCREL:
> +                       val = find_got_kernel_entry(sym, rel);
> +                       if (!val)
> +                               goto unexpected_got_reference;
> +                       fallthrough;
> +#endif
>                 case R_X86_64_PC32:
>                 case R_X86_64_PLT32:
>                         val -= (u64)loc;
> @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
>         }
>         return 0;
>
> +#ifdef CONFIG_X86_PIE
> +unexpected_got_reference:
> +       pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> +       return -ENOEXEC;
> +#else
>  overflow:
>         pr_err("overflow in relocation type %d val %Lx\n",
>                (int)ELF64_R_TYPE(rel[i].r_info), val);
>         pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
>                me->name);
> +#endif
> +
>         return -ENOEXEC;
>  }
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching for PIE support
  2023-04-28 13:44   ` Steven Rostedt
@ 2023-04-29  3:38     ` Hou Wenlong
  0 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-29  3:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Masami Hiramatsu, Mark Rutland, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Huacai Chen,
	Qing Zhang, linux-trace-kernel

On Fri, Apr 28, 2023 at 09:44:54PM +0800, Steven Rostedt wrote:
> On Fri, 28 Apr 2023 17:51:02 +0800
> "Hou Wenlong" <houwenlong.hwl@antgroup.com> wrote:
> 
> > From: Thomas Garnier <thgarnie@chromium.org>
> > 
> > From: Thomas Garnier <thgarnie@chromium.org>
> > 
> > When using PIE with function tracing, the compiler generates a
> > call through the GOT (call *__fentry__@GOTPCREL). This instruction
> > takes 6-bytes instead of 5-bytes with a relative call. And -mnop-mcount
> > option is not implemented for -fPIE now.
> > 
> > If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
> > so ftrace can handle the previous 5-bytes as before.
> 
> Wait! This won't work!
> 
> You can't just append another nop to fill in the blanks here. We must
> either have a single 6 byte nop, or we need to refactor the entire logic to
> something that other archs have.
> 
> The two nops means that the CPU can take it as two separate commands.
> There's nothing stopping the computer from preempting a task between the
> two. If that happens, and you modify the 1byte nop and 5byte nop with a
> single 6 byte command, when the task get's rescheduled, it will execute the
> last 5 bytes of that 6 byte command and take a general protection fault, and
> likely crash the machine.
> 
> NACK on this. It needs a better solution.
> 
> -- Steve
> 
>
Hi Steve,

Sorry for not providing the original patch link:
https://lore.kernel.org/all/20190131192533.34130-22-thgarnie@chromium.org/

I drop the Reviewed-by tag due to the change described in commit
message.

This nop patching is only used for the first time (addr = MCOUNT) before
SMP or executing code in module. And ftrace_make_call() is not modified,
then we would use 5 byte direct call to replace the first 5 byte nop
when tracepoint is enabled like before, it's still one instruction. So,
the logic is same like before, patch the first 5 byte when tracepoint is
enabled or disabled during running.

> > 
> > [Hou Wenlong: Adapt code change and fix wrong offset calculation in
> > make_nop_x86()]
> > 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 21/43] x86/ftrace: Adapt assembly for PIE support
  2023-04-28 13:37   ` Steven Rostedt
@ 2023-04-29  3:43     ` Hou Wenlong
  0 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-29  3:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Masami Hiramatsu, Mark Rutland, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-trace-kernel

On Fri, Apr 28, 2023 at 09:37:19PM +0800, Steven Rostedt wrote:
> On Fri, 28 Apr 2023 17:51:01 +0800
> "Hou Wenlong" <houwenlong.hwl@antgroup.com> wrote:
> 
> > Change the assembly code to use only relative references of symbols for
> > the kernel to be PIE compatible.
> > 
> > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > Cc: Thomas Garnier <thgarnie@chromium.org>
> > Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > ---
> >  arch/x86/kernel/ftrace_64.S | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
> > index eddb4fabc16f..411fa4148e18 100644
> > --- a/arch/x86/kernel/ftrace_64.S
> > +++ b/arch/x86/kernel/ftrace_64.S
> > @@ -315,7 +315,14 @@ STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)
> >  SYM_FUNC_START(__fentry__)
> >  	CALL_DEPTH_ACCOUNT
> >  
> > +#ifdef CONFIG_X86_PIE
> > +	pushq %r8
> > +	leaq ftrace_stub(%rip), %r8
> > +	cmpq %r8, ftrace_trace_function(%rip)
> > +	popq %r8
> > +#else
> >  	cmpq $ftrace_stub, ftrace_trace_function
> > +#endif
> >  	jnz trace
> >  	RET
> >  
> > @@ -329,7 +336,7 @@ trace:
> >  	 * ip and parent ip are used and the list function is called when
> >  	 * function tracing is enabled.
> >  	 */
> > -	movq ftrace_trace_function, %r8
> > +	movq ftrace_trace_function(%rip), %r8
> >  	CALL_NOSPEC r8
> >  	restore_mcount_regs
> >  
> 
> I really don't want to add more updates to !DYNAMIC_FTRACE. This code only
> exists to make sure I don't break it for other architectures.
> 
> How about
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 442eccc00960..ee4d0713139d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -37,7 +37,7 @@ config X86_64
>  
>  config FORCE_DYNAMIC_FTRACE
>  	def_bool y
> -	depends on X86_32
> +	depends on X86_32 || X86_PIE
>  	depends on FUNCTION_TRACER
>  	select DYNAMIC_FTRACE
>  	help
> 
> 
> ?
>
OK, I'll drop it. Actually, I select DYNAMIC_FTRACE when
CONFIG_RETPOLINE is enabled for PIE due to the indirect call for
__fentry__() in patch 34.

Thanks.
> -- Steve

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support
  2023-04-28 10:28   ` Christophe Leroy
  2023-04-28 11:43     ` Peter Zijlstra
@ 2023-04-29  3:52     ` Hou Wenlong
  1 sibling, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-29  3:52 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Masahiro Yamada, Nathan Chancellor,
	Nick Desaulniers, Nicolas Schier, Josh Poimboeuf, Peter Zijlstra,
	Sathvika Vasireddy, Thomas Weißschuh, linux-kbuild

On Fri, Apr 28, 2023 at 06:28:19PM +0800, Christophe Leroy wrote:
> 
> 
> Le 28/04/2023 à 11:51, Hou Wenlong a écrit :
> > [Vous ne recevez pas souvent de courriers de houwenlong.hwl@antgroup.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > For x86 PIE binary, only RIP-relative addressing is allowed, however,
> > there are still a little absolute references of R_X86_64_64 relocation
> > type for data section and a little absolute references of R_X86_64_32S
> > relocation type in pvh_start_xen() function.
> > 
> > Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > Cc: Thomas Garnier <thgarnie@chromium.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > ---
> >   arch/x86/Kconfig                        |  1 +
> >   scripts/Makefile.lib                    |  1 +
> >   tools/objtool/builtin-check.c           |  4 +-
> >   tools/objtool/check.c                   | 82 +++++++++++++++++++++++++
> >   tools/objtool/include/objtool/builtin.h |  1 +
> >   5 files changed, 88 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 715f0734d065..b753a54e5ea7 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -2224,6 +2224,7 @@ config RELOCATABLE
> >   config X86_PIE
> >          def_bool n
> >          depends on X86_64
> > +       select OBJTOOL if HAVE_OBJTOOL
> > 
> >   config RANDOMIZE_BASE
> >          bool "Randomize the address of the kernel image (KASLR)"
> > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> > index 100a386fcd71..e3c804fbc421 100644
> > --- a/scripts/Makefile.lib
> > +++ b/scripts/Makefile.lib
> > @@ -270,6 +270,7 @@ objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE)              += --static-call
> >   objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION)         += --uaccess
> >   objtool-args-$(CONFIG_GCOV_KERNEL)                     += --no-unreachable
> >   objtool-args-$(CONFIG_PREFIX_SYMBOLS)                  += --prefix=$(CONFIG_FUNCTION_PADDING_BYTES)
> > +objtool-args-$(CONFIG_X86_PIE)                         += --pie
> > 
> >   objtool-args = $(objtool-args-y)                                       \
> >          $(if $(delay-objtool), --link)                                  \
> > diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
> > index 7c175198d09f..1cf1d00464e0 100644
> > --- a/tools/objtool/builtin-check.c
> > +++ b/tools/objtool/builtin-check.c
> > @@ -81,6 +81,7 @@ static const struct option check_options[] = {
> >          OPT_BOOLEAN('t', "static-call", &opts.static_call, "annotate static calls"),
> >          OPT_BOOLEAN('u', "uaccess", &opts.uaccess, "validate uaccess rules for SMAP"),
> >          OPT_BOOLEAN(0  , "cfi", &opts.cfi, "annotate kernel control flow integrity (kCFI) function preambles"),
> > +       OPT_BOOLEAN(0, "pie", &opts.pie, "validate addressing rules for PIE"),
> >          OPT_CALLBACK_OPTARG(0, "dump", NULL, NULL, "orc", "dump metadata", parse_dump),
> > 
> >          OPT_GROUP("Options:"),
> > @@ -137,7 +138,8 @@ static bool opts_valid(void)
> >              opts.sls                    ||
> >              opts.stackval               ||
> >              opts.static_call            ||
> > -           opts.uaccess) {
> > +           opts.uaccess                ||
> > +           opts.pie) {
> >                  if (opts.dump_orc) {
> >                          ERROR("--dump can't be combined with other options");
> >                          return false;
> > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > index 5b600bbf2389..d67b80251eec 100644
> > --- a/tools/objtool/check.c
> > +++ b/tools/objtool/check.c
> > @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
> >          for (insn = next_insn_same_sec(file, insn); insn;               \
> >               insn = next_insn_same_sec(file, insn))
> > 
> > +static struct instruction *find_insn_containing(struct objtool_file *file,
> > +                                               struct section *sec,
> > +                                               unsigned long offset)
> > +{
> > +       struct instruction *insn;
> > +
> > +       insn = find_insn(file, sec, 0);
> > +       if (!insn)
> > +               return NULL;
> > +
> > +       sec_for_each_insn_from(file, insn) {
> > +               if (insn->offset > offset)
> > +                       return NULL;
> > +               if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> > +                       return insn;
> > +       }
> > +
> > +       return NULL;
> > +}
> > +
> > +
> >   static inline struct symbol *insn_call_dest(struct instruction *insn)
> >   {
> >          if (insn->type == INSN_JUMP_DYNAMIC ||
> > @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
> >          return 0;
> >   }
> > 
> > +static int is_in_pvh_code(struct instruction *insn)
> > +{
> > +       struct symbol *sym = insn->sym;
> > +
> > +       return sym && !strcmp(sym->name, "pvh_start_xen");
> > +}
> > +
> > +static int validate_pie(struct objtool_file *file)
> > +{
> > +       struct section *sec;
> > +       struct reloc *reloc;
> > +       struct instruction *insn;
> > +       int warnings = 0;
> > +
> > +       for_each_sec(file, sec) {
> > +               if (!sec->reloc)
> > +                       continue;
> > +               if (!(sec->sh.sh_flags & SHF_ALLOC))
> > +                       continue;
> > +
> > +               list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> > +                       switch (reloc->type) {
> > +                       case R_X86_64_NONE:
> > +                       case R_X86_64_PC32:
> > +                       case R_X86_64_PLT32:
> > +                       case R_X86_64_64:
> > +                       case R_X86_64_PC64:
> > +                       case R_X86_64_GOTPCREL:
> > +                               break;
> > +                       case R_X86_64_32:
> > +                       case R_X86_64_32S:
> 
> That looks very specific to X86, should it go at another place ?
> 
> If it can work for any architecture, can you add generic macros, just 
> like commit c1449735211d ("objtool: Use macros to define arch specific 
> reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount 
> specific implementation") ?
>
Get it, I'll refactor it and move code into X86 directory.

Thanks. 
> > +                               insn = find_insn_containing(file, sec, reloc->offset);
> > +                               if (!insn) {
> > +                                       WARN("can't find relocate insn near %s+0x%lx",
> > +                                            sec->name, reloc->offset);
> > +                               } else {
> > +                                       if (is_in_pvh_code(insn))
> > +                                               break;
> > +                                       WARN("insn at %s+0x%lx is not compatible with PIE",
> > +                                            sec->name, insn->offset);
> > +                               }
> > +                               warnings++;
> > +                               break;
> > +                       default:
> > +                               WARN("unexpected relocation type %d at %s+0x%lx",
> > +                                    reloc->type, sec->name, reloc->offset);
> > +                               warnings++;
> > +                               break;
> > +                       }
> > +               }
> > +       }
> > +
> > +       return warnings;
> > +}
> > +
> >   int check(struct objtool_file *file)
> >   {
> >          int ret, warnings = 0;
> > @@ -4673,6 +4749,12 @@ int check(struct objtool_file *file)
> >                  warnings += ret;
> >          }
> > 
> > +       if (opts.pie) {
> > +               ret = validate_pie(file);
> > +               if (ret < 0)
> > +                       return ret;
> > +               warnings += ret;
> > +       }
> > 
> >          if (opts.stats) {
> >                  printf("nr_insns_visited: %ld\n", nr_insns_visited);
> > diff --git a/tools/objtool/include/objtool/builtin.h b/tools/objtool/include/objtool/builtin.h
> > index 2a108e648b7a..1151211a5cea 100644
> > --- a/tools/objtool/include/objtool/builtin.h
> > +++ b/tools/objtool/include/objtool/builtin.h
> > @@ -26,6 +26,7 @@ struct opts {
> >          bool uaccess;
> >          int prefix;
> >          bool cfi;
> > +       bool pie;
> > 
> >          /* options: */
> >          bool backtrace;
> > --
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support
  2023-04-28 11:43     ` Peter Zijlstra
@ 2023-04-29  4:04       ` Hou Wenlong
  0 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-04-29  4:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christophe Leroy, linux-kernel, Thomas Garnier, Lai Jiangshan,
	Kees Cook, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Masahiro Yamada,
	Nathan Chancellor, Nick Desaulniers, Nicolas Schier,
	Josh Poimboeuf, Sathvika Vasireddy, Thomas Wei�schuh,
	linux-kbuild

On Fri, Apr 28, 2023 at 07:43:38PM +0800, Peter Zijlstra wrote:
> On Fri, Apr 28, 2023 at 10:28:19AM +0000, Christophe Leroy wrote:
> 
> 
> > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > > index 5b600bbf2389..d67b80251eec 100644
> > > --- a/tools/objtool/check.c
> > > +++ b/tools/objtool/check.c
> > > @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
> > >          for (insn = next_insn_same_sec(file, insn); insn;               \
> > >               insn = next_insn_same_sec(file, insn))
> > > 
> > > +static struct instruction *find_insn_containing(struct objtool_file *file,
> > > +                                               struct section *sec,
> > > +                                               unsigned long offset)
> > > +{
> > > +       struct instruction *insn;
> > > +
> > > +       insn = find_insn(file, sec, 0);
> > > +       if (!insn)
> > > +               return NULL;
> > > +
> > > +       sec_for_each_insn_from(file, insn) {
> > > +               if (insn->offset > offset)
> > > +                       return NULL;
> > > +               if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> > > +                       return insn;
> > > +       }
> > > +
> > > +       return NULL;
> > > +}
> 
> Urgh, this is horrendous crap. Yes you're only using it in case of a
> warning, but adding a function like this makes it appear like it's
> actually sane to use.
> 
> A far better implementation -- but still not stellar -- would be
> something like:
> 
> 	sym = find_symbol_containing(sec, offset);
> 	if (!sym)
> 		// fail
> 	sym_for_each_insn(file, sym, insn) {
> 		...
> 	}
> 
> But given insn_hash uses sec_offset_hash() you can do something similar
> to find_reloc_by_dest_range()
> 
> 	start = offset - (INSN_MAX_SIZE - 1);
> 	for_offset_range(o, start, start + INSN_MAX_SIZE) {
> 		hash_for_each_possible(file->insn_hash, insn, hash, sec_offset_hash(sec, o)) {
> 			if (insn->sec != sec)
> 				continue;
> 
> 			if (insn->offset <= offset &&
> 			    insn->offset + inns->len > offset)
> 				return insn;
> 		}
> 	}
> 	return NULL;
>
Thanks for your suggestion, I'll pick it in the next version.
 
> > > +
> > > +
> > >   static inline struct symbol *insn_call_dest(struct instruction *insn)
> > >   {
> > >          if (insn->type == INSN_JUMP_DYNAMIC ||
> > > @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
> > >          return 0;
> > >   }
> > > 
> > > +static int is_in_pvh_code(struct instruction *insn)
> > > +{
> > > +       struct symbol *sym = insn->sym;
> > > +
> > > +       return sym && !strcmp(sym->name, "pvh_start_xen");
> > > +}
> > > +
> > > +static int validate_pie(struct objtool_file *file)
> > > +{
> > > +       struct section *sec;
> > > +       struct reloc *reloc;
> > > +       struct instruction *insn;
> > > +       int warnings = 0;
> > > +
> > > +       for_each_sec(file, sec) {
> > > +               if (!sec->reloc)
> > > +                       continue;
> > > +               if (!(sec->sh.sh_flags & SHF_ALLOC))
> > > +                       continue;
> > > +
> > > +               list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> > > +                       switch (reloc->type) {
> > > +                       case R_X86_64_NONE:
> > > +                       case R_X86_64_PC32:
> > > +                       case R_X86_64_PLT32:
> > > +                       case R_X86_64_64:
> > > +                       case R_X86_64_PC64:
> > > +                       case R_X86_64_GOTPCREL:
> > > +                               break;
> > > +                       case R_X86_64_32:
> > > +                       case R_X86_64_32S:
> > 
> > That looks very specific to X86, should it go at another place ?
> > 
> > If it can work for any architecture, can you add generic macros, just 
> > like commit c1449735211d ("objtool: Use macros to define arch specific 
> > reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount 
> > specific implementation") ?
> 
> Yes, this should be something like arch_PIE_reloc() or so. Similar to
> arch_pc_relative_reloc().

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 25/43] x86/mm: Make the x86 GOT read-only
  2023-04-28  9:51 ` [PATCH RFC 25/43] x86/mm: Make the x86 GOT read-only Hou Wenlong
@ 2023-04-30 14:23   ` Ard Biesheuvel
  2023-05-08 11:40     ` Hou Wenlong
  0 siblings, 1 reply; 80+ messages in thread
From: Ard Biesheuvel @ 2023-04-30 14:23 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Arnd Bergmann, Peter Zijlstra, Josh Poimboeuf,
	Juergen Gross, Brian Gerst, linux-arch

On Fri, 28 Apr 2023 at 11:55, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> From: Thomas Garnier <thgarnie@chromium.org>
>
> From: Thomas Garnier <thgarnie@chromium.org>
>
> The GOT is changed during early boot when relocations are applied. Make
> it read-only directly. This table exists only for PIE binary. Since weak
> symbol reference would always be GOT reference, there are 8 entries in
> GOT, but only one entry for __fentry__() is in use.  Other GOT
> references have been optimized by linker.
>
> [Hou Wenlong: Change commit message and skip GOT size check]
>
> Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/kernel/vmlinux.lds.S     |  2 ++
>  include/asm-generic/vmlinux.lds.h | 12 ++++++++++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index f02dcde9f8a8..fa4c6582663f 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -462,6 +462,7 @@ SECTIONS
>  #endif
>                "Unexpected GOT/PLT entries detected!")
>
> +#ifndef CONFIG_X86_PIE
>         /*
>          * Sections that should stay zero sized, which is safer to
>          * explicitly check instead of blindly discarding.
> @@ -470,6 +471,7 @@ SECTIONS
>                 *(.got) *(.igot.*)
>         }
>         ASSERT(SIZEOF(.got) == 0, "Unexpected GOT entries detected!")
> +#endif
>
>         .plt : {
>                 *(.plt) *(.plt.*) *(.iplt)
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index d1f57e4868ed..438ed8b39896 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -441,6 +441,17 @@
>         __end_ro_after_init = .;
>  #endif
>
> +#ifdef CONFIG_X86_PIE
> +#define RO_GOT_X86

Please don't put X86 specific stuff in generic code.

> +       .got        : AT(ADDR(.got) - LOAD_OFFSET) {                    \
> +               __start_got = .;                                        \
> +               *(.got) *(.igot.*);                                     \
> +               __end_got = .;                                          \
> +       }
> +#else
> +#define RO_GOT_X86
> +#endif
> +

I don't think it makes sense for this definition to be conditional.
You can include it conditionally from the x86 code, but even that
seems unnecessary, given that it will be empty otherwise.

>  /*
>   * .kcfi_traps contains a list KCFI trap locations.
>   */
> @@ -486,6 +497,7 @@
>                 BOUNDED_SECTION_PRE_LABEL(.pci_fixup_suspend_late, _pci_fixups_suspend_late, __start, __end) \
>         }                                                               \
>                                                                         \
> +       RO_GOT_X86                                                      \
>         FW_LOADER_BUILT_IN_DATA                                         \
>         TRACEDATA                                                       \
>                                                                         \
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-04-28  9:50 ` [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler Hou Wenlong
@ 2023-05-01 17:27   ` Nick Desaulniers
  2023-05-05  6:14     ` Hou Wenlong
  2023-05-04 10:31   ` Juergen Gross
  1 sibling, 1 reply; 80+ messages in thread
From: Nick Desaulniers @ 2023-05-01 17:27 UTC (permalink / raw)
  To: Hou Wenlong, Brian Gerst
  Cc: linux-kernel, Kees Cook, x86, Nathan Chancellor, llvm

On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> From: Brian Gerst <brgerst@gmail.com>
>
> From: Brian Gerst <brgerst@gmail.com>
>
> If the compiler supports it, use a standard per-cpu variable for the
> stack protector instead of the old fixed location.  Keep the fixed
> location code for compatibility with older compilers.
>
> [Hou Wenlong: Disable it on Clang, adapt new code change and adapt
> missing GS set up path in pvh_start_xen()]
>
> Signed-off-by: Brian Gerst <brgerst@gmail.com>
> Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/Kconfig                      | 12 ++++++++++++
>  arch/x86/Makefile                     | 21 ++++++++++++++-------
>  arch/x86/entry/entry_64.S             |  6 +++++-
>  arch/x86/include/asm/processor.h      | 17 ++++++++++++-----
>  arch/x86/include/asm/stackprotector.h | 16 +++++++---------
>  arch/x86/kernel/asm-offsets_64.c      |  2 +-
>  arch/x86/kernel/cpu/common.c          | 15 +++++++--------
>  arch/x86/kernel/head_64.S             | 16 ++++++++++------
>  arch/x86/kernel/vmlinux.lds.S         |  4 +++-
>  arch/x86/platform/pvh/head.S          |  8 ++++++++
>  arch/x86/xen/xen-head.S               | 14 +++++++++-----
>  11 files changed, 88 insertions(+), 43 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 68e5da464b96..55cce8cdf9bd 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -410,6 +410,18 @@ config CC_HAS_SANE_STACKPROTECTOR
>           the compiler produces broken code or if it does not let us control
>           the segment on 32-bit kernels.
>
> +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> +       bool
> +       # Although clang supports -mstack-protector-guard-reg option, it
> +       # would generate GOT reference for __stack_chk_guard even with
> +       # -fno-PIE flag.
> +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))

Hi Hou,
I've filed this bug against LLVM and will work with LLVM folks at
Intel to resolve:
https://github.com/llvm/llvm-project/issues/62481
Can you please review that report and let me know here or there if I
missed anything? Would you also mind including a link to that in the
comments in the next version of this patch?

Less relevant issues I filed looking at some related codegen:
https://github.com/llvm/llvm-project/issues/62482
https://github.com/llvm/llvm-project/issues/62480

And we should probably look into:
https://github.com/llvm/llvm-project/issues/22476


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-04-28  9:50 ` [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler Hou Wenlong
  2023-05-01 17:27   ` Nick Desaulniers
@ 2023-05-04 10:31   ` Juergen Gross
  2023-05-05  3:09     ` Hou Wenlong
  1 sibling, 1 reply; 80+ messages in thread
From: Juergen Gross @ 2023-05-04 10:31 UTC (permalink / raw)
  To: Hou Wenlong, linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook, Brian Gerst,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Boris Ostrovsky, Darren Hart,
	Andy Shevchenko, Nathan Chancellor, Nick Desaulniers, Tom Rix,
	Peter Zijlstra, Mike Rapoport (IBM),
	Ashok Raj, Rick Edgecombe, Catalin Marinas, Guo Ren,
	Greg Kroah-Hartman, Jason A. Donenfeld, Pawan Gupta,
	Kim Phillips, David Woodhouse, Josh Poimboeuf, xen-devel,
	platform-driver-x86, llvm


[-- Attachment #1.1.1: Type: text/plain, Size: 2487 bytes --]

On 28.04.23 11:50, Hou Wenlong wrote:
> From: Brian Gerst <brgerst@gmail.com>
> 
> From: Brian Gerst <brgerst@gmail.com>
> 
> If the compiler supports it, use a standard per-cpu variable for the
> stack protector instead of the old fixed location.  Keep the fixed
> location code for compatibility with older compilers.
> 
> [Hou Wenlong: Disable it on Clang, adapt new code change and adapt
> missing GS set up path in pvh_start_xen()]
> 
> Signed-off-by: Brian Gerst <brgerst@gmail.com>
> Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>   arch/x86/Kconfig                      | 12 ++++++++++++
>   arch/x86/Makefile                     | 21 ++++++++++++++-------
>   arch/x86/entry/entry_64.S             |  6 +++++-
>   arch/x86/include/asm/processor.h      | 17 ++++++++++++-----
>   arch/x86/include/asm/stackprotector.h | 16 +++++++---------
>   arch/x86/kernel/asm-offsets_64.c      |  2 +-
>   arch/x86/kernel/cpu/common.c          | 15 +++++++--------
>   arch/x86/kernel/head_64.S             | 16 ++++++++++------
>   arch/x86/kernel/vmlinux.lds.S         |  4 +++-
>   arch/x86/platform/pvh/head.S          |  8 ++++++++
>   arch/x86/xen/xen-head.S               | 14 +++++++++-----
>   11 files changed, 88 insertions(+), 43 deletions(-)
> 

...

> diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
> index 643d02900fbb..09eaf59e8066 100644
> --- a/arch/x86/xen/xen-head.S
> +++ b/arch/x86/xen/xen-head.S
> @@ -51,15 +51,19 @@ SYM_CODE_START(startup_xen)
>   
>   	leaq	(__end_init_task - PTREGS_SIZE)(%rip), %rsp
>   
> -	/* Set up %gs.
> -	 *
> -	 * The base of %gs always points to fixed_percpu_data.  If the
> -	 * stack protector canary is enabled, it is located at %gs:40.
> +	/*
> +	 * Set up GS base.
>   	 * Note that, on SMP, the boot cpu uses init data section until
>   	 * the per cpu areas are set up.
>   	 */
>   	movl	$MSR_GS_BASE,%ecx
> -	movq	$INIT_PER_CPU_VAR(fixed_percpu_data),%rax
> +#if defined(CONFIG_STACKPROTECTOR_FIXED)
> +	leaq	INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
> +#elif defined(CONFIG_SMP)
> +	movabs	$__per_cpu_load, %rdx

Shouldn't above 2 targets be %rax?

> +#else
> +	xorl	%eax, %eax
> +#endif
>   	cdq
>   	wrmsr
>   


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-05-04 10:31   ` Juergen Gross
@ 2023-05-05  3:09     ` Hou Wenlong
  0 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-05-05  3:09 UTC (permalink / raw)
  To: Juergen Gross
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Brian Gerst, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski,
	Boris Ostrovsky, Darren Hart, Andy Shevchenko, Nathan Chancellor,
	Nick Desaulniers, Tom Rix, Peter Zijlstra, Mike Rapoport (IBM),
	Ashok Raj, Rick Edgecombe, Catalin Marinas, Guo Ren,
	Greg Kroah-Hartman, Jason A. Donenfeld, Pawan Gupta,
	Kim Phillips, David Woodhouse, Josh Poimboeuf, xen-devel,
	platform-driver-x86, llvm

On Thu, May 04, 2023 at 12:31:59PM +0200, Juergen Gross wrote:
> On 28.04.23 11:50, Hou Wenlong wrote:
> >From: Brian Gerst <brgerst@gmail.com>
> >
> >From: Brian Gerst <brgerst@gmail.com>
> >
> >If the compiler supports it, use a standard per-cpu variable for the
> >stack protector instead of the old fixed location.  Keep the fixed
> >location code for compatibility with older compilers.
> >
> >[Hou Wenlong: Disable it on Clang, adapt new code change and adapt
> >missing GS set up path in pvh_start_xen()]
> >
> >Signed-off-by: Brian Gerst <brgerst@gmail.com>
> >Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> >Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> >Cc: Thomas Garnier <thgarnie@chromium.org>
> >Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >Cc: Kees Cook <keescook@chromium.org>
> >---
> >  arch/x86/Kconfig                      | 12 ++++++++++++
> >  arch/x86/Makefile                     | 21 ++++++++++++++-------
> >  arch/x86/entry/entry_64.S             |  6 +++++-
> >  arch/x86/include/asm/processor.h      | 17 ++++++++++++-----
> >  arch/x86/include/asm/stackprotector.h | 16 +++++++---------
> >  arch/x86/kernel/asm-offsets_64.c      |  2 +-
> >  arch/x86/kernel/cpu/common.c          | 15 +++++++--------
> >  arch/x86/kernel/head_64.S             | 16 ++++++++++------
> >  arch/x86/kernel/vmlinux.lds.S         |  4 +++-
> >  arch/x86/platform/pvh/head.S          |  8 ++++++++
> >  arch/x86/xen/xen-head.S               | 14 +++++++++-----
> >  11 files changed, 88 insertions(+), 43 deletions(-)
> >
> 
> ...
> 
> >diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
> >index 643d02900fbb..09eaf59e8066 100644
> >--- a/arch/x86/xen/xen-head.S
> >+++ b/arch/x86/xen/xen-head.S
> >@@ -51,15 +51,19 @@ SYM_CODE_START(startup_xen)
> >  	leaq	(__end_init_task - PTREGS_SIZE)(%rip), %rsp
> >-	/* Set up %gs.
> >-	 *
> >-	 * The base of %gs always points to fixed_percpu_data.  If the
> >-	 * stack protector canary is enabled, it is located at %gs:40.
> >+	/*
> >+	 * Set up GS base.
> >  	 * Note that, on SMP, the boot cpu uses init data section until
> >  	 * the per cpu areas are set up.
> >  	 */
> >  	movl	$MSR_GS_BASE,%ecx
> >-	movq	$INIT_PER_CPU_VAR(fixed_percpu_data),%rax
> >+#if defined(CONFIG_STACKPROTECTOR_FIXED)
> >+	leaq	INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
> >+#elif defined(CONFIG_SMP)
> >+	movabs	$__per_cpu_load, %rdx
> 
> Shouldn't above 2 targets be %rax?
>
Ah yes, my mistake. I didn't test it on XEN guest, sorry,
I'll test XEN guest before the next submission.

Thanks.

> >+#else
> >+	xorl	%eax, %eax
> >+#endif
> >  	cdq
> >  	wrmsr
> 
> 
> Juergen

> pub  2048R/28BF132F 2014-06-02 Juergen Gross <jg@pfupf.net>
> uid                            Juergen Gross <jgross@suse.com>
> uid                            Juergen Gross <jgross@novell.com>
> uid                            Juergen Gross <jgross@suse.de>
> sub  2048R/16375B53 2014-06-02





^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-05-01 17:27   ` Nick Desaulniers
@ 2023-05-05  6:14     ` Hou Wenlong
  2023-05-05 18:02       ` Nick Desaulniers
  0 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-05-05  6:14 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Brian Gerst, linux-kernel, Kees Cook, x86, Nathan Chancellor, llvm

On Tue, May 02, 2023 at 01:27:53AM +0800, Nick Desaulniers wrote:
> On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > From: Brian Gerst <brgerst@gmail.com>
> >
> > From: Brian Gerst <brgerst@gmail.com>
> >
> > If the compiler supports it, use a standard per-cpu variable for the
> > stack protector instead of the old fixed location.  Keep the fixed
> > location code for compatibility with older compilers.
> >
> > [Hou Wenlong: Disable it on Clang, adapt new code change and adapt
> > missing GS set up path in pvh_start_xen()]
> >
> > Signed-off-by: Brian Gerst <brgerst@gmail.com>
> > Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > Cc: Thomas Garnier <thgarnie@chromium.org>
> > Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > ---
> >  arch/x86/Kconfig                      | 12 ++++++++++++
> >  arch/x86/Makefile                     | 21 ++++++++++++++-------
> >  arch/x86/entry/entry_64.S             |  6 +++++-
> >  arch/x86/include/asm/processor.h      | 17 ++++++++++++-----
> >  arch/x86/include/asm/stackprotector.h | 16 +++++++---------
> >  arch/x86/kernel/asm-offsets_64.c      |  2 +-
> >  arch/x86/kernel/cpu/common.c          | 15 +++++++--------
> >  arch/x86/kernel/head_64.S             | 16 ++++++++++------
> >  arch/x86/kernel/vmlinux.lds.S         |  4 +++-
> >  arch/x86/platform/pvh/head.S          |  8 ++++++++
> >  arch/x86/xen/xen-head.S               | 14 +++++++++-----
> >  11 files changed, 88 insertions(+), 43 deletions(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 68e5da464b96..55cce8cdf9bd 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -410,6 +410,18 @@ config CC_HAS_SANE_STACKPROTECTOR
> >           the compiler produces broken code or if it does not let us control
> >           the segment on 32-bit kernels.
> >
> > +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> > +       bool
> > +       # Although clang supports -mstack-protector-guard-reg option, it
> > +       # would generate GOT reference for __stack_chk_guard even with
> > +       # -fno-PIE flag.
> > +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))
> 
> Hi Hou,
> I've filed this bug against LLVM and will work with LLVM folks at
> Intel to resolve:
> https://github.com/llvm/llvm-project/issues/62481
> Can you please review that report and let me know here or there if I
> missed anything? Would you also mind including a link to that in the
> comments in the next version of this patch?
> 
Hi Nick,

Thanks for your help, I'll include the link in the next version.
Actually, I had post an issue on github too when I test the patch on
LLVM. But no replies. :(.
https://github.com/llvm/llvm-project/issues/60116

There is another problem I met for this patch, some unexpected code
are generated:

do_one_initcall: (init/main.o)
......
movq    __stack_chk_guard(%rip), %rax
movq    %rax,0x2b0(%rsp)

The complier generates wrong instruction, no GOT reference and gs
register. I only see it in init/main.c file. I have tried to move the
function into another file and compiled it with same cflags. It could
generate right instruction for the function in another file.

The LLVM chain toolsare built by myself:
clang version 15.0.7 (https://github.com/llvm/llvm-project.git
8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)

> Less relevant issues I filed looking at some related codegen:
> https://github.com/llvm/llvm-project/issues/62482
> https://github.com/llvm/llvm-project/issues/62480
> 
> And we should probably look into:
> https://github.com/llvm/llvm-project/issues/22476
> 
>

Except for per-cpu stack canary patch, there is another issue I post on
github: https://github.com/llvm/llvm-project/issues/60096

The related patch is:
https://lore.kernel.org/lkml/175116f75c38c15d8d73a03301eab805fea13a0a.1682673543.git.houwenlong.hwl@antgroup.com/

I couldn't find the related documentation about that, hope you can help
me too.

One more problem that I didn't post is:
https://lore.kernel.org/lkml/8d6bbaf66b90cf1a8fd2c5da98f5e094b9ffcb27.1682673543.git.houwenlong.hwl@antgroup.com/

> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-05-05  6:14     ` Hou Wenlong
@ 2023-05-05 18:02       ` Nick Desaulniers
  2023-05-05 19:06         ` Fangrui Song
  2023-05-08  8:06         ` Hou Wenlong
  0 siblings, 2 replies; 80+ messages in thread
From: Nick Desaulniers @ 2023-05-05 18:02 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: Brian Gerst, linux-kernel, Kees Cook, x86, Nathan Chancellor,
	llvm, Fangrui Song

On Thu, May 4, 2023 at 11:14 PM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> On Tue, May 02, 2023 at 01:27:53AM +0800, Nick Desaulniers wrote:
> > On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > >
> > > +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> > > +       bool
> > > +       # Although clang supports -mstack-protector-guard-reg option, it
> > > +       # would generate GOT reference for __stack_chk_guard even with
> > > +       # -fno-PIE flag.
> > > +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))
> >
> > Hi Hou,
> > I've filed this bug against LLVM and will work with LLVM folks at
> > Intel to resolve:
> > https://github.com/llvm/llvm-project/issues/62481
> > Can you please review that report and let me know here or there if I
> > missed anything? Would you also mind including a link to that in the
> > comments in the next version of this patch?
> >
> Hi Nick,
>
> Thanks for your help, I'll include the link in the next version.
> Actually, I had post an issue on github too when I test the patch on
> LLVM. But no replies. :(.

Ah, sorry about that.  The issue tracker is pretty high volume and
stuff gets missed.  With many users comes many bug reports.  We could
be better about triage though.  If it's specific to the Linux kernel,
https://github.com/ClangBuiltLinux/linux/issues is a better issue
tracker to use; we can move bug reports upstream to
https://github.com/llvm/llvm-project/issues/ when necessary. It's
linked off of clangbuiltlinux.github.io if you lose it.

> https://github.com/llvm/llvm-project/issues/60116
>
> There is another problem I met for this patch, some unexpected code
> are generated:
>
> do_one_initcall: (init/main.o)
> ......
> movq    __stack_chk_guard(%rip), %rax
> movq    %rax,0x2b0(%rsp)
>
> The complier generates wrong instruction, no GOT reference and gs
> register. I only see it in init/main.c file. I have tried to move the
> function into another file and compiled it with same cflags. It could
> generate right instruction for the function in another file.

The wrong instruction or the wrong operand?  This is loading the
canary into the stack slot in the fn prolog.  Perhaps the expected
cflag is not getting set (or being removed) from init/main.c? You
should be able to do:

$ make LLVM=1 init/main.o V=1

to see how clang was invoked to see if the expected flag was there, or not.

>
> The LLVM chain toolsare built by myself:
> clang version 15.0.7 (https://github.com/llvm/llvm-project.git
> 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)

Perhaps worth rebuilding with top of tree, which is clang 17.

>
> > Less relevant issues I filed looking at some related codegen:
> > https://github.com/llvm/llvm-project/issues/62482
> > https://github.com/llvm/llvm-project/issues/62480
> >
> > And we should probably look into:
> > https://github.com/llvm/llvm-project/issues/22476
> >
> >
>
> Except for per-cpu stack canary patch, there is another issue I post on
> github: https://github.com/llvm/llvm-project/issues/60096

Thanks, I'll bring that up with Intel, too.

>
> The related patch is:
> https://lore.kernel.org/lkml/175116f75c38c15d8d73a03301eab805fea13a0a.1682673543.git.houwenlong.hwl@antgroup.com/
>
> I couldn't find the related documentation about that, hope you can help
> me too.
>
> One more problem that I didn't post is:
> https://lore.kernel.org/lkml/8d6bbaf66b90cf1a8fd2c5da98f5e094b9ffcb27.1682673543.git.houwenlong.hwl@antgroup.com/

Mind filing another bug for this in llvm's issue tracker? We can
discuss there if LLD needs to be doing something different.

Thanks for uncovering these and helping us get them fixed up!
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-05-05 18:02       ` Nick Desaulniers
@ 2023-05-05 19:06         ` Fangrui Song
  2023-05-08  8:06         ` Hou Wenlong
  1 sibling, 0 replies; 80+ messages in thread
From: Fangrui Song @ 2023-05-05 19:06 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: Brian Gerst, Nick Desaulniers, linux-kernel, Kees Cook, x86,
	Nathan Chancellor, llvm

On Fri, May 5, 2023 at 11:02 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Thu, May 4, 2023 at 11:14 PM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > On Tue, May 02, 2023 at 01:27:53AM +0800, Nick Desaulniers wrote:
> > > On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > >
> > > > +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> > > > +       bool
> > > > +       # Although clang supports -mstack-protector-guard-reg option, it
> > > > +       # would generate GOT reference for __stack_chk_guard even with
> > > > +       # -fno-PIE flag.
> > > > +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))
> > >
> > > Hi Hou,
> > > I've filed this bug against LLVM and will work with LLVM folks at
> > > Intel to resolve:
> > > https://github.com/llvm/llvm-project/issues/62481
> > > Can you please review that report and let me know here or there if I
> > > missed anything? Would you also mind including a link to that in the
> > > comments in the next version of this patch?
> > >
> > Hi Nick,
> >
> > Thanks for your help, I'll include the link in the next version.
> > Actually, I had post an issue on github too when I test the patch on
> > LLVM. But no replies. :(.
>
> Ah, sorry about that.  The issue tracker is pretty high volume and
> stuff gets missed.  With many users comes many bug reports.  We could
> be better about triage though.  If it's specific to the Linux kernel,
> https://github.com/ClangBuiltLinux/linux/issues is a better issue
> tracker to use; we can move bug reports upstream to
> https://github.com/llvm/llvm-project/issues/ when necessary. It's
> linked off of clangbuiltlinux.github.io if you lose it.
>
> > https://github.com/llvm/llvm-project/issues/60116
> >
> > There is another problem I met for this patch, some unexpected code
> > are generated:
> >
> > do_one_initcall: (init/main.o)
> > ......
> > movq    __stack_chk_guard(%rip), %rax
> > movq    %rax,0x2b0(%rsp)
> >
> > The complier generates wrong instruction, no GOT reference and gs
> > register. I only see it in init/main.c file. I have tried to move the
> > function into another file and compiled it with same cflags. It could
> > generate right instruction for the function in another file.
>
> The wrong instruction or the wrong operand?  This is loading the
> canary into the stack slot in the fn prolog.  Perhaps the expected
> cflag is not getting set (or being removed) from init/main.c? You
> should be able to do:
>
> $ make LLVM=1 init/main.o V=1
>
> to see how clang was invoked to see if the expected flag was there, or not.
>
> >
> > The LLVM chain toolsare built by myself:
> > clang version 15.0.7 (https://github.com/llvm/llvm-project.git
> > 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
>
> Perhaps worth rebuilding with top of tree, which is clang 17.
>
> >
> > > Less relevant issues I filed looking at some related codegen:
> > > https://github.com/llvm/llvm-project/issues/62482
> > > https://github.com/llvm/llvm-project/issues/62480
> > >
> > > And we should probably look into:
> > > https://github.com/llvm/llvm-project/issues/22476
> > >
> > >
> >
> > Except for per-cpu stack canary patch, there is another issue I post on
> > github: https://github.com/llvm/llvm-project/issues/60096
>
> Thanks, I'll bring that up with Intel, too.
>
> >
> > The related patch is:
> > https://lore.kernel.org/lkml/175116f75c38c15d8d73a03301eab805fea13a0a.1682673543.git.houwenlong.hwl@antgroup.com/
> >
> > I couldn't find the related documentation about that, hope you can help
> > me too.
> >
> > One more problem that I didn't post is:
> > https://lore.kernel.org/lkml/8d6bbaf66b90cf1a8fd2c5da98f5e094b9ffcb27.1682673543.git.houwenlong.hwl@antgroup.com/
>
> Mind filing another bug for this in llvm's issue tracker? We can
> discuss there if LLD needs to be doing something different.
>
> Thanks for uncovering these and helping us get them fixed up!
> --
> Thanks,
> ~Nick Desaulniers

In my opinion, Clang's behavior is working as intended. The Linux
kernel should support R_X86_64_REX_GOTPCRELX, and the solution is
straightforward: treat R_X86_64_REX_GOTPCRELX the same way as
R_X86_64_PC32 (-shared -Bsymbolic), assuming that every symbol is
defined, which means that every symbol is non-preemptible.

Clang's `-fno-pic` option chooses `R_X86_64_REX_GOTPCRELX` which is
correct, although it differs from GCC's `-fno-pic` option.

The compiler doesn't know whether `__stack_chk_guard` will be provided
by the main executable (`libc.a`) or a shared object (`libc.so`,
available on some ports of glibc but not x86, on musl this is
available for all ports).
(Also see `__stack_chk_guard` on
https://maskray.me/blog/2022-12-18-control-flow-integrity)

If an `R_X86_64_32` relocation is used and `__stack_chk_guard` is
defined by a shared object, copy relocation.
We will need an ELF hack called [copy
relocation](https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected).

The instruction movq __stack_chk_guard@GOTPCREL(%rip), %rbx produces
an R_X86_64_REX_GOTPCRELX relocation.
If `__stack_chk_guard` is non-preemptible, linkers can [optimize the
access to be direct](https://maskray.me/blog/2021-08-29-all-about-global-offset-table#got-optimization).

Although we could technically use the
`-fno-direct-access-external-data` option to switch between
`R_X86_64_REX_GOTPCRELX` and `R_X86_64_32`, I think there is no
justification to complicate the compiler.



-- 
宋方睿

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible
  2023-04-28 15:22 ` [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Peter Zijlstra
@ 2023-05-06  7:19   ` Hou Wenlong
  0 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-05-06  7:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Nathan Chancellor, Nick Desaulniers, Tom Rix, bpf, llvm

On Fri, Apr 28, 2023 at 11:22:06PM +0800, Peter Zijlstra wrote:
> 
> For some raison I didn't get 0/n but did get all of the others. Please
> keep your Cc list consistent.
>
Sorry, I'll pay attention next time.
 
> On Fri, Apr 28, 2023 at 05:50:40PM +0800, Hou Wenlong wrote:
> 
> >   - It is not allowed to reference global variables in an alternative
> >     section since RIP-relative addressing is not fixed in
> >     apply_alternatives(). Fortunately, all disallowed relocations in the
> >     alternative section can be captured by objtool. I believe that this
> >     issue can also be fixed by using objtool.
> 
> https://lkml.kernel.org/r/Y9py2a5Xw0xbB8ou@hirez.programming.kicks-ass.net

Thank you for your patch. However, it's more complicated for call depth
tracking case. Although, the per-cpu variable in the alternative section
is relocated, but the content of the "skl_call_thunk_template" is copied
into another place, so the offset is still incorrect. I'm not sure if
this case is common or not. Since the destination is clear, I could do
relocation here as well, but it would make the code more complicated.

Thanks!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler
  2023-05-05 18:02       ` Nick Desaulniers
  2023-05-05 19:06         ` Fangrui Song
@ 2023-05-08  8:06         ` Hou Wenlong
  1 sibling, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-05-08  8:06 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Brian Gerst, linux-kernel, Kees Cook, x86, Nathan Chancellor,
	llvm, Fangrui Song

On Sat, May 06, 2023 at 02:02:25AM +0800, Nick Desaulniers wrote:
> On Thu, May 4, 2023 at 11:14 PM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > On Tue, May 02, 2023 at 01:27:53AM +0800, Nick Desaulniers wrote:
> > > On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > >
> > > > +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> > > > +       bool
> > > > +       # Although clang supports -mstack-protector-guard-reg option, it
> > > > +       # would generate GOT reference for __stack_chk_guard even with
> > > > +       # -fno-PIE flag.
> > > > +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))
> > >
> > > Hi Hou,
> > > I've filed this bug against LLVM and will work with LLVM folks at
> > > Intel to resolve:
> > > https://github.com/llvm/llvm-project/issues/62481
> > > Can you please review that report and let me know here or there if I
> > > missed anything? Would you also mind including a link to that in the
> > > comments in the next version of this patch?
> > >
> > Hi Nick,
> >
> > Thanks for your help, I'll include the link in the next version.
> > Actually, I had post an issue on github too when I test the patch on
> > LLVM. But no replies. :(.
> 
> Ah, sorry about that.  The issue tracker is pretty high volume and
> stuff gets missed.  With many users comes many bug reports.  We could
> be better about triage though.  If it's specific to the Linux kernel,
> https://github.com/ClangBuiltLinux/linux/issues is a better issue
> tracker to use; we can move bug reports upstream to
> https://github.com/llvm/llvm-project/issues/ when necessary. It's
> linked off of clangbuiltlinux.github.io if you lose it.
> 
> > https://github.com/llvm/llvm-project/issues/60116
> >
> > There is another problem I met for this patch, some unexpected code
> > are generated:
> >
> > do_one_initcall: (init/main.o)
> > ......
> > movq    __stack_chk_guard(%rip), %rax
> > movq    %rax,0x2b0(%rsp)
> >
> > The complier generates wrong instruction, no GOT reference and gs
> > register. I only see it in init/main.c file. I have tried to move the
> > function into another file and compiled it with same cflags. It could
> > generate right instruction for the function in another file.
> 
> The wrong instruction or the wrong operand?  This is loading the
> canary into the stack slot in the fn prolog.  Perhaps the expected
> cflag is not getting set (or being removed) from init/main.c? You
> should be able to do:
> 
> $ make LLVM=1 init/main.o V=1
> 
> to see how clang was invoked to see if the expected flag was there, or not.
>
Hi Nick,
The ouput is:
  clang -Wp,-MMD,init/.main.o.d  -nostdinc -I./arch/x86/include
-I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi
-I./arch/x86/include/generated/uapi -I./include/uapi
-I./include/generated/uapi -include ./include/linux/compiler-version.h
-include ./include/linux/kconfig.h -include
./include/linux/compiler_types.h -D__KERNEL__ -Werror
-fmacro-prefix-map=./= -Wall -Wundef -Werror=strict-prototypes
-Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE
-Werror=implicit-function-declaration -Werror=implicit-int
-Werror=return-type -Wno-format-security -funsigned-char -std=gnu11
--target=x86_64-linux-gnu -fintegrated-as -Werror=unknown-warning-option
-Werror=ignored-optimization-argument -Werror=option-ignored
-Werror=unused-command-line-argument -mno-sse -mno-mmx -mno-sse2
-mno-3dnow -mno-avx -fcf-protection=branch -fno-jump-tables -m64
-falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mstack-alignment=8
-mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel
-mstack-protector-guard-reg=gs
-mstack-protector-guard-symbol=__stack_chk_guard -Wno-sign-compare
-fno-asynchronous-unwind-tables -mretpoline-external-thunk
-mfunction-return=thunk-extern -fpatchable-function-entry=16,16
-fno-delete-null-pointer-checks -Wno-frame-address
-Wno-address-of-packed-member -O2 -Wframe-larger-than=2048
-fstack-protector-strong -Wno-gnu -Wno-unused-but-set-variable
-Wno-unused-const-variable -fomit-frame-pointer
-ftrivial-auto-var-init=zero
-enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
-fno-stack-clash-protection -falign-functions=16
-Wdeclaration-after-statement -Wvla -Wno-pointer-sign
-Wcast-function-type -Wimplicit-fallthrough -fno-strict-overflow
-fno-stack-check -Werror=date-time -Werror=incompatible-pointer-types
-Wno-initializer-overrides -Wno-format -Wformat-extra-args
-Wformat-invalid-specifier -Wformat-zero-length -Wnonnull
-Wformat-insufficient-args -Wno-sign-compare -Wno-pointer-to-enum-cast
-Wno-tautological-constant-out-of-range-compare -Wno-unaligned-access
-fno-function-sections -fno-data-sections
-DKBUILD_MODFILE='"init/main"' -DKBUILD_BASENAME='"main"'
-DKBUILD_MODNAME='"main"' -D__KBUILD_MODNAME=kmod_main -c -o init/main.o
init/main.c

I see the expected flags in the ouput. But the generated code is wrong:
00000000000006e0 <do_one_initcall>: (init/main.o)
     .......
     6ff: 48 8b 05 00 00 00 00          movq    (%rip), %rax # 0x706 <do_one_initcall+0x26>
                0000000000000702:  R_X86_64_PC32 __stack_chk_guard-0x4
     706: 48 89 84 24 e0 02 00 00       movq    %rax, 736(%rsp)

The expected generated code should be:
0000000000000010 <name_to_dev_t>: (init/do_mounts.o)
      ......
      2c: 4c 8b 25 00 00 00 00          movq    (%rip), %r12 # 0x33 <name_to_dev_t+0x23>
                000000000000002f:  R_X86_64_REX_GOTPCRELX __stack_chk_guard-0x4
      33: 65 49 8b 04 24                movq    %gs:(%r12), %rax
      38: 48 89 44 24 30                movq    %rax, 48(%rsp)

Actually, this is the main reason why I disable per-cpu stack canary on
LLVM. This patch could be picked separately, if you have time to help me
find out the reason.

Thanks.
> >
> > The LLVM chain toolsare built by myself:
> > clang version 15.0.7 (https://github.com/llvm/llvm-project.git
> > 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
> 
> Perhaps worth rebuilding with top of tree, which is clang 17.
> 
> >
> > > Less relevant issues I filed looking at some related codegen:
> > > https://github.com/llvm/llvm-project/issues/62482
> > > https://github.com/llvm/llvm-project/issues/62480
> > >
> > > And we should probably look into:
> > > https://github.com/llvm/llvm-project/issues/22476
> > >
> > >
> >
> > Except for per-cpu stack canary patch, there is another issue I post on
> > github: https://github.com/llvm/llvm-project/issues/60096
> 
> Thanks, I'll bring that up with Intel, too.
> 
> >
> > The related patch is:
> > https://lore.kernel.org/lkml/175116f75c38c15d8d73a03301eab805fea13a0a.1682673543.git.houwenlong.hwl@antgroup.com/
> >
> > I couldn't find the related documentation about that, hope you can help
> > me too.
> >
> > One more problem that I didn't post is:
> > https://lore.kernel.org/lkml/8d6bbaf66b90cf1a8fd2c5da98f5e094b9ffcb27.1682673543.git.houwenlong.hwl@antgroup.com/
> 
> Mind filing another bug for this in llvm's issue tracker? We can
> discuss there if LLD needs to be doing something different.
> 
> Thanks for uncovering these and helping us get them fixed up!
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-04-28 19:29   ` Ard Biesheuvel
@ 2023-05-08  8:32     ` Hou Wenlong
  2023-05-08  9:16       ` Ard Biesheuvel
  0 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-05-08  8:32 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > Adapt module loading to support PIE relocations. No GOT is generared for
> > module, all the GOT entry of got references in module should exist in
> > kernel GOT.  Currently, there is only one usable got reference for
> > __fentry__().
> >
> 
> I don't think this is the right approach. We should permit GOTPCREL
> relocations properly, which means making them point to a location in
> memory that carries the absolute address of the symbol. There are
> several ways to go about that, but perhaps the simplest way is to make
> the symbol address in ksymtab a 64-bit absolute value (but retain the
> PC32 references for the symbol name and the symbol namespace name).
> That way, you can always resolve such GOTPCREL relocations by pointing
> it to the ksymtab entry. Another option would be to take inspiration
> from the PLT code we have on ARM and arm64 (and other architectures,
> surely) and to count the GOT based relocations, allocate some extra
> r/o module space for each, and allocate slots and populate them with
> the right value as you fix up the relocations.
> 
> Then, many such relocations can be relaxed at module load time if the
> symbol is in range. IIUC, the module and kernel will still be inside
> the same 2G window even after widening the KASLR range to 512G, so
> most GOT loads can be converted into RIP relative LEA instructions.
> 
> Note that this will also permit you to do things like
> 
> #define PV_VCPU_PREEMPTED_ASM \
>  "leaq __per_cpu_offset(%rip), %rax \n\t" \
>  "movq (%rax,%rdi,8), %rax \n\t" \
>  "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
>  "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
>  "setne %al\n\t"
> 
> or
> 
> +#ifdef CONFIG_X86_PIE
> + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> +#else
> " pushq $arch_rethook_trampoline\n"
> +#endif
> 
> instead of having these kludgy push/pop sequences to free up temp registers.
> 
> (FYI I have looked into this PIE linking just a few weeks ago [0] so
> this is all rather fresh in my memory)
> 
> 
> 
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> 
>
Hi Ard,
Thanks for providing the link, it has been very helpful for me as I am
new to the topic of compilers. One key difference I noticed is that you
linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
switched to "--emit-reloc" in order to reduce dynamic relocation space
on mapped memory.

The another issue is that it requires the addition of the
"-mrelax-relocations=no" option to support older compilers and linkers.
R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
in binutils 2.26 and later, but the mini version required for the kernel
is 2.25. This option disables relocation relaxation, which makes GOT not
empty. I also noticed this option in arch/x86/boot/compressed/Makefile
with the reason given in [2]. Without relocation relaxation, GOT
references would increase the size of GOT. Therefore, I do not want to
use GOT reference in assembly directly.  However, I realized that the
compiler could still generate GOT references in some cases such as
"fentry" calls and stack canary references.

Regarding module loading, I agree that we should support GOT reference
for the module itself. I will refactor it according to your suggestion.

Thanks.

[0] https://yhbt.net/lore/all/20170718223333.110371-20-thgarnie@google.com
[1] https://yhbt.net/lore/all/20171004212003.28296-1-thgarnie@google.com
[2] https://lore.kernel.org/all/20200903203053.3411268-2-samitolvanen@google.com/

> > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > Cc: Thomas Garnier <thgarnie@chromium.org>
> > Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > ---
> >  arch/x86/include/asm/sections.h |  5 +++++
> >  arch/x86/kernel/module.c        | 27 +++++++++++++++++++++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > index a6e8373a5170..dc1c2b08ec48 100644
> > --- a/arch/x86/include/asm/sections.h
> > +++ b/arch/x86/include/asm/sections.h
> > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> >
> >  #if defined(CONFIG_X86_64)
> >  extern char __end_rodata_hpage_align[];
> > +
> > +#ifdef CONFIG_X86_PIE
> > +extern char __start_got[], __end_got[];
> > +#endif
> > +
> >  #endif
> >
> >  extern char __end_of_kernel_reserve[];
> > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > index 84ad0e61ba6e..051f88e6884e 100644
> > --- a/arch/x86/kernel/module.c
> > +++ b/arch/x86/kernel/module.c
> > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> >         return 0;
> >  }
> >  #else /*X86_64*/
> > +#ifdef CONFIG_X86_PIE
> > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > +{
> > +       u64 *pos;
> > +
> > +       for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > +               if (*pos == sym->st_value)
> > +                       return (u64)pos + rela->r_addend;
> > +       return 0;
> > +}
> > +#endif
> > +
> >  static int __write_relocate_add(Elf64_Shdr *sechdrs,
> >                    const char *strtab,
> >                    unsigned int symindex,
> > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> >                 case R_X86_64_64:
> >                         size = 8;
> >                         break;
> > +#ifndef CONFIG_X86_PIE
> >                 case R_X86_64_32:
> >                         if (val != *(u32 *)&val)
> >                                 goto overflow;
> > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> >                                 goto overflow;
> >                         size = 4;
> >                         break;
> > +#else
> > +               case R_X86_64_GOTPCREL:
> > +                       val = find_got_kernel_entry(sym, rel);
> > +                       if (!val)
> > +                               goto unexpected_got_reference;
> > +                       fallthrough;
> > +#endif
> >                 case R_X86_64_PC32:
> >                 case R_X86_64_PLT32:
> >                         val -= (u64)loc;
> > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> >         }
> >         return 0;
> >
> > +#ifdef CONFIG_X86_PIE
> > +unexpected_got_reference:
> > +       pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > +       return -ENOEXEC;
> > +#else
> >  overflow:
> >         pr_err("overflow in relocation type %d val %Lx\n",
> >                (int)ELF64_R_TYPE(rel[i].r_info), val);
> >         pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> >                me->name);
> > +#endif
> > +
> >         return -ENOEXEC;
> >  }
> >
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-08  8:32     ` Hou Wenlong
@ 2023-05-08  9:16       ` Ard Biesheuvel
  2023-05-08 11:40         ` Hou Wenlong
  2023-05-10  7:09         ` Hou Wenlong
  0 siblings, 2 replies; 80+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  9:16 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > >
> > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > module, all the GOT entry of got references in module should exist in
> > > kernel GOT.  Currently, there is only one usable got reference for
> > > __fentry__().
> > >
> >
> > I don't think this is the right approach. We should permit GOTPCREL
> > relocations properly, which means making them point to a location in
> > memory that carries the absolute address of the symbol. There are
> > several ways to go about that, but perhaps the simplest way is to make
> > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > PC32 references for the symbol name and the symbol namespace name).
> > That way, you can always resolve such GOTPCREL relocations by pointing
> > it to the ksymtab entry. Another option would be to take inspiration
> > from the PLT code we have on ARM and arm64 (and other architectures,
> > surely) and to count the GOT based relocations, allocate some extra
> > r/o module space for each, and allocate slots and populate them with
> > the right value as you fix up the relocations.
> >
> > Then, many such relocations can be relaxed at module load time if the
> > symbol is in range. IIUC, the module and kernel will still be inside
> > the same 2G window even after widening the KASLR range to 512G, so
> > most GOT loads can be converted into RIP relative LEA instructions.
> >
> > Note that this will also permit you to do things like
> >
> > #define PV_VCPU_PREEMPTED_ASM \
> >  "leaq __per_cpu_offset(%rip), %rax \n\t" \
> >  "movq (%rax,%rdi,8), %rax \n\t" \
> >  "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> >  "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> >  "setne %al\n\t"
> >
> > or
> >
> > +#ifdef CONFIG_X86_PIE
> > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > +#else
> > " pushq $arch_rethook_trampoline\n"
> > +#endif
> >
> > instead of having these kludgy push/pop sequences to free up temp registers.
> >
> > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > this is all rather fresh in my memory)
> >
> >
> >
> >
> > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> >
> >
> Hi Ard,
> Thanks for providing the link, it has been very helpful for me as I am
> new to the topic of compilers.

Happy to hear that.

> One key difference I noticed is that you
> linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> switched to "--emit-reloc" in order to reduce dynamic relocation space
> on mapped memory.
>

The problem with --emit-relocs is that the relocations emitted into
the binary may get out of sync with the actual code after the linker
has applied relocations.

$ cat /tmp/a.s
foo:movq foo@GOTPCREL(%rip), %rax

$ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o

/tmp/a.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
   0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
3: R_X86_64_REX_GOTPCRELX foo-0x4

$ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.o
0000000000000000 <foo>:
   0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
3: R_X86_64_REX_GOTPCRELX foo-0x4

$ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
-Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.elf
0000000000401000 <foo>:
  401000: 48 c7 c0 00 10 40 00 mov    $0x401000,%rax
401003: R_X86_64_32S foo

$ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
-Wl,-q,--defsym,_start=0x0 /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.elf
0000000000001000 <foo>:
    1000: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 1000 <foo>
1003: R_X86_64_PC32 foo-0x4

This all looks as expected. However, when using Clang, we end up with

$ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
-fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.elf
00000000000012c0 <foo>:
    12c0: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 12c0 <foo>
12c3: R_X86_64_REX_GOTPCRELX foo-0x4

So in this case, what --emit-relocs gives us is not what is actually
in the binary. We cannot just ignore these either, given that they are
treated differently depending on whether the symbol is a per-CPU
symbol or not - in the former case, we need to perform a fixup if the
relaxed reference is RIP relative, and in the latter case, if the
relaxed reference is absolute.

On top of that, --emit-relocs does not cover the GOT, so we'd still
need to process that from the code explicitly.

In general, relying on --emit-relocs is kind of dodgy, and I think
combining PIE linking with --emit-relocs is a bad idea.

> The another issue is that it requires the addition of the
> "-mrelax-relocations=no" option to support older compilers and linkers.

Why? The decompressor is now linked in PIE mode so we should be able
to drop that. Or do you need to add is somewhere else?

> R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> in binutils 2.26 and later, but the mini version required for the kernel
> is 2.25. This option disables relocation relaxation, which makes GOT not
> empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> with the reason given in [2]. Without relocation relaxation, GOT
> references would increase the size of GOT. Therefore, I do not want to
> use GOT reference in assembly directly.  However, I realized that the
> compiler could still generate GOT references in some cases such as
> "fentry" calls and stack canary references.
>

The stack canary references are under discussion here [3]. I have also
sent a patch for kallsyms symbol references [4]. Beyond that, there
should be very few cases where GOT entries are emitted, so I don't
think this is fundamentally a problem.

I haven't run into the __fentry__ issue myself: do you think we should
fix this in the compiler?

> Regarding module loading, I agree that we should support GOT reference
> for the module itself. I will refactor it according to your suggestion.
>

Excellent, good luck with that.

However, you will still need to make a convincing case for why this is
all worth the trouble. Especially given that you disable the depth
tracking code, which I don't think should be mutually exclusive.

I am aware that this a rather tricky, and involves rewriting
RIP-relative per-CPU variable accesses, but it would be good to get a
discussion started on that topic, and figure out whether there is a
way forward there. Ignoring it is not going to help.


>
> [0] https://yhbt.net/lore/all/20170718223333.110371-20-thgarnie@google.com
> [1] https://yhbt.net/lore/all/20171004212003.28296-1-thgarnie@google.com
> [2] https://lore.kernel.org/all/20200903203053.3411268-2-samitolvanen@google.com/
>

[3] https://github.com/llvm/llvm-project/issues/60116
[4] 20230504174320.3930345-1-ardb@kernel.org

> > > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > > Cc: Thomas Garnier <thgarnie@chromium.org>
> > > Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > > Cc: Kees Cook <keescook@chromium.org>
> > > ---
> > >  arch/x86/include/asm/sections.h |  5 +++++
> > >  arch/x86/kernel/module.c        | 27 +++++++++++++++++++++++++++
> > >  2 files changed, 32 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > > index a6e8373a5170..dc1c2b08ec48 100644
> > > --- a/arch/x86/include/asm/sections.h
> > > +++ b/arch/x86/include/asm/sections.h
> > > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> > >
> > >  #if defined(CONFIG_X86_64)
> > >  extern char __end_rodata_hpage_align[];
> > > +
> > > +#ifdef CONFIG_X86_PIE
> > > +extern char __start_got[], __end_got[];
> > > +#endif
> > > +
> > >  #endif
> > >
> > >  extern char __end_of_kernel_reserve[];
> > > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > > index 84ad0e61ba6e..051f88e6884e 100644
> > > --- a/arch/x86/kernel/module.c
> > > +++ b/arch/x86/kernel/module.c
> > > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> > >         return 0;
> > >  }
> > >  #else /*X86_64*/
> > > +#ifdef CONFIG_X86_PIE
> > > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > > +{
> > > +       u64 *pos;
> > > +
> > > +       for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > > +               if (*pos == sym->st_value)
> > > +                       return (u64)pos + rela->r_addend;
> > > +       return 0;
> > > +}
> > > +#endif
> > > +
> > >  static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > >                    const char *strtab,
> > >                    unsigned int symindex,
> > > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > >                 case R_X86_64_64:
> > >                         size = 8;
> > >                         break;
> > > +#ifndef CONFIG_X86_PIE
> > >                 case R_X86_64_32:
> > >                         if (val != *(u32 *)&val)
> > >                                 goto overflow;
> > > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > >                                 goto overflow;
> > >                         size = 4;
> > >                         break;
> > > +#else
> > > +               case R_X86_64_GOTPCREL:
> > > +                       val = find_got_kernel_entry(sym, rel);
> > > +                       if (!val)
> > > +                               goto unexpected_got_reference;
> > > +                       fallthrough;
> > > +#endif
> > >                 case R_X86_64_PC32:
> > >                 case R_X86_64_PLT32:
> > >                         val -= (u64)loc;
> > > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > >         }
> > >         return 0;
> > >
> > > +#ifdef CONFIG_X86_PIE
> > > +unexpected_got_reference:
> > > +       pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > > +       return -ENOEXEC;
> > > +#else
> > >  overflow:
> > >         pr_err("overflow in relocation type %d val %Lx\n",
> > >                (int)ELF64_R_TYPE(rel[i].r_info), val);
> > >         pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> > >                me->name);
> > > +#endif
> > > +
> > >         return -ENOEXEC;
> > >  }
> > >
> > > --
> > > 2.31.1
> > >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-08  9:16       ` Ard Biesheuvel
@ 2023-05-08 11:40         ` Hou Wenlong
  2023-05-08 17:47           ` Ard Biesheuvel
  2023-05-10  7:09         ` Hou Wenlong
  1 sibling, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-05-08 11:40 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > >
> > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > module, all the GOT entry of got references in module should exist in
> > > > kernel GOT.  Currently, there is only one usable got reference for
> > > > __fentry__().
> > > >
> > >
> > > I don't think this is the right approach. We should permit GOTPCREL
> > > relocations properly, which means making them point to a location in
> > > memory that carries the absolute address of the symbol. There are
> > > several ways to go about that, but perhaps the simplest way is to make
> > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > PC32 references for the symbol name and the symbol namespace name).
> > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > it to the ksymtab entry. Another option would be to take inspiration
> > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > surely) and to count the GOT based relocations, allocate some extra
> > > r/o module space for each, and allocate slots and populate them with
> > > the right value as you fix up the relocations.
> > >
> > > Then, many such relocations can be relaxed at module load time if the
> > > symbol is in range. IIUC, the module and kernel will still be inside
> > > the same 2G window even after widening the KASLR range to 512G, so
> > > most GOT loads can be converted into RIP relative LEA instructions.
> > >
> > > Note that this will also permit you to do things like
> > >
> > > #define PV_VCPU_PREEMPTED_ASM \
> > >  "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > >  "movq (%rax,%rdi,8), %rax \n\t" \
> > >  "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > >  "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > >  "setne %al\n\t"
> > >
> > > or
> > >
> > > +#ifdef CONFIG_X86_PIE
> > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > +#else
> > > " pushq $arch_rethook_trampoline\n"
> > > +#endif
> > >
> > > instead of having these kludgy push/pop sequences to free up temp registers.
> > >
> > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > this is all rather fresh in my memory)
> > >
> > >
> > >
> > >
> > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > >
> > >
> > Hi Ard,
> > Thanks for providing the link, it has been very helpful for me as I am
> > new to the topic of compilers.
> 
> Happy to hear that.
>
 
Thanks for your prompt reply.

> > One key difference I noticed is that you
> > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > on mapped memory.
> >
> 
> The problem with --emit-relocs is that the relocations emitted into
> the binary may get out of sync with the actual code after the linker
> has applied relocations.
> 
> $ cat /tmp/a.s
> foo:movq foo@GOTPCREL(%rip), %rax
> 
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> 
> /tmp/a.o:     file format elf64-x86-64
> 
> 
> Disassembly of section .text:
> 
> 0000000000000000 <foo>:
>    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
> 
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> 0000000000000000 <foo>:
>    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
> 
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000401000 <foo>:
>   401000: 48 c7 c0 00 10 40 00 mov    $0x401000,%rax
> 401003: R_X86_64_32S foo
> 
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000001000 <foo>:
>     1000: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 1000 <foo>
> 1003: R_X86_64_PC32 foo-0x4
> 
> This all looks as expected. However, when using Clang, we end up with
> 
> $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 00000000000012c0 <foo>:
>     12c0: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 12c0 <foo>
> 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> 
> So in this case, what --emit-relocs gives us is not what is actually
> in the binary. We cannot just ignore these either, given that they are
> treated differently depending on whether the symbol is a per-CPU
> symbol or not - in the former case, we need to perform a fixup if the
> relaxed reference is RIP relative, and in the latter case, if the
> relaxed reference is absolute.
>
With symbols hidden and the compile-time address of the kernel image
kept in the top 2G, is it possible for the relaxed reference to be
absolute, even if I keep the percpu section zero-mapping for SMP?  I
didn't see absoulte relaxed reference after dropping
"-mrelax-relocations=no" option.

> On top of that, --emit-relocs does not cover the GOT, so we'd still
> need to process that from the code explicitly.
>
Yes, so the relocs tool would process GOT, and generate
R_X86_64_GLOB_DAT relocation for GOT entries in patch 27:
https://lore.kernel.org/lkml/d25c7644249355785365914398bdba1ed2c52468.1682673543.git.houwenlong.hwl@antgroup.com

> In general, relying on --emit-relocs is kind of dodgy, and I think
> combining PIE linking with --emit-relocs is a bad idea.
> 
> > The another issue is that it requires the addition of the
> > "-mrelax-relocations=no" option to support older compilers and linkers.
> 
> Why? The decompressor is now linked in PIE mode so we should be able
> to drop that. Or do you need to add is somewhere else?
>
I tried to use binutils 2.25 (mini version), it couldn't recognize
R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX.

> > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > in binutils 2.26 and later, but the mini version required for the kernel
> > is 2.25. This option disables relocation relaxation, which makes GOT not
> > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > with the reason given in [2]. Without relocation relaxation, GOT
> > references would increase the size of GOT. Therefore, I do not want to
> > use GOT reference in assembly directly.  However, I realized that the
> > compiler could still generate GOT references in some cases such as
> > "fentry" calls and stack canary references.
> >
> 
> The stack canary references are under discussion here [3]. I have also
> sent a patch for kallsyms symbol references [4]. Beyond that, there
> should be very few cases where GOT entries are emitted, so I don't
> think this is fundamentally a problem.
> 
> I haven't run into the __fentry__ issue myself: do you think we should
> fix this in the compiler?
>
The issue about __fentry__ is that the compiler would generate 6-bytes
indirect call through GOT with "-fPIE" option. However, the original
ftrace nop patching assumes it is a 5-bytes direct call. And
"-mnop-mcount" option is not compatiable with "-fPIE" option, so the
complier woudn't patch it as nop.

So we should patch it with one 5-bytes nop followed by one 1-byte nop,
This way, ftrace can handle the previous 5-bytes as before. Also I have
built PIE kernel with relocation relaxation on GCC, and the linker would
relax it as following:
ffffffff810018f0 <do_one_initcall>:
ffffffff810018f0:       f3 0f 1e fa             endbr64
ffffffff810018f4:       67 e8 a6 d6 05 00       addr32 call ffffffff8105efa0 <__fentry__>
			ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4

It still requires a different nop patching for ftrace. I notice
"Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
that the GOT indirect call can be relaxed as "call fentry nop" or "nop
call fentry", it appears that the latter is chosen. If the linker could
generate the former, then no fixup would be necessary for ftrace with
PIE.

> > Regarding module loading, I agree that we should support GOT reference
> > for the module itself. I will refactor it according to your suggestion.
> >
> 
> Excellent, good luck with that.
> 
> However, you will still need to make a convincing case for why this is
> all worth the trouble. Especially given that you disable the depth
> tracking code, which I don't think should be mutually exclusive.
>
Actually, I could do relocation for it when apply patching for the
depth tracking code. I'm not sure such case is common or not.

> I am aware that this a rather tricky, and involves rewriting
> RIP-relative per-CPU variable accesses, but it would be good to get a
> discussion started on that topic, and figure out whether there is a
> way forward there. Ignoring it is not going to help.
> 
>
I see that your PIE linking chose to put the per-cpu section in high
kernel image address, I still keep it as zero-mapping. However, both are
in the RIP-relative addressing range.

> >
> > [0] https://yhbt.net/lore/all/20170718223333.110371-20-thgarnie@google.com
> > [1] https://yhbt.net/lore/all/20171004212003.28296-1-thgarnie@google.com
> > [2] https://lore.kernel.org/all/20200903203053.3411268-2-samitolvanen@google.com/
> >
> 
> [3] https://github.com/llvm/llvm-project/issues/60116
> [4] 20230504174320.3930345-1-ardb@kernel.org
> 
> > > > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > > > Cc: Thomas Garnier <thgarnie@chromium.org>
> > > > Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > > > Cc: Kees Cook <keescook@chromium.org>
> > > > ---
> > > >  arch/x86/include/asm/sections.h |  5 +++++
> > > >  arch/x86/kernel/module.c        | 27 +++++++++++++++++++++++++++
> > > >  2 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > > > index a6e8373a5170..dc1c2b08ec48 100644
> > > > --- a/arch/x86/include/asm/sections.h
> > > > +++ b/arch/x86/include/asm/sections.h
> > > > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> > > >
> > > >  #if defined(CONFIG_X86_64)
> > > >  extern char __end_rodata_hpage_align[];
> > > > +
> > > > +#ifdef CONFIG_X86_PIE
> > > > +extern char __start_got[], __end_got[];
> > > > +#endif
> > > > +
> > > >  #endif
> > > >
> > > >  extern char __end_of_kernel_reserve[];
> > > > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > > > index 84ad0e61ba6e..051f88e6884e 100644
> > > > --- a/arch/x86/kernel/module.c
> > > > +++ b/arch/x86/kernel/module.c
> > > > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> > > >         return 0;
> > > >  }
> > > >  #else /*X86_64*/
> > > > +#ifdef CONFIG_X86_PIE
> > > > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > > > +{
> > > > +       u64 *pos;
> > > > +
> > > > +       for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > > > +               if (*pos == sym->st_value)
> > > > +                       return (u64)pos + rela->r_addend;
> > > > +       return 0;
> > > > +}
> > > > +#endif
> > > > +
> > > >  static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >                    const char *strtab,
> > > >                    unsigned int symindex,
> > > > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >                 case R_X86_64_64:
> > > >                         size = 8;
> > > >                         break;
> > > > +#ifndef CONFIG_X86_PIE
> > > >                 case R_X86_64_32:
> > > >                         if (val != *(u32 *)&val)
> > > >                                 goto overflow;
> > > > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >                                 goto overflow;
> > > >                         size = 4;
> > > >                         break;
> > > > +#else
> > > > +               case R_X86_64_GOTPCREL:
> > > > +                       val = find_got_kernel_entry(sym, rel);
> > > > +                       if (!val)
> > > > +                               goto unexpected_got_reference;
> > > > +                       fallthrough;
> > > > +#endif
> > > >                 case R_X86_64_PC32:
> > > >                 case R_X86_64_PLT32:
> > > >                         val -= (u64)loc;
> > > > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >         }
> > > >         return 0;
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > +unexpected_got_reference:
> > > > +       pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > > > +       return -ENOEXEC;
> > > > +#else
> > > >  overflow:
> > > >         pr_err("overflow in relocation type %d val %Lx\n",
> > > >                (int)ELF64_R_TYPE(rel[i].r_info), val);
> > > >         pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> > > >                me->name);
> > > > +#endif
> > > > +
> > > >         return -ENOEXEC;
> > > >  }
> > > >
> > > > --
> > > > 2.31.1
> > > >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 25/43] x86/mm: Make the x86 GOT read-only
  2023-04-30 14:23   ` Ard Biesheuvel
@ 2023-05-08 11:40     ` Hou Wenlong
  0 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-05-08 11:40 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, Thomas Garnier, Lai Jiangshan, Kees Cook,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Arnd Bergmann, Peter Zijlstra, Josh Poimboeuf,
	Juergen Gross, Brian Gerst, linux-arch

On Sun, Apr 30, 2023 at 10:23:56PM +0800, Ard Biesheuvel wrote:
> On Fri, 28 Apr 2023 at 11:55, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > From: Thomas Garnier <thgarnie@chromium.org>
> >
> > From: Thomas Garnier <thgarnie@chromium.org>
> >
> > The GOT is changed during early boot when relocations are applied. Make
> > it read-only directly. This table exists only for PIE binary. Since weak
> > symbol reference would always be GOT reference, there are 8 entries in
> > GOT, but only one entry for __fentry__() is in use.  Other GOT
> > references have been optimized by linker.
> >
> > [Hou Wenlong: Change commit message and skip GOT size check]
> >
> > Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
> > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > ---
> >  arch/x86/kernel/vmlinux.lds.S     |  2 ++
> >  include/asm-generic/vmlinux.lds.h | 12 ++++++++++++
> >  2 files changed, 14 insertions(+)
> >
> > diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> > index f02dcde9f8a8..fa4c6582663f 100644
> > --- a/arch/x86/kernel/vmlinux.lds.S
> > +++ b/arch/x86/kernel/vmlinux.lds.S
> > @@ -462,6 +462,7 @@ SECTIONS
> >  #endif
> >                "Unexpected GOT/PLT entries detected!")
> >
> > +#ifndef CONFIG_X86_PIE
> >         /*
> >          * Sections that should stay zero sized, which is safer to
> >          * explicitly check instead of blindly discarding.
> > @@ -470,6 +471,7 @@ SECTIONS
> >                 *(.got) *(.igot.*)
> >         }
> >         ASSERT(SIZEOF(.got) == 0, "Unexpected GOT entries detected!")
> > +#endif
> >
> >         .plt : {
> >                 *(.plt) *(.plt.*) *(.iplt)
> > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> > index d1f57e4868ed..438ed8b39896 100644
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -441,6 +441,17 @@
> >         __end_ro_after_init = .;
> >  #endif
> >
> > +#ifdef CONFIG_X86_PIE
> > +#define RO_GOT_X86
> 
> Please don't put X86 specific stuff in generic code.
> 
> > +       .got        : AT(ADDR(.got) - LOAD_OFFSET) {                    \
> > +               __start_got = .;                                        \
> > +               *(.got) *(.igot.*);                                     \
> > +               __end_got = .;                                          \
> > +       }
> > +#else
> > +#define RO_GOT_X86
> > +#endif
> > +
> 
> I don't think it makes sense for this definition to be conditional.
> You can include it conditionally from the x86 code, but even that
> seems unnecessary, given that it will be empty otherwise.
>
Hi Ard,
I know that X86 specific stuff should be in generic code. I notice that
even relocation relaxation is enabled, the GOT would not be empty. I
want it to be included in the read-only data section (RODATA) between
the symbols __start_rodata and __end_rodata for safety. I noticed that
you placed it in the data section during your PIE linking.  Should it be
marked as read-only if it is not empty?

Thanks.
> >  /*
> >   * .kcfi_traps contains a list KCFI trap locations.
> >   */
> > @@ -486,6 +497,7 @@
> >                 BOUNDED_SECTION_PRE_LABEL(.pci_fixup_suspend_late, _pci_fixups_suspend_late, __start, __end) \
> >         }                                                               \
> >                                                                         \
> > +       RO_GOT_X86                                                      \
> >         FW_LOADER_BUILT_IN_DATA                                         \
> >         TRACEDATA                                                       \
> >                                                                         \
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-08 11:40         ` Hou Wenlong
@ 2023-05-08 17:47           ` Ard Biesheuvel
  2023-05-09  9:42             ` Hou Wenlong
  0 siblings, 1 reply; 80+ messages in thread
From: Ard Biesheuvel @ 2023-05-08 17:47 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Mon, 8 May 2023 at 13:45, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > >
> > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > > >
> > > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > > module, all the GOT entry of got references in module should exist in
> > > > > kernel GOT.  Currently, there is only one usable got reference for
> > > > > __fentry__().
> > > > >
> > > >
> > > > I don't think this is the right approach. We should permit GOTPCREL
> > > > relocations properly, which means making them point to a location in
> > > > memory that carries the absolute address of the symbol. There are
> > > > several ways to go about that, but perhaps the simplest way is to make
> > > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > > PC32 references for the symbol name and the symbol namespace name).
> > > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > > it to the ksymtab entry. Another option would be to take inspiration
> > > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > > surely) and to count the GOT based relocations, allocate some extra
> > > > r/o module space for each, and allocate slots and populate them with
> > > > the right value as you fix up the relocations.
> > > >
> > > > Then, many such relocations can be relaxed at module load time if the
> > > > symbol is in range. IIUC, the module and kernel will still be inside
> > > > the same 2G window even after widening the KASLR range to 512G, so
> > > > most GOT loads can be converted into RIP relative LEA instructions.
> > > >
> > > > Note that this will also permit you to do things like
> > > >
> > > > #define PV_VCPU_PREEMPTED_ASM \
> > > >  "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > >  "movq (%rax,%rdi,8), %rax \n\t" \
> > > >  "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > >  "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > >  "setne %al\n\t"
> > > >
> > > > or
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > > +#else
> > > > " pushq $arch_rethook_trampoline\n"
> > > > +#endif
> > > >
> > > > instead of having these kludgy push/pop sequences to free up temp registers.
> > > >
> > > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > > this is all rather fresh in my memory)
> > > >
> > > >
> > > >
> > > >
> > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > > >
> > > >
> > > Hi Ard,
> > > Thanks for providing the link, it has been very helpful for me as I am
> > > new to the topic of compilers.
> >
> > Happy to hear that.
> >
>
> Thanks for your prompt reply.
>
> > > One key difference I noticed is that you
> > > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > > on mapped memory.
> > >
> >
> > The problem with --emit-relocs is that the relocations emitted into
> > the binary may get out of sync with the actual code after the linker
> > has applied relocations.
> >
> > $ cat /tmp/a.s
> > foo:movq foo@GOTPCREL(%rip), %rax
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> >
> > /tmp/a.o:     file format elf64-x86-64
> >
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <foo>:
> >    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > 0000000000000000 <foo>:
> >    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000401000 <foo>:
> >   401000: 48 c7 c0 00 10 40 00 mov    $0x401000,%rax
> > 401003: R_X86_64_32S foo
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000001000 <foo>:
> >     1000: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 1000 <foo>
> > 1003: R_X86_64_PC32 foo-0x4
> >
> > This all looks as expected. However, when using Clang, we end up with
> >
> > $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> > -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 00000000000012c0 <foo>:
> >     12c0: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 12c0 <foo>
> > 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > So in this case, what --emit-relocs gives us is not what is actually
> > in the binary. We cannot just ignore these either, given that they are
> > treated differently depending on whether the symbol is a per-CPU
> > symbol or not - in the former case, we need to perform a fixup if the
> > relaxed reference is RIP relative, and in the latter case, if the
> > relaxed reference is absolute.
> >
> With symbols hidden and the compile-time address of the kernel image
> kept in the top 2G, is it possible for the relaxed reference to be
> absolute, even if I keep the percpu section zero-mapping for SMP?  I
> didn't see absoulte relaxed reference after dropping
> "-mrelax-relocations=no" option.
>

If you link in PIE mode, you should never see absolute references
after relaxation.

> > On top of that, --emit-relocs does not cover the GOT, so we'd still
> > need to process that from the code explicitly.
> >
> Yes, so the relocs tool would process GOT, and generate
> R_X86_64_GLOB_DAT relocation for GOT entries in patch 27:
> https://lore.kernel.org/lkml/d25c7644249355785365914398bdba1ed2c52468.1682673543.git.houwenlong.hwl@antgroup.com
>

Yes, something like that is needed. I'd lean towards generating the
reloc data directly instead of creating an artifiical RELA section
with GLOB_DAT relocations, but that is a minor detail.

> > In general, relying on --emit-relocs is kind of dodgy, and I think
> > combining PIE linking with --emit-relocs is a bad idea.
> >
> > > The another issue is that it requires the addition of the
> > > "-mrelax-relocations=no" option to support older compilers and linkers.
> >
> > Why? The decompressor is now linked in PIE mode so we should be able
> > to drop that. Or do you need to add is somewhere else?
> >
> I tried to use binutils 2.25 (mini version), it couldn't recognize
> R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX.
>

I'm not sure that matters. If the assembler accepts @GOTPCREL
notation, it should generate the relocations that the linker can
understand. If the toolchain is not internally consistent in this
regard, I don't think it is our problem.

This might mean that we end up with more residual GOT entries than
with a more recent toolchain, but I don't think that is a big deal.

> > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > in binutils 2.26 and later, but the mini version required for the kernel
> > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > with the reason given in [2]. Without relocation relaxation, GOT
> > > references would increase the size of GOT. Therefore, I do not want to
> > > use GOT reference in assembly directly.  However, I realized that the
> > > compiler could still generate GOT references in some cases such as
> > > "fentry" calls and stack canary references.
> > >
> >
> > The stack canary references are under discussion here [3]. I have also
> > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > should be very few cases where GOT entries are emitted, so I don't
> > think this is fundamentally a problem.
> >
> > I haven't run into the __fentry__ issue myself: do you think we should
> > fix this in the compiler?
> >
> The issue about __fentry__ is that the compiler would generate 6-bytes
> indirect call through GOT with "-fPIE" option. However, the original
> ftrace nop patching assumes it is a 5-bytes direct call. And
> "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> complier woudn't patch it as nop.
>
> So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> This way, ftrace can handle the previous 5-bytes as before. Also I have
> built PIE kernel with relocation relaxation on GCC, and the linker would
> relax it as following:
> ffffffff810018f0 <do_one_initcall>:
> ffffffff810018f0:       f3 0f 1e fa             endbr64
> ffffffff810018f4:       67 e8 a6 d6 05 00       addr32 call ffffffff8105efa0 <__fentry__>
>                         ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
>

But if fentry is a function symbol, I would not expect the codegen to
be different at all. Are you using -fno-plt?

> It still requires a different nop patching for ftrace. I notice
> "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> call fentry", it appears that the latter is chosen. If the linker could
> generate the former, then no fixup would be necessary for ftrace with
> PIE.
>

Right. I think this may be a result of __fentry__ not being subject to
the same rules wrt visibility etc, similar to __stack_chk_guard. These
are arguably compiler issues that could qualify as bugs, given that
these symbol references don't behave like ordinary symbol references.

> > > Regarding module loading, I agree that we should support GOT reference
> > > for the module itself. I will refactor it according to your suggestion.
> > >
> >
> > Excellent, good luck with that.
> >
> > However, you will still need to make a convincing case for why this is
> > all worth the trouble. Especially given that you disable the depth
> > tracking code, which I don't think should be mutually exclusive.
> >
> Actually, I could do relocation for it when apply patching for the
> depth tracking code. I'm not sure such case is common or not.
>

I think that alternatives patching in general would need to support
RIP relative references in the alternatives. The depth tracking
template is a bit different in this regard, and could be fixed more
easily, I think.

> > I am aware that this a rather tricky, and involves rewriting
> > RIP-relative per-CPU variable accesses, but it would be good to get a
> > discussion started on that topic, and figure out whether there is a
> > way forward there. Ignoring it is not going to help.
> >
> >
> I see that your PIE linking chose to put the per-cpu section in high
> kernel image address, I still keep it as zero-mapping. However, both are
> in the RIP-relative addressing range.
>

Pure PIE linking cannot support the zero mapping - it can only work
with --emit-relocs, which I was trying to avoid.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-08 17:47           ` Ard Biesheuvel
@ 2023-05-09  9:42             ` Hou Wenlong
  2023-05-09  9:52               ` Ard Biesheuvel
  0 siblings, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-05-09  9:42 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Tue, May 09, 2023 at 01:47:53AM +0800, Ard Biesheuvel wrote:
> On Mon, 8 May 2023 at 13:45, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > > On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > >
> > > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > > > >
> > > > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > > > module, all the GOT entry of got references in module should exist in
> > > > > > kernel GOT.  Currently, there is only one usable got reference for
> > > > > > __fentry__().
> > > > > >
> > > > >
> > > > > I don't think this is the right approach. We should permit GOTPCREL
> > > > > relocations properly, which means making them point to a location in
> > > > > memory that carries the absolute address of the symbol. There are
> > > > > several ways to go about that, but perhaps the simplest way is to make
> > > > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > > > PC32 references for the symbol name and the symbol namespace name).
> > > > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > > > it to the ksymtab entry. Another option would be to take inspiration
> > > > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > > > surely) and to count the GOT based relocations, allocate some extra
> > > > > r/o module space for each, and allocate slots and populate them with
> > > > > the right value as you fix up the relocations.
> > > > >
> > > > > Then, many such relocations can be relaxed at module load time if the
> > > > > symbol is in range. IIUC, the module and kernel will still be inside
> > > > > the same 2G window even after widening the KASLR range to 512G, so
> > > > > most GOT loads can be converted into RIP relative LEA instructions.
> > > > >
> > > > > Note that this will also permit you to do things like
> > > > >
> > > > > #define PV_VCPU_PREEMPTED_ASM \
> > > > >  "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > > >  "movq (%rax,%rdi,8), %rax \n\t" \
> > > > >  "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > > >  "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > > >  "setne %al\n\t"
> > > > >
> > > > > or
> > > > >
> > > > > +#ifdef CONFIG_X86_PIE
> > > > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > > > +#else
> > > > > " pushq $arch_rethook_trampoline\n"
> > > > > +#endif
> > > > >
> > > > > instead of having these kludgy push/pop sequences to free up temp registers.
> > > > >
> > > > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > > > this is all rather fresh in my memory)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > > > >
> > > > >
> > > > Hi Ard,
> > > > Thanks for providing the link, it has been very helpful for me as I am
> > > > new to the topic of compilers.
> > >
> > > Happy to hear that.
> > >
> >
> > Thanks for your prompt reply.
> >
> > > > One key difference I noticed is that you
> > > > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > > > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > > > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > > > on mapped memory.
> > > >
> > >
> > > The problem with --emit-relocs is that the relocations emitted into
> > > the binary may get out of sync with the actual code after the linker
> > > has applied relocations.
> > >
> > > $ cat /tmp/a.s
> > > foo:movq foo@GOTPCREL(%rip), %rax
> > >
> > > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > > ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > >
> > > /tmp/a.o:     file format elf64-x86-64
> > >
> > >
> > > Disassembly of section .text:
> > >
> > > 0000000000000000 <foo>:
> > >    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> > > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> > >
> > > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > > 0000000000000000 <foo>:
> > >    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> > > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> > >
> > > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > > -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > > 0000000000401000 <foo>:
> > >   401000: 48 c7 c0 00 10 40 00 mov    $0x401000,%rax
> > > 401003: R_X86_64_32S foo
> > >
> > > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > > -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > > 0000000000001000 <foo>:
> > >     1000: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 1000 <foo>
> > > 1003: R_X86_64_PC32 foo-0x4
> > >
> > > This all looks as expected. However, when using Clang, we end up with
> > >
> > > $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> > > -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > > 00000000000012c0 <foo>:
> > >     12c0: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 12c0 <foo>
> > > 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> > >
> > > So in this case, what --emit-relocs gives us is not what is actually
> > > in the binary. We cannot just ignore these either, given that they are
> > > treated differently depending on whether the symbol is a per-CPU
> > > symbol or not - in the former case, we need to perform a fixup if the
> > > relaxed reference is RIP relative, and in the latter case, if the
> > > relaxed reference is absolute.
> > >
> > With symbols hidden and the compile-time address of the kernel image
> > kept in the top 2G, is it possible for the relaxed reference to be
> > absolute, even if I keep the percpu section zero-mapping for SMP?  I
> > didn't see absoulte relaxed reference after dropping
> > "-mrelax-relocations=no" option.
> >
> 
> If you link in PIE mode, you should never see absolute references
> after relaxation.
> 
> > > On top of that, --emit-relocs does not cover the GOT, so we'd still
> > > need to process that from the code explicitly.
> > >
> > Yes, so the relocs tool would process GOT, and generate
> > R_X86_64_GLOB_DAT relocation for GOT entries in patch 27:
> > https://lore.kernel.org/lkml/d25c7644249355785365914398bdba1ed2c52468.1682673543.git.houwenlong.hwl@antgroup.com
> >
> 
> Yes, something like that is needed. I'd lean towards generating the
> reloc data directly instead of creating an artifiical RELA section
> with GLOB_DAT relocations, but that is a minor detail.
> 
> > > In general, relying on --emit-relocs is kind of dodgy, and I think
> > > combining PIE linking with --emit-relocs is a bad idea.
> > >
> > > > The another issue is that it requires the addition of the
> > > > "-mrelax-relocations=no" option to support older compilers and linkers.
> > >
> > > Why? The decompressor is now linked in PIE mode so we should be able
> > > to drop that. Or do you need to add is somewhere else?
> > >
> > I tried to use binutils 2.25 (mini version), it couldn't recognize
> > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX.
> >
> 
> I'm not sure that matters. If the assembler accepts @GOTPCREL
> notation, it should generate the relocations that the linker can
> understand. If the toolchain is not internally consistent in this
> regard, I don't think it is our problem.
>
I get it. Thanks.
 
> This might mean that we end up with more residual GOT entries than
> with a more recent toolchain, but I don't think that is a big deal.
> 
> > > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > > in binutils 2.26 and later, but the mini version required for the kernel
> > > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > > with the reason given in [2]. Without relocation relaxation, GOT
> > > > references would increase the size of GOT. Therefore, I do not want to
> > > > use GOT reference in assembly directly.  However, I realized that the
> > > > compiler could still generate GOT references in some cases such as
> > > > "fentry" calls and stack canary references.
> > > >
> > >
> > > The stack canary references are under discussion here [3]. I have also
> > > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > > should be very few cases where GOT entries are emitted, so I don't
> > > think this is fundamentally a problem.
> > >
> > > I haven't run into the __fentry__ issue myself: do you think we should
> > > fix this in the compiler?
> > >
> > The issue about __fentry__ is that the compiler would generate 6-bytes
> > indirect call through GOT with "-fPIE" option. However, the original
> > ftrace nop patching assumes it is a 5-bytes direct call. And
> > "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> > complier woudn't patch it as nop.
> >
> > So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> > This way, ftrace can handle the previous 5-bytes as before. Also I have
> > built PIE kernel with relocation relaxation on GCC, and the linker would
> > relax it as following:
> > ffffffff810018f0 <do_one_initcall>:
> > ffffffff810018f0:       f3 0f 1e fa             endbr64
> > ffffffff810018f4:       67 e8 a6 d6 05 00       addr32 call ffffffff8105efa0 <__fentry__>
> >                         ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
> >
> 
> But if fentry is a function symbol, I would not expect the codegen to
> be different at all. Are you using -fno-plt?
>
No, even with -fno-plt added, the compiler still generates a GOT
reference for fentry. Therefore, the problem may be visibility, as you
said.

> > It still requires a different nop patching for ftrace. I notice
> > "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> > that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> > call fentry", it appears that the latter is chosen. If the linker could
> > generate the former, then no fixup would be necessary for ftrace with
> > PIE.
> >
> 
> Right. I think this may be a result of __fentry__ not being subject to
> the same rules wrt visibility etc, similar to __stack_chk_guard. These
> are arguably compiler issues that could qualify as bugs, given that
> these symbol references don't behave like ordinary symbol references.
> 
> > > > Regarding module loading, I agree that we should support GOT reference
> > > > for the module itself. I will refactor it according to your suggestion.
> > > >
> > >
> > > Excellent, good luck with that.
> > >
> > > However, you will still need to make a convincing case for why this is
> > > all worth the trouble. Especially given that you disable the depth
> > > tracking code, which I don't think should be mutually exclusive.
> > >
> > Actually, I could do relocation for it when apply patching for the
> > depth tracking code. I'm not sure such case is common or not.
> >
> 
> I think that alternatives patching in general would need to support
> RIP relative references in the alternatives. The depth tracking
> template is a bit different in this regard, and could be fixed more
> easily, I think.
> 
> > > I am aware that this a rather tricky, and involves rewriting
> > > RIP-relative per-CPU variable accesses, but it would be good to get a
> > > discussion started on that topic, and figure out whether there is a
> > > way forward there. Ignoring it is not going to help.
> > >
> > >
> > I see that your PIE linking chose to put the per-cpu section in high
> > kernel image address, I still keep it as zero-mapping. However, both are
> > in the RIP-relative addressing range.
> >
> 
> Pure PIE linking cannot support the zero mapping - it can only work
> with --emit-relocs, which I was trying to avoid.
Sorry, why doesn't PIE linking support zero mapping? I noticed in the
commit message for your PIE linking that it stated, "if we randomize the
kernel's VA by increasing it by X bytes, every RIP-relative per-CPU
reference needs to be decreased by the same amount in order for the
produced offset to remain correct." As a result, I decided to decrease
the GS base and not relocate the RIP-relative per-CPU reference in the
relocs. Consequently, all RIP-relative references, regardless of whether
they are per-CPU variables or not, do not require relocation.

Furthermore, all symbols are hidden, which implies that all per-CPU
references will not generate a GOT reference and will be relaxed as
absolute reference due to zero mapping. However, the __stack_chk_guard
on CLANG always generates a GOT reference, but I didn't see it being
relaxed as absolute reference on LLVM.

Thanks!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-09  9:42             ` Hou Wenlong
@ 2023-05-09  9:52               ` Ard Biesheuvel
  2023-05-09 12:35                 ` Hou Wenlong
  0 siblings, 1 reply; 80+ messages in thread
From: Ard Biesheuvel @ 2023-05-09  9:52 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Tue, 9 May 2023 at 11:42, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> On Tue, May 09, 2023 at 01:47:53AM +0800, Ard Biesheuvel wrote:
> > On Mon, 8 May 2023 at 13:45, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > >
> > > On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > > > On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > > >
> > > > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
...
> > > > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > > > in binutils 2.26 and later, but the mini version required for the kernel
> > > > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > > > with the reason given in [2]. Without relocation relaxation, GOT
> > > > > references would increase the size of GOT. Therefore, I do not want to
> > > > > use GOT reference in assembly directly.  However, I realized that the
> > > > > compiler could still generate GOT references in some cases such as
> > > > > "fentry" calls and stack canary references.
> > > > >
> > > >
> > > > The stack canary references are under discussion here [3]. I have also
> > > > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > > > should be very few cases where GOT entries are emitted, so I don't
> > > > think this is fundamentally a problem.
> > > >
> > > > I haven't run into the __fentry__ issue myself: do you think we should
> > > > fix this in the compiler?
> > > >
> > > The issue about __fentry__ is that the compiler would generate 6-bytes
> > > indirect call through GOT with "-fPIE" option. However, the original
> > > ftrace nop patching assumes it is a 5-bytes direct call. And
> > > "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> > > complier woudn't patch it as nop.
> > >
> > > So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> > > This way, ftrace can handle the previous 5-bytes as before. Also I have
> > > built PIE kernel with relocation relaxation on GCC, and the linker would
> > > relax it as following:
> > > ffffffff810018f0 <do_one_initcall>:
> > > ffffffff810018f0:       f3 0f 1e fa             endbr64
> > > ffffffff810018f4:       67 e8 a6 d6 05 00       addr32 call ffffffff8105efa0 <__fentry__>
> > >                         ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
> > >
> >
> > But if fentry is a function symbol, I would not expect the codegen to
> > be different at all. Are you using -fno-plt?
> >
> No, even with -fno-plt added, the compiler still generates a GOT
> reference for fentry. Therefore, the problem may be visibility, as you
> said.
>

Yeah, I spotted this issue in GCC - I just sent them a patch this morning.

> > > It still requires a different nop patching for ftrace. I notice
> > > "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> > > that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> > > call fentry", it appears that the latter is chosen. If the linker could
> > > generate the former, then no fixup would be necessary for ftrace with
> > > PIE.
> > >
> >
> > Right. I think this may be a result of __fentry__ not being subject to
> > the same rules wrt visibility etc, similar to __stack_chk_guard. These
> > are arguably compiler issues that could qualify as bugs, given that
> > these symbol references don't behave like ordinary symbol references.
> >
> > > > > Regarding module loading, I agree that we should support GOT reference
> > > > > for the module itself. I will refactor it according to your suggestion.
> > > > >
> > > >
> > > > Excellent, good luck with that.
> > > >
> > > > However, you will still need to make a convincing case for why this is
> > > > all worth the trouble. Especially given that you disable the depth
> > > > tracking code, which I don't think should be mutually exclusive.
> > > >
> > > Actually, I could do relocation for it when apply patching for the
> > > depth tracking code. I'm not sure such case is common or not.
> > >
> >
> > I think that alternatives patching in general would need to support
> > RIP relative references in the alternatives. The depth tracking
> > template is a bit different in this regard, and could be fixed more
> > easily, I think.
> >
> > > > I am aware that this a rather tricky, and involves rewriting
> > > > RIP-relative per-CPU variable accesses, but it would be good to get a
> > > > discussion started on that topic, and figure out whether there is a
> > > > way forward there. Ignoring it is not going to help.
> > > >
> > > >
> > > I see that your PIE linking chose to put the per-cpu section in high
> > > kernel image address, I still keep it as zero-mapping. However, both are
> > > in the RIP-relative addressing range.
> > >
> >
> > Pure PIE linking cannot support the zero mapping - it can only work
> > with --emit-relocs, which I was trying to avoid.
> Sorry, why doesn't PIE linking support zero mapping? I noticed in the
> commit message for your PIE linking that it stated, "if we randomize the
> kernel's VA by increasing it by X bytes, every RIP-relative per-CPU
> reference needs to be decreased by the same amount in order for the
> produced offset to remain correct." As a result, I decided to decrease
> the GS base and not relocate the RIP-relative per-CPU reference in the
> relocs. Consequently, all RIP-relative references, regardless of whether
> they are per-CPU variables or not, do not require relocation.
>

Interesting. Does that work as expected with dynamically allocated
per-CPU variables?

> Furthermore, all symbols are hidden, which implies that all per-CPU
> references will not generate a GOT reference and will be relaxed as
> absolute reference due to zero mapping. However, the __stack_chk_guard
> on CLANG always generates a GOT reference, but I didn't see it being
> relaxed as absolute reference on LLVM.
>

Yeah, we should fix that.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-09  9:52               ` Ard Biesheuvel
@ 2023-05-09 12:35                 ` Hou Wenlong
  0 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-05-09 12:35 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Tue, May 09, 2023 at 05:52:31PM +0800, Ard Biesheuvel wrote:
> On Tue, 9 May 2023 at 11:42, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > On Tue, May 09, 2023 at 01:47:53AM +0800, Ard Biesheuvel wrote:
> > > On Mon, 8 May 2023 at 13:45, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > >
> > > > On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > > > > On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > > > >
> > > > > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> ...
> > > > > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > > > > in binutils 2.26 and later, but the mini version required for the kernel
> > > > > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > > > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > > > > with the reason given in [2]. Without relocation relaxation, GOT
> > > > > > references would increase the size of GOT. Therefore, I do not want to
> > > > > > use GOT reference in assembly directly.  However, I realized that the
> > > > > > compiler could still generate GOT references in some cases such as
> > > > > > "fentry" calls and stack canary references.
> > > > > >
> > > > >
> > > > > The stack canary references are under discussion here [3]. I have also
> > > > > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > > > > should be very few cases where GOT entries are emitted, so I don't
> > > > > think this is fundamentally a problem.
> > > > >
> > > > > I haven't run into the __fentry__ issue myself: do you think we should
> > > > > fix this in the compiler?
> > > > >
> > > > The issue about __fentry__ is that the compiler would generate 6-bytes
> > > > indirect call through GOT with "-fPIE" option. However, the original
> > > > ftrace nop patching assumes it is a 5-bytes direct call. And
> > > > "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> > > > complier woudn't patch it as nop.
> > > >
> > > > So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> > > > This way, ftrace can handle the previous 5-bytes as before. Also I have
> > > > built PIE kernel with relocation relaxation on GCC, and the linker would
> > > > relax it as following:
> > > > ffffffff810018f0 <do_one_initcall>:
> > > > ffffffff810018f0:       f3 0f 1e fa             endbr64
> > > > ffffffff810018f4:       67 e8 a6 d6 05 00       addr32 call ffffffff8105efa0 <__fentry__>
> > > >                         ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
> > > >
> > >
> > > But if fentry is a function symbol, I would not expect the codegen to
> > > be different at all. Are you using -fno-plt?
> > >
> > No, even with -fno-plt added, the compiler still generates a GOT
> > reference for fentry. Therefore, the problem may be visibility, as you
> > said.
> >
> 
> Yeah, I spotted this issue in GCC - I just sent them a patch this morning.
> 
> > > > It still requires a different nop patching for ftrace. I notice
> > > > "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> > > > that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> > > > call fentry", it appears that the latter is chosen. If the linker could
> > > > generate the former, then no fixup would be necessary for ftrace with
> > > > PIE.
> > > >
> > >
> > > Right. I think this may be a result of __fentry__ not being subject to
> > > the same rules wrt visibility etc, similar to __stack_chk_guard. These
> > > are arguably compiler issues that could qualify as bugs, given that
> > > these symbol references don't behave like ordinary symbol references.
> > >
> > > > > > Regarding module loading, I agree that we should support GOT reference
> > > > > > for the module itself. I will refactor it according to your suggestion.
> > > > > >
> > > > >
> > > > > Excellent, good luck with that.
> > > > >
> > > > > However, you will still need to make a convincing case for why this is
> > > > > all worth the trouble. Especially given that you disable the depth
> > > > > tracking code, which I don't think should be mutually exclusive.
> > > > >
> > > > Actually, I could do relocation for it when apply patching for the
> > > > depth tracking code. I'm not sure such case is common or not.
> > > >
> > >
> > > I think that alternatives patching in general would need to support
> > > RIP relative references in the alternatives. The depth tracking
> > > template is a bit different in this regard, and could be fixed more
> > > easily, I think.
> > >
> > > > > I am aware that this a rather tricky, and involves rewriting
> > > > > RIP-relative per-CPU variable accesses, but it would be good to get a
> > > > > discussion started on that topic, and figure out whether there is a
> > > > > way forward there. Ignoring it is not going to help.
> > > > >
> > > > >
> > > > I see that your PIE linking chose to put the per-cpu section in high
> > > > kernel image address, I still keep it as zero-mapping. However, both are
> > > > in the RIP-relative addressing range.
> > > >
> > >
> > > Pure PIE linking cannot support the zero mapping - it can only work
> > > with --emit-relocs, which I was trying to avoid.
> > Sorry, why doesn't PIE linking support zero mapping? I noticed in the
> > commit message for your PIE linking that it stated, "if we randomize the
> > kernel's VA by increasing it by X bytes, every RIP-relative per-CPU
> > reference needs to be decreased by the same amount in order for the
> > produced offset to remain correct." As a result, I decided to decrease
> > the GS base and not relocate the RIP-relative per-CPU reference in the
> > relocs. Consequently, all RIP-relative references, regardless of whether
> > they are per-CPU variables or not, do not require relocation.
> >
> 
> Interesting. Does that work as expected with dynamically allocated
> per-CPU variables?
>
I didn't encounter any issues with the dynamically allocated per-CPU
variables. Since the per_cpu_ptr macro uses the __per_cpu_offset array
directly, it should work. In any case, I have tested loading the kvm
module, which uses dynamically allocated per-CPU variables, and
successfully booted a guest.

The related patch is:
https://lore.kernel.org/lkml/62d7e9e73467b711351a84ebce99372d3dccaa73.1682673543.git.houwenlong.hwl@antgroup.com
 
Thanks!
> > Furthermore, all symbols are hidden, which implies that all per-CPU
> > references will not generate a GOT reference and will be relaxed as
> > absolute reference due to zero mapping. However, the __stack_chk_guard
> > on CLANG always generates a GOT reference, but I didn't see it being
> > relaxed as absolute reference on LLVM.
> >
> 
> Yeah, we should fix that.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-08  9:16       ` Ard Biesheuvel
  2023-05-08 11:40         ` Hou Wenlong
@ 2023-05-10  7:09         ` Hou Wenlong
  2023-05-10  8:15           ` Ard Biesheuvel
  1 sibling, 1 reply; 80+ messages in thread
From: Hou Wenlong @ 2023-05-10  7:09 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > >
> > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > module, all the GOT entry of got references in module should exist in
> > > > kernel GOT.  Currently, there is only one usable got reference for
> > > > __fentry__().
> > > >
> > >
> > > I don't think this is the right approach. We should permit GOTPCREL
> > > relocations properly, which means making them point to a location in
> > > memory that carries the absolute address of the symbol. There are
> > > several ways to go about that, but perhaps the simplest way is to make
> > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > PC32 references for the symbol name and the symbol namespace name).
> > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > it to the ksymtab entry. Another option would be to take inspiration
> > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > surely) and to count the GOT based relocations, allocate some extra
> > > r/o module space for each, and allocate slots and populate them with
> > > the right value as you fix up the relocations.
> > >
> > > Then, many such relocations can be relaxed at module load time if the
> > > symbol is in range. IIUC, the module and kernel will still be inside
> > > the same 2G window even after widening the KASLR range to 512G, so
> > > most GOT loads can be converted into RIP relative LEA instructions.
> > >
> > > Note that this will also permit you to do things like
> > >
> > > #define PV_VCPU_PREEMPTED_ASM \
> > >  "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > >  "movq (%rax,%rdi,8), %rax \n\t" \
> > >  "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > >  "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > >  "setne %al\n\t"
> > >
> > > or
> > >
> > > +#ifdef CONFIG_X86_PIE
> > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > +#else
> > > " pushq $arch_rethook_trampoline\n"
> > > +#endif
> > >
> > > instead of having these kludgy push/pop sequences to free up temp registers.
> > >
> > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > this is all rather fresh in my memory)
> > >
> > >
> > >
> > >
> > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > >
> > >
> > Hi Ard,
> > Thanks for providing the link, it has been very helpful for me as I am
> > new to the topic of compilers.
> 
> Happy to hear that.
> 
> > One key difference I noticed is that you
> > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > on mapped memory.
> >
> 
> The problem with --emit-relocs is that the relocations emitted into
> the binary may get out of sync with the actual code after the linker
> has applied relocations.
> 
> $ cat /tmp/a.s
> foo:movq foo@GOTPCREL(%rip), %rax
> 
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> 
> /tmp/a.o:     file format elf64-x86-64
> 
> 
> Disassembly of section .text:
> 
> 0000000000000000 <foo>:
>    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
> 
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> 0000000000000000 <foo>:
>    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
> 
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000401000 <foo>:
>   401000: 48 c7 c0 00 10 40 00 mov    $0x401000,%rax
> 401003: R_X86_64_32S foo
> 
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000001000 <foo>:
>     1000: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 1000 <foo>
> 1003: R_X86_64_PC32 foo-0x4
> 
> This all looks as expected. However, when using Clang, we end up with
> 
> $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 00000000000012c0 <foo>:
>     12c0: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 12c0 <foo>
> 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> 
> So in this case, what --emit-relocs gives us is not what is actually
> in the binary. We cannot just ignore these either, given that they are
> treated differently depending on whether the symbol is a per-CPU
> symbol or not - in the former case, we need to perform a fixup if the
> relaxed reference is RIP relative, and in the latter case, if the
> relaxed reference is absolute.
> 
> On top of that, --emit-relocs does not cover the GOT, so we'd still
> need to process that from the code explicitly.
> 
> In general, relying on --emit-relocs is kind of dodgy, and I think
> combining PIE linking with --emit-relocs is a bad idea.
> 
> > The another issue is that it requires the addition of the
> > "-mrelax-relocations=no" option to support older compilers and linkers.
> 
> Why? The decompressor is now linked in PIE mode so we should be able
> to drop that. Or do you need to add is somewhere else?
>
Hi Ard,

After removing the "-mrelax-relocations=no" option, I noticed that the
linker was relaxing GOT references as absolute references for mov
instructions, even if the symbol was in a high address, as long as I
kept the compile-time base address of the kernel image in the top 2G. I
consulted the "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI,
which stated that "When position-independent code is disabled and foo is
defined locally in the lower 32-bit address space, memory operand in mov
can be converted into immediate operand". However, it seemed that if the
symbol was in the higher 32-bit address space, the memory operand in mov
would also be converted into an immediate operand. If I decreased the
compile-time base address of the kernel image, it would be relaxed as
lea. Therefore, I believe that using "-mrelax-relocations=no" without
"-pie" option is necessary. Is there a way to force the linker to relax
it as lea without using the "-pie" option when linking?

Since all GOT references cannot be omitted, perhaps I should try linking
the kernel with the "-pie" option.

Thanks!

> > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > in binutils 2.26 and later, but the mini version required for the kernel
> > is 2.25. This option disables relocation relaxation, which makes GOT not
> > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > with the reason given in [2]. Without relocation relaxation, GOT
> > references would increase the size of GOT. Therefore, I do not want to
> > use GOT reference in assembly directly.  However, I realized that the
> > compiler could still generate GOT references in some cases such as
> > "fentry" calls and stack canary references.
> >
> 
> The stack canary references are under discussion here [3]. I have also
> sent a patch for kallsyms symbol references [4]. Beyond that, there
> should be very few cases where GOT entries are emitted, so I don't
> think this is fundamentally a problem.
> 
> I haven't run into the __fentry__ issue myself: do you think we should
> fix this in the compiler?
> 
> > Regarding module loading, I agree that we should support GOT reference
> > for the module itself. I will refactor it according to your suggestion.
> >
> 
> Excellent, good luck with that.
> 
> However, you will still need to make a convincing case for why this is
> all worth the trouble. Especially given that you disable the depth
> tracking code, which I don't think should be mutually exclusive.
> 
> I am aware that this a rather tricky, and involves rewriting
> RIP-relative per-CPU variable accesses, but it would be good to get a
> discussion started on that topic, and figure out whether there is a
> way forward there. Ignoring it is not going to help.
> 
> 
> >
> > [0] https://yhbt.net/lore/all/20170718223333.110371-20-thgarnie@google.com
> > [1] https://yhbt.net/lore/all/20171004212003.28296-1-thgarnie@google.com
> > [2] https://lore.kernel.org/all/20200903203053.3411268-2-samitolvanen@google.com/
> >
> 
> [3] https://github.com/llvm/llvm-project/issues/60116
> [4] 20230504174320.3930345-1-ardb@kernel.org
> 
> > > > Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> > > > Cc: Thomas Garnier <thgarnie@chromium.org>
> > > > Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > > > Cc: Kees Cook <keescook@chromium.org>
> > > > ---
> > > >  arch/x86/include/asm/sections.h |  5 +++++
> > > >  arch/x86/kernel/module.c        | 27 +++++++++++++++++++++++++++
> > > >  2 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > > > index a6e8373a5170..dc1c2b08ec48 100644
> > > > --- a/arch/x86/include/asm/sections.h
> > > > +++ b/arch/x86/include/asm/sections.h
> > > > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> > > >
> > > >  #if defined(CONFIG_X86_64)
> > > >  extern char __end_rodata_hpage_align[];
> > > > +
> > > > +#ifdef CONFIG_X86_PIE
> > > > +extern char __start_got[], __end_got[];
> > > > +#endif
> > > > +
> > > >  #endif
> > > >
> > > >  extern char __end_of_kernel_reserve[];
> > > > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > > > index 84ad0e61ba6e..051f88e6884e 100644
> > > > --- a/arch/x86/kernel/module.c
> > > > +++ b/arch/x86/kernel/module.c
> > > > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> > > >         return 0;
> > > >  }
> > > >  #else /*X86_64*/
> > > > +#ifdef CONFIG_X86_PIE
> > > > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > > > +{
> > > > +       u64 *pos;
> > > > +
> > > > +       for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > > > +               if (*pos == sym->st_value)
> > > > +                       return (u64)pos + rela->r_addend;
> > > > +       return 0;
> > > > +}
> > > > +#endif
> > > > +
> > > >  static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >                    const char *strtab,
> > > >                    unsigned int symindex,
> > > > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >                 case R_X86_64_64:
> > > >                         size = 8;
> > > >                         break;
> > > > +#ifndef CONFIG_X86_PIE
> > > >                 case R_X86_64_32:
> > > >                         if (val != *(u32 *)&val)
> > > >                                 goto overflow;
> > > > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >                                 goto overflow;
> > > >                         size = 4;
> > > >                         break;
> > > > +#else
> > > > +               case R_X86_64_GOTPCREL:
> > > > +                       val = find_got_kernel_entry(sym, rel);
> > > > +                       if (!val)
> > > > +                               goto unexpected_got_reference;
> > > > +                       fallthrough;
> > > > +#endif
> > > >                 case R_X86_64_PC32:
> > > >                 case R_X86_64_PLT32:
> > > >                         val -= (u64)loc;
> > > > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > >         }
> > > >         return 0;
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > +unexpected_got_reference:
> > > > +       pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > > > +       return -ENOEXEC;
> > > > +#else
> > > >  overflow:
> > > >         pr_err("overflow in relocation type %d val %Lx\n",
> > > >                (int)ELF64_R_TYPE(rel[i].r_info), val);
> > > >         pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> > > >                me->name);
> > > > +#endif
> > > > +
> > > >         return -ENOEXEC;
> > > >  }
> > > >
> > > > --
> > > > 2.31.1
> > > >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support
  2023-05-10  7:09         ` Hou Wenlong
@ 2023-05-10  8:15           ` Ard Biesheuvel
  0 siblings, 0 replies; 80+ messages in thread
From: Ard Biesheuvel @ 2023-05-10  8:15 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: linux-kernel, Lai Jiangshan, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Petr Mladek, Greg Kroah-Hartman,
	Jason A. Donenfeld, Song Liu, Julian Pidancet

On Wed, 10 May 2023 at 09:15, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > On Mon, 8 May 2023 at 10:38, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > >
> > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > > >
> > > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > > module, all the GOT entry of got references in module should exist in
> > > > > kernel GOT.  Currently, there is only one usable got reference for
> > > > > __fentry__().
> > > > >
> > > >
> > > > I don't think this is the right approach. We should permit GOTPCREL
> > > > relocations properly, which means making them point to a location in
> > > > memory that carries the absolute address of the symbol. There are
> > > > several ways to go about that, but perhaps the simplest way is to make
> > > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > > PC32 references for the symbol name and the symbol namespace name).
> > > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > > it to the ksymtab entry. Another option would be to take inspiration
> > > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > > surely) and to count the GOT based relocations, allocate some extra
> > > > r/o module space for each, and allocate slots and populate them with
> > > > the right value as you fix up the relocations.
> > > >
> > > > Then, many such relocations can be relaxed at module load time if the
> > > > symbol is in range. IIUC, the module and kernel will still be inside
> > > > the same 2G window even after widening the KASLR range to 512G, so
> > > > most GOT loads can be converted into RIP relative LEA instructions.
> > > >
> > > > Note that this will also permit you to do things like
> > > >
> > > > #define PV_VCPU_PREEMPTED_ASM \
> > > >  "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > >  "movq (%rax,%rdi,8), %rax \n\t" \
> > > >  "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > >  "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > >  "setne %al\n\t"
> > > >
> > > > or
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > > +#else
> > > > " pushq $arch_rethook_trampoline\n"
> > > > +#endif
> > > >
> > > > instead of having these kludgy push/pop sequences to free up temp registers.
> > > >
> > > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > > this is all rather fresh in my memory)
> > > >
> > > >
> > > >
> > > >
> > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > > >
> > > >
> > > Hi Ard,
> > > Thanks for providing the link, it has been very helpful for me as I am
> > > new to the topic of compilers.
> >
> > Happy to hear that.
> >
> > > One key difference I noticed is that you
> > > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > > on mapped memory.
> > >
> >
> > The problem with --emit-relocs is that the relocations emitted into
> > the binary may get out of sync with the actual code after the linker
> > has applied relocations.
> >
> > $ cat /tmp/a.s
> > foo:movq foo@GOTPCREL(%rip), %rax
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> >
> > /tmp/a.o:     file format elf64-x86-64
> >
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <foo>:
> >    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > 0000000000000000 <foo>:
> >    0: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000401000 <foo>:
> >   401000: 48 c7 c0 00 10 40 00 mov    $0x401000,%rax
> > 401003: R_X86_64_32S foo
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000001000 <foo>:
> >     1000: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 1000 <foo>
> > 1003: R_X86_64_PC32 foo-0x4
> >
> > This all looks as expected. However, when using Clang, we end up with
> >
> > $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> > -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 00000000000012c0 <foo>:
> >     12c0: 48 8d 05 f9 ff ff ff lea    -0x7(%rip),%rax        # 12c0 <foo>
> > 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > So in this case, what --emit-relocs gives us is not what is actually
> > in the binary. We cannot just ignore these either, given that they are
> > treated differently depending on whether the symbol is a per-CPU
> > symbol or not - in the former case, we need to perform a fixup if the
> > relaxed reference is RIP relative, and in the latter case, if the
> > relaxed reference is absolute.
> >
> > On top of that, --emit-relocs does not cover the GOT, so we'd still
> > need to process that from the code explicitly.
> >
> > In general, relying on --emit-relocs is kind of dodgy, and I think
> > combining PIE linking with --emit-relocs is a bad idea.
> >
> > > The another issue is that it requires the addition of the
> > > "-mrelax-relocations=no" option to support older compilers and linkers.
> >
> > Why? The decompressor is now linked in PIE mode so we should be able
> > to drop that. Or do you need to add is somewhere else?
> >
> Hi Ard,
>
> After removing the "-mrelax-relocations=no" option, I noticed that the
> linker was relaxing GOT references as absolute references for mov
> instructions, even if the symbol was in a high address, as long as I
> kept the compile-time base address of the kernel image in the top 2G. I
> consulted the "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI,
> which stated that "When position-independent code is disabled and foo is
> defined locally in the lower 32-bit address space, memory operand in mov
> can be converted into immediate operand". However, it seemed that if the
> symbol was in the higher 32-bit address space, the memory operand in mov
> would also be converted into an immediate operand. If I decreased the
> compile-time base address of the kernel image, it would be relaxed as
> lea. Therefore, I believe that using "-mrelax-relocations=no" without
> "-pie" option is necessary.

Indeed. As you noted, the linker assumes that non-PIE linked binaries
will always appear at their link time address, and relaxations will
try to take advantage of that.

Currently, we use -pie linking only for the decompressor, and we
should be able to drop -mrelax-relocations=no from its LDFLAGS. But
position dependent linking should not use relaxations at all.

> Is there a way to force the linker to relax
> it as lea without using the "-pie" option when linking?
>

Not that I am aware of.

> Since all GOT references cannot be omitted, perhaps I should try linking
> the kernel with the "-pie" option.
>

That way, we will end up with two sets of relocations, the static ones
from --emit-relocs and the dynamic ones from -pie. This should be
manageable, given that the difference between those sets should
exactly cover the GOT.

However, relying on --emit-relocs and -pie at the same time seems
clumsy to me. I'd prefer to only depend on -pie at /some/ point.

-- 
Ard.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction
  2023-04-28  9:50 ` [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction Hou Wenlong
@ 2023-06-01  9:29     ` Juergen Gross via Virtualization
  0 siblings, 0 replies; 80+ messages in thread
From: Juergen Gross @ 2023-06-01  9:29 UTC (permalink / raw)
  To: Hou Wenlong, linux-kernel
  Cc: Thomas Garnier, Lai Jiangshan, Kees Cook,
	Srivatsa S. Bhat (VMware),
	Alexey Makhalov, VMware PV-Drivers Reviewers, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Song Liu, Nadav Amit, Arnd Bergmann,
	virtualization


[-- Attachment #1.1.1: Type: text/plain, Size: 731 bytes --]

On 28.04.23 11:50, Hou Wenlong wrote:
> Similar to the alternative patching, use relative reference for original
> instruction rather than absolute one, which saves 8 bytes for one entry
> on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
> R_X86_64_64 relocation, which also reduces relocation metadata on
> relocatable builds. And the alignment could be hard coded to be 4 now.
> 
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>

Reviewed-by: Juergen Gross <jgross@suse.com>

I think this patch should be taken even without the series.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction
@ 2023-06-01  9:29     ` Juergen Gross via Virtualization
  0 siblings, 0 replies; 80+ messages in thread
From: Juergen Gross via Virtualization @ 2023-06-01  9:29 UTC (permalink / raw)
  To: Hou Wenlong, linux-kernel
  Cc: x86, H. Peter Anvin, Kees Cook, Arnd Bergmann, Thomas Garnier,
	VMware PV-Drivers Reviewers, Dave Hansen, Lai Jiangshan,
	virtualization, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Alexey Makhalov, Nadav Amit, Thomas Gleixner, Song Liu


[-- Attachment #1.1.1.1: Type: text/plain, Size: 731 bytes --]

On 28.04.23 11:50, Hou Wenlong wrote:
> Similar to the alternative patching, use relative reference for original
> instruction rather than absolute one, which saves 8 bytes for one entry
> on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
> R_X86_64_64 relocation, which also reduces relocation metadata on
> relocatable builds. And the alignment could be hard coded to be 4 now.
> 
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>

Reviewed-by: Juergen Gross <jgross@suse.com>

I think this patch should be taken even without the series.


Juergen

[-- Attachment #1.1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction
  2023-06-01  9:29     ` Juergen Gross via Virtualization
@ 2023-06-05  6:40       ` Nadav Amit via Virtualization
  -1 siblings, 0 replies; 80+ messages in thread
From: Nadav Amit @ 2023-06-05  6:40 UTC (permalink / raw)
  To: Juergen Gross, Hou Wenlong
  Cc: kernel list, Thomas Garnier, Lai Jiangshan, Kees Cook, srivatsa,
	Alexey Makhalov, Pv-drivers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Peter Zijlstra, Song Liu, Arnd Bergmann, virtualization



> On Jun 1, 2023, at 2:29 AM, Juergen Gross <jgross@suse.com> wrote:
> 
> On 28.04.23 11:50, Hou Wenlong wrote:
>> Similar to the alternative patching, use relative reference for original
>> instruction rather than absolute one, which saves 8 bytes for one entry
>> on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
>> R_X86_64_64 relocation, which also reduces relocation metadata on
>> relocatable builds. And the alignment could be hard coded to be 4 now.
>> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
>> Cc: Thomas Garnier <thgarnie@chromium.org>
>> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
>> Cc: Kees Cook <keescook@chromium.org>
> 
> Reviewed-by: Juergen Gross <jgross@suse.com>
> 
> I think this patch should be taken even without the series.

It looks good to me, I am just not sure what the alignment is needed
at all.

Why not to make the struct __packed (like struct alt_instr) and get rid
of all the .align directives? Am I missing something?


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction
@ 2023-06-05  6:40       ` Nadav Amit via Virtualization
  0 siblings, 0 replies; 80+ messages in thread
From: Nadav Amit via Virtualization @ 2023-06-05  6:40 UTC (permalink / raw)
  To: Juergen Gross, Hou Wenlong
  Cc: X86 ML, H. Peter Anvin, Kees Cook, Arnd Bergmann, Thomas Garnier,
	Pv-drivers, Dave Hansen, Lai Jiangshan, kernel list,
	Peter Zijlstra, Ingo Molnar, Borislav Petkov, Alexey Makhalov,
	Thomas Gleixner, virtualization, Song Liu



> On Jun 1, 2023, at 2:29 AM, Juergen Gross <jgross@suse.com> wrote:
> 
> On 28.04.23 11:50, Hou Wenlong wrote:
>> Similar to the alternative patching, use relative reference for original
>> instruction rather than absolute one, which saves 8 bytes for one entry
>> on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
>> R_X86_64_64 relocation, which also reduces relocation metadata on
>> relocatable builds. And the alignment could be hard coded to be 4 now.
>> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
>> Cc: Thomas Garnier <thgarnie@chromium.org>
>> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
>> Cc: Kees Cook <keescook@chromium.org>
> 
> Reviewed-by: Juergen Gross <jgross@suse.com>
> 
> I think this patch should be taken even without the series.

It looks good to me, I am just not sure what the alignment is needed
at all.

Why not to make the struct __packed (like struct alt_instr) and get rid
of all the .align directives? Am I missing something?

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction
  2023-06-05  6:40       ` Nadav Amit via Virtualization
  (?)
@ 2023-06-06 11:35       ` Hou Wenlong
  -1 siblings, 0 replies; 80+ messages in thread
From: Hou Wenlong @ 2023-06-06 11:35 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Juergen Gross, kernel list, Thomas Garnier, Lai Jiangshan,
	Kees Cook, srivatsa, Alexey Makhalov, Pv-drivers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, Peter Zijlstra, Song Liu, Arnd Bergmann,
	virtualization

On Mon, Jun 05, 2023 at 02:40:54PM +0800, Nadav Amit wrote:
> 
> 
> > On Jun 1, 2023, at 2:29 AM, Juergen Gross <jgross@suse.com> wrote:
> > 
> > On 28.04.23 11:50, Hou Wenlong wrote:
> >> Similar to the alternative patching, use relative reference for original
> >> instruction rather than absolute one, which saves 8 bytes for one entry
> >> on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
> >> R_X86_64_64 relocation, which also reduces relocation metadata on
> >> relocatable builds. And the alignment could be hard coded to be 4 now.
> >> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> >> Cc: Thomas Garnier <thgarnie@chromium.org>
> >> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >> Cc: Kees Cook <keescook@chromium.org>
> > 
> > Reviewed-by: Juergen Gross <jgross@suse.com>
> > 
> > I think this patch should be taken even without the series.
> 
> It looks good to me, I am just not sure what the alignment is needed
> at all.
> 
> Why not to make the struct __packed (like struct alt_instr) and get rid
> of all the .align directives? Am I missing something?

Yes, making the struct __packed can save more space. If I understand
correctly, it could be done even without this patch but it may lead to
misaligned memory access. However, it seems to not matter as I didn't
find any related log for packing struct alt_instr. I can do such things
if needed.

Thanks.

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2023-06-06 11:36 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-28  9:50 [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 01/43] x86/crypto: Adapt assembly for PIE support Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 02/43] x86: Add macro to get symbol address " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 03/43] x86: relocate_kernel - Adapt assembly " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 04/43] x86/entry/64: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 05/43] x86: pm-trace: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 06/43] x86/CPU: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 07/43] x86/acpi: " Hou Wenlong
2023-04-28 11:32   ` Rafael J. Wysocki
2023-04-28  9:50 ` [PATCH RFC 08/43] x86/boot/64: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 09/43] x86/power/64: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 10/43] x86/alternatives: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 11/43] x86/irq: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 12/43] x86,rethook: " Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction Hou Wenlong
2023-06-01  9:29   ` Juergen Gross
2023-06-01  9:29     ` Juergen Gross via Virtualization
2023-06-05  6:40     ` Nadav Amit
2023-06-05  6:40       ` Nadav Amit via Virtualization
2023-06-06 11:35       ` Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 14/43] x86/Kconfig: Introduce new Kconfig for PIE kernel building Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 15/43] x86/PVH: Use fixed_percpu_data to set up GS base Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 16/43] x86-64: Use per-cpu stack canary if supported by compiler Hou Wenlong
2023-05-01 17:27   ` Nick Desaulniers
2023-05-05  6:14     ` Hou Wenlong
2023-05-05 18:02       ` Nick Desaulniers
2023-05-05 19:06         ` Fangrui Song
2023-05-08  8:06         ` Hou Wenlong
2023-05-04 10:31   ` Juergen Gross
2023-05-05  3:09     ` Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 17/43] x86/pie: Enable stack protector only if per-cpu stack canary is supported Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 18/43] x86/percpu: Use PC-relative addressing for percpu variable references Hou Wenlong
2023-04-28  9:50 ` [PATCH RFC 19/43] x86/tools: Explicitly include autoconf.h for hostprogs Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 20/43] x86/percpu: Adapt percpu references relocation for PIE support Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 21/43] x86/ftrace: Adapt assembly " Hou Wenlong
2023-04-28 13:37   ` Steven Rostedt
2023-04-29  3:43     ` Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching " Hou Wenlong
2023-04-28 13:44   ` Steven Rostedt
2023-04-29  3:38     ` Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 23/43] x86/pie: Force hidden visibility for all symbol references Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 24/43] x86/boot/compressed: Adapt sed command to generate voffset.h when PIE is enabled Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 25/43] x86/mm: Make the x86 GOT read-only Hou Wenlong
2023-04-30 14:23   ` Ard Biesheuvel
2023-05-08 11:40     ` Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 26/43] x86/pie: Add .data.rel.* sections into link script Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 27/43] x86/relocs: Handle PIE relocations Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 28/43] KVM: x86: Adapt assembly for PIE support Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 29/43] x86/PVH: Adapt PVH booting " Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 30/43] x86/bpf: Adapt BPF_CALL JIT codegen " Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 31/43] x86/modules: Adapt module loading " Hou Wenlong
2023-04-28 19:29   ` Ard Biesheuvel
2023-05-08  8:32     ` Hou Wenlong
2023-05-08  9:16       ` Ard Biesheuvel
2023-05-08 11:40         ` Hou Wenlong
2023-05-08 17:47           ` Ard Biesheuvel
2023-05-09  9:42             ` Hou Wenlong
2023-05-09  9:52               ` Ard Biesheuvel
2023-05-09 12:35                 ` Hou Wenlong
2023-05-10  7:09         ` Hou Wenlong
2023-05-10  8:15           ` Ard Biesheuvel
2023-04-28  9:51 ` [PATCH RFC 32/43] x86/boot/64: Use data relocation to get absloute address when PIE is enabled Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 33/43] objtool: Add validation for x86 PIE support Hou Wenlong
2023-04-28 10:28   ` Christophe Leroy
2023-04-28 11:43     ` Peter Zijlstra
2023-04-29  4:04       ` Hou Wenlong
2023-04-29  3:52     ` Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for " Hou Wenlong
2023-04-28 15:18   ` Peter Zijlstra
2023-04-28  9:51 ` [PATCH RFC 35/43] x86/pie: Build the kernel as PIE Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 36/43] x86/vsyscall: Don't use set_fixmap() to map vsyscall page Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 37/43] x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap area Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 38/43] x86/fixmap: Move vsyscall page " Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 39/43] x86/fixmap: Unify FIXADDR_TOP Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 40/43] x86/boot: Fill kernel image puds dynamically Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 41/43] x86/mm: Sort address_markers array when X86 PIE is enabled Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 42/43] x86/pie: Allow kernel image to be relocated in top 512G Hou Wenlong
2023-04-28  9:51 ` [PATCH RFC 43/43] x86/boot: Extend relocate range for PIE kernel image Hou Wenlong
2023-04-28 15:22 ` [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible Peter Zijlstra
2023-05-06  7:19   ` Hou Wenlong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.