All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] arm64: hibernate: Support DEBUG_PAGEALLOC
@ 2016-08-22 17:35 James Morse
  2016-08-22 17:35 ` [PATCH v5 1/3] arm64: Create sections.h James Morse
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: James Morse @ 2016-08-22 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

These patches improve arm64's hibernate support by fixing the hibernate
problem with DEBUG_PAGEALLOC reported by Will.
This also addresses the outstanding comment from Catalin [0] regarding
cleaning of the whole kernel to the PoC. Now we clean three new sections
.mmuoff.{*.data,text}. The mmuoff data is split into two sections depending
on whether they are read or written the MMU off as the alignment is different.
The two mmuoff.data sections share their start/end markers as the difference
should only matter to the linker.

This series is based on v4.8-rc3, and can be retrieved from:
git://linux-arm.org/linux-jm.git -b hibernate/debug_pagealloc/v5

Changes since v4:
 * Moved alignment rules out of assembly into the linker script.
 * Split mmuoff.data into two sections one written with the mmu off, the other
   read with the mmu off. CWG alignment only applies to mmuoff.write.data.
 * Fixed PE/COFF alignment breakage.

Changes since v3:
 * Pad mmuoff.data section to CWG.
 * Specified the .mmuoff.data section for secondary_holding_pen_release in C
 * Added irqentry_text start/end to sections.h
 * Updated patch 1 for kprobes idmap/hyp_idmap declarations.

Changes since v2:
 * Added a mmuoff.data section for secondary_holding_pen_release etc.

Changes since v1:
 * Added .mmuoff.text section and gathered functions together.
 * Put sections.h in alphabetical order.


[v1] http://www.spinics.net/lists/arm-kernel/msg507805.html
[v2] http://permalink.gmane.org/gmane.linux.power-management.general/77467
[v3] https://www.spinics.net/lists/arm-kernel/msg514644.html
[v4] https://www.spinics.net/lists/arm-kernel/msg477769.html

[0] http://www.spinics.net/lists/arm-kernel/msg499305.html

James Morse (3):
  arm64: Create sections.h
  arm64: vmlinux.ld: Add mmuoff text and data sections
  arm64: hibernate: Support DEBUG_PAGEALLOC

 arch/arm64/Kconfig                 |  1 -
 arch/arm64/include/asm/Kbuild      |  1 -
 arch/arm64/include/asm/sections.h  | 31 +++++++++++++++++++++++
 arch/arm64/include/asm/traps.h     |  6 +----
 arch/arm64/include/asm/virt.h      |  9 +------
 arch/arm64/kernel/alternative.c    |  7 +++---
 arch/arm64/kernel/head.S           | 24 ++++++++++++------
 arch/arm64/kernel/hibernate.c      | 51 +++++++++++++++++++++++++++-----------
 arch/arm64/kernel/probes/kprobes.c |  5 +---
 arch/arm64/kernel/sleep.S          |  2 ++
 arch/arm64/kernel/smp_spin_table.c |  3 ++-
 arch/arm64/kernel/vmlinux.lds.S    | 21 ++++++++++++++++
 arch/arm64/mm/pageattr.c           | 40 +++++++++++++++++++++++++++++-
 arch/arm64/mm/proc.S               |  4 +++
 14 files changed, 157 insertions(+), 48 deletions(-)
 create mode 100644 arch/arm64/include/asm/sections.h

-- 
2.8.0.rc3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 1/3] arm64: Create sections.h
  2016-08-22 17:35 [PATCH v5 0/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
@ 2016-08-22 17:35 ` James Morse
  2016-08-22 17:35 ` [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections James Morse
  2016-08-22 17:35 ` [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
  2 siblings, 0 replies; 11+ messages in thread
From: James Morse @ 2016-08-22 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

Each time new section markers are added, kernel/vmlinux.ld.S is updated,
and new extern char __start_foo[] definitions are scattered through the
tree.

Create asm/include/sections.h to collect these definitions (and include
the existing asm-generic version).

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
---
 arch/arm64/include/asm/Kbuild      |  1 -
 arch/arm64/include/asm/sections.h  | 29 +++++++++++++++++++++++++++++
 arch/arm64/include/asm/traps.h     |  6 +-----
 arch/arm64/include/asm/virt.h      |  9 +--------
 arch/arm64/kernel/alternative.c    |  7 +++----
 arch/arm64/kernel/hibernate.c      |  6 ------
 arch/arm64/kernel/probes/kprobes.c |  5 +----
 7 files changed, 35 insertions(+), 28 deletions(-)
 create mode 100644 arch/arm64/include/asm/sections.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index f43d2c44c765..2b3d2d24acba 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -32,7 +32,6 @@ generic-y += poll.h
 generic-y += preempt.h
 generic-y += resource.h
 generic-y += rwsem.h
-generic-y += sections.h
 generic-y += segment.h
 generic-y += sembuf.h
 generic-y += serial.h
diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
new file mode 100644
index 000000000000..237fcdd13445
--- /dev/null
+++ b/arch/arm64/include/asm/sections.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright (C) 2016 ARM Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef __ASM_SECTIONS_H
+#define __ASM_SECTIONS_H
+
+#include <asm-generic/sections.h>
+
+extern char __alt_instructions[], __alt_instructions_end[];
+extern char __exception_text_start[], __exception_text_end[];
+extern char __hibernate_exit_text_start[], __hibernate_exit_text_end[];
+extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
+extern char __hyp_text_start[], __hyp_text_end[];
+extern char __idmap_text_start[], __idmap_text_end[];
+extern char __irqentry_text_start[], __irqentry_text_end[];
+
+#endif /* __ASM_SECTIONS_H */
diff --git a/arch/arm64/include/asm/traps.h b/arch/arm64/include/asm/traps.h
index 9cd03f3e812f..02e9035b0685 100644
--- a/arch/arm64/include/asm/traps.h
+++ b/arch/arm64/include/asm/traps.h
@@ -19,6 +19,7 @@
 #define __ASM_TRAP_H
 
 #include <linux/list.h>
+#include <asm/sections.h>
 
 struct pt_regs;
 
@@ -39,9 +40,6 @@ void arm64_notify_segfault(struct pt_regs *regs, unsigned long addr);
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 static inline int __in_irqentry_text(unsigned long ptr)
 {
-	extern char __irqentry_text_start[];
-	extern char __irqentry_text_end[];
-
 	return ptr >= (unsigned long)&__irqentry_text_start &&
 	       ptr < (unsigned long)&__irqentry_text_end;
 }
@@ -54,8 +52,6 @@ static inline int __in_irqentry_text(unsigned long ptr)
 
 static inline int in_exception_text(unsigned long ptr)
 {
-	extern char __exception_text_start[];
-	extern char __exception_text_end[];
 	int in;
 
 	in = ptr >= (unsigned long)&__exception_text_start &&
diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
index 1788545f25bc..db5739413677 100644
--- a/arch/arm64/include/asm/virt.h
+++ b/arch/arm64/include/asm/virt.h
@@ -45,6 +45,7 @@
 #ifndef __ASSEMBLY__
 
 #include <asm/ptrace.h>
+#include <asm/sections.h>
 
 /*
  * __boot_cpu_mode records what mode CPUs were booted in.
@@ -87,14 +88,6 @@ extern void verify_cpu_run_el(void);
 static inline void verify_cpu_run_el(void) {}
 #endif
 
-/* The section containing the hypervisor idmap text */
-extern char __hyp_idmap_text_start[];
-extern char __hyp_idmap_text_end[];
-
-/* The section containing the hypervisor text */
-extern char __hyp_text_start[];
-extern char __hyp_text_end[];
-
 #endif /* __ASSEMBLY__ */
 
 #endif /* ! __ASM__VIRT_H */
diff --git a/arch/arm64/kernel/alternative.c b/arch/arm64/kernel/alternative.c
index d2ee1b21a10d..4434dabde898 100644
--- a/arch/arm64/kernel/alternative.c
+++ b/arch/arm64/kernel/alternative.c
@@ -25,14 +25,13 @@
 #include <asm/alternative.h>
 #include <asm/cpufeature.h>
 #include <asm/insn.h>
+#include <asm/sections.h>
 #include <linux/stop_machine.h>
 
 #define __ALT_PTR(a,f)		(u32 *)((void *)&(a)->f + (a)->f)
 #define ALT_ORIG_PTR(a)		__ALT_PTR(a, orig_offset)
 #define ALT_REPL_PTR(a)		__ALT_PTR(a, alt_offset)
 
-extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
-
 struct alt_region {
 	struct alt_instr *begin;
 	struct alt_instr *end;
@@ -124,8 +123,8 @@ static int __apply_alternatives_multi_stop(void *unused)
 {
 	static int patched = 0;
 	struct alt_region region = {
-		.begin	= __alt_instructions,
-		.end	= __alt_instructions_end,
+		.begin	= (struct alt_instr *)__alt_instructions,
+		.end	= (struct alt_instr *)__alt_instructions_end,
 	};
 
 	/* We always have a CPU 0 at this point (__init) */
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 65d81f965e74..b4082017c4cb 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -54,12 +54,6 @@ extern int in_suspend;
 /* Do we need to reset el2? */
 #define el2_reset_needed() (is_hyp_mode_available() && !is_kernel_in_hyp_mode())
 
-/*
- * Start/end of the hibernate exit code, this must be copied to a 'safe'
- * location in memory, and executed from there.
- */
-extern char __hibernate_exit_text_start[], __hibernate_exit_text_end[];
-
 /* temporary el2 vectors in the __hibernate_exit_text section. */
 extern char hibernate_el2_vectors[];
 
diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
index c6b0f40620d8..2dc1b42afaaa 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -31,7 +31,7 @@
 #include <asm/insn.h>
 #include <asm/uaccess.h>
 #include <asm/irq.h>
-#include <asm-generic/sections.h>
+#include <asm/sections.h>
 
 #include "decode-insn.h"
 
@@ -543,9 +543,6 @@ int __kprobes longjmp_break_handler(struct kprobe *p, struct pt_regs *regs)
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
-	extern char __idmap_text_start[], __idmap_text_end[];
-	extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
-
 	if ((addr >= (unsigned long)__kprobes_text_start &&
 	    addr < (unsigned long)__kprobes_text_end) ||
 	    (addr >= (unsigned long)__entry_text_start &&
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections
  2016-08-22 17:35 [PATCH v5 0/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
  2016-08-22 17:35 ` [PATCH v5 1/3] arm64: Create sections.h James Morse
@ 2016-08-22 17:35 ` James Morse
  2016-08-22 17:56   ` Catalin Marinas
  2016-08-23 13:11   ` Ard Biesheuvel
  2016-08-22 17:35 ` [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
  2 siblings, 2 replies; 11+ messages in thread
From: James Morse @ 2016-08-22 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

Resume from hibernate needs to clean any text executed by the kernel with
the MMU off to the PoC. Collect these functions together into a new
.mmuoff.text section.

Data is more complicated, secondary_holding_pen_release is written with
the MMU on, clean and invalidated, then read with the MMU off. In contrast
__boot_cpu_mode is written with the MMU off, the corresponding cache line
is invalidated, so when we read it with the MMU on we don't get stale data.
These cache maintenance operations conflict with each other if the values
are within a Cache Writeback Granule (CWG) of each other.
Collect the data into two sections .mmuoff.data.read and .mmuoff.data.write,
the linker script ensures mmuoff.data.write section is aligned to the
architectural maximum CWG of 2KB.

Signed-off-by: James Morse <james.morse@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
---
Changes since v4:
 * Moved alignment rules out of assembly into the linker script.
 * Split mmuoff.data into two sections one written with the mmu off, the other
   read with the mmu off. CWG alignment only applies to mmuoff.data.write
 * Fixed PE/COFF alignment breakage.

 arch/arm64/include/asm/sections.h  |  2 ++
 arch/arm64/kernel/head.S           | 24 ++++++++++++++++--------
 arch/arm64/kernel/sleep.S          |  2 ++
 arch/arm64/kernel/smp_spin_table.c |  3 ++-
 arch/arm64/kernel/vmlinux.lds.S    | 21 +++++++++++++++++++++
 arch/arm64/mm/proc.S               |  4 ++++
 6 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
index 237fcdd13445..fb824a71fbb2 100644
--- a/arch/arm64/include/asm/sections.h
+++ b/arch/arm64/include/asm/sections.h
@@ -25,5 +25,7 @@ extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
 extern char __hyp_text_start[], __hyp_text_end[];
 extern char __idmap_text_start[], __idmap_text_end[];
 extern char __irqentry_text_start[], __irqentry_text_end[];
+extern char __mmuoff_data_start[], __mmuoff_data_end[];
+extern char __mmuoff_text_start[], __mmuoff_text_end[];
 
 #endif /* __ASM_SECTIONS_H */
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index b77f58355da1..9bee5394d76b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -477,6 +477,7 @@ ENTRY(kimage_vaddr)
  * Returns either BOOT_CPU_MODE_EL1 or BOOT_CPU_MODE_EL2 in x20 if
  * booted in EL1 or EL2 respectively.
  */
+	.pushsection ".mmuoff.text", "ax"
 ENTRY(el2_setup)
 	mrs	x0, CurrentEL
 	cmp	x0, #CurrentEL_EL2
@@ -621,17 +622,29 @@ set_cpu_boot_mode_flag:
 ENDPROC(set_cpu_boot_mode_flag)
 
 /*
+ * These values are written with the MMU off, but read with the MMU on.
+ * Writers will invalidate the corresponding address, discarding up to a
+ * 'Cache Writeback Granule' (CWG) worth of data. The linker script ensures
+ * sufficient alignment that the CWG doesn't overlap another section.
+ */
+	.pushsection ".mmuoff.data.write", "aw"
+/*
  * We need to find out the CPU boot mode long after boot, so we need to
  * store it in a writable variable.
  *
  * This is not in .bss, because we set it sufficiently early that the boot-time
  * zeroing of .bss would clobber it.
  */
-	.pushsection	.data..cacheline_aligned
-	.align	L1_CACHE_SHIFT
 ENTRY(__boot_cpu_mode)
 	.long	BOOT_CPU_MODE_EL2
 	.long	BOOT_CPU_MODE_EL1
+/*
+ * The booting CPU updates the failed status @__early_cpu_boot_status,
+ * with MMU turned off.
+ */
+ENTRY(__early_cpu_boot_status)
+	.long 	0
+
 	.popsection
 
 	/*
@@ -687,6 +700,7 @@ __secondary_switched:
 	mov	x29, #0
 	b	secondary_start_kernel
 ENDPROC(__secondary_switched)
+	.popsection
 
 /*
  * The booting CPU updates the failed status @__early_cpu_boot_status,
@@ -706,12 +720,6 @@ ENDPROC(__secondary_switched)
 	dc	ivac, \tmp1			// Invalidate potentially stale cache line
 	.endm
 
-	.pushsection	.data..cacheline_aligned
-	.align	L1_CACHE_SHIFT
-ENTRY(__early_cpu_boot_status)
-	.long 	0
-	.popsection
-
 /*
  * Enable the MMU.
  *
diff --git a/arch/arm64/kernel/sleep.S b/arch/arm64/kernel/sleep.S
index ccf79d849e0a..8c0a6957c7e8 100644
--- a/arch/arm64/kernel/sleep.S
+++ b/arch/arm64/kernel/sleep.S
@@ -97,6 +97,7 @@ ENTRY(__cpu_suspend_enter)
 ENDPROC(__cpu_suspend_enter)
 	.ltorg
 
+	.pushsection ".mmuoff.text", "ax"
 ENTRY(cpu_resume)
 	bl	el2_setup		// if in EL2 drop to EL1 cleanly
 	/* enable the MMU early - so we can access sleep_save_stash by va */
@@ -106,6 +107,7 @@ ENTRY(cpu_resume)
 	adrp	x26, swapper_pg_dir
 	b	__cpu_setup
 ENDPROC(cpu_resume)
+	.popsection
 
 	.pushsection	".idmap.text", "ax"
 _resume_switched:
diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c
index 18a71bcd26ee..9a00eee9acc8 100644
--- a/arch/arm64/kernel/smp_spin_table.c
+++ b/arch/arm64/kernel/smp_spin_table.c
@@ -29,7 +29,8 @@
 #include <asm/smp_plat.h>
 
 extern void secondary_holding_pen(void);
-volatile unsigned long secondary_holding_pen_release = INVALID_HWID;
+volatile unsigned long __section(".mmuoff.data.read")
+secondary_holding_pen_release = INVALID_HWID;
 
 static phys_addr_t cpu_release_addr[NR_CPUS];
 
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 659963d40bb4..97c3f36ce30b 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -120,6 +120,9 @@ SECTIONS
 			IRQENTRY_TEXT
 			SOFTIRQENTRY_TEXT
 			ENTRY_TEXT
+			__mmuoff_text_start = .;
+			*(.mmuoff.text)
+			__mmuoff_text_end = .;
 			TEXT_TEXT
 			SCHED_TEXT
 			LOCK_TEXT
@@ -185,6 +188,24 @@ SECTIONS
 	_data = .;
 	_sdata = .;
 	RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
+
+	/*
+	 * Data written with the MMU off but read with the MMU on requires
+	 * cache lines to be invalidated, discarding up to a Cache Writeback
+	 * Granule (CWG) of data from the cache. Keep the section that
+	 * requires this type of maintenance to be in its own Cache Writeback
+	 * Granule (CWG) area so the cache maintenance operations don't
+	 * don't interfere with adjacent data.
+	 */
+	.mmuoff.data.write : ALIGN(SZ_2K) {
+		__mmuoff_data_start = .;
+		*(.mmuoff.data.write)
+	}
+	.mmuoff.data.read : ALIGN(SZ_2K) {
+		*(.mmuoff.data.read)
+		__mmuoff_data_end = .;
+	}
+
 	PECOFF_EDATA_PADDING
 	_edata = .;
 
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 5bb61de23201..a709e95d68ff 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -83,6 +83,7 @@ ENDPROC(cpu_do_suspend)
  *
  * x0: Address of context pointer
  */
+	.pushsection ".mmuoff.text", "ax"
 ENTRY(cpu_do_resume)
 	ldp	x2, x3, [x0]
 	ldp	x4, x5, [x0, #16]
@@ -111,6 +112,7 @@ ENTRY(cpu_do_resume)
 	isb
 	ret
 ENDPROC(cpu_do_resume)
+	.popsection
 #endif
 
 /*
@@ -172,6 +174,7 @@ ENDPROC(idmap_cpu_replace_ttbr1)
  *	Initialise the processor for turning the MMU on.  Return in x0 the
  *	value of the SCTLR_EL1 register.
  */
+	.pushsection ".mmuoff.text", "ax"
 ENTRY(__cpu_setup)
 	tlbi	vmalle1				// Invalidate local TLB
 	dsb	nsh
@@ -257,3 +260,4 @@ ENDPROC(__cpu_setup)
 crval:
 	.word	0xfcffffff			// clear
 	.word	0x34d5d91d			// set
+	.popsection
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC
  2016-08-22 17:35 [PATCH v5 0/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
  2016-08-22 17:35 ` [PATCH v5 1/3] arm64: Create sections.h James Morse
  2016-08-22 17:35 ` [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections James Morse
@ 2016-08-22 17:35 ` James Morse
  2016-08-22 18:51   ` Catalin Marinas
  2 siblings, 1 reply; 11+ messages in thread
From: James Morse @ 2016-08-22 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

DEBUG_PAGEALLOC removes the valid bit of page table entries to prevent
any access to unallocated memory. Hibernate uses this as a hint that those
pages don't need to be saved/restored. This patch adds the
kernel_page_present() function it uses.

hibernate.c copies the resume kernel's linear map for use during restore.
Add _copy_pte() to fill-in the holes made by DEBUG_PAGEALLOC in the resume
kernel, so we can restore data the original kernel had at these addresses.

Finally, DEBUG_PAGEALLOC means the linear-map alias of KERNEL_START to
KERNEL_END may have holes in it, so we can't lazily clean this whole
area to the PoC. Only clean the new mmuoff regions, and the kernel/kvm
idmaps.

This reverts commit da24eb1f3f9e2c7b75c5f8c40d8e48e2c4789596.

Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/arm64/Kconfig            |  1 -
 arch/arm64/kernel/hibernate.c | 45 ++++++++++++++++++++++++++++++++++---------
 arch/arm64/mm/pageattr.c      | 40 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index bc3f00f586f1..9be0c164df4e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -607,7 +607,6 @@ source kernel/Kconfig.preempt
 source kernel/Kconfig.hz
 
 config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-	depends on !HIBERNATION
 	def_bool y
 
 config ARCH_HAS_HOLES_MEMORYMODEL
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index b4082017c4cb..da4470de1807 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -235,6 +235,7 @@ out:
 	return rc;
 }
 
+#define dcache_clean_range(start, end)	__flush_dcache_area(start, (end - start))
 
 int swsusp_arch_suspend(void)
 {
@@ -252,8 +253,14 @@ int swsusp_arch_suspend(void)
 	if (__cpu_suspend_enter(&state)) {
 		ret = swsusp_save();
 	} else {
-		/* Clean kernel to PoC for secondary core startup */
-		__flush_dcache_area(LMADDR(KERNEL_START), KERNEL_END - KERNEL_START);
+		/* Clean kernel core startup/idle code to PoC*/
+		dcache_clean_range(__mmuoff_text_start, __mmuoff_text_end);
+		dcache_clean_range(__mmuoff_data_start, __mmuoff_data_end);
+		dcache_clean_range(__idmap_text_start, __idmap_text_end);
+
+		/* Clean kvm setup code to PoC? */
+		if (el2_reset_needed())
+			dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);
 
 		/*
 		 * Tell the hibernation core that we've just restored
@@ -269,6 +276,32 @@ int swsusp_arch_suspend(void)
 	return ret;
 }
 
+static void _copy_pte(pte_t *dst_pte, pte_t *src_pte, unsigned long addr)
+{
+	unsigned long pfn = virt_to_pfn(addr);
+
+	if (pte_valid(*src_pte)) {
+		/*
+		 * Resume will overwrite areas that may be marked
+		 * read only (code, rodata). Clear the RDONLY bit from
+		 * the temporary mappings we use during restore.
+		 */
+		set_pte(dst_pte, __pte(pte_val(*src_pte) & ~PTE_RDONLY));
+	} else if (debug_pagealloc_enabled()) {
+		/*
+		 * debug_pagealloc may have removed the PTE_VALID bit if
+		 * the page isn't in use by the resume kernel. It may have
+		 * been in use by the original kernel, in which case we need
+		 * to put it back in our copy to do the restore.
+		 *
+		 * Check for mappable memory that gives us a translation
+		 * like part of the linear map.
+		 */
+		if (pfn_valid(pfn) && pte_pfn(*src_pte) == pfn)
+			set_pte(dst_pte, __pte((pte_val(*src_pte) & ~PTE_RDONLY) | PTE_VALID));
+	}
+}
+
 static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
 		    unsigned long end)
 {
@@ -284,13 +317,7 @@ static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
 
 	src_pte = pte_offset_kernel(src_pmd, start);
 	do {
-		if (!pte_none(*src_pte))
-			/*
-			 * Resume will overwrite areas that may be marked
-			 * read only (code, rodata). Clear the RDONLY bit from
-			 * the temporary mappings we use during restore.
-			 */
-			set_pte(dst_pte, __pte(pte_val(*src_pte) & ~PTE_RDONLY));
+		_copy_pte(dst_pte, src_pte, addr);
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	return 0;
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index ca6d268e3313..b6c0da84258c 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -139,4 +139,42 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 					__pgprot(0),
 					__pgprot(PTE_VALID));
 }
-#endif
+#ifdef CONFIG_HIBERNATION
+/*
+ * When built with CONFIG_DEBUG_PAGEALLOC and CONFIG_HIBERNATION, this function
+ * is used to determine if a linear map page has been marked as not-present by
+ * CONFIG_DEBUG_PAGEALLOC. Walk the page table and check the PTE_VALID bit.
+ * This is based on kern_addr_valid(), which almost does what we need.
+ */
+bool kernel_page_present(struct page *page)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	unsigned long addr = (unsigned long)page_address(page);
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return false;
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud))
+		return false;
+	if (pud_sect(*pud))
+		return true;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return false;
+	if (pmd_sect(*pmd))
+		return true;
+
+	pte = pte_offset_kernel(pmd, addr);
+	if (pte_none(*pte))
+		return false;
+
+	return pte_valid(*pte);
+}
+#endif /* CONFIG_HIBERNATION */
+#endif /* CONFIG_DEBUG_PAGEALLOC */
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections
  2016-08-22 17:35 ` [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections James Morse
@ 2016-08-22 17:56   ` Catalin Marinas
  2016-08-23 13:11   ` Ard Biesheuvel
  1 sibling, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2016-08-22 17:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 22, 2016 at 06:35:18PM +0100, James Morse wrote:
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 659963d40bb4..97c3f36ce30b 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -120,6 +120,9 @@ SECTIONS
>  			IRQENTRY_TEXT
>  			SOFTIRQENTRY_TEXT
>  			ENTRY_TEXT
> +			__mmuoff_text_start = .;
> +			*(.mmuoff.text)
> +			__mmuoff_text_end = .;
>  			TEXT_TEXT
>  			SCHED_TEXT
>  			LOCK_TEXT
> @@ -185,6 +188,24 @@ SECTIONS
>  	_data = .;
>  	_sdata = .;
>  	RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
> +
> +	/*
> +	 * Data written with the MMU off but read with the MMU on requires
> +	 * cache lines to be invalidated, discarding up to a Cache Writeback
> +	 * Granule (CWG) of data from the cache. Keep the section that
> +	 * requires this type of maintenance to be in its own Cache Writeback
> +	 * Granule (CWG) area so the cache maintenance operations don't
> +	 * don't interfere with adjacent data.
> +	 */

Nitpick: you have "don't" twice in the comment above.

-- 
Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC
  2016-08-22 17:35 ` [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
@ 2016-08-22 18:51   ` Catalin Marinas
  2016-08-23 13:33     ` James Morse
  0 siblings, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2016-08-22 18:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 22, 2016 at 06:35:19PM +0100, James Morse wrote:
> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> index b4082017c4cb..da4470de1807 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -235,6 +235,7 @@ out:
>  	return rc;
>  }
>  
> +#define dcache_clean_range(start, end)	__flush_dcache_area(start, (end - start))
>  
>  int swsusp_arch_suspend(void)
>  {
> @@ -252,8 +253,14 @@ int swsusp_arch_suspend(void)
>  	if (__cpu_suspend_enter(&state)) {
>  		ret = swsusp_save();
>  	} else {
> -		/* Clean kernel to PoC for secondary core startup */
> -		__flush_dcache_area(LMADDR(KERNEL_START), KERNEL_END - KERNEL_START);
> +		/* Clean kernel core startup/idle code to PoC*/
> +		dcache_clean_range(__mmuoff_text_start, __mmuoff_text_end);
> +		dcache_clean_range(__mmuoff_data_start, __mmuoff_data_end);
> +		dcache_clean_range(__idmap_text_start, __idmap_text_end);
> +
> +		/* Clean kvm setup code to PoC? */
> +		if (el2_reset_needed())
> +			dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);
>  
>  		/*
>  		 * Tell the hibernation core that we've just restored
> @@ -269,6 +276,32 @@ int swsusp_arch_suspend(void)
>  	return ret;
>  }
>  
> +static void _copy_pte(pte_t *dst_pte, pte_t *src_pte, unsigned long addr)
> +{
> +	unsigned long pfn = virt_to_pfn(addr);

I assume this is only called on the kernel linear mapping (according to
the copy_page_tables() use in swsusp_arch_resume). Otherwise
virt_to_pfn() would not work.

Something I missed in the original hibernation support but it may look
better if you have something like:

	pte_t pte = *src_pte;

> +
> +	if (pte_valid(*src_pte)) {
> +		/*
> +		 * Resume will overwrite areas that may be marked
> +		 * read only (code, rodata). Clear the RDONLY bit from
> +		 * the temporary mappings we use during restore.
> +		 */
> +		set_pte(dst_pte, __pte(pte_val(*src_pte) & ~PTE_RDONLY));

and here:

		set_pte(dst_pte, pte_mkwrite(pte));

> +	} else if (debug_pagealloc_enabled()) {
> +		/*
> +		 * debug_pagealloc may have removed the PTE_VALID bit if
> +		 * the page isn't in use by the resume kernel. It may have
> +		 * been in use by the original kernel, in which case we need
> +		 * to put it back in our copy to do the restore.
> +		 *
> +		 * Check for mappable memory that gives us a translation
> +		 * like part of the linear map.
> +		 */
> +		if (pfn_valid(pfn) && pte_pfn(*src_pte) == pfn)

Is there a case where this condition is false?

> +			set_pte(dst_pte, __pte((pte_val(*src_pte) & ~PTE_RDONLY) | PTE_VALID));

With some more macros:

			set_pte(dst_pte, pte_mkwrite(pte_mkpresent(pte)))

(pte_mkpresent() needs to be added)

> +	}
> +}
> +
>  static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
>  		    unsigned long end)
>  {
> @@ -284,13 +317,7 @@ static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
>  
>  	src_pte = pte_offset_kernel(src_pmd, start);
>  	do {
> -		if (!pte_none(*src_pte))

You seem to no longer check for pte_none(). Is this not needed or
covered by the pte_pfn() != pfn check above?

> -			/*
> -			 * Resume will overwrite areas that may be marked
> -			 * read only (code, rodata). Clear the RDONLY bit from
> -			 * the temporary mappings we use during restore.
> -			 */
> -			set_pte(dst_pte, __pte(pte_val(*src_pte) & ~PTE_RDONLY));
> +		_copy_pte(dst_pte, src_pte, addr);
>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>  
>  	return 0;
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index ca6d268e3313..b6c0da84258c 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -139,4 +139,42 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
>  					__pgprot(0),
>  					__pgprot(PTE_VALID));
>  }
> -#endif
> +#ifdef CONFIG_HIBERNATION
> +/*
> + * When built with CONFIG_DEBUG_PAGEALLOC and CONFIG_HIBERNATION, this function
> + * is used to determine if a linear map page has been marked as not-present by
> + * CONFIG_DEBUG_PAGEALLOC. Walk the page table and check the PTE_VALID bit.
> + * This is based on kern_addr_valid(), which almost does what we need.
> + */
> +bool kernel_page_present(struct page *page)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	unsigned long addr = (unsigned long)page_address(page);
> +
> +	pgd = pgd_offset_k(addr);
> +	if (pgd_none(*pgd))
> +		return false;
> +
> +	pud = pud_offset(pgd, addr);
> +	if (pud_none(*pud))
> +		return false;
> +	if (pud_sect(*pud))
> +		return true;

This wouldn't normally guarantee "present" but I don't think we ever
have a non-present section mapping for the kernel (we do for user
though). You may want to add a comment.

> +
> +	pmd = pmd_offset(pud, addr);
> +	if (pmd_none(*pmd))
> +		return false;
> +	if (pmd_sect(*pmd))
> +		return true;
> +
> +	pte = pte_offset_kernel(pmd, addr);
> +	if (pte_none(*pte))
> +		return false;
> +
> +	return pte_valid(*pte);

You can return pte_valid() directly without the pte_none() check since
pte_none() implies !pte_valid().

-- 
Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections
  2016-08-22 17:35 ` [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections James Morse
  2016-08-22 17:56   ` Catalin Marinas
@ 2016-08-23 13:11   ` Ard Biesheuvel
  2016-08-24 15:49     ` James Morse
  1 sibling, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2016-08-23 13:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 22 August 2016 at 19:35, James Morse <james.morse@arm.com> wrote:
> Resume from hibernate needs to clean any text executed by the kernel with
> the MMU off to the PoC. Collect these functions together into a new
> .mmuoff.text section.
>

Wouldn't it be much simpler to move all of this into the .idmap.text
section? It is not that much code, and they intersect anyway (i.e.,
some .idmap.text code is only called with the MMU off)

I am sitting on some related/conflicting patches that clean up head.S
and sleep.S, and simplify the interactions between the MMU off code,
__enable_mmu() and the idmap-to-KVA trampolines that we needed to add
for relocatable kernels. For example, the resulting cpu_resume() looks
like this

    .pushsection ".idmap.text", "ax"
ENTRY(cpu_resume)
    bl    el2_setup                // if in EL2 drop to EL1 cleanly
    bl    __cpu_setup
    /* enable the MMU early - so we can access sleep_save_stash by va */
    bl    __enable_mmu
    ldr   x8, =_cpu_resume
    br    x8
ENDPROC(cpu_resume)
    .ltorg
    .popsection

with the separate trampoline removed. Likewise for primary and
secondary boot paths.

I was on the fence about whether to send out these patches, and even
more so now that I realize they will conflict with yours (unless you
merge the .mmuoff.text and .idmap.text sections :-))
For reference, branch is here
https://git.linaro.org/people/ard.biesheuvel/linux-arm.git/shortlog/refs/heads/arm64-headS-cleanup

Some other comments below

> Data is more complicated, secondary_holding_pen_release is written with
> the MMU on, clean and invalidated, then read with the MMU off. In contrast
> __boot_cpu_mode is written with the MMU off, the corresponding cache line
> is invalidated, so when we read it with the MMU on we don't get stale data.
> These cache maintenance operations conflict with each other if the values
> are within a Cache Writeback Granule (CWG) of each other.
> Collect the data into two sections .mmuoff.data.read and .mmuoff.data.write,
> the linker script ensures mmuoff.data.write section is aligned to the
> architectural maximum CWG of 2KB.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> ---
> Changes since v4:
>  * Moved alignment rules out of assembly into the linker script.
>  * Split mmuoff.data into two sections one written with the mmu off, the other
>    read with the mmu off. CWG alignment only applies to mmuoff.data.write
>  * Fixed PE/COFF alignment breakage.
>
>  arch/arm64/include/asm/sections.h  |  2 ++
>  arch/arm64/kernel/head.S           | 24 ++++++++++++++++--------
>  arch/arm64/kernel/sleep.S          |  2 ++
>  arch/arm64/kernel/smp_spin_table.c |  3 ++-
>  arch/arm64/kernel/vmlinux.lds.S    | 21 +++++++++++++++++++++
>  arch/arm64/mm/proc.S               |  4 ++++
>  6 files changed, 47 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
> index 237fcdd13445..fb824a71fbb2 100644
> --- a/arch/arm64/include/asm/sections.h
> +++ b/arch/arm64/include/asm/sections.h
> @@ -25,5 +25,7 @@ extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
>  extern char __hyp_text_start[], __hyp_text_end[];
>  extern char __idmap_text_start[], __idmap_text_end[];
>  extern char __irqentry_text_start[], __irqentry_text_end[];
> +extern char __mmuoff_data_start[], __mmuoff_data_end[];
> +extern char __mmuoff_text_start[], __mmuoff_text_end[];
>
>  #endif /* __ASM_SECTIONS_H */
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index b77f58355da1..9bee5394d76b 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -477,6 +477,7 @@ ENTRY(kimage_vaddr)
>   * Returns either BOOT_CPU_MODE_EL1 or BOOT_CPU_MODE_EL2 in x20 if
>   * booted in EL1 or EL2 respectively.
>   */
> +       .pushsection ".mmuoff.text", "ax"
>  ENTRY(el2_setup)
>         mrs     x0, CurrentEL
>         cmp     x0, #CurrentEL_EL2
> @@ -621,17 +622,29 @@ set_cpu_boot_mode_flag:
>  ENDPROC(set_cpu_boot_mode_flag)
>
>  /*
> + * These values are written with the MMU off, but read with the MMU on.
> + * Writers will invalidate the corresponding address, discarding up to a
> + * 'Cache Writeback Granule' (CWG) worth of data. The linker script ensures
> + * sufficient alignment that the CWG doesn't overlap another section.
> + */
> +       .pushsection ".mmuoff.data.write", "aw"
> +/*
>   * We need to find out the CPU boot mode long after boot, so we need to
>   * store it in a writable variable.
>   *
>   * This is not in .bss, because we set it sufficiently early that the boot-time
>   * zeroing of .bss would clobber it.
>   */
> -       .pushsection    .data..cacheline_aligned
> -       .align  L1_CACHE_SHIFT
>  ENTRY(__boot_cpu_mode)
>         .long   BOOT_CPU_MODE_EL2
>         .long   BOOT_CPU_MODE_EL1
> +/*
> + * The booting CPU updates the failed status @__early_cpu_boot_status,
> + * with MMU turned off.
> + */
> +ENTRY(__early_cpu_boot_status)
> +       .long   0
> +
>         .popsection
>
>         /*
> @@ -687,6 +700,7 @@ __secondary_switched:
>         mov     x29, #0
>         b       secondary_start_kernel
>  ENDPROC(__secondary_switched)
> +       .popsection
>
>  /*
>   * The booting CPU updates the failed status @__early_cpu_boot_status,
> @@ -706,12 +720,6 @@ ENDPROC(__secondary_switched)
>         dc      ivac, \tmp1                     // Invalidate potentially stale cache line
>         .endm
>
> -       .pushsection    .data..cacheline_aligned
> -       .align  L1_CACHE_SHIFT
> -ENTRY(__early_cpu_boot_status)
> -       .long   0
> -       .popsection
> -
>  /*
>   * Enable the MMU.
>   *
> diff --git a/arch/arm64/kernel/sleep.S b/arch/arm64/kernel/sleep.S
> index ccf79d849e0a..8c0a6957c7e8 100644
> --- a/arch/arm64/kernel/sleep.S
> +++ b/arch/arm64/kernel/sleep.S
> @@ -97,6 +97,7 @@ ENTRY(__cpu_suspend_enter)
>  ENDPROC(__cpu_suspend_enter)
>         .ltorg
>
> +       .pushsection ".mmuoff.text", "ax"
>  ENTRY(cpu_resume)
>         bl      el2_setup               // if in EL2 drop to EL1 cleanly
>         /* enable the MMU early - so we can access sleep_save_stash by va */
> @@ -106,6 +107,7 @@ ENTRY(cpu_resume)
>         adrp    x26, swapper_pg_dir
>         b       __cpu_setup
>  ENDPROC(cpu_resume)
> +       .popsection
>
>         .pushsection    ".idmap.text", "ax"
>  _resume_switched:
> diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c
> index 18a71bcd26ee..9a00eee9acc8 100644
> --- a/arch/arm64/kernel/smp_spin_table.c
> +++ b/arch/arm64/kernel/smp_spin_table.c
> @@ -29,7 +29,8 @@
>  #include <asm/smp_plat.h>
>
>  extern void secondary_holding_pen(void);
> -volatile unsigned long secondary_holding_pen_release = INVALID_HWID;
> +volatile unsigned long __section(".mmuoff.data.read")
> +secondary_holding_pen_release = INVALID_HWID;
>
>  static phys_addr_t cpu_release_addr[NR_CPUS];
>
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 659963d40bb4..97c3f36ce30b 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -120,6 +120,9 @@ SECTIONS
>                         IRQENTRY_TEXT
>                         SOFTIRQENTRY_TEXT
>                         ENTRY_TEXT
> +                       __mmuoff_text_start = .;
> +                       *(.mmuoff.text)
> +                       __mmuoff_text_end = .;

Could you add a define for this like we have for the other ones?

>                         TEXT_TEXT
>                         SCHED_TEXT
>                         LOCK_TEXT
> @@ -185,6 +188,24 @@ SECTIONS
>         _data = .;
>         _sdata = .;
>         RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
> +
> +       /*
> +        * Data written with the MMU off but read with the MMU on requires
> +        * cache lines to be invalidated, discarding up to a Cache Writeback
> +        * Granule (CWG) of data from the cache. Keep the section that
> +        * requires this type of maintenance to be in its own Cache Writeback
> +        * Granule (CWG) area so the cache maintenance operations don't
> +        * don't interfere with adjacent data.
> +        */
> +       .mmuoff.data.write : ALIGN(SZ_2K) {
> +               __mmuoff_data_start = .;
> +               *(.mmuoff.data.write)
> +       }
> +       .mmuoff.data.read : ALIGN(SZ_2K) {
> +               *(.mmuoff.data.read)
> +               __mmuoff_data_end = .;
> +       }
> +

This looks fine now, with the caveat that empty sections are dropped
completely by the linker (including their alignment), and so the start
and end markers may assume unexpected values.

For instance, if .mmuoff.data.read turns out empty in the build (and I
am aware this is theoretical), the section will be dropped, and
__mmuoff_data_end may assume a value that exceeds _edata. This is
actually fine, as long as you put a

. = ALIGN(SZ_2K);

before BSS_SECTION() to ensure that the cache invalidation does not
affect .bss in case .mmuoff.data.read does turn out empty (or
alternatively, put the alignment inside the BSS_SECTION() invocation,
as I did in my example before)

-- 
Ard.





>         _edata = .;
>
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 5bb61de23201..a709e95d68ff 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -83,6 +83,7 @@ ENDPROC(cpu_do_suspend)
>   *
>   * x0: Address of context pointer
>   */
> +       .pushsection ".mmuoff.text", "ax"
>  ENTRY(cpu_do_resume)
>         ldp     x2, x3, [x0]
>         ldp     x4, x5, [x0, #16]
> @@ -111,6 +112,7 @@ ENTRY(cpu_do_resume)
>         isb
>         ret
>  ENDPROC(cpu_do_resume)
> +       .popsection
>  #endif
>
>  /*
> @@ -172,6 +174,7 @@ ENDPROC(idmap_cpu_replace_ttbr1)
>   *     Initialise the processor for turning the MMU on.  Return in x0 the
>   *     value of the SCTLR_EL1 register.
>   */
> +       .pushsection ".mmuoff.text", "ax"
>  ENTRY(__cpu_setup)
>         tlbi    vmalle1                         // Invalidate local TLB
>         dsb     nsh
> @@ -257,3 +260,4 @@ ENDPROC(__cpu_setup)
>  crval:
>         .word   0xfcffffff                      // clear
>         .word   0x34d5d91d                      // set
> +       .popsection
> --
> 2.8.0.rc3
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC
  2016-08-22 18:51   ` Catalin Marinas
@ 2016-08-23 13:33     ` James Morse
  2016-08-23 17:06       ` Catalin Marinas
  0 siblings, 1 reply; 11+ messages in thread
From: James Morse @ 2016-08-23 13:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Catalin

On 22/08/16 19:51, Catalin Marinas wrote:
> On Mon, Aug 22, 2016 at 06:35:19PM +0100, James Morse wrote:
>> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
>> index b4082017c4cb..da4470de1807 100644
>> --- a/arch/arm64/kernel/hibernate.c
>> +++ b/arch/arm64/kernel/hibernate.c
>> @@ -235,6 +235,7 @@ out:
>>  	return rc;
>>  }
>>  
>> +#define dcache_clean_range(start, end)	__flush_dcache_area(start, (end - start))
>>  
>>  int swsusp_arch_suspend(void)
>>  {
>> @@ -252,8 +253,14 @@ int swsusp_arch_suspend(void)
>>  	if (__cpu_suspend_enter(&state)) {
>>  		ret = swsusp_save();
>>  	} else {
>> -		/* Clean kernel to PoC for secondary core startup */
>> -		__flush_dcache_area(LMADDR(KERNEL_START), KERNEL_END - KERNEL_START);
>> +		/* Clean kernel core startup/idle code to PoC*/
>> +		dcache_clean_range(__mmuoff_text_start, __mmuoff_text_end);
>> +		dcache_clean_range(__mmuoff_data_start, __mmuoff_data_end);
>> +		dcache_clean_range(__idmap_text_start, __idmap_text_end);
>> +
>> +		/* Clean kvm setup code to PoC? */
>> +		if (el2_reset_needed())
>> +			dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);
>>  
>>  		/*
>>  		 * Tell the hibernation core that we've just restored
>> @@ -269,6 +276,32 @@ int swsusp_arch_suspend(void)
>>  	return ret;
>>  }
>>  
>> +static void _copy_pte(pte_t *dst_pte, pte_t *src_pte, unsigned long addr)
>> +{
>> +	unsigned long pfn = virt_to_pfn(addr);
> 
> I assume this is only called on the kernel linear mapping (according to
> the copy_page_tables() use in swsusp_arch_resume). Otherwise
> virt_to_pfn() would not work.

Yes, hibernate only save/restores the linear map, on the assumption that
contains everything.


> Something I missed in the original hibernation support but it may look
> better if you have something like:
> 
> 	pte_t pte = *src_pte;

Sure,


>> +
>> +	if (pte_valid(*src_pte)) {
>> +		/*
>> +		 * Resume will overwrite areas that may be marked
>> +		 * read only (code, rodata). Clear the RDONLY bit from
>> +		 * the temporary mappings we use during restore.
>> +		 */
>> +		set_pte(dst_pte, __pte(pte_val(*src_pte) & ~PTE_RDONLY));
> 
> and here:
> 
> 		set_pte(dst_pte, pte_mkwrite(pte));

pte_write() doesn't clear the PTE_RDONLY flag.
Should it be changed?


>> +	} else if (debug_pagealloc_enabled()) {
>> +		/*
>> +		 * debug_pagealloc may have removed the PTE_VALID bit if
>> +		 * the page isn't in use by the resume kernel. It may have
>> +		 * been in use by the original kernel, in which case we need
>> +		 * to put it back in our copy to do the restore.
>> +		 *
>> +		 * Check for mappable memory that gives us a translation
>> +		 * like part of the linear map.
>> +		 */
>> +		if (pfn_valid(pfn) && pte_pfn(*src_pte) == pfn)
> 
> Is there a case where this condition is false?

Hopefully not, but I tried to avoid marking whatever happens to be there as
valid. This is as paranoid as I can make it: checking the pfn should be mapped,
and that the output-address part of the record is correct.

If you're happy with the assumption that only valid records ever appear in the
linear map page tables, (and that anything marked not-valid is a result of
debug_pagealloc), then we can change this to !pte_none().


>> +			set_pte(dst_pte, __pte((pte_val(*src_pte) & ~PTE_RDONLY) | PTE_VALID));
> 
> With some more macros:
> 
> 			set_pte(dst_pte, pte_mkwrite(pte_mkpresent(pte)))
> 
> (pte_mkpresent() needs to be added)

>> +	}
>> +}
>> +
>>  static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
>>  		    unsigned long end)
>>  {
>> @@ -284,13 +317,7 @@ static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
>>  
>>  	src_pte = pte_offset_kernel(src_pmd, start);
>>  	do {
>> -		if (!pte_none(*src_pte))
> 
> You seem to no longer check for pte_none(). Is this not needed or
> covered by the pte_pfn() != pfn check above?

A bit of both:
Previously this copied over any values it found. _copy_pte() now copies valid
values, and if debug_pagealloc is turned on, tries to guess whether the
non-valid values should be copied and marked valid.


> 
>> -			/*
>> -			 * Resume will overwrite areas that may be marked
>> -			 * read only (code, rodata). Clear the RDONLY bit from
>> -			 * the temporary mappings we use during restore.
>> -			 */
>> -			set_pte(dst_pte, __pte(pte_val(*src_pte) & ~PTE_RDONLY));
>> +		_copy_pte(dst_pte, src_pte, addr);
>>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>>  
>>  	return 0;
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index ca6d268e3313..b6c0da84258c 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -139,4 +139,42 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
>>  					__pgprot(0),
>>  					__pgprot(PTE_VALID));
>>  }
>> -#endif
>> +#ifdef CONFIG_HIBERNATION
>> +/*
>> + * When built with CONFIG_DEBUG_PAGEALLOC and CONFIG_HIBERNATION, this function
>> + * is used to determine if a linear map page has been marked as not-present by
>> + * CONFIG_DEBUG_PAGEALLOC. Walk the page table and check the PTE_VALID bit.
>> + * This is based on kern_addr_valid(), which almost does what we need.
>> + */
>> +bool kernel_page_present(struct page *page)
>> +{
>> +	pgd_t *pgd;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +	pte_t *pte;
>> +	unsigned long addr = (unsigned long)page_address(page);
>> +
>> +	pgd = pgd_offset_k(addr);
>> +	if (pgd_none(*pgd))
>> +		return false;
>> +
>> +	pud = pud_offset(pgd, addr);
>> +	if (pud_none(*pud))
>> +		return false;
>> +	if (pud_sect(*pud))
>> +		return true;
> 
> This wouldn't normally guarantee "present" but I don't think we ever
> have a non-present section mapping for the kernel (we do for user
> though). You may want to add a comment.

Sure.

Just in case I've totally miss-understood:
> * Because this is only called on the kernel linar map we don't need to
> * use p?d_present() to check for PROT_NONE regions, as these don't occur
> * in the linear map.


>> +
>> +	pmd = pmd_offset(pud, addr);
>> +	if (pmd_none(*pmd))
>> +		return false;
>> +	if (pmd_sect(*pmd))
>> +		return true;
>> +
>> +	pte = pte_offset_kernel(pmd, addr);
>> +	if (pte_none(*pte))
>> +		return false;
>> +
>> +	return pte_valid(*pte);
> 
> You can return pte_valid() directly without the pte_none() check since
> pte_none() implies !pte_valid().

Oops, yes.


Thanks for the comments!


James

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC
  2016-08-23 13:33     ` James Morse
@ 2016-08-23 17:06       ` Catalin Marinas
  0 siblings, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2016-08-23 17:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 23, 2016 at 02:33:04PM +0100, James Morse wrote:
> On 22/08/16 19:51, Catalin Marinas wrote:
> > On Mon, Aug 22, 2016 at 06:35:19PM +0100, James Morse wrote:
> >> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> >> index b4082017c4cb..da4470de1807 100644
> >> --- a/arch/arm64/kernel/hibernate.c
> >> +++ b/arch/arm64/kernel/hibernate.c
> >> @@ -235,6 +235,7 @@ out:
> >>  	return rc;
> >>  }
> >>  
> >> +#define dcache_clean_range(start, end)	__flush_dcache_area(start, (end - start))
> >>  
> >>  int swsusp_arch_suspend(void)
> >>  {
> >> @@ -252,8 +253,14 @@ int swsusp_arch_suspend(void)
> >>  	if (__cpu_suspend_enter(&state)) {
> >>  		ret = swsusp_save();
> >>  	} else {
> >> -		/* Clean kernel to PoC for secondary core startup */
> >> -		__flush_dcache_area(LMADDR(KERNEL_START), KERNEL_END - KERNEL_START);
> >> +		/* Clean kernel core startup/idle code to PoC*/
> >> +		dcache_clean_range(__mmuoff_text_start, __mmuoff_text_end);
> >> +		dcache_clean_range(__mmuoff_data_start, __mmuoff_data_end);
> >> +		dcache_clean_range(__idmap_text_start, __idmap_text_end);
> >> +
> >> +		/* Clean kvm setup code to PoC? */
> >> +		if (el2_reset_needed())
> >> +			dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);
> >>  
> >>  		/*
> >>  		 * Tell the hibernation core that we've just restored
> >> @@ -269,6 +276,32 @@ int swsusp_arch_suspend(void)
> >>  	return ret;
> >>  }
> >>  
> >> +static void _copy_pte(pte_t *dst_pte, pte_t *src_pte, unsigned long addr)
> >> +{
> >> +	unsigned long pfn = virt_to_pfn(addr);
[...]
> > Something I missed in the original hibernation support but it may look
> > better if you have something like:
> > 
> > 	pte_t pte = *src_pte;
> 
> Sure,
> 
> >> +
> >> +	if (pte_valid(*src_pte)) {
> >> +		/*
> >> +		 * Resume will overwrite areas that may be marked
> >> +		 * read only (code, rodata). Clear the RDONLY bit from
> >> +		 * the temporary mappings we use during restore.
> >> +		 */
> >> +		set_pte(dst_pte, __pte(pte_val(*src_pte) & ~PTE_RDONLY));
> > 
> > and here:
> > 
> > 		set_pte(dst_pte, pte_mkwrite(pte));
> 
> pte_write() doesn't clear the PTE_RDONLY flag.
> Should it be changed?

I missed this. The PTE_RDONLY flag is supposed to be cleared by
set_pte_at() but we don't call this function here. Well, I guess it's
better as you originally wrote it, so no need to change.

> >> +	} else if (debug_pagealloc_enabled()) {
> >> +		/*
> >> +		 * debug_pagealloc may have removed the PTE_VALID bit if
> >> +		 * the page isn't in use by the resume kernel. It may have
> >> +		 * been in use by the original kernel, in which case we need
> >> +		 * to put it back in our copy to do the restore.
> >> +		 *
> >> +		 * Check for mappable memory that gives us a translation
> >> +		 * like part of the linear map.
> >> +		 */
> >> +		if (pfn_valid(pfn) && pte_pfn(*src_pte) == pfn)
> > 
> > Is there a case where this condition is false?
> 
> Hopefully not, but I tried to avoid marking whatever happens to be there as
> valid. This is as paranoid as I can make it: checking the pfn should be mapped,
> and that the output-address part of the record is correct.
> 
> If you're happy with the assumption that only valid records ever appear in the
> linear map page tables, (and that anything marked not-valid is a result of
> debug_pagealloc), then we can change this to !pte_none().

I think we can go for pte_none() but with a
BUG_ON(!pfn_valid(pte_pfn(*src_pte))).

> >> +			set_pte(dst_pte, __pte((pte_val(*src_pte) & ~PTE_RDONLY) | PTE_VALID));
> > 
> > With some more macros:
> > 
> > 			set_pte(dst_pte, pte_mkwrite(pte_mkpresent(pte)))
> > 
> > (pte_mkpresent() needs to be added)
> 
> >> +	}
> >> +}
> >> +
> >>  static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
> >>  		    unsigned long end)
> >>  {
> >> @@ -284,13 +317,7 @@ static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
> >>  
> >>  	src_pte = pte_offset_kernel(src_pmd, start);
> >>  	do {
> >> -		if (!pte_none(*src_pte))
> > 
> > You seem to no longer check for pte_none(). Is this not needed or
> > covered by the pte_pfn() != pfn check above?
> 
> A bit of both:
> Previously this copied over any values it found. _copy_pte() now copies valid
> values, and if debug_pagealloc is turned on, tries to guess whether
> the non-valid values should be copied and marked valid.

I haven't looked in detail at debug_pagealloc but if it only ever clears
the valid bit, we shouldn't have an issue.

> >> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> >> index ca6d268e3313..b6c0da84258c 100644
> >> --- a/arch/arm64/mm/pageattr.c
> >> +++ b/arch/arm64/mm/pageattr.c
> >> @@ -139,4 +139,42 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
> >>  					__pgprot(0),
> >>  					__pgprot(PTE_VALID));
> >>  }
> >> -#endif
> >> +#ifdef CONFIG_HIBERNATION
> >> +/*
> >> + * When built with CONFIG_DEBUG_PAGEALLOC and CONFIG_HIBERNATION, this function
> >> + * is used to determine if a linear map page has been marked as not-present by
> >> + * CONFIG_DEBUG_PAGEALLOC. Walk the page table and check the PTE_VALID bit.
> >> + * This is based on kern_addr_valid(), which almost does what we need.
> >> + */
> >> +bool kernel_page_present(struct page *page)
> >> +{
> >> +	pgd_t *pgd;
> >> +	pud_t *pud;
> >> +	pmd_t *pmd;
> >> +	pte_t *pte;
> >> +	unsigned long addr = (unsigned long)page_address(page);
> >> +
> >> +	pgd = pgd_offset_k(addr);
> >> +	if (pgd_none(*pgd))
> >> +		return false;
> >> +
> >> +	pud = pud_offset(pgd, addr);
> >> +	if (pud_none(*pud))
> >> +		return false;
> >> +	if (pud_sect(*pud))
> >> +		return true;
> > 
> > This wouldn't normally guarantee "present" but I don't think we ever
> > have a non-present section mapping for the kernel (we do for user
> > though). You may want to add a comment.
> 
> Sure.
> 
> Just in case I've totally miss-understood:
> > * Because this is only called on the kernel linar map we don't need to
> > * use p?d_present() to check for PROT_NONE regions, as these don't occur
> > * in the linear map.

Something simpler not to confuse the reader with PTE_PROT_NONE:

	/*
	 * Because this is only called on the kernel linear map,
	 * p?d_sect() implies p?d_present(). When debug_pagealloc is
	 * enabled, sections mappings are disabled.
	 */

-- 
Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections
  2016-08-23 13:11   ` Ard Biesheuvel
@ 2016-08-24 15:49     ` James Morse
  2016-08-24 15:59       ` Ard Biesheuvel
  0 siblings, 1 reply; 11+ messages in thread
From: James Morse @ 2016-08-24 15:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On 23/08/16 14:11, Ard Biesheuvel wrote:
> On 22 August 2016 at 19:35, James Morse <james.morse@arm.com> wrote:
>> Resume from hibernate needs to clean any text executed by the kernel with
>> the MMU off to the PoC. Collect these functions together into a new
>> .mmuoff.text section.
> 
> Wouldn't it be much simpler to move all of this into the .idmap.text
> section? It is not that much code, and they intersect anyway (i.e.,
> some .idmap.text code is only called with the MMU off)

That's an idea, it changes the meaning of the things in that section slightly
but saves multiplying linker sections like this.


>> Data is more complicated, secondary_holding_pen_release is written with
>> the MMU on, clean and invalidated, then read with the MMU off. In contrast
>> __boot_cpu_mode is written with the MMU off, the corresponding cache line
>> is invalidated, so when we read it with the MMU on we don't get stale data.
>> These cache maintenance operations conflict with each other if the values
>> are within a Cache Writeback Granule (CWG) of each other.
>> Collect the data into two sections .mmuoff.data.read and .mmuoff.data.write,
>> the linker script ensures mmuoff.data.write section is aligned to the
>> architectural maximum CWG of 2KB.

>> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
>> index 659963d40bb4..97c3f36ce30b 100644
>> --- a/arch/arm64/kernel/vmlinux.lds.S
>> +++ b/arch/arm64/kernel/vmlinux.lds.S
>> @@ -120,6 +120,9 @@ SECTIONS
>>                         IRQENTRY_TEXT
>>                         SOFTIRQENTRY_TEXT
>>                         ENTRY_TEXT
>> +                       __mmuoff_text_start = .;
>> +                       *(.mmuoff.text)
>> +                       __mmuoff_text_end = .;
> 
> Could you add a define for this like we have for the other ones?

I should have thought of that. If its merged with the idmap this disappears
completely.


> 
>>                         TEXT_TEXT
>>                         SCHED_TEXT
>>                         LOCK_TEXT
>> @@ -185,6 +188,24 @@ SECTIONS
>>         _data = .;
>>         _sdata = .;
>>         RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
>> +
>> +       /*
>> +        * Data written with the MMU off but read with the MMU on requires
>> +        * cache lines to be invalidated, discarding up to a Cache Writeback
>> +        * Granule (CWG) of data from the cache. Keep the section that
>> +        * requires this type of maintenance to be in its own Cache Writeback
>> +        * Granule (CWG) area so the cache maintenance operations don't
>> +        * don't interfere with adjacent data.
>> +        */
>> +       .mmuoff.data.write : ALIGN(SZ_2K) {
>> +               __mmuoff_data_start = .;
>> +               *(.mmuoff.data.write)
>> +       }
>> +       .mmuoff.data.read : ALIGN(SZ_2K) {
>> +               *(.mmuoff.data.read)
>> +               __mmuoff_data_end = .;
>> +       }
>> +
> 
> This looks fine now, with the caveat that empty sections are dropped
> completely by the linker (including their alignment), and so the start
> and end markers may assume unexpected values.
> 
> For instance, if .mmuoff.data.read turns out empty in the build (and I
> am aware this is theoretical), the section will be dropped, and
> __mmuoff_data_end may assume a value that exceeds _edata.

Really?
/me tries it.
... okay that's weird ...


> This is
> actually fine, as long as you put a
> 
> . = ALIGN(SZ_2K);
> 
> before BSS_SECTION() to ensure that the cache invalidation does not
> affect .bss in case .mmuoff.data.read does turn out empty

This can happen because the ALIGN() is attached to a different section, that
could be removed. Is this sufficient?:
>???????.mmuoff.data.write : ALIGN(SZ_2K) {
>??????? ???????__mmuoff_data_start = .;
>??????? ???????*(.mmuoff.data.write)
>???????}
>???????. = ALIGN(SZ_2K);
>???????.mmuoff.data.read : {
>??????? ???????*(.mmuoff.data.read)
>??????? ???????__mmuoff_data_end = .;
>???????}


> (or
> alternatively, put the alignment inside the BSS_SECTION() invocation,
> as I did in my example before)

What worries me about that is the alignment has nothing to do with the BSS. It's
not even anything to do with the 'PECOFF_EDATA_PADDING/_edata' that appears
before it.


Thanks,

James

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections
  2016-08-24 15:49     ` James Morse
@ 2016-08-24 15:59       ` Ard Biesheuvel
  0 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2016-08-24 15:59 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 August 2016 at 17:49, James Morse <james.morse@arm.com> wrote:
> Hi Ard,
>
> On 23/08/16 14:11, Ard Biesheuvel wrote:
>> On 22 August 2016 at 19:35, James Morse <james.morse@arm.com> wrote:
>>> Resume from hibernate needs to clean any text executed by the kernel with
>>> the MMU off to the PoC. Collect these functions together into a new
>>> .mmuoff.text section.
>>
>> Wouldn't it be much simpler to move all of this into the .idmap.text
>> section? It is not that much code, and they intersect anyway (i.e.,
>> some .idmap.text code is only called with the MMU off)
>
> That's an idea, it changes the meaning of the things in that section slightly
> but saves multiplying linker sections like this.
>
>
>>> Data is more complicated, secondary_holding_pen_release is written with
>>> the MMU on, clean and invalidated, then read with the MMU off. In contrast
>>> __boot_cpu_mode is written with the MMU off, the corresponding cache line
>>> is invalidated, so when we read it with the MMU on we don't get stale data.
>>> These cache maintenance operations conflict with each other if the values
>>> are within a Cache Writeback Granule (CWG) of each other.
>>> Collect the data into two sections .mmuoff.data.read and .mmuoff.data.write,
>>> the linker script ensures mmuoff.data.write section is aligned to the
>>> architectural maximum CWG of 2KB.
>
>>> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
>>> index 659963d40bb4..97c3f36ce30b 100644
>>> --- a/arch/arm64/kernel/vmlinux.lds.S
>>> +++ b/arch/arm64/kernel/vmlinux.lds.S
>>> @@ -120,6 +120,9 @@ SECTIONS
>>>                         IRQENTRY_TEXT
>>>                         SOFTIRQENTRY_TEXT
>>>                         ENTRY_TEXT
>>> +                       __mmuoff_text_start = .;
>>> +                       *(.mmuoff.text)
>>> +                       __mmuoff_text_end = .;
>>
>> Could you add a define for this like we have for the other ones?
>
> I should have thought of that. If its merged with the idmap this disappears
> completely.
>
>
>>
>>>                         TEXT_TEXT
>>>                         SCHED_TEXT
>>>                         LOCK_TEXT
>>> @@ -185,6 +188,24 @@ SECTIONS
>>>         _data = .;
>>>         _sdata = .;
>>>         RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
>>> +
>>> +       /*
>>> +        * Data written with the MMU off but read with the MMU on requires
>>> +        * cache lines to be invalidated, discarding up to a Cache Writeback
>>> +        * Granule (CWG) of data from the cache. Keep the section that
>>> +        * requires this type of maintenance to be in its own Cache Writeback
>>> +        * Granule (CWG) area so the cache maintenance operations don't
>>> +        * don't interfere with adjacent data.
>>> +        */
>>> +       .mmuoff.data.write : ALIGN(SZ_2K) {
>>> +               __mmuoff_data_start = .;
>>> +               *(.mmuoff.data.write)
>>> +       }
>>> +       .mmuoff.data.read : ALIGN(SZ_2K) {
>>> +               *(.mmuoff.data.read)
>>> +               __mmuoff_data_end = .;
>>> +       }
>>> +
>>
>> This looks fine now, with the caveat that empty sections are dropped
>> completely by the linker (including their alignment), and so the start
>> and end markers may assume unexpected values.
>>
>> For instance, if .mmuoff.data.read turns out empty in the build (and I
>> am aware this is theoretical), the section will be dropped, and
>> __mmuoff_data_end may assume a value that exceeds _edata.
>
> Really?
> /me tries it.
> ... okay that's weird ...
>
>
>> This is
>> actually fine, as long as you put a
>>
>> . = ALIGN(SZ_2K);
>>
>> before BSS_SECTION() to ensure that the cache invalidation does not
>> affect .bss in case .mmuoff.data.read does turn out empty
>
> This can happen because the ALIGN() is attached to a different section, that
> could be removed. Is this sufficient?:
>>???????.mmuoff.data.write : ALIGN(SZ_2K) {
>>??????? __mmuoff_data_start = .;
>>??????? *(.mmuoff.data.write)
>>???????}
>>???????. = ALIGN(SZ_2K);
>>???????.mmuoff.data.read : {
>>??????? *(.mmuoff.data.read)
>>??????? __mmuoff_data_end = .;
>>???????}
>
>

I suppose that would work, the only difference being that the location
pointer is advanced even if .mmuoff.data.write turns out empty. Of
course, that does not matter at all given that that will never be the
case anyway.

>> (or
>> alternatively, put the alignment inside the BSS_SECTION() invocation,
>> as I did in my example before)
>
> What worries me about that is the alignment has nothing to do with the BSS. It's
> not even anything to do with the 'PECOFF_EDATA_PADDING/_edata' that appears
> before it.
>

The linker simply tries not to emit padding in case the paddee ends up
being dropped.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-24 15:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-22 17:35 [PATCH v5 0/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
2016-08-22 17:35 ` [PATCH v5 1/3] arm64: Create sections.h James Morse
2016-08-22 17:35 ` [PATCH v5 2/3] arm64: vmlinux.ld: Add mmuoff text and data sections James Morse
2016-08-22 17:56   ` Catalin Marinas
2016-08-23 13:11   ` Ard Biesheuvel
2016-08-24 15:49     ` James Morse
2016-08-24 15:59       ` Ard Biesheuvel
2016-08-22 17:35 ` [PATCH v5 3/3] arm64: hibernate: Support DEBUG_PAGEALLOC James Morse
2016-08-22 18:51   ` Catalin Marinas
2016-08-23 13:33     ` James Morse
2016-08-23 17:06       ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.