All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] arm64: Introduce IRQ stack
@ 2015-10-07 15:28 ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-07 15:28 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, linux-arm-kernel
  Cc: james.morse, takahiro.akashi, mark.rutland, barami97, linux-kernel

Hi All,

This is fourth version of the series. A major change is stack trace
support for IRQ stack since v3.

Any feedbacks or comments are always welcome.

Thanks in advance!

Changes since v3:
- Expanded stack trace to support IRQ stack
- Added more comments

Changes since v2:
- Optmised current_thread_info function as removing masking operation
  and volatile keyword per James and Catalin
- Reworked irq re-enterance check logic using top-bit comparison of
  stacks per James
- Added sp_el0 update in cpu_resume per James
- Selected HAVE_IRQ_EXIT_ON_IRQ_STACK to expose this feature explicitly
- Added a Tested-by tag from James
- Added comments on sp_el0 as a helper messeage

Changes since v1:
- Rebased on top of v4.3-rc1
- Removed Kconfig about IRQ stack, per James
- Used PERCPU for IRQ stack, per James
- Tried to allocate IRQ stack when CPU is about to start up, per James
- Moved sp_el0 update into kernel_entry macro, per James
- Dropped S_SP removal patch, per Mark and James

Jungseok Lee (2):
  arm64: Introduce IRQ stack
  arm64: Expand the stack trace feature to support IRQ stack

 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/irq.h         | 18 ++++++
 arch/arm64/include/asm/thread_info.h | 10 +++-
 arch/arm64/kernel/asm-offsets.c      |  5 ++
 arch/arm64/kernel/entry.S            | 49 ++++++++++++++--
 arch/arm64/kernel/head.S             |  5 ++
 arch/arm64/kernel/irq.c              | 21 +++++++
 arch/arm64/kernel/sleep.S            |  3 +
 arch/arm64/kernel/smp.c              | 13 +++-
 arch/arm64/kernel/stacktrace.c       | 22 ++++++-
 arch/arm64/kernel/traps.c            | 13 ++++
 11 files changed, 149 insertions(+), 11 deletions(-)

-- 
2.5.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 0/2] arm64: Introduce IRQ stack
@ 2015-10-07 15:28 ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-07 15:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi All,

This is fourth version of the series. A major change is stack trace
support for IRQ stack since v3.

Any feedbacks or comments are always welcome.

Thanks in advance!

Changes since v3:
- Expanded stack trace to support IRQ stack
- Added more comments

Changes since v2:
- Optmised current_thread_info function as removing masking operation
  and volatile keyword per James and Catalin
- Reworked irq re-enterance check logic using top-bit comparison of
  stacks per James
- Added sp_el0 update in cpu_resume per James
- Selected HAVE_IRQ_EXIT_ON_IRQ_STACK to expose this feature explicitly
- Added a Tested-by tag from James
- Added comments on sp_el0 as a helper messeage

Changes since v1:
- Rebased on top of v4.3-rc1
- Removed Kconfig about IRQ stack, per James
- Used PERCPU for IRQ stack, per James
- Tried to allocate IRQ stack when CPU is about to start up, per James
- Moved sp_el0 update into kernel_entry macro, per James
- Dropped S_SP removal patch, per Mark and James

Jungseok Lee (2):
  arm64: Introduce IRQ stack
  arm64: Expand the stack trace feature to support IRQ stack

 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/irq.h         | 18 ++++++
 arch/arm64/include/asm/thread_info.h | 10 +++-
 arch/arm64/kernel/asm-offsets.c      |  5 ++
 arch/arm64/kernel/entry.S            | 49 ++++++++++++++--
 arch/arm64/kernel/head.S             |  5 ++
 arch/arm64/kernel/irq.c              | 21 +++++++
 arch/arm64/kernel/sleep.S            |  3 +
 arch/arm64/kernel/smp.c              | 13 +++-
 arch/arm64/kernel/stacktrace.c       | 22 ++++++-
 arch/arm64/kernel/traps.c            | 13 ++++
 11 files changed, 149 insertions(+), 11 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 1/2] arm64: Introduce IRQ stack
  2015-10-07 15:28 ` Jungseok Lee
@ 2015-10-07 15:28   ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-07 15:28 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, linux-arm-kernel
  Cc: james.morse, takahiro.akashi, mark.rutland, barami97, linux-kernel

Currently, kernel context and interrupts are handled using a single
kernel stack navigated by sp_el1. This forces a system to use 16KB
stack, not 8KB one. This restriction makes low memory platforms suffer
from memory pressure accompanied by performance degradation.

This patch addresses the issue as introducing a separate percpu IRQ
stack to handle both hard and soft interrupts with two ground rules:

  - Utilize sp_el0 in EL1 context, which is not used currently
  - Do not complicate current_thread_info calculation

It is a core concept to directly retrieve struct thread_info from
sp_el0. This approach helps to prevent text section size from being
increased largely as removing masking operation using THREAD_SIZE
in tons of places.

[Thanks to James Morse for his valuable feedbacks which greatly help
to figure out a better implementation. - Jungseok]

Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Tested-by: James Morse <james.morse@arm.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/irq.h         |  6 +++
 arch/arm64/include/asm/thread_info.h | 10 +++-
 arch/arm64/kernel/asm-offsets.c      |  2 +
 arch/arm64/kernel/entry.S            | 41 ++++++++++++++--
 arch/arm64/kernel/head.S             |  5 ++
 arch/arm64/kernel/irq.c              | 21 ++++++++
 arch/arm64/kernel/sleep.S            |  3 ++
 arch/arm64/kernel/smp.c              | 13 ++++-
 9 files changed, 93 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 07d1811..9767bd9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -68,6 +68,7 @@ config ARM64
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_GENERIC_DMA_COHERENT
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS
+	select HAVE_IRQ_EXIT_ON_IRQ_STACK
 	select HAVE_MEMBLOCK
 	select HAVE_PATA_PLATFORM
 	select HAVE_PERF_EVENTS
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index bbb251b..6ea82e8 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -5,11 +5,17 @@
 
 #include <asm-generic/irq.h>
 
+struct irq_stack {
+	void *stack;
+};
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
 extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
 
+extern int alloc_irq_stack(unsigned int cpu);
+
 static inline void acpi_irq_init(void)
 {
 	/*
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d1..fa014df 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -71,10 +71,16 @@ register unsigned long current_stack_pointer asm ("sp");
  */
 static inline struct thread_info *current_thread_info(void) __attribute_const__;
 
+/*
+ * struct thread_info can be accessed directly via sp_el0.
+ */
 static inline struct thread_info *current_thread_info(void)
 {
-	return (struct thread_info *)
-		(current_stack_pointer & ~(THREAD_SIZE - 1));
+	unsigned long sp_el0;
+
+	asm ("mrs %0, sp_el0" : "=r" (sp_el0));
+
+	return (struct thread_info *)sp_el0;
 }
 
 #define thread_saved_pc(tsk)	\
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 8d89cf8..b16e3cf 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -41,6 +41,8 @@ int main(void)
   BLANK();
   DEFINE(THREAD_CPU_CONTEXT,	offsetof(struct task_struct, thread.cpu_context));
   BLANK();
+  DEFINE(IRQ_STACK,		offsetof(struct irq_stack, stack));
+  BLANK();
   DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
   DEFINE(S_X1,			offsetof(struct pt_regs, regs[1]));
   DEFINE(S_X2,			offsetof(struct pt_regs, regs[2]));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 4306c93..6d4e8c5 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -88,7 +88,8 @@
 
 	.if	\el == 0
 	mrs	x21, sp_el0
-	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
+	mov	tsk, sp
+	and	tsk, tsk, #~(THREAD_SIZE - 1)	// Ensure MDSCR_EL1.SS is clear,
 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
 	disable_step_tsk x19, x20		// exceptions when scheduling.
 	.else
@@ -108,6 +109,13 @@
 	.endif
 
 	/*
+	 * Set sp_el0 to current thread_info.
+	 */
+	.if	\el == 0
+	msr	sp_el0, tsk
+	.endif
+
+	/*
 	 * Registers that may be useful after this macro is invoked:
 	 *
 	 * x21 - aborted SP
@@ -164,8 +172,28 @@ alternative_endif
 	.endm
 
 	.macro	get_thread_info, rd
-	mov	\rd, sp
-	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
+	mrs	\rd, sp_el0
+	.endm
+
+	.macro	irq_stack_entry
+	adr_l	x19, irq_stacks
+	mrs	x20, tpidr_el1
+	add	x19, x19, x20
+	ldr	x24, [x19, #IRQ_STACK]
+	and	x20, x24, #~(THREAD_SIZE - 1)
+	mov	x23, sp
+	and	x23, x23, #~(THREAD_SIZE - 1)
+	cmp	x20, x23			// check irq re-enterance
+	mov	x19, sp
+	csel	x23, x19, x24, eq		// x24 = top of irq stack
+	mov	sp, x23
+	.endm
+
+	/*
+	 * x19 is preserved between irq_stack_entry and irq_stack_exit.
+	 */
+	.macro	irq_stack_exit
+	mov	sp, x19
 	.endm
 
 /*
@@ -183,10 +211,11 @@ tsk	.req	x28		// current thread_info
  * Interrupt handling.
  */
 	.macro	irq_handler
-	adrp	x1, handle_arch_irq
-	ldr	x1, [x1, #:lo12:handle_arch_irq]
+	ldr_l	x1, handle_arch_irq
 	mov	x0, sp
+	irq_stack_entry
 	blr	x1
+	irq_stack_exit
 	.endm
 
 	.text
@@ -597,6 +626,8 @@ ENTRY(cpu_switch_to)
 	ldp	x29, x9, [x8], #16
 	ldr	lr, [x8]
 	mov	sp, x9
+	and	x9, x9, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x9
 	ret
 ENDPROC(cpu_switch_to)
 
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 90d09ed..dab089b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -441,6 +441,9 @@ __mmap_switched:
 	b	1b
 2:
 	adr_l	sp, initial_sp, x4
+	mov	x4, sp
+	and	x4, x4, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x4			// Save thread_info
 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
 	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
 	mov	x29, #0
@@ -618,6 +621,8 @@ ENDPROC(secondary_startup)
 ENTRY(__secondary_switched)
 	ldr	x0, [x21]			// get secondary_data.stack
 	mov	sp, x0
+	and	x0, x0, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x0			// save thread_info
 	mov	x29, #0
 	b	secondary_start_kernel
 ENDPROC(__secondary_switched)
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 11dc3fd..a6bdf4d 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -31,6 +31,8 @@
 
 unsigned long irq_err_count;
 
+DEFINE_PER_CPU(struct irq_stack, irq_stacks);
+
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
 	show_ipi_list(p, prec);
@@ -50,6 +52,9 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
 
 void __init init_IRQ(void)
 {
+	if (alloc_irq_stack(smp_processor_id()))
+		panic("Failed to allocate IRQ stack for a boot cpu");
+
 	irqchip_init();
 	if (!handle_arch_irq)
 		panic("No interrupt controller found.");
@@ -115,3 +120,19 @@ void migrate_irqs(void)
 	local_irq_restore(flags);
 }
 #endif /* CONFIG_HOTPLUG_CPU */
+
+int alloc_irq_stack(unsigned int cpu)
+{
+	void *stack;
+
+	if (per_cpu(irq_stacks, cpu).stack)
+		return 0;
+
+	stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+	if (!stack)
+		return -ENOMEM;
+
+	per_cpu(irq_stacks, cpu).stack = stack + THREAD_START_SP;
+
+	return 0;
+}
diff --git a/arch/arm64/kernel/sleep.S b/arch/arm64/kernel/sleep.S
index f586f7c..e33fe33 100644
--- a/arch/arm64/kernel/sleep.S
+++ b/arch/arm64/kernel/sleep.S
@@ -173,6 +173,9 @@ ENTRY(cpu_resume)
 	/* load physical address of identity map page table in x1 */
 	adrp	x1, idmap_pg_dir
 	mov	sp, x2
+	/* save thread_info */
+	and	x2, x2, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x2
 	/*
 	 * cpu_do_resume expects x0 to contain context physical address
 	 * pointer and x1 to contain physical address of 1:1 page tables
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index dbdaacd..2b8e33d 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -91,13 +91,22 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 	int ret;
 
 	/*
-	 * We need to tell the secondary core where to find its stack and the
-	 * page tables.
+	 * We need to tell the secondary core where to find its process stack
+	 * and the page tables.
 	 */
 	secondary_data.stack = task_stack_page(idle) + THREAD_START_SP;
 	__flush_dcache_area(&secondary_data, sizeof(secondary_data));
 
 	/*
+	 * Allocate IRQ stack to handle both hard and soft interrupts.
+	 */
+	ret = alloc_irq_stack(cpu);
+	if (ret) {
+		pr_crit("CPU%u: failed to allocate IRQ stack\n", cpu);
+		return ret;
+	}
+
+	/*
 	 * Now bring the CPU into our world.
 	 */
 	ret = boot_secondary(cpu, idle);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 1/2] arm64: Introduce IRQ stack
@ 2015-10-07 15:28   ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-07 15:28 UTC (permalink / raw)
  To: linux-arm-kernel

Currently, kernel context and interrupts are handled using a single
kernel stack navigated by sp_el1. This forces a system to use 16KB
stack, not 8KB one. This restriction makes low memory platforms suffer
from memory pressure accompanied by performance degradation.

This patch addresses the issue as introducing a separate percpu IRQ
stack to handle both hard and soft interrupts with two ground rules:

  - Utilize sp_el0 in EL1 context, which is not used currently
  - Do not complicate current_thread_info calculation

It is a core concept to directly retrieve struct thread_info from
sp_el0. This approach helps to prevent text section size from being
increased largely as removing masking operation using THREAD_SIZE
in tons of places.

[Thanks to James Morse for his valuable feedbacks which greatly help
to figure out a better implementation. - Jungseok]

Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Tested-by: James Morse <james.morse@arm.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/irq.h         |  6 +++
 arch/arm64/include/asm/thread_info.h | 10 +++-
 arch/arm64/kernel/asm-offsets.c      |  2 +
 arch/arm64/kernel/entry.S            | 41 ++++++++++++++--
 arch/arm64/kernel/head.S             |  5 ++
 arch/arm64/kernel/irq.c              | 21 ++++++++
 arch/arm64/kernel/sleep.S            |  3 ++
 arch/arm64/kernel/smp.c              | 13 ++++-
 9 files changed, 93 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 07d1811..9767bd9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -68,6 +68,7 @@ config ARM64
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_GENERIC_DMA_COHERENT
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS
+	select HAVE_IRQ_EXIT_ON_IRQ_STACK
 	select HAVE_MEMBLOCK
 	select HAVE_PATA_PLATFORM
 	select HAVE_PERF_EVENTS
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index bbb251b..6ea82e8 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -5,11 +5,17 @@
 
 #include <asm-generic/irq.h>
 
+struct irq_stack {
+	void *stack;
+};
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
 extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
 
+extern int alloc_irq_stack(unsigned int cpu);
+
 static inline void acpi_irq_init(void)
 {
 	/*
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d1..fa014df 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -71,10 +71,16 @@ register unsigned long current_stack_pointer asm ("sp");
  */
 static inline struct thread_info *current_thread_info(void) __attribute_const__;
 
+/*
+ * struct thread_info can be accessed directly via sp_el0.
+ */
 static inline struct thread_info *current_thread_info(void)
 {
-	return (struct thread_info *)
-		(current_stack_pointer & ~(THREAD_SIZE - 1));
+	unsigned long sp_el0;
+
+	asm ("mrs %0, sp_el0" : "=r" (sp_el0));
+
+	return (struct thread_info *)sp_el0;
 }
 
 #define thread_saved_pc(tsk)	\
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 8d89cf8..b16e3cf 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -41,6 +41,8 @@ int main(void)
   BLANK();
   DEFINE(THREAD_CPU_CONTEXT,	offsetof(struct task_struct, thread.cpu_context));
   BLANK();
+  DEFINE(IRQ_STACK,		offsetof(struct irq_stack, stack));
+  BLANK();
   DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
   DEFINE(S_X1,			offsetof(struct pt_regs, regs[1]));
   DEFINE(S_X2,			offsetof(struct pt_regs, regs[2]));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 4306c93..6d4e8c5 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -88,7 +88,8 @@
 
 	.if	\el == 0
 	mrs	x21, sp_el0
-	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
+	mov	tsk, sp
+	and	tsk, tsk, #~(THREAD_SIZE - 1)	// Ensure MDSCR_EL1.SS is clear,
 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
 	disable_step_tsk x19, x20		// exceptions when scheduling.
 	.else
@@ -108,6 +109,13 @@
 	.endif
 
 	/*
+	 * Set sp_el0 to current thread_info.
+	 */
+	.if	\el == 0
+	msr	sp_el0, tsk
+	.endif
+
+	/*
 	 * Registers that may be useful after this macro is invoked:
 	 *
 	 * x21 - aborted SP
@@ -164,8 +172,28 @@ alternative_endif
 	.endm
 
 	.macro	get_thread_info, rd
-	mov	\rd, sp
-	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
+	mrs	\rd, sp_el0
+	.endm
+
+	.macro	irq_stack_entry
+	adr_l	x19, irq_stacks
+	mrs	x20, tpidr_el1
+	add	x19, x19, x20
+	ldr	x24, [x19, #IRQ_STACK]
+	and	x20, x24, #~(THREAD_SIZE - 1)
+	mov	x23, sp
+	and	x23, x23, #~(THREAD_SIZE - 1)
+	cmp	x20, x23			// check irq re-enterance
+	mov	x19, sp
+	csel	x23, x19, x24, eq		// x24 = top of irq stack
+	mov	sp, x23
+	.endm
+
+	/*
+	 * x19 is preserved between irq_stack_entry and irq_stack_exit.
+	 */
+	.macro	irq_stack_exit
+	mov	sp, x19
 	.endm
 
 /*
@@ -183,10 +211,11 @@ tsk	.req	x28		// current thread_info
  * Interrupt handling.
  */
 	.macro	irq_handler
-	adrp	x1, handle_arch_irq
-	ldr	x1, [x1, #:lo12:handle_arch_irq]
+	ldr_l	x1, handle_arch_irq
 	mov	x0, sp
+	irq_stack_entry
 	blr	x1
+	irq_stack_exit
 	.endm
 
 	.text
@@ -597,6 +626,8 @@ ENTRY(cpu_switch_to)
 	ldp	x29, x9, [x8], #16
 	ldr	lr, [x8]
 	mov	sp, x9
+	and	x9, x9, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x9
 	ret
 ENDPROC(cpu_switch_to)
 
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 90d09ed..dab089b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -441,6 +441,9 @@ __mmap_switched:
 	b	1b
 2:
 	adr_l	sp, initial_sp, x4
+	mov	x4, sp
+	and	x4, x4, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x4			// Save thread_info
 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
 	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
 	mov	x29, #0
@@ -618,6 +621,8 @@ ENDPROC(secondary_startup)
 ENTRY(__secondary_switched)
 	ldr	x0, [x21]			// get secondary_data.stack
 	mov	sp, x0
+	and	x0, x0, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x0			// save thread_info
 	mov	x29, #0
 	b	secondary_start_kernel
 ENDPROC(__secondary_switched)
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 11dc3fd..a6bdf4d 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -31,6 +31,8 @@
 
 unsigned long irq_err_count;
 
+DEFINE_PER_CPU(struct irq_stack, irq_stacks);
+
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
 	show_ipi_list(p, prec);
@@ -50,6 +52,9 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
 
 void __init init_IRQ(void)
 {
+	if (alloc_irq_stack(smp_processor_id()))
+		panic("Failed to allocate IRQ stack for a boot cpu");
+
 	irqchip_init();
 	if (!handle_arch_irq)
 		panic("No interrupt controller found.");
@@ -115,3 +120,19 @@ void migrate_irqs(void)
 	local_irq_restore(flags);
 }
 #endif /* CONFIG_HOTPLUG_CPU */
+
+int alloc_irq_stack(unsigned int cpu)
+{
+	void *stack;
+
+	if (per_cpu(irq_stacks, cpu).stack)
+		return 0;
+
+	stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+	if (!stack)
+		return -ENOMEM;
+
+	per_cpu(irq_stacks, cpu).stack = stack + THREAD_START_SP;
+
+	return 0;
+}
diff --git a/arch/arm64/kernel/sleep.S b/arch/arm64/kernel/sleep.S
index f586f7c..e33fe33 100644
--- a/arch/arm64/kernel/sleep.S
+++ b/arch/arm64/kernel/sleep.S
@@ -173,6 +173,9 @@ ENTRY(cpu_resume)
 	/* load physical address of identity map page table in x1 */
 	adrp	x1, idmap_pg_dir
 	mov	sp, x2
+	/* save thread_info */
+	and	x2, x2, #~(THREAD_SIZE - 1)
+	msr	sp_el0, x2
 	/*
 	 * cpu_do_resume expects x0 to contain context physical address
 	 * pointer and x1 to contain physical address of 1:1 page tables
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index dbdaacd..2b8e33d 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -91,13 +91,22 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 	int ret;
 
 	/*
-	 * We need to tell the secondary core where to find its stack and the
-	 * page tables.
+	 * We need to tell the secondary core where to find its process stack
+	 * and the page tables.
 	 */
 	secondary_data.stack = task_stack_page(idle) + THREAD_START_SP;
 	__flush_dcache_area(&secondary_data, sizeof(secondary_data));
 
 	/*
+	 * Allocate IRQ stack to handle both hard and soft interrupts.
+	 */
+	ret = alloc_irq_stack(cpu);
+	if (ret) {
+		pr_crit("CPU%u: failed to allocate IRQ stack\n", cpu);
+		return ret;
+	}
+
+	/*
 	 * Now bring the CPU into our world.
 	 */
 	ret = boot_secondary(cpu, idle);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-07 15:28 ` Jungseok Lee
@ 2015-10-07 15:28   ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-07 15:28 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, linux-arm-kernel
  Cc: james.morse, takahiro.akashi, mark.rutland, barami97, linux-kernel

Currently, a call trace drops a process stack walk when a separate IRQ
stack is used. It makes a call trace information much less useful when
a system gets paniked in interrupt context.

This patch addresses the issue with the following schemes:

  - Store aborted stack frame data
  - Decide whether another stack walk is needed or not via current sp
  - Loosen the frame pointer upper bound condition

Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/include/asm/irq.h    | 12 +++++++++++
 arch/arm64/kernel/asm-offsets.c |  3 +++
 arch/arm64/kernel/entry.S       | 10 ++++++++--
 arch/arm64/kernel/stacktrace.c  | 22 ++++++++++++++++++++-
 arch/arm64/kernel/traps.c       | 13 ++++++++++++
 5 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 6ea82e8..e5904a1 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -2,13 +2,25 @@
 #define __ASM_IRQ_H
 
 #include <linux/irqchip/arm-gic-acpi.h>
+#include <asm/stacktrace.h>
 
 #include <asm-generic/irq.h>
 
 struct irq_stack {
 	void *stack;
+	struct stackframe frame;
 };
 
+DECLARE_PER_CPU(struct irq_stack, irq_stacks);
+
+static inline bool in_irq_stack(unsigned int cpu)
+{
+	unsigned long high = (unsigned long)per_cpu(irq_stacks, cpu).stack;
+
+	return (current_stack_pointer >= round_down(high, THREAD_SIZE)) &&
+		current_stack_pointer < high;
+}
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index b16e3cf..fbb52f2d 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -42,6 +42,9 @@ int main(void)
   DEFINE(THREAD_CPU_CONTEXT,	offsetof(struct task_struct, thread.cpu_context));
   BLANK();
   DEFINE(IRQ_STACK,		offsetof(struct irq_stack, stack));
+  DEFINE(IRQ_FRAME_FP,		offsetof(struct irq_stack, frame.fp));
+  DEFINE(IRQ_FRAME_SP,		offsetof(struct irq_stack, frame.sp));
+  DEFINE(IRQ_FRAME_PC,		offsetof(struct irq_stack, frame.pc));
   BLANK();
   DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
   DEFINE(S_X1,			offsetof(struct pt_regs, regs[1]));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d4e8c5..650cc05 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -121,7 +121,8 @@
 	 * x21 - aborted SP
 	 * x22 - aborted PC
 	 * x23 - aborted PSTATE
-	*/
+	 * x29 - aborted FP
+	 */
 	.endm
 
 	.macro	kernel_exit, el
@@ -184,7 +185,12 @@ alternative_endif
 	mov	x23, sp
 	and	x23, x23, #~(THREAD_SIZE - 1)
 	cmp	x20, x23			// check irq re-enterance
-	mov	x19, sp
+	beq	1f
+	str	x29, [x19, #IRQ_FRAME_FP]
+	str	x21, [x19, #IRQ_FRAME_SP]
+	str	x22, [x19, #IRQ_FRAME_PC]
+	mov	x29, x24
+1:	mov	x19, sp
 	csel	x23, x19, x24, eq		// x24 = top of irq stack
 	mov	sp, x23
 	.endm
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 407991b..5124649 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
 	low  = frame->sp;
 	high = ALIGN(low, THREAD_SIZE);
 
-	if (fp < low || fp > high - 0x18 || fp & 0xf)
+	/*
+	 * A frame pointer would reach an upper bound if a prologue of the
+	 * first function of call trace looks as follows:
+	 *
+	 *	stp     x29, x30, [sp,#-16]!
+	 *	mov     x29, sp
+	 *
+	 * Thus, the upper bound is (top of stack - 0x20) with consideration
+	 * of a 16-byte empty space in THREAD_START_SP.
+	 *
+	 * The value, 0x20, however, does not cover all cases as interrupts
+	 * are handled using a separate stack. That is, a call trace can start
+	 * from elx_irq exception vectors. The symbols could not be promoted
+	 * to candidates for a stack trace under the restriction, 0x20.
+	 *
+	 * The scenario is handled without complexity as 1) considering
+	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
+	 * content of which is 0, and 2) allowing the case, which changes
+	 * the value to 0x10 from 0x20.
+	 */
+	if (fp < low || fp > high - 0x10 || fp & 0xf)
 		return -EINVAL;
 
 	frame->sp = fp + 0x10;
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index f93aae5..44b2f828 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
 static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 {
 	struct stackframe frame;
+	unsigned int cpu = smp_processor_id();
+	bool in_irq = in_irq_stack(cpu);
 
 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
 
@@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 	}
 
 	pr_emerg("Call trace:\n");
+repeat:
+	if (in_irq)
+		pr_emerg("<IRQ>\n");
+
 	while (1) {
 		unsigned long where = frame.pc;
 		int ret;
@@ -179,6 +185,13 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 			break;
 		dump_backtrace_entry(where, frame.sp);
 	}
+
+	if (in_irq) {
+		frame = per_cpu(irq_stacks, cpu).frame;
+		in_irq = false;
+		pr_emerg("<EOI>\n");
+		goto repeat;
+	}
 }
 
 void show_stack(struct task_struct *tsk, unsigned long *sp)
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-07 15:28   ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-07 15:28 UTC (permalink / raw)
  To: linux-arm-kernel

Currently, a call trace drops a process stack walk when a separate IRQ
stack is used. It makes a call trace information much less useful when
a system gets paniked in interrupt context.

This patch addresses the issue with the following schemes:

  - Store aborted stack frame data
  - Decide whether another stack walk is needed or not via current sp
  - Loosen the frame pointer upper bound condition

Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/include/asm/irq.h    | 12 +++++++++++
 arch/arm64/kernel/asm-offsets.c |  3 +++
 arch/arm64/kernel/entry.S       | 10 ++++++++--
 arch/arm64/kernel/stacktrace.c  | 22 ++++++++++++++++++++-
 arch/arm64/kernel/traps.c       | 13 ++++++++++++
 5 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 6ea82e8..e5904a1 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -2,13 +2,25 @@
 #define __ASM_IRQ_H
 
 #include <linux/irqchip/arm-gic-acpi.h>
+#include <asm/stacktrace.h>
 
 #include <asm-generic/irq.h>
 
 struct irq_stack {
 	void *stack;
+	struct stackframe frame;
 };
 
+DECLARE_PER_CPU(struct irq_stack, irq_stacks);
+
+static inline bool in_irq_stack(unsigned int cpu)
+{
+	unsigned long high = (unsigned long)per_cpu(irq_stacks, cpu).stack;
+
+	return (current_stack_pointer >= round_down(high, THREAD_SIZE)) &&
+		current_stack_pointer < high;
+}
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index b16e3cf..fbb52f2d 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -42,6 +42,9 @@ int main(void)
   DEFINE(THREAD_CPU_CONTEXT,	offsetof(struct task_struct, thread.cpu_context));
   BLANK();
   DEFINE(IRQ_STACK,		offsetof(struct irq_stack, stack));
+  DEFINE(IRQ_FRAME_FP,		offsetof(struct irq_stack, frame.fp));
+  DEFINE(IRQ_FRAME_SP,		offsetof(struct irq_stack, frame.sp));
+  DEFINE(IRQ_FRAME_PC,		offsetof(struct irq_stack, frame.pc));
   BLANK();
   DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
   DEFINE(S_X1,			offsetof(struct pt_regs, regs[1]));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d4e8c5..650cc05 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -121,7 +121,8 @@
 	 * x21 - aborted SP
 	 * x22 - aborted PC
 	 * x23 - aborted PSTATE
-	*/
+	 * x29 - aborted FP
+	 */
 	.endm
 
 	.macro	kernel_exit, el
@@ -184,7 +185,12 @@ alternative_endif
 	mov	x23, sp
 	and	x23, x23, #~(THREAD_SIZE - 1)
 	cmp	x20, x23			// check irq re-enterance
-	mov	x19, sp
+	beq	1f
+	str	x29, [x19, #IRQ_FRAME_FP]
+	str	x21, [x19, #IRQ_FRAME_SP]
+	str	x22, [x19, #IRQ_FRAME_PC]
+	mov	x29, x24
+1:	mov	x19, sp
 	csel	x23, x19, x24, eq		// x24 = top of irq stack
 	mov	sp, x23
 	.endm
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 407991b..5124649 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
 	low  = frame->sp;
 	high = ALIGN(low, THREAD_SIZE);
 
-	if (fp < low || fp > high - 0x18 || fp & 0xf)
+	/*
+	 * A frame pointer would reach an upper bound if a prologue of the
+	 * first function of call trace looks as follows:
+	 *
+	 *	stp     x29, x30, [sp,#-16]!
+	 *	mov     x29, sp
+	 *
+	 * Thus, the upper bound is (top of stack - 0x20) with consideration
+	 * of a 16-byte empty space in THREAD_START_SP.
+	 *
+	 * The value, 0x20, however, does not cover all cases as interrupts
+	 * are handled using a separate stack. That is, a call trace can start
+	 * from elx_irq exception vectors. The symbols could not be promoted
+	 * to candidates for a stack trace under the restriction, 0x20.
+	 *
+	 * The scenario is handled without complexity as 1) considering
+	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
+	 * content of which is 0, and 2) allowing the case, which changes
+	 * the value to 0x10 from 0x20.
+	 */
+	if (fp < low || fp > high - 0x10 || fp & 0xf)
 		return -EINVAL;
 
 	frame->sp = fp + 0x10;
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index f93aae5..44b2f828 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
 static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 {
 	struct stackframe frame;
+	unsigned int cpu = smp_processor_id();
+	bool in_irq = in_irq_stack(cpu);
 
 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
 
@@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 	}
 
 	pr_emerg("Call trace:\n");
+repeat:
+	if (in_irq)
+		pr_emerg("<IRQ>\n");
+
 	while (1) {
 		unsigned long where = frame.pc;
 		int ret;
@@ -179,6 +185,13 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 			break;
 		dump_backtrace_entry(where, frame.sp);
 	}
+
+	if (in_irq) {
+		frame = per_cpu(irq_stacks, cpu).frame;
+		in_irq = false;
+		pr_emerg("<EOI>\n");
+		goto repeat;
+	}
 }
 
 void show_stack(struct task_struct *tsk, unsigned long *sp)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/2] arm64: Introduce IRQ stack
  2015-10-07 15:28   ` Jungseok Lee
@ 2015-10-08 10:25     ` Pratyush Anand
  -1 siblings, 0 replies; 60+ messages in thread
From: Pratyush Anand @ 2015-10-08 10:25 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: catalin.marinas, will.deacon, linux-arm-kernel, james.morse,
	takahiro.akashi, mark.rutland, barami97, linux-kernel

Hi Jungseok,

On 07/10/2015:03:28:11 PM, Jungseok Lee wrote:
> Currently, kernel context and interrupts are handled using a single
> kernel stack navigated by sp_el1. This forces a system to use 16KB
> stack, not 8KB one. This restriction makes low memory platforms suffer
> from memory pressure accompanied by performance degradation.

How will it behave on 64K Page system? There, it would take atleast 64K per cpu,
right?

> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	void *stack;
> +
> +	if (per_cpu(irq_stacks, cpu).stack)
> +		return 0;
> +
> +	stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);

Above would not compile for 64K pages as THREAD_SIZE_ORDER is only defined for
non 64K. This need to be fixed.

~Pratyush

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 1/2] arm64: Introduce IRQ stack
@ 2015-10-08 10:25     ` Pratyush Anand
  0 siblings, 0 replies; 60+ messages in thread
From: Pratyush Anand @ 2015-10-08 10:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jungseok,

On 07/10/2015:03:28:11 PM, Jungseok Lee wrote:
> Currently, kernel context and interrupts are handled using a single
> kernel stack navigated by sp_el1. This forces a system to use 16KB
> stack, not 8KB one. This restriction makes low memory platforms suffer
> from memory pressure accompanied by performance degradation.

How will it behave on 64K Page system? There, it would take atleast 64K per cpu,
right?

> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	void *stack;
> +
> +	if (per_cpu(irq_stacks, cpu).stack)
> +		return 0;
> +
> +	stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);

Above would not compile for 64K pages as THREAD_SIZE_ORDER is only defined for
non 64K. This need to be fixed.

~Pratyush

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/2] arm64: Introduce IRQ stack
  2015-10-08 10:25     ` Pratyush Anand
@ 2015-10-08 14:32       ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-08 14:32 UTC (permalink / raw)
  To: Pratyush Anand
  Cc: catalin.marinas, will.deacon, linux-arm-kernel, james.morse,
	takahiro.akashi, mark.rutland, barami97, linux-kernel

On Oct 8, 2015, at 7:25 PM, Pratyush Anand wrote:
> Hi Jungseok,

Hi Pratyush,

> 
> On 07/10/2015:03:28:11 PM, Jungseok Lee wrote:
>> Currently, kernel context and interrupts are handled using a single
>> kernel stack navigated by sp_el1. This forces a system to use 16KB
>> stack, not 8KB one. This restriction makes low memory platforms suffer
>> from memory pressure accompanied by performance degradation.
> 
> How will it behave on 64K Page system? There, it would take atleast 64K per cpu,
> right?

It would take 16KB per cpu even on 64KB page system.
The following code snippet from kernel/fork.c would be helpful.

----8<----
# if THREAD_SIZE >= PAGE_SIZE                                                                  
static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,                     
                                                  int node)                                    
{
        struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,                        
                                                  THREAD_SIZE_ORDER);                          

        return page ? page_address(page) : NULL;                                               
}

static inline void free_thread_info(struct thread_info *ti)                                    
{
        free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);                                 
}
# else                                                                                         
static struct kmem_cache *thread_info_cache;                                                   

static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,                     
                                                  int node)                                    
{
        return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);                 
}

static void free_thread_info(struct thread_info *ti)                                           
{
        kmem_cache_free(thread_info_cache, ti);                                                
}

void thread_info_cache_init(void)                                                              
{
        thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
                                              THREAD_SIZE, 0, NULL);
        BUG_ON(thread_info_cache == NULL);                                                     
}
# endif
----8<----

> 
>> +int alloc_irq_stack(unsigned int cpu)
>> +{
>> +	void *stack;
>> +
>> +	if (per_cpu(irq_stacks, cpu).stack)
>> +		return 0;
>> +
>> +	stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> 
> Above would not compile for 64K pages as THREAD_SIZE_ORDER is only defined for
> non 64K. This need to be fixed.

Thanks for pointing it out! I will update it.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 1/2] arm64: Introduce IRQ stack
@ 2015-10-08 14:32       ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-08 14:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 8, 2015, at 7:25 PM, Pratyush Anand wrote:
> Hi Jungseok,

Hi Pratyush,

> 
> On 07/10/2015:03:28:11 PM, Jungseok Lee wrote:
>> Currently, kernel context and interrupts are handled using a single
>> kernel stack navigated by sp_el1. This forces a system to use 16KB
>> stack, not 8KB one. This restriction makes low memory platforms suffer
>> from memory pressure accompanied by performance degradation.
> 
> How will it behave on 64K Page system? There, it would take atleast 64K per cpu,
> right?

It would take 16KB per cpu even on 64KB page system.
The following code snippet from kernel/fork.c would be helpful.

----8<----
# if THREAD_SIZE >= PAGE_SIZE                                                                  
static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,                     
                                                  int node)                                    
{
        struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,                        
                                                  THREAD_SIZE_ORDER);                          

        return page ? page_address(page) : NULL;                                               
}

static inline void free_thread_info(struct thread_info *ti)                                    
{
        free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);                                 
}
# else                                                                                         
static struct kmem_cache *thread_info_cache;                                                   

static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,                     
                                                  int node)                                    
{
        return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);                 
}

static void free_thread_info(struct thread_info *ti)                                           
{
        kmem_cache_free(thread_info_cache, ti);                                                
}

void thread_info_cache_init(void)                                                              
{
        thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
                                              THREAD_SIZE, 0, NULL);
        BUG_ON(thread_info_cache == NULL);                                                     
}
# endif
----8<----

> 
>> +int alloc_irq_stack(unsigned int cpu)
>> +{
>> +	void *stack;
>> +
>> +	if (per_cpu(irq_stacks, cpu).stack)
>> +		return 0;
>> +
>> +	stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> 
> Above would not compile for 64K pages as THREAD_SIZE_ORDER is only defined for
> non 64K. This need to be fixed.

Thanks for pointing it out! I will update it.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/2] arm64: Introduce IRQ stack
  2015-10-08 14:32       ` Jungseok Lee
@ 2015-10-08 16:51         ` Pratyush Anand
  -1 siblings, 0 replies; 60+ messages in thread
From: Pratyush Anand @ 2015-10-08 16:51 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: catalin.marinas, will.deacon, linux-arm-kernel, james.morse,
	takahiro.akashi, mark.rutland, barami97, linux-kernel

Hi Jungseok,

On 08/10/2015:11:32:43 PM, Jungseok Lee wrote:
> On Oct 8, 2015, at 7:25 PM, Pratyush Anand wrote:
> > Hi Jungseok,
> 
> Hi Pratyush,
> 
> > 
> > On 07/10/2015:03:28:11 PM, Jungseok Lee wrote:
> >> Currently, kernel context and interrupts are handled using a single
> >> kernel stack navigated by sp_el1. This forces a system to use 16KB
> >> stack, not 8KB one. This restriction makes low memory platforms suffer
> >> from memory pressure accompanied by performance degradation.
> > 
> > How will it behave on 64K Page system? There, it would take atleast 64K per cpu,
> > right?
> 
> It would take 16KB per cpu even on 64KB page system.
> The following code snippet from kernel/fork.c would be helpful.

Yes..Yes..its understood.
Thanks for pointing to the code.

~Pratyush

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 1/2] arm64: Introduce IRQ stack
@ 2015-10-08 16:51         ` Pratyush Anand
  0 siblings, 0 replies; 60+ messages in thread
From: Pratyush Anand @ 2015-10-08 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jungseok,

On 08/10/2015:11:32:43 PM, Jungseok Lee wrote:
> On Oct 8, 2015, at 7:25 PM, Pratyush Anand wrote:
> > Hi Jungseok,
> 
> Hi Pratyush,
> 
> > 
> > On 07/10/2015:03:28:11 PM, Jungseok Lee wrote:
> >> Currently, kernel context and interrupts are handled using a single
> >> kernel stack navigated by sp_el1. This forces a system to use 16KB
> >> stack, not 8KB one. This restriction makes low memory platforms suffer
> >> from memory pressure accompanied by performance degradation.
> > 
> > How will it behave on 64K Page system? There, it would take atleast 64K per cpu,
> > right?
> 
> It would take 16KB per cpu even on 64KB page system.
> The following code snippet from kernel/fork.c would be helpful.

Yes..Yes..its understood.
Thanks for pointing to the code.

~Pratyush

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-07 15:28   ` Jungseok Lee
@ 2015-10-09 14:24     ` James Morse
  -1 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-09 14:24 UTC (permalink / raw)
  To: Jungseok Lee, takahiro.akashi
  Cc: catalin.marinas, will.deacon, linux-arm-kernel, mark.rutland,
	barami97, linux-kernel

Hi Jungseok,

On 07/10/15 16:28, Jungseok Lee wrote:
> Currently, a call trace drops a process stack walk when a separate IRQ
> stack is used. It makes a call trace information much less useful when
> a system gets paniked in interrupt context.

panicked

> This patch addresses the issue with the following schemes:
> 
>   - Store aborted stack frame data
>   - Decide whether another stack walk is needed or not via current sp
>   - Loosen the frame pointer upper bound condition

It may be worth merging this patch with its predecessor - anyone trying to
bisect a problem could land between these two patches, and spend time
debugging the truncated call traces.


> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index 6ea82e8..e5904a1 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,13 +2,25 @@
>  #define __ASM_IRQ_H
>  
>  #include <linux/irqchip/arm-gic-acpi.h>
> +#include <asm/stacktrace.h>
>  
>  #include <asm-generic/irq.h>
>  
>  struct irq_stack {
>  	void *stack;
> +	struct stackframe frame;
>  };
>  
> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);

Good idea, storing this in the per-cpu data makes it immune to stack
corruption.


> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991b..5124649 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>  	low  = frame->sp;
>  	high = ALIGN(low, THREAD_SIZE);
>  
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	/*
> +	 * A frame pointer would reach an upper bound if a prologue of the
> +	 * first function of call trace looks as follows:
> +	 *
> +	 *	stp     x29, x30, [sp,#-16]!
> +	 *	mov     x29, sp
> +	 *
> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration

The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
to be the highest address, which is used first, making it the bottom of the
stack.

I would try to use the terms low/est and high/est, in keeping with the
variable names in use here.


> +	 * of a 16-byte empty space in THREAD_START_SP.
> +	 *
> +	 * The value, 0x20, however, does not cover all cases as interrupts
> +	 * are handled using a separate stack. That is, a call trace can start
> +	 * from elx_irq exception vectors. The symbols could not be promoted
> +	 * to candidates for a stack trace under the restriction, 0x20.
> +	 *
> +	 * The scenario is handled without complexity as 1) considering
> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
> +	 * content of which is 0, and 2) allowing the case, which changes
> +	 * the value to 0x10 from 0x20.

Where has 0x20 come from? The old value was 0x18.

My understanding is the highest part of the stack looks like this:
high        [ off-stack ]
high - 0x08 [ left free by THREAD_START_SP ]
high - 0x10 [ left free by THREAD_START_SP ]
high - 0x18 [#1 x30 ]
high - 0x20 [#1 x29 ]

So the condition 'fp > high - 0x18' prevents returning either 'left free'
address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
allows the first half of that reserved area to be a valid stack frame.

This change is breaking perf using incantations [0] and [1]:

Before, with just patch 1/2:
                  ---__do_softirq
                     |
                     |--92.95%-- __handle_domain_irq
                     |          __irqentry_text_start
                     |          el1_irq
                     |

After, with both patches:
                 ---__do_softirq
                    |
                    |--83.83%-- __handle_domain_irq
                    |          __irqentry_text_start
                    |          el1_irq
                    |          |
                    |          |--99.39%-- 0x400008040d00000c
                    |           --0.61%-- [...]
                    |

Changing the condition to 'fp >= high - 0x10' fixes this.

I agree it needs documenting, it is quite fiddly - I think Akashi Takahiro
is the expert.


I think unwind_frame() needs to walk the irq stack too. [2] is an example
of perf tracing back to userspace, (and there are patches on the list to
do/fix this), so we need to walk back to the start of the first stack for
the perf accounting to be correct.


> +	 */
> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>  		return -EINVAL;
>  
>  	frame->sp = fp + 0x10;
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index f93aae5..44b2f828 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>  static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  {
>  	struct stackframe frame;
> +	unsigned int cpu = smp_processor_id();

I wonder if there is any case where dump_backtrace() is called on another cpu?

Setting the cpu value from task_thread_info(tsk)->cpu would protect against
this.


> +	bool in_irq = in_irq_stack(cpu);
>  
>  	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>  
> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  	}
>  
>  	pr_emerg("Call trace:\n");
> +repeat:
> +	if (in_irq)
> +		pr_emerg("<IRQ>\n");

Do we need these? 'el1_irq()' in the trace is a giveaway...


> +
>  	while (1) {
>  		unsigned long where = frame.pc;
>  		int ret;
> @@ -179,6 +185,13 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  			break;
>  		dump_backtrace_entry(where, frame.sp);
>  	}
> +
> +	if (in_irq) {
> +		frame = per_cpu(irq_stacks, cpu).frame;
> +		in_irq = false;
> +		pr_emerg("<EOI>\n");
> +		goto repeat;
> +	}
>  }
>  
>  void show_stack(struct task_struct *tsk, unsigned long *sp)


Thanks!

James


[0] sudo ./perf record -e mem:<address of __do_softirq()>:x -ag -- sleep 10
[1] sudo ./perf report --call-graph --stdio
[2] http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-09 14:24     ` James Morse
  0 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-09 14:24 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jungseok,

On 07/10/15 16:28, Jungseok Lee wrote:
> Currently, a call trace drops a process stack walk when a separate IRQ
> stack is used. It makes a call trace information much less useful when
> a system gets paniked in interrupt context.

panicked

> This patch addresses the issue with the following schemes:
> 
>   - Store aborted stack frame data
>   - Decide whether another stack walk is needed or not via current sp
>   - Loosen the frame pointer upper bound condition

It may be worth merging this patch with its predecessor - anyone trying to
bisect a problem could land between these two patches, and spend time
debugging the truncated call traces.


> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index 6ea82e8..e5904a1 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,13 +2,25 @@
>  #define __ASM_IRQ_H
>  
>  #include <linux/irqchip/arm-gic-acpi.h>
> +#include <asm/stacktrace.h>
>  
>  #include <asm-generic/irq.h>
>  
>  struct irq_stack {
>  	void *stack;
> +	struct stackframe frame;
>  };
>  
> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);

Good idea, storing this in the per-cpu data makes it immune to stack
corruption.


> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991b..5124649 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>  	low  = frame->sp;
>  	high = ALIGN(low, THREAD_SIZE);
>  
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	/*
> +	 * A frame pointer would reach an upper bound if a prologue of the
> +	 * first function of call trace looks as follows:
> +	 *
> +	 *	stp     x29, x30, [sp,#-16]!
> +	 *	mov     x29, sp
> +	 *
> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration

The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
to be the highest address, which is used first, making it the bottom of the
stack.

I would try to use the terms low/est and high/est, in keeping with the
variable names in use here.


> +	 * of a 16-byte empty space in THREAD_START_SP.
> +	 *
> +	 * The value, 0x20, however, does not cover all cases as interrupts
> +	 * are handled using a separate stack. That is, a call trace can start
> +	 * from elx_irq exception vectors. The symbols could not be promoted
> +	 * to candidates for a stack trace under the restriction, 0x20.
> +	 *
> +	 * The scenario is handled without complexity as 1) considering
> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
> +	 * content of which is 0, and 2) allowing the case, which changes
> +	 * the value to 0x10 from 0x20.

Where has 0x20 come from? The old value was 0x18.

My understanding is the highest part of the stack looks like this:
high        [ off-stack ]
high - 0x08 [ left free by THREAD_START_SP ]
high - 0x10 [ left free by THREAD_START_SP ]
high - 0x18 [#1 x30 ]
high - 0x20 [#1 x29 ]

So the condition 'fp > high - 0x18' prevents returning either 'left free'
address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
allows the first half of that reserved area to be a valid stack frame.

This change is breaking perf using incantations [0] and [1]:

Before, with just patch 1/2:
                  ---__do_softirq
                     |
                     |--92.95%-- __handle_domain_irq
                     |          __irqentry_text_start
                     |          el1_irq
                     |

After, with both patches:
                 ---__do_softirq
                    |
                    |--83.83%-- __handle_domain_irq
                    |          __irqentry_text_start
                    |          el1_irq
                    |          |
                    |          |--99.39%-- 0x400008040d00000c
                    |           --0.61%-- [...]
                    |

Changing the condition to 'fp >= high - 0x10' fixes this.

I agree it needs documenting, it is quite fiddly - I think Akashi Takahiro
is the expert.


I think unwind_frame() needs to walk the irq stack too. [2] is an example
of perf tracing back to userspace, (and there are patches on the list to
do/fix this), so we need to walk back to the start of the first stack for
the perf accounting to be correct.


> +	 */
> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>  		return -EINVAL;
>  
>  	frame->sp = fp + 0x10;
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index f93aae5..44b2f828 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>  static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  {
>  	struct stackframe frame;
> +	unsigned int cpu = smp_processor_id();

I wonder if there is any case where dump_backtrace() is called on another cpu?

Setting the cpu value from task_thread_info(tsk)->cpu would protect against
this.


> +	bool in_irq = in_irq_stack(cpu);
>  
>  	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>  
> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  	}
>  
>  	pr_emerg("Call trace:\n");
> +repeat:
> +	if (in_irq)
> +		pr_emerg("<IRQ>\n");

Do we need these? 'el1_irq()' in the trace is a giveaway...


> +
>  	while (1) {
>  		unsigned long where = frame.pc;
>  		int ret;
> @@ -179,6 +185,13 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  			break;
>  		dump_backtrace_entry(where, frame.sp);
>  	}
> +
> +	if (in_irq) {
> +		frame = per_cpu(irq_stacks, cpu).frame;
> +		in_irq = false;
> +		pr_emerg("<EOI>\n");
> +		goto repeat;
> +	}
>  }
>  
>  void show_stack(struct task_struct *tsk, unsigned long *sp)


Thanks!

James


[0] sudo ./perf record -e mem:<address of __do_softirq()>:x -ag -- sleep 10
[1] sudo ./perf report --call-graph --stdio
[2] http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-09 14:24     ` James Morse
@ 2015-10-12 14:53       ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-12 14:53 UTC (permalink / raw)
  To: James Morse
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 9, 2015, at 11:24 PM, James Morse wrote:
> Hi Jungseok,

Hi James,

> On 07/10/15 16:28, Jungseok Lee wrote:
>> Currently, a call trace drops a process stack walk when a separate IRQ
>> stack is used. It makes a call trace information much less useful when
>> a system gets paniked in interrupt context.
> 
> panicked

I will fix the typo.

>> This patch addresses the issue with the following schemes:
>> 
>>  - Store aborted stack frame data
>>  - Decide whether another stack walk is needed or not via current sp
>>  - Loosen the frame pointer upper bound condition
> 
> It may be worth merging this patch with its predecessor - anyone trying to
> bisect a problem could land between these two patches, and spend time
> debugging the truncated call traces.

It was an original intention to lead them to this patch, not the [1/2] one.
This separation would help anyone touching the call trace feature including
me focus on these changes apart from stack allocation, IRQ recursion check
and thread_info management.

In addition, I would like to add a clear and sufficient explanation on the
frame pointer condition.

>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>> index 6ea82e8..e5904a1 100644
>> --- a/arch/arm64/include/asm/irq.h
>> +++ b/arch/arm64/include/asm/irq.h
>> @@ -2,13 +2,25 @@
>> #define __ASM_IRQ_H
>> 
>> #include <linux/irqchip/arm-gic-acpi.h>
>> +#include <asm/stacktrace.h>
>> 
>> #include <asm-generic/irq.h>
>> 
>> struct irq_stack {
>> 	void *stack;
>> +	struct stackframe frame;
>> };
>> 
>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
> 
> Good idea, storing this in the per-cpu data makes it immune to stack
> corruption.
> 
> 
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 407991b..5124649 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>> 	low  = frame->sp;
>> 	high = ALIGN(low, THREAD_SIZE);
>> 
>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>> +	/*
>> +	 * A frame pointer would reach an upper bound if a prologue of the
>> +	 * first function of call trace looks as follows:
>> +	 *
>> +	 *	stp     x29, x30, [sp,#-16]!
>> +	 *	mov     x29, sp
>> +	 *
>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
> 
> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
> to be the highest address, which is used first, making it the bottom of the
> stack.
> 
> I would try to use the terms low/est and high/est, in keeping with the
> variable names in use here.

Good idea. I'm favor of those terms.

>> +	 * of a 16-byte empty space in THREAD_START_SP.
>> +	 *
>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>> +	 * are handled using a separate stack. That is, a call trace can start
>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>> +	 * to candidates for a stack trace under the restriction, 0x20.
>> +	 *
>> +	 * The scenario is handled without complexity as 1) considering
>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>> +	 * content of which is 0, and 2) allowing the case, which changes
>> +	 * the value to 0x10 from 0x20.
> 
> Where has 0x20 come from? The old value was 0x18.

What I meant is 0x20 is the highest valid frame pointer. The comment should
have been described more clearly.

> My understanding is the highest part of the stack looks like this:
> high        [ off-stack ]
> high - 0x08 [ left free by THREAD_START_SP ]
> high - 0x10 [ left free by THREAD_START_SP ]
> high - 0x18 [#1 x30 ]
> high - 0x20 [#1 x29 ]

Clear description than mine!

> So the condition 'fp > high - 0x18' prevents returning either 'left free'
> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
> allows the first half of that reserved area to be a valid stack frame.

I believe my understanding is aligned with yours.

Under a current condition, 'fp > high - 0x18', it is impossible to catch the
'el1_irq' symbol. This is why I set x29 to high - 0x10 and changed the frame
pointer condition, but the changes fail to cover perf according to your data.

> This change is breaking perf using incantations [0] and [1]:

I'm reviewing how perf stack trace works..

> Before, with just patch 1/2:
>                  ---__do_softirq
>                     |
>                     |--92.95%-- __handle_domain_irq
>                     |          __irqentry_text_start
>                     |          el1_irq
>                     |
> 
> After, with both patches:
>                 ---__do_softirq
>                    |
>                    |--83.83%-- __handle_domain_irq
>                    |          __irqentry_text_start
>                    |          el1_irq
>                    |          |
>                    |          |--99.39%-- 0x400008040d00000c
>                    |           --0.61%-- [...]
>                    |
> 
> Changing the condition to 'fp >= high - 0x10' fixes this.

'fp >= high - 0x10' drops 'el1_irq' when dump_stack() or panic() is called.

> I agree it needs documenting, it is quite fiddly - I think Akashi Takahiro
> is the expert.

If possible, it would be greatly helpful.

> I think unwind_frame() needs to walk the irq stack too. [2] is an example
> of perf tracing back to userspace, (and there are patches on the list to
> do/fix this), so we need to walk back to the start of the first stack for
> the perf accounting to be correct.

Frankly, I missed the case where perf does backtrace to userspace.

IMO, this statement supports why the stack trace feature commit should be
written independently. The [1/2] patch would be pretty stable if 64KB page
is supported. The separation might help us concentrate on the stack trace
feature in a generic dump stack, perf, and ftrace point of view.

>> +	 */
>> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>> 		return -EINVAL;
>> 
>> 	frame->sp = fp + 0x10;
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index f93aae5..44b2f828 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>> static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>> {
>> 	struct stackframe frame;
>> +	unsigned int cpu = smp_processor_id();
> 
> I wonder if there is any case where dump_backtrace() is called on another cpu?
> 
> Setting the cpu value from task_thread_info(tsk)->cpu would protect against
> this.

IMO, no, but your suggestion makes sense. I will update it.

>> +	bool in_irq = in_irq_stack(cpu);
>> 
>> 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>> 
>> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>> 	}
>> 
>> 	pr_emerg("Call trace:\n");
>> +repeat:
>> +	if (in_irq)
>> +		pr_emerg("<IRQ>\n");
> 
> Do we need these? 'el1_irq()' in the trace is a giveaway…

I borrow this idea from x86 implementation in order to show a separate stack
explicitly. There is no issue to remove these tags, <IRQ> and <EOI>.

Great thanks!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-12 14:53       ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-12 14:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 9, 2015, at 11:24 PM, James Morse wrote:
> Hi Jungseok,

Hi James,

> On 07/10/15 16:28, Jungseok Lee wrote:
>> Currently, a call trace drops a process stack walk when a separate IRQ
>> stack is used. It makes a call trace information much less useful when
>> a system gets paniked in interrupt context.
> 
> panicked

I will fix the typo.

>> This patch addresses the issue with the following schemes:
>> 
>>  - Store aborted stack frame data
>>  - Decide whether another stack walk is needed or not via current sp
>>  - Loosen the frame pointer upper bound condition
> 
> It may be worth merging this patch with its predecessor - anyone trying to
> bisect a problem could land between these two patches, and spend time
> debugging the truncated call traces.

It was an original intention to lead them to this patch, not the [1/2] one.
This separation would help anyone touching the call trace feature including
me focus on these changes apart from stack allocation, IRQ recursion check
and thread_info management.

In addition, I would like to add a clear and sufficient explanation on the
frame pointer condition.

>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>> index 6ea82e8..e5904a1 100644
>> --- a/arch/arm64/include/asm/irq.h
>> +++ b/arch/arm64/include/asm/irq.h
>> @@ -2,13 +2,25 @@
>> #define __ASM_IRQ_H
>> 
>> #include <linux/irqchip/arm-gic-acpi.h>
>> +#include <asm/stacktrace.h>
>> 
>> #include <asm-generic/irq.h>
>> 
>> struct irq_stack {
>> 	void *stack;
>> +	struct stackframe frame;
>> };
>> 
>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
> 
> Good idea, storing this in the per-cpu data makes it immune to stack
> corruption.
> 
> 
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 407991b..5124649 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>> 	low  = frame->sp;
>> 	high = ALIGN(low, THREAD_SIZE);
>> 
>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>> +	/*
>> +	 * A frame pointer would reach an upper bound if a prologue of the
>> +	 * first function of call trace looks as follows:
>> +	 *
>> +	 *	stp     x29, x30, [sp,#-16]!
>> +	 *	mov     x29, sp
>> +	 *
>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
> 
> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
> to be the highest address, which is used first, making it the bottom of the
> stack.
> 
> I would try to use the terms low/est and high/est, in keeping with the
> variable names in use here.

Good idea. I'm favor of those terms.

>> +	 * of a 16-byte empty space in THREAD_START_SP.
>> +	 *
>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>> +	 * are handled using a separate stack. That is, a call trace can start
>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>> +	 * to candidates for a stack trace under the restriction, 0x20.
>> +	 *
>> +	 * The scenario is handled without complexity as 1) considering
>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>> +	 * content of which is 0, and 2) allowing the case, which changes
>> +	 * the value to 0x10 from 0x20.
> 
> Where has 0x20 come from? The old value was 0x18.

What I meant is 0x20 is the highest valid frame pointer. The comment should
have been described more clearly.

> My understanding is the highest part of the stack looks like this:
> high        [ off-stack ]
> high - 0x08 [ left free by THREAD_START_SP ]
> high - 0x10 [ left free by THREAD_START_SP ]
> high - 0x18 [#1 x30 ]
> high - 0x20 [#1 x29 ]

Clear description than mine!

> So the condition 'fp > high - 0x18' prevents returning either 'left free'
> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
> allows the first half of that reserved area to be a valid stack frame.

I believe my understanding is aligned with yours.

Under a current condition, 'fp > high - 0x18', it is impossible to catch the
'el1_irq' symbol. This is why I set x29 to high - 0x10 and changed the frame
pointer condition, but the changes fail to cover perf according to your data.

> This change is breaking perf using incantations [0] and [1]:

I'm reviewing how perf stack trace works..

> Before, with just patch 1/2:
>                  ---__do_softirq
>                     |
>                     |--92.95%-- __handle_domain_irq
>                     |          __irqentry_text_start
>                     |          el1_irq
>                     |
> 
> After, with both patches:
>                 ---__do_softirq
>                    |
>                    |--83.83%-- __handle_domain_irq
>                    |          __irqentry_text_start
>                    |          el1_irq
>                    |          |
>                    |          |--99.39%-- 0x400008040d00000c
>                    |           --0.61%-- [...]
>                    |
> 
> Changing the condition to 'fp >= high - 0x10' fixes this.

'fp >= high - 0x10' drops 'el1_irq' when dump_stack() or panic() is called.

> I agree it needs documenting, it is quite fiddly - I think Akashi Takahiro
> is the expert.

If possible, it would be greatly helpful.

> I think unwind_frame() needs to walk the irq stack too. [2] is an example
> of perf tracing back to userspace, (and there are patches on the list to
> do/fix this), so we need to walk back to the start of the first stack for
> the perf accounting to be correct.

Frankly, I missed the case where perf does backtrace to userspace.

IMO, this statement supports why the stack trace feature commit should be
written independently. The [1/2] patch would be pretty stable if 64KB page
is supported. The separation might help us concentrate on the stack trace
feature in a generic dump stack, perf, and ftrace point of view.

>> +	 */
>> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>> 		return -EINVAL;
>> 
>> 	frame->sp = fp + 0x10;
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index f93aae5..44b2f828 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>> static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>> {
>> 	struct stackframe frame;
>> +	unsigned int cpu = smp_processor_id();
> 
> I wonder if there is any case where dump_backtrace() is called on another cpu?
> 
> Setting the cpu value from task_thread_info(tsk)->cpu would protect against
> this.

IMO, no, but your suggestion makes sense. I will update it.

>> +	bool in_irq = in_irq_stack(cpu);
>> 
>> 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>> 
>> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>> 	}
>> 
>> 	pr_emerg("Call trace:\n");
>> +repeat:
>> +	if (in_irq)
>> +		pr_emerg("<IRQ>\n");
> 
> Do we need these? 'el1_irq()' in the trace is a giveaway?

I borrow this idea from x86 implementation in order to show a separate stack
explicitly. There is no issue to remove these tags, <IRQ> and <EOI>.

Great thanks!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-12 14:53       ` Jungseok Lee
@ 2015-10-12 16:34         ` James Morse
  -1 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-12 16:34 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

Hi Jungseok,

On 12/10/15 15:53, Jungseok Lee wrote:
> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>> of perf tracing back to userspace, (and there are patches on the list to
>> do/fix this), so we need to walk back to the start of the first stack for
>> the perf accounting to be correct.
> 
> Frankly, I missed the case where perf does backtrace to userspace.
> 
> IMO, this statement supports why the stack trace feature commit should be
> written independently. The [1/2] patch would be pretty stable if 64KB page
> is supported.

If this hasn't been started yet, here is a build-test-only first-pass at
the 64K page support - based on the code in kernel/fork.c:

==================%<==================
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index a6bdf4d3a57c..deb057a735ad 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -27,8 +27,22 @@
 #include <linux/init.h>
 #include <linux/irqchip.h>
 #include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/topology.h>
 #include <linux/ratelimit.h>

+#if THREAD_SIZE >= PAGE_SIZE
+#define __alloc_irq_stack(x) (void *)__get_free_pages(THREADINFO_GFP,  \
+                                                     THREAD_SIZE_ORDER)
+
+extern struct kmem_cache *irq_stack_cache;     /* dummy declaration */
+#else
+#define __alloc_irq_stack(cpu) (void
*)kmem_cache_alloc_node(irq_stack_cache, \
+                                       THREADINFO_GFP, cpu_to_node(cpu))
+
+static struct kmem_cache *irq_stack_cache;
+#endif /* THREAD_SIZE >= PAGE_SIZE */

 unsigned long irq_err_count;

 DEFINE_PER_CPU(struct irq_stack, irq_stacks);
@@ -128,7 +142,17 @@ int alloc_irq_stack(unsigned int cpu)
        if (per_cpu(irq_stacks, cpu).stack)
                return 0;

-       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+       if (THREAD_SIZE < PAGE_SIZE) {
+               if (!irq_stack_cache) {
+                       irq_stack_cache = kmem_cache_create("irq_stack",
+                                                           THREAD_SIZE,
+                                                           THREAD_SIZE, 0,
+                                                           NULL);
+                       BUG_ON(!irq_stack_cache);
+               }
+       }
+
+       stack = __alloc_irq_stack(cpu);
        if (!stack)
                return -ENOMEM;

==================%<==================
(my mail client will almost certainly mangle that)

Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
(especially for systems with few cpus)...

The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
allocate all stack memory from arch code. (Largely copied code, prevents
irq stacks being a different size, and nothing uses that define today!)


Thoughts?


> 
>>> +	 */
>>> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>>> 		return -EINVAL;
>>>
>>> 	frame->sp = fp + 0x10;
>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>> index f93aae5..44b2f828 100644
>>> --- a/arch/arm64/kernel/traps.c
>>> +++ b/arch/arm64/kernel/traps.c
>>> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>>> static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>> {
>>> 	struct stackframe frame;
>>> +	unsigned int cpu = smp_processor_id();
>>
>> I wonder if there is any case where dump_backtrace() is called on another cpu?
>>
>> Setting the cpu value from task_thread_info(tsk)->cpu would protect against
>> this.
> 
> IMO, no, but your suggestion makes sense. I will update it.
> 
>>> +	bool in_irq = in_irq_stack(cpu);
>>>
>>> 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>>>
>>> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>> 	}
>>>
>>> 	pr_emerg("Call trace:\n");
>>> +repeat:
>>> +	if (in_irq)
>>> +		pr_emerg("<IRQ>\n");
>>
>> Do we need these? 'el1_irq()' in the trace is a giveaway…
> 
> I borrow this idea from x86 implementation in order to show a separate stack
> explicitly. There is no issue to remove these tags, <IRQ> and <EOI>.

Ah okay - if its done elsewhere, its better to be consistent.


Thanks,


James


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-12 16:34         ` James Morse
  0 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-12 16:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jungseok,

On 12/10/15 15:53, Jungseok Lee wrote:
> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>> of perf tracing back to userspace, (and there are patches on the list to
>> do/fix this), so we need to walk back to the start of the first stack for
>> the perf accounting to be correct.
> 
> Frankly, I missed the case where perf does backtrace to userspace.
> 
> IMO, this statement supports why the stack trace feature commit should be
> written independently. The [1/2] patch would be pretty stable if 64KB page
> is supported.

If this hasn't been started yet, here is a build-test-only first-pass at
the 64K page support - based on the code in kernel/fork.c:

==================%<==================
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index a6bdf4d3a57c..deb057a735ad 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -27,8 +27,22 @@
 #include <linux/init.h>
 #include <linux/irqchip.h>
 #include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/topology.h>
 #include <linux/ratelimit.h>

+#if THREAD_SIZE >= PAGE_SIZE
+#define __alloc_irq_stack(x) (void *)__get_free_pages(THREADINFO_GFP,  \
+                                                     THREAD_SIZE_ORDER)
+
+extern struct kmem_cache *irq_stack_cache;     /* dummy declaration */
+#else
+#define __alloc_irq_stack(cpu) (void
*)kmem_cache_alloc_node(irq_stack_cache, \
+                                       THREADINFO_GFP, cpu_to_node(cpu))
+
+static struct kmem_cache *irq_stack_cache;
+#endif /* THREAD_SIZE >= PAGE_SIZE */

 unsigned long irq_err_count;

 DEFINE_PER_CPU(struct irq_stack, irq_stacks);
@@ -128,7 +142,17 @@ int alloc_irq_stack(unsigned int cpu)
        if (per_cpu(irq_stacks, cpu).stack)
                return 0;

-       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+       if (THREAD_SIZE < PAGE_SIZE) {
+               if (!irq_stack_cache) {
+                       irq_stack_cache = kmem_cache_create("irq_stack",
+                                                           THREAD_SIZE,
+                                                           THREAD_SIZE, 0,
+                                                           NULL);
+                       BUG_ON(!irq_stack_cache);
+               }
+       }
+
+       stack = __alloc_irq_stack(cpu);
        if (!stack)
                return -ENOMEM;

==================%<==================
(my mail client will almost certainly mangle that)

Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
(especially for systems with few cpus)...

The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
allocate all stack memory from arch code. (Largely copied code, prevents
irq stacks being a different size, and nothing uses that define today!)


Thoughts?


> 
>>> +	 */
>>> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>>> 		return -EINVAL;
>>>
>>> 	frame->sp = fp + 0x10;
>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>> index f93aae5..44b2f828 100644
>>> --- a/arch/arm64/kernel/traps.c
>>> +++ b/arch/arm64/kernel/traps.c
>>> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>>> static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>> {
>>> 	struct stackframe frame;
>>> +	unsigned int cpu = smp_processor_id();
>>
>> I wonder if there is any case where dump_backtrace() is called on another cpu?
>>
>> Setting the cpu value from task_thread_info(tsk)->cpu would protect against
>> this.
> 
> IMO, no, but your suggestion makes sense. I will update it.
> 
>>> +	bool in_irq = in_irq_stack(cpu);
>>>
>>> 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>>>
>>> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>> 	}
>>>
>>> 	pr_emerg("Call trace:\n");
>>> +repeat:
>>> +	if (in_irq)
>>> +		pr_emerg("<IRQ>\n");
>>
>> Do we need these? 'el1_irq()' in the trace is a giveaway?
> 
> I borrow this idea from x86 implementation in order to show a separate stack
> explicitly. There is no issue to remove these tags, <IRQ> and <EOI>.

Ah okay - if its done elsewhere, its better to be consistent.


Thanks,


James

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-12 16:34         ` James Morse
@ 2015-10-12 22:13           ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-12 22:13 UTC (permalink / raw)
  To: James Morse
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 13, 2015, at 1:34 AM, James Morse wrote:
> Hi Jungseok,

Hi James,

> On 12/10/15 15:53, Jungseok Lee wrote:
>> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>>> of perf tracing back to userspace, (and there are patches on the list to
>>> do/fix this), so we need to walk back to the start of the first stack for
>>> the perf accounting to be correct.
>> 
>> Frankly, I missed the case where perf does backtrace to userspace.
>> 
>> IMO, this statement supports why the stack trace feature commit should be
>> written independently. The [1/2] patch would be pretty stable if 64KB page
>> is supported.
> 
> If this hasn't been started yet, here is a build-test-only first-pass at
> the 64K page support - based on the code in kernel/fork.c:
> 
> ==================%<==================
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index a6bdf4d3a57c..deb057a735ad 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -27,8 +27,22 @@
> #include <linux/init.h>
> #include <linux/irqchip.h>
> #include <linux/seq_file.h>
> +#include <linux/slab.h>
> +#include <linux/topology.h>
> #include <linux/ratelimit.h>
> 
> +#if THREAD_SIZE >= PAGE_SIZE
> +#define __alloc_irq_stack(x) (void *)__get_free_pages(THREADINFO_GFP,  \
> +                                                     THREAD_SIZE_ORDER)
> +
> +extern struct kmem_cache *irq_stack_cache;     /* dummy declaration */
> +#else
> +#define __alloc_irq_stack(cpu) (void
> *)kmem_cache_alloc_node(irq_stack_cache, \
> +                                       THREADINFO_GFP, cpu_to_node(cpu))
> +
> +static struct kmem_cache *irq_stack_cache;
> +#endif /* THREAD_SIZE >= PAGE_SIZE */
> 
> unsigned long irq_err_count;
> 
> DEFINE_PER_CPU(struct irq_stack, irq_stacks);
> @@ -128,7 +142,17 @@ int alloc_irq_stack(unsigned int cpu)
>        if (per_cpu(irq_stacks, cpu).stack)
>                return 0;
> 
> -       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +       if (THREAD_SIZE < PAGE_SIZE) {
> +               if (!irq_stack_cache) {
> +                       irq_stack_cache = kmem_cache_create("irq_stack",
> +                                                           THREAD_SIZE,
> +                                                           THREAD_SIZE, 0,
> +                                                           NULL);
> +                       BUG_ON(!irq_stack_cache);
> +               }
> +       }
> +
> +       stack = __alloc_irq_stack(cpu);
>        if (!stack)
>                return -ENOMEM;
> 
> ==================%<==================
> (my mail client will almost certainly mangle that)
> 
> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
> (especially for systems with few cpus)…

This would be a single concern. To address this issue, I drop the 'static'
keyword in thread_info_cache. Please refer to the below hunk.

> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
> allocate all stack memory from arch code. (Largely copied code, prevents
> irq stacks being a different size, and nothing uses that define today!)
> 
> 
> Thoughts?

Almost same story I've been testing.

I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.

Another approach I've tried is the following data structure, but it's not
a good fit for this case due to __per_cpu_offset which is page-size aligned,
not thread-size.

struct irq_stack {
	char stack[THREAD_SIZE];
	char *highest;
} __aligned(THREAD_SIZE);

DEFINE_PER_CPU(struct irq_stack, irq_stacks);

----8<-----
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 6ea82e8..d3619b3 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -1,7 +1,9 @@
 #ifndef __ASM_IRQ_H
 #define __ASM_IRQ_H
 
+#include <linux/gfp.h>
 #include <linux/irqchip/arm-gic-acpi.h>
+#include <linux/slab.h>
 
 #include <asm-generic/irq.h>
 
@@ -9,6 +11,21 @@ struct irq_stack {
        void *stack;
 };
 
+#if THREAD_SIZE >= PAGE_SIZE
+static inline void *__alloc_irq_stack(void)
+{
+       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
+                                       THREAD_SIZE_ORDER);
+}
+#else
+extern struct kmem_cache *thread_info_cache;
+
+static inline void *__alloc_irq_stack(void)
+{
+       return kmem_cache_alloc(thread_info_cache, THREADINFO_GFP | __GFP_ZERO);
+}
+#endif
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index a6bdf4d..4e13bdd 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -50,10 +50,13 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
        handle_arch_irq = handle_irq;
 }
 
+static char boot_irq_stack[THREAD_SIZE] __aligned(THREAD_SIZE);
+
 void __init init_IRQ(void)
 {
-       if (alloc_irq_stack(smp_processor_id()))
-               panic("Failed to allocate IRQ stack for a boot cpu");
+       unsigned int cpu = smp_processor_id();
+
+       per_cpu(irq_stacks, cpu).stack = boot_irq_stack + THREAD_START_SP;
 
        irqchip_init();
        if (!handle_arch_irq)
@@ -128,7 +131,7 @@ int alloc_irq_stack(unsigned int cpu)
        if (per_cpu(irq_stacks, cpu).stack)
                return 0;
 
-       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+       stack = __alloc_irq_stack();
        if (!stack)
                return -ENOMEM;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2845623..9c55f86 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -172,7 +172,7 @@ static inline void free_thread_info(struct thread_info *ti)
        free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
-static struct kmem_cache *thread_info_cache;
+struct kmem_cache *thread_info_cache;
 
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
                                                  int node)
----8<-----

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-12 22:13           ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-12 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 13, 2015, at 1:34 AM, James Morse wrote:
> Hi Jungseok,

Hi James,

> On 12/10/15 15:53, Jungseok Lee wrote:
>> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>>> of perf tracing back to userspace, (and there are patches on the list to
>>> do/fix this), so we need to walk back to the start of the first stack for
>>> the perf accounting to be correct.
>> 
>> Frankly, I missed the case where perf does backtrace to userspace.
>> 
>> IMO, this statement supports why the stack trace feature commit should be
>> written independently. The [1/2] patch would be pretty stable if 64KB page
>> is supported.
> 
> If this hasn't been started yet, here is a build-test-only first-pass at
> the 64K page support - based on the code in kernel/fork.c:
> 
> ==================%<==================
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index a6bdf4d3a57c..deb057a735ad 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -27,8 +27,22 @@
> #include <linux/init.h>
> #include <linux/irqchip.h>
> #include <linux/seq_file.h>
> +#include <linux/slab.h>
> +#include <linux/topology.h>
> #include <linux/ratelimit.h>
> 
> +#if THREAD_SIZE >= PAGE_SIZE
> +#define __alloc_irq_stack(x) (void *)__get_free_pages(THREADINFO_GFP,  \
> +                                                     THREAD_SIZE_ORDER)
> +
> +extern struct kmem_cache *irq_stack_cache;     /* dummy declaration */
> +#else
> +#define __alloc_irq_stack(cpu) (void
> *)kmem_cache_alloc_node(irq_stack_cache, \
> +                                       THREADINFO_GFP, cpu_to_node(cpu))
> +
> +static struct kmem_cache *irq_stack_cache;
> +#endif /* THREAD_SIZE >= PAGE_SIZE */
> 
> unsigned long irq_err_count;
> 
> DEFINE_PER_CPU(struct irq_stack, irq_stacks);
> @@ -128,7 +142,17 @@ int alloc_irq_stack(unsigned int cpu)
>        if (per_cpu(irq_stacks, cpu).stack)
>                return 0;
> 
> -       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +       if (THREAD_SIZE < PAGE_SIZE) {
> +               if (!irq_stack_cache) {
> +                       irq_stack_cache = kmem_cache_create("irq_stack",
> +                                                           THREAD_SIZE,
> +                                                           THREAD_SIZE, 0,
> +                                                           NULL);
> +                       BUG_ON(!irq_stack_cache);
> +               }
> +       }
> +
> +       stack = __alloc_irq_stack(cpu);
>        if (!stack)
>                return -ENOMEM;
> 
> ==================%<==================
> (my mail client will almost certainly mangle that)
> 
> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
> (especially for systems with few cpus)?

This would be a single concern. To address this issue, I drop the 'static'
keyword in thread_info_cache. Please refer to the below hunk.

> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
> allocate all stack memory from arch code. (Largely copied code, prevents
> irq stacks being a different size, and nothing uses that define today!)
> 
> 
> Thoughts?

Almost same story I've been testing.

I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.

Another approach I've tried is the following data structure, but it's not
a good fit for this case due to __per_cpu_offset which is page-size aligned,
not thread-size.

struct irq_stack {
	char stack[THREAD_SIZE];
	char *highest;
} __aligned(THREAD_SIZE);

DEFINE_PER_CPU(struct irq_stack, irq_stacks);

----8<-----
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 6ea82e8..d3619b3 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -1,7 +1,9 @@
 #ifndef __ASM_IRQ_H
 #define __ASM_IRQ_H
 
+#include <linux/gfp.h>
 #include <linux/irqchip/arm-gic-acpi.h>
+#include <linux/slab.h>
 
 #include <asm-generic/irq.h>
 
@@ -9,6 +11,21 @@ struct irq_stack {
        void *stack;
 };
 
+#if THREAD_SIZE >= PAGE_SIZE
+static inline void *__alloc_irq_stack(void)
+{
+       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
+                                       THREAD_SIZE_ORDER);
+}
+#else
+extern struct kmem_cache *thread_info_cache;
+
+static inline void *__alloc_irq_stack(void)
+{
+       return kmem_cache_alloc(thread_info_cache, THREADINFO_GFP | __GFP_ZERO);
+}
+#endif
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index a6bdf4d..4e13bdd 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -50,10 +50,13 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
        handle_arch_irq = handle_irq;
 }
 
+static char boot_irq_stack[THREAD_SIZE] __aligned(THREAD_SIZE);
+
 void __init init_IRQ(void)
 {
-       if (alloc_irq_stack(smp_processor_id()))
-               panic("Failed to allocate IRQ stack for a boot cpu");
+       unsigned int cpu = smp_processor_id();
+
+       per_cpu(irq_stacks, cpu).stack = boot_irq_stack + THREAD_START_SP;
 
        irqchip_init();
        if (!handle_arch_irq)
@@ -128,7 +131,7 @@ int alloc_irq_stack(unsigned int cpu)
        if (per_cpu(irq_stacks, cpu).stack)
                return 0;
 
-       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+       stack = __alloc_irq_stack();
        if (!stack)
                return -ENOMEM;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2845623..9c55f86 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -172,7 +172,7 @@ static inline void free_thread_info(struct thread_info *ti)
        free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
-static struct kmem_cache *thread_info_cache;
+struct kmem_cache *thread_info_cache;
 
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
                                                  int node)
----8<-----

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-12 22:13           ` Jungseok Lee
@ 2015-10-13 11:00             ` James Morse
  -1 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-13 11:00 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

Hi Jungseok,

On 12/10/15 23:13, Jungseok Lee wrote:
> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>> (especially for systems with few cpus)…
> 
> This would be a single concern. To address this issue, I drop the 'static'
> keyword in thread_info_cache. Please refer to the below hunk.

Its only a problem on systems with 64K pages, which don't have a multiple
of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
plenty of memory...


>> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
>> allocate all stack memory from arch code. (Largely copied code, prevents
>> irq stacks being a different size, and nothing uses that define today!)
>>
>>
>> Thoughts?
> 
> Almost same story I've been testing.
> 
> I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.
> 
> Another approach I've tried is the following data structure, but it's not
> a good fit for this case due to __per_cpu_offset which is page-size aligned,
> not thread-size.
> 
> struct irq_stack {
> 	char stack[THREAD_SIZE];
> 	char *highest;
> } __aligned(THREAD_SIZE);
> 
> DEFINE_PER_CPU(struct irq_stack, irq_stacks);

Yes, x86 does this - but it increases the Image size by 16K, as that space
could have some initialisation values. This isn't a problem on x86 as
no-one uses the uncompressed image.

I would avoid this approach due to the bloat!

> 
> ----8<-----
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index 6ea82e8..d3619b3 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -1,7 +1,9 @@
>  #ifndef __ASM_IRQ_H
>  #define __ASM_IRQ_H
>  
> +#include <linux/gfp.h>
>  #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/slab.h>
>  
>  #include <asm-generic/irq.h>
>  
> @@ -9,6 +11,21 @@ struct irq_stack {
>         void *stack;
>  };
>  
> +#if THREAD_SIZE >= PAGE_SIZE
> +static inline void *__alloc_irq_stack(void)
> +{
> +       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
> +                                       THREAD_SIZE_ORDER);
> +}
> +#else
> +extern struct kmem_cache *thread_info_cache;

If this has been made a published symbol, it should go in a header file.

> +
> +static inline void *__alloc_irq_stack(void)
> +{
> +       return kmem_cache_alloc(thread_info_cache, THREADINFO_GFP | __GFP_ZERO);
> +}
> +#endif
> +
>  struct pt_regs;
>  
>  extern void migrate_irqs(void);
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index a6bdf4d..4e13bdd 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -50,10 +50,13 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>         handle_arch_irq = handle_irq;
>  }
>  
> +static char boot_irq_stack[THREAD_SIZE] __aligned(THREAD_SIZE);
> +
>  void __init init_IRQ(void)
>  {
> -       if (alloc_irq_stack(smp_processor_id()))
> -               panic("Failed to allocate IRQ stack for a boot cpu");
> +       unsigned int cpu = smp_processor_id();
> +
> +       per_cpu(irq_stacks, cpu).stack = boot_irq_stack + THREAD_START_SP;
>  
>         irqchip_init();
>         if (!handle_arch_irq)
> @@ -128,7 +131,7 @@ int alloc_irq_stack(unsigned int cpu)
>         if (per_cpu(irq_stacks, cpu).stack)
>                 return 0;
>  
> -       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +       stack = __alloc_irq_stack();
>         if (!stack)
>                 return -ENOMEM;
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2845623..9c55f86 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -172,7 +172,7 @@ static inline void free_thread_info(struct thread_info *ti)
>         free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
> -static struct kmem_cache *thread_info_cache;
> +struct kmem_cache *thread_info_cache;
>  
>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>                                                   int node)
> ----8<-----


Looks good!


Thanks,

James





^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-13 11:00             ` James Morse
  0 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-13 11:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jungseok,

On 12/10/15 23:13, Jungseok Lee wrote:
> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>> (especially for systems with few cpus)?
> 
> This would be a single concern. To address this issue, I drop the 'static'
> keyword in thread_info_cache. Please refer to the below hunk.

Its only a problem on systems with 64K pages, which don't have a multiple
of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
plenty of memory...


>> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
>> allocate all stack memory from arch code. (Largely copied code, prevents
>> irq stacks being a different size, and nothing uses that define today!)
>>
>>
>> Thoughts?
> 
> Almost same story I've been testing.
> 
> I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.
> 
> Another approach I've tried is the following data structure, but it's not
> a good fit for this case due to __per_cpu_offset which is page-size aligned,
> not thread-size.
> 
> struct irq_stack {
> 	char stack[THREAD_SIZE];
> 	char *highest;
> } __aligned(THREAD_SIZE);
> 
> DEFINE_PER_CPU(struct irq_stack, irq_stacks);

Yes, x86 does this - but it increases the Image size by 16K, as that space
could have some initialisation values. This isn't a problem on x86 as
no-one uses the uncompressed image.

I would avoid this approach due to the bloat!

> 
> ----8<-----
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index 6ea82e8..d3619b3 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -1,7 +1,9 @@
>  #ifndef __ASM_IRQ_H
>  #define __ASM_IRQ_H
>  
> +#include <linux/gfp.h>
>  #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/slab.h>
>  
>  #include <asm-generic/irq.h>
>  
> @@ -9,6 +11,21 @@ struct irq_stack {
>         void *stack;
>  };
>  
> +#if THREAD_SIZE >= PAGE_SIZE
> +static inline void *__alloc_irq_stack(void)
> +{
> +       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
> +                                       THREAD_SIZE_ORDER);
> +}
> +#else
> +extern struct kmem_cache *thread_info_cache;

If this has been made a published symbol, it should go in a header file.

> +
> +static inline void *__alloc_irq_stack(void)
> +{
> +       return kmem_cache_alloc(thread_info_cache, THREADINFO_GFP | __GFP_ZERO);
> +}
> +#endif
> +
>  struct pt_regs;
>  
>  extern void migrate_irqs(void);
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index a6bdf4d..4e13bdd 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -50,10 +50,13 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>         handle_arch_irq = handle_irq;
>  }
>  
> +static char boot_irq_stack[THREAD_SIZE] __aligned(THREAD_SIZE);
> +
>  void __init init_IRQ(void)
>  {
> -       if (alloc_irq_stack(smp_processor_id()))
> -               panic("Failed to allocate IRQ stack for a boot cpu");
> +       unsigned int cpu = smp_processor_id();
> +
> +       per_cpu(irq_stacks, cpu).stack = boot_irq_stack + THREAD_START_SP;
>  
>         irqchip_init();
>         if (!handle_arch_irq)
> @@ -128,7 +131,7 @@ int alloc_irq_stack(unsigned int cpu)
>         if (per_cpu(irq_stacks, cpu).stack)
>                 return 0;
>  
> -       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +       stack = __alloc_irq_stack();
>         if (!stack)
>                 return -ENOMEM;
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2845623..9c55f86 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -172,7 +172,7 @@ static inline void free_thread_info(struct thread_info *ti)
>         free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
> -static struct kmem_cache *thread_info_cache;
> +struct kmem_cache *thread_info_cache;
>  
>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>                                                   int node)
> ----8<-----


Looks good!


Thanks,

James

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-13 11:00             ` James Morse
@ 2015-10-13 15:00               ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-13 15:00 UTC (permalink / raw)
  To: James Morse
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 13, 2015, at 8:00 PM, James Morse wrote:
> Hi Jungseok,

Hi James,

> On 12/10/15 23:13, Jungseok Lee wrote:
>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>> (especially for systems with few cpus)…
>> 
>> This would be a single concern. To address this issue, I drop the 'static'
>> keyword in thread_info_cache. Please refer to the below hunk.
> 
> Its only a problem on systems with 64K pages, which don't have a multiple
> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
> plenty of memory…

Yes, the problem 'two kmem_caches' comes from only 64K page system.

I don't get the statement 'which don't have a multiple of 4 cpus'.
Could you point out what I am missing?

Since I don't have platforms which have many cores and huge memory,
I cannot play with this series on them.

>>> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
>>> allocate all stack memory from arch code. (Largely copied code, prevents
>>> irq stacks being a different size, and nothing uses that define today!)
>>> 
>>> 
>>> Thoughts?
>> 
>> Almost same story I've been testing.
>> 
>> I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.
>> 
>> Another approach I've tried is the following data structure, but it's not
>> a good fit for this case due to __per_cpu_offset which is page-size aligned,
>> not thread-size.
>> 
>> struct irq_stack {
>> 	char stack[THREAD_SIZE];
>> 	char *highest;
>> } __aligned(THREAD_SIZE);
>> 
>> DEFINE_PER_CPU(struct irq_stack, irq_stacks);
> 
> Yes, x86 does this - but it increases the Image size by 16K, as that space
> could have some initialisation values. This isn't a problem on x86 as
> no-one uses the uncompressed image.
> 
> I would avoid this approach due to the bloat!
> 
>> 
>> ----8<-----
>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>> index 6ea82e8..d3619b3 100644
>> --- a/arch/arm64/include/asm/irq.h
>> +++ b/arch/arm64/include/asm/irq.h
>> @@ -1,7 +1,9 @@
>> #ifndef __ASM_IRQ_H
>> #define __ASM_IRQ_H
>> 
>> +#include <linux/gfp.h>
>> #include <linux/irqchip/arm-gic-acpi.h>
>> +#include <linux/slab.h>
>> 
>> #include <asm-generic/irq.h>
>> 
>> @@ -9,6 +11,21 @@ struct irq_stack {
>>        void *stack;
>> };
>> 
>> +#if THREAD_SIZE >= PAGE_SIZE
>> +static inline void *__alloc_irq_stack(void)
>> +{
>> +       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
>> +                                       THREAD_SIZE_ORDER);
>> +}
>> +#else
>> +extern struct kmem_cache *thread_info_cache;
> 
> If this has been made a published symbol, it should go in a header file.

Sure.

>> +
>> +static inline void *__alloc_irq_stack(void)
>> +{
>> +       return kmem_cache_alloc(thread_info_cache, THREADINFO_GFP | __GFP_ZERO);
>> +}
>> +#endif
>> +
>> struct pt_regs;
>> 
>> extern void migrate_irqs(void);
>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>> index a6bdf4d..4e13bdd 100644
>> --- a/arch/arm64/kernel/irq.c
>> +++ b/arch/arm64/kernel/irq.c
>> @@ -50,10 +50,13 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>>        handle_arch_irq = handle_irq;
>> }
>> 
>> +static char boot_irq_stack[THREAD_SIZE] __aligned(THREAD_SIZE);
>> +
>> void __init init_IRQ(void)
>> {
>> -       if (alloc_irq_stack(smp_processor_id()))
>> -               panic("Failed to allocate IRQ stack for a boot cpu");
>> +       unsigned int cpu = smp_processor_id();
>> +
>> +       per_cpu(irq_stacks, cpu).stack = boot_irq_stack + THREAD_START_SP;
>> 
>>        irqchip_init();
>>        if (!handle_arch_irq)
>> @@ -128,7 +131,7 @@ int alloc_irq_stack(unsigned int cpu)
>>        if (per_cpu(irq_stacks, cpu).stack)
>>                return 0;
>> 
>> -       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>> +       stack = __alloc_irq_stack();
>>        if (!stack)
>>                return -ENOMEM;
>> 
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 2845623..9c55f86 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -172,7 +172,7 @@ static inline void free_thread_info(struct thread_info *ti)
>>        free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>> }
>> # else
>> -static struct kmem_cache *thread_info_cache;
>> +struct kmem_cache *thread_info_cache;
>> 
>> static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>                                                  int node)
>> ----8<-----
> 
> 
> Looks good!

Thanks for reviewing the code!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-13 15:00               ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-13 15:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 13, 2015, at 8:00 PM, James Morse wrote:
> Hi Jungseok,

Hi James,

> On 12/10/15 23:13, Jungseok Lee wrote:
>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>> (especially for systems with few cpus)?
>> 
>> This would be a single concern. To address this issue, I drop the 'static'
>> keyword in thread_info_cache. Please refer to the below hunk.
> 
> Its only a problem on systems with 64K pages, which don't have a multiple
> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
> plenty of memory?

Yes, the problem 'two kmem_caches' comes from only 64K page system.

I don't get the statement 'which don't have a multiple of 4 cpus'.
Could you point out what I am missing?

Since I don't have platforms which have many cores and huge memory,
I cannot play with this series on them.

>>> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
>>> allocate all stack memory from arch code. (Largely copied code, prevents
>>> irq stacks being a different size, and nothing uses that define today!)
>>> 
>>> 
>>> Thoughts?
>> 
>> Almost same story I've been testing.
>> 
>> I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.
>> 
>> Another approach I've tried is the following data structure, but it's not
>> a good fit for this case due to __per_cpu_offset which is page-size aligned,
>> not thread-size.
>> 
>> struct irq_stack {
>> 	char stack[THREAD_SIZE];
>> 	char *highest;
>> } __aligned(THREAD_SIZE);
>> 
>> DEFINE_PER_CPU(struct irq_stack, irq_stacks);
> 
> Yes, x86 does this - but it increases the Image size by 16K, as that space
> could have some initialisation values. This isn't a problem on x86 as
> no-one uses the uncompressed image.
> 
> I would avoid this approach due to the bloat!
> 
>> 
>> ----8<-----
>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>> index 6ea82e8..d3619b3 100644
>> --- a/arch/arm64/include/asm/irq.h
>> +++ b/arch/arm64/include/asm/irq.h
>> @@ -1,7 +1,9 @@
>> #ifndef __ASM_IRQ_H
>> #define __ASM_IRQ_H
>> 
>> +#include <linux/gfp.h>
>> #include <linux/irqchip/arm-gic-acpi.h>
>> +#include <linux/slab.h>
>> 
>> #include <asm-generic/irq.h>
>> 
>> @@ -9,6 +11,21 @@ struct irq_stack {
>>        void *stack;
>> };
>> 
>> +#if THREAD_SIZE >= PAGE_SIZE
>> +static inline void *__alloc_irq_stack(void)
>> +{
>> +       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
>> +                                       THREAD_SIZE_ORDER);
>> +}
>> +#else
>> +extern struct kmem_cache *thread_info_cache;
> 
> If this has been made a published symbol, it should go in a header file.

Sure.

>> +
>> +static inline void *__alloc_irq_stack(void)
>> +{
>> +       return kmem_cache_alloc(thread_info_cache, THREADINFO_GFP | __GFP_ZERO);
>> +}
>> +#endif
>> +
>> struct pt_regs;
>> 
>> extern void migrate_irqs(void);
>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>> index a6bdf4d..4e13bdd 100644
>> --- a/arch/arm64/kernel/irq.c
>> +++ b/arch/arm64/kernel/irq.c
>> @@ -50,10 +50,13 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>>        handle_arch_irq = handle_irq;
>> }
>> 
>> +static char boot_irq_stack[THREAD_SIZE] __aligned(THREAD_SIZE);
>> +
>> void __init init_IRQ(void)
>> {
>> -       if (alloc_irq_stack(smp_processor_id()))
>> -               panic("Failed to allocate IRQ stack for a boot cpu");
>> +       unsigned int cpu = smp_processor_id();
>> +
>> +       per_cpu(irq_stacks, cpu).stack = boot_irq_stack + THREAD_START_SP;
>> 
>>        irqchip_init();
>>        if (!handle_arch_irq)
>> @@ -128,7 +131,7 @@ int alloc_irq_stack(unsigned int cpu)
>>        if (per_cpu(irq_stacks, cpu).stack)
>>                return 0;
>> 
>> -       stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>> +       stack = __alloc_irq_stack();
>>        if (!stack)
>>                return -ENOMEM;
>> 
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 2845623..9c55f86 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -172,7 +172,7 @@ static inline void free_thread_info(struct thread_info *ti)
>>        free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>> }
>> # else
>> -static struct kmem_cache *thread_info_cache;
>> +struct kmem_cache *thread_info_cache;
>> 
>> static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>                                                  int node)
>> ----8<-----
> 
> 
> Looks good!

Thanks for reviewing the code!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-09 14:24     ` James Morse
@ 2015-10-14  7:13       ` AKASHI Takahiro
  -1 siblings, 0 replies; 60+ messages in thread
From: AKASHI Takahiro @ 2015-10-14  7:13 UTC (permalink / raw)
  To: James Morse, Jungseok Lee
  Cc: catalin.marinas, will.deacon, linux-arm-kernel, mark.rutland,
	barami97, linux-kernel

On 10/09/2015 11:24 PM, James Morse wrote:
> Hi Jungseok,
>
> On 07/10/15 16:28, Jungseok Lee wrote:
>> Currently, a call trace drops a process stack walk when a separate IRQ
>> stack is used. It makes a call trace information much less useful when
>> a system gets paniked in interrupt context.
>
> panicked
>
>> This patch addresses the issue with the following schemes:
>>
>>    - Store aborted stack frame data
>>    - Decide whether another stack walk is needed or not via current sp
>>    - Loosen the frame pointer upper bound condition
>
> It may be worth merging this patch with its predecessor - anyone trying to
> bisect a problem could land between these two patches, and spend time
> debugging the truncated call traces.
>
>
>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>> index 6ea82e8..e5904a1 100644
>> --- a/arch/arm64/include/asm/irq.h
>> +++ b/arch/arm64/include/asm/irq.h
>> @@ -2,13 +2,25 @@
>>   #define __ASM_IRQ_H
>>
>>   #include <linux/irqchip/arm-gic-acpi.h>
>> +#include <asm/stacktrace.h>
>>
>>   #include <asm-generic/irq.h>
>>
>>   struct irq_stack {
>>   	void *stack;
>> +	struct stackframe frame;
>>   };
>>
>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>
> Good idea, storing this in the per-cpu data makes it immune to stack
> corruption.

Is this the only reason that you have a dummy stack frame in per-cpu data?
By placing this frame in an interrupt stack, I think, we will be able to eliminate
changes in dump_stace(). and

>
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 407991b..5124649 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>   	low  = frame->sp;
>>   	high = ALIGN(low, THREAD_SIZE);
>>
>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>> +	/*
>> +	 * A frame pointer would reach an upper bound if a prologue of the
>> +	 * first function of call trace looks as follows:
>> +	 *
>> +	 *	stp     x29, x30, [sp,#-16]!
>> +	 *	mov     x29, sp
>> +	 *
>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>
> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
> to be the highest address, which is used first, making it the bottom of the
> stack.
>
> I would try to use the terms low/est and high/est, in keeping with the
> variable names in use here.
>
>
>> +	 * of a 16-byte empty space in THREAD_START_SP.
>> +	 *
>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>> +	 * are handled using a separate stack. That is, a call trace can start
>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>> +	 * to candidates for a stack trace under the restriction, 0x20.
>> +	 *
>> +	 * The scenario is handled without complexity as 1) considering
>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>> +	 * content of which is 0, and 2) allowing the case, which changes
>> +	 * the value to 0x10 from 0x20.
>
> Where has 0x20 come from? The old value was 0x18.
>
> My understanding is the highest part of the stack looks like this:
> high        [ off-stack ]
> high - 0x08 [ left free by THREAD_START_SP ]
> high - 0x10 [ left free by THREAD_START_SP ]
> high - 0x18 [#1 x30 ]
> high - 0x20 [#1 x29 ]
>
> So the condition 'fp > high - 0x18' prevents returning either 'left free'
> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
> allows the first half of that reserved area to be a valid stack frame.
>
> This change is breaking perf using incantations [0] and [1]:
>
> Before, with just patch 1/2:
>                    ---__do_softirq
>                       |
>                       |--92.95%-- __handle_domain_irq
>                       |          __irqentry_text_start
>                       |          el1_irq
>                       |
>
> After, with both patches:
>                   ---__do_softirq
>                      |
>                      |--83.83%-- __handle_domain_irq
>                      |          __irqentry_text_start
>                      |          el1_irq
>                      |          |
>                      |          |--99.39%-- 0x400008040d00000c
>                      |           --0.61%-- [...]
>                      |

This also shows that walk_stackframe() doesn't walk through a process stack.
Now I'm trying the following hack on top of Jungseok's patch.
(It doesn't traverse from an irq stack to an process stack yet. I need modify
unwind_frame().)

Thanks,
-Takahiro AKASHI
----8<----
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 650cc05..5fbd1ea 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -185,14 +185,12 @@ alternative_endif
  	mov	x23, sp
  	and	x23, x23, #~(THREAD_SIZE - 1)
  	cmp	x20, x23			// check irq re-enterance
+	mov	x19, sp
  	beq	1f
-	str	x29, [x19, #IRQ_FRAME_FP]
-	str	x21, [x19, #IRQ_FRAME_SP]
-	str	x22, [x19, #IRQ_FRAME_PC]
-	mov	x29, x24
-1:	mov	x19, sp
-	csel	x23, x19, x24, eq		// x24 = top of irq stack
-	mov	sp, x23
+	mov	sp, x24				// x24 = top of irq stack
+	stp	x29, x22, [sp, #-32]!
+	mov	x29, sp
+1:
  	.endm

  	/*

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-14  7:13       ` AKASHI Takahiro
  0 siblings, 0 replies; 60+ messages in thread
From: AKASHI Takahiro @ 2015-10-14  7:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/09/2015 11:24 PM, James Morse wrote:
> Hi Jungseok,
>
> On 07/10/15 16:28, Jungseok Lee wrote:
>> Currently, a call trace drops a process stack walk when a separate IRQ
>> stack is used. It makes a call trace information much less useful when
>> a system gets paniked in interrupt context.
>
> panicked
>
>> This patch addresses the issue with the following schemes:
>>
>>    - Store aborted stack frame data
>>    - Decide whether another stack walk is needed or not via current sp
>>    - Loosen the frame pointer upper bound condition
>
> It may be worth merging this patch with its predecessor - anyone trying to
> bisect a problem could land between these two patches, and spend time
> debugging the truncated call traces.
>
>
>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>> index 6ea82e8..e5904a1 100644
>> --- a/arch/arm64/include/asm/irq.h
>> +++ b/arch/arm64/include/asm/irq.h
>> @@ -2,13 +2,25 @@
>>   #define __ASM_IRQ_H
>>
>>   #include <linux/irqchip/arm-gic-acpi.h>
>> +#include <asm/stacktrace.h>
>>
>>   #include <asm-generic/irq.h>
>>
>>   struct irq_stack {
>>   	void *stack;
>> +	struct stackframe frame;
>>   };
>>
>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>
> Good idea, storing this in the per-cpu data makes it immune to stack
> corruption.

Is this the only reason that you have a dummy stack frame in per-cpu data?
By placing this frame in an interrupt stack, I think, we will be able to eliminate
changes in dump_stace(). and

>
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 407991b..5124649 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>   	low  = frame->sp;
>>   	high = ALIGN(low, THREAD_SIZE);
>>
>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>> +	/*
>> +	 * A frame pointer would reach an upper bound if a prologue of the
>> +	 * first function of call trace looks as follows:
>> +	 *
>> +	 *	stp     x29, x30, [sp,#-16]!
>> +	 *	mov     x29, sp
>> +	 *
>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>
> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
> to be the highest address, which is used first, making it the bottom of the
> stack.
>
> I would try to use the terms low/est and high/est, in keeping with the
> variable names in use here.
>
>
>> +	 * of a 16-byte empty space in THREAD_START_SP.
>> +	 *
>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>> +	 * are handled using a separate stack. That is, a call trace can start
>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>> +	 * to candidates for a stack trace under the restriction, 0x20.
>> +	 *
>> +	 * The scenario is handled without complexity as 1) considering
>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>> +	 * content of which is 0, and 2) allowing the case, which changes
>> +	 * the value to 0x10 from 0x20.
>
> Where has 0x20 come from? The old value was 0x18.
>
> My understanding is the highest part of the stack looks like this:
> high        [ off-stack ]
> high - 0x08 [ left free by THREAD_START_SP ]
> high - 0x10 [ left free by THREAD_START_SP ]
> high - 0x18 [#1 x30 ]
> high - 0x20 [#1 x29 ]
>
> So the condition 'fp > high - 0x18' prevents returning either 'left free'
> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
> allows the first half of that reserved area to be a valid stack frame.
>
> This change is breaking perf using incantations [0] and [1]:
>
> Before, with just patch 1/2:
>                    ---__do_softirq
>                       |
>                       |--92.95%-- __handle_domain_irq
>                       |          __irqentry_text_start
>                       |          el1_irq
>                       |
>
> After, with both patches:
>                   ---__do_softirq
>                      |
>                      |--83.83%-- __handle_domain_irq
>                      |          __irqentry_text_start
>                      |          el1_irq
>                      |          |
>                      |          |--99.39%-- 0x400008040d00000c
>                      |           --0.61%-- [...]
>                      |

This also shows that walk_stackframe() doesn't walk through a process stack.
Now I'm trying the following hack on top of Jungseok's patch.
(It doesn't traverse from an irq stack to an process stack yet. I need modify
unwind_frame().)

Thanks,
-Takahiro AKASHI
----8<----
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 650cc05..5fbd1ea 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -185,14 +185,12 @@ alternative_endif
  	mov	x23, sp
  	and	x23, x23, #~(THREAD_SIZE - 1)
  	cmp	x20, x23			// check irq re-enterance
+	mov	x19, sp
  	beq	1f
-	str	x29, [x19, #IRQ_FRAME_FP]
-	str	x21, [x19, #IRQ_FRAME_SP]
-	str	x22, [x19, #IRQ_FRAME_PC]
-	mov	x29, x24
-1:	mov	x19, sp
-	csel	x23, x19, x24, eq		// x24 = top of irq stack
-	mov	sp, x23
+	mov	sp, x24				// x24 = top of irq stack
+	stp	x29, x22, [sp, #-32]!
+	mov	x29, sp
+1:
  	.endm

  	/*

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-13 15:00               ` Jungseok Lee
@ 2015-10-14 12:12                 ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-14 12:12 UTC (permalink / raw)
  To: James Morse
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 14, 2015, at 12:00 AM, Jungseok Lee wrote:
> On Oct 13, 2015, at 8:00 PM, James Morse wrote:
>> Hi Jungseok,
> 
> Hi James,
> 
>> On 12/10/15 23:13, Jungseok Lee wrote:
>>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>>> (especially for systems with few cpus)…
>>> 
>>> This would be a single concern. To address this issue, I drop the 'static'
>>> keyword in thread_info_cache. Please refer to the below hunk.
>> 
>> Its only a problem on systems with 64K pages, which don't have a multiple
>> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
>> plenty of memory…
> 
> Yes, the problem 'two kmem_caches' comes from only 64K page system.
> 
> I don't get the statement 'which don't have a multiple of 4 cpus'.
> Could you point out what I am missing?

You're talking about sl{a|u}b allocator behavior. If so, I got what you meant.

> Since I don't have platforms which have many cores and huge memory,
> I cannot play with this series on them.
> 
>>>> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
>>>> allocate all stack memory from arch code. (Largely copied code, prevents
>>>> irq stacks being a different size, and nothing uses that define today!)
>>>> 
>>>> 
>>>> Thoughts?
>>> 
>>> Almost same story I've been testing.
>>> 
>>> I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.
>>> 
>>> Another approach I've tried is the following data structure, but it's not
>>> a good fit for this case due to __per_cpu_offset which is page-size aligned,
>>> not thread-size.
>>> 
>>> struct irq_stack {
>>> 	char stack[THREAD_SIZE];
>>> 	char *highest;
>>> } __aligned(THREAD_SIZE);
>>> 
>>> DEFINE_PER_CPU(struct irq_stack, irq_stacks);
>> 
>> Yes, x86 does this - but it increases the Image size by 16K, as that space
>> could have some initialisation values. This isn't a problem on x86 as
>> no-one uses the uncompressed image.
>> 
>> I would avoid this approach due to the bloat!
>> 
>>> 
>>> ----8<-----
>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>> index 6ea82e8..d3619b3 100644
>>> --- a/arch/arm64/include/asm/irq.h
>>> +++ b/arch/arm64/include/asm/irq.h
>>> @@ -1,7 +1,9 @@
>>> #ifndef __ASM_IRQ_H
>>> #define __ASM_IRQ_H
>>> 
>>> +#include <linux/gfp.h>
>>> #include <linux/irqchip/arm-gic-acpi.h>
>>> +#include <linux/slab.h>
>>> 
>>> #include <asm-generic/irq.h>
>>> 
>>> @@ -9,6 +11,21 @@ struct irq_stack {
>>>       void *stack;
>>> };
>>> 
>>> +#if THREAD_SIZE >= PAGE_SIZE
>>> +static inline void *__alloc_irq_stack(void)
>>> +{
>>> +       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
>>> +                                       THREAD_SIZE_ORDER);
>>> +}
>>> +#else
>>> +extern struct kmem_cache *thread_info_cache;
>> 
>> If this has been made a published symbol, it should go in a header file.
> 
> Sure.

I had the wrong impression that there is a room under include/linux/*.

IMO, this is architectural option whether arch relies on thread_info_cache or not.
In other words, it would be clear to put this extern under arch/*/include/asm/*.

Thoughts?

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-14 12:12                 ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-14 12:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 14, 2015, at 12:00 AM, Jungseok Lee wrote:
> On Oct 13, 2015, at 8:00 PM, James Morse wrote:
>> Hi Jungseok,
> 
> Hi James,
> 
>> On 12/10/15 23:13, Jungseok Lee wrote:
>>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>>> (especially for systems with few cpus)?
>>> 
>>> This would be a single concern. To address this issue, I drop the 'static'
>>> keyword in thread_info_cache. Please refer to the below hunk.
>> 
>> Its only a problem on systems with 64K pages, which don't have a multiple
>> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
>> plenty of memory?
> 
> Yes, the problem 'two kmem_caches' comes from only 64K page system.
> 
> I don't get the statement 'which don't have a multiple of 4 cpus'.
> Could you point out what I am missing?

You're talking about sl{a|u}b allocator behavior. If so, I got what you meant.

> Since I don't have platforms which have many cores and huge memory,
> I cannot play with this series on them.
> 
>>>> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and
>>>> allocate all stack memory from arch code. (Largely copied code, prevents
>>>> irq stacks being a different size, and nothing uses that define today!)
>>>> 
>>>> 
>>>> Thoughts?
>>> 
>>> Almost same story I've been testing.
>>> 
>>> I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR.
>>> 
>>> Another approach I've tried is the following data structure, but it's not
>>> a good fit for this case due to __per_cpu_offset which is page-size aligned,
>>> not thread-size.
>>> 
>>> struct irq_stack {
>>> 	char stack[THREAD_SIZE];
>>> 	char *highest;
>>> } __aligned(THREAD_SIZE);
>>> 
>>> DEFINE_PER_CPU(struct irq_stack, irq_stacks);
>> 
>> Yes, x86 does this - but it increases the Image size by 16K, as that space
>> could have some initialisation values. This isn't a problem on x86 as
>> no-one uses the uncompressed image.
>> 
>> I would avoid this approach due to the bloat!
>> 
>>> 
>>> ----8<-----
>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>> index 6ea82e8..d3619b3 100644
>>> --- a/arch/arm64/include/asm/irq.h
>>> +++ b/arch/arm64/include/asm/irq.h
>>> @@ -1,7 +1,9 @@
>>> #ifndef __ASM_IRQ_H
>>> #define __ASM_IRQ_H
>>> 
>>> +#include <linux/gfp.h>
>>> #include <linux/irqchip/arm-gic-acpi.h>
>>> +#include <linux/slab.h>
>>> 
>>> #include <asm-generic/irq.h>
>>> 
>>> @@ -9,6 +11,21 @@ struct irq_stack {
>>>       void *stack;
>>> };
>>> 
>>> +#if THREAD_SIZE >= PAGE_SIZE
>>> +static inline void *__alloc_irq_stack(void)
>>> +{
>>> +       return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
>>> +                                       THREAD_SIZE_ORDER);
>>> +}
>>> +#else
>>> +extern struct kmem_cache *thread_info_cache;
>> 
>> If this has been made a published symbol, it should go in a header file.
> 
> Sure.

I had the wrong impression that there is a room under include/linux/*.

IMO, this is architectural option whether arch relies on thread_info_cache or not.
In other words, it would be clear to put this extern under arch/*/include/asm/*.

Thoughts?

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-14  7:13       ` AKASHI Takahiro
@ 2015-10-14 12:24         ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-14 12:24 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: James Morse, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
> On 10/09/2015 11:24 PM, James Morse wrote:
>> Hi Jungseok,
>> 
>> On 07/10/15 16:28, Jungseok Lee wrote:
>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>> stack is used. It makes a call trace information much less useful when
>>> a system gets paniked in interrupt context.
>> 
>> panicked
>> 
>>> This patch addresses the issue with the following schemes:
>>> 
>>>   - Store aborted stack frame data
>>>   - Decide whether another stack walk is needed or not via current sp
>>>   - Loosen the frame pointer upper bound condition
>> 
>> It may be worth merging this patch with its predecessor - anyone trying to
>> bisect a problem could land between these two patches, and spend time
>> debugging the truncated call traces.
>> 
>> 
>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>> index 6ea82e8..e5904a1 100644
>>> --- a/arch/arm64/include/asm/irq.h
>>> +++ b/arch/arm64/include/asm/irq.h
>>> @@ -2,13 +2,25 @@
>>>  #define __ASM_IRQ_H
>>> 
>>>  #include <linux/irqchip/arm-gic-acpi.h>
>>> +#include <asm/stacktrace.h>
>>> 
>>>  #include <asm-generic/irq.h>
>>> 
>>>  struct irq_stack {
>>>  	void *stack;
>>> +	struct stackframe frame;
>>>  };
>>> 
>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>> 
>> Good idea, storing this in the per-cpu data makes it immune to stack
>> corruption.
> 
> Is this the only reason that you have a dummy stack frame in per-cpu data?
> By placing this frame in an interrupt stack, I think, we will be able to eliminate
> changes in dump_stace(). and
> 
>> 
>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>> index 407991b..5124649 100644
>>> --- a/arch/arm64/kernel/stacktrace.c
>>> +++ b/arch/arm64/kernel/stacktrace.c
>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>  	low  = frame->sp;
>>>  	high = ALIGN(low, THREAD_SIZE);
>>> 
>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>> +	/*
>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>> +	 * first function of call trace looks as follows:
>>> +	 *
>>> +	 *	stp     x29, x30, [sp,#-16]!
>>> +	 *	mov     x29, sp
>>> +	 *
>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>> 
>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>> to be the highest address, which is used first, making it the bottom of the
>> stack.
>> 
>> I would try to use the terms low/est and high/est, in keeping with the
>> variable names in use here.
>> 
>> 
>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>> +	 *
>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>> +	 * are handled using a separate stack. That is, a call trace can start
>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>> +	 *
>>> +	 * The scenario is handled without complexity as 1) considering
>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>> +	 * the value to 0x10 from 0x20.
>> 
>> Where has 0x20 come from? The old value was 0x18.
>> 
>> My understanding is the highest part of the stack looks like this:
>> high        [ off-stack ]
>> high - 0x08 [ left free by THREAD_START_SP ]
>> high - 0x10 [ left free by THREAD_START_SP ]
>> high - 0x18 [#1 x30 ]
>> high - 0x20 [#1 x29 ]
>> 
>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>> allows the first half of that reserved area to be a valid stack frame.
>> 
>> This change is breaking perf using incantations [0] and [1]:
>> 
>> Before, with just patch 1/2:
>>                   ---__do_softirq
>>                      |
>>                      |--92.95%-- __handle_domain_irq
>>                      |          __irqentry_text_start
>>                      |          el1_irq
>>                      |
>> 
>> After, with both patches:
>>                  ---__do_softirq
>>                     |
>>                     |--83.83%-- __handle_domain_irq
>>                     |          __irqentry_text_start
>>                     |          el1_irq
>>                     |          |
>>                     |          |--99.39%-- 0x400008040d00000c
>>                     |           --0.61%-- [...]
>>                     |
> 
> This also shows that walk_stackframe() doesn't walk through a process stack.
> Now I'm trying the following hack on top of Jungseok's patch.
> (It doesn't traverse from an irq stack to an process stack yet. I need modify
> unwind_frame().)

I've got a difference between perf and dump_backtrace() as reviewing perf call
chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
That is, a symbol is printed out *before* unwind_frame() call in case of perf.
By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
perf behavior is correct since frame.pc is retrieved from a valid stack frame.

So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
could utilize walk_stackframe(), which simplifies the code.

----8<----
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index f93aae5..e18be43 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
        set_fs(fs);
 }
 
-static void dump_backtrace_entry(unsigned long where, unsigned long stack)
+static void dump_backtrace_entry(unsigned long where)
 {
+       /*
+        * PC has a physical address when MMU is disabled.
+        */
+       if (!kernel_text_address(where))
+               where = (unsigned long)phys_to_virt(where);
+
        print_ip_sym(where);
-       if (in_exception_text(where))
-               dump_mem("", "Exception stack", stack,
-                        stack + sizeof(struct pt_regs), false);
 }
 
 static void dump_instr(const char *lvl, struct pt_regs *regs)
@@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
        pr_emerg("Call trace:\n");
        while (1) {
                unsigned long where = frame.pc;
+               unsigned long stack;
                int ret;
 
+               dump_backtrace_entry(where);
                ret = unwind_frame(&frame);
                if (ret < 0)
                        break;
-               dump_backtrace_entry(where, frame.sp);
+               stack = frame.sp;
+               if (in_exception_text(where))
+                       dump_mem("", "Exception stack", stack,
+                                stack + sizeof(struct pt_regs), false);
        }
 }
----8<----

> Thanks,
> -Takahiro AKASHI
> ----8<----
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 650cc05..5fbd1ea 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -185,14 +185,12 @@ alternative_endif
> 	mov	x23, sp
> 	and	x23, x23, #~(THREAD_SIZE - 1)
> 	cmp	x20, x23			// check irq re-enterance
> +	mov	x19, sp
> 	beq	1f
> -	str	x29, [x19, #IRQ_FRAME_FP]
> -	str	x21, [x19, #IRQ_FRAME_SP]
> -	str	x22, [x19, #IRQ_FRAME_PC]
> -	mov	x29, x24
> -1:	mov	x19, sp
> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
> -	mov	sp, x23
> +	mov	sp, x24				// x24 = top of irq stack
> +	stp	x29, x22, [sp, #-32]!
> +	mov	x29, sp
> +1:
> 	.endm
> 
> 	/*

Is it possible to decide which stack is used without aborted SP information?
In addition, I'm curious about an origin of #-32.

Thanks!

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-14 12:24         ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-14 12:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
> On 10/09/2015 11:24 PM, James Morse wrote:
>> Hi Jungseok,
>> 
>> On 07/10/15 16:28, Jungseok Lee wrote:
>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>> stack is used. It makes a call trace information much less useful when
>>> a system gets paniked in interrupt context.
>> 
>> panicked
>> 
>>> This patch addresses the issue with the following schemes:
>>> 
>>>   - Store aborted stack frame data
>>>   - Decide whether another stack walk is needed or not via current sp
>>>   - Loosen the frame pointer upper bound condition
>> 
>> It may be worth merging this patch with its predecessor - anyone trying to
>> bisect a problem could land between these two patches, and spend time
>> debugging the truncated call traces.
>> 
>> 
>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>> index 6ea82e8..e5904a1 100644
>>> --- a/arch/arm64/include/asm/irq.h
>>> +++ b/arch/arm64/include/asm/irq.h
>>> @@ -2,13 +2,25 @@
>>>  #define __ASM_IRQ_H
>>> 
>>>  #include <linux/irqchip/arm-gic-acpi.h>
>>> +#include <asm/stacktrace.h>
>>> 
>>>  #include <asm-generic/irq.h>
>>> 
>>>  struct irq_stack {
>>>  	void *stack;
>>> +	struct stackframe frame;
>>>  };
>>> 
>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>> 
>> Good idea, storing this in the per-cpu data makes it immune to stack
>> corruption.
> 
> Is this the only reason that you have a dummy stack frame in per-cpu data?
> By placing this frame in an interrupt stack, I think, we will be able to eliminate
> changes in dump_stace(). and
> 
>> 
>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>> index 407991b..5124649 100644
>>> --- a/arch/arm64/kernel/stacktrace.c
>>> +++ b/arch/arm64/kernel/stacktrace.c
>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>  	low  = frame->sp;
>>>  	high = ALIGN(low, THREAD_SIZE);
>>> 
>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>> +	/*
>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>> +	 * first function of call trace looks as follows:
>>> +	 *
>>> +	 *	stp     x29, x30, [sp,#-16]!
>>> +	 *	mov     x29, sp
>>> +	 *
>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>> 
>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>> to be the highest address, which is used first, making it the bottom of the
>> stack.
>> 
>> I would try to use the terms low/est and high/est, in keeping with the
>> variable names in use here.
>> 
>> 
>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>> +	 *
>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>> +	 * are handled using a separate stack. That is, a call trace can start
>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>> +	 *
>>> +	 * The scenario is handled without complexity as 1) considering
>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>> +	 * the value to 0x10 from 0x20.
>> 
>> Where has 0x20 come from? The old value was 0x18.
>> 
>> My understanding is the highest part of the stack looks like this:
>> high        [ off-stack ]
>> high - 0x08 [ left free by THREAD_START_SP ]
>> high - 0x10 [ left free by THREAD_START_SP ]
>> high - 0x18 [#1 x30 ]
>> high - 0x20 [#1 x29 ]
>> 
>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>> allows the first half of that reserved area to be a valid stack frame.
>> 
>> This change is breaking perf using incantations [0] and [1]:
>> 
>> Before, with just patch 1/2:
>>                   ---__do_softirq
>>                      |
>>                      |--92.95%-- __handle_domain_irq
>>                      |          __irqentry_text_start
>>                      |          el1_irq
>>                      |
>> 
>> After, with both patches:
>>                  ---__do_softirq
>>                     |
>>                     |--83.83%-- __handle_domain_irq
>>                     |          __irqentry_text_start
>>                     |          el1_irq
>>                     |          |
>>                     |          |--99.39%-- 0x400008040d00000c
>>                     |           --0.61%-- [...]
>>                     |
> 
> This also shows that walk_stackframe() doesn't walk through a process stack.
> Now I'm trying the following hack on top of Jungseok's patch.
> (It doesn't traverse from an irq stack to an process stack yet. I need modify
> unwind_frame().)

I've got a difference between perf and dump_backtrace() as reviewing perf call
chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
That is, a symbol is printed out *before* unwind_frame() call in case of perf.
By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
perf behavior is correct since frame.pc is retrieved from a valid stack frame.

So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
could utilize walk_stackframe(), which simplifies the code.

----8<----
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index f93aae5..e18be43 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
        set_fs(fs);
 }
 
-static void dump_backtrace_entry(unsigned long where, unsigned long stack)
+static void dump_backtrace_entry(unsigned long where)
 {
+       /*
+        * PC has a physical address when MMU is disabled.
+        */
+       if (!kernel_text_address(where))
+               where = (unsigned long)phys_to_virt(where);
+
        print_ip_sym(where);
-       if (in_exception_text(where))
-               dump_mem("", "Exception stack", stack,
-                        stack + sizeof(struct pt_regs), false);
 }
 
 static void dump_instr(const char *lvl, struct pt_regs *regs)
@@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
        pr_emerg("Call trace:\n");
        while (1) {
                unsigned long where = frame.pc;
+               unsigned long stack;
                int ret;
 
+               dump_backtrace_entry(where);
                ret = unwind_frame(&frame);
                if (ret < 0)
                        break;
-               dump_backtrace_entry(where, frame.sp);
+               stack = frame.sp;
+               if (in_exception_text(where))
+                       dump_mem("", "Exception stack", stack,
+                                stack + sizeof(struct pt_regs), false);
        }
 }
----8<----

> Thanks,
> -Takahiro AKASHI
> ----8<----
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 650cc05..5fbd1ea 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -185,14 +185,12 @@ alternative_endif
> 	mov	x23, sp
> 	and	x23, x23, #~(THREAD_SIZE - 1)
> 	cmp	x20, x23			// check irq re-enterance
> +	mov	x19, sp
> 	beq	1f
> -	str	x29, [x19, #IRQ_FRAME_FP]
> -	str	x21, [x19, #IRQ_FRAME_SP]
> -	str	x22, [x19, #IRQ_FRAME_PC]
> -	mov	x29, x24
> -1:	mov	x19, sp
> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
> -	mov	sp, x23
> +	mov	sp, x24				// x24 = top of irq stack
> +	stp	x29, x22, [sp, #-32]!
> +	mov	x29, sp
> +1:
> 	.endm
> 
> 	/*

Is it possible to decide which stack is used without aborted SP information?
In addition, I'm curious about an origin of #-32.

Thanks!

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-14 12:24         ` Jungseok Lee
@ 2015-10-14 12:55           ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-14 12:55 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: James Morse, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 14, 2015, at 9:24 PM, Jungseok Lee wrote:
> On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
>> On 10/09/2015 11:24 PM, James Morse wrote:
>>> Hi Jungseok,
>>> 
>>> On 07/10/15 16:28, Jungseok Lee wrote:
>>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>>> stack is used. It makes a call trace information much less useful when
>>>> a system gets paniked in interrupt context.
>>> 
>>> panicked
>>> 
>>>> This patch addresses the issue with the following schemes:
>>>> 
>>>>  - Store aborted stack frame data
>>>>  - Decide whether another stack walk is needed or not via current sp
>>>>  - Loosen the frame pointer upper bound condition
>>> 
>>> It may be worth merging this patch with its predecessor - anyone trying to
>>> bisect a problem could land between these two patches, and spend time
>>> debugging the truncated call traces.
>>> 
>>> 
>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>>> index 6ea82e8..e5904a1 100644
>>>> --- a/arch/arm64/include/asm/irq.h
>>>> +++ b/arch/arm64/include/asm/irq.h
>>>> @@ -2,13 +2,25 @@
>>>> #define __ASM_IRQ_H
>>>> 
>>>> #include <linux/irqchip/arm-gic-acpi.h>
>>>> +#include <asm/stacktrace.h>
>>>> 
>>>> #include <asm-generic/irq.h>
>>>> 
>>>> struct irq_stack {
>>>> 	void *stack;
>>>> +	struct stackframe frame;
>>>> };
>>>> 
>>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>>> 
>>> Good idea, storing this in the per-cpu data makes it immune to stack
>>> corruption.
>> 
>> Is this the only reason that you have a dummy stack frame in per-cpu data?
>> By placing this frame in an interrupt stack, I think, we will be able to eliminate
>> changes in dump_stace(). and
>> 
>>> 
>>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>>> index 407991b..5124649 100644
>>>> --- a/arch/arm64/kernel/stacktrace.c
>>>> +++ b/arch/arm64/kernel/stacktrace.c
>>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>> 	low  = frame->sp;
>>>> 	high = ALIGN(low, THREAD_SIZE);
>>>> 
>>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>>> +	/*
>>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>>> +	 * first function of call trace looks as follows:
>>>> +	 *
>>>> +	 *	stp     x29, x30, [sp,#-16]!
>>>> +	 *	mov     x29, sp
>>>> +	 *
>>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>>> 
>>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>>> to be the highest address, which is used first, making it the bottom of the
>>> stack.
>>> 
>>> I would try to use the terms low/est and high/est, in keeping with the
>>> variable names in use here.
>>> 
>>> 
>>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>>> +	 *
>>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>>> +	 * are handled using a separate stack. That is, a call trace can start
>>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>>> +	 *
>>>> +	 * The scenario is handled without complexity as 1) considering
>>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>>> +	 * the value to 0x10 from 0x20.
>>> 
>>> Where has 0x20 come from? The old value was 0x18.
>>> 
>>> My understanding is the highest part of the stack looks like this:
>>> high        [ off-stack ]
>>> high - 0x08 [ left free by THREAD_START_SP ]
>>> high - 0x10 [ left free by THREAD_START_SP ]
>>> high - 0x18 [#1 x30 ]
>>> high - 0x20 [#1 x29 ]
>>> 
>>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>>> allows the first half of that reserved area to be a valid stack frame.
>>> 
>>> This change is breaking perf using incantations [0] and [1]:
>>> 
>>> Before, with just patch 1/2:
>>>                  ---__do_softirq
>>>                     |
>>>                     |--92.95%-- __handle_domain_irq
>>>                     |          __irqentry_text_start
>>>                     |          el1_irq
>>>                     |
>>> 
>>> After, with both patches:
>>>                 ---__do_softirq
>>>                    |
>>>                    |--83.83%-- __handle_domain_irq
>>>                    |          __irqentry_text_start
>>>                    |          el1_irq
>>>                    |          |
>>>                    |          |--99.39%-- 0x400008040d00000c
>>>                    |           --0.61%-- [...]
>>>                    |
>> 
>> This also shows that walk_stackframe() doesn't walk through a process stack.
>> Now I'm trying the following hack on top of Jungseok's patch.
>> (It doesn't traverse from an irq stack to an process stack yet. I need modify
>> unwind_frame().)
> 
> I've got a difference between perf and dump_backtrace() as reviewing perf call
> chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
> That is, a symbol is printed out *before* unwind_frame() call in case of perf.
> By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
> perf behavior is correct since frame.pc is retrieved from a valid stack frame.
> 
> So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
> call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
> could utilize walk_stackframe(), which simplifies the code.
> 
> ----8<----
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index f93aae5..e18be43 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>        set_fs(fs);
> }
> 
> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
> +static void dump_backtrace_entry(unsigned long where)
> {
> +       /*
> +        * PC has a physical address when MMU is disabled.
> +        */
> +       if (!kernel_text_address(where))
> +               where = (unsigned long)phys_to_virt(where);
> +
>        print_ip_sym(where);
> -       if (in_exception_text(where))
> -               dump_mem("", "Exception stack", stack,
> -                        stack + sizeof(struct pt_regs), false);
> }
> 
> static void dump_instr(const char *lvl, struct pt_regs *regs)
> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>        pr_emerg("Call trace:\n");
>        while (1) {
>                unsigned long where = frame.pc;
> +               unsigned long stack;
>                int ret;
> 
> +               dump_backtrace_entry(where);
>                ret = unwind_frame(&frame);
>                if (ret < 0)
>                        break;
> -               dump_backtrace_entry(where, frame.sp);
> +               stack = frame.sp;
> +               if (in_exception_text(where))
> +                       dump_mem("", "Exception stack", stack,
> +                                stack + sizeof(struct pt_regs), false);
>        }
> }
> ----8<----
> 
>> Thanks,
>> -Takahiro AKASHI
>> ----8<----
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index 650cc05..5fbd1ea 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -185,14 +185,12 @@ alternative_endif
>> 	mov	x23, sp
>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>> 	cmp	x20, x23			// check irq re-enterance
>> +	mov	x19, sp
>> 	beq	1f
>> -	str	x29, [x19, #IRQ_FRAME_FP]
>> -	str	x21, [x19, #IRQ_FRAME_SP]
>> -	str	x22, [x19, #IRQ_FRAME_PC]
>> -	mov	x29, x24
>> -1:	mov	x19, sp
>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>> -	mov	sp, x23
>> +	mov	sp, x24				// x24 = top of irq stack
>> +	stp	x29, x22, [sp, #-32]!
>> +	mov	x29, sp
>> +1:
>> 	.endm
>> 
>> 	/*
> 
> Is it possible to decide which stack is used without aborted SP information?

We could know which stack is used via current SP, but how could we decide
a variable 'low' in unwind_frame() when walking a process stack?

Sorry for confusion.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-14 12:55           ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-14 12:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 14, 2015, at 9:24 PM, Jungseok Lee wrote:
> On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
>> On 10/09/2015 11:24 PM, James Morse wrote:
>>> Hi Jungseok,
>>> 
>>> On 07/10/15 16:28, Jungseok Lee wrote:
>>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>>> stack is used. It makes a call trace information much less useful when
>>>> a system gets paniked in interrupt context.
>>> 
>>> panicked
>>> 
>>>> This patch addresses the issue with the following schemes:
>>>> 
>>>>  - Store aborted stack frame data
>>>>  - Decide whether another stack walk is needed or not via current sp
>>>>  - Loosen the frame pointer upper bound condition
>>> 
>>> It may be worth merging this patch with its predecessor - anyone trying to
>>> bisect a problem could land between these two patches, and spend time
>>> debugging the truncated call traces.
>>> 
>>> 
>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>>> index 6ea82e8..e5904a1 100644
>>>> --- a/arch/arm64/include/asm/irq.h
>>>> +++ b/arch/arm64/include/asm/irq.h
>>>> @@ -2,13 +2,25 @@
>>>> #define __ASM_IRQ_H
>>>> 
>>>> #include <linux/irqchip/arm-gic-acpi.h>
>>>> +#include <asm/stacktrace.h>
>>>> 
>>>> #include <asm-generic/irq.h>
>>>> 
>>>> struct irq_stack {
>>>> 	void *stack;
>>>> +	struct stackframe frame;
>>>> };
>>>> 
>>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>>> 
>>> Good idea, storing this in the per-cpu data makes it immune to stack
>>> corruption.
>> 
>> Is this the only reason that you have a dummy stack frame in per-cpu data?
>> By placing this frame in an interrupt stack, I think, we will be able to eliminate
>> changes in dump_stace(). and
>> 
>>> 
>>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>>> index 407991b..5124649 100644
>>>> --- a/arch/arm64/kernel/stacktrace.c
>>>> +++ b/arch/arm64/kernel/stacktrace.c
>>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>> 	low  = frame->sp;
>>>> 	high = ALIGN(low, THREAD_SIZE);
>>>> 
>>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>>> +	/*
>>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>>> +	 * first function of call trace looks as follows:
>>>> +	 *
>>>> +	 *	stp     x29, x30, [sp,#-16]!
>>>> +	 *	mov     x29, sp
>>>> +	 *
>>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>>> 
>>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>>> to be the highest address, which is used first, making it the bottom of the
>>> stack.
>>> 
>>> I would try to use the terms low/est and high/est, in keeping with the
>>> variable names in use here.
>>> 
>>> 
>>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>>> +	 *
>>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>>> +	 * are handled using a separate stack. That is, a call trace can start
>>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>>> +	 *
>>>> +	 * The scenario is handled without complexity as 1) considering
>>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>>> +	 * the value to 0x10 from 0x20.
>>> 
>>> Where has 0x20 come from? The old value was 0x18.
>>> 
>>> My understanding is the highest part of the stack looks like this:
>>> high        [ off-stack ]
>>> high - 0x08 [ left free by THREAD_START_SP ]
>>> high - 0x10 [ left free by THREAD_START_SP ]
>>> high - 0x18 [#1 x30 ]
>>> high - 0x20 [#1 x29 ]
>>> 
>>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>>> allows the first half of that reserved area to be a valid stack frame.
>>> 
>>> This change is breaking perf using incantations [0] and [1]:
>>> 
>>> Before, with just patch 1/2:
>>>                  ---__do_softirq
>>>                     |
>>>                     |--92.95%-- __handle_domain_irq
>>>                     |          __irqentry_text_start
>>>                     |          el1_irq
>>>                     |
>>> 
>>> After, with both patches:
>>>                 ---__do_softirq
>>>                    |
>>>                    |--83.83%-- __handle_domain_irq
>>>                    |          __irqentry_text_start
>>>                    |          el1_irq
>>>                    |          |
>>>                    |          |--99.39%-- 0x400008040d00000c
>>>                    |           --0.61%-- [...]
>>>                    |
>> 
>> This also shows that walk_stackframe() doesn't walk through a process stack.
>> Now I'm trying the following hack on top of Jungseok's patch.
>> (It doesn't traverse from an irq stack to an process stack yet. I need modify
>> unwind_frame().)
> 
> I've got a difference between perf and dump_backtrace() as reviewing perf call
> chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
> That is, a symbol is printed out *before* unwind_frame() call in case of perf.
> By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
> perf behavior is correct since frame.pc is retrieved from a valid stack frame.
> 
> So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
> call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
> could utilize walk_stackframe(), which simplifies the code.
> 
> ----8<----
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index f93aae5..e18be43 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>        set_fs(fs);
> }
> 
> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
> +static void dump_backtrace_entry(unsigned long where)
> {
> +       /*
> +        * PC has a physical address when MMU is disabled.
> +        */
> +       if (!kernel_text_address(where))
> +               where = (unsigned long)phys_to_virt(where);
> +
>        print_ip_sym(where);
> -       if (in_exception_text(where))
> -               dump_mem("", "Exception stack", stack,
> -                        stack + sizeof(struct pt_regs), false);
> }
> 
> static void dump_instr(const char *lvl, struct pt_regs *regs)
> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>        pr_emerg("Call trace:\n");
>        while (1) {
>                unsigned long where = frame.pc;
> +               unsigned long stack;
>                int ret;
> 
> +               dump_backtrace_entry(where);
>                ret = unwind_frame(&frame);
>                if (ret < 0)
>                        break;
> -               dump_backtrace_entry(where, frame.sp);
> +               stack = frame.sp;
> +               if (in_exception_text(where))
> +                       dump_mem("", "Exception stack", stack,
> +                                stack + sizeof(struct pt_regs), false);
>        }
> }
> ----8<----
> 
>> Thanks,
>> -Takahiro AKASHI
>> ----8<----
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index 650cc05..5fbd1ea 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -185,14 +185,12 @@ alternative_endif
>> 	mov	x23, sp
>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>> 	cmp	x20, x23			// check irq re-enterance
>> +	mov	x19, sp
>> 	beq	1f
>> -	str	x29, [x19, #IRQ_FRAME_FP]
>> -	str	x21, [x19, #IRQ_FRAME_SP]
>> -	str	x22, [x19, #IRQ_FRAME_PC]
>> -	mov	x29, x24
>> -1:	mov	x19, sp
>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>> -	mov	sp, x23
>> +	mov	sp, x24				// x24 = top of irq stack
>> +	stp	x29, x22, [sp, #-32]!
>> +	mov	x29, sp
>> +1:
>> 	.endm
>> 
>> 	/*
> 
> Is it possible to decide which stack is used without aborted SP information?

We could know which stack is used via current SP, but how could we decide
a variable 'low' in unwind_frame() when walking a process stack?

Sorry for confusion.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-14 12:55           ` Jungseok Lee
@ 2015-10-15  4:19             ` AKASHI Takahiro
  -1 siblings, 0 replies; 60+ messages in thread
From: AKASHI Takahiro @ 2015-10-15  4:19 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: James Morse, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

Jungseok,

On 10/14/2015 09:55 PM, Jungseok Lee wrote:
> On Oct 14, 2015, at 9:24 PM, Jungseok Lee wrote:
>> On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
>>> On 10/09/2015 11:24 PM, James Morse wrote:
>>>> Hi Jungseok,
>>>>
>>>> On 07/10/15 16:28, Jungseok Lee wrote:
>>>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>>>> stack is used. It makes a call trace information much less useful when
>>>>> a system gets paniked in interrupt context.
>>>>
>>>> panicked
>>>>
>>>>> This patch addresses the issue with the following schemes:
>>>>>
>>>>>   - Store aborted stack frame data
>>>>>   - Decide whether another stack walk is needed or not via current sp
>>>>>   - Loosen the frame pointer upper bound condition
>>>>
>>>> It may be worth merging this patch with its predecessor - anyone trying to
>>>> bisect a problem could land between these two patches, and spend time
>>>> debugging the truncated call traces.
>>>>
>>>>
>>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>>>> index 6ea82e8..e5904a1 100644
>>>>> --- a/arch/arm64/include/asm/irq.h
>>>>> +++ b/arch/arm64/include/asm/irq.h
>>>>> @@ -2,13 +2,25 @@
>>>>> #define __ASM_IRQ_H
>>>>>
>>>>> #include <linux/irqchip/arm-gic-acpi.h>
>>>>> +#include <asm/stacktrace.h>
>>>>>
>>>>> #include <asm-generic/irq.h>
>>>>>
>>>>> struct irq_stack {
>>>>> 	void *stack;
>>>>> +	struct stackframe frame;
>>>>> };
>>>>>
>>>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>>>>
>>>> Good idea, storing this in the per-cpu data makes it immune to stack
>>>> corruption.
>>>
>>> Is this the only reason that you have a dummy stack frame in per-cpu data?
>>> By placing this frame in an interrupt stack, I think, we will be able to eliminate
>>> changes in dump_stace(). and
>>>
>>>>
>>>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>>>> index 407991b..5124649 100644
>>>>> --- a/arch/arm64/kernel/stacktrace.c
>>>>> +++ b/arch/arm64/kernel/stacktrace.c
>>>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>>> 	low  = frame->sp;
>>>>> 	high = ALIGN(low, THREAD_SIZE);
>>>>>
>>>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>>>> +	/*
>>>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>>>> +	 * first function of call trace looks as follows:
>>>>> +	 *
>>>>> +	 *	stp     x29, x30, [sp,#-16]!
>>>>> +	 *	mov     x29, sp
>>>>> +	 *
>>>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>>>>
>>>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>>>> to be the highest address, which is used first, making it the bottom of the
>>>> stack.
>>>>
>>>> I would try to use the terms low/est and high/est, in keeping with the
>>>> variable names in use here.
>>>>
>>>>
>>>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>>>> +	 *
>>>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>>>> +	 * are handled using a separate stack. That is, a call trace can start
>>>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>>>> +	 *
>>>>> +	 * The scenario is handled without complexity as 1) considering
>>>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>>>> +	 * the value to 0x10 from 0x20.
>>>>
>>>> Where has 0x20 come from? The old value was 0x18.
>>>>
>>>> My understanding is the highest part of the stack looks like this:
>>>> high        [ off-stack ]
>>>> high - 0x08 [ left free by THREAD_START_SP ]
>>>> high - 0x10 [ left free by THREAD_START_SP ]
>>>> high - 0x18 [#1 x30 ]
>>>> high - 0x20 [#1 x29 ]
>>>>
>>>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>>>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>>>> allows the first half of that reserved area to be a valid stack frame.
>>>>
>>>> This change is breaking perf using incantations [0] and [1]:
>>>>
>>>> Before, with just patch 1/2:
>>>>                   ---__do_softirq
>>>>                      |
>>>>                      |--92.95%-- __handle_domain_irq
>>>>                      |          __irqentry_text_start
>>>>                      |          el1_irq
>>>>                      |
>>>>
>>>> After, with both patches:
>>>>                  ---__do_softirq
>>>>                     |
>>>>                     |--83.83%-- __handle_domain_irq
>>>>                     |          __irqentry_text_start
>>>>                     |          el1_irq
>>>>                     |          |
>>>>                     |          |--99.39%-- 0x400008040d00000c
>>>>                     |           --0.61%-- [...]
>>>>                     |
>>>
>>> This also shows that walk_stackframe() doesn't walk through a process stack.
>>> Now I'm trying the following hack on top of Jungseok's patch.
>>> (It doesn't traverse from an irq stack to an process stack yet. I need modify
>>> unwind_frame().)
>>
>> I've got a difference between perf and dump_backtrace() as reviewing perf call
>> chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
>> That is, a symbol is printed out *before* unwind_frame() call in case of perf.
>> By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
>> perf behavior is correct since frame.pc is retrieved from a valid stack frame.
>>
>> So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
>> call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
>> could utilize walk_stackframe(), which simplifies the code.
>>
>> ----8<----
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index f93aae5..e18be43 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>         set_fs(fs);
>> }
>>
>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>> +static void dump_backtrace_entry(unsigned long where)
>> {
>> +       /*
>> +        * PC has a physical address when MMU is disabled.
>> +        */
>> +       if (!kernel_text_address(where))
>> +               where = (unsigned long)phys_to_virt(where);
>> +
>>         print_ip_sym(where);
>> -       if (in_exception_text(where))
>> -               dump_mem("", "Exception stack", stack,
>> -                        stack + sizeof(struct pt_regs), false);
>> }
>>
>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>         pr_emerg("Call trace:\n");
>>         while (1) {
>>                 unsigned long where = frame.pc;
>> +               unsigned long stack;
>>                 int ret;
>>
>> +               dump_backtrace_entry(where);
>>                 ret = unwind_frame(&frame);
>>                 if (ret < 0)
>>                         break;
>> -               dump_backtrace_entry(where, frame.sp);
>> +               stack = frame.sp;
>> +               if (in_exception_text(where))
>> +                       dump_mem("", "Exception stack", stack,
>> +                                stack + sizeof(struct pt_regs), false);
>>         }
>> }
>> ----8<----
>>
>>> Thanks,
>>> -Takahiro AKASHI
>>> ----8<----
>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>> index 650cc05..5fbd1ea 100644
>>> --- a/arch/arm64/kernel/entry.S
>>> +++ b/arch/arm64/kernel/entry.S
>>> @@ -185,14 +185,12 @@ alternative_endif
>>> 	mov	x23, sp
>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>> 	cmp	x20, x23			// check irq re-enterance
>>> +	mov	x19, sp
>>> 	beq	1f
>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>> -	mov	x29, x24
>>> -1:	mov	x19, sp
>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>> -	mov	sp, x23
>>> +	mov	sp, x24				// x24 = top of irq stack
>>> +	stp	x29, x22, [sp, #-32]!
>>> +	mov	x29, sp
>>> +1:
>>> 	.endm
>>>
>>> 	/*
>>
>> Is it possible to decide which stack is used without aborted SP information?
>
> We could know which stack is used via current SP, but how could we decide
> a variable 'low' in unwind_frame() when walking a process stack?

The following patch, replacing your [PATCH 2/2], seems to work nicely,
traversing from interrupt stack to process stack. I tried James' method as well
as "echo c > /proc/sysrq-trigger."

The only issue that I have now is that dump_backtrace() does not show
correct "pt_regs" data on process stack (actually it dumps interrupt stack):

CPU1: stopping
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
Call trace:
[<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
[<ffffffc00008a968>] show_stack+0x1c/0x28
[<ffffffc0003936d0>] dump_stack+0x88/0xc8
[<ffffffc00008fdf8>] handle_IPI+0x258/0x268
[<ffffffc000082530>] gic_handle_irq+0x88/0xa4
Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
[<ffffffc0000855e0>] el1_irq+0xa0/0x114
[<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
[<ffffffc0000fc110>] default_idle_call+0x1c/0x34
[<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
[<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
[<ffffffc0000827a8>] secondary_startup+0x8/0x20

Thanks,
-Takahiro AKASHI

----8<----
 From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
From: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date: Thu, 15 Oct 2015 09:04:10 +0900
Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack

This patch allows unwind_frame() to traverse from interrupt stack
to process stack correctly by having a dummy stack frame for irq_handler
created at its prologue.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
---
  arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
  arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
  2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d4e8c5..25cabd9 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -185,8 +185,26 @@ alternative_endif
  	and	x23, x23, #~(THREAD_SIZE - 1)
  	cmp	x20, x23			// check irq re-enterance
  	mov	x19, sp
-	csel	x23, x19, x24, eq		// x24 = top of irq stack
-	mov	sp, x23
+	beq	1f
+	mov	sp, x24				// x24 = top of irq stack
+	stp	x29, x21, [sp, #-16]!		// for sanity check
+	stp	x29, x22, [sp, #-16]!		// dummy stack frame
+	mov	x29, sp
+1:
+	/*
+	 * Layout of interrupt stack after this macro is invoked:
+	 *
+	 *     |                |
+	 *-0x20+----------------+ <= dummy stack frame
+	 *     |      fp        |    : fp on process stack
+	 *-0x18+----------------+
+	 *     |      lr        |    : return address
+	 *-0x10+----------------+
+	 *     |    fp (copy)   |    : for sanity check
+	 * -0x8+----------------+
+	 *     |      sp        |    : sp on process stack
+	 *  0x0+----------------+
+	 */
  	.endm

  	/*
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 407991b..03611a1 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
  	low  = frame->sp;
  	high = ALIGN(low, THREAD_SIZE);

-	if (fp < low || fp > high - 0x18 || fp & 0xf)
+	if (fp < low || fp > high - 0x20 || fp & 0xf)
  		return -EINVAL;

  	frame->sp = fp + 0x10;
  	frame->fp = *(unsigned long *)(fp);
  	/*
+	 * check whether we are going to walk trough from interrupt stack
+	 * to process stack
+	 * If the previous frame is the initial (dummy) stack frame on
+	 * interrupt stack, frame->sp now points to just below the frame
+	 * (dummy frame + 0x10).
+	 * See entry.S
+	 */
+#define STACK_LOW(addr) round_down((addr), THREAD_SIZE)
+	if ((STACK_LOW(frame->sp) != STACK_LOW(frame->fp)) &&
+			(frame->fp == *(unsigned long *)frame->sp))
+		frame->sp = *((unsigned long *)(frame->sp + 8));
+	/*
  	 * -4 here because we care about the PC at time of bl,
  	 * not where the return will go.
  	 */
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-15  4:19             ` AKASHI Takahiro
  0 siblings, 0 replies; 60+ messages in thread
From: AKASHI Takahiro @ 2015-10-15  4:19 UTC (permalink / raw)
  To: linux-arm-kernel

Jungseok,

On 10/14/2015 09:55 PM, Jungseok Lee wrote:
> On Oct 14, 2015, at 9:24 PM, Jungseok Lee wrote:
>> On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
>>> On 10/09/2015 11:24 PM, James Morse wrote:
>>>> Hi Jungseok,
>>>>
>>>> On 07/10/15 16:28, Jungseok Lee wrote:
>>>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>>>> stack is used. It makes a call trace information much less useful when
>>>>> a system gets paniked in interrupt context.
>>>>
>>>> panicked
>>>>
>>>>> This patch addresses the issue with the following schemes:
>>>>>
>>>>>   - Store aborted stack frame data
>>>>>   - Decide whether another stack walk is needed or not via current sp
>>>>>   - Loosen the frame pointer upper bound condition
>>>>
>>>> It may be worth merging this patch with its predecessor - anyone trying to
>>>> bisect a problem could land between these two patches, and spend time
>>>> debugging the truncated call traces.
>>>>
>>>>
>>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>>>> index 6ea82e8..e5904a1 100644
>>>>> --- a/arch/arm64/include/asm/irq.h
>>>>> +++ b/arch/arm64/include/asm/irq.h
>>>>> @@ -2,13 +2,25 @@
>>>>> #define __ASM_IRQ_H
>>>>>
>>>>> #include <linux/irqchip/arm-gic-acpi.h>
>>>>> +#include <asm/stacktrace.h>
>>>>>
>>>>> #include <asm-generic/irq.h>
>>>>>
>>>>> struct irq_stack {
>>>>> 	void *stack;
>>>>> +	struct stackframe frame;
>>>>> };
>>>>>
>>>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>>>>
>>>> Good idea, storing this in the per-cpu data makes it immune to stack
>>>> corruption.
>>>
>>> Is this the only reason that you have a dummy stack frame in per-cpu data?
>>> By placing this frame in an interrupt stack, I think, we will be able to eliminate
>>> changes in dump_stace(). and
>>>
>>>>
>>>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>>>> index 407991b..5124649 100644
>>>>> --- a/arch/arm64/kernel/stacktrace.c
>>>>> +++ b/arch/arm64/kernel/stacktrace.c
>>>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>>> 	low  = frame->sp;
>>>>> 	high = ALIGN(low, THREAD_SIZE);
>>>>>
>>>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>>>> +	/*
>>>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>>>> +	 * first function of call trace looks as follows:
>>>>> +	 *
>>>>> +	 *	stp     x29, x30, [sp,#-16]!
>>>>> +	 *	mov     x29, sp
>>>>> +	 *
>>>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>>>>
>>>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>>>> to be the highest address, which is used first, making it the bottom of the
>>>> stack.
>>>>
>>>> I would try to use the terms low/est and high/est, in keeping with the
>>>> variable names in use here.
>>>>
>>>>
>>>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>>>> +	 *
>>>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>>>> +	 * are handled using a separate stack. That is, a call trace can start
>>>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>>>> +	 *
>>>>> +	 * The scenario is handled without complexity as 1) considering
>>>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>>>> +	 * the value to 0x10 from 0x20.
>>>>
>>>> Where has 0x20 come from? The old value was 0x18.
>>>>
>>>> My understanding is the highest part of the stack looks like this:
>>>> high        [ off-stack ]
>>>> high - 0x08 [ left free by THREAD_START_SP ]
>>>> high - 0x10 [ left free by THREAD_START_SP ]
>>>> high - 0x18 [#1 x30 ]
>>>> high - 0x20 [#1 x29 ]
>>>>
>>>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>>>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>>>> allows the first half of that reserved area to be a valid stack frame.
>>>>
>>>> This change is breaking perf using incantations [0] and [1]:
>>>>
>>>> Before, with just patch 1/2:
>>>>                   ---__do_softirq
>>>>                      |
>>>>                      |--92.95%-- __handle_domain_irq
>>>>                      |          __irqentry_text_start
>>>>                      |          el1_irq
>>>>                      |
>>>>
>>>> After, with both patches:
>>>>                  ---__do_softirq
>>>>                     |
>>>>                     |--83.83%-- __handle_domain_irq
>>>>                     |          __irqentry_text_start
>>>>                     |          el1_irq
>>>>                     |          |
>>>>                     |          |--99.39%-- 0x400008040d00000c
>>>>                     |           --0.61%-- [...]
>>>>                     |
>>>
>>> This also shows that walk_stackframe() doesn't walk through a process stack.
>>> Now I'm trying the following hack on top of Jungseok's patch.
>>> (It doesn't traverse from an irq stack to an process stack yet. I need modify
>>> unwind_frame().)
>>
>> I've got a difference between perf and dump_backtrace() as reviewing perf call
>> chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
>> That is, a symbol is printed out *before* unwind_frame() call in case of perf.
>> By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
>> perf behavior is correct since frame.pc is retrieved from a valid stack frame.
>>
>> So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
>> call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
>> could utilize walk_stackframe(), which simplifies the code.
>>
>> ----8<----
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index f93aae5..e18be43 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>         set_fs(fs);
>> }
>>
>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>> +static void dump_backtrace_entry(unsigned long where)
>> {
>> +       /*
>> +        * PC has a physical address when MMU is disabled.
>> +        */
>> +       if (!kernel_text_address(where))
>> +               where = (unsigned long)phys_to_virt(where);
>> +
>>         print_ip_sym(where);
>> -       if (in_exception_text(where))
>> -               dump_mem("", "Exception stack", stack,
>> -                        stack + sizeof(struct pt_regs), false);
>> }
>>
>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>         pr_emerg("Call trace:\n");
>>         while (1) {
>>                 unsigned long where = frame.pc;
>> +               unsigned long stack;
>>                 int ret;
>>
>> +               dump_backtrace_entry(where);
>>                 ret = unwind_frame(&frame);
>>                 if (ret < 0)
>>                         break;
>> -               dump_backtrace_entry(where, frame.sp);
>> +               stack = frame.sp;
>> +               if (in_exception_text(where))
>> +                       dump_mem("", "Exception stack", stack,
>> +                                stack + sizeof(struct pt_regs), false);
>>         }
>> }
>> ----8<----
>>
>>> Thanks,
>>> -Takahiro AKASHI
>>> ----8<----
>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>> index 650cc05..5fbd1ea 100644
>>> --- a/arch/arm64/kernel/entry.S
>>> +++ b/arch/arm64/kernel/entry.S
>>> @@ -185,14 +185,12 @@ alternative_endif
>>> 	mov	x23, sp
>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>> 	cmp	x20, x23			// check irq re-enterance
>>> +	mov	x19, sp
>>> 	beq	1f
>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>> -	mov	x29, x24
>>> -1:	mov	x19, sp
>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>> -	mov	sp, x23
>>> +	mov	sp, x24				// x24 = top of irq stack
>>> +	stp	x29, x22, [sp, #-32]!
>>> +	mov	x29, sp
>>> +1:
>>> 	.endm
>>>
>>> 	/*
>>
>> Is it possible to decide which stack is used without aborted SP information?
>
> We could know which stack is used via current SP, but how could we decide
> a variable 'low' in unwind_frame() when walking a process stack?

The following patch, replacing your [PATCH 2/2], seems to work nicely,
traversing from interrupt stack to process stack. I tried James' method as well
as "echo c > /proc/sysrq-trigger."

The only issue that I have now is that dump_backtrace() does not show
correct "pt_regs" data on process stack (actually it dumps interrupt stack):

CPU1: stopping
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
Call trace:
[<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
[<ffffffc00008a968>] show_stack+0x1c/0x28
[<ffffffc0003936d0>] dump_stack+0x88/0xc8
[<ffffffc00008fdf8>] handle_IPI+0x258/0x268
[<ffffffc000082530>] gic_handle_irq+0x88/0xa4
Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
[<ffffffc0000855e0>] el1_irq+0xa0/0x114
[<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
[<ffffffc0000fc110>] default_idle_call+0x1c/0x34
[<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
[<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
[<ffffffc0000827a8>] secondary_startup+0x8/0x20

Thanks,
-Takahiro AKASHI

----8<----
 From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
From: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date: Thu, 15 Oct 2015 09:04:10 +0900
Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack

This patch allows unwind_frame() to traverse from interrupt stack
to process stack correctly by having a dummy stack frame for irq_handler
created at its prologue.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
---
  arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
  arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
  2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d4e8c5..25cabd9 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -185,8 +185,26 @@ alternative_endif
  	and	x23, x23, #~(THREAD_SIZE - 1)
  	cmp	x20, x23			// check irq re-enterance
  	mov	x19, sp
-	csel	x23, x19, x24, eq		// x24 = top of irq stack
-	mov	sp, x23
+	beq	1f
+	mov	sp, x24				// x24 = top of irq stack
+	stp	x29, x21, [sp, #-16]!		// for sanity check
+	stp	x29, x22, [sp, #-16]!		// dummy stack frame
+	mov	x29, sp
+1:
+	/*
+	 * Layout of interrupt stack after this macro is invoked:
+	 *
+	 *     |                |
+	 *-0x20+----------------+ <= dummy stack frame
+	 *     |      fp        |    : fp on process stack
+	 *-0x18+----------------+
+	 *     |      lr        |    : return address
+	 *-0x10+----------------+
+	 *     |    fp (copy)   |    : for sanity check
+	 * -0x8+----------------+
+	 *     |      sp        |    : sp on process stack
+	 *  0x0+----------------+
+	 */
  	.endm

  	/*
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 407991b..03611a1 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
  	low  = frame->sp;
  	high = ALIGN(low, THREAD_SIZE);

-	if (fp < low || fp > high - 0x18 || fp & 0xf)
+	if (fp < low || fp > high - 0x20 || fp & 0xf)
  		return -EINVAL;

  	frame->sp = fp + 0x10;
  	frame->fp = *(unsigned long *)(fp);
  	/*
+	 * check whether we are going to walk trough from interrupt stack
+	 * to process stack
+	 * If the previous frame is the initial (dummy) stack frame on
+	 * interrupt stack, frame->sp now points to just below the frame
+	 * (dummy frame + 0x10).
+	 * See entry.S
+	 */
+#define STACK_LOW(addr) round_down((addr), THREAD_SIZE)
+	if ((STACK_LOW(frame->sp) != STACK_LOW(frame->fp)) &&
+			(frame->fp == *(unsigned long *)frame->sp))
+		frame->sp = *((unsigned long *)(frame->sp + 8));
+	/*
  	 * -4 here because we care about the PC at time of bl,
  	 * not where the return will go.
  	 */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-15  4:19             ` AKASHI Takahiro
@ 2015-10-15 13:39               ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-15 13:39 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: James Morse, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 15, 2015, at 1:19 PM, AKASHI Takahiro wrote:
> Jungseok,

Hi Akashi,

> On 10/14/2015 09:55 PM, Jungseok Lee wrote:
>> On Oct 14, 2015, at 9:24 PM, Jungseok Lee wrote:
>>> On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
>>>> On 10/09/2015 11:24 PM, James Morse wrote:
>>>>> Hi Jungseok,
>>>>> 
>>>>> On 07/10/15 16:28, Jungseok Lee wrote:
>>>>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>>>>> stack is used. It makes a call trace information much less useful when
>>>>>> a system gets paniked in interrupt context.
>>>>> 
>>>>> panicked
>>>>> 
>>>>>> This patch addresses the issue with the following schemes:
>>>>>> 
>>>>>>  - Store aborted stack frame data
>>>>>>  - Decide whether another stack walk is needed or not via current sp
>>>>>>  - Loosen the frame pointer upper bound condition
>>>>> 
>>>>> It may be worth merging this patch with its predecessor - anyone trying to
>>>>> bisect a problem could land between these two patches, and spend time
>>>>> debugging the truncated call traces.
>>>>> 
>>>>> 
>>>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>>>>> index 6ea82e8..e5904a1 100644
>>>>>> --- a/arch/arm64/include/asm/irq.h
>>>>>> +++ b/arch/arm64/include/asm/irq.h
>>>>>> @@ -2,13 +2,25 @@
>>>>>> #define __ASM_IRQ_H
>>>>>> 
>>>>>> #include <linux/irqchip/arm-gic-acpi.h>
>>>>>> +#include <asm/stacktrace.h>
>>>>>> 
>>>>>> #include <asm-generic/irq.h>
>>>>>> 
>>>>>> struct irq_stack {
>>>>>> 	void *stack;
>>>>>> +	struct stackframe frame;
>>>>>> };
>>>>>> 
>>>>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>>>>> 
>>>>> Good idea, storing this in the per-cpu data makes it immune to stack
>>>>> corruption.
>>>> 
>>>> Is this the only reason that you have a dummy stack frame in per-cpu data?
>>>> By placing this frame in an interrupt stack, I think, we will be able to eliminate
>>>> changes in dump_stace(). and
>>>> 
>>>>> 
>>>>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>>>>> index 407991b..5124649 100644
>>>>>> --- a/arch/arm64/kernel/stacktrace.c
>>>>>> +++ b/arch/arm64/kernel/stacktrace.c
>>>>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>>>> 	low  = frame->sp;
>>>>>> 	high = ALIGN(low, THREAD_SIZE);
>>>>>> 
>>>>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>>>>> +	/*
>>>>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>>>>> +	 * first function of call trace looks as follows:
>>>>>> +	 *
>>>>>> +	 *	stp     x29, x30, [sp,#-16]!
>>>>>> +	 *	mov     x29, sp
>>>>>> +	 *
>>>>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>>>>> 
>>>>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>>>>> to be the highest address, which is used first, making it the bottom of the
>>>>> stack.
>>>>> 
>>>>> I would try to use the terms low/est and high/est, in keeping with the
>>>>> variable names in use here.
>>>>> 
>>>>> 
>>>>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>>>>> +	 *
>>>>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>>>>> +	 * are handled using a separate stack. That is, a call trace can start
>>>>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>>>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>>>>> +	 *
>>>>>> +	 * The scenario is handled without complexity as 1) considering
>>>>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>>>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>>>>> +	 * the value to 0x10 from 0x20.
>>>>> 
>>>>> Where has 0x20 come from? The old value was 0x18.
>>>>> 
>>>>> My understanding is the highest part of the stack looks like this:
>>>>> high        [ off-stack ]
>>>>> high - 0x08 [ left free by THREAD_START_SP ]
>>>>> high - 0x10 [ left free by THREAD_START_SP ]
>>>>> high - 0x18 [#1 x30 ]
>>>>> high - 0x20 [#1 x29 ]
>>>>> 
>>>>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>>>>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>>>>> allows the first half of that reserved area to be a valid stack frame.
>>>>> 
>>>>> This change is breaking perf using incantations [0] and [1]:
>>>>> 
>>>>> Before, with just patch 1/2:
>>>>>                  ---__do_softirq
>>>>>                     |
>>>>>                     |--92.95%-- __handle_domain_irq
>>>>>                     |          __irqentry_text_start
>>>>>                     |          el1_irq
>>>>>                     |
>>>>> 
>>>>> After, with both patches:
>>>>>                 ---__do_softirq
>>>>>                    |
>>>>>                    |--83.83%-- __handle_domain_irq
>>>>>                    |          __irqentry_text_start
>>>>>                    |          el1_irq
>>>>>                    |          |
>>>>>                    |          |--99.39%-- 0x400008040d00000c
>>>>>                    |           --0.61%-- [...]
>>>>>                    |
>>>> 
>>>> This also shows that walk_stackframe() doesn't walk through a process stack.
>>>> Now I'm trying the following hack on top of Jungseok's patch.
>>>> (It doesn't traverse from an irq stack to an process stack yet. I need modify
>>>> unwind_frame().)
>>> 
>>> I've got a difference between perf and dump_backtrace() as reviewing perf call
>>> chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
>>> That is, a symbol is printed out *before* unwind_frame() call in case of perf.
>>> By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
>>> perf behavior is correct since frame.pc is retrieved from a valid stack frame.
>>> 
>>> So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
>>> call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
>>> could utilize walk_stackframe(), which simplifies the code.
>>> 
>>> ----8<----
>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>> index f93aae5..e18be43 100644
>>> --- a/arch/arm64/kernel/traps.c
>>> +++ b/arch/arm64/kernel/traps.c
>>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>>        set_fs(fs);
>>> }
>>> 
>>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>>> +static void dump_backtrace_entry(unsigned long where)
>>> {
>>> +       /*
>>> +        * PC has a physical address when MMU is disabled.
>>> +        */
>>> +       if (!kernel_text_address(where))
>>> +               where = (unsigned long)phys_to_virt(where);
>>> +
>>>        print_ip_sym(where);
>>> -       if (in_exception_text(where))
>>> -               dump_mem("", "Exception stack", stack,
>>> -                        stack + sizeof(struct pt_regs), false);
>>> }
>>> 
>>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>>        pr_emerg("Call trace:\n");
>>>        while (1) {
>>>                unsigned long where = frame.pc;
>>> +               unsigned long stack;
>>>                int ret;
>>> 
>>> +               dump_backtrace_entry(where);
>>>                ret = unwind_frame(&frame);
>>>                if (ret < 0)
>>>                        break;
>>> -               dump_backtrace_entry(where, frame.sp);
>>> +               stack = frame.sp;
>>> +               if (in_exception_text(where))
>>> +                       dump_mem("", "Exception stack", stack,
>>> +                                stack + sizeof(struct pt_regs), false);
>>>        }
>>> }
>>> ----8<----
>>> 
>>>> Thanks,
>>>> -Takahiro AKASHI
>>>> ----8<----
>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>> index 650cc05..5fbd1ea 100644
>>>> --- a/arch/arm64/kernel/entry.S
>>>> +++ b/arch/arm64/kernel/entry.S
>>>> @@ -185,14 +185,12 @@ alternative_endif
>>>> 	mov	x23, sp
>>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>>> 	cmp	x20, x23			// check irq re-enterance
>>>> +	mov	x19, sp
>>>> 	beq	1f
>>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>>> -	mov	x29, x24
>>>> -1:	mov	x19, sp
>>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>>> -	mov	sp, x23
>>>> +	mov	sp, x24				// x24 = top of irq stack
>>>> +	stp	x29, x22, [sp, #-32]!
>>>> +	mov	x29, sp
>>>> +1:
>>>> 	.endm
>>>> 
>>>> 	/*
>>> 
>>> Is it possible to decide which stack is used without aborted SP information?
>> 
>> We could know which stack is used via current SP, but how could we decide
>> a variable 'low' in unwind_frame() when walking a process stack?
> 
> The following patch, replacing your [PATCH 2/2], seems to work nicely,
> traversing from interrupt stack to process stack. I tried James' method as well
> as "echo c > /proc/sysrq-trigger."

Great thanks!

Since I'm favor of your approach, I've played with this patch instead of my one.
A kernel panic is observed when using 'perf record with -g option' and sysrq.
I guess some other changes are on your tree..

Please refer to my analysis.

> The only issue that I have now is that dump_backtrace() does not show
> correct "pt_regs" data on process stack (actually it dumps interrupt stack):
> 
> CPU1: stopping
> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
> Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
> Call trace:
> [<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
> [<ffffffc00008a968>] show_stack+0x1c/0x28
> [<ffffffc0003936d0>] dump_stack+0x88/0xc8
> [<ffffffc00008fdf8>] handle_IPI+0x258/0x268
> [<ffffffc000082530>] gic_handle_irq+0x88/0xa4
> Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
> ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
> ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
> ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
> 0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
> 0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> [<ffffffc0000855e0>] el1_irq+0xa0/0x114
> [<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
> [<ffffffc0000fc110>] default_idle_call+0x1c/0x34
> [<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
> [<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
> [<ffffffc0000827a8>] secondary_startup+0x8/0x20

My 'dump_backtrace() rework' patch is in your working tree. Right?

> 
> Thanks,
> -Takahiro AKASHI
> 
> ----8<----
> From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Date: Thu, 15 Oct 2015 09:04:10 +0900
> Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack
> 
> This patch allows unwind_frame() to traverse from interrupt stack
> to process stack correctly by having a dummy stack frame for irq_handler
> created at its prologue.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
> arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
> arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
> 2 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 6d4e8c5..25cabd9 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -185,8 +185,26 @@ alternative_endif
> 	and	x23, x23, #~(THREAD_SIZE - 1)
> 	cmp	x20, x23			// check irq re-enterance
> 	mov	x19, sp
> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
> -	mov	sp, x23
> +	beq	1f
> +	mov	sp, x24				// x24 = top of irq stack
> +	stp	x29, x21, [sp, #-16]!		// for sanity check
> +	stp	x29, x22, [sp, #-16]!		// dummy stack frame
> +	mov	x29, sp
> +1:
> +	/*
> +	 * Layout of interrupt stack after this macro is invoked:
> +	 *
> +	 *     |                |
> +	 *-0x20+----------------+ <= dummy stack frame
> +	 *     |      fp        |    : fp on process stack
> +	 *-0x18+----------------+
> +	 *     |      lr        |    : return address
> +	 *-0x10+----------------+
> +	 *     |    fp (copy)   |    : for sanity check
> +	 * -0x8+----------------+
> +	 *     |      sp        |    : sp on process stack
> +	 *  0x0+----------------+
> +	 */
> 	.endm
> 
> 	/*
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991b..03611a1 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
> 	low  = frame->sp;
> 	high = ALIGN(low, THREAD_SIZE);
> 
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	if (fp < low || fp > high - 0x20 || fp & 0xf)
> 		return -EINVAL;

IMO, this condition should be changes as follows.

	if (fp < low || fp > high - 0x20 || fp & 0xf || !fp)

Please refer to the below for details.

> 
> 	frame->sp = fp + 0x10;
> 	frame->fp = *(unsigned long *)(fp);
> 	/*
> +	 * check whether we are going to walk trough from interrupt stack
> +	 * to process stack
> +	 * If the previous frame is the initial (dummy) stack frame on
> +	 * interrupt stack, frame->sp now points to just below the frame
> +	 * (dummy frame + 0x10).
> +	 * See entry.S
> +	 */
> +#define STACK_LOW(addr) round_down((addr), THREAD_SIZE)
> +	if ((STACK_LOW(frame->sp) != STACK_LOW(frame->fp)) &&
> +			(frame->fp == *(unsigned long *)frame->sp))
> +		frame->sp = *((unsigned long *)(frame->sp + 8));

An original intention seems to catch a stack change from IRQ stack to process one.
Unfortunately, this condition hits when the last of stack frame of swapper is
retrieved. This leads to NULL pointer access due to the following code snippet.

ENTRY(__secondary_switched)
        ldr     x0, [x21]                       // get secondary_data.stack
        mov     sp, x0
        mov     x29, #0
        b       secondary_start_kernel
ENDPROC(__secondary_switched)

This is why x29 should be checked.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-15 13:39               ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-15 13:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 15, 2015, at 1:19 PM, AKASHI Takahiro wrote:
> Jungseok,

Hi Akashi,

> On 10/14/2015 09:55 PM, Jungseok Lee wrote:
>> On Oct 14, 2015, at 9:24 PM, Jungseok Lee wrote:
>>> On Oct 14, 2015, at 4:13 PM, AKASHI Takahiro wrote:
>>>> On 10/09/2015 11:24 PM, James Morse wrote:
>>>>> Hi Jungseok,
>>>>> 
>>>>> On 07/10/15 16:28, Jungseok Lee wrote:
>>>>>> Currently, a call trace drops a process stack walk when a separate IRQ
>>>>>> stack is used. It makes a call trace information much less useful when
>>>>>> a system gets paniked in interrupt context.
>>>>> 
>>>>> panicked
>>>>> 
>>>>>> This patch addresses the issue with the following schemes:
>>>>>> 
>>>>>>  - Store aborted stack frame data
>>>>>>  - Decide whether another stack walk is needed or not via current sp
>>>>>>  - Loosen the frame pointer upper bound condition
>>>>> 
>>>>> It may be worth merging this patch with its predecessor - anyone trying to
>>>>> bisect a problem could land between these two patches, and spend time
>>>>> debugging the truncated call traces.
>>>>> 
>>>>> 
>>>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>>>>>> index 6ea82e8..e5904a1 100644
>>>>>> --- a/arch/arm64/include/asm/irq.h
>>>>>> +++ b/arch/arm64/include/asm/irq.h
>>>>>> @@ -2,13 +2,25 @@
>>>>>> #define __ASM_IRQ_H
>>>>>> 
>>>>>> #include <linux/irqchip/arm-gic-acpi.h>
>>>>>> +#include <asm/stacktrace.h>
>>>>>> 
>>>>>> #include <asm-generic/irq.h>
>>>>>> 
>>>>>> struct irq_stack {
>>>>>> 	void *stack;
>>>>>> +	struct stackframe frame;
>>>>>> };
>>>>>> 
>>>>>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
>>>>> 
>>>>> Good idea, storing this in the per-cpu data makes it immune to stack
>>>>> corruption.
>>>> 
>>>> Is this the only reason that you have a dummy stack frame in per-cpu data?
>>>> By placing this frame in an interrupt stack, I think, we will be able to eliminate
>>>> changes in dump_stace(). and
>>>> 
>>>>> 
>>>>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>>>>> index 407991b..5124649 100644
>>>>>> --- a/arch/arm64/kernel/stacktrace.c
>>>>>> +++ b/arch/arm64/kernel/stacktrace.c
>>>>>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>>>>>> 	low  = frame->sp;
>>>>>> 	high = ALIGN(low, THREAD_SIZE);
>>>>>> 
>>>>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>>>>> +	/*
>>>>>> +	 * A frame pointer would reach an upper bound if a prologue of the
>>>>>> +	 * first function of call trace looks as follows:
>>>>>> +	 *
>>>>>> +	 *	stp     x29, x30, [sp,#-16]!
>>>>>> +	 *	mov     x29, sp
>>>>>> +	 *
>>>>>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
>>>>> 
>>>>> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
>>>>> to be the highest address, which is used first, making it the bottom of the
>>>>> stack.
>>>>> 
>>>>> I would try to use the terms low/est and high/est, in keeping with the
>>>>> variable names in use here.
>>>>> 
>>>>> 
>>>>>> +	 * of a 16-byte empty space in THREAD_START_SP.
>>>>>> +	 *
>>>>>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>>>>>> +	 * are handled using a separate stack. That is, a call trace can start
>>>>>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>>>>>> +	 * to candidates for a stack trace under the restriction, 0x20.
>>>>>> +	 *
>>>>>> +	 * The scenario is handled without complexity as 1) considering
>>>>>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>>>>>> +	 * content of which is 0, and 2) allowing the case, which changes
>>>>>> +	 * the value to 0x10 from 0x20.
>>>>> 
>>>>> Where has 0x20 come from? The old value was 0x18.
>>>>> 
>>>>> My understanding is the highest part of the stack looks like this:
>>>>> high        [ off-stack ]
>>>>> high - 0x08 [ left free by THREAD_START_SP ]
>>>>> high - 0x10 [ left free by THREAD_START_SP ]
>>>>> high - 0x18 [#1 x30 ]
>>>>> high - 0x20 [#1 x29 ]
>>>>> 
>>>>> So the condition 'fp > high - 0x18' prevents returning either 'left free'
>>>>> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
>>>>> allows the first half of that reserved area to be a valid stack frame.
>>>>> 
>>>>> This change is breaking perf using incantations [0] and [1]:
>>>>> 
>>>>> Before, with just patch 1/2:
>>>>>                  ---__do_softirq
>>>>>                     |
>>>>>                     |--92.95%-- __handle_domain_irq
>>>>>                     |          __irqentry_text_start
>>>>>                     |          el1_irq
>>>>>                     |
>>>>> 
>>>>> After, with both patches:
>>>>>                 ---__do_softirq
>>>>>                    |
>>>>>                    |--83.83%-- __handle_domain_irq
>>>>>                    |          __irqentry_text_start
>>>>>                    |          el1_irq
>>>>>                    |          |
>>>>>                    |          |--99.39%-- 0x400008040d00000c
>>>>>                    |           --0.61%-- [...]
>>>>>                    |
>>>> 
>>>> This also shows that walk_stackframe() doesn't walk through a process stack.
>>>> Now I'm trying the following hack on top of Jungseok's patch.
>>>> (It doesn't traverse from an irq stack to an process stack yet. I need modify
>>>> unwind_frame().)
>>> 
>>> I've got a difference between perf and dump_backtrace() as reviewing perf call
>>> chain operation. Perf relies on walk_stackframe(), but dump_backtrace() does not.
>>> That is, a symbol is printed out *before* unwind_frame() call in case of perf.
>>> By contrast, dump_backtrace() records a symbol *after* unwind_frame(). I think
>>> perf behavior is correct since frame.pc is retrieved from a valid stack frame.
>>> 
>>> So, the following diff is a prerequisite. It looks reasonable to remove dump_mem()
>>> call since frame.sp is calculated incorrectly now. If accepted, dump_backtrace()
>>> could utilize walk_stackframe(), which simplifies the code.
>>> 
>>> ----8<----
>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>> index f93aae5..e18be43 100644
>>> --- a/arch/arm64/kernel/traps.c
>>> +++ b/arch/arm64/kernel/traps.c
>>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>>        set_fs(fs);
>>> }
>>> 
>>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>>> +static void dump_backtrace_entry(unsigned long where)
>>> {
>>> +       /*
>>> +        * PC has a physical address when MMU is disabled.
>>> +        */
>>> +       if (!kernel_text_address(where))
>>> +               where = (unsigned long)phys_to_virt(where);
>>> +
>>>        print_ip_sym(where);
>>> -       if (in_exception_text(where))
>>> -               dump_mem("", "Exception stack", stack,
>>> -                        stack + sizeof(struct pt_regs), false);
>>> }
>>> 
>>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>>        pr_emerg("Call trace:\n");
>>>        while (1) {
>>>                unsigned long where = frame.pc;
>>> +               unsigned long stack;
>>>                int ret;
>>> 
>>> +               dump_backtrace_entry(where);
>>>                ret = unwind_frame(&frame);
>>>                if (ret < 0)
>>>                        break;
>>> -               dump_backtrace_entry(where, frame.sp);
>>> +               stack = frame.sp;
>>> +               if (in_exception_text(where))
>>> +                       dump_mem("", "Exception stack", stack,
>>> +                                stack + sizeof(struct pt_regs), false);
>>>        }
>>> }
>>> ----8<----
>>> 
>>>> Thanks,
>>>> -Takahiro AKASHI
>>>> ----8<----
>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>> index 650cc05..5fbd1ea 100644
>>>> --- a/arch/arm64/kernel/entry.S
>>>> +++ b/arch/arm64/kernel/entry.S
>>>> @@ -185,14 +185,12 @@ alternative_endif
>>>> 	mov	x23, sp
>>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>>> 	cmp	x20, x23			// check irq re-enterance
>>>> +	mov	x19, sp
>>>> 	beq	1f
>>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>>> -	mov	x29, x24
>>>> -1:	mov	x19, sp
>>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>>> -	mov	sp, x23
>>>> +	mov	sp, x24				// x24 = top of irq stack
>>>> +	stp	x29, x22, [sp, #-32]!
>>>> +	mov	x29, sp
>>>> +1:
>>>> 	.endm
>>>> 
>>>> 	/*
>>> 
>>> Is it possible to decide which stack is used without aborted SP information?
>> 
>> We could know which stack is used via current SP, but how could we decide
>> a variable 'low' in unwind_frame() when walking a process stack?
> 
> The following patch, replacing your [PATCH 2/2], seems to work nicely,
> traversing from interrupt stack to process stack. I tried James' method as well
> as "echo c > /proc/sysrq-trigger."

Great thanks!

Since I'm favor of your approach, I've played with this patch instead of my one.
A kernel panic is observed when using 'perf record with -g option' and sysrq.
I guess some other changes are on your tree..

Please refer to my analysis.

> The only issue that I have now is that dump_backtrace() does not show
> correct "pt_regs" data on process stack (actually it dumps interrupt stack):
> 
> CPU1: stopping
> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
> Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
> Call trace:
> [<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
> [<ffffffc00008a968>] show_stack+0x1c/0x28
> [<ffffffc0003936d0>] dump_stack+0x88/0xc8
> [<ffffffc00008fdf8>] handle_IPI+0x258/0x268
> [<ffffffc000082530>] gic_handle_irq+0x88/0xa4
> Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
> ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
> ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
> ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
> 0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
> 0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> 00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
> [<ffffffc0000855e0>] el1_irq+0xa0/0x114
> [<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
> [<ffffffc0000fc110>] default_idle_call+0x1c/0x34
> [<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
> [<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
> [<ffffffc0000827a8>] secondary_startup+0x8/0x20

My 'dump_backtrace() rework' patch is in your working tree. Right?

> 
> Thanks,
> -Takahiro AKASHI
> 
> ----8<----
> From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Date: Thu, 15 Oct 2015 09:04:10 +0900
> Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack
> 
> This patch allows unwind_frame() to traverse from interrupt stack
> to process stack correctly by having a dummy stack frame for irq_handler
> created at its prologue.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
> arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
> arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
> 2 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 6d4e8c5..25cabd9 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -185,8 +185,26 @@ alternative_endif
> 	and	x23, x23, #~(THREAD_SIZE - 1)
> 	cmp	x20, x23			// check irq re-enterance
> 	mov	x19, sp
> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
> -	mov	sp, x23
> +	beq	1f
> +	mov	sp, x24				// x24 = top of irq stack
> +	stp	x29, x21, [sp, #-16]!		// for sanity check
> +	stp	x29, x22, [sp, #-16]!		// dummy stack frame
> +	mov	x29, sp
> +1:
> +	/*
> +	 * Layout of interrupt stack after this macro is invoked:
> +	 *
> +	 *     |                |
> +	 *-0x20+----------------+ <= dummy stack frame
> +	 *     |      fp        |    : fp on process stack
> +	 *-0x18+----------------+
> +	 *     |      lr        |    : return address
> +	 *-0x10+----------------+
> +	 *     |    fp (copy)   |    : for sanity check
> +	 * -0x8+----------------+
> +	 *     |      sp        |    : sp on process stack
> +	 *  0x0+----------------+
> +	 */
> 	.endm
> 
> 	/*
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991b..03611a1 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
> 	low  = frame->sp;
> 	high = ALIGN(low, THREAD_SIZE);
> 
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	if (fp < low || fp > high - 0x20 || fp & 0xf)
> 		return -EINVAL;

IMO, this condition should be changes as follows.

	if (fp < low || fp > high - 0x20 || fp & 0xf || !fp)

Please refer to the below for details.

> 
> 	frame->sp = fp + 0x10;
> 	frame->fp = *(unsigned long *)(fp);
> 	/*
> +	 * check whether we are going to walk trough from interrupt stack
> +	 * to process stack
> +	 * If the previous frame is the initial (dummy) stack frame on
> +	 * interrupt stack, frame->sp now points to just below the frame
> +	 * (dummy frame + 0x10).
> +	 * See entry.S
> +	 */
> +#define STACK_LOW(addr) round_down((addr), THREAD_SIZE)
> +	if ((STACK_LOW(frame->sp) != STACK_LOW(frame->fp)) &&
> +			(frame->fp == *(unsigned long *)frame->sp))
> +		frame->sp = *((unsigned long *)(frame->sp + 8));

An original intention seems to catch a stack change from IRQ stack to process one.
Unfortunately, this condition hits when the last of stack frame of swapper is
retrieved. This leads to NULL pointer access due to the following code snippet.

ENTRY(__secondary_switched)
        ldr     x0, [x21]                       // get secondary_data.stack
        mov     sp, x0
        mov     x29, #0
        b       secondary_start_kernel
ENDPROC(__secondary_switched)

This is why x29 should be checked.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-09 14:24     ` James Morse
@ 2015-10-15 14:24       ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-15 14:24 UTC (permalink / raw)
  To: James Morse
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 9, 2015, at 11:24 PM, James Morse wrote:

Hi James,

[ ... ]

> I think unwind_frame() needs to walk the irq stack too. [2] is an example
> of perf tracing back to userspace, (and there are patches on the list to
> do/fix this), so we need to walk back to the start of the first stack for
> the perf accounting to be correct.

I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
approach looks better than mine and 2) you have the perf patches for [2].
This would help us to move forward.

Thoughts?

[ ... ]

> [0] sudo ./perf record -e mem:<address of __do_softirq()>:x -ag -- sleep 10
> [1] sudo ./perf report --call-graph --stdio
> [2] http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-15 14:24       ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-15 14:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 9, 2015, at 11:24 PM, James Morse wrote:

Hi James,

[ ... ]

> I think unwind_frame() needs to walk the irq stack too. [2] is an example
> of perf tracing back to userspace, (and there are patches on the list to
> do/fix this), so we need to walk back to the start of the first stack for
> the perf accounting to be correct.

I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
approach looks better than mine and 2) you have the perf patches for [2].
This would help us to move forward.

Thoughts?

[ ... ]

> [0] sudo ./perf record -e mem:<address of __do_softirq()>:x -ag -- sleep 10
> [1] sudo ./perf report --call-graph --stdio
> [2] http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-14 12:12                 ` Jungseok Lee
@ 2015-10-15 15:59                   ` James Morse
  -1 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-15 15:59 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On 14/10/15 13:12, Jungseok Lee wrote:
> On Oct 14, 2015, at 12:00 AM, Jungseok Lee wrote:
>> On Oct 13, 2015, at 8:00 PM, James Morse wrote:
>>> On 12/10/15 23:13, Jungseok Lee wrote:
>>>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>>>> (especially for systems with few cpus)…
>>>>
>>>> This would be a single concern. To address this issue, I drop the 'static'
>>>> keyword in thread_info_cache. Please refer to the below hunk.
>>>
>>> Its only a problem on systems with 64K pages, which don't have a multiple
>>> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
>>> plenty of memory…
>>
>> Yes, the problem 'two kmem_caches' comes from only 64K page system.
>>
>> I don't get the statement 'which don't have a multiple of 4 cpus'.
>> Could you point out what I am missing?
> 
> You're talking about sl{a|u}b allocator behavior. If so, I got what you meant.

Yes,
With Nx4 cpus, the (currently) 16K irq stacks take up Nx64K - a nice
multiple of pages, so no wasted memory.


>>> If this has been made a published symbol, it should go in a header file.
>>
>> Sure.
> 
> I had the wrong impression that there is a room under include/linux/*.

Yes, I see there isn't anywhere obvious to put it...


> IMO, this is architectural option whether arch relies on thread_info_cache or not.
> In other words, it would be clear to put this extern under arch/*/include/asm/*.

Its up to the arch whether or not to define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR. In the case where it hasn't defined it,
and THREAD_SIZE >= PAGE_SIZE, your change is exposing thread_info_cache on
all architectures, so it ought go in a header file accessible to all
architectures.

Something like this, (only build-tested!):
=========%<=========
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -10,6 +10,8 @@
 #include <linux/types.h>
 #include <linux/bug.h>

+#include <asm/page.h>
+
 struct timespec;
 struct compat_timespec;

@@ -145,6 +147,12 @@ static inline bool test_and_clear_restore_sigmask(void)
 #error "no set_restore_sigmask() provided and default one won't work"
 #endif

+#ifndef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
+#if THREAD_SIZE >= PAGE_SIZE
+extern struct kmem_cache *thread_info_cache;
+#endif /* THREAD_SIZE >= PAGE_SIZE */
+#endif /* CONFIG_ARCH_THREAD_INFO_ALLOCATOR */
+
 #endif /* __KERNEL__ */

 #endif /* _LINUX_THREAD_INFO_H */
=========%<=========
Quite ugly!

My concern is there could be push-back from the maintainer of
kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
generic code isn't what you need", and push-back from the arm64 maintainers
about copy-pasting that chunk into arch/arm64.... both of which are fair,
hence my initial version created a second kmem_cache.


Thanks,

James


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-15 15:59                   ` James Morse
  0 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-15 15:59 UTC (permalink / raw)
  To: linux-arm-kernel

On 14/10/15 13:12, Jungseok Lee wrote:
> On Oct 14, 2015, at 12:00 AM, Jungseok Lee wrote:
>> On Oct 13, 2015, at 8:00 PM, James Morse wrote:
>>> On 12/10/15 23:13, Jungseok Lee wrote:
>>>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>>>> (especially for systems with few cpus)?
>>>>
>>>> This would be a single concern. To address this issue, I drop the 'static'
>>>> keyword in thread_info_cache. Please refer to the below hunk.
>>>
>>> Its only a problem on systems with 64K pages, which don't have a multiple
>>> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
>>> plenty of memory?
>>
>> Yes, the problem 'two kmem_caches' comes from only 64K page system.
>>
>> I don't get the statement 'which don't have a multiple of 4 cpus'.
>> Could you point out what I am missing?
> 
> You're talking about sl{a|u}b allocator behavior. If so, I got what you meant.

Yes,
With Nx4 cpus, the (currently) 16K irq stacks take up Nx64K - a nice
multiple of pages, so no wasted memory.


>>> If this has been made a published symbol, it should go in a header file.
>>
>> Sure.
> 
> I had the wrong impression that there is a room under include/linux/*.

Yes, I see there isn't anywhere obvious to put it...


> IMO, this is architectural option whether arch relies on thread_info_cache or not.
> In other words, it would be clear to put this extern under arch/*/include/asm/*.

Its up to the arch whether or not to define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR. In the case where it hasn't defined it,
and THREAD_SIZE >= PAGE_SIZE, your change is exposing thread_info_cache on
all architectures, so it ought go in a header file accessible to all
architectures.

Something like this, (only build-tested!):
=========%<=========
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -10,6 +10,8 @@
 #include <linux/types.h>
 #include <linux/bug.h>

+#include <asm/page.h>
+
 struct timespec;
 struct compat_timespec;

@@ -145,6 +147,12 @@ static inline bool test_and_clear_restore_sigmask(void)
 #error "no set_restore_sigmask() provided and default one won't work"
 #endif

+#ifndef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
+#if THREAD_SIZE >= PAGE_SIZE
+extern struct kmem_cache *thread_info_cache;
+#endif /* THREAD_SIZE >= PAGE_SIZE */
+#endif /* CONFIG_ARCH_THREAD_INFO_ALLOCATOR */
+
 #endif /* __KERNEL__ */

 #endif /* _LINUX_THREAD_INFO_H */
=========%<=========
Quite ugly!

My concern is there could be push-back from the maintainer of
kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
generic code isn't what you need", and push-back from the arm64 maintainers
about copy-pasting that chunk into arch/arm64.... both of which are fair,
hence my initial version created a second kmem_cache.


Thanks,

James

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-15 14:24       ` Jungseok Lee
@ 2015-10-15 16:01         ` James Morse
  -1 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-15 16:01 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On 15/10/15 15:24, Jungseok Lee wrote:
> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>> of perf tracing back to userspace, (and there are patches on the list to
>> do/fix this), so we need to walk back to the start of the first stack for
>> the perf accounting to be correct.
> 
> I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
> approach looks better than mine and 2) you have the perf patches for [2].

They aren't my patches - the ones I saw on the list were for arm:
https://lkml.org/lkml/2015/10/1/769 - its evidently something perf
supports, so we shouldn't make it worse...


Thanks!

James

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-15 16:01         ` James Morse
  0 siblings, 0 replies; 60+ messages in thread
From: James Morse @ 2015-10-15 16:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 15/10/15 15:24, Jungseok Lee wrote:
> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>> of perf tracing back to userspace, (and there are patches on the list to
>> do/fix this), so we need to walk back to the start of the first stack for
>> the perf accounting to be correct.
> 
> I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
> approach looks better than mine and 2) you have the perf patches for [2].

They aren't my patches - the ones I saw on the list were for arm:
https://lkml.org/lkml/2015/10/1/769 - its evidently something perf
supports, so we shouldn't make it worse...


Thanks!

James

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-15 15:59                   ` James Morse
@ 2015-10-16 13:01                     ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-16 13:01 UTC (permalink / raw)
  To: James Morse
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 16, 2015, at 12:59 AM, James Morse wrote:

Hi James,

> On 14/10/15 13:12, Jungseok Lee wrote:
>> On Oct 14, 2015, at 12:00 AM, Jungseok Lee wrote:
>>> On Oct 13, 2015, at 8:00 PM, James Morse wrote:
>>>> On 12/10/15 23:13, Jungseok Lee wrote:
>>>>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>>>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>>>>> (especially for systems with few cpus)…
>>>>> 
>>>>> This would be a single concern. To address this issue, I drop the 'static'
>>>>> keyword in thread_info_cache. Please refer to the below hunk.
>>>> 
>>>> Its only a problem on systems with 64K pages, which don't have a multiple
>>>> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
>>>> plenty of memory…
>>> 
>>> Yes, the problem 'two kmem_caches' comes from only 64K page system.
>>> 
>>> I don't get the statement 'which don't have a multiple of 4 cpus'.
>>> Could you point out what I am missing?
>> 
>> You're talking about sl{a|u}b allocator behavior. If so, I got what you meant.
> 
> Yes,
> With Nx4 cpus, the (currently) 16K irq stacks take up Nx64K - a nice
> multiple of pages, so no wasted memory.
> 
> 
>>>> If this has been made a published symbol, it should go in a header file.
>>> 
>>> Sure.
>> 
>> I had the wrong impression that there is a room under include/linux/*.
> 
> Yes, I see there isn't anywhere obvious to put it...
> 
> 
>> IMO, this is architectural option whether arch relies on thread_info_cache or not.
>> In other words, it would be clear to put this extern under arch/*/include/asm/*.
> 
> Its up to the arch whether or not to define
> CONFIG_ARCH_THREAD_INFO_ALLOCATOR. In the case where it hasn't defined it,
> and THREAD_SIZE >= PAGE_SIZE, your change is exposing thread_info_cache on
> all architectures, so it ought go in a header file accessible to all
> architectures.
> 
> Something like this, (only build-tested!):
> =========%<=========
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -10,6 +10,8 @@
> #include <linux/types.h>
> #include <linux/bug.h>
> 
> +#include <asm/page.h>
> +

As reviewing arch codes, it seems not to cover all architecture..

> struct timespec;
> struct compat_timespec;
> 
> @@ -145,6 +147,12 @@ static inline bool test_and_clear_restore_sigmask(void)
> #error "no set_restore_sigmask() provided and default one won't work"
> #endif
> 
> +#ifndef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +#if THREAD_SIZE >= PAGE_SIZE
> +extern struct kmem_cache *thread_info_cache;
> +#endif /* THREAD_SIZE >= PAGE_SIZE */
> +#endif /* CONFIG_ARCH_THREAD_INFO_ALLOCATOR */
> +
> #endif /* __KERNEL__ */
> 
> #endif /* _LINUX_THREAD_INFO_H */
> =========%<=========
> Quite ugly!
> 
> My concern is there could be push-back from the maintainer of
> kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
> generic code isn't what you need", and push-back from the arm64 maintainers
> about copy-pasting that chunk into arch/arm64.... both of which are fair,
> hence my initial version created a second kmem_cache.

Same concern. I believe now is the time to get feedbacks from maintainers.
It will help us to decide the next step.

I will do re-spin soon!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-16 13:01                     ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-16 13:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 16, 2015, at 12:59 AM, James Morse wrote:

Hi James,

> On 14/10/15 13:12, Jungseok Lee wrote:
>> On Oct 14, 2015, at 12:00 AM, Jungseok Lee wrote:
>>> On Oct 13, 2015, at 8:00 PM, James Morse wrote:
>>>> On 12/10/15 23:13, Jungseok Lee wrote:
>>>>> On Oct 13, 2015, at 1:34 AM, James Morse wrote:
>>>>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful
>>>>>> (especially for systems with few cpus)?
>>>>> 
>>>>> This would be a single concern. To address this issue, I drop the 'static'
>>>>> keyword in thread_info_cache. Please refer to the below hunk.
>>>> 
>>>> Its only a problem on systems with 64K pages, which don't have a multiple
>>>> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with
>>>> plenty of memory?
>>> 
>>> Yes, the problem 'two kmem_caches' comes from only 64K page system.
>>> 
>>> I don't get the statement 'which don't have a multiple of 4 cpus'.
>>> Could you point out what I am missing?
>> 
>> You're talking about sl{a|u}b allocator behavior. If so, I got what you meant.
> 
> Yes,
> With Nx4 cpus, the (currently) 16K irq stacks take up Nx64K - a nice
> multiple of pages, so no wasted memory.
> 
> 
>>>> If this has been made a published symbol, it should go in a header file.
>>> 
>>> Sure.
>> 
>> I had the wrong impression that there is a room under include/linux/*.
> 
> Yes, I see there isn't anywhere obvious to put it...
> 
> 
>> IMO, this is architectural option whether arch relies on thread_info_cache or not.
>> In other words, it would be clear to put this extern under arch/*/include/asm/*.
> 
> Its up to the arch whether or not to define
> CONFIG_ARCH_THREAD_INFO_ALLOCATOR. In the case where it hasn't defined it,
> and THREAD_SIZE >= PAGE_SIZE, your change is exposing thread_info_cache on
> all architectures, so it ought go in a header file accessible to all
> architectures.
> 
> Something like this, (only build-tested!):
> =========%<=========
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -10,6 +10,8 @@
> #include <linux/types.h>
> #include <linux/bug.h>
> 
> +#include <asm/page.h>
> +

As reviewing arch codes, it seems not to cover all architecture..

> struct timespec;
> struct compat_timespec;
> 
> @@ -145,6 +147,12 @@ static inline bool test_and_clear_restore_sigmask(void)
> #error "no set_restore_sigmask() provided and default one won't work"
> #endif
> 
> +#ifndef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +#if THREAD_SIZE >= PAGE_SIZE
> +extern struct kmem_cache *thread_info_cache;
> +#endif /* THREAD_SIZE >= PAGE_SIZE */
> +#endif /* CONFIG_ARCH_THREAD_INFO_ALLOCATOR */
> +
> #endif /* __KERNEL__ */
> 
> #endif /* _LINUX_THREAD_INFO_H */
> =========%<=========
> Quite ugly!
> 
> My concern is there could be push-back from the maintainer of
> kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
> generic code isn't what you need", and push-back from the arm64 maintainers
> about copy-pasting that chunk into arch/arm64.... both of which are fair,
> hence my initial version created a second kmem_cache.

Same concern. I believe now is the time to get feedbacks from maintainers.
It will help us to decide the next step.

I will do re-spin soon!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-15 16:01         ` James Morse
@ 2015-10-16 13:02           ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-16 13:02 UTC (permalink / raw)
  To: James Morse
  Cc: takahiro.akashi, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 16, 2015, at 1:01 AM, James Morse wrote:
> On 15/10/15 15:24, Jungseok Lee wrote:
>> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>>> of perf tracing back to userspace, (and there are patches on the list to
>>> do/fix this), so we need to walk back to the start of the first stack for
>>> the perf accounting to be correct.
>> 
>> I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
>> approach looks better than mine and 2) you have the perf patches for [2].
> 
> They aren't my patches - the ones I saw on the list were for arm:
> https://lkml.org/lkml/2015/10/1/769 - its evidently something perf
> supports, so we shouldn't make it worse…

Aha, thanks for the information!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-16 13:02           ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-16 13:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 16, 2015, at 1:01 AM, James Morse wrote:
> On 15/10/15 15:24, Jungseok Lee wrote:
>> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>>> of perf tracing back to userspace, (and there are patches on the list to
>>> do/fix this), so we need to walk back to the start of the first stack for
>>> the perf accounting to be correct.
>> 
>> I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
>> approach looks better than mine and 2) you have the perf patches for [2].
> 
> They aren't my patches - the ones I saw on the list were for arm:
> https://lkml.org/lkml/2015/10/1/769 - its evidently something perf
> supports, so we shouldn't make it worse?

Aha, thanks for the information!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-16 13:01                     ` Jungseok Lee
@ 2015-10-16 16:06                       ` Catalin Marinas
  -1 siblings, 0 replies; 60+ messages in thread
From: Catalin Marinas @ 2015-10-16 16:06 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: James Morse, mark.rutland, barami97, will.deacon, linux-kernel,
	takahiro.akashi, linux-arm-kernel

On Fri, Oct 16, 2015 at 10:01:20PM +0900, Jungseok Lee wrote:
> On Oct 16, 2015, at 12:59 AM, James Morse wrote:
> > My concern is there could be push-back from the maintainer of
> > kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
> > generic code isn't what you need", and push-back from the arm64 maintainers
> > about copy-pasting that chunk into arch/arm64.... both of which are fair,
> > hence my initial version created a second kmem_cache.
> 
> Same concern. I believe now is the time to get feedbacks from maintainers.
> It will help us to decide the next step.

I'll push back now to avoid further doubts in changing kernel/fork.c ;).

A reason to define a kmem_cache is performance for repeated allocations.
But here you only do it once during boot. So you could simply use
kmalloc() when THREAD_SIZE < PAGE_SIZE. BTW, the IRQ stack size doesn't
even need to be the same as THREAD_SIZE, though we could initially keep
them the same. But it's worth defining an IRQ_STACK_SIZE macro if we
ever need to change it.

BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
save us from another stack address reading on the IRQ entry path. I'm
not sure exactly where the 16K image increase comes from but at least it
doesn't grow with NR_CPUS, so we can probably live with this.

-- 
Catalin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-16 16:06                       ` Catalin Marinas
  0 siblings, 0 replies; 60+ messages in thread
From: Catalin Marinas @ 2015-10-16 16:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 16, 2015 at 10:01:20PM +0900, Jungseok Lee wrote:
> On Oct 16, 2015, at 12:59 AM, James Morse wrote:
> > My concern is there could be push-back from the maintainer of
> > kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
> > generic code isn't what you need", and push-back from the arm64 maintainers
> > about copy-pasting that chunk into arch/arm64.... both of which are fair,
> > hence my initial version created a second kmem_cache.
> 
> Same concern. I believe now is the time to get feedbacks from maintainers.
> It will help us to decide the next step.

I'll push back now to avoid further doubts in changing kernel/fork.c ;).

A reason to define a kmem_cache is performance for repeated allocations.
But here you only do it once during boot. So you could simply use
kmalloc() when THREAD_SIZE < PAGE_SIZE. BTW, the IRQ stack size doesn't
even need to be the same as THREAD_SIZE, though we could initially keep
them the same. But it's worth defining an IRQ_STACK_SIZE macro if we
ever need to change it.

BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
save us from another stack address reading on the IRQ entry path. I'm
not sure exactly where the 16K image increase comes from but@least it
doesn't grow with NR_CPUS, so we can probably live with this.

-- 
Catalin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-16 16:06                       ` Catalin Marinas
@ 2015-10-17 13:38                         ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-17 13:38 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: James Morse, mark.rutland, barami97, will.deacon, linux-kernel,
	takahiro.akashi, linux-arm-kernel

On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:

Hi Catalin,

> On Fri, Oct 16, 2015 at 10:01:20PM +0900, Jungseok Lee wrote:
>> On Oct 16, 2015, at 12:59 AM, James Morse wrote:
>>> My concern is there could be push-back from the maintainer of
>>> kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
>>> generic code isn't what you need", and push-back from the arm64 maintainers
>>> about copy-pasting that chunk into arch/arm64.... both of which are fair,
>>> hence my initial version created a second kmem_cache.
>> 
>> Same concern. I believe now is the time to get feedbacks from maintainers.
>> It will help us to decide the next step.
> 
> I'll push back now to avoid further doubts in changing kernel/fork.c ;).

Thanks a lot!

> A reason to define a kmem_cache is performance for repeated allocations.
> But here you only do it once during boot. So you could simply use
> kmalloc() when THREAD_SIZE < PAGE_SIZE. BTW, the IRQ stack size doesn't
> even need to be the same as THREAD_SIZE, though we could initially keep
> them the same. But it's worth defining an IRQ_STACK_SIZE macro if we
> ever need to change it.

I will update the series using IRQ_* macro.

> BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
> save us from another stack address reading on the IRQ entry path. I'm
> not sure exactly where the 16K image increase comes from but at least it
> doesn't grow with NR_CPUS, so we can probably live with this.

I've tried the approach, a static allocation using DEFINE_PER_CPU, but
it dose not work on a top-bit comparison method (for IRQ re-entrance
check). The top-bit idea is based on the assumption that IRQ stack is
aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
to IRQ re-entrance failure in case of 4KB page system.

IMHO, it is hard to avoid 16KB size increase for 64KB page support.
Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
stack for at least a boot cpu should be allocated statically.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-17 13:38                         ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-17 13:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:

Hi Catalin,

> On Fri, Oct 16, 2015 at 10:01:20PM +0900, Jungseok Lee wrote:
>> On Oct 16, 2015, at 12:59 AM, James Morse wrote:
>>> My concern is there could be push-back from the maintainer of
>>> kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the
>>> generic code isn't what you need", and push-back from the arm64 maintainers
>>> about copy-pasting that chunk into arch/arm64.... both of which are fair,
>>> hence my initial version created a second kmem_cache.
>> 
>> Same concern. I believe now is the time to get feedbacks from maintainers.
>> It will help us to decide the next step.
> 
> I'll push back now to avoid further doubts in changing kernel/fork.c ;).

Thanks a lot!

> A reason to define a kmem_cache is performance for repeated allocations.
> But here you only do it once during boot. So you could simply use
> kmalloc() when THREAD_SIZE < PAGE_SIZE. BTW, the IRQ stack size doesn't
> even need to be the same as THREAD_SIZE, though we could initially keep
> them the same. But it's worth defining an IRQ_STACK_SIZE macro if we
> ever need to change it.

I will update the series using IRQ_* macro.

> BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
> save us from another stack address reading on the IRQ entry path. I'm
> not sure exactly where the 16K image increase comes from but at least it
> doesn't grow with NR_CPUS, so we can probably live with this.

I've tried the approach, a static allocation using DEFINE_PER_CPU, but
it dose not work on a top-bit comparison method (for IRQ re-entrance
check). The top-bit idea is based on the assumption that IRQ stack is
aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
to IRQ re-entrance failure in case of 4KB page system.

IMHO, it is hard to avoid 16KB size increase for 64KB page support.
Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
stack for at least a boot cpu should be allocated statically.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-15 13:39               ` Jungseok Lee
@ 2015-10-19  6:47                 ` AKASHI Takahiro
  -1 siblings, 0 replies; 60+ messages in thread
From: AKASHI Takahiro @ 2015-10-19  6:47 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: James Morse, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

Jungseok,

On 10/15/2015 10:39 PM, Jungseok Lee wrote:
> On Oct 15, 2015, at 1:19 PM, AKASHI Takahiro wrote:
>> Jungseok,
>
>>>> ----8<----
>>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>>> index f93aae5..e18be43 100644
>>>> --- a/arch/arm64/kernel/traps.c
>>>> +++ b/arch/arm64/kernel/traps.c
>>>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>>>         set_fs(fs);
>>>> }
>>>>
>>>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>>>> +static void dump_backtrace_entry(unsigned long where)
>>>> {
>>>> +       /*
>>>> +        * PC has a physical address when MMU is disabled.
>>>> +        */
>>>> +       if (!kernel_text_address(where))
>>>> +               where = (unsigned long)phys_to_virt(where);
>>>> +
>>>>         print_ip_sym(where);
>>>> -       if (in_exception_text(where))
>>>> -               dump_mem("", "Exception stack", stack,
>>>> -                        stack + sizeof(struct pt_regs), false);
>>>> }
>>>>
>>>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>>>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>>>         pr_emerg("Call trace:\n");
>>>>         while (1) {
>>>>                 unsigned long where = frame.pc;
>>>> +               unsigned long stack;
>>>>                 int ret;
>>>>
>>>> +               dump_backtrace_entry(where);
>>>>                 ret = unwind_frame(&frame);
>>>>                 if (ret < 0)
>>>>                         break;
>>>> -               dump_backtrace_entry(where, frame.sp);
>>>> +               stack = frame.sp;
>>>> +               if (in_exception_text(where))
>>>> +                       dump_mem("", "Exception stack", stack,
>>>> +                                stack + sizeof(struct pt_regs), false);
>>>>         }
>>>> }
>>>> ----8<----
>>>>
>>>>> Thanks,
>>>>> -Takahiro AKASHI
>>>>> ----8<----
>>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>>> index 650cc05..5fbd1ea 100644
>>>>> --- a/arch/arm64/kernel/entry.S
>>>>> +++ b/arch/arm64/kernel/entry.S
>>>>> @@ -185,14 +185,12 @@ alternative_endif
>>>>> 	mov	x23, sp
>>>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>>>> 	cmp	x20, x23			// check irq re-enterance
>>>>> +	mov	x19, sp
>>>>> 	beq	1f
>>>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>>>> -	mov	x29, x24
>>>>> -1:	mov	x19, sp
>>>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>>>> -	mov	sp, x23
>>>>> +	mov	sp, x24				// x24 = top of irq stack
>>>>> +	stp	x29, x22, [sp, #-32]!
>>>>> +	mov	x29, sp
>>>>> +1:
>>>>> 	.endm
>>>>>
>>>>> 	/*
>>>>
>>>> Is it possible to decide which stack is used without aborted SP information?
>>>
>>> We could know which stack is used via current SP, but how could we decide
>>> a variable 'low' in unwind_frame() when walking a process stack?
>>
>> The following patch, replacing your [PATCH 2/2], seems to work nicely,
>> traversing from interrupt stack to process stack. I tried James' method as well
>> as "echo c > /proc/sysrq-trigger."
>
> Great thanks!
>
> Since I'm favor of your approach, I've played with this patch instead of my one.
> A kernel panic is observed when using 'perf record with -g option' and sysrq.
> I guess some other changes are on your tree..
>
> Please refer to my analysis.
>
>> The only issue that I have now is that dump_backtrace() does not show
>> correct "pt_regs" data on process stack (actually it dumps interrupt stack):
>>
>> CPU1: stopping
>> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
>> Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
>> Call trace:
>> [<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
>> [<ffffffc00008a968>] show_stack+0x1c/0x28
>> [<ffffffc0003936d0>] dump_stack+0x88/0xc8
>> [<ffffffc00008fdf8>] handle_IPI+0x258/0x268
>> [<ffffffc000082530>] gic_handle_irq+0x88/0xa4
>> Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
>> ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
>> ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
>> ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
>> 0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
>> 0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> [<ffffffc0000855e0>] el1_irq+0xa0/0x114
>> [<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
>> [<ffffffc0000fc110>] default_idle_call+0x1c/0x34
>> [<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
>> [<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
>> [<ffffffc0000827a8>] secondary_startup+0x8/0x20
>
> My 'dump_backtrace() rework' patch is in your working tree. Right?

Yeah. I applied your irq stack v5 and "Synchronise dump_backtrace()..." v3,
and tried to reproduce your problem, but didn't.

>>
>> Thanks,
>> -Takahiro AKASHI
>>
>> ----8<----
>>  From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> Date: Thu, 15 Oct 2015 09:04:10 +0900
>> Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack
>>
>> This patch allows unwind_frame() to traverse from interrupt stack
>> to process stack correctly by having a dummy stack frame for irq_handler
>> created at its prologue.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>> arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
>> arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
>> 2 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index 6d4e8c5..25cabd9 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -185,8 +185,26 @@ alternative_endif
>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>> 	cmp	x20, x23			// check irq re-enterance
>> 	mov	x19, sp
>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>> -	mov	sp, x23
>> +	beq	1f
>> +	mov	sp, x24				// x24 = top of irq stack
>> +	stp	x29, x21, [sp, #-16]!		// for sanity check
>> +	stp	x29, x22, [sp, #-16]!		// dummy stack frame
>> +	mov	x29, sp
>> +1:
>> +	/*
>> +	 * Layout of interrupt stack after this macro is invoked:
>> +	 *
>> +	 *     |                |
>> +	 *-0x20+----------------+ <= dummy stack frame
>> +	 *     |      fp        |    : fp on process stack
>> +	 *-0x18+----------------+
>> +	 *     |      lr        |    : return address
>> +	 *-0x10+----------------+
>> +	 *     |    fp (copy)   |    : for sanity check
>> +	 * -0x8+----------------+
>> +	 *     |      sp        |    : sp on process stack
>> +	 *  0x0+----------------+
>> +	 */
>> 	.endm
>>
>> 	/*
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 407991b..03611a1 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
>> 	low  = frame->sp;
>> 	high = ALIGN(low, THREAD_SIZE);
>>
>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>> +	if (fp < low || fp > high - 0x20 || fp & 0xf)
>> 		return -EINVAL;
>
> IMO, this condition should be changes as follows.
>
> 	if (fp < low || fp > high - 0x20 || fp & 0xf || !fp)

If fp is NULL, (fp < low) should also be true.

-Takahiro AKASHI


> Please refer to the below for details.
>
>>
>> 	frame->sp = fp + 0x10;
>> 	frame->fp = *(unsigned long *)(fp);
>> 	/*
>> +	 * check whether we are going to walk trough from interrupt stack
>> +	 * to process stack
>> +	 * If the previous frame is the initial (dummy) stack frame on
>> +	 * interrupt stack, frame->sp now points to just below the frame
>> +	 * (dummy frame + 0x10).
>> +	 * See entry.S
>> +	 */
>> +#define STACK_LOW(addr) round_down((addr), THREAD_SIZE)
>> +	if ((STACK_LOW(frame->sp) != STACK_LOW(frame->fp)) &&
>> +			(frame->fp == *(unsigned long *)frame->sp))
>> +		frame->sp = *((unsigned long *)(frame->sp + 8));
>
> An original intention seems to catch a stack change from IRQ stack to process one.
> Unfortunately, this condition hits when the last of stack frame of swapper is
> retrieved. This leads to NULL pointer access due to the following code snippet.
>
> ENTRY(__secondary_switched)
>          ldr     x0, [x21]                       // get secondary_data.stack
>          mov     sp, x0
>          mov     x29, #0
>          b       secondary_start_kernel
> ENDPROC(__secondary_switched)
>
> This is why x29 should be checked.
>
> Best Regards
> Jungseok Lee
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-19  6:47                 ` AKASHI Takahiro
  0 siblings, 0 replies; 60+ messages in thread
From: AKASHI Takahiro @ 2015-10-19  6:47 UTC (permalink / raw)
  To: linux-arm-kernel

Jungseok,

On 10/15/2015 10:39 PM, Jungseok Lee wrote:
> On Oct 15, 2015, at 1:19 PM, AKASHI Takahiro wrote:
>> Jungseok,
>
>>>> ----8<----
>>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>>> index f93aae5..e18be43 100644
>>>> --- a/arch/arm64/kernel/traps.c
>>>> +++ b/arch/arm64/kernel/traps.c
>>>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>>>         set_fs(fs);
>>>> }
>>>>
>>>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>>>> +static void dump_backtrace_entry(unsigned long where)
>>>> {
>>>> +       /*
>>>> +        * PC has a physical address when MMU is disabled.
>>>> +        */
>>>> +       if (!kernel_text_address(where))
>>>> +               where = (unsigned long)phys_to_virt(where);
>>>> +
>>>>         print_ip_sym(where);
>>>> -       if (in_exception_text(where))
>>>> -               dump_mem("", "Exception stack", stack,
>>>> -                        stack + sizeof(struct pt_regs), false);
>>>> }
>>>>
>>>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>>>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>>>         pr_emerg("Call trace:\n");
>>>>         while (1) {
>>>>                 unsigned long where = frame.pc;
>>>> +               unsigned long stack;
>>>>                 int ret;
>>>>
>>>> +               dump_backtrace_entry(where);
>>>>                 ret = unwind_frame(&frame);
>>>>                 if (ret < 0)
>>>>                         break;
>>>> -               dump_backtrace_entry(where, frame.sp);
>>>> +               stack = frame.sp;
>>>> +               if (in_exception_text(where))
>>>> +                       dump_mem("", "Exception stack", stack,
>>>> +                                stack + sizeof(struct pt_regs), false);
>>>>         }
>>>> }
>>>> ----8<----
>>>>
>>>>> Thanks,
>>>>> -Takahiro AKASHI
>>>>> ----8<----
>>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>>> index 650cc05..5fbd1ea 100644
>>>>> --- a/arch/arm64/kernel/entry.S
>>>>> +++ b/arch/arm64/kernel/entry.S
>>>>> @@ -185,14 +185,12 @@ alternative_endif
>>>>> 	mov	x23, sp
>>>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>>>> 	cmp	x20, x23			// check irq re-enterance
>>>>> +	mov	x19, sp
>>>>> 	beq	1f
>>>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>>>> -	mov	x29, x24
>>>>> -1:	mov	x19, sp
>>>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>>>> -	mov	sp, x23
>>>>> +	mov	sp, x24				// x24 = top of irq stack
>>>>> +	stp	x29, x22, [sp, #-32]!
>>>>> +	mov	x29, sp
>>>>> +1:
>>>>> 	.endm
>>>>>
>>>>> 	/*
>>>>
>>>> Is it possible to decide which stack is used without aborted SP information?
>>>
>>> We could know which stack is used via current SP, but how could we decide
>>> a variable 'low' in unwind_frame() when walking a process stack?
>>
>> The following patch, replacing your [PATCH 2/2], seems to work nicely,
>> traversing from interrupt stack to process stack. I tried James' method as well
>> as "echo c > /proc/sysrq-trigger."
>
> Great thanks!
>
> Since I'm favor of your approach, I've played with this patch instead of my one.
> A kernel panic is observed when using 'perf record with -g option' and sysrq.
> I guess some other changes are on your tree..
>
> Please refer to my analysis.
>
>> The only issue that I have now is that dump_backtrace() does not show
>> correct "pt_regs" data on process stack (actually it dumps interrupt stack):
>>
>> CPU1: stopping
>> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
>> Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
>> Call trace:
>> [<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
>> [<ffffffc00008a968>] show_stack+0x1c/0x28
>> [<ffffffc0003936d0>] dump_stack+0x88/0xc8
>> [<ffffffc00008fdf8>] handle_IPI+0x258/0x268
>> [<ffffffc000082530>] gic_handle_irq+0x88/0xa4
>> Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
>> ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
>> ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
>> ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
>> 0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
>> 0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> 00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>> [<ffffffc0000855e0>] el1_irq+0xa0/0x114
>> [<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
>> [<ffffffc0000fc110>] default_idle_call+0x1c/0x34
>> [<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
>> [<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
>> [<ffffffc0000827a8>] secondary_startup+0x8/0x20
>
> My 'dump_backtrace() rework' patch is in your working tree. Right?

Yeah. I applied your irq stack v5 and "Synchronise dump_backtrace()..." v3,
and tried to reproduce your problem, but didn't.

>>
>> Thanks,
>> -Takahiro AKASHI
>>
>> ----8<----
>>  From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> Date: Thu, 15 Oct 2015 09:04:10 +0900
>> Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack
>>
>> This patch allows unwind_frame() to traverse from interrupt stack
>> to process stack correctly by having a dummy stack frame for irq_handler
>> created at its prologue.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>> arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
>> arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
>> 2 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index 6d4e8c5..25cabd9 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -185,8 +185,26 @@ alternative_endif
>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>> 	cmp	x20, x23			// check irq re-enterance
>> 	mov	x19, sp
>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>> -	mov	sp, x23
>> +	beq	1f
>> +	mov	sp, x24				// x24 = top of irq stack
>> +	stp	x29, x21, [sp, #-16]!		// for sanity check
>> +	stp	x29, x22, [sp, #-16]!		// dummy stack frame
>> +	mov	x29, sp
>> +1:
>> +	/*
>> +	 * Layout of interrupt stack after this macro is invoked:
>> +	 *
>> +	 *     |                |
>> +	 *-0x20+----------------+ <= dummy stack frame
>> +	 *     |      fp        |    : fp on process stack
>> +	 *-0x18+----------------+
>> +	 *     |      lr        |    : return address
>> +	 *-0x10+----------------+
>> +	 *     |    fp (copy)   |    : for sanity check
>> +	 * -0x8+----------------+
>> +	 *     |      sp        |    : sp on process stack
>> +	 *  0x0+----------------+
>> +	 */
>> 	.endm
>>
>> 	/*
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 407991b..03611a1 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
>> 	low  = frame->sp;
>> 	high = ALIGN(low, THREAD_SIZE);
>>
>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>> +	if (fp < low || fp > high - 0x20 || fp & 0xf)
>> 		return -EINVAL;
>
> IMO, this condition should be changes as follows.
>
> 	if (fp < low || fp > high - 0x20 || fp & 0xf || !fp)

If fp is NULL, (fp < low) should also be true.

-Takahiro AKASHI


> Please refer to the below for details.
>
>>
>> 	frame->sp = fp + 0x10;
>> 	frame->fp = *(unsigned long *)(fp);
>> 	/*
>> +	 * check whether we are going to walk trough from interrupt stack
>> +	 * to process stack
>> +	 * If the previous frame is the initial (dummy) stack frame on
>> +	 * interrupt stack, frame->sp now points to just below the frame
>> +	 * (dummy frame + 0x10).
>> +	 * See entry.S
>> +	 */
>> +#define STACK_LOW(addr) round_down((addr), THREAD_SIZE)
>> +	if ((STACK_LOW(frame->sp) != STACK_LOW(frame->fp)) &&
>> +			(frame->fp == *(unsigned long *)frame->sp))
>> +		frame->sp = *((unsigned long *)(frame->sp + 8));
>
> An original intention seems to catch a stack change from IRQ stack to process one.
> Unfortunately, this condition hits when the last of stack frame of swapper is
> retrieved. This leads to NULL pointer access due to the following code snippet.
>
> ENTRY(__secondary_switched)
>          ldr     x0, [x21]                       // get secondary_data.stack
>          mov     sp, x0
>          mov     x29, #0
>          b       secondary_start_kernel
> ENDPROC(__secondary_switched)
>
> This is why x29 should be checked.
>
> Best Regards
> Jungseok Lee
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-17 13:38                         ` Jungseok Lee
@ 2015-10-19 16:18                           ` Catalin Marinas
  -1 siblings, 0 replies; 60+ messages in thread
From: Catalin Marinas @ 2015-10-19 16:18 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: mark.rutland, barami97, will.deacon, linux-kernel,
	takahiro.akashi, James Morse, linux-arm-kernel

On Sat, Oct 17, 2015 at 10:38:16PM +0900, Jungseok Lee wrote:
> On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:
> > BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
> > save us from another stack address reading on the IRQ entry path. I'm
> > not sure exactly where the 16K image increase comes from but at least it
> > doesn't grow with NR_CPUS, so we can probably live with this.
> 
> I've tried the approach, a static allocation using DEFINE_PER_CPU, but
> it dose not work on a top-bit comparison method (for IRQ re-entrance
> check). The top-bit idea is based on the assumption that IRQ stack is
> aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
> to IRQ re-entrance failure in case of 4KB page system.
> 
> IMHO, it is hard to avoid 16KB size increase for 64KB page support.
> Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
> stack for at least a boot cpu should be allocated statically.

Ah, I forgot about the alignment check. The problem we have with your v5
patch is that kmalloc() doesn't guarantee this either (see commit
2a0b5c0d1929, "arm64: Align less than PAGE_SIZE pgds naturally", where
we had to fix this for pgd_alloc).

I'm leaning more and more towards the x86 approach as I mentioned in the
two messages below:

http://article.gmane.org/gmane.linux.kernel/2041877
http://article.gmane.org/gmane.linux.kernel/2043002

With a per-cpu stack you can avoid another pointer read, replacing it
with a single check for the re-entrance. But note that the update only
happens during do_softirq_own_stack() and *not* for every IRQ taken.

-- 
Catalin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-19 16:18                           ` Catalin Marinas
  0 siblings, 0 replies; 60+ messages in thread
From: Catalin Marinas @ 2015-10-19 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Oct 17, 2015 at 10:38:16PM +0900, Jungseok Lee wrote:
> On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:
> > BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
> > save us from another stack address reading on the IRQ entry path. I'm
> > not sure exactly where the 16K image increase comes from but at least it
> > doesn't grow with NR_CPUS, so we can probably live with this.
> 
> I've tried the approach, a static allocation using DEFINE_PER_CPU, but
> it dose not work on a top-bit comparison method (for IRQ re-entrance
> check). The top-bit idea is based on the assumption that IRQ stack is
> aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
> to IRQ re-entrance failure in case of 4KB page system.
> 
> IMHO, it is hard to avoid 16KB size increase for 64KB page support.
> Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
> stack for at least a boot cpu should be allocated statically.

Ah, I forgot about the alignment check. The problem we have with your v5
patch is that kmalloc() doesn't guarantee this either (see commit
2a0b5c0d1929, "arm64: Align less than PAGE_SIZE pgds naturally", where
we had to fix this for pgd_alloc).

I'm leaning more and more towards the x86 approach as I mentioned in the
two messages below:

http://article.gmane.org/gmane.linux.kernel/2041877
http://article.gmane.org/gmane.linux.kernel/2043002

With a per-cpu stack you can avoid another pointer read, replacing it
with a single check for the re-entrance. But note that the update only
happens during do_softirq_own_stack() and *not* for every IRQ taken.

-- 
Catalin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-19 16:18                           ` Catalin Marinas
@ 2015-10-20 13:08                             ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-20 13:08 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: mark.rutland, barami97, will.deacon, linux-kernel,
	takahiro.akashi, James Morse, linux-arm-kernel

On Oct 20, 2015, at 1:18 AM, Catalin Marinas wrote:

Hi Catalin,

> On Sat, Oct 17, 2015 at 10:38:16PM +0900, Jungseok Lee wrote:
>> On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:
>>> BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
>>> save us from another stack address reading on the IRQ entry path. I'm
>>> not sure exactly where the 16K image increase comes from but at least it
>>> doesn't grow with NR_CPUS, so we can probably live with this.
>> 
>> I've tried the approach, a static allocation using DEFINE_PER_CPU, but
>> it dose not work on a top-bit comparison method (for IRQ re-entrance
>> check). The top-bit idea is based on the assumption that IRQ stack is
>> aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
>> to IRQ re-entrance failure in case of 4KB page system.
>> 
>> IMHO, it is hard to avoid 16KB size increase for 64KB page support.
>> Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
>> stack for at least a boot cpu should be allocated statically.
> 
> Ah, I forgot about the alignment check. The problem we have with your v5
> patch is that kmalloc() doesn't guarantee this either (see commit
> 2a0b5c0d1929, "arm64: Align less than PAGE_SIZE pgds naturally", where
> we had to fix this for pgd_alloc).

The alignment would be guaranteed under the following additional diff. It is
possible to remove one pointer read in irq_stack_entry on 64KB page, but it
leads to code divergence. Am I missing something?

----8<----

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 2755b2f..c480613 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -17,15 +17,17 @@
 #include <asm-generic/irq.h>
 
 #if IRQ_STACK_SIZE >= PAGE_SIZE
-static inline void *__alloc_irq_stack(void)
+static inline void *__alloc_irq_stack(unsigned int cpu)
 {
        return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
                                        IRQ_STACK_SIZE_ORDER);
 }
 #else
-static inline void *__alloc_irq_stack(void)
+DECLARE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
+
+static inline void *__alloc_irq_stack(unsigned int cpu)
 {
-       return kmalloc(IRQ_STACK_SIZE, THREADINFO_GFP | __GFP_ZERO);
+       return (void *)per_cpu(irq_stack, cpu);
 }
 #endif
 
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index c8e0bcf..f1303c5 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -177,7 +177,7 @@ alternative_endif
        .endm
 
        .macro  irq_stack_entry
-       adr_l   x19, irq_stacks
+       adr_l   x19, irq_stack_ptr
        mrs     x20, tpidr_el1
        add     x19, x19, x20
        ldr     x24, [x19]
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 13fe8f4..acb9a14 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -30,7 +30,10 @@
 
 unsigned long irq_err_count;
 
-DEFINE_PER_CPU(void *, irq_stacks);
+DEFINE_PER_CPU(void *, irq_stack_ptr);
+#if IRQ_STACK_SIZE < PAGE_SIZE
+DEFINE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
+#endif
 
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
@@ -49,13 +52,10 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
        handle_arch_irq = handle_irq;
 }
 
-static char boot_irq_stack[IRQ_STACK_SIZE] __aligned(IRQ_STACK_SIZE);
-
 void __init init_IRQ(void)
 {
-       unsigned int cpu = smp_processor_id();
-
-       per_cpu(irq_stacks, cpu) = boot_irq_stack + IRQ_STACK_START_SP;
+       if (alloc_irq_stack(smp_processor_id()))
+               panic("Failed to allocate IRQ stack for a boot cpu");
 
        irqchip_init();
        if (!handle_arch_irq)
@@ -66,14 +66,14 @@ int alloc_irq_stack(unsigned int cpu)
 {
        void *stack;
 
-       if (per_cpu(irq_stacks, cpu))
+       if (per_cpu(irq_stack_ptr, cpu))
                return 0;
 
-       stack = __alloc_irq_stack();
+       stack = __alloc_irq_stack(cpu);
        if (!stack)
                return -ENOMEM;
 
-       per_cpu(irq_stacks, cpu) = stack + IRQ_STACK_START_SP;
+       per_cpu(irq_stack_ptr, cpu) = stack + IRQ_STACK_START_SP;
 
        return 0;
 }

----8<----


> 
> I'm leaning more and more towards the x86 approach as I mentioned in the
> two messages below:
> 
> http://article.gmane.org/gmane.linux.kernel/2041877
> http://article.gmane.org/gmane.linux.kernel/2043002
> 
> With a per-cpu stack you can avoid another pointer read, replacing it
> with a single check for the re-entrance. But note that the update only
> happens during do_softirq_own_stack() and *not* for every IRQ taken.

I've reviewed carefully the approach you mentioned about a month ago.
According to my observation on max stack depth, its context is as follows:

 (1) process context
 (2) hard IRQ raised
 (3) soft IRQ raised in irq_exit()
 (4) another process context
 (5) another hard IRQ raised 

The below is a stack description under the scenario.

 --- ------- <- High address of stack
     |     |
     |     |
 (a) |     | Process context (1)
     |     |
     |     |
 --- ------- <- Hard IRQ raised (2)
 (b) |     |
 --- ------- <- Soft IRQ raised in irq_exit() (3)
 (c) |     |
 --- ------- <- Max stack depth by (2)
     |     |
 (d) |     | Another process context (4)
     |     |
 --- ------- <- Another hard IRQ raised (5)
 (e) |     |
 --- ------- <- Low address of stack

The following is max stack depth calculation: The first argument of max() is
handled by process stack, the second one is handled by IRQ stack. 

 - current status  : max_stack_depth = max((a)+(b)+(c)+(d)+(e), 0)
 - current patch   : max_stack_depth = max((a), (b)+(c)+(d)+(e))
 - do_softirq_own_ : max_stack_depth = max((a)+(b)+(c), (c)+(d)+(e))

It is a principal objective to build up an infrastructure targeted at reduction
of process stack size, THREAD_SIZE. Frankly I'm not sure about the inequality,
(a)+(b)+(c) <= 8KB. If the condition is not satisfied, this feature, IRQ stack
support, would be questionable. That is, it might be insufficient to manage a
single out-of-tree patch which adjusts both IRQ_STACK_SIZE and THREAD_SIZE.

However, if the inequality is guaranteed, do_softirq_own_ approach looks better
than the current one in operation overhead perspective. BTW, is there a way to
simplify a top-bit comparison logic?

Great thanks for valuable feedbacks from which I've learned a lot.

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-20 13:08                             ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-20 13:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 20, 2015, at 1:18 AM, Catalin Marinas wrote:

Hi Catalin,

> On Sat, Oct 17, 2015 at 10:38:16PM +0900, Jungseok Lee wrote:
>> On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:
>>> BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
>>> save us from another stack address reading on the IRQ entry path. I'm
>>> not sure exactly where the 16K image increase comes from but at least it
>>> doesn't grow with NR_CPUS, so we can probably live with this.
>> 
>> I've tried the approach, a static allocation using DEFINE_PER_CPU, but
>> it dose not work on a top-bit comparison method (for IRQ re-entrance
>> check). The top-bit idea is based on the assumption that IRQ stack is
>> aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
>> to IRQ re-entrance failure in case of 4KB page system.
>> 
>> IMHO, it is hard to avoid 16KB size increase for 64KB page support.
>> Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
>> stack for at least a boot cpu should be allocated statically.
> 
> Ah, I forgot about the alignment check. The problem we have with your v5
> patch is that kmalloc() doesn't guarantee this either (see commit
> 2a0b5c0d1929, "arm64: Align less than PAGE_SIZE pgds naturally", where
> we had to fix this for pgd_alloc).

The alignment would be guaranteed under the following additional diff. It is
possible to remove one pointer read in irq_stack_entry on 64KB page, but it
leads to code divergence. Am I missing something?

----8<----

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 2755b2f..c480613 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -17,15 +17,17 @@
 #include <asm-generic/irq.h>
 
 #if IRQ_STACK_SIZE >= PAGE_SIZE
-static inline void *__alloc_irq_stack(void)
+static inline void *__alloc_irq_stack(unsigned int cpu)
 {
        return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
                                        IRQ_STACK_SIZE_ORDER);
 }
 #else
-static inline void *__alloc_irq_stack(void)
+DECLARE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
+
+static inline void *__alloc_irq_stack(unsigned int cpu)
 {
-       return kmalloc(IRQ_STACK_SIZE, THREADINFO_GFP | __GFP_ZERO);
+       return (void *)per_cpu(irq_stack, cpu);
 }
 #endif
 
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index c8e0bcf..f1303c5 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -177,7 +177,7 @@ alternative_endif
        .endm
 
        .macro  irq_stack_entry
-       adr_l   x19, irq_stacks
+       adr_l   x19, irq_stack_ptr
        mrs     x20, tpidr_el1
        add     x19, x19, x20
        ldr     x24, [x19]
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 13fe8f4..acb9a14 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -30,7 +30,10 @@
 
 unsigned long irq_err_count;
 
-DEFINE_PER_CPU(void *, irq_stacks);
+DEFINE_PER_CPU(void *, irq_stack_ptr);
+#if IRQ_STACK_SIZE < PAGE_SIZE
+DEFINE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
+#endif
 
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
@@ -49,13 +52,10 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
        handle_arch_irq = handle_irq;
 }
 
-static char boot_irq_stack[IRQ_STACK_SIZE] __aligned(IRQ_STACK_SIZE);
-
 void __init init_IRQ(void)
 {
-       unsigned int cpu = smp_processor_id();
-
-       per_cpu(irq_stacks, cpu) = boot_irq_stack + IRQ_STACK_START_SP;
+       if (alloc_irq_stack(smp_processor_id()))
+               panic("Failed to allocate IRQ stack for a boot cpu");
 
        irqchip_init();
        if (!handle_arch_irq)
@@ -66,14 +66,14 @@ int alloc_irq_stack(unsigned int cpu)
 {
        void *stack;
 
-       if (per_cpu(irq_stacks, cpu))
+       if (per_cpu(irq_stack_ptr, cpu))
                return 0;
 
-       stack = __alloc_irq_stack();
+       stack = __alloc_irq_stack(cpu);
        if (!stack)
                return -ENOMEM;
 
-       per_cpu(irq_stacks, cpu) = stack + IRQ_STACK_START_SP;
+       per_cpu(irq_stack_ptr, cpu) = stack + IRQ_STACK_START_SP;
 
        return 0;
 }

----8<----


> 
> I'm leaning more and more towards the x86 approach as I mentioned in the
> two messages below:
> 
> http://article.gmane.org/gmane.linux.kernel/2041877
> http://article.gmane.org/gmane.linux.kernel/2043002
> 
> With a per-cpu stack you can avoid another pointer read, replacing it
> with a single check for the re-entrance. But note that the update only
> happens during do_softirq_own_stack() and *not* for every IRQ taken.

I've reviewed carefully the approach you mentioned about a month ago.
According to my observation on max stack depth, its context is as follows:

 (1) process context
 (2) hard IRQ raised
 (3) soft IRQ raised in irq_exit()
 (4) another process context
 (5) another hard IRQ raised 

The below is a stack description under the scenario.

 --- ------- <- High address of stack
     |     |
     |     |
 (a) |     | Process context (1)
     |     |
     |     |
 --- ------- <- Hard IRQ raised (2)
 (b) |     |
 --- ------- <- Soft IRQ raised in irq_exit() (3)
 (c) |     |
 --- ------- <- Max stack depth by (2)
     |     |
 (d) |     | Another process context (4)
     |     |
 --- ------- <- Another hard IRQ raised (5)
 (e) |     |
 --- ------- <- Low address of stack

The following is max stack depth calculation: The first argument of max() is
handled by process stack, the second one is handled by IRQ stack. 

 - current status  : max_stack_depth = max((a)+(b)+(c)+(d)+(e), 0)
 - current patch   : max_stack_depth = max((a), (b)+(c)+(d)+(e))
 - do_softirq_own_ : max_stack_depth = max((a)+(b)+(c), (c)+(d)+(e))

It is a principal objective to build up an infrastructure targeted@reduction
of process stack size, THREAD_SIZE. Frankly I'm not sure about the inequality,
(a)+(b)+(c) <= 8KB. If the condition is not satisfied, this feature, IRQ stack
support, would be questionable. That is, it might be insufficient to manage a
single out-of-tree patch which adjusts both IRQ_STACK_SIZE and THREAD_SIZE.

However, if the inequality is guaranteed, do_softirq_own_ approach looks better
than the current one in operation overhead perspective. BTW, is there a way to
simplify a top-bit comparison logic?

Great thanks for valuable feedbacks from which I've learned a lot.

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-19  6:47                 ` AKASHI Takahiro
@ 2015-10-20 13:19                   ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-20 13:19 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: James Morse, catalin.marinas, will.deacon, linux-arm-kernel,
	mark.rutland, barami97, linux-kernel

On Oct 19, 2015, at 3:47 PM, AKASHI Takahiro wrote:
> Jungseok,
> 
> On 10/15/2015 10:39 PM, Jungseok Lee wrote:
>> On Oct 15, 2015, at 1:19 PM, AKASHI Takahiro wrote:
>>> Jungseok,
>> 
>>>>> ----8<----
>>>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>>>> index f93aae5..e18be43 100644
>>>>> --- a/arch/arm64/kernel/traps.c
>>>>> +++ b/arch/arm64/kernel/traps.c
>>>>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>>>>        set_fs(fs);
>>>>> }
>>>>> 
>>>>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>>>>> +static void dump_backtrace_entry(unsigned long where)
>>>>> {
>>>>> +       /*
>>>>> +        * PC has a physical address when MMU is disabled.
>>>>> +        */
>>>>> +       if (!kernel_text_address(where))
>>>>> +               where = (unsigned long)phys_to_virt(where);
>>>>> +
>>>>>        print_ip_sym(where);
>>>>> -       if (in_exception_text(where))
>>>>> -               dump_mem("", "Exception stack", stack,
>>>>> -                        stack + sizeof(struct pt_regs), false);
>>>>> }
>>>>> 
>>>>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>>>>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>>>>        pr_emerg("Call trace:\n");
>>>>>        while (1) {
>>>>>                unsigned long where = frame.pc;
>>>>> +               unsigned long stack;
>>>>>                int ret;
>>>>> 
>>>>> +               dump_backtrace_entry(where);
>>>>>                ret = unwind_frame(&frame);
>>>>>                if (ret < 0)
>>>>>                        break;
>>>>> -               dump_backtrace_entry(where, frame.sp);
>>>>> +               stack = frame.sp;
>>>>> +               if (in_exception_text(where))
>>>>> +                       dump_mem("", "Exception stack", stack,
>>>>> +                                stack + sizeof(struct pt_regs), false);
>>>>>        }
>>>>> }
>>>>> ----8<----
>>>>> 
>>>>>> Thanks,
>>>>>> -Takahiro AKASHI
>>>>>> ----8<----
>>>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>>>> index 650cc05..5fbd1ea 100644
>>>>>> --- a/arch/arm64/kernel/entry.S
>>>>>> +++ b/arch/arm64/kernel/entry.S
>>>>>> @@ -185,14 +185,12 @@ alternative_endif
>>>>>> 	mov	x23, sp
>>>>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>>>>> 	cmp	x20, x23			// check irq re-enterance
>>>>>> +	mov	x19, sp
>>>>>> 	beq	1f
>>>>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>>>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>>>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>>>>> -	mov	x29, x24
>>>>>> -1:	mov	x19, sp
>>>>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>>>>> -	mov	sp, x23
>>>>>> +	mov	sp, x24				// x24 = top of irq stack
>>>>>> +	stp	x29, x22, [sp, #-32]!
>>>>>> +	mov	x29, sp
>>>>>> +1:
>>>>>> 	.endm
>>>>>> 
>>>>>> 	/*
>>>>> 
>>>>> Is it possible to decide which stack is used without aborted SP information?
>>>> 
>>>> We could know which stack is used via current SP, but how could we decide
>>>> a variable 'low' in unwind_frame() when walking a process stack?
>>> 
>>> The following patch, replacing your [PATCH 2/2], seems to work nicely,
>>> traversing from interrupt stack to process stack. I tried James' method as well
>>> as "echo c > /proc/sysrq-trigger."
>> 
>> Great thanks!
>> 
>> Since I'm favor of your approach, I've played with this patch instead of my one.
>> A kernel panic is observed when using 'perf record with -g option' and sysrq.
>> I guess some other changes are on your tree..
>> 
>> Please refer to my analysis.
>> 
>>> The only issue that I have now is that dump_backtrace() does not show
>>> correct "pt_regs" data on process stack (actually it dumps interrupt stack):
>>> 
>>> CPU1: stopping
>>> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
>>> Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
>>> Call trace:
>>> [<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
>>> [<ffffffc00008a968>] show_stack+0x1c/0x28
>>> [<ffffffc0003936d0>] dump_stack+0x88/0xc8
>>> [<ffffffc00008fdf8>] handle_IPI+0x258/0x268
>>> [<ffffffc000082530>] gic_handle_irq+0x88/0xa4
>>> Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
>>> ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
>>> ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
>>> ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
>>> 0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
>>> 0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> [<ffffffc0000855e0>] el1_irq+0xa0/0x114
>>> [<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
>>> [<ffffffc0000fc110>] default_idle_call+0x1c/0x34
>>> [<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
>>> [<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
>>> [<ffffffc0000827a8>] secondary_startup+0x8/0x20
>> 
>> My 'dump_backtrace() rework' patch is in your working tree. Right?
> 
> Yeah. I applied your irq stack v5 and "Synchronise dump_backtrace()..." v3,
> and tried to reproduce your problem, but didn't.

I've have not seen this problem yet with my patches and your v2.

>>> 
>>> Thanks,
>>> -Takahiro AKASHI
>>> 
>>> ----8<----
>>> From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>> Date: Thu, 15 Oct 2015 09:04:10 +0900
>>> Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack
>>> 
>>> This patch allows unwind_frame() to traverse from interrupt stack
>>> to process stack correctly by having a dummy stack frame for irq_handler
>>> created at its prologue.
>>> 
>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>> ---
>>> arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
>>> arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
>>> 2 files changed, 33 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>> index 6d4e8c5..25cabd9 100644
>>> --- a/arch/arm64/kernel/entry.S
>>> +++ b/arch/arm64/kernel/entry.S
>>> @@ -185,8 +185,26 @@ alternative_endif
>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>> 	cmp	x20, x23			// check irq re-enterance
>>> 	mov	x19, sp
>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>> -	mov	sp, x23
>>> +	beq	1f
>>> +	mov	sp, x24				// x24 = top of irq stack
>>> +	stp	x29, x21, [sp, #-16]!		// for sanity check
>>> +	stp	x29, x22, [sp, #-16]!		// dummy stack frame
>>> +	mov	x29, sp
>>> +1:
>>> +	/*
>>> +	 * Layout of interrupt stack after this macro is invoked:
>>> +	 *
>>> +	 *     |                |
>>> +	 *-0x20+----------------+ <= dummy stack frame
>>> +	 *     |      fp        |    : fp on process stack
>>> +	 *-0x18+----------------+
>>> +	 *     |      lr        |    : return address
>>> +	 *-0x10+----------------+
>>> +	 *     |    fp (copy)   |    : for sanity check
>>> +	 * -0x8+----------------+
>>> +	 *     |      sp        |    : sp on process stack
>>> +	 *  0x0+----------------+
>>> +	 */
>>> 	.endm
>>> 
>>> 	/*
>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>> index 407991b..03611a1 100644
>>> --- a/arch/arm64/kernel/stacktrace.c
>>> +++ b/arch/arm64/kernel/stacktrace.c
>>> @@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
>>> 	low  = frame->sp;
>>> 	high = ALIGN(low, THREAD_SIZE);
>>> 
>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>> +	if (fp < low || fp > high - 0x20 || fp & 0xf)
>>> 		return -EINVAL;
>> 
>> IMO, this condition should be changes as follows.
>> 
>> 	if (fp < low || fp > high - 0x20 || fp & 0xf || !fp)
> 
> If fp is NULL, (fp < low) should also be true.

I will report with the value of low, high, and fp if detected.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-20 13:19                   ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-20 13:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 19, 2015, at 3:47 PM, AKASHI Takahiro wrote:
> Jungseok,
> 
> On 10/15/2015 10:39 PM, Jungseok Lee wrote:
>> On Oct 15, 2015, at 1:19 PM, AKASHI Takahiro wrote:
>>> Jungseok,
>> 
>>>>> ----8<----
>>>>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>>>>> index f93aae5..e18be43 100644
>>>>> --- a/arch/arm64/kernel/traps.c
>>>>> +++ b/arch/arm64/kernel/traps.c
>>>>> @@ -103,12 +103,15 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom,
>>>>>        set_fs(fs);
>>>>> }
>>>>> 
>>>>> -static void dump_backtrace_entry(unsigned long where, unsigned long stack)
>>>>> +static void dump_backtrace_entry(unsigned long where)
>>>>> {
>>>>> +       /*
>>>>> +        * PC has a physical address when MMU is disabled.
>>>>> +        */
>>>>> +       if (!kernel_text_address(where))
>>>>> +               where = (unsigned long)phys_to_virt(where);
>>>>> +
>>>>>        print_ip_sym(where);
>>>>> -       if (in_exception_text(where))
>>>>> -               dump_mem("", "Exception stack", stack,
>>>>> -                        stack + sizeof(struct pt_regs), false);
>>>>> }
>>>>> 
>>>>> static void dump_instr(const char *lvl, struct pt_regs *regs)
>>>>> @@ -172,12 +175,17 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>>>>>        pr_emerg("Call trace:\n");
>>>>>        while (1) {
>>>>>                unsigned long where = frame.pc;
>>>>> +               unsigned long stack;
>>>>>                int ret;
>>>>> 
>>>>> +               dump_backtrace_entry(where);
>>>>>                ret = unwind_frame(&frame);
>>>>>                if (ret < 0)
>>>>>                        break;
>>>>> -               dump_backtrace_entry(where, frame.sp);
>>>>> +               stack = frame.sp;
>>>>> +               if (in_exception_text(where))
>>>>> +                       dump_mem("", "Exception stack", stack,
>>>>> +                                stack + sizeof(struct pt_regs), false);
>>>>>        }
>>>>> }
>>>>> ----8<----
>>>>> 
>>>>>> Thanks,
>>>>>> -Takahiro AKASHI
>>>>>> ----8<----
>>>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>>>> index 650cc05..5fbd1ea 100644
>>>>>> --- a/arch/arm64/kernel/entry.S
>>>>>> +++ b/arch/arm64/kernel/entry.S
>>>>>> @@ -185,14 +185,12 @@ alternative_endif
>>>>>> 	mov	x23, sp
>>>>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>>>>> 	cmp	x20, x23			// check irq re-enterance
>>>>>> +	mov	x19, sp
>>>>>> 	beq	1f
>>>>>> -	str	x29, [x19, #IRQ_FRAME_FP]
>>>>>> -	str	x21, [x19, #IRQ_FRAME_SP]
>>>>>> -	str	x22, [x19, #IRQ_FRAME_PC]
>>>>>> -	mov	x29, x24
>>>>>> -1:	mov	x19, sp
>>>>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>>>>> -	mov	sp, x23
>>>>>> +	mov	sp, x24				// x24 = top of irq stack
>>>>>> +	stp	x29, x22, [sp, #-32]!
>>>>>> +	mov	x29, sp
>>>>>> +1:
>>>>>> 	.endm
>>>>>> 
>>>>>> 	/*
>>>>> 
>>>>> Is it possible to decide which stack is used without aborted SP information?
>>>> 
>>>> We could know which stack is used via current SP, but how could we decide
>>>> a variable 'low' in unwind_frame() when walking a process stack?
>>> 
>>> The following patch, replacing your [PATCH 2/2], seems to work nicely,
>>> traversing from interrupt stack to process stack. I tried James' method as well
>>> as "echo c > /proc/sysrq-trigger."
>> 
>> Great thanks!
>> 
>> Since I'm favor of your approach, I've played with this patch instead of my one.
>> A kernel panic is observed when using 'perf record with -g option' and sysrq.
>> I guess some other changes are on your tree..
>> 
>> Please refer to my analysis.
>> 
>>> The only issue that I have now is that dump_backtrace() does not show
>>> correct "pt_regs" data on process stack (actually it dumps interrupt stack):
>>> 
>>> CPU1: stopping
>>> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.3.0-rc5+ #24
>>> Hardware name: ARM Arm Versatile Express/Arm Versatile Express, BIOS 11:37:19 Jul 16 2015
>>> Call trace:
>>> [<ffffffc00008a7b0>] dump_backtrace+0x0/0x19c
>>> [<ffffffc00008a968>] show_stack+0x1c/0x28
>>> [<ffffffc0003936d0>] dump_stack+0x88/0xc8
>>> [<ffffffc00008fdf8>] handle_IPI+0x258/0x268
>>> [<ffffffc000082530>] gic_handle_irq+0x88/0xa4
>>> Exception stack(0xffffffc87b1bffa0 to 0xffffffc87b1c00c0) <== HERE
>>> ffa0: ffffffc87b18fe30 ffffffc87b1bc000 ffffffc87b18ff50 ffffffc000086ac8
>>> ffc0: ffffffc87b18c000 afafafafafafafaf ffffffc87b18ff50 ffffffc000086ac8
>>> ffe0: ffffffc87b18ff50 ffffffc87b18ff50 afafafafafafafaf afafafafafafafaf
>>> 0000: 0000000000000000 ffffffffffffffff ffffffc87b195c00 0000000200000002
>>> 0020: 0000000057ac6e9d afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 0040: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 0060: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 0080: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> 00a0: afafafafafafafaf afafafafafafafaf afafafafafafafaf afafafafafafafaf
>>> [<ffffffc0000855e0>] el1_irq+0xa0/0x114
>>> [<ffffffc000086ac4>] arch_cpu_idle+0x14/0x20
>>> [<ffffffc0000fc110>] default_idle_call+0x1c/0x34
>>> [<ffffffc0000fc464>] cpu_startup_entry+0x2cc/0x30c
>>> [<ffffffc00008f7c4>] secondary_start_kernel+0x120/0x148
>>> [<ffffffc0000827a8>] secondary_startup+0x8/0x20
>> 
>> My 'dump_backtrace() rework' patch is in your working tree. Right?
> 
> Yeah. I applied your irq stack v5 and "Synchronise dump_backtrace()..." v3,
> and tried to reproduce your problem, but didn't.

I've have not seen this problem yet with my patches and your v2.

>>> 
>>> Thanks,
>>> -Takahiro AKASHI
>>> 
>>> ----8<----
>>> From 1aa8d4e533d44099f69ff761acfa3c1045a00796 Mon Sep 17 00:00:00 2001
>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>> Date: Thu, 15 Oct 2015 09:04:10 +0900
>>> Subject: [PATCH] arm64: revamp unwind_frame for interrupt stack
>>> 
>>> This patch allows unwind_frame() to traverse from interrupt stack
>>> to process stack correctly by having a dummy stack frame for irq_handler
>>> created at its prologue.
>>> 
>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>> ---
>>> arch/arm64/kernel/entry.S      |   22 ++++++++++++++++++++--
>>> arch/arm64/kernel/stacktrace.c |   14 +++++++++++++-
>>> 2 files changed, 33 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>> index 6d4e8c5..25cabd9 100644
>>> --- a/arch/arm64/kernel/entry.S
>>> +++ b/arch/arm64/kernel/entry.S
>>> @@ -185,8 +185,26 @@ alternative_endif
>>> 	and	x23, x23, #~(THREAD_SIZE - 1)
>>> 	cmp	x20, x23			// check irq re-enterance
>>> 	mov	x19, sp
>>> -	csel	x23, x19, x24, eq		// x24 = top of irq stack
>>> -	mov	sp, x23
>>> +	beq	1f
>>> +	mov	sp, x24				// x24 = top of irq stack
>>> +	stp	x29, x21, [sp, #-16]!		// for sanity check
>>> +	stp	x29, x22, [sp, #-16]!		// dummy stack frame
>>> +	mov	x29, sp
>>> +1:
>>> +	/*
>>> +	 * Layout of interrupt stack after this macro is invoked:
>>> +	 *
>>> +	 *     |                |
>>> +	 *-0x20+----------------+ <= dummy stack frame
>>> +	 *     |      fp        |    : fp on process stack
>>> +	 *-0x18+----------------+
>>> +	 *     |      lr        |    : return address
>>> +	 *-0x10+----------------+
>>> +	 *     |    fp (copy)   |    : for sanity check
>>> +	 * -0x8+----------------+
>>> +	 *     |      sp        |    : sp on process stack
>>> +	 *  0x0+----------------+
>>> +	 */
>>> 	.endm
>>> 
>>> 	/*
>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>> index 407991b..03611a1 100644
>>> --- a/arch/arm64/kernel/stacktrace.c
>>> +++ b/arch/arm64/kernel/stacktrace.c
>>> @@ -43,12 +43,24 @@ int notrace unwind_frame(struct stackframe *frame)
>>> 	low  = frame->sp;
>>> 	high = ALIGN(low, THREAD_SIZE);
>>> 
>>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>>> +	if (fp < low || fp > high - 0x20 || fp & 0xf)
>>> 		return -EINVAL;
>> 
>> IMO, this condition should be changes as follows.
>> 
>> 	if (fp < low || fp > high - 0x20 || fp & 0xf || !fp)
> 
> If fp is NULL, (fp < low) should also be true.

I will report with the value of low, high, and fp if detected.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
  2015-10-20 13:08                             ` Jungseok Lee
@ 2015-10-21 15:14                               ` Jungseok Lee
  -1 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-21 15:14 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: mark.rutland, barami97, will.deacon, linux-kernel,
	takahiro.akashi, James Morse, linux-arm-kernel

On Oct 20, 2015, at 10:08 PM, Jungseok Lee wrote:
> On Oct 20, 2015, at 1:18 AM, Catalin Marinas wrote:
> 
> Hi Catalin,
> 
>> On Sat, Oct 17, 2015 at 10:38:16PM +0900, Jungseok Lee wrote:
>>> On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:
>>>> BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
>>>> save us from another stack address reading on the IRQ entry path. I'm
>>>> not sure exactly where the 16K image increase comes from but at least it
>>>> doesn't grow with NR_CPUS, so we can probably live with this.
>>> 
>>> I've tried the approach, a static allocation using DEFINE_PER_CPU, but
>>> it dose not work on a top-bit comparison method (for IRQ re-entrance
>>> check). The top-bit idea is based on the assumption that IRQ stack is
>>> aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
>>> to IRQ re-entrance failure in case of 4KB page system.
>>> 
>>> IMHO, it is hard to avoid 16KB size increase for 64KB page support.
>>> Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
>>> stack for at least a boot cpu should be allocated statically.
>> 
>> Ah, I forgot about the alignment check. The problem we have with your v5
>> patch is that kmalloc() doesn't guarantee this either (see commit
>> 2a0b5c0d1929, "arm64: Align less than PAGE_SIZE pgds naturally", where
>> we had to fix this for pgd_alloc).
> 
> The alignment would be guaranteed under the following additional diff. It is
> possible to remove one pointer read in irq_stack_entry on 64KB page, but it
> leads to code divergence. Am I missing something?
> 
> ----8<----
> 
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index 2755b2f..c480613 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -17,15 +17,17 @@
> #include <asm-generic/irq.h>
> 
> #if IRQ_STACK_SIZE >= PAGE_SIZE
> -static inline void *__alloc_irq_stack(void)
> +static inline void *__alloc_irq_stack(unsigned int cpu)
> {
>        return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
>                                        IRQ_STACK_SIZE_ORDER);
> }
> #else
> -static inline void *__alloc_irq_stack(void)
> +DECLARE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
> +
> +static inline void *__alloc_irq_stack(unsigned int cpu)
> {
> -       return kmalloc(IRQ_STACK_SIZE, THREADINFO_GFP | __GFP_ZERO);
> +       return (void *)per_cpu(irq_stack, cpu);
> }
> #endif
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index c8e0bcf..f1303c5 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -177,7 +177,7 @@ alternative_endif
>        .endm
> 
>        .macro  irq_stack_entry
> -       adr_l   x19, irq_stacks
> +       adr_l   x19, irq_stack_ptr
>        mrs     x20, tpidr_el1
>        add     x19, x19, x20
>        ldr     x24, [x19]
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 13fe8f4..acb9a14 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -30,7 +30,10 @@
> 
> unsigned long irq_err_count;
> 
> -DEFINE_PER_CPU(void *, irq_stacks);
> +DEFINE_PER_CPU(void *, irq_stack_ptr);
> +#if IRQ_STACK_SIZE < PAGE_SIZE
> +DEFINE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
> +#endif
> 
> int arch_show_interrupts(struct seq_file *p, int prec)
> {
> @@ -49,13 +52,10 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>        handle_arch_irq = handle_irq;
> }
> 
> -static char boot_irq_stack[IRQ_STACK_SIZE] __aligned(IRQ_STACK_SIZE);
> -
> void __init init_IRQ(void)
> {
> -       unsigned int cpu = smp_processor_id();
> -
> -       per_cpu(irq_stacks, cpu) = boot_irq_stack + IRQ_STACK_START_SP;
> +       if (alloc_irq_stack(smp_processor_id()))
> +               panic("Failed to allocate IRQ stack for a boot cpu");
> 
>        irqchip_init();
>        if (!handle_arch_irq)
> @@ -66,14 +66,14 @@ int alloc_irq_stack(unsigned int cpu)
> {
>        void *stack;
> 
> -       if (per_cpu(irq_stacks, cpu))
> +       if (per_cpu(irq_stack_ptr, cpu))
>                return 0;
> 
> -       stack = __alloc_irq_stack();
> +       stack = __alloc_irq_stack(cpu);
>        if (!stack)
>                return -ENOMEM;
> 
> -       per_cpu(irq_stacks, cpu) = stack + IRQ_STACK_START_SP;
> +       per_cpu(irq_stack_ptr, cpu) = stack + IRQ_STACK_START_SP;
> 
>        return 0;
> }
> 
> ----8<----
> 
> 
>> 
>> I'm leaning more and more towards the x86 approach as I mentioned in the
>> two messages below:
>> 
>> http://article.gmane.org/gmane.linux.kernel/2041877
>> http://article.gmane.org/gmane.linux.kernel/2043002
>> 
>> With a per-cpu stack you can avoid another pointer read, replacing it
>> with a single check for the re-entrance. But note that the update only
>> happens during do_softirq_own_stack() and *not* for every IRQ taken.
> 
> I've reviewed carefully the approach you mentioned about a month ago.
> According to my observation on max stack depth, its context is as follows:
> 
> (1) process context
> (2) hard IRQ raised
> (3) soft IRQ raised in irq_exit()
> (4) another process context
> (5) another hard IRQ raised 
> 
> The below is a stack description under the scenario.
> 
> --- ------- <- High address of stack
>     |     |
>     |     |
> (a) |     | Process context (1)
>     |     |
>     |     |
> --- ------- <- Hard IRQ raised (2)
> (b) |     |
> --- ------- <- Soft IRQ raised in irq_exit() (3)
> (c) |     |
> --- ------- <- Max stack depth by (2)
>     |     |
> (d) |     | Another process context (4)
>     |     |
> --- ------- <- Another hard IRQ raised (5)
> (e) |     |
> --- ------- <- Low address of stack
> 
> The following is max stack depth calculation: The first argument of max() is
> handled by process stack, the second one is handled by IRQ stack. 
> 
> - current status  : max_stack_depth = max((a)+(b)+(c)+(d)+(e), 0)
> - current patch   : max_stack_depth = max((a), (b)+(c)+(d)+(e))
> - do_softirq_own_ : max_stack_depth = max((a)+(b)+(c), (c)+(d)+(e))
> 
> It is a principal objective to build up an infrastructure targeted at reduction
> of process stack size, THREAD_SIZE. Frankly I'm not sure about the inequality,
> (a)+(b)+(c) <= 8KB. If the condition is not satisfied, this feature, IRQ stack
> support, would be questionable. That is, it might be insufficient to manage a
> single out-of-tree patch which adjusts both IRQ_STACK_SIZE and THREAD_SIZE.
> 
> However, if the inequality is guaranteed, do_softirq_own_ approach looks better
> than the current one in operation overhead perspective. BTW, is there a way to
> simplify a top-bit comparison logic?
> 
> Great thanks for valuable feedbacks from which I've learned a lot.


1) Another pointer read

My interpretation on your comment is as follows.

    DEFINE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);

    .macro irq_stack_entry
    adr_l x19, irq_stack
    mrs   x20, tpidr_el1
    mov   x21, #IRQ_STACK_START_SP
    add   x21, x20, x21            // x21 = top of irq_stack
    .endm

If static allocation is used, we could avoid *ldr* operation in irq_stack_entry.
I think this is *another pointer read* advantage under static allocation. Right?

2) Static allocation

Since ARM64 relies on generic setup_per_cpu_areas(), tpidr_el1 is PAGE_SIZE aligned.
As reviewing other architecture which have its own setup_per_cpu_areas(), it is
one of reasons to configure its own 'atom_size' parameter. If mm/percpu.c gives
a chance, overriding atom_size, to architecture, static allocation would be available
on 4KB page system. If this does not sound unreasonable, I will try to get feedbacks
via linux-mm. For reference, the implementation might look as below.

----8<----

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index caebf2a..ab9a1f2 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -9,6 +9,7 @@
 #include <linux/pfn.h>
 #include <linux/init.h>
 
+#include <asm/page.h>
 #include <asm/percpu.h>
 
 /* enough to cover all DEFINE_PER_CPUs in modules */
@@ -18,6 +19,10 @@
 #define PERCPU_MODULE_RESERVE          0
 #endif
 
+#ifndef PERCPU_ATOM_SIZE
+#define PERCPU_ATOM_SIZE               PAGE_SIZE
+#endif
+
 #ifndef PERCPU_ENOUGH_ROOM
 #define PERCPU_ENOUGH_ROOM                                             \
        (ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES) +      \
diff --git a/mm/percpu.c b/mm/percpu.c
index a63b4d8..cd1e0ec 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2201,8 +2201,8 @@ void __init setup_per_cpu_areas(void)
         * what the legacy allocator did.
         */
        rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
-                                   PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, NULL,
-                                   pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
+                                   PERCPU_DYNAMIC_RESERVE, PERCPU_ATOM_SIZE,
+                                   NULL, pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
        if (rc < 0)
                panic("Failed to initialize percpu areas.");
 
@@ -2231,7 +2231,7 @@ void __init setup_per_cpu_areas(void)
 
        ai = pcpu_alloc_alloc_info(1, 1);
        fc = memblock_virt_alloc_from_nopanic(unit_size,
-                                             PAGE_SIZE,
+                                             PERCPU_ATOM_SIZE,
                                              __pa(MAX_DMA_ADDRESS));
        if (!ai || !fc)
                panic("Failed to allocate memory for percpu areas.");

----8<----

As overriding PERCPU_ATOM_SIZE in architecture side, aligned stack could be
allocated in a static way, which get rids of another pointer read.

Any feedbacks are welcome.

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
@ 2015-10-21 15:14                               ` Jungseok Lee
  0 siblings, 0 replies; 60+ messages in thread
From: Jungseok Lee @ 2015-10-21 15:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Oct 20, 2015, at 10:08 PM, Jungseok Lee wrote:
> On Oct 20, 2015, at 1:18 AM, Catalin Marinas wrote:
> 
> Hi Catalin,
> 
>> On Sat, Oct 17, 2015 at 10:38:16PM +0900, Jungseok Lee wrote:
>>> On Oct 17, 2015, at 1:06 AM, Catalin Marinas wrote:
>>>> BTW, a static allocation (DEFINE_PER_CPU for the whole irq stack) would
>>>> save us from another stack address reading on the IRQ entry path. I'm
>>>> not sure exactly where the 16K image increase comes from but at least it
>>>> doesn't grow with NR_CPUS, so we can probably live with this.
>>> 
>>> I've tried the approach, a static allocation using DEFINE_PER_CPU, but
>>> it dose not work on a top-bit comparison method (for IRQ re-entrance
>>> check). The top-bit idea is based on the assumption that IRQ stack is
>>> aligned with THREAD_SIZE. But, tpidr_el1 is PAGE_SIZE aligned. It leads
>>> to IRQ re-entrance failure in case of 4KB page system.
>>> 
>>> IMHO, it is hard to avoid 16KB size increase for 64KB page support.
>>> Secondary cores can rely on slab.h, but a boot core cannot. So, IRQ
>>> stack for at least a boot cpu should be allocated statically.
>> 
>> Ah, I forgot about the alignment check. The problem we have with your v5
>> patch is that kmalloc() doesn't guarantee this either (see commit
>> 2a0b5c0d1929, "arm64: Align less than PAGE_SIZE pgds naturally", where
>> we had to fix this for pgd_alloc).
> 
> The alignment would be guaranteed under the following additional diff. It is
> possible to remove one pointer read in irq_stack_entry on 64KB page, but it
> leads to code divergence. Am I missing something?
> 
> ----8<----
> 
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index 2755b2f..c480613 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -17,15 +17,17 @@
> #include <asm-generic/irq.h>
> 
> #if IRQ_STACK_SIZE >= PAGE_SIZE
> -static inline void *__alloc_irq_stack(void)
> +static inline void *__alloc_irq_stack(unsigned int cpu)
> {
>        return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO,
>                                        IRQ_STACK_SIZE_ORDER);
> }
> #else
> -static inline void *__alloc_irq_stack(void)
> +DECLARE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
> +
> +static inline void *__alloc_irq_stack(unsigned int cpu)
> {
> -       return kmalloc(IRQ_STACK_SIZE, THREADINFO_GFP | __GFP_ZERO);
> +       return (void *)per_cpu(irq_stack, cpu);
> }
> #endif
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index c8e0bcf..f1303c5 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -177,7 +177,7 @@ alternative_endif
>        .endm
> 
>        .macro  irq_stack_entry
> -       adr_l   x19, irq_stacks
> +       adr_l   x19, irq_stack_ptr
>        mrs     x20, tpidr_el1
>        add     x19, x19, x20
>        ldr     x24, [x19]
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 13fe8f4..acb9a14 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -30,7 +30,10 @@
> 
> unsigned long irq_err_count;
> 
> -DEFINE_PER_CPU(void *, irq_stacks);
> +DEFINE_PER_CPU(void *, irq_stack_ptr);
> +#if IRQ_STACK_SIZE < PAGE_SIZE
> +DEFINE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);
> +#endif
> 
> int arch_show_interrupts(struct seq_file *p, int prec)
> {
> @@ -49,13 +52,10 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>        handle_arch_irq = handle_irq;
> }
> 
> -static char boot_irq_stack[IRQ_STACK_SIZE] __aligned(IRQ_STACK_SIZE);
> -
> void __init init_IRQ(void)
> {
> -       unsigned int cpu = smp_processor_id();
> -
> -       per_cpu(irq_stacks, cpu) = boot_irq_stack + IRQ_STACK_START_SP;
> +       if (alloc_irq_stack(smp_processor_id()))
> +               panic("Failed to allocate IRQ stack for a boot cpu");
> 
>        irqchip_init();
>        if (!handle_arch_irq)
> @@ -66,14 +66,14 @@ int alloc_irq_stack(unsigned int cpu)
> {
>        void *stack;
> 
> -       if (per_cpu(irq_stacks, cpu))
> +       if (per_cpu(irq_stack_ptr, cpu))
>                return 0;
> 
> -       stack = __alloc_irq_stack();
> +       stack = __alloc_irq_stack(cpu);
>        if (!stack)
>                return -ENOMEM;
> 
> -       per_cpu(irq_stacks, cpu) = stack + IRQ_STACK_START_SP;
> +       per_cpu(irq_stack_ptr, cpu) = stack + IRQ_STACK_START_SP;
> 
>        return 0;
> }
> 
> ----8<----
> 
> 
>> 
>> I'm leaning more and more towards the x86 approach as I mentioned in the
>> two messages below:
>> 
>> http://article.gmane.org/gmane.linux.kernel/2041877
>> http://article.gmane.org/gmane.linux.kernel/2043002
>> 
>> With a per-cpu stack you can avoid another pointer read, replacing it
>> with a single check for the re-entrance. But note that the update only
>> happens during do_softirq_own_stack() and *not* for every IRQ taken.
> 
> I've reviewed carefully the approach you mentioned about a month ago.
> According to my observation on max stack depth, its context is as follows:
> 
> (1) process context
> (2) hard IRQ raised
> (3) soft IRQ raised in irq_exit()
> (4) another process context
> (5) another hard IRQ raised 
> 
> The below is a stack description under the scenario.
> 
> --- ------- <- High address of stack
>     |     |
>     |     |
> (a) |     | Process context (1)
>     |     |
>     |     |
> --- ------- <- Hard IRQ raised (2)
> (b) |     |
> --- ------- <- Soft IRQ raised in irq_exit() (3)
> (c) |     |
> --- ------- <- Max stack depth by (2)
>     |     |
> (d) |     | Another process context (4)
>     |     |
> --- ------- <- Another hard IRQ raised (5)
> (e) |     |
> --- ------- <- Low address of stack
> 
> The following is max stack depth calculation: The first argument of max() is
> handled by process stack, the second one is handled by IRQ stack. 
> 
> - current status  : max_stack_depth = max((a)+(b)+(c)+(d)+(e), 0)
> - current patch   : max_stack_depth = max((a), (b)+(c)+(d)+(e))
> - do_softirq_own_ : max_stack_depth = max((a)+(b)+(c), (c)+(d)+(e))
> 
> It is a principal objective to build up an infrastructure targeted at reduction
> of process stack size, THREAD_SIZE. Frankly I'm not sure about the inequality,
> (a)+(b)+(c) <= 8KB. If the condition is not satisfied, this feature, IRQ stack
> support, would be questionable. That is, it might be insufficient to manage a
> single out-of-tree patch which adjusts both IRQ_STACK_SIZE and THREAD_SIZE.
> 
> However, if the inequality is guaranteed, do_softirq_own_ approach looks better
> than the current one in operation overhead perspective. BTW, is there a way to
> simplify a top-bit comparison logic?
> 
> Great thanks for valuable feedbacks from which I've learned a lot.


1) Another pointer read

My interpretation on your comment is as follows.

    DEFINE_PER_CPU(char [IRQ_STACK_SIZE], irq_stack) __aligned(IRQ_STACK_SIZE);

    .macro irq_stack_entry
    adr_l x19, irq_stack
    mrs   x20, tpidr_el1
    mov   x21, #IRQ_STACK_START_SP
    add   x21, x20, x21            // x21 = top of irq_stack
    .endm

If static allocation is used, we could avoid *ldr* operation in irq_stack_entry.
I think this is *another pointer read* advantage under static allocation. Right?

2) Static allocation

Since ARM64 relies on generic setup_per_cpu_areas(), tpidr_el1 is PAGE_SIZE aligned.
As reviewing other architecture which have its own setup_per_cpu_areas(), it is
one of reasons to configure its own 'atom_size' parameter. If mm/percpu.c gives
a chance, overriding atom_size, to architecture, static allocation would be available
on 4KB page system. If this does not sound unreasonable, I will try to get feedbacks
via linux-mm. For reference, the implementation might look as below.

----8<----

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index caebf2a..ab9a1f2 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -9,6 +9,7 @@
 #include <linux/pfn.h>
 #include <linux/init.h>
 
+#include <asm/page.h>
 #include <asm/percpu.h>
 
 /* enough to cover all DEFINE_PER_CPUs in modules */
@@ -18,6 +19,10 @@
 #define PERCPU_MODULE_RESERVE          0
 #endif
 
+#ifndef PERCPU_ATOM_SIZE
+#define PERCPU_ATOM_SIZE               PAGE_SIZE
+#endif
+
 #ifndef PERCPU_ENOUGH_ROOM
 #define PERCPU_ENOUGH_ROOM                                             \
        (ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES) +      \
diff --git a/mm/percpu.c b/mm/percpu.c
index a63b4d8..cd1e0ec 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2201,8 +2201,8 @@ void __init setup_per_cpu_areas(void)
         * what the legacy allocator did.
         */
        rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
-                                   PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, NULL,
-                                   pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
+                                   PERCPU_DYNAMIC_RESERVE, PERCPU_ATOM_SIZE,
+                                   NULL, pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
        if (rc < 0)
                panic("Failed to initialize percpu areas.");
 
@@ -2231,7 +2231,7 @@ void __init setup_per_cpu_areas(void)
 
        ai = pcpu_alloc_alloc_info(1, 1);
        fc = memblock_virt_alloc_from_nopanic(unit_size,
-                                             PAGE_SIZE,
+                                             PERCPU_ATOM_SIZE,
                                              __pa(MAX_DMA_ADDRESS));
        if (!ai || !fc)
                panic("Failed to allocate memory for percpu areas.");

----8<----

As overriding PERCPU_ATOM_SIZE in architecture side, aligned stack could be
allocated in a static way, which get rids of another pointer read.

Any feedbacks are welcome.

Best Regards
Jungseok Lee

^ permalink raw reply related	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2015-10-21 15:14 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-07 15:28 [PATCH v4 0/2] arm64: Introduce IRQ stack Jungseok Lee
2015-10-07 15:28 ` Jungseok Lee
2015-10-07 15:28 ` [PATCH v4 1/2] " Jungseok Lee
2015-10-07 15:28   ` Jungseok Lee
2015-10-08 10:25   ` Pratyush Anand
2015-10-08 10:25     ` Pratyush Anand
2015-10-08 14:32     ` Jungseok Lee
2015-10-08 14:32       ` Jungseok Lee
2015-10-08 16:51       ` Pratyush Anand
2015-10-08 16:51         ` Pratyush Anand
2015-10-07 15:28 ` [PATCH v4 2/2] arm64: Expand the stack trace feature to support " Jungseok Lee
2015-10-07 15:28   ` Jungseok Lee
2015-10-09 14:24   ` James Morse
2015-10-09 14:24     ` James Morse
2015-10-12 14:53     ` Jungseok Lee
2015-10-12 14:53       ` Jungseok Lee
2015-10-12 16:34       ` James Morse
2015-10-12 16:34         ` James Morse
2015-10-12 22:13         ` Jungseok Lee
2015-10-12 22:13           ` Jungseok Lee
2015-10-13 11:00           ` James Morse
2015-10-13 11:00             ` James Morse
2015-10-13 15:00             ` Jungseok Lee
2015-10-13 15:00               ` Jungseok Lee
2015-10-14 12:12               ` Jungseok Lee
2015-10-14 12:12                 ` Jungseok Lee
2015-10-15 15:59                 ` James Morse
2015-10-15 15:59                   ` James Morse
2015-10-16 13:01                   ` Jungseok Lee
2015-10-16 13:01                     ` Jungseok Lee
2015-10-16 16:06                     ` Catalin Marinas
2015-10-16 16:06                       ` Catalin Marinas
2015-10-17 13:38                       ` Jungseok Lee
2015-10-17 13:38                         ` Jungseok Lee
2015-10-19 16:18                         ` Catalin Marinas
2015-10-19 16:18                           ` Catalin Marinas
2015-10-20 13:08                           ` Jungseok Lee
2015-10-20 13:08                             ` Jungseok Lee
2015-10-21 15:14                             ` Jungseok Lee
2015-10-21 15:14                               ` Jungseok Lee
2015-10-14  7:13     ` AKASHI Takahiro
2015-10-14  7:13       ` AKASHI Takahiro
2015-10-14 12:24       ` Jungseok Lee
2015-10-14 12:24         ` Jungseok Lee
2015-10-14 12:55         ` Jungseok Lee
2015-10-14 12:55           ` Jungseok Lee
2015-10-15  4:19           ` AKASHI Takahiro
2015-10-15  4:19             ` AKASHI Takahiro
2015-10-15 13:39             ` Jungseok Lee
2015-10-15 13:39               ` Jungseok Lee
2015-10-19  6:47               ` AKASHI Takahiro
2015-10-19  6:47                 ` AKASHI Takahiro
2015-10-20 13:19                 ` Jungseok Lee
2015-10-20 13:19                   ` Jungseok Lee
2015-10-15 14:24     ` Jungseok Lee
2015-10-15 14:24       ` Jungseok Lee
2015-10-15 16:01       ` James Morse
2015-10-15 16:01         ` James Morse
2015-10-16 13:02         ` Jungseok Lee
2015-10-16 13:02           ` Jungseok Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.