All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Implement IRQ stack on ARM64
@ 2015-09-04 14:23 ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, linux-arm-kernel; +Cc: linux-kernel

ARM64 kernel allocates 16KB kernel stack when creating a process. In case
of low memory platforms with tough workloads on userland, this order-2
allocation request reaches to memory pressure and performance degradation
simultaenously since VM page allocator falls into slowpath frequently,
which triggers page reclaim and compaction.

I believe that one of the best solutions is to reduce kernel stack size.
According to the following data from stack tracer with some fixes, [1],
a separate IRQ stack would greatly help to decrease a kernel stack depth.

	        Depth    Size   Location    (51 entries)
	        -----    ----   --------
	  0)     5352      96   _raw_spin_unlock_irqrestore+0x1c/0x60
	  1)     5256      48   gic_raise_softirq+0xa0/0xbc
	  2)     5208      80   smp_cross_call+0x40/0xbc
	  3)     5128      48   smp_send_reschedule+0x38/0x48
	  4)     5080      32   trigger_load_balance+0x184/0x29c
	  5)     5048     112   scheduler_tick+0xac/0x104
	  6)     4936      64   update_process_times+0x5c/0x74
	  7)     4872      32   tick_sched_handle.isra.15+0x38/0x7c
	  8)     4840      48   tick_sched_timer+0x48/0x90
	  9)     4792      48   __run_hrtimer+0x60/0x258
	 10)     4744      64   hrtimer_interrupt+0xe8/0x260
	 11)     4680     128   arch_timer_handler_virt+0x38/0x48
	 12)     4552      32   handle_percpu_devid_irq+0x84/0x188
	 13)     4520      64   generic_handle_irq+0x38/0x54
	 14)     4456      32   __handle_domain_irq+0x68/0xbc
	 15)     4424      64   gic_handle_irq+0x38/0x88
	 16)     4360     280   el1_irq+0x64/0xd8
	 17)     4080     168   ftrace_ops_no_ops+0xb4/0x16c
	 18)     3912      32   ftrace_call+0x0/0x4
	 19)     3880     144   __alloc_skb+0x48/0x180
	 20)     3736      96   alloc_skb_with_frags+0x74/0x234
	 21)     3640     112   sock_alloc_send_pskb+0x1d0/0x294
	 22)     3528     160   sock_alloc_send_skb+0x44/0x54
	 23)     3368      64   __ip_append_data.isra.40+0x78c/0xb48
	 24)     3304     224   ip_append_data.part.42+0x98/0xe8
	 25)     3080     112   ip_append_data+0x68/0x7c
	 26)     2968      96   icmp_push_reply+0x7c/0x144
	 27)     2872      96   icmp_send+0x3c0/0x3c8
	 28)     2776     192   __udp4_lib_rcv+0x5b8/0x684
	 29)     2584      96   udp_rcv+0x2c/0x3c
	 30)     2488      32   ip_local_deliver+0xa0/0x224
	 31)     2456      48   ip_rcv+0x360/0x57c
	 32)     2408      64   __netif_receive_skb_core+0x4d0/0x80c
	 33)     2344     128   __netif_receive_skb+0x24/0x84
	 34)     2216      32   process_backlog+0x9c/0x15c
	 35)     2184      80   net_rx_action+0x1ec/0x32c
	 36)     2104     160   __do_softirq+0x114/0x2f0
	 37)     1944     128   do_softirq+0x60/0x68
	 38)     1816      32   __local_bh_enable_ip+0xb0/0xd4
	 39)     1784      32   ip_finish_output+0x1f4/0xabc
	 40)     1752      96   ip_output+0xf0/0x120
	 41)     1656      64   ip_local_out_sk+0x44/0x54
	 42)     1592      32   ip_send_skb+0x24/0xbc
	 43)     1560      48   udp_send_skb+0x1b4/0x2f4
	 44)     1512      80   udp_sendmsg+0x2a8/0x7a0
	 45)     1432     272   inet_sendmsg+0xa0/0xd0
	 46)     1160      48   sock_sendmsg+0x30/0x78
	 47)     1112      32   ___sys_sendmsg+0x15c/0x26c
	 48)     1080     400   __sys_sendmmsg+0x94/0x180
	 49)      680     320   SyS_sendmmsg+0x38/0x54
	 50)      360     360   el0_svc_naked+0x20/0x28

So, this patch set implements a separate percpu IRQ stack. 

AFAIK, a stack tracer on ftrace does not work well. Thus, this is a single
todo list at this moment.

This series is written on top of 4.2-rc5 with drangon410c board, and it has
been validated with two different tracks: 4.2-rc5 + Linaro Ubuntu 15.04 and
3.10 + Android.

After this merge window, I will rebase this series and resend it.

Any comments or feedbacks are always welcome.

Thanks in advance!

[1]: https://lkml.org/lkml/2015/7/13/29  

Jungseok Lee (3):
  arm64: entry: Remove unnecessary calculation for S_SP in EL1h
  arm64: Introduce IRQ stack
  arm64: Reduce kernel stack size when using IRQ stack

 arch/arm64/Kconfig.debug             | 10 ++
 arch/arm64/include/asm/irq.h         |  8 ++
 arch/arm64/include/asm/thread_info.h | 19 ++++
 arch/arm64/kernel/asm-offsets.c      |  8 ++
 arch/arm64/kernel/entry.S            | 85 +++++++++++++++-
 arch/arm64/kernel/head.S             |  7 ++
 arch/arm64/kernel/irq.c              | 18 ++++
 7 files changed, 150 insertions(+), 5 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 0/3] Implement IRQ stack on ARM64
@ 2015-09-04 14:23 ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: linux-arm-kernel

ARM64 kernel allocates 16KB kernel stack when creating a process. In case
of low memory platforms with tough workloads on userland, this order-2
allocation request reaches to memory pressure and performance degradation
simultaenously since VM page allocator falls into slowpath frequently,
which triggers page reclaim and compaction.

I believe that one of the best solutions is to reduce kernel stack size.
According to the following data from stack tracer with some fixes, [1],
a separate IRQ stack would greatly help to decrease a kernel stack depth.

	        Depth    Size   Location    (51 entries)
	        -----    ----   --------
	  0)     5352      96   _raw_spin_unlock_irqrestore+0x1c/0x60
	  1)     5256      48   gic_raise_softirq+0xa0/0xbc
	  2)     5208      80   smp_cross_call+0x40/0xbc
	  3)     5128      48   smp_send_reschedule+0x38/0x48
	  4)     5080      32   trigger_load_balance+0x184/0x29c
	  5)     5048     112   scheduler_tick+0xac/0x104
	  6)     4936      64   update_process_times+0x5c/0x74
	  7)     4872      32   tick_sched_handle.isra.15+0x38/0x7c
	  8)     4840      48   tick_sched_timer+0x48/0x90
	  9)     4792      48   __run_hrtimer+0x60/0x258
	 10)     4744      64   hrtimer_interrupt+0xe8/0x260
	 11)     4680     128   arch_timer_handler_virt+0x38/0x48
	 12)     4552      32   handle_percpu_devid_irq+0x84/0x188
	 13)     4520      64   generic_handle_irq+0x38/0x54
	 14)     4456      32   __handle_domain_irq+0x68/0xbc
	 15)     4424      64   gic_handle_irq+0x38/0x88
	 16)     4360     280   el1_irq+0x64/0xd8
	 17)     4080     168   ftrace_ops_no_ops+0xb4/0x16c
	 18)     3912      32   ftrace_call+0x0/0x4
	 19)     3880     144   __alloc_skb+0x48/0x180
	 20)     3736      96   alloc_skb_with_frags+0x74/0x234
	 21)     3640     112   sock_alloc_send_pskb+0x1d0/0x294
	 22)     3528     160   sock_alloc_send_skb+0x44/0x54
	 23)     3368      64   __ip_append_data.isra.40+0x78c/0xb48
	 24)     3304     224   ip_append_data.part.42+0x98/0xe8
	 25)     3080     112   ip_append_data+0x68/0x7c
	 26)     2968      96   icmp_push_reply+0x7c/0x144
	 27)     2872      96   icmp_send+0x3c0/0x3c8
	 28)     2776     192   __udp4_lib_rcv+0x5b8/0x684
	 29)     2584      96   udp_rcv+0x2c/0x3c
	 30)     2488      32   ip_local_deliver+0xa0/0x224
	 31)     2456      48   ip_rcv+0x360/0x57c
	 32)     2408      64   __netif_receive_skb_core+0x4d0/0x80c
	 33)     2344     128   __netif_receive_skb+0x24/0x84
	 34)     2216      32   process_backlog+0x9c/0x15c
	 35)     2184      80   net_rx_action+0x1ec/0x32c
	 36)     2104     160   __do_softirq+0x114/0x2f0
	 37)     1944     128   do_softirq+0x60/0x68
	 38)     1816      32   __local_bh_enable_ip+0xb0/0xd4
	 39)     1784      32   ip_finish_output+0x1f4/0xabc
	 40)     1752      96   ip_output+0xf0/0x120
	 41)     1656      64   ip_local_out_sk+0x44/0x54
	 42)     1592      32   ip_send_skb+0x24/0xbc
	 43)     1560      48   udp_send_skb+0x1b4/0x2f4
	 44)     1512      80   udp_sendmsg+0x2a8/0x7a0
	 45)     1432     272   inet_sendmsg+0xa0/0xd0
	 46)     1160      48   sock_sendmsg+0x30/0x78
	 47)     1112      32   ___sys_sendmsg+0x15c/0x26c
	 48)     1080     400   __sys_sendmmsg+0x94/0x180
	 49)      680     320   SyS_sendmmsg+0x38/0x54
	 50)      360     360   el0_svc_naked+0x20/0x28

So, this patch set implements a separate percpu IRQ stack. 

AFAIK, a stack tracer on ftrace does not work well. Thus, this is a single
todo list at this moment.

This series is written on top of 4.2-rc5 with drangon410c board, and it has
been validated with two different tracks: 4.2-rc5 + Linaro Ubuntu 15.04 and
3.10 + Android.

After this merge window, I will rebase this series and resend it.

Any comments or feedbacks are always welcome.

Thanks in advance!

[1]: https://lkml.org/lkml/2015/7/13/29  

Jungseok Lee (3):
  arm64: entry: Remove unnecessary calculation for S_SP in EL1h
  arm64: Introduce IRQ stack
  arm64: Reduce kernel stack size when using IRQ stack

 arch/arm64/Kconfig.debug             | 10 ++
 arch/arm64/include/asm/irq.h         |  8 ++
 arch/arm64/include/asm/thread_info.h | 19 ++++
 arch/arm64/kernel/asm-offsets.c      |  8 ++
 arch/arm64/kernel/entry.S            | 85 +++++++++++++++-
 arch/arm64/kernel/head.S             |  7 ++
 arch/arm64/kernel/irq.c              | 18 ++++
 7 files changed, 150 insertions(+), 5 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
  2015-09-04 14:23 ` Jungseok Lee
@ 2015-09-04 14:23   ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, linux-arm-kernel; +Cc: linux-kernel

Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
is not needed in kernel_entry. Currently, S_SP information is vaild only
when sp_el0 is used.

Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/kernel/entry.S | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index e163518..d23ca0d 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -91,8 +91,6 @@
 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
 	disable_step_tsk x19, x20		// exceptions when scheduling.
-	.else
-	add	x21, sp, #S_FRAME_SIZE
 	.endif
 	mrs	x22, elr_el1
 	mrs	x23, spsr_el1
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
@ 2015-09-04 14:23   ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: linux-arm-kernel

Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
is not needed in kernel_entry. Currently, S_SP information is vaild only
when sp_el0 is used.

Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/kernel/entry.S | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index e163518..d23ca0d 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -91,8 +91,6 @@
 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
 	disable_step_tsk x19, x20		// exceptions when scheduling.
-	.else
-	add	x21, sp, #S_FRAME_SIZE
 	.endif
 	mrs	x22, elr_el1
 	mrs	x23, spsr_el1
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/3] arm64: Introduce IRQ stack
  2015-09-04 14:23 ` Jungseok Lee
@ 2015-09-04 14:23   ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, linux-arm-kernel; +Cc: linux-kernel

Currently, kernel context and interrupts are handled using a single
kernel stack navigated by sp_el1. This forces many systems to use
16KB stack, not 8KB one. Low memory platforms naturally suffer from
both memory pressure and performance degradation simultaneously as
VM page allocator falls into slowpath frequently.

This patch, thus, solves the problem as introducing a separate percpu
IRQ stack to handle both hard and soft interrupts with two ground rules:

  - Utilize sp_el0 in EL1 context, which is not used currently
  - Do *not* complicate current_thread_info calculation

struct thread_info can be tracked easily using sp_el0, not sp_el1 when
this feature is enabled.

Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/Kconfig.debug             | 10 ++
 arch/arm64/include/asm/irq.h         |  8 ++
 arch/arm64/include/asm/thread_info.h | 11 ++
 arch/arm64/kernel/asm-offsets.c      |  8 ++
 arch/arm64/kernel/entry.S            | 83 +++++++++++++++-
 arch/arm64/kernel/head.S             |  7 ++
 arch/arm64/kernel/irq.c              | 18 ++++
 7 files changed, 142 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index d6285ef..e16d91f 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -18,6 +18,16 @@ config ARM64_PTDUMP
 	  kernel.
 	  If in doubt, say "N"
 
+config IRQ_STACK
+	bool "Use separate kernel stack when handling interrupts"
+	depends on ARM64_4K_PAGES
+	help
+	  Say Y here if you want to use separate kernel stack to handle both
+	  hard and soft interrupts. As reduceing memory footprint regarding
+	  kernel stack, it benefits low memory platforms.
+
+	  If in doubt, say N.
+
 config STRICT_DEVMEM
 	bool "Filter access to /dev/mem"
 	depends on MMU
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index bbb251b..3ec1fa2 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -5,6 +5,14 @@
 
 #include <asm-generic/irq.h>
 
+#ifdef CONFIG_IRQ_STACK
+struct irq_stack {
+	void *stack;
+	unsigned long thread_sp;
+	unsigned int count;
+};
+#endif
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d1..5345a67 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -71,11 +71,22 @@ register unsigned long current_stack_pointer asm ("sp");
  */
 static inline struct thread_info *current_thread_info(void) __attribute_const__;
 
+#ifndef CONFIG_IRQ_STACK
 static inline struct thread_info *current_thread_info(void)
 {
 	return (struct thread_info *)
 		(current_stack_pointer & ~(THREAD_SIZE - 1));
 }
+#else
+static inline struct thread_info *current_thread_info(void)
+{
+	unsigned long sp_el0;
+
+	asm volatile("mrs %0, sp_el0" : "=r" (sp_el0));
+
+	return (struct thread_info *)(sp_el0 & ~(THREAD_SIZE - 1));
+}
+#endif
 
 #define thread_saved_pc(tsk)	\
 	((unsigned long)(tsk->thread.cpu_context.pc))
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index c99701a..6feee0e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/dma-mapping.h>
 #include <linux/kvm_host.h>
+#include <asm/irq.h>
 #include <asm/thread_info.h>
 #include <asm/memory.h>
 #include <asm/smp_plat.h>
@@ -41,6 +42,13 @@ int main(void)
   BLANK();
   DEFINE(THREAD_CPU_CONTEXT,	offsetof(struct task_struct, thread.cpu_context));
   BLANK();
+#ifdef CONFIG_IRQ_STACK
+  DEFINE(IRQ_STACK,		offsetof(struct irq_stack, stack));
+  DEFINE(IRQ_THREAD_SP,		offsetof(struct irq_stack, thread_sp));
+  DEFINE(IRQ_COUNT,		offsetof(struct irq_stack, count));
+  DEFINE(IRQ_STACK_SIZE,	sizeof(struct irq_stack));
+  BLANK();
+#endif
   DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
   DEFINE(S_X1,			offsetof(struct pt_regs, regs[1]));
   DEFINE(S_X2,			offsetof(struct pt_regs, regs[2]));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index d23ca0d..f1fdfa9 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -88,7 +88,11 @@
 
 	.if	\el == 0
 	mrs	x21, sp_el0
+#ifndef CONFIG_IRQ_STACK
 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
+#else
+	get_thread_info \el, tsk
+#endif
 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
 	disable_step_tsk x19, x20		// exceptions when scheduling.
 	.endif
@@ -168,11 +172,56 @@
 	eret					// return to kernel
 	.endm
 
+#ifndef CONFIG_IRQ_STACK
 	.macro	get_thread_info, rd
 	mov	\rd, sp
-	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
+	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of stack
+	.endm
+#else
+	.macro	get_thread_info, el, rd
+	.if	\el == 0
+	mov	\rd, sp
+	.else
+	mrs	\rd, sp_el0
+	.endif
+	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of thread stack
+	.endm
+
+	.macro	get_irq_stack
+	get_thread_info 1, tsk
+	ldr	w22, [tsk, #TI_CPU]
+	adr_l	x21, irq_stacks
+	mov	x23, #IRQ_STACK_SIZE
+	madd	x21, x22, x23, x21
 	.endm
 
+	.macro	irq_stack_entry
+	get_irq_stack
+	ldr	w23, [x21, #IRQ_COUNT]
+	cbnz	w23, 1f
+	mov	x23, sp
+	str	x23, [x21, #IRQ_THREAD_SP]
+	ldr	x23, [x21, #IRQ_STACK]
+	mov	sp, x23
+	mov	x23, xzr
+1:	add	w23, w23, #1
+	str	w23, [x21, #IRQ_COUNT]
+	.endm
+
+	.macro	irq_stack_exit
+	get_irq_stack
+	ldr	w23, [x21, #IRQ_COUNT]
+	sub	w23, w23, #1
+	cbnz	w23, 1f
+	mov	x23, sp
+	str	x23, [x21, #IRQ_STACK]
+	ldr	x23, [x21, #IRQ_THREAD_SP]
+	mov	sp, x23
+	mov	x23, xzr
+1:	str	w23, [x21, #IRQ_COUNT]
+	.endm
+#endif
+
 /*
  * These are the registers used in the syscall handler, and allow us to
  * have in theory up to 7 arguments to a function - x0 to x6.
@@ -188,10 +237,15 @@ tsk	.req	x28		// current thread_info
  * Interrupt handling.
  */
 	.macro	irq_handler
-	adrp	x1, handle_arch_irq
-	ldr	x1, [x1, #:lo12:handle_arch_irq]
+	ldr_l	x1, handle_arch_irq
 	mov	x0, sp
+#ifdef CONFIG_IRQ_STACK
+	irq_stack_entry
+#endif
 	blr	x1
+#ifdef CONFIG_IRQ_STACK
+	irq_stack_exit
+#endif
 	.endm
 
 	.text
@@ -366,7 +420,11 @@ el1_irq:
 	irq_handler
 
 #ifdef CONFIG_PREEMPT
+#ifndef CONFIG_IRQ_STACK
 	get_thread_info tsk
+#else
+	get_thread_info 1, tsk
+#endif
 	ldr	w24, [tsk, #TI_PREEMPT]		// get preempt count
 	cbnz	w24, 1f				// preempt count != 0
 	ldr	x0, [tsk, #TI_FLAGS]		// get flags
@@ -395,6 +453,10 @@ el1_preempt:
 	.align	6
 el0_sync:
 	kernel_entry 0
+#ifdef CONFIG_IRQ_STACK
+	mov	x25, sp
+	msr	sp_el0, x25
+#endif
 	mrs	x25, esr_el1			// read the syndrome register
 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
 	cmp	x24, #ESR_ELx_EC_SVC64		// SVC in 64-bit state
@@ -423,6 +485,10 @@ el0_sync:
 	.align	6
 el0_sync_compat:
 	kernel_entry 0, 32
+#ifdef CONFIG_IRQ_STACK
+	mov	x25, sp
+	msr	sp_el0, x25
+#endif
 	mrs	x25, esr_el1			// read the syndrome register
 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
 	cmp	x24, #ESR_ELx_EC_SVC32		// SVC in 32-bit state
@@ -560,6 +626,10 @@ ENDPROC(el0_sync)
 el0_irq:
 	kernel_entry 0
 el0_irq_naked:
+#ifdef CONFIG_IRQ_STACK
+	mov	x22, sp
+	msr	sp_el0, x22
+#endif
 	enable_dbg
 #ifdef CONFIG_TRACE_IRQFLAGS
 	bl	trace_hardirqs_off
@@ -602,6 +672,9 @@ ENTRY(cpu_switch_to)
 	ldp	x29, x9, [x8], #16
 	ldr	lr, [x8]
 	mov	sp, x9
+#ifdef CONFIG_IRQ_STACK
+	msr	sp_el0, x9
+#endif
 	ret
 ENDPROC(cpu_switch_to)
 
@@ -661,7 +734,11 @@ ENTRY(ret_from_fork)
 	cbz	x19, 1f				// not a kernel thread
 	mov	x0, x20
 	blr	x19
+#ifndef CONFIG_IRQ_STACK
 1:	get_thread_info tsk
+#else
+1:	get_thread_info 1, tsk
+#endif
 	b	ret_to_user
 ENDPROC(ret_from_fork)
 
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index c0ff3ce..3142766 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -446,6 +446,10 @@ __mmap_switched:
 	b	1b
 2:
 	adr_l	sp, initial_sp, x4
+#ifdef CONFIG_IRQ_STACK
+	mov	x4, sp
+	msr	sp_el0, x4
+#endif
 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
 	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
 	mov	x29, #0
@@ -619,6 +623,9 @@ ENDPROC(secondary_startup)
 ENTRY(__secondary_switched)
 	ldr	x0, [x21]			// get secondary_data.stack
 	mov	sp, x0
+#ifdef CONFIG_IRQ_STACK
+	msr	sp_el0, x0
+#endif
 	mov	x29, #0
 	b	secondary_start_kernel
 ENDPROC(__secondary_switched)
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 463fa2e..fb0f522 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -31,6 +31,10 @@
 
 unsigned long irq_err_count;
 
+#ifdef CONFIG_IRQ_STACK
+struct irq_stack irq_stacks[NR_CPUS];
+#endif
+
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
 #ifdef CONFIG_SMP
@@ -52,6 +56,20 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
 
 void __init init_IRQ(void)
 {
+#ifdef CONFIG_IRQ_STACK
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		void *stack = (void *)__get_free_pages(THREADINFO_GFP,
+							THREAD_SIZE_ORDER);
+		if (!stack)
+			panic("CPU%u: No IRQ stack", cpu);
+
+		irq_stacks[cpu].stack = stack + THREAD_START_SP;
+	}
+	pr_info("IRQ: Allocated IRQ stack successfully\n");
+#endif
+
 	irqchip_init();
 	if (!handle_arch_irq)
 		panic("No interrupt controller found.");
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/3] arm64: Introduce IRQ stack
@ 2015-09-04 14:23   ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: linux-arm-kernel

Currently, kernel context and interrupts are handled using a single
kernel stack navigated by sp_el1. This forces many systems to use
16KB stack, not 8KB one. Low memory platforms naturally suffer from
both memory pressure and performance degradation simultaneously as
VM page allocator falls into slowpath frequently.

This patch, thus, solves the problem as introducing a separate percpu
IRQ stack to handle both hard and soft interrupts with two ground rules:

  - Utilize sp_el0 in EL1 context, which is not used currently
  - Do *not* complicate current_thread_info calculation

struct thread_info can be tracked easily using sp_el0, not sp_el1 when
this feature is enabled.

Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/Kconfig.debug             | 10 ++
 arch/arm64/include/asm/irq.h         |  8 ++
 arch/arm64/include/asm/thread_info.h | 11 ++
 arch/arm64/kernel/asm-offsets.c      |  8 ++
 arch/arm64/kernel/entry.S            | 83 +++++++++++++++-
 arch/arm64/kernel/head.S             |  7 ++
 arch/arm64/kernel/irq.c              | 18 ++++
 7 files changed, 142 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index d6285ef..e16d91f 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -18,6 +18,16 @@ config ARM64_PTDUMP
 	  kernel.
 	  If in doubt, say "N"
 
+config IRQ_STACK
+	bool "Use separate kernel stack when handling interrupts"
+	depends on ARM64_4K_PAGES
+	help
+	  Say Y here if you want to use separate kernel stack to handle both
+	  hard and soft interrupts. As reduceing memory footprint regarding
+	  kernel stack, it benefits low memory platforms.
+
+	  If in doubt, say N.
+
 config STRICT_DEVMEM
 	bool "Filter access to /dev/mem"
 	depends on MMU
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index bbb251b..3ec1fa2 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -5,6 +5,14 @@
 
 #include <asm-generic/irq.h>
 
+#ifdef CONFIG_IRQ_STACK
+struct irq_stack {
+	void *stack;
+	unsigned long thread_sp;
+	unsigned int count;
+};
+#endif
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d1..5345a67 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -71,11 +71,22 @@ register unsigned long current_stack_pointer asm ("sp");
  */
 static inline struct thread_info *current_thread_info(void) __attribute_const__;
 
+#ifndef CONFIG_IRQ_STACK
 static inline struct thread_info *current_thread_info(void)
 {
 	return (struct thread_info *)
 		(current_stack_pointer & ~(THREAD_SIZE - 1));
 }
+#else
+static inline struct thread_info *current_thread_info(void)
+{
+	unsigned long sp_el0;
+
+	asm volatile("mrs %0, sp_el0" : "=r" (sp_el0));
+
+	return (struct thread_info *)(sp_el0 & ~(THREAD_SIZE - 1));
+}
+#endif
 
 #define thread_saved_pc(tsk)	\
 	((unsigned long)(tsk->thread.cpu_context.pc))
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index c99701a..6feee0e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/dma-mapping.h>
 #include <linux/kvm_host.h>
+#include <asm/irq.h>
 #include <asm/thread_info.h>
 #include <asm/memory.h>
 #include <asm/smp_plat.h>
@@ -41,6 +42,13 @@ int main(void)
   BLANK();
   DEFINE(THREAD_CPU_CONTEXT,	offsetof(struct task_struct, thread.cpu_context));
   BLANK();
+#ifdef CONFIG_IRQ_STACK
+  DEFINE(IRQ_STACK,		offsetof(struct irq_stack, stack));
+  DEFINE(IRQ_THREAD_SP,		offsetof(struct irq_stack, thread_sp));
+  DEFINE(IRQ_COUNT,		offsetof(struct irq_stack, count));
+  DEFINE(IRQ_STACK_SIZE,	sizeof(struct irq_stack));
+  BLANK();
+#endif
   DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
   DEFINE(S_X1,			offsetof(struct pt_regs, regs[1]));
   DEFINE(S_X2,			offsetof(struct pt_regs, regs[2]));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index d23ca0d..f1fdfa9 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -88,7 +88,11 @@
 
 	.if	\el == 0
 	mrs	x21, sp_el0
+#ifndef CONFIG_IRQ_STACK
 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
+#else
+	get_thread_info \el, tsk
+#endif
 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
 	disable_step_tsk x19, x20		// exceptions when scheduling.
 	.endif
@@ -168,11 +172,56 @@
 	eret					// return to kernel
 	.endm
 
+#ifndef CONFIG_IRQ_STACK
 	.macro	get_thread_info, rd
 	mov	\rd, sp
-	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
+	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of stack
+	.endm
+#else
+	.macro	get_thread_info, el, rd
+	.if	\el == 0
+	mov	\rd, sp
+	.else
+	mrs	\rd, sp_el0
+	.endif
+	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of thread stack
+	.endm
+
+	.macro	get_irq_stack
+	get_thread_info 1, tsk
+	ldr	w22, [tsk, #TI_CPU]
+	adr_l	x21, irq_stacks
+	mov	x23, #IRQ_STACK_SIZE
+	madd	x21, x22, x23, x21
 	.endm
 
+	.macro	irq_stack_entry
+	get_irq_stack
+	ldr	w23, [x21, #IRQ_COUNT]
+	cbnz	w23, 1f
+	mov	x23, sp
+	str	x23, [x21, #IRQ_THREAD_SP]
+	ldr	x23, [x21, #IRQ_STACK]
+	mov	sp, x23
+	mov	x23, xzr
+1:	add	w23, w23, #1
+	str	w23, [x21, #IRQ_COUNT]
+	.endm
+
+	.macro	irq_stack_exit
+	get_irq_stack
+	ldr	w23, [x21, #IRQ_COUNT]
+	sub	w23, w23, #1
+	cbnz	w23, 1f
+	mov	x23, sp
+	str	x23, [x21, #IRQ_STACK]
+	ldr	x23, [x21, #IRQ_THREAD_SP]
+	mov	sp, x23
+	mov	x23, xzr
+1:	str	w23, [x21, #IRQ_COUNT]
+	.endm
+#endif
+
 /*
  * These are the registers used in the syscall handler, and allow us to
  * have in theory up to 7 arguments to a function - x0 to x6.
@@ -188,10 +237,15 @@ tsk	.req	x28		// current thread_info
  * Interrupt handling.
  */
 	.macro	irq_handler
-	adrp	x1, handle_arch_irq
-	ldr	x1, [x1, #:lo12:handle_arch_irq]
+	ldr_l	x1, handle_arch_irq
 	mov	x0, sp
+#ifdef CONFIG_IRQ_STACK
+	irq_stack_entry
+#endif
 	blr	x1
+#ifdef CONFIG_IRQ_STACK
+	irq_stack_exit
+#endif
 	.endm
 
 	.text
@@ -366,7 +420,11 @@ el1_irq:
 	irq_handler
 
 #ifdef CONFIG_PREEMPT
+#ifndef CONFIG_IRQ_STACK
 	get_thread_info tsk
+#else
+	get_thread_info 1, tsk
+#endif
 	ldr	w24, [tsk, #TI_PREEMPT]		// get preempt count
 	cbnz	w24, 1f				// preempt count != 0
 	ldr	x0, [tsk, #TI_FLAGS]		// get flags
@@ -395,6 +453,10 @@ el1_preempt:
 	.align	6
 el0_sync:
 	kernel_entry 0
+#ifdef CONFIG_IRQ_STACK
+	mov	x25, sp
+	msr	sp_el0, x25
+#endif
 	mrs	x25, esr_el1			// read the syndrome register
 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
 	cmp	x24, #ESR_ELx_EC_SVC64		// SVC in 64-bit state
@@ -423,6 +485,10 @@ el0_sync:
 	.align	6
 el0_sync_compat:
 	kernel_entry 0, 32
+#ifdef CONFIG_IRQ_STACK
+	mov	x25, sp
+	msr	sp_el0, x25
+#endif
 	mrs	x25, esr_el1			// read the syndrome register
 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
 	cmp	x24, #ESR_ELx_EC_SVC32		// SVC in 32-bit state
@@ -560,6 +626,10 @@ ENDPROC(el0_sync)
 el0_irq:
 	kernel_entry 0
 el0_irq_naked:
+#ifdef CONFIG_IRQ_STACK
+	mov	x22, sp
+	msr	sp_el0, x22
+#endif
 	enable_dbg
 #ifdef CONFIG_TRACE_IRQFLAGS
 	bl	trace_hardirqs_off
@@ -602,6 +672,9 @@ ENTRY(cpu_switch_to)
 	ldp	x29, x9, [x8], #16
 	ldr	lr, [x8]
 	mov	sp, x9
+#ifdef CONFIG_IRQ_STACK
+	msr	sp_el0, x9
+#endif
 	ret
 ENDPROC(cpu_switch_to)
 
@@ -661,7 +734,11 @@ ENTRY(ret_from_fork)
 	cbz	x19, 1f				// not a kernel thread
 	mov	x0, x20
 	blr	x19
+#ifndef CONFIG_IRQ_STACK
 1:	get_thread_info tsk
+#else
+1:	get_thread_info 1, tsk
+#endif
 	b	ret_to_user
 ENDPROC(ret_from_fork)
 
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index c0ff3ce..3142766 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -446,6 +446,10 @@ __mmap_switched:
 	b	1b
 2:
 	adr_l	sp, initial_sp, x4
+#ifdef CONFIG_IRQ_STACK
+	mov	x4, sp
+	msr	sp_el0, x4
+#endif
 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
 	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
 	mov	x29, #0
@@ -619,6 +623,9 @@ ENDPROC(secondary_startup)
 ENTRY(__secondary_switched)
 	ldr	x0, [x21]			// get secondary_data.stack
 	mov	sp, x0
+#ifdef CONFIG_IRQ_STACK
+	msr	sp_el0, x0
+#endif
 	mov	x29, #0
 	b	secondary_start_kernel
 ENDPROC(__secondary_switched)
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 463fa2e..fb0f522 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -31,6 +31,10 @@
 
 unsigned long irq_err_count;
 
+#ifdef CONFIG_IRQ_STACK
+struct irq_stack irq_stacks[NR_CPUS];
+#endif
+
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
 #ifdef CONFIG_SMP
@@ -52,6 +56,20 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
 
 void __init init_IRQ(void)
 {
+#ifdef CONFIG_IRQ_STACK
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		void *stack = (void *)__get_free_pages(THREADINFO_GFP,
+							THREAD_SIZE_ORDER);
+		if (!stack)
+			panic("CPU%u: No IRQ stack", cpu);
+
+		irq_stacks[cpu].stack = stack + THREAD_START_SP;
+	}
+	pr_info("IRQ: Allocated IRQ stack successfully\n");
+#endif
+
 	irqchip_init();
 	if (!handle_arch_irq)
 		panic("No interrupt controller found.");
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC PATCH 3/3] arm64: Reduce kernel stack size when using IRQ stack
  2015-09-04 14:23 ` Jungseok Lee
@ 2015-09-04 14:23   ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, linux-arm-kernel; +Cc: linux-kernel

It is a principal objective of IRQ stack feature to reduce kernel
stack size. Therefore, the size is set to 8KB when a separate IRQ
stack is active.

Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/include/asm/thread_info.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 5345a67..e79210d 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -24,10 +24,18 @@
 #include <linux/compiler.h>
 
 #ifndef CONFIG_ARM64_64K_PAGES
+#ifdef CONFIG_IRQ_STACK
+#define THREAD_SIZE_ORDER	1
+#else
 #define THREAD_SIZE_ORDER	2
 #endif
+#endif
 
+#ifdef CONFIG_IRQ_STACK
+#define THREAD_SIZE		8192
+#else
 #define THREAD_SIZE		16384
+#endif
 #define THREAD_START_SP		(THREAD_SIZE - 16)
 
 #ifndef __ASSEMBLY__
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC PATCH 3/3] arm64: Reduce kernel stack size when using IRQ stack
@ 2015-09-04 14:23   ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-04 14:23 UTC (permalink / raw)
  To: linux-arm-kernel

It is a principal objective of IRQ stack feature to reduce kernel
stack size. Therefore, the size is set to 8KB when a separate IRQ
stack is active.

Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/include/asm/thread_info.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 5345a67..e79210d 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -24,10 +24,18 @@
 #include <linux/compiler.h>
 
 #ifndef CONFIG_ARM64_64K_PAGES
+#ifdef CONFIG_IRQ_STACK
+#define THREAD_SIZE_ORDER	1
+#else
 #define THREAD_SIZE_ORDER	2
 #endif
+#endif
 
+#ifdef CONFIG_IRQ_STACK
+#define THREAD_SIZE		8192
+#else
 #define THREAD_SIZE		16384
+#endif
 #define THREAD_START_SP		(THREAD_SIZE - 16)
 
 #ifndef __ASSEMBLY__
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 2/3] arm64: Introduce IRQ stack
  2015-09-04 14:23   ` Jungseok Lee
@ 2015-09-04 17:12     ` Alexnader Kuleshov
  -1 siblings, 0 replies; 52+ messages in thread
From: Alexnader Kuleshov @ 2015-09-04 17:12 UTC (permalink / raw)
  To: Jungseok Lee; +Cc: catalin.marinas, will.deacon, linux-arm-kernel, linux-kernel

Hello Jungseok,

On 09-04-15, Jungseok Lee wrote:
> +config IRQ_STACK
> +	bool "Use separate kernel stack when handling interrupts"
> +	depends on ARM64_4K_PAGES
> +	help
> +	  Say Y here if you want to use separate kernel stack to handle both
> +	  hard and soft interrupts. As reduceing memory footprint regarding
> +	  kernel stack, it benefits low memory platforms.

s/reduceing/reducing

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/3] arm64: Introduce IRQ stack
@ 2015-09-04 17:12     ` Alexnader Kuleshov
  0 siblings, 0 replies; 52+ messages in thread
From: Alexnader Kuleshov @ 2015-09-04 17:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Jungseok,

On 09-04-15, Jungseok Lee wrote:
> +config IRQ_STACK
> +	bool "Use separate kernel stack when handling interrupts"
> +	depends on ARM64_4K_PAGES
> +	help
> +	  Say Y here if you want to use separate kernel stack to handle both
> +	  hard and soft interrupts. As reduceing memory footprint regarding
> +	  kernel stack, it benefits low memory platforms.

s/reduceing/reducing

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 2/3] arm64: Introduce IRQ stack
  2015-09-04 17:12     ` Alexnader Kuleshov
@ 2015-09-07 14:08       ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 14:08 UTC (permalink / raw)
  To: Alexnader Kuleshov
  Cc: catalin.marinas, will.deacon, linux-arm-kernel, linux-kernel

On Sep 5, 2015, at 2:12 AM, Alexnader Kuleshov wrote:

> Hello Jungseok,

Hello Alexnader,

> On 09-04-15, Jungseok Lee wrote:
>> +config IRQ_STACK
>> +	bool "Use separate kernel stack when handling interrupts"
>> +	depends on ARM64_4K_PAGES
>> +	help
>> +	  Say Y here if you want to use separate kernel stack to handle both
>> +	  hard and soft interrupts. As reduceing memory footprint regarding
>> +	  kernel stack, it benefits low memory platforms.
> 
> s/reduceing/reducing

Thanks for pointing out the typo.

I will update the help message in the next version.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/3] arm64: Introduce IRQ stack
@ 2015-09-07 14:08       ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 14:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 5, 2015, at 2:12 AM, Alexnader Kuleshov wrote:

> Hello Jungseok,

Hello Alexnader,

> On 09-04-15, Jungseok Lee wrote:
>> +config IRQ_STACK
>> +	bool "Use separate kernel stack when handling interrupts"
>> +	depends on ARM64_4K_PAGES
>> +	help
>> +	  Say Y here if you want to use separate kernel stack to handle both
>> +	  hard and soft interrupts. As reduceing memory footprint regarding
>> +	  kernel stack, it benefits low memory platforms.
> 
> s/reduceing/reducing

Thanks for pointing out the typo.

I will update the help message in the next version.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 0/3] Implement IRQ stack on ARM64
  2015-09-04 14:23 ` Jungseok Lee
@ 2015-09-07 14:33   ` James Morse
  -1 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:33 UTC (permalink / raw)
  To: Jungseok Lee; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On 04/09/15 15:23, Jungseok Lee wrote:
> ARM64 kernel allocates 16KB kernel stack when creating a process. In case
> of low memory platforms with tough workloads on userland, this order-2
> allocation request reaches to memory pressure and performance degradation
> simultaenously since VM page allocator falls into slowpath frequently,
> which triggers page reclaim and compaction.
> 
> I believe that one of the best solutions is to reduce kernel stack size.
> According to the following data from stack tracer with some fixes, [1],
> a separate IRQ stack would greatly help to decrease a kernel stack depth.
>

Hi Jungseok Lee,

I was working on a similar patch for irq stack, (patch as a follow up email).

I suggest we work together on a single implementation. I think the only
major difference is that you're using sp_el0 as a temporary register to
store a copy of the stack-pointer to find struct thread_info, whereas I was
copying it between stacks (ends up as 2x ldp/stps), which keeps the change
restricted to irq_stack setup code.

We should get some feedback as to which approach is preferred.


Thanks,

James Morse



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 0/3] Implement IRQ stack on ARM64
@ 2015-09-07 14:33   ` James Morse
  0 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/09/15 15:23, Jungseok Lee wrote:
> ARM64 kernel allocates 16KB kernel stack when creating a process. In case
> of low memory platforms with tough workloads on userland, this order-2
> allocation request reaches to memory pressure and performance degradation
> simultaenously since VM page allocator falls into slowpath frequently,
> which triggers page reclaim and compaction.
> 
> I believe that one of the best solutions is to reduce kernel stack size.
> According to the following data from stack tracer with some fixes, [1],
> a separate IRQ stack would greatly help to decrease a kernel stack depth.
>

Hi Jungseok Lee,

I was working on a similar patch for irq stack, (patch as a follow up email).

I suggest we work together on a single implementation. I think the only
major difference is that you're using sp_el0 as a temporary register to
store a copy of the stack-pointer to find struct thread_info, whereas I was
copying it between stacks (ends up as 2x ldp/stps), which keeps the change
restricted to irq_stack setup code.

We should get some feedback as to which approach is preferred.


Thanks,

James Morse

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-07 14:33   ` James Morse
@ 2015-09-07 14:36     ` James Morse
  -1 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:36 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: James Morse, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel

Having to handle interrupts on top of an existing kernel stack means the
kernel stack must be large enough to accomodate both the maximum kernel
usage, and the maximum irq handler usage. Switching to a different stack
when processing irqs allows us to make the stack size smaller.

Maximum kernel stack usage (running ltp and generating usb+ethernet
interrupts) was 7256 bytes. With this patch, the same workload gives
a maximum stack usage of 5816 bytes.

Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/arm64/include/asm/irq.h         | 12 +++++++++
 arch/arm64/include/asm/thread_info.h |  8 ++++--
 arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
 arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/smp.c              |  4 +++
 arch/arm64/kernel/stacktrace.c       |  4 ++-
 6 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index bbb251b14746..050d4196c736 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -2,14 +2,20 @@
 #define __ASM_IRQ_H
 
 #include <linux/irqchip/arm-gic-acpi.h>
+#include <linux/percpu.h>
 
 #include <asm-generic/irq.h>
+#include <asm/thread_info.h>
+
+DECLARE_PER_CPU(unsigned long, irq_sp);
 
 struct pt_regs;
 
 extern void migrate_irqs(void);
 extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
 
+extern int alloc_irq_stack(unsigned int cpu);
+
 static inline void acpi_irq_init(void)
 {
 	/*
@@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
 }
 #define acpi_irq_init acpi_irq_init
 
+static inline bool is_irq_stack(unsigned long sp)
+{
+	struct thread_info *ti = get_thread_info(sp);
+	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
+}
+
 #endif
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d18a42a..b906254fc400 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
 /*
  * how to get the thread information struct from C
  */
+static inline struct thread_info *get_thread_info(unsigned long sp)
+{
+	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
+}
+
 static inline struct thread_info *current_thread_info(void) __attribute_const__;
 
 static inline struct thread_info *current_thread_info(void)
 {
-	return (struct thread_info *)
-		(current_stack_pointer & ~(THREAD_SIZE - 1));
+	return get_thread_info(current_stack_pointer);
 }
 
 #define thread_saved_pc(tsk)	\
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index e16351819fed..d42371f3f5a1 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
  * Interrupt handling.
  */
 	.macro	irq_handler
-	adrp	x1, handle_arch_irq
-	ldr	x1, [x1, #:lo12:handle_arch_irq]
-	mov	x0, sp
+	mrs	x21, tpidr_el1
+	adr_l	x20, irq_sp
+	add	x20, x20, x21
+
+	ldr	x21, [x20]
+	mov	x20, sp
+
+	mov	x0, x21
+	mov	x1, x20
+	bl	irq_copy_thread_info
+
+	/* test for recursive use of irq_sp */
+	cbz	w0, 1f
+	mrs	x30, elr_el1
+	mov	sp, x21
+
+	/*
+	 * Create a fake stack frame to bump unwind_frame() onto the original
+	 * stack. This relies on x29 not being clobbered by kernel_entry().
+	 */
+	push	x29, x30
+
+1:	ldr_l	x1, handle_arch_irq
+	mov     x0, x20
 	blr	x1
+
+	mov	x0, x20
+	mov	x1, x21
+	bl	irq_copy_thread_info
+	mov	sp, x20
+
 	.endm
 
 	.text
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 463fa2e7e34c..10b57a006da8 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -26,11 +26,14 @@
 #include <linux/smp.h>
 #include <linux/init.h>
 #include <linux/irqchip.h>
+#include <linux/percpu.h>
 #include <linux/seq_file.h>
 #include <linux/ratelimit.h>
 
 unsigned long irq_err_count;
 
+DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
+
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
 #ifdef CONFIG_SMP
@@ -55,6 +58,10 @@ void __init init_IRQ(void)
 	irqchip_init();
 	if (!handle_arch_irq)
 		panic("No interrupt controller found.");
+
+	/* Allocate an irq stack for the boot cpu */
+	if (alloc_irq_stack(smp_processor_id()))
+		panic("Failed to allocate irq stack for boot cpu.");
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -117,3 +124,48 @@ void migrate_irqs(void)
 	local_irq_restore(flags);
 }
 #endif /* CONFIG_HOTPLUG_CPU */
+
+/* Allocate an irq_stack for a cpu that is about to be brought up. */
+int alloc_irq_stack(unsigned int cpu)
+{
+	struct page *irq_stack_page;
+	union thread_union *irq_stack;
+
+	/* reuse stack allocated previously */
+	if (per_cpu(irq_sp, cpu))
+		return 0;
+
+	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+	if (!irq_stack_page) {
+		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
+			smp_processor_id(), cpu);
+		return -ENOMEM;
+	}
+	irq_stack = page_address(irq_stack_page);
+
+	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
+			       + THREAD_START_SP;
+
+	return 0;
+}
+
+/*
+ * Copy struct thread_info between two stacks, and update current->stack.
+ * This is used when moving to the irq exception stack.
+ * Changing current->stack is necessary so that non-arch thread_info writers
+ * don't use the new thread_info->task->stack to find the old thread_info when
+ * setting flags like TIF_NEED_RESCHED...
+ */
+asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
+{
+	struct thread_info *src = get_thread_info(src_sp);
+	struct thread_info *dst = get_thread_info(dst_sp);
+
+	if (dst == src)
+		return 0;
+
+	*dst = *src;
+	current->stack = dst;
+
+	return 1;
+}
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 50fb4696654e..5283dc5629e4 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -89,6 +89,10 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 {
 	int ret;
 
+	ret = alloc_irq_stack(cpu);
+	if (ret)
+		return ret;
+
 	/*
 	 * We need to tell the secondary core where to find its stack and the
 	 * page tables.
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 407991bf79f5..3d6d5b45aa4b 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/stacktrace.h>
 
+#include <asm/irq.h>
 #include <asm/stacktrace.h>
 
 /*
@@ -43,7 +44,8 @@ int notrace unwind_frame(struct stackframe *frame)
 	low  = frame->sp;
 	high = ALIGN(low, THREAD_SIZE);
 
-	if (fp < low || fp > high - 0x18 || fp & 0xf)
+	if ((fp < low || fp > high - 0x18 || fp & 0xf) &&
+	    !is_irq_stack(frame->sp))
 		return -EINVAL;
 
 	frame->sp = fp + 0x10;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-07 14:36     ` James Morse
  0 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:36 UTC (permalink / raw)
  To: linux-arm-kernel

Having to handle interrupts on top of an existing kernel stack means the
kernel stack must be large enough to accomodate both the maximum kernel
usage, and the maximum irq handler usage. Switching to a different stack
when processing irqs allows us to make the stack size smaller.

Maximum kernel stack usage (running ltp and generating usb+ethernet
interrupts) was 7256 bytes. With this patch, the same workload gives
a maximum stack usage of 5816 bytes.

Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/arm64/include/asm/irq.h         | 12 +++++++++
 arch/arm64/include/asm/thread_info.h |  8 ++++--
 arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
 arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/smp.c              |  4 +++
 arch/arm64/kernel/stacktrace.c       |  4 ++-
 6 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index bbb251b14746..050d4196c736 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -2,14 +2,20 @@
 #define __ASM_IRQ_H
 
 #include <linux/irqchip/arm-gic-acpi.h>
+#include <linux/percpu.h>
 
 #include <asm-generic/irq.h>
+#include <asm/thread_info.h>
+
+DECLARE_PER_CPU(unsigned long, irq_sp);
 
 struct pt_regs;
 
 extern void migrate_irqs(void);
 extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
 
+extern int alloc_irq_stack(unsigned int cpu);
+
 static inline void acpi_irq_init(void)
 {
 	/*
@@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
 }
 #define acpi_irq_init acpi_irq_init
 
+static inline bool is_irq_stack(unsigned long sp)
+{
+	struct thread_info *ti = get_thread_info(sp);
+	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
+}
+
 #endif
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d18a42a..b906254fc400 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
 /*
  * how to get the thread information struct from C
  */
+static inline struct thread_info *get_thread_info(unsigned long sp)
+{
+	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
+}
+
 static inline struct thread_info *current_thread_info(void) __attribute_const__;
 
 static inline struct thread_info *current_thread_info(void)
 {
-	return (struct thread_info *)
-		(current_stack_pointer & ~(THREAD_SIZE - 1));
+	return get_thread_info(current_stack_pointer);
 }
 
 #define thread_saved_pc(tsk)	\
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index e16351819fed..d42371f3f5a1 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
  * Interrupt handling.
  */
 	.macro	irq_handler
-	adrp	x1, handle_arch_irq
-	ldr	x1, [x1, #:lo12:handle_arch_irq]
-	mov	x0, sp
+	mrs	x21, tpidr_el1
+	adr_l	x20, irq_sp
+	add	x20, x20, x21
+
+	ldr	x21, [x20]
+	mov	x20, sp
+
+	mov	x0, x21
+	mov	x1, x20
+	bl	irq_copy_thread_info
+
+	/* test for recursive use of irq_sp */
+	cbz	w0, 1f
+	mrs	x30, elr_el1
+	mov	sp, x21
+
+	/*
+	 * Create a fake stack frame to bump unwind_frame() onto the original
+	 * stack. This relies on x29 not being clobbered by kernel_entry().
+	 */
+	push	x29, x30
+
+1:	ldr_l	x1, handle_arch_irq
+	mov     x0, x20
 	blr	x1
+
+	mov	x0, x20
+	mov	x1, x21
+	bl	irq_copy_thread_info
+	mov	sp, x20
+
 	.endm
 
 	.text
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 463fa2e7e34c..10b57a006da8 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -26,11 +26,14 @@
 #include <linux/smp.h>
 #include <linux/init.h>
 #include <linux/irqchip.h>
+#include <linux/percpu.h>
 #include <linux/seq_file.h>
 #include <linux/ratelimit.h>
 
 unsigned long irq_err_count;
 
+DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
+
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
 #ifdef CONFIG_SMP
@@ -55,6 +58,10 @@ void __init init_IRQ(void)
 	irqchip_init();
 	if (!handle_arch_irq)
 		panic("No interrupt controller found.");
+
+	/* Allocate an irq stack for the boot cpu */
+	if (alloc_irq_stack(smp_processor_id()))
+		panic("Failed to allocate irq stack for boot cpu.");
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -117,3 +124,48 @@ void migrate_irqs(void)
 	local_irq_restore(flags);
 }
 #endif /* CONFIG_HOTPLUG_CPU */
+
+/* Allocate an irq_stack for a cpu that is about to be brought up. */
+int alloc_irq_stack(unsigned int cpu)
+{
+	struct page *irq_stack_page;
+	union thread_union *irq_stack;
+
+	/* reuse stack allocated previously */
+	if (per_cpu(irq_sp, cpu))
+		return 0;
+
+	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
+	if (!irq_stack_page) {
+		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
+			smp_processor_id(), cpu);
+		return -ENOMEM;
+	}
+	irq_stack = page_address(irq_stack_page);
+
+	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
+			       + THREAD_START_SP;
+
+	return 0;
+}
+
+/*
+ * Copy struct thread_info between two stacks, and update current->stack.
+ * This is used when moving to the irq exception stack.
+ * Changing current->stack is necessary so that non-arch thread_info writers
+ * don't use the new thread_info->task->stack to find the old thread_info when
+ * setting flags like TIF_NEED_RESCHED...
+ */
+asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
+{
+	struct thread_info *src = get_thread_info(src_sp);
+	struct thread_info *dst = get_thread_info(dst_sp);
+
+	if (dst == src)
+		return 0;
+
+	*dst = *src;
+	current->stack = dst;
+
+	return 1;
+}
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 50fb4696654e..5283dc5629e4 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -89,6 +89,10 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 {
 	int ret;
 
+	ret = alloc_irq_stack(cpu);
+	if (ret)
+		return ret;
+
 	/*
 	 * We need to tell the secondary core where to find its stack and the
 	 * page tables.
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 407991bf79f5..3d6d5b45aa4b 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/stacktrace.h>
 
+#include <asm/irq.h>
 #include <asm/stacktrace.h>
 
 /*
@@ -43,7 +44,8 @@ int notrace unwind_frame(struct stackframe *frame)
 	low  = frame->sp;
 	high = ALIGN(low, THREAD_SIZE);
 
-	if (fp < low || fp > high - 0x18 || fp & 0xf)
+	if ((fp < low || fp > high - 0x18 || fp & 0xf) &&
+	    !is_irq_stack(frame->sp))
 		return -EINVAL;
 
 	frame->sp = fp + 0x10;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 2/3] arm64: Introduce IRQ stack
  2015-09-04 14:23   ` Jungseok Lee
@ 2015-09-07 14:48     ` James Morse
  -1 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:48 UTC (permalink / raw)
  To: Jungseok Lee; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On 04/09/15 15:23, Jungseok Lee wrote:
> Currently, kernel context and interrupts are handled using a single
> kernel stack navigated by sp_el1. This forces many systems to use
> 16KB stack, not 8KB one. Low memory platforms naturally suffer from
> both memory pressure and performance degradation simultaneously as
> VM page allocator falls into slowpath frequently.
> 
> This patch, thus, solves the problem as introducing a separate percpu
> IRQ stack to handle both hard and soft interrupts with two ground rules:
> 
>   - Utilize sp_el0 in EL1 context, which is not used currently
>   - Do *not* complicate current_thread_info calculation
> 
> struct thread_info can be tracked easily using sp_el0, not sp_el1 when
> this feature is enabled.
> 
> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> ---
>  arch/arm64/Kconfig.debug             | 10 ++
>  arch/arm64/include/asm/irq.h         |  8 ++
>  arch/arm64/include/asm/thread_info.h | 11 ++
>  arch/arm64/kernel/asm-offsets.c      |  8 ++
>  arch/arm64/kernel/entry.S            | 83 +++++++++++++++-
>  arch/arm64/kernel/head.S             |  7 ++
>  arch/arm64/kernel/irq.c              | 18 ++++
>  7 files changed, 142 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
> index d6285ef..e16d91f 100644
> --- a/arch/arm64/Kconfig.debug
> +++ b/arch/arm64/Kconfig.debug
> @@ -18,6 +18,16 @@ config ARM64_PTDUMP
>  	  kernel.
>  	  If in doubt, say "N"
>  
> +config IRQ_STACK
> +	bool "Use separate kernel stack when handling interrupts"
> +	depends on ARM64_4K_PAGES
> +	help
> +	  Say Y here if you want to use separate kernel stack to handle both
> +	  hard and soft interrupts. As reduceing memory footprint regarding
> +	  kernel stack, it benefits low memory platforms.
> +
> +	  If in doubt, say N.
> +

I don't think it is necessary to have a debug-only Kconfig option for this.
Reducing memory use is good for everyone!

This would let you get rid of all the #ifdefs


>  config STRICT_DEVMEM
>  	bool "Filter access to /dev/mem"
>  	depends on MMU
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d1..5345a67 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -71,11 +71,22 @@ register unsigned long current_stack_pointer asm ("sp");
>   */
>  static inline struct thread_info *current_thread_info(void) __attribute_const__;
>  
> +#ifndef CONFIG_IRQ_STACK
>  static inline struct thread_info *current_thread_info(void)
>  {
>  	return (struct thread_info *)
>  		(current_stack_pointer & ~(THREAD_SIZE - 1));
>  }
> +#else
> +static inline struct thread_info *current_thread_info(void)
> +{
> +	unsigned long sp_el0;
> +
> +	asm volatile("mrs %0, sp_el0" : "=r" (sp_el0));
> +
> +	return (struct thread_info *)(sp_el0 & ~(THREAD_SIZE - 1));
> +}
> +#endif

Because sp_el0 is only used as a stack value to find struct thread_info,
you could just store the struct thread_info pointer in sp_el0, and save the
masking on each read of the value.


>  
>  #define thread_saved_pc(tsk)	\
>  	((unsigned long)(tsk->thread.cpu_context.pc))
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index d23ca0d..f1fdfa9 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -88,7 +88,11 @@
>  
>  	.if	\el == 0
>  	mrs	x21, sp_el0
> +#ifndef CONFIG_IRQ_STACK
>  	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
> +#else
> +	get_thread_info \el, tsk
> +#endif
>  	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>  	disable_step_tsk x19, x20		// exceptions when scheduling.
>  	.endif
> @@ -168,11 +172,56 @@
>  	eret					// return to kernel
>  	.endm
>  
> +#ifndef CONFIG_IRQ_STACK
>  	.macro	get_thread_info, rd
>  	mov	\rd, sp
> -	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of stack
> +	.endm
> +#else
> +	.macro	get_thread_info, el, rd
> +	.if	\el == 0
> +	mov	\rd, sp
> +	.else
> +	mrs	\rd, sp_el0
> +	.endif
> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of thread stack
> +	.endm
> +
> +	.macro	get_irq_stack
> +	get_thread_info 1, tsk
> +	ldr	w22, [tsk, #TI_CPU]
> +	adr_l	x21, irq_stacks
> +	mov	x23, #IRQ_STACK_SIZE
> +	madd	x21, x22, x23, x21
>  	.endm

Using per_cpu variables would save the multiply here.
You then wouldn't need IRQ_STACK_SIZE.


>  
> +	.macro	irq_stack_entry
> +	get_irq_stack
> +	ldr	w23, [x21, #IRQ_COUNT]
> +	cbnz	w23, 1f
> +	mov	x23, sp
> +	str	x23, [x21, #IRQ_THREAD_SP]
> +	ldr	x23, [x21, #IRQ_STACK]
> +	mov	sp, x23
> +	mov	x23, xzr
> +1:	add	w23, w23, #1
> +	str	w23, [x21, #IRQ_COUNT]

Why do you need to count the number of irqs?
struct thread_info:preempt_count already does something similar (but its
not usable this early...)

An alternative way would be to compare the top bits of the two stack
pointers - all we care about is irq recursion right?

This would save storing 'int count' for each cpu, and the logic to
inc/decrement it.


> +	.endm
> +
> +	.macro	irq_stack_exit
> +	get_irq_stack
> +	ldr	w23, [x21, #IRQ_COUNT]
> +	sub	w23, w23, #1
> +	cbnz	w23, 1f
> +	mov	x23, sp
> +	str	x23, [x21, #IRQ_STACK]
> +	ldr	x23, [x21, #IRQ_THREAD_SP]
> +	mov	sp, x23
> +	mov	x23, xzr
> +1:	str	w23, [x21, #IRQ_COUNT]
> +	.endm
> +#endif
> +
>  /*
>   * These are the registers used in the syscall handler, and allow us to
>   * have in theory up to 7 arguments to a function - x0 to x6.
> @@ -188,10 +237,15 @@ tsk	.req	x28		// current thread_info
>   * Interrupt handling.
>   */
>  	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> +	ldr_l	x1, handle_arch_irq
>  	mov	x0, sp
> +#ifdef CONFIG_IRQ_STACK
> +	irq_stack_entry
> +#endif
>  	blr	x1
> +#ifdef CONFIG_IRQ_STACK
> +	irq_stack_exit
> +#endif
>  	.endm
>  
>  	.text
> @@ -366,7 +420,11 @@ el1_irq:
>  	irq_handler
>  
>  #ifdef CONFIG_PREEMPT
> +#ifndef CONFIG_IRQ_STACK
>  	get_thread_info tsk
> +#else
> +	get_thread_info 1, tsk
> +#endif
>  	ldr	w24, [tsk, #TI_PREEMPT]		// get preempt count
>  	cbnz	w24, 1f				// preempt count != 0
>  	ldr	x0, [tsk, #TI_FLAGS]		// get flags
> @@ -395,6 +453,10 @@ el1_preempt:
>  	.align	6
>  el0_sync:
>  	kernel_entry 0
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x25, sp
> +	msr	sp_el0, x25
> +#endif

Could you do this in kernel_entry?


>  	mrs	x25, esr_el1			// read the syndrome register
>  	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>  	cmp	x24, #ESR_ELx_EC_SVC64		// SVC in 64-bit state
> @@ -423,6 +485,10 @@ el0_sync:
>  	.align	6
>  el0_sync_compat:
>  	kernel_entry 0, 32
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x25, sp
> +	msr	sp_el0, x25
> +#endif
>  	mrs	x25, esr_el1			// read the syndrome register
>  	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>  	cmp	x24, #ESR_ELx_EC_SVC32		// SVC in 32-bit state
> @@ -560,6 +626,10 @@ ENDPROC(el0_sync)
>  el0_irq:
>  	kernel_entry 0
>  el0_irq_naked:
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x22, sp
> +	msr	sp_el0, x22
> +#endif
>  	enable_dbg
>  #ifdef CONFIG_TRACE_IRQFLAGS
>  	bl	trace_hardirqs_off
> @@ -602,6 +672,9 @@ ENTRY(cpu_switch_to)
>  	ldp	x29, x9, [x8], #16
>  	ldr	lr, [x8]
>  	mov	sp, x9
> +#ifdef CONFIG_IRQ_STACK
> +	msr	sp_el0, x9
> +#endif
>  	ret
>  ENDPROC(cpu_switch_to)
>  
> @@ -661,7 +734,11 @@ ENTRY(ret_from_fork)
>  	cbz	x19, 1f				// not a kernel thread
>  	mov	x0, x20
>  	blr	x19
> +#ifndef CONFIG_IRQ_STACK
>  1:	get_thread_info tsk
> +#else
> +1:	get_thread_info 1, tsk
> +#endif
>  	b	ret_to_user
>  ENDPROC(ret_from_fork)
>  
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index c0ff3ce..3142766 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -446,6 +446,10 @@ __mmap_switched:
>  	b	1b
>  2:
>  	adr_l	sp, initial_sp, x4
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x4, sp
> +	msr	sp_el0, x4
> +#endif
>  	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
>  	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
>  	mov	x29, #0
> @@ -619,6 +623,9 @@ ENDPROC(secondary_startup)
>  ENTRY(__secondary_switched)
>  	ldr	x0, [x21]			// get secondary_data.stack
>  	mov	sp, x0
> +#ifdef CONFIG_IRQ_STACK
> +	msr	sp_el0, x0
> +#endif
>  	mov	x29, #0
>  	b	secondary_start_kernel
>  ENDPROC(__secondary_switched)
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e..fb0f522 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -31,6 +31,10 @@
>  
>  unsigned long irq_err_count;
>  
> +#ifdef CONFIG_IRQ_STACK
> +struct irq_stack irq_stacks[NR_CPUS];
> +#endif
> +

DEFINE_PER_CPU()?


>  int arch_show_interrupts(struct seq_file *p, int prec)
>  {
>  #ifdef CONFIG_SMP
> @@ -52,6 +56,20 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>  
>  void __init init_IRQ(void)
>  {
> +#ifdef CONFIG_IRQ_STACK
> +	unsigned int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		void *stack = (void *)__get_free_pages(THREADINFO_GFP,
> +							THREAD_SIZE_ORDER);
> +		if (!stack)
> +			panic("CPU%u: No IRQ stack", cpu);
> +
> +		irq_stacks[cpu].stack = stack + THREAD_START_SP;
> +	}
> +	pr_info("IRQ: Allocated IRQ stack successfully\n");
> +#endif
> +

Wouldn't it be better to only allocate the stack when the CPU is about to
be brought up? (put the alloc code in __cpu_up()).


Thanks,

James

>  	irqchip_init();
>  	if (!handle_arch_irq)
>  		panic("No interrupt controller found.");
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/3] arm64: Introduce IRQ stack
@ 2015-09-07 14:48     ` James Morse
  0 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/09/15 15:23, Jungseok Lee wrote:
> Currently, kernel context and interrupts are handled using a single
> kernel stack navigated by sp_el1. This forces many systems to use
> 16KB stack, not 8KB one. Low memory platforms naturally suffer from
> both memory pressure and performance degradation simultaneously as
> VM page allocator falls into slowpath frequently.
> 
> This patch, thus, solves the problem as introducing a separate percpu
> IRQ stack to handle both hard and soft interrupts with two ground rules:
> 
>   - Utilize sp_el0 in EL1 context, which is not used currently
>   - Do *not* complicate current_thread_info calculation
> 
> struct thread_info can be tracked easily using sp_el0, not sp_el1 when
> this feature is enabled.
> 
> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> ---
>  arch/arm64/Kconfig.debug             | 10 ++
>  arch/arm64/include/asm/irq.h         |  8 ++
>  arch/arm64/include/asm/thread_info.h | 11 ++
>  arch/arm64/kernel/asm-offsets.c      |  8 ++
>  arch/arm64/kernel/entry.S            | 83 +++++++++++++++-
>  arch/arm64/kernel/head.S             |  7 ++
>  arch/arm64/kernel/irq.c              | 18 ++++
>  7 files changed, 142 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
> index d6285ef..e16d91f 100644
> --- a/arch/arm64/Kconfig.debug
> +++ b/arch/arm64/Kconfig.debug
> @@ -18,6 +18,16 @@ config ARM64_PTDUMP
>  	  kernel.
>  	  If in doubt, say "N"
>  
> +config IRQ_STACK
> +	bool "Use separate kernel stack when handling interrupts"
> +	depends on ARM64_4K_PAGES
> +	help
> +	  Say Y here if you want to use separate kernel stack to handle both
> +	  hard and soft interrupts. As reduceing memory footprint regarding
> +	  kernel stack, it benefits low memory platforms.
> +
> +	  If in doubt, say N.
> +

I don't think it is necessary to have a debug-only Kconfig option for this.
Reducing memory use is good for everyone!

This would let you get rid of all the #ifdefs


>  config STRICT_DEVMEM
>  	bool "Filter access to /dev/mem"
>  	depends on MMU
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d1..5345a67 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -71,11 +71,22 @@ register unsigned long current_stack_pointer asm ("sp");
>   */
>  static inline struct thread_info *current_thread_info(void) __attribute_const__;
>  
> +#ifndef CONFIG_IRQ_STACK
>  static inline struct thread_info *current_thread_info(void)
>  {
>  	return (struct thread_info *)
>  		(current_stack_pointer & ~(THREAD_SIZE - 1));
>  }
> +#else
> +static inline struct thread_info *current_thread_info(void)
> +{
> +	unsigned long sp_el0;
> +
> +	asm volatile("mrs %0, sp_el0" : "=r" (sp_el0));
> +
> +	return (struct thread_info *)(sp_el0 & ~(THREAD_SIZE - 1));
> +}
> +#endif

Because sp_el0 is only used as a stack value to find struct thread_info,
you could just store the struct thread_info pointer in sp_el0, and save the
masking on each read of the value.


>  
>  #define thread_saved_pc(tsk)	\
>  	((unsigned long)(tsk->thread.cpu_context.pc))
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index d23ca0d..f1fdfa9 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -88,7 +88,11 @@
>  
>  	.if	\el == 0
>  	mrs	x21, sp_el0
> +#ifndef CONFIG_IRQ_STACK
>  	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
> +#else
> +	get_thread_info \el, tsk
> +#endif
>  	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>  	disable_step_tsk x19, x20		// exceptions when scheduling.
>  	.endif
> @@ -168,11 +172,56 @@
>  	eret					// return to kernel
>  	.endm
>  
> +#ifndef CONFIG_IRQ_STACK
>  	.macro	get_thread_info, rd
>  	mov	\rd, sp
> -	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of stack
> +	.endm
> +#else
> +	.macro	get_thread_info, el, rd
> +	.if	\el == 0
> +	mov	\rd, sp
> +	.else
> +	mrs	\rd, sp_el0
> +	.endif
> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of thread stack
> +	.endm
> +
> +	.macro	get_irq_stack
> +	get_thread_info 1, tsk
> +	ldr	w22, [tsk, #TI_CPU]
> +	adr_l	x21, irq_stacks
> +	mov	x23, #IRQ_STACK_SIZE
> +	madd	x21, x22, x23, x21
>  	.endm

Using per_cpu variables would save the multiply here.
You then wouldn't need IRQ_STACK_SIZE.


>  
> +	.macro	irq_stack_entry
> +	get_irq_stack
> +	ldr	w23, [x21, #IRQ_COUNT]
> +	cbnz	w23, 1f
> +	mov	x23, sp
> +	str	x23, [x21, #IRQ_THREAD_SP]
> +	ldr	x23, [x21, #IRQ_STACK]
> +	mov	sp, x23
> +	mov	x23, xzr
> +1:	add	w23, w23, #1
> +	str	w23, [x21, #IRQ_COUNT]

Why do you need to count the number of irqs?
struct thread_info:preempt_count already does something similar (but its
not usable this early...)

An alternative way would be to compare the top bits of the two stack
pointers - all we care about is irq recursion right?

This would save storing 'int count' for each cpu, and the logic to
inc/decrement it.


> +	.endm
> +
> +	.macro	irq_stack_exit
> +	get_irq_stack
> +	ldr	w23, [x21, #IRQ_COUNT]
> +	sub	w23, w23, #1
> +	cbnz	w23, 1f
> +	mov	x23, sp
> +	str	x23, [x21, #IRQ_STACK]
> +	ldr	x23, [x21, #IRQ_THREAD_SP]
> +	mov	sp, x23
> +	mov	x23, xzr
> +1:	str	w23, [x21, #IRQ_COUNT]
> +	.endm
> +#endif
> +
>  /*
>   * These are the registers used in the syscall handler, and allow us to
>   * have in theory up to 7 arguments to a function - x0 to x6.
> @@ -188,10 +237,15 @@ tsk	.req	x28		// current thread_info
>   * Interrupt handling.
>   */
>  	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> +	ldr_l	x1, handle_arch_irq
>  	mov	x0, sp
> +#ifdef CONFIG_IRQ_STACK
> +	irq_stack_entry
> +#endif
>  	blr	x1
> +#ifdef CONFIG_IRQ_STACK
> +	irq_stack_exit
> +#endif
>  	.endm
>  
>  	.text
> @@ -366,7 +420,11 @@ el1_irq:
>  	irq_handler
>  
>  #ifdef CONFIG_PREEMPT
> +#ifndef CONFIG_IRQ_STACK
>  	get_thread_info tsk
> +#else
> +	get_thread_info 1, tsk
> +#endif
>  	ldr	w24, [tsk, #TI_PREEMPT]		// get preempt count
>  	cbnz	w24, 1f				// preempt count != 0
>  	ldr	x0, [tsk, #TI_FLAGS]		// get flags
> @@ -395,6 +453,10 @@ el1_preempt:
>  	.align	6
>  el0_sync:
>  	kernel_entry 0
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x25, sp
> +	msr	sp_el0, x25
> +#endif

Could you do this in kernel_entry?


>  	mrs	x25, esr_el1			// read the syndrome register
>  	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>  	cmp	x24, #ESR_ELx_EC_SVC64		// SVC in 64-bit state
> @@ -423,6 +485,10 @@ el0_sync:
>  	.align	6
>  el0_sync_compat:
>  	kernel_entry 0, 32
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x25, sp
> +	msr	sp_el0, x25
> +#endif
>  	mrs	x25, esr_el1			// read the syndrome register
>  	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>  	cmp	x24, #ESR_ELx_EC_SVC32		// SVC in 32-bit state
> @@ -560,6 +626,10 @@ ENDPROC(el0_sync)
>  el0_irq:
>  	kernel_entry 0
>  el0_irq_naked:
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x22, sp
> +	msr	sp_el0, x22
> +#endif
>  	enable_dbg
>  #ifdef CONFIG_TRACE_IRQFLAGS
>  	bl	trace_hardirqs_off
> @@ -602,6 +672,9 @@ ENTRY(cpu_switch_to)
>  	ldp	x29, x9, [x8], #16
>  	ldr	lr, [x8]
>  	mov	sp, x9
> +#ifdef CONFIG_IRQ_STACK
> +	msr	sp_el0, x9
> +#endif
>  	ret
>  ENDPROC(cpu_switch_to)
>  
> @@ -661,7 +734,11 @@ ENTRY(ret_from_fork)
>  	cbz	x19, 1f				// not a kernel thread
>  	mov	x0, x20
>  	blr	x19
> +#ifndef CONFIG_IRQ_STACK
>  1:	get_thread_info tsk
> +#else
> +1:	get_thread_info 1, tsk
> +#endif
>  	b	ret_to_user
>  ENDPROC(ret_from_fork)
>  
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index c0ff3ce..3142766 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -446,6 +446,10 @@ __mmap_switched:
>  	b	1b
>  2:
>  	adr_l	sp, initial_sp, x4
> +#ifdef CONFIG_IRQ_STACK
> +	mov	x4, sp
> +	msr	sp_el0, x4
> +#endif
>  	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
>  	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
>  	mov	x29, #0
> @@ -619,6 +623,9 @@ ENDPROC(secondary_startup)
>  ENTRY(__secondary_switched)
>  	ldr	x0, [x21]			// get secondary_data.stack
>  	mov	sp, x0
> +#ifdef CONFIG_IRQ_STACK
> +	msr	sp_el0, x0
> +#endif
>  	mov	x29, #0
>  	b	secondary_start_kernel
>  ENDPROC(__secondary_switched)
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e..fb0f522 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -31,6 +31,10 @@
>  
>  unsigned long irq_err_count;
>  
> +#ifdef CONFIG_IRQ_STACK
> +struct irq_stack irq_stacks[NR_CPUS];
> +#endif
> +

DEFINE_PER_CPU()?


>  int arch_show_interrupts(struct seq_file *p, int prec)
>  {
>  #ifdef CONFIG_SMP
> @@ -52,6 +56,20 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>  
>  void __init init_IRQ(void)
>  {
> +#ifdef CONFIG_IRQ_STACK
> +	unsigned int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		void *stack = (void *)__get_free_pages(THREADINFO_GFP,
> +							THREAD_SIZE_ORDER);
> +		if (!stack)
> +			panic("CPU%u: No IRQ stack", cpu);
> +
> +		irq_stacks[cpu].stack = stack + THREAD_START_SP;
> +	}
> +	pr_info("IRQ: Allocated IRQ stack successfully\n");
> +#endif
> +

Wouldn't it be better to only allocate the stack when the CPU is about to
be brought up? (put the alloc code in __cpu_up()).


Thanks,

James

>  	irqchip_init();
>  	if (!handle_arch_irq)
>  		panic("No interrupt controller found.");
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
  2015-09-04 14:23   ` Jungseok Lee
@ 2015-09-07 14:48     ` James Morse
  -1 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:48 UTC (permalink / raw)
  To: Jungseok Lee; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On 04/09/15 15:23, Jungseok Lee wrote:
> Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
> is not needed in kernel_entry. Currently, S_SP information is vaild only
> when sp_el0 is used.
> 
> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e163518..d23ca0d 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -91,8 +91,6 @@
>  	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>  	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>  	disable_step_tsk x19, x20		// exceptions when scheduling.
> -	.else
> -	add	x21, sp, #S_FRAME_SIZE
>  	.endif
>  	mrs	x22, elr_el1
>  	mrs	x23, spsr_el1
> 

This sp value gets written to the struct pt_regs that is built on the
stack, and passed to the fault handlers, see 'el1_sp_pc' in kernel/entry.S,
which goes on to call do_sp_pc_abort() which prints this value out. (Other
fault handlers may make decisions based on this value).
It should be present and correct.


Thanks,

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
@ 2015-09-07 14:48     ` James Morse
  0 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/09/15 15:23, Jungseok Lee wrote:
> Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
> is not needed in kernel_entry. Currently, S_SP information is vaild only
> when sp_el0 is used.
> 
> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e163518..d23ca0d 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -91,8 +91,6 @@
>  	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>  	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>  	disable_step_tsk x19, x20		// exceptions when scheduling.
> -	.else
> -	add	x21, sp, #S_FRAME_SIZE
>  	.endif
>  	mrs	x22, elr_el1
>  	mrs	x23, spsr_el1
> 

This sp value gets written to the struct pt_regs that is built on the
stack, and passed to the fault handlers, see 'el1_sp_pc' in kernel/entry.S,
which goes on to call do_sp_pc_abort() which prints this value out. (Other
fault handlers may make decisions based on this value).
It should be present and correct.


Thanks,

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
  2015-09-04 14:23   ` Jungseok Lee
@ 2015-09-07 14:56     ` Mark Rutland
  -1 siblings, 0 replies; 52+ messages in thread
From: Mark Rutland @ 2015-09-07 14:56 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	Dave P Martin, James Morse

On Fri, Sep 04, 2015 at 03:23:05PM +0100, Jungseok Lee wrote:
> Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
> is not needed in kernel_entry. Currently, S_SP information is vaild only
> when sp_el0 is used.

I don't think this is true. The generic BUG implementation will grab the
saved SP from the pt_regs, and with this change we'll report whatever
happened to be in x21 instead.

> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> ---
>  arch/arm64/kernel/entry.S | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e163518..d23ca0d 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -91,8 +91,6 @@
>  	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>  	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>  	disable_step_tsk x19, x20		// exceptions when scheduling.
> -	.else
> -	add	x21, sp, #S_FRAME_SIZE
>  	.endif
>  	mrs	x22, elr_el1
>  	mrs	x23, spsr_el1

Immediately after this we do:

	stp     lr, x21, [sp, #S_LR]

To store the LR and SP to the pt_regs which bug_handler would use.

Am I missing smoething?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
@ 2015-09-07 14:56     ` Mark Rutland
  0 siblings, 0 replies; 52+ messages in thread
From: Mark Rutland @ 2015-09-07 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 04, 2015 at 03:23:05PM +0100, Jungseok Lee wrote:
> Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
> is not needed in kernel_entry. Currently, S_SP information is vaild only
> when sp_el0 is used.

I don't think this is true. The generic BUG implementation will grab the
saved SP from the pt_regs, and with this change we'll report whatever
happened to be in x21 instead.

> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> ---
>  arch/arm64/kernel/entry.S | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e163518..d23ca0d 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -91,8 +91,6 @@
>  	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>  	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>  	disable_step_tsk x19, x20		// exceptions when scheduling.
> -	.else
> -	add	x21, sp, #S_FRAME_SIZE
>  	.endif
>  	mrs	x22, elr_el1
>  	mrs	x23, spsr_el1

Immediately after this we do:

	stp     lr, x21, [sp, #S_LR]

To store the LR and SP to the pt_regs which bug_handler would use.

Am I missing smoething?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 0/3] Implement IRQ stack on ARM64
  2015-09-07 14:33   ` James Morse
@ 2015-09-07 15:42     ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 15:42 UTC (permalink / raw)
  To: James Morse; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On Sep 7, 2015, at 11:33 PM, James Morse wrote:
> On 04/09/15 15:23, Jungseok Lee wrote:
>> ARM64 kernel allocates 16KB kernel stack when creating a process. In case
>> of low memory platforms with tough workloads on userland, this order-2
>> allocation request reaches to memory pressure and performance degradation
>> simultaenously since VM page allocator falls into slowpath frequently,
>> which triggers page reclaim and compaction.
>> 
>> I believe that one of the best solutions is to reduce kernel stack size.
>> According to the following data from stack tracer with some fixes, [1],
>> a separate IRQ stack would greatly help to decrease a kernel stack depth.
>> 
> 
> Hi Jungseok Lee,

Hi James Morse,

> I was working on a similar patch for irq stack, (patch as a follow up email).
> 
> I suggest we work together on a single implementation. I think the only
> major difference is that you're using sp_el0 as a temporary register to
> store a copy of the stack-pointer to find struct thread_info, whereas I was
> copying it between stacks (ends up as 2x ldp/stps), which keeps the change
> restricted to irq_stack setup code.
> 
> We should get some feedback as to which approach is preferred.

Great idea!
I'd really like to figure out the most ideal implementation of this feature.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 0/3] Implement IRQ stack on ARM64
@ 2015-09-07 15:42     ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 15:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 7, 2015, at 11:33 PM, James Morse wrote:
> On 04/09/15 15:23, Jungseok Lee wrote:
>> ARM64 kernel allocates 16KB kernel stack when creating a process. In case
>> of low memory platforms with tough workloads on userland, this order-2
>> allocation request reaches to memory pressure and performance degradation
>> simultaenously since VM page allocator falls into slowpath frequently,
>> which triggers page reclaim and compaction.
>> 
>> I believe that one of the best solutions is to reduce kernel stack size.
>> According to the following data from stack tracer with some fixes, [1],
>> a separate IRQ stack would greatly help to decrease a kernel stack depth.
>> 
> 
> Hi Jungseok Lee,

Hi James Morse,

> I was working on a similar patch for irq stack, (patch as a follow up email).
> 
> I suggest we work together on a single implementation. I think the only
> major difference is that you're using sp_el0 as a temporary register to
> store a copy of the stack-pointer to find struct thread_info, whereas I was
> copying it between stacks (ends up as 2x ldp/stps), which keeps the change
> restricted to irq_stack setup code.
> 
> We should get some feedback as to which approach is preferred.

Great idea!
I'd really like to figure out the most ideal implementation of this feature.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-07 14:36     ` James Morse
@ 2015-09-07 15:48       ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 15:48 UTC (permalink / raw)
  To: James Morse; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On Sep 7, 2015, at 11:36 PM, James Morse wrote:

Hi James,

> Having to handle interrupts on top of an existing kernel stack means the
> kernel stack must be large enough to accomodate both the maximum kernel
> usage, and the maximum irq handler usage. Switching to a different stack
> when processing irqs allows us to make the stack size smaller.
> 
> Maximum kernel stack usage (running ltp and generating usb+ethernet
> interrupts) was 7256 bytes. With this patch, the same workload gives
> a maximum stack usage of 5816 bytes.

I'd like to know how to measure the max stack depth.
AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
region and find or track down an untouched region? 

I will leave comments after reading and playing with this change carefully.

Best Regards
Jungseok Lee

> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> arch/arm64/include/asm/irq.h         | 12 +++++++++
> arch/arm64/include/asm/thread_info.h |  8 ++++--
> arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
> arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
> arch/arm64/kernel/smp.c              |  4 +++
> arch/arm64/kernel/stacktrace.c       |  4 ++-
> 6 files changed, 107 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index bbb251b14746..050d4196c736 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,14 +2,20 @@
> #define __ASM_IRQ_H
> 
> #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/percpu.h>
> 
> #include <asm-generic/irq.h>
> +#include <asm/thread_info.h>
> +
> +DECLARE_PER_CPU(unsigned long, irq_sp);
> 
> struct pt_regs;
> 
> extern void migrate_irqs(void);
> extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
> 
> +extern int alloc_irq_stack(unsigned int cpu);
> +
> static inline void acpi_irq_init(void)
> {
> 	/*
> @@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
> }
> #define acpi_irq_init acpi_irq_init
> 
> +static inline bool is_irq_stack(unsigned long sp)
> +{
> +	struct thread_info *ti = get_thread_info(sp);
> +	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
> +}
> +
> #endif
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d18a42a..b906254fc400 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
> /*
>  * how to get the thread information struct from C
>  */
> +static inline struct thread_info *get_thread_info(unsigned long sp)
> +{
> +	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
> +}
> +
> static inline struct thread_info *current_thread_info(void) __attribute_const__;
> 
> static inline struct thread_info *current_thread_info(void)
> {
> -	return (struct thread_info *)
> -		(current_stack_pointer & ~(THREAD_SIZE - 1));
> +	return get_thread_info(current_stack_pointer);
> }
> 
> #define thread_saved_pc(tsk)	\
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e16351819fed..d42371f3f5a1 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>  * Interrupt handling.
>  */
> 	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> -	mov	x0, sp
> +	mrs	x21, tpidr_el1
> +	adr_l	x20, irq_sp
> +	add	x20, x20, x21
> +
> +	ldr	x21, [x20]
> +	mov	x20, sp
> +
> +	mov	x0, x21
> +	mov	x1, x20
> +	bl	irq_copy_thread_info
> +
> +	/* test for recursive use of irq_sp */
> +	cbz	w0, 1f
> +	mrs	x30, elr_el1
> +	mov	sp, x21
> +
> +	/*
> +	 * Create a fake stack frame to bump unwind_frame() onto the original
> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
> +	 */
> +	push	x29, x30
> +
> +1:	ldr_l	x1, handle_arch_irq
> +	mov     x0, x20
> 	blr	x1
> +
> +	mov	x0, x20
> +	mov	x1, x21
> +	bl	irq_copy_thread_info
> +	mov	sp, x20
> +
> 	.endm
> 
> 	.text
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e7e34c..10b57a006da8 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -26,11 +26,14 @@
> #include <linux/smp.h>
> #include <linux/init.h>
> #include <linux/irqchip.h>
> +#include <linux/percpu.h>
> #include <linux/seq_file.h>
> #include <linux/ratelimit.h>
> 
> unsigned long irq_err_count;
> 
> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
> +
> int arch_show_interrupts(struct seq_file *p, int prec)
> {
> #ifdef CONFIG_SMP
> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
> 	irqchip_init();
> 	if (!handle_arch_irq)
> 		panic("No interrupt controller found.");
> +
> +	/* Allocate an irq stack for the boot cpu */
> +	if (alloc_irq_stack(smp_processor_id()))
> +		panic("Failed to allocate irq stack for boot cpu.");
> }
> 
> #ifdef CONFIG_HOTPLUG_CPU
> @@ -117,3 +124,48 @@ void migrate_irqs(void)
> 	local_irq_restore(flags);
> }
> #endif /* CONFIG_HOTPLUG_CPU */
> +
> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	struct page *irq_stack_page;
> +	union thread_union *irq_stack;
> +
> +	/* reuse stack allocated previously */
> +	if (per_cpu(irq_sp, cpu))
> +		return 0;
> +
> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +	if (!irq_stack_page) {
> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
> +			smp_processor_id(), cpu);
> +		return -ENOMEM;
> +	}
> +	irq_stack = page_address(irq_stack_page);
> +
> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
> +			       + THREAD_START_SP;
> +
> +	return 0;
> +}
> +
> +/*
> + * Copy struct thread_info between two stacks, and update current->stack.
> + * This is used when moving to the irq exception stack.
> + * Changing current->stack is necessary so that non-arch thread_info writers
> + * don't use the new thread_info->task->stack to find the old thread_info when
> + * setting flags like TIF_NEED_RESCHED...
> + */
> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
> +{
> +	struct thread_info *src = get_thread_info(src_sp);
> +	struct thread_info *dst = get_thread_info(dst_sp);
> +
> +	if (dst == src)
> +		return 0;
> +
> +	*dst = *src;
> +	current->stack = dst;
> +
> +	return 1;
> +}
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 50fb4696654e..5283dc5629e4 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -89,6 +89,10 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
> {
> 	int ret;
> 
> +	ret = alloc_irq_stack(cpu);
> +	if (ret)
> +		return ret;
> +
> 	/*
> 	 * We need to tell the secondary core where to find its stack and the
> 	 * page tables.
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991bf79f5..3d6d5b45aa4b 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -20,6 +20,7 @@
> #include <linux/sched.h>
> #include <linux/stacktrace.h>
> 
> +#include <asm/irq.h>
> #include <asm/stacktrace.h>
> 
> /*
> @@ -43,7 +44,8 @@ int notrace unwind_frame(struct stackframe *frame)
> 	low  = frame->sp;
> 	high = ALIGN(low, THREAD_SIZE);
> 
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	if ((fp < low || fp > high - 0x18 || fp & 0xf) &&
> +	    !is_irq_stack(frame->sp))
> 		return -EINVAL;
> 
> 	frame->sp = fp + 0x10;
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-07 15:48       ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 15:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 7, 2015, at 11:36 PM, James Morse wrote:

Hi James,

> Having to handle interrupts on top of an existing kernel stack means the
> kernel stack must be large enough to accomodate both the maximum kernel
> usage, and the maximum irq handler usage. Switching to a different stack
> when processing irqs allows us to make the stack size smaller.
> 
> Maximum kernel stack usage (running ltp and generating usb+ethernet
> interrupts) was 7256 bytes. With this patch, the same workload gives
> a maximum stack usage of 5816 bytes.

I'd like to know how to measure the max stack depth.
AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
region and find or track down an untouched region? 

I will leave comments after reading and playing with this change carefully.

Best Regards
Jungseok Lee

> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> arch/arm64/include/asm/irq.h         | 12 +++++++++
> arch/arm64/include/asm/thread_info.h |  8 ++++--
> arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
> arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
> arch/arm64/kernel/smp.c              |  4 +++
> arch/arm64/kernel/stacktrace.c       |  4 ++-
> 6 files changed, 107 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index bbb251b14746..050d4196c736 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,14 +2,20 @@
> #define __ASM_IRQ_H
> 
> #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/percpu.h>
> 
> #include <asm-generic/irq.h>
> +#include <asm/thread_info.h>
> +
> +DECLARE_PER_CPU(unsigned long, irq_sp);
> 
> struct pt_regs;
> 
> extern void migrate_irqs(void);
> extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
> 
> +extern int alloc_irq_stack(unsigned int cpu);
> +
> static inline void acpi_irq_init(void)
> {
> 	/*
> @@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
> }
> #define acpi_irq_init acpi_irq_init
> 
> +static inline bool is_irq_stack(unsigned long sp)
> +{
> +	struct thread_info *ti = get_thread_info(sp);
> +	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
> +}
> +
> #endif
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d18a42a..b906254fc400 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
> /*
>  * how to get the thread information struct from C
>  */
> +static inline struct thread_info *get_thread_info(unsigned long sp)
> +{
> +	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
> +}
> +
> static inline struct thread_info *current_thread_info(void) __attribute_const__;
> 
> static inline struct thread_info *current_thread_info(void)
> {
> -	return (struct thread_info *)
> -		(current_stack_pointer & ~(THREAD_SIZE - 1));
> +	return get_thread_info(current_stack_pointer);
> }
> 
> #define thread_saved_pc(tsk)	\
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e16351819fed..d42371f3f5a1 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>  * Interrupt handling.
>  */
> 	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> -	mov	x0, sp
> +	mrs	x21, tpidr_el1
> +	adr_l	x20, irq_sp
> +	add	x20, x20, x21
> +
> +	ldr	x21, [x20]
> +	mov	x20, sp
> +
> +	mov	x0, x21
> +	mov	x1, x20
> +	bl	irq_copy_thread_info
> +
> +	/* test for recursive use of irq_sp */
> +	cbz	w0, 1f
> +	mrs	x30, elr_el1
> +	mov	sp, x21
> +
> +	/*
> +	 * Create a fake stack frame to bump unwind_frame() onto the original
> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
> +	 */
> +	push	x29, x30
> +
> +1:	ldr_l	x1, handle_arch_irq
> +	mov     x0, x20
> 	blr	x1
> +
> +	mov	x0, x20
> +	mov	x1, x21
> +	bl	irq_copy_thread_info
> +	mov	sp, x20
> +
> 	.endm
> 
> 	.text
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e7e34c..10b57a006da8 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -26,11 +26,14 @@
> #include <linux/smp.h>
> #include <linux/init.h>
> #include <linux/irqchip.h>
> +#include <linux/percpu.h>
> #include <linux/seq_file.h>
> #include <linux/ratelimit.h>
> 
> unsigned long irq_err_count;
> 
> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
> +
> int arch_show_interrupts(struct seq_file *p, int prec)
> {
> #ifdef CONFIG_SMP
> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
> 	irqchip_init();
> 	if (!handle_arch_irq)
> 		panic("No interrupt controller found.");
> +
> +	/* Allocate an irq stack for the boot cpu */
> +	if (alloc_irq_stack(smp_processor_id()))
> +		panic("Failed to allocate irq stack for boot cpu.");
> }
> 
> #ifdef CONFIG_HOTPLUG_CPU
> @@ -117,3 +124,48 @@ void migrate_irqs(void)
> 	local_irq_restore(flags);
> }
> #endif /* CONFIG_HOTPLUG_CPU */
> +
> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	struct page *irq_stack_page;
> +	union thread_union *irq_stack;
> +
> +	/* reuse stack allocated previously */
> +	if (per_cpu(irq_sp, cpu))
> +		return 0;
> +
> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +	if (!irq_stack_page) {
> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
> +			smp_processor_id(), cpu);
> +		return -ENOMEM;
> +	}
> +	irq_stack = page_address(irq_stack_page);
> +
> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
> +			       + THREAD_START_SP;
> +
> +	return 0;
> +}
> +
> +/*
> + * Copy struct thread_info between two stacks, and update current->stack.
> + * This is used when moving to the irq exception stack.
> + * Changing current->stack is necessary so that non-arch thread_info writers
> + * don't use the new thread_info->task->stack to find the old thread_info when
> + * setting flags like TIF_NEED_RESCHED...
> + */
> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
> +{
> +	struct thread_info *src = get_thread_info(src_sp);
> +	struct thread_info *dst = get_thread_info(dst_sp);
> +
> +	if (dst == src)
> +		return 0;
> +
> +	*dst = *src;
> +	current->stack = dst;
> +
> +	return 1;
> +}
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 50fb4696654e..5283dc5629e4 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -89,6 +89,10 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
> {
> 	int ret;
> 
> +	ret = alloc_irq_stack(cpu);
> +	if (ret)
> +		return ret;
> +
> 	/*
> 	 * We need to tell the secondary core where to find its stack and the
> 	 * page tables.
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991bf79f5..3d6d5b45aa4b 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -20,6 +20,7 @@
> #include <linux/sched.h>
> #include <linux/stacktrace.h>
> 
> +#include <asm/irq.h>
> #include <asm/stacktrace.h>
> 
> /*
> @@ -43,7 +44,8 @@ int notrace unwind_frame(struct stackframe *frame)
> 	low  = frame->sp;
> 	high = ALIGN(low, THREAD_SIZE);
> 
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	if ((fp < low || fp > high - 0x18 || fp & 0xf) &&
> +	    !is_irq_stack(frame->sp))
> 		return -EINVAL;
> 
> 	frame->sp = fp + 0x10;
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
  2015-09-07 14:56     ` Mark Rutland
@ 2015-09-07 15:51       ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 15:51 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	Dave P Martin, James Morse

On Sep 7, 2015, at 11:56 PM, Mark Rutland wrote:

Hi Mark,

> On Fri, Sep 04, 2015 at 03:23:05PM +0100, Jungseok Lee wrote:
>> Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
>> is not needed in kernel_entry. Currently, S_SP information is vaild only
>> when sp_el0 is used.
> 
> I don't think this is true. The generic BUG implementation will grab the
> saved SP from the pt_regs, and with this change we'll report whatever
> happened to be in x21 instead.
> 
>> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
>> ---
>> arch/arm64/kernel/entry.S | 2 --
>> 1 file changed, 2 deletions(-)
>> 
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index e163518..d23ca0d 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -91,8 +91,6 @@
>> 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>> 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>> 	disable_step_tsk x19, x20		// exceptions when scheduling.
>> -	.else
>> -	add	x21, sp, #S_FRAME_SIZE
>> 	.endif
>> 	mrs	x22, elr_el1
>> 	mrs	x23, spsr_el1
> 
> Immediately after this we do:
> 
> 	stp     lr, x21, [sp, #S_LR]
> 
> To store the LR and SP to the pt_regs which bug_handler would use.
> 
> Am I missing smoething?

No, You're right. As James mentioned, x21 is used in do_sp_pc_abort.

Thanks for the comment.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
@ 2015-09-07 15:51       ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 15:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 7, 2015, at 11:56 PM, Mark Rutland wrote:

Hi Mark,

> On Fri, Sep 04, 2015 at 03:23:05PM +0100, Jungseok Lee wrote:
>> Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation
>> is not needed in kernel_entry. Currently, S_SP information is vaild only
>> when sp_el0 is used.
> 
> I don't think this is true. The generic BUG implementation will grab the
> saved SP from the pt_regs, and with this change we'll report whatever
> happened to be in x21 instead.
> 
>> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
>> ---
>> arch/arm64/kernel/entry.S | 2 --
>> 1 file changed, 2 deletions(-)
>> 
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index e163518..d23ca0d 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -91,8 +91,6 @@
>> 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>> 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>> 	disable_step_tsk x19, x20		// exceptions when scheduling.
>> -	.else
>> -	add	x21, sp, #S_FRAME_SIZE
>> 	.endif
>> 	mrs	x22, elr_el1
>> 	mrs	x23, spsr_el1
> 
> Immediately after this we do:
> 
> 	stp     lr, x21, [sp, #S_LR]
> 
> To store the LR and SP to the pt_regs which bug_handler would use.
> 
> Am I missing smoething?

No, You're right. As James mentioned, x21 is used in do_sp_pc_abort.

Thanks for the comment.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-07 15:48       ` Jungseok Lee
@ 2015-09-07 16:06         ` James Morse
  -1 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 16:06 UTC (permalink / raw)
  To: Jungseok Lee; +Cc: linux-arm-kernel, linux-kernel

On 07/09/15 16:48, Jungseok Lee wrote:
> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
> 
> Hi James,
> 
>> Having to handle interrupts on top of an existing kernel stack means the
>> kernel stack must be large enough to accomodate both the maximum kernel
>> usage, and the maximum irq handler usage. Switching to a different stack
>> when processing irqs allows us to make the stack size smaller.
>>
>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>> interrupts) was 7256 bytes. With this patch, the same workload gives
>> a maximum stack usage of 5816 bytes.
> 
> I'd like to know how to measure the max stack depth.
> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
> region and find or track down an untouched region? 

I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
'Tracers', then looked in debugfs:/tracing/stack_max_size.

What problems did you encounter?
(I may be missing something...)


Thanks,

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-07 16:06         ` James Morse
  0 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-07 16:06 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/09/15 16:48, Jungseok Lee wrote:
> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
> 
> Hi James,
> 
>> Having to handle interrupts on top of an existing kernel stack means the
>> kernel stack must be large enough to accomodate both the maximum kernel
>> usage, and the maximum irq handler usage. Switching to a different stack
>> when processing irqs allows us to make the stack size smaller.
>>
>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>> interrupts) was 7256 bytes. With this patch, the same workload gives
>> a maximum stack usage of 5816 bytes.
> 
> I'd like to know how to measure the max stack depth.
> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
> region and find or track down an untouched region? 

I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
'Tracers', then looked in debugfs:/tracing/stack_max_size.

What problems did you encounter?
(I may be missing something...)


Thanks,

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-07 16:06         ` James Morse
@ 2015-09-07 16:34           ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 16:34 UTC (permalink / raw)
  To: James Morse
  Cc: linux-arm-kernel, linux-kernel@vger.kernel.org Mailing List,
	AKASHI Takahiro

On Sep 8, 2015, at 1:06 AM, James Morse wrote:
> On 07/09/15 16:48, Jungseok Lee wrote:
>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>> 
>> Hi James,
>> 
>>> Having to handle interrupts on top of an existing kernel stack means the
>>> kernel stack must be large enough to accomodate both the maximum kernel
>>> usage, and the maximum irq handler usage. Switching to a different stack
>>> when processing irqs allows us to make the stack size smaller.
>>> 
>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>> a maximum stack usage of 5816 bytes.
>> 
>> I'd like to know how to measure the max stack depth.
>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>> region and find or track down an untouched region? 
> 
> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
> 
> What problems did you encounter?
> (I may be missing something…)

When I enabled the feature, all entries had *0* size except the last entry.
It can be reproduced easily as looking in debugs:/tracing/stack_trace.

You can track down my report and Akashi's changes with the following links:
- http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
- https://lkml.org/lkml/2015/7/13/29

Although it is impossible to measure an exact depth at this moment, the feature
could be utilized to check improvement.

Cc'ing Akashi for additional comments if needed.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-07 16:34           ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-07 16:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 8, 2015, at 1:06 AM, James Morse wrote:
> On 07/09/15 16:48, Jungseok Lee wrote:
>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>> 
>> Hi James,
>> 
>>> Having to handle interrupts on top of an existing kernel stack means the
>>> kernel stack must be large enough to accomodate both the maximum kernel
>>> usage, and the maximum irq handler usage. Switching to a different stack
>>> when processing irqs allows us to make the stack size smaller.
>>> 
>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>> a maximum stack usage of 5816 bytes.
>> 
>> I'd like to know how to measure the max stack depth.
>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>> region and find or track down an untouched region? 
> 
> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
> 
> What problems did you encounter?
> (I may be missing something?)

When I enabled the feature, all entries had *0* size except the last entry.
It can be reproduced easily as looking in debugs:/tracing/stack_trace.

You can track down my report and Akashi's changes with the following links:
- http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
- https://lkml.org/lkml/2015/7/13/29

Although it is impossible to measure an exact depth at this moment, the feature
could be utilized to check improvement.

Cc'ing Akashi for additional comments if needed.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-07 16:34           ` Jungseok Lee
@ 2015-09-08  1:45             ` AKASHI Takahiro
  -1 siblings, 0 replies; 52+ messages in thread
From: AKASHI Takahiro @ 2015-09-08  1:45 UTC (permalink / raw)
  To: Jungseok Lee, James Morse
  Cc: linux-arm-kernel, linux-kernel@vger.kernel.org Mailing List

Jungseok,

On 09/08/2015 01:34 AM, Jungseok Lee wrote:
> On Sep 8, 2015, at 1:06 AM, James Morse wrote:
>> On 07/09/15 16:48, Jungseok Lee wrote:
>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>
>>> Hi James,
>>>
>>>> Having to handle interrupts on top of an existing kernel stack means the
>>>> kernel stack must be large enough to accomodate both the maximum kernel
>>>> usage, and the maximum irq handler usage. Switching to a different stack
>>>> when processing irqs allows us to make the stack size smaller.
>>>>
>>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>>> a maximum stack usage of 5816 bytes.
>>>
>>> I'd like to know how to measure the max stack depth.
>>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>>> region and find or track down an untouched region?
>>
>> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
>> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
>>
>> What problems did you encounter?
>> (I may be missing something…)
>
> When I enabled the feature, all entries had *0* size except the last entry.
> It can be reproduced easily as looking in debugs:/tracing/stack_trace.

I'm afraid that you have not applied one of patches in my RFC:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355919.html

I have not looked into James' patch in details, but hope that it will help
fix one of issues that are annoying me: Stack tracer (actually save_stack_trace())
will miss a function (and its parent function in some case) that is being executed
when an interrupt is taken.


-Takahiro AKASHI

> You can track down my report and Akashi's changes with the following links:
> - http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
> - https://lkml.org/lkml/2015/7/13/29
>
> Although it is impossible to measure an exact depth at this moment, the feature
> could be utilized to check improvement.
>
> Cc'ing Akashi for additional comments if needed.
>
> Best Regards
> Jungseok Lee
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-08  1:45             ` AKASHI Takahiro
  0 siblings, 0 replies; 52+ messages in thread
From: AKASHI Takahiro @ 2015-09-08  1:45 UTC (permalink / raw)
  To: linux-arm-kernel

Jungseok,

On 09/08/2015 01:34 AM, Jungseok Lee wrote:
> On Sep 8, 2015, at 1:06 AM, James Morse wrote:
>> On 07/09/15 16:48, Jungseok Lee wrote:
>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>
>>> Hi James,
>>>
>>>> Having to handle interrupts on top of an existing kernel stack means the
>>>> kernel stack must be large enough to accomodate both the maximum kernel
>>>> usage, and the maximum irq handler usage. Switching to a different stack
>>>> when processing irqs allows us to make the stack size smaller.
>>>>
>>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>>> a maximum stack usage of 5816 bytes.
>>>
>>> I'd like to know how to measure the max stack depth.
>>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>>> region and find or track down an untouched region?
>>
>> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
>> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
>>
>> What problems did you encounter?
>> (I may be missing something?)
>
> When I enabled the feature, all entries had *0* size except the last entry.
> It can be reproduced easily as looking in debugs:/tracing/stack_trace.

I'm afraid that you have not applied one of patches in my RFC:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355919.html

I have not looked into James' patch in details, but hope that it will help
fix one of issues that are annoying me: Stack tracer (actually save_stack_trace())
will miss a function (and its parent function in some case) that is being executed
when an interrupt is taken.


-Takahiro AKASHI

> You can track down my report and Akashi's changes with the following links:
> - http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
> - https://lkml.org/lkml/2015/7/13/29
>
> Although it is impossible to measure an exact depth at this moment, the feature
> could be utilized to check improvement.
>
> Cc'ing Akashi for additional comments if needed.
>
> Best Regards
> Jungseok Lee
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-08  1:45             ` AKASHI Takahiro
@ 2015-09-08  6:44               ` AKASHI Takahiro
  -1 siblings, 0 replies; 52+ messages in thread
From: AKASHI Takahiro @ 2015-09-08  6:44 UTC (permalink / raw)
  To: Jungseok Lee, James Morse
  Cc: linux-arm-kernel, linux-kernel@vger.kernel.org Mailing List

On 09/08/2015 10:45 AM, AKASHI Takahiro wrote:
> Jungseok,
>
> On 09/08/2015 01:34 AM, Jungseok Lee wrote:
>> On Sep 8, 2015, at 1:06 AM, James Morse wrote:
>>> On 07/09/15 16:48, Jungseok Lee wrote:
>>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>>
>>>> Hi James,
>>>>
>>>>> Having to handle interrupts on top of an existing kernel stack means the
>>>>> kernel stack must be large enough to accomodate both the maximum kernel
>>>>> usage, and the maximum irq handler usage. Switching to a different stack
>>>>> when processing irqs allows us to make the stack size smaller.
>>>>>
>>>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>>>> a maximum stack usage of 5816 bytes.
>>>>
>>>> I'd like to know how to measure the max stack depth.
>>>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>>>> region and find or track down an untouched region?
>>>
>>> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
>>> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
>>>
>>> What problems did you encounter?
>>> (I may be missing something…)
>>
>> When I enabled the feature, all entries had *0* size except the last entry.
>> It can be reproduced easily as looking in debugs:/tracing/stack_trace.
>
> I'm afraid that you have not applied one of patches in my RFC:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355919.html
>
> I have not looked into James' patch in details, but hope that it will help
> fix one of issues that are annoying me: Stack tracer (actually save_stack_trace())
> will miss a function (and its parent function in some case) that is being executed
> when an interrupt is taken.

Well, it didn't fix the issue:
# cat /sys/kernel/debug/tracing/stack_trace
         Depth    Size   Location    (54 entries)
         -----    ----   --------
   0)     5096      16   irq_copy_thread_info+0x18/0x70
   1)     5080     336   el1_irq+0x78/0x11c

                                  <<<= _raw_spin_unlock_irqrestore

   2)     4744      48   __skb_recv_datagram+0x148/0x49c
   3)     4696     208   skb_recv_datagram+0x50/0x60
   4)     4488      64   xs_udp_data_ready+0x48/0x170
   5)     4424      96   sock_queue_rcv_skb+0x1fc/0x270
  ...
  53)      344     344   el0_svc_naked+0x20/0x28

__skb_recv_datagram+0x148 is "bl _raw_spin_unlock_irqrestore."

And the frames, #0 and #1, appear here because this patch replaces stack pointers
*after* kernel_entry.

-Takahiro AKASHI

>
> -Takahiro AKASHI
>
>> You can track down my report and Akashi's changes with the following links:
>> - http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
>> - https://lkml.org/lkml/2015/7/13/29
>>
>> Although it is impossible to measure an exact depth at this moment, the feature
>> could be utilized to check improvement.
>>
>> Cc'ing Akashi for additional comments if needed.
>>
>> Best Regards
>> Jungseok Lee
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-08  6:44               ` AKASHI Takahiro
  0 siblings, 0 replies; 52+ messages in thread
From: AKASHI Takahiro @ 2015-09-08  6:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/08/2015 10:45 AM, AKASHI Takahiro wrote:
> Jungseok,
>
> On 09/08/2015 01:34 AM, Jungseok Lee wrote:
>> On Sep 8, 2015, at 1:06 AM, James Morse wrote:
>>> On 07/09/15 16:48, Jungseok Lee wrote:
>>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>>
>>>> Hi James,
>>>>
>>>>> Having to handle interrupts on top of an existing kernel stack means the
>>>>> kernel stack must be large enough to accomodate both the maximum kernel
>>>>> usage, and the maximum irq handler usage. Switching to a different stack
>>>>> when processing irqs allows us to make the stack size smaller.
>>>>>
>>>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>>>> a maximum stack usage of 5816 bytes.
>>>>
>>>> I'd like to know how to measure the max stack depth.
>>>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>>>> region and find or track down an untouched region?
>>>
>>> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
>>> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
>>>
>>> What problems did you encounter?
>>> (I may be missing something?)
>>
>> When I enabled the feature, all entries had *0* size except the last entry.
>> It can be reproduced easily as looking in debugs:/tracing/stack_trace.
>
> I'm afraid that you have not applied one of patches in my RFC:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355919.html
>
> I have not looked into James' patch in details, but hope that it will help
> fix one of issues that are annoying me: Stack tracer (actually save_stack_trace())
> will miss a function (and its parent function in some case) that is being executed
> when an interrupt is taken.

Well, it didn't fix the issue:
# cat /sys/kernel/debug/tracing/stack_trace
         Depth    Size   Location    (54 entries)
         -----    ----   --------
   0)     5096      16   irq_copy_thread_info+0x18/0x70
   1)     5080     336   el1_irq+0x78/0x11c

                                  <<<= _raw_spin_unlock_irqrestore

   2)     4744      48   __skb_recv_datagram+0x148/0x49c
   3)     4696     208   skb_recv_datagram+0x50/0x60
   4)     4488      64   xs_udp_data_ready+0x48/0x170
   5)     4424      96   sock_queue_rcv_skb+0x1fc/0x270
  ...
  53)      344     344   el0_svc_naked+0x20/0x28

__skb_recv_datagram+0x148 is "bl _raw_spin_unlock_irqrestore."

And the frames, #0 and #1, appear here because this patch replaces stack pointers
*after* kernel_entry.

-Takahiro AKASHI

>
> -Takahiro AKASHI
>
>> You can track down my report and Akashi's changes with the following links:
>> - http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
>> - https://lkml.org/lkml/2015/7/13/29
>>
>> Although it is impossible to measure an exact depth at this moment, the feature
>> could be utilized to check improvement.
>>
>> Cc'ing Akashi for additional comments if needed.
>>
>> Best Regards
>> Jungseok Lee
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-07 14:36     ` James Morse
@ 2015-09-08  7:51       ` AKASHI Takahiro
  -1 siblings, 0 replies; 52+ messages in thread
From: AKASHI Takahiro @ 2015-09-08  7:51 UTC (permalink / raw)
  To: James Morse, Jungseok Lee
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On 09/07/2015 11:36 PM, James Morse wrote:
> Having to handle interrupts on top of an existing kernel stack means the
> kernel stack must be large enough to accomodate both the maximum kernel
> usage, and the maximum irq handler usage. Switching to a different stack
> when processing irqs allows us to make the stack size smaller.
>
> Maximum kernel stack usage (running ltp and generating usb+ethernet
> interrupts) was 7256 bytes. With this patch, the same workload gives
> a maximum stack usage of 5816 bytes.

With stack tracer on, the kernel with this patch ran into BUG() in check_stack():
     if (task_stack_end_corrupted(current))
         BUG();

This is because a check for an interrupt stack with "object_is_on_stack(stack)"
has passed, while your irq stack doesn't have a STACK_END_MAGIC.

> +-+
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>   arch/arm64/include/asm/irq.h         | 12 +++++++++
>   arch/arm64/include/asm/thread_info.h |  8 ++++--
>   arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
>   arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
>   arch/arm64/kernel/smp.c              |  4 +++
>   arch/arm64/kernel/stacktrace.c       |  4 ++-
>   6 files changed, 107 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index bbb251b14746..050d4196c736 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,14 +2,20 @@
>   #define __ASM_IRQ_H
>
>   #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/percpu.h>
>
>   #include <asm-generic/irq.h>
> +#include <asm/thread_info.h>
> +
> +DECLARE_PER_CPU(unsigned long, irq_sp);
>
>   struct pt_regs;
>
>   extern void migrate_irqs(void);
>   extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
>
> +extern int alloc_irq_stack(unsigned int cpu);
> +
>   static inline void acpi_irq_init(void)
>   {
>   	/*
> @@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
>   }
>   #define acpi_irq_init acpi_irq_init
>
> +static inline bool is_irq_stack(unsigned long sp)
> +{
> +	struct thread_info *ti = get_thread_info(sp);
> +	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
> +}
> +
>   #endif
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d18a42a..b906254fc400 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
>   /*
>    * how to get the thread information struct from C
>    */
> +static inline struct thread_info *get_thread_info(unsigned long sp)
> +{
> +	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
> +}
> +
>   static inline struct thread_info *current_thread_info(void) __attribute_const__;
>
>   static inline struct thread_info *current_thread_info(void)
>   {
> -	return (struct thread_info *)
> -		(current_stack_pointer & ~(THREAD_SIZE - 1));
> +	return get_thread_info(current_stack_pointer);
>   }
>
>   #define thread_saved_pc(tsk)	\
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e16351819fed..d42371f3f5a1 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>    * Interrupt handling.
>    */
>   	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> -	mov	x0, sp
> +	mrs	x21, tpidr_el1
> +	adr_l	x20, irq_sp
> +	add	x20, x20, x21
> +
> +	ldr	x21, [x20]
> +	mov	x20, sp
> +
> +	mov	x0, x21
> +	mov	x1, x20
> +	bl	irq_copy_thread_info
> +
> +	/* test for recursive use of irq_sp */
> +	cbz	w0, 1f
> +	mrs	x30, elr_el1
> +	mov	sp, x21
> +
> +	/*
> +	 * Create a fake stack frame to bump unwind_frame() onto the original
> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
> +	 */
> +	push	x29, x30

I don't think that it works. Since unwind_frame() doesn't care about sp, but
only does fp, a pushed stack frame will never be dereferenced.

-Takahiro AKASHI


> +
> +1:	ldr_l	x1, handle_arch_irq
> +	mov     x0, x20
>   	blr	x1
> +
> +	mov	x0, x20
> +	mov	x1, x21
> +	bl	irq_copy_thread_info
> +	mov	sp, x20
> +
>   	.endm
>
>   	.text
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e7e34c..10b57a006da8 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -26,11 +26,14 @@
>   #include <linux/smp.h>
>   #include <linux/init.h>
>   #include <linux/irqchip.h>
> +#include <linux/percpu.h>
>   #include <linux/seq_file.h>
>   #include <linux/ratelimit.h>
>
>   unsigned long irq_err_count;
>
> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
> +
>   int arch_show_interrupts(struct seq_file *p, int prec)
>   {
>   #ifdef CONFIG_SMP
> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>   	irqchip_init();
>   	if (!handle_arch_irq)
>   		panic("No interrupt controller found.");
> +
> +	/* Allocate an irq stack for the boot cpu */
> +	if (alloc_irq_stack(smp_processor_id()))
> +		panic("Failed to allocate irq stack for boot cpu.");
>   }
>
>   #ifdef CONFIG_HOTPLUG_CPU
> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>   	local_irq_restore(flags);
>   }
>   #endif /* CONFIG_HOTPLUG_CPU */
> +
> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	struct page *irq_stack_page;
> +	union thread_union *irq_stack;
> +
> +	/* reuse stack allocated previously */
> +	if (per_cpu(irq_sp, cpu))
> +		return 0;
> +
> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +	if (!irq_stack_page) {
> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
> +			smp_processor_id(), cpu);
> +		return -ENOMEM;
> +	}
> +	irq_stack = page_address(irq_stack_page);
> +
> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
> +			       + THREAD_START_SP;
> +
> +	return 0;
> +}
> +
> +/*
> + * Copy struct thread_info between two stacks, and update current->stack.
> + * This is used when moving to the irq exception stack.
> + * Changing current->stack is necessary so that non-arch thread_info writers
> + * don't use the new thread_info->task->stack to find the old thread_info when
> + * setting flags like TIF_NEED_RESCHED...
> + */
> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
> +{
> +	struct thread_info *src = get_thread_info(src_sp);
> +	struct thread_info *dst = get_thread_info(dst_sp);
> +
> +	if (dst == src)
> +		return 0;
> +
> +	*dst = *src;
> +	current->stack = dst;
> +
> +	return 1;
> +}
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 50fb4696654e..5283dc5629e4 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -89,6 +89,10 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>   {
>   	int ret;
>
> +	ret = alloc_irq_stack(cpu);
> +	if (ret)
> +		return ret;
> +
>   	/*
>   	 * We need to tell the secondary core where to find its stack and the
>   	 * page tables.
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991bf79f5..3d6d5b45aa4b 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -20,6 +20,7 @@
>   #include <linux/sched.h>
>   #include <linux/stacktrace.h>
>
> +#include <asm/irq.h>
>   #include <asm/stacktrace.h>
>
>   /*
> @@ -43,7 +44,8 @@ int notrace unwind_frame(struct stackframe *frame)
>   	low  = frame->sp;
>   	high = ALIGN(low, THREAD_SIZE);
>
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	if ((fp < low || fp > high - 0x18 || fp & 0xf) &&
> +	    !is_irq_stack(frame->sp))
>   		return -EINVAL;
>
>   	frame->sp = fp + 0x10;
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-08  7:51       ` AKASHI Takahiro
  0 siblings, 0 replies; 52+ messages in thread
From: AKASHI Takahiro @ 2015-09-08  7:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/07/2015 11:36 PM, James Morse wrote:
> Having to handle interrupts on top of an existing kernel stack means the
> kernel stack must be large enough to accomodate both the maximum kernel
> usage, and the maximum irq handler usage. Switching to a different stack
> when processing irqs allows us to make the stack size smaller.
>
> Maximum kernel stack usage (running ltp and generating usb+ethernet
> interrupts) was 7256 bytes. With this patch, the same workload gives
> a maximum stack usage of 5816 bytes.

With stack tracer on, the kernel with this patch ran into BUG() in check_stack():
     if (task_stack_end_corrupted(current))
         BUG();

This is because a check for an interrupt stack with "object_is_on_stack(stack)"
has passed, while your irq stack doesn't have a STACK_END_MAGIC.

> +-+
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>   arch/arm64/include/asm/irq.h         | 12 +++++++++
>   arch/arm64/include/asm/thread_info.h |  8 ++++--
>   arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
>   arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
>   arch/arm64/kernel/smp.c              |  4 +++
>   arch/arm64/kernel/stacktrace.c       |  4 ++-
>   6 files changed, 107 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index bbb251b14746..050d4196c736 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,14 +2,20 @@
>   #define __ASM_IRQ_H
>
>   #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/percpu.h>
>
>   #include <asm-generic/irq.h>
> +#include <asm/thread_info.h>
> +
> +DECLARE_PER_CPU(unsigned long, irq_sp);
>
>   struct pt_regs;
>
>   extern void migrate_irqs(void);
>   extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
>
> +extern int alloc_irq_stack(unsigned int cpu);
> +
>   static inline void acpi_irq_init(void)
>   {
>   	/*
> @@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
>   }
>   #define acpi_irq_init acpi_irq_init
>
> +static inline bool is_irq_stack(unsigned long sp)
> +{
> +	struct thread_info *ti = get_thread_info(sp);
> +	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
> +}
> +
>   #endif
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d18a42a..b906254fc400 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
>   /*
>    * how to get the thread information struct from C
>    */
> +static inline struct thread_info *get_thread_info(unsigned long sp)
> +{
> +	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
> +}
> +
>   static inline struct thread_info *current_thread_info(void) __attribute_const__;
>
>   static inline struct thread_info *current_thread_info(void)
>   {
> -	return (struct thread_info *)
> -		(current_stack_pointer & ~(THREAD_SIZE - 1));
> +	return get_thread_info(current_stack_pointer);
>   }
>
>   #define thread_saved_pc(tsk)	\
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e16351819fed..d42371f3f5a1 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>    * Interrupt handling.
>    */
>   	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> -	mov	x0, sp
> +	mrs	x21, tpidr_el1
> +	adr_l	x20, irq_sp
> +	add	x20, x20, x21
> +
> +	ldr	x21, [x20]
> +	mov	x20, sp
> +
> +	mov	x0, x21
> +	mov	x1, x20
> +	bl	irq_copy_thread_info
> +
> +	/* test for recursive use of irq_sp */
> +	cbz	w0, 1f
> +	mrs	x30, elr_el1
> +	mov	sp, x21
> +
> +	/*
> +	 * Create a fake stack frame to bump unwind_frame() onto the original
> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
> +	 */
> +	push	x29, x30

I don't think that it works. Since unwind_frame() doesn't care about sp, but
only does fp, a pushed stack frame will never be dereferenced.

-Takahiro AKASHI


> +
> +1:	ldr_l	x1, handle_arch_irq
> +	mov     x0, x20
>   	blr	x1
> +
> +	mov	x0, x20
> +	mov	x1, x21
> +	bl	irq_copy_thread_info
> +	mov	sp, x20
> +
>   	.endm
>
>   	.text
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e7e34c..10b57a006da8 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -26,11 +26,14 @@
>   #include <linux/smp.h>
>   #include <linux/init.h>
>   #include <linux/irqchip.h>
> +#include <linux/percpu.h>
>   #include <linux/seq_file.h>
>   #include <linux/ratelimit.h>
>
>   unsigned long irq_err_count;
>
> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
> +
>   int arch_show_interrupts(struct seq_file *p, int prec)
>   {
>   #ifdef CONFIG_SMP
> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>   	irqchip_init();
>   	if (!handle_arch_irq)
>   		panic("No interrupt controller found.");
> +
> +	/* Allocate an irq stack for the boot cpu */
> +	if (alloc_irq_stack(smp_processor_id()))
> +		panic("Failed to allocate irq stack for boot cpu.");
>   }
>
>   #ifdef CONFIG_HOTPLUG_CPU
> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>   	local_irq_restore(flags);
>   }
>   #endif /* CONFIG_HOTPLUG_CPU */
> +
> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	struct page *irq_stack_page;
> +	union thread_union *irq_stack;
> +
> +	/* reuse stack allocated previously */
> +	if (per_cpu(irq_sp, cpu))
> +		return 0;
> +
> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +	if (!irq_stack_page) {
> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
> +			smp_processor_id(), cpu);
> +		return -ENOMEM;
> +	}
> +	irq_stack = page_address(irq_stack_page);
> +
> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
> +			       + THREAD_START_SP;
> +
> +	return 0;
> +}
> +
> +/*
> + * Copy struct thread_info between two stacks, and update current->stack.
> + * This is used when moving to the irq exception stack.
> + * Changing current->stack is necessary so that non-arch thread_info writers
> + * don't use the new thread_info->task->stack to find the old thread_info when
> + * setting flags like TIF_NEED_RESCHED...
> + */
> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
> +{
> +	struct thread_info *src = get_thread_info(src_sp);
> +	struct thread_info *dst = get_thread_info(dst_sp);
> +
> +	if (dst == src)
> +		return 0;
> +
> +	*dst = *src;
> +	current->stack = dst;
> +
> +	return 1;
> +}
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 50fb4696654e..5283dc5629e4 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -89,6 +89,10 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>   {
>   	int ret;
>
> +	ret = alloc_irq_stack(cpu);
> +	if (ret)
> +		return ret;
> +
>   	/*
>   	 * We need to tell the secondary core where to find its stack and the
>   	 * page tables.
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991bf79f5..3d6d5b45aa4b 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -20,6 +20,7 @@
>   #include <linux/sched.h>
>   #include <linux/stacktrace.h>
>
> +#include <asm/irq.h>
>   #include <asm/stacktrace.h>
>
>   /*
> @@ -43,7 +44,8 @@ int notrace unwind_frame(struct stackframe *frame)
>   	low  = frame->sp;
>   	high = ALIGN(low, THREAD_SIZE);
>
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	if ((fp < low || fp > high - 0x18 || fp & 0xf) &&
> +	    !is_irq_stack(frame->sp))
>   		return -EINVAL;
>
>   	frame->sp = fp + 0x10;
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 2/3] arm64: Introduce IRQ stack
  2015-09-07 14:48     ` James Morse
@ 2015-09-08 14:28       ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-08 14:28 UTC (permalink / raw)
  To: James Morse; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On Sep 7, 2015, at 11:48 PM, James Morse wrote:

Hi James,

> On 04/09/15 15:23, Jungseok Lee wrote:
>> Currently, kernel context and interrupts are handled using a single
>> kernel stack navigated by sp_el1. This forces many systems to use
>> 16KB stack, not 8KB one. Low memory platforms naturally suffer from
>> both memory pressure and performance degradation simultaneously as
>> VM page allocator falls into slowpath frequently.
>> 
>> This patch, thus, solves the problem as introducing a separate percpu
>> IRQ stack to handle both hard and soft interrupts with two ground rules:
>> 
>>  - Utilize sp_el0 in EL1 context, which is not used currently
>>  - Do *not* complicate current_thread_info calculation
>> 
>> struct thread_info can be tracked easily using sp_el0, not sp_el1 when
>> this feature is enabled.
>> 
>> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
>> ---
>> arch/arm64/Kconfig.debug             | 10 ++
>> arch/arm64/include/asm/irq.h         |  8 ++
>> arch/arm64/include/asm/thread_info.h | 11 ++
>> arch/arm64/kernel/asm-offsets.c      |  8 ++
>> arch/arm64/kernel/entry.S            | 83 +++++++++++++++-
>> arch/arm64/kernel/head.S             |  7 ++
>> arch/arm64/kernel/irq.c              | 18 ++++
>> 7 files changed, 142 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
>> index d6285ef..e16d91f 100644
>> --- a/arch/arm64/Kconfig.debug
>> +++ b/arch/arm64/Kconfig.debug
>> @@ -18,6 +18,16 @@ config ARM64_PTDUMP
>> 	  kernel.
>> 	  If in doubt, say "N"
>> 
>> +config IRQ_STACK
>> +	bool "Use separate kernel stack when handling interrupts"
>> +	depends on ARM64_4K_PAGES
>> +	help
>> +	  Say Y here if you want to use separate kernel stack to handle both
>> +	  hard and soft interrupts. As reduceing memory footprint regarding
>> +	  kernel stack, it benefits low memory platforms.
>> +
>> +	  If in doubt, say N.
>> +
> 
> I don't think it is necessary to have a debug-only Kconfig option for this.
> Reducing memory use is good for everyone!
> 
> This would let you get rid of all the #ifdefs

Hmm.. A single concern is stability. However, I agree that this is not an optional
feature. Especially, it definitely benefits low memory platforms. I will remove
this snippet in the next version. 

>> config STRICT_DEVMEM
>> 	bool "Filter access to /dev/mem"
>> 	depends on MMU
>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>> index dcd06d1..5345a67 100644
>> --- a/arch/arm64/include/asm/thread_info.h
>> +++ b/arch/arm64/include/asm/thread_info.h
>> @@ -71,11 +71,22 @@ register unsigned long current_stack_pointer asm ("sp");
>>  */
>> static inline struct thread_info *current_thread_info(void) __attribute_const__;
>> 
>> +#ifndef CONFIG_IRQ_STACK
>> static inline struct thread_info *current_thread_info(void)
>> {
>> 	return (struct thread_info *)
>> 		(current_stack_pointer & ~(THREAD_SIZE - 1));
>> }
>> +#else
>> +static inline struct thread_info *current_thread_info(void)
>> +{
>> +	unsigned long sp_el0;
>> +
>> +	asm volatile("mrs %0, sp_el0" : "=r" (sp_el0));
>> +
>> +	return (struct thread_info *)(sp_el0 & ~(THREAD_SIZE - 1));
>> +}
>> +#endif
> 
> Because sp_el0 is only used as a stack value to find struct thread_info,
> you could just store the struct thread_info pointer in sp_el0, and save the
> masking on each read of the value.

IMHO, this logic, masking SP with ~(THREAD_SIZE - 1), is a well-known idiom.
So, I don't want to change the expression. In addition, the same process is
needed before storing the address of struct thread_info. 

>> 
>> #define thread_saved_pc(tsk)	\
>> 	((unsigned long)(tsk->thread.cpu_context.pc))
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index d23ca0d..f1fdfa9 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -88,7 +88,11 @@
>> 
>> 	.if	\el == 0
>> 	mrs	x21, sp_el0
>> +#ifndef CONFIG_IRQ_STACK
>> 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>> +#else
>> +	get_thread_info \el, tsk
>> +#endif
>> 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>> 	disable_step_tsk x19, x20		// exceptions when scheduling.
>> 	.endif
>> @@ -168,11 +172,56 @@
>> 	eret					// return to kernel
>> 	.endm
>> 
>> +#ifndef CONFIG_IRQ_STACK
>> 	.macro	get_thread_info, rd
>> 	mov	\rd, sp
>> -	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
>> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of stack
>> +	.endm
>> +#else
>> +	.macro	get_thread_info, el, rd
>> +	.if	\el == 0
>> +	mov	\rd, sp
>> +	.else
>> +	mrs	\rd, sp_el0
>> +	.endif
>> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of thread stack
>> +	.endm
>> +
>> +	.macro	get_irq_stack
>> +	get_thread_info 1, tsk
>> +	ldr	w22, [tsk, #TI_CPU]
>> +	adr_l	x21, irq_stacks
>> +	mov	x23, #IRQ_STACK_SIZE
>> +	madd	x21, x22, x23, x21
>> 	.endm
> 
> Using per_cpu variables would save the multiply here.
> You then wouldn't need IRQ_STACK_SIZE.

Agree.

>> 
>> +	.macro	irq_stack_entry
>> +	get_irq_stack
>> +	ldr	w23, [x21, #IRQ_COUNT]
>> +	cbnz	w23, 1f
>> +	mov	x23, sp
>> +	str	x23, [x21, #IRQ_THREAD_SP]
>> +	ldr	x23, [x21, #IRQ_STACK]
>> +	mov	sp, x23
>> +	mov	x23, xzr
>> +1:	add	w23, w23, #1
>> +	str	w23, [x21, #IRQ_COUNT]
> 
> Why do you need to count the number of irqs?
> struct thread_info:preempt_count already does something similar (but its
> not usable this early...)
> 
> An alternative way would be to compare the top bits of the two stack
> pointers - all we care about is irq recursion right?

Exactly. A point is a recursion check, not an actual number in this context.
I'd like to address the recursion issue with minimal load and store operation.
The suggestion sounds good. I will figure out a better and simpler way and
update it with comments in the code.

> This would save storing 'int count' for each cpu, and the logic to
> inc/decrement it.
> 
> 
>> +	.endm
>> +
>> +	.macro	irq_stack_exit
>> +	get_irq_stack
>> +	ldr	w23, [x21, #IRQ_COUNT]
>> +	sub	w23, w23, #1
>> +	cbnz	w23, 1f
>> +	mov	x23, sp
>> +	str	x23, [x21, #IRQ_STACK]
>> +	ldr	x23, [x21, #IRQ_THREAD_SP]
>> +	mov	sp, x23
>> +	mov	x23, xzr
>> +1:	str	w23, [x21, #IRQ_COUNT]
>> +	.endm
>> +#endif
>> +
>> /*
>>  * These are the registers used in the syscall handler, and allow us to
>>  * have in theory up to 7 arguments to a function - x0 to x6.
>> @@ -188,10 +237,15 @@ tsk	.req	x28		// current thread_info
>>  * Interrupt handling.
>>  */
>> 	.macro	irq_handler
>> -	adrp	x1, handle_arch_irq
>> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
>> +	ldr_l	x1, handle_arch_irq
>> 	mov	x0, sp
>> +#ifdef CONFIG_IRQ_STACK
>> +	irq_stack_entry
>> +#endif
>> 	blr	x1
>> +#ifdef CONFIG_IRQ_STACK
>> +	irq_stack_exit
>> +#endif
>> 	.endm
>> 
>> 	.text
>> @@ -366,7 +420,11 @@ el1_irq:
>> 	irq_handler
>> 
>> #ifdef CONFIG_PREEMPT
>> +#ifndef CONFIG_IRQ_STACK
>> 	get_thread_info tsk
>> +#else
>> +	get_thread_info 1, tsk
>> +#endif
>> 	ldr	w24, [tsk, #TI_PREEMPT]		// get preempt count
>> 	cbnz	w24, 1f				// preempt count != 0
>> 	ldr	x0, [tsk, #TI_FLAGS]		// get flags
>> @@ -395,6 +453,10 @@ el1_preempt:
>> 	.align	6
>> el0_sync:
>> 	kernel_entry 0
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x25, sp
>> +	msr	sp_el0, x25
>> +#endif
> 
> Could you do this in kernel_entry?

Yes, I could. It would make the code neater.

>> 	mrs	x25, esr_el1			// read the syndrome register
>> 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>> 	cmp	x24, #ESR_ELx_EC_SVC64		// SVC in 64-bit state
>> @@ -423,6 +485,10 @@ el0_sync:
>> 	.align	6
>> el0_sync_compat:
>> 	kernel_entry 0, 32
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x25, sp
>> +	msr	sp_el0, x25
>> +#endif
>> 	mrs	x25, esr_el1			// read the syndrome register
>> 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>> 	cmp	x24, #ESR_ELx_EC_SVC32		// SVC in 32-bit state
>> @@ -560,6 +626,10 @@ ENDPROC(el0_sync)
>> el0_irq:
>> 	kernel_entry 0
>> el0_irq_naked:
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x22, sp
>> +	msr	sp_el0, x22
>> +#endif
>> 	enable_dbg
>> #ifdef CONFIG_TRACE_IRQFLAGS
>> 	bl	trace_hardirqs_off
>> @@ -602,6 +672,9 @@ ENTRY(cpu_switch_to)
>> 	ldp	x29, x9, [x8], #16
>> 	ldr	lr, [x8]
>> 	mov	sp, x9
>> +#ifdef CONFIG_IRQ_STACK
>> +	msr	sp_el0, x9
>> +#endif
>> 	ret
>> ENDPROC(cpu_switch_to)
>> 
>> @@ -661,7 +734,11 @@ ENTRY(ret_from_fork)
>> 	cbz	x19, 1f				// not a kernel thread
>> 	mov	x0, x20
>> 	blr	x19
>> +#ifndef CONFIG_IRQ_STACK
>> 1:	get_thread_info tsk
>> +#else
>> +1:	get_thread_info 1, tsk
>> +#endif
>> 	b	ret_to_user
>> ENDPROC(ret_from_fork)
>> 
>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>> index c0ff3ce..3142766 100644
>> --- a/arch/arm64/kernel/head.S
>> +++ b/arch/arm64/kernel/head.S
>> @@ -446,6 +446,10 @@ __mmap_switched:
>> 	b	1b
>> 2:
>> 	adr_l	sp, initial_sp, x4
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x4, sp
>> +	msr	sp_el0, x4
>> +#endif
>> 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
>> 	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
>> 	mov	x29, #0
>> @@ -619,6 +623,9 @@ ENDPROC(secondary_startup)
>> ENTRY(__secondary_switched)
>> 	ldr	x0, [x21]			// get secondary_data.stack
>> 	mov	sp, x0
>> +#ifdef CONFIG_IRQ_STACK
>> +	msr	sp_el0, x0
>> +#endif
>> 	mov	x29, #0
>> 	b	secondary_start_kernel
>> ENDPROC(__secondary_switched)
>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>> index 463fa2e..fb0f522 100644
>> --- a/arch/arm64/kernel/irq.c
>> +++ b/arch/arm64/kernel/irq.c
>> @@ -31,6 +31,10 @@
>> 
>> unsigned long irq_err_count;
>> 
>> +#ifdef CONFIG_IRQ_STACK
>> +struct irq_stack irq_stacks[NR_CPUS];
>> +#endif
>> +
> 
> DEFINE_PER_CPU()?

Yeah, I will update it with per_cpu in the next version.

>> int arch_show_interrupts(struct seq_file *p, int prec)
>> {
>> #ifdef CONFIG_SMP
>> @@ -52,6 +56,20 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>> 
>> void __init init_IRQ(void)
>> {
>> +#ifdef CONFIG_IRQ_STACK
>> +	unsigned int cpu;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		void *stack = (void *)__get_free_pages(THREADINFO_GFP,
>> +							THREAD_SIZE_ORDER);
>> +		if (!stack)
>> +			panic("CPU%u: No IRQ stack", cpu);
>> +
>> +		irq_stacks[cpu].stack = stack + THREAD_START_SP;
>> +	}
>> +	pr_info("IRQ: Allocated IRQ stack successfully\n");
>> +#endif
>> +
> 
> Wouldn't it be better to only allocate the stack when the CPU is about to
> be brought up? (put the alloc code in __cpu_up()).

__cpu_up could be called very frequently if a power management scheme based
on CPU hotplug is actively used. So, I'd like to avoid even a simple logic
which checks the previously allocated IRQ stack. 

Thanks for the valuable feedbacks.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/3] arm64: Introduce IRQ stack
@ 2015-09-08 14:28       ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-08 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 7, 2015, at 11:48 PM, James Morse wrote:

Hi James,

> On 04/09/15 15:23, Jungseok Lee wrote:
>> Currently, kernel context and interrupts are handled using a single
>> kernel stack navigated by sp_el1. This forces many systems to use
>> 16KB stack, not 8KB one. Low memory platforms naturally suffer from
>> both memory pressure and performance degradation simultaneously as
>> VM page allocator falls into slowpath frequently.
>> 
>> This patch, thus, solves the problem as introducing a separate percpu
>> IRQ stack to handle both hard and soft interrupts with two ground rules:
>> 
>>  - Utilize sp_el0 in EL1 context, which is not used currently
>>  - Do *not* complicate current_thread_info calculation
>> 
>> struct thread_info can be tracked easily using sp_el0, not sp_el1 when
>> this feature is enabled.
>> 
>> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
>> ---
>> arch/arm64/Kconfig.debug             | 10 ++
>> arch/arm64/include/asm/irq.h         |  8 ++
>> arch/arm64/include/asm/thread_info.h | 11 ++
>> arch/arm64/kernel/asm-offsets.c      |  8 ++
>> arch/arm64/kernel/entry.S            | 83 +++++++++++++++-
>> arch/arm64/kernel/head.S             |  7 ++
>> arch/arm64/kernel/irq.c              | 18 ++++
>> 7 files changed, 142 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
>> index d6285ef..e16d91f 100644
>> --- a/arch/arm64/Kconfig.debug
>> +++ b/arch/arm64/Kconfig.debug
>> @@ -18,6 +18,16 @@ config ARM64_PTDUMP
>> 	  kernel.
>> 	  If in doubt, say "N"
>> 
>> +config IRQ_STACK
>> +	bool "Use separate kernel stack when handling interrupts"
>> +	depends on ARM64_4K_PAGES
>> +	help
>> +	  Say Y here if you want to use separate kernel stack to handle both
>> +	  hard and soft interrupts. As reduceing memory footprint regarding
>> +	  kernel stack, it benefits low memory platforms.
>> +
>> +	  If in doubt, say N.
>> +
> 
> I don't think it is necessary to have a debug-only Kconfig option for this.
> Reducing memory use is good for everyone!
> 
> This would let you get rid of all the #ifdefs

Hmm.. A single concern is stability. However, I agree that this is not an optional
feature. Especially, it definitely benefits low memory platforms. I will remove
this snippet in the next version. 

>> config STRICT_DEVMEM
>> 	bool "Filter access to /dev/mem"
>> 	depends on MMU
>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>> index dcd06d1..5345a67 100644
>> --- a/arch/arm64/include/asm/thread_info.h
>> +++ b/arch/arm64/include/asm/thread_info.h
>> @@ -71,11 +71,22 @@ register unsigned long current_stack_pointer asm ("sp");
>>  */
>> static inline struct thread_info *current_thread_info(void) __attribute_const__;
>> 
>> +#ifndef CONFIG_IRQ_STACK
>> static inline struct thread_info *current_thread_info(void)
>> {
>> 	return (struct thread_info *)
>> 		(current_stack_pointer & ~(THREAD_SIZE - 1));
>> }
>> +#else
>> +static inline struct thread_info *current_thread_info(void)
>> +{
>> +	unsigned long sp_el0;
>> +
>> +	asm volatile("mrs %0, sp_el0" : "=r" (sp_el0));
>> +
>> +	return (struct thread_info *)(sp_el0 & ~(THREAD_SIZE - 1));
>> +}
>> +#endif
> 
> Because sp_el0 is only used as a stack value to find struct thread_info,
> you could just store the struct thread_info pointer in sp_el0, and save the
> masking on each read of the value.

IMHO, this logic, masking SP with ~(THREAD_SIZE - 1), is a well-known idiom.
So, I don't want to change the expression. In addition, the same process is
needed before storing the address of struct thread_info. 

>> 
>> #define thread_saved_pc(tsk)	\
>> 	((unsigned long)(tsk->thread.cpu_context.pc))
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index d23ca0d..f1fdfa9 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -88,7 +88,11 @@
>> 
>> 	.if	\el == 0
>> 	mrs	x21, sp_el0
>> +#ifndef CONFIG_IRQ_STACK
>> 	get_thread_info tsk			// Ensure MDSCR_EL1.SS is clear,
>> +#else
>> +	get_thread_info \el, tsk
>> +#endif
>> 	ldr	x19, [tsk, #TI_FLAGS]		// since we can unmask debug
>> 	disable_step_tsk x19, x20		// exceptions when scheduling.
>> 	.endif
>> @@ -168,11 +172,56 @@
>> 	eret					// return to kernel
>> 	.endm
>> 
>> +#ifndef CONFIG_IRQ_STACK
>> 	.macro	get_thread_info, rd
>> 	mov	\rd, sp
>> -	and	\rd, \rd, #~(THREAD_SIZE - 1)	// top of stack
>> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of stack
>> +	.endm
>> +#else
>> +	.macro	get_thread_info, el, rd
>> +	.if	\el == 0
>> +	mov	\rd, sp
>> +	.else
>> +	mrs	\rd, sp_el0
>> +	.endif
>> +	and	\rd, \rd, #~(THREAD_SIZE - 1)	// bottom of thread stack
>> +	.endm
>> +
>> +	.macro	get_irq_stack
>> +	get_thread_info 1, tsk
>> +	ldr	w22, [tsk, #TI_CPU]
>> +	adr_l	x21, irq_stacks
>> +	mov	x23, #IRQ_STACK_SIZE
>> +	madd	x21, x22, x23, x21
>> 	.endm
> 
> Using per_cpu variables would save the multiply here.
> You then wouldn't need IRQ_STACK_SIZE.

Agree.

>> 
>> +	.macro	irq_stack_entry
>> +	get_irq_stack
>> +	ldr	w23, [x21, #IRQ_COUNT]
>> +	cbnz	w23, 1f
>> +	mov	x23, sp
>> +	str	x23, [x21, #IRQ_THREAD_SP]
>> +	ldr	x23, [x21, #IRQ_STACK]
>> +	mov	sp, x23
>> +	mov	x23, xzr
>> +1:	add	w23, w23, #1
>> +	str	w23, [x21, #IRQ_COUNT]
> 
> Why do you need to count the number of irqs?
> struct thread_info:preempt_count already does something similar (but its
> not usable this early...)
> 
> An alternative way would be to compare the top bits of the two stack
> pointers - all we care about is irq recursion right?

Exactly. A point is a recursion check, not an actual number in this context.
I'd like to address the recursion issue with minimal load and store operation.
The suggestion sounds good. I will figure out a better and simpler way and
update it with comments in the code.

> This would save storing 'int count' for each cpu, and the logic to
> inc/decrement it.
> 
> 
>> +	.endm
>> +
>> +	.macro	irq_stack_exit
>> +	get_irq_stack
>> +	ldr	w23, [x21, #IRQ_COUNT]
>> +	sub	w23, w23, #1
>> +	cbnz	w23, 1f
>> +	mov	x23, sp
>> +	str	x23, [x21, #IRQ_STACK]
>> +	ldr	x23, [x21, #IRQ_THREAD_SP]
>> +	mov	sp, x23
>> +	mov	x23, xzr
>> +1:	str	w23, [x21, #IRQ_COUNT]
>> +	.endm
>> +#endif
>> +
>> /*
>>  * These are the registers used in the syscall handler, and allow us to
>>  * have in theory up to 7 arguments to a function - x0 to x6.
>> @@ -188,10 +237,15 @@ tsk	.req	x28		// current thread_info
>>  * Interrupt handling.
>>  */
>> 	.macro	irq_handler
>> -	adrp	x1, handle_arch_irq
>> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
>> +	ldr_l	x1, handle_arch_irq
>> 	mov	x0, sp
>> +#ifdef CONFIG_IRQ_STACK
>> +	irq_stack_entry
>> +#endif
>> 	blr	x1
>> +#ifdef CONFIG_IRQ_STACK
>> +	irq_stack_exit
>> +#endif
>> 	.endm
>> 
>> 	.text
>> @@ -366,7 +420,11 @@ el1_irq:
>> 	irq_handler
>> 
>> #ifdef CONFIG_PREEMPT
>> +#ifndef CONFIG_IRQ_STACK
>> 	get_thread_info tsk
>> +#else
>> +	get_thread_info 1, tsk
>> +#endif
>> 	ldr	w24, [tsk, #TI_PREEMPT]		// get preempt count
>> 	cbnz	w24, 1f				// preempt count != 0
>> 	ldr	x0, [tsk, #TI_FLAGS]		// get flags
>> @@ -395,6 +453,10 @@ el1_preempt:
>> 	.align	6
>> el0_sync:
>> 	kernel_entry 0
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x25, sp
>> +	msr	sp_el0, x25
>> +#endif
> 
> Could you do this in kernel_entry?

Yes, I could. It would make the code neater.

>> 	mrs	x25, esr_el1			// read the syndrome register
>> 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>> 	cmp	x24, #ESR_ELx_EC_SVC64		// SVC in 64-bit state
>> @@ -423,6 +485,10 @@ el0_sync:
>> 	.align	6
>> el0_sync_compat:
>> 	kernel_entry 0, 32
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x25, sp
>> +	msr	sp_el0, x25
>> +#endif
>> 	mrs	x25, esr_el1			// read the syndrome register
>> 	lsr	x24, x25, #ESR_ELx_EC_SHIFT	// exception class
>> 	cmp	x24, #ESR_ELx_EC_SVC32		// SVC in 32-bit state
>> @@ -560,6 +626,10 @@ ENDPROC(el0_sync)
>> el0_irq:
>> 	kernel_entry 0
>> el0_irq_naked:
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x22, sp
>> +	msr	sp_el0, x22
>> +#endif
>> 	enable_dbg
>> #ifdef CONFIG_TRACE_IRQFLAGS
>> 	bl	trace_hardirqs_off
>> @@ -602,6 +672,9 @@ ENTRY(cpu_switch_to)
>> 	ldp	x29, x9, [x8], #16
>> 	ldr	lr, [x8]
>> 	mov	sp, x9
>> +#ifdef CONFIG_IRQ_STACK
>> +	msr	sp_el0, x9
>> +#endif
>> 	ret
>> ENDPROC(cpu_switch_to)
>> 
>> @@ -661,7 +734,11 @@ ENTRY(ret_from_fork)
>> 	cbz	x19, 1f				// not a kernel thread
>> 	mov	x0, x20
>> 	blr	x19
>> +#ifndef CONFIG_IRQ_STACK
>> 1:	get_thread_info tsk
>> +#else
>> +1:	get_thread_info 1, tsk
>> +#endif
>> 	b	ret_to_user
>> ENDPROC(ret_from_fork)
>> 
>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>> index c0ff3ce..3142766 100644
>> --- a/arch/arm64/kernel/head.S
>> +++ b/arch/arm64/kernel/head.S
>> @@ -446,6 +446,10 @@ __mmap_switched:
>> 	b	1b
>> 2:
>> 	adr_l	sp, initial_sp, x4
>> +#ifdef CONFIG_IRQ_STACK
>> +	mov	x4, sp
>> +	msr	sp_el0, x4
>> +#endif
>> 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
>> 	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
>> 	mov	x29, #0
>> @@ -619,6 +623,9 @@ ENDPROC(secondary_startup)
>> ENTRY(__secondary_switched)
>> 	ldr	x0, [x21]			// get secondary_data.stack
>> 	mov	sp, x0
>> +#ifdef CONFIG_IRQ_STACK
>> +	msr	sp_el0, x0
>> +#endif
>> 	mov	x29, #0
>> 	b	secondary_start_kernel
>> ENDPROC(__secondary_switched)
>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>> index 463fa2e..fb0f522 100644
>> --- a/arch/arm64/kernel/irq.c
>> +++ b/arch/arm64/kernel/irq.c
>> @@ -31,6 +31,10 @@
>> 
>> unsigned long irq_err_count;
>> 
>> +#ifdef CONFIG_IRQ_STACK
>> +struct irq_stack irq_stacks[NR_CPUS];
>> +#endif
>> +
> 
> DEFINE_PER_CPU()?

Yeah, I will update it with per_cpu in the next version.

>> int arch_show_interrupts(struct seq_file *p, int prec)
>> {
>> #ifdef CONFIG_SMP
>> @@ -52,6 +56,20 @@ void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
>> 
>> void __init init_IRQ(void)
>> {
>> +#ifdef CONFIG_IRQ_STACK
>> +	unsigned int cpu;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		void *stack = (void *)__get_free_pages(THREADINFO_GFP,
>> +							THREAD_SIZE_ORDER);
>> +		if (!stack)
>> +			panic("CPU%u: No IRQ stack", cpu);
>> +
>> +		irq_stacks[cpu].stack = stack + THREAD_START_SP;
>> +	}
>> +	pr_info("IRQ: Allocated IRQ stack successfully\n");
>> +#endif
>> +
> 
> Wouldn't it be better to only allocate the stack when the CPU is about to
> be brought up? (put the alloc code in __cpu_up()).

__cpu_up could be called very frequently if a power management scheme based
on CPU hotplug is actively used. So, I'd like to avoid even a simple logic
which checks the previously allocated IRQ stack. 

Thanks for the valuable feedbacks.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-07 14:36     ` James Morse
@ 2015-09-08 14:54       ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-08 14:54 UTC (permalink / raw)
  To: James Morse; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On Sep 7, 2015, at 11:36 PM, James Morse wrote:

Hi James,

> Having to handle interrupts on top of an existing kernel stack means the
> kernel stack must be large enough to accomodate both the maximum kernel
> usage, and the maximum irq handler usage. Switching to a different stack
> when processing irqs allows us to make the stack size smaller.
> 
> Maximum kernel stack usage (running ltp and generating usb+ethernet
> interrupts) was 7256 bytes. With this patch, the same workload gives
> a maximum stack usage of 5816 bytes.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> arch/arm64/include/asm/irq.h         | 12 +++++++++
> arch/arm64/include/asm/thread_info.h |  8 ++++--
> arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
> arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
> arch/arm64/kernel/smp.c              |  4 +++
> arch/arm64/kernel/stacktrace.c       |  4 ++-
> 6 files changed, 107 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index bbb251b14746..050d4196c736 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,14 +2,20 @@
> #define __ASM_IRQ_H
> 
> #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/percpu.h>
> 
> #include <asm-generic/irq.h>
> +#include <asm/thread_info.h>
> +
> +DECLARE_PER_CPU(unsigned long, irq_sp);
> 
> struct pt_regs;
> 
> extern void migrate_irqs(void);
> extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
> 
> +extern int alloc_irq_stack(unsigned int cpu);
> +
> static inline void acpi_irq_init(void)
> {
> 	/*
> @@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
> }
> #define acpi_irq_init acpi_irq_init
> 
> +static inline bool is_irq_stack(unsigned long sp)
> +{
> +	struct thread_info *ti = get_thread_info(sp);
> +	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
> +}
> +
> #endif
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d18a42a..b906254fc400 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
> /*
>  * how to get the thread information struct from C
>  */
> +static inline struct thread_info *get_thread_info(unsigned long sp)
> +{
> +	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
> +}
> +
> static inline struct thread_info *current_thread_info(void) __attribute_const__;
> 
> static inline struct thread_info *current_thread_info(void)
> {
> -	return (struct thread_info *)
> -		(current_stack_pointer & ~(THREAD_SIZE - 1));
> +	return get_thread_info(current_stack_pointer);
> }
> 
> #define thread_saved_pc(tsk)	\
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e16351819fed..d42371f3f5a1 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>  * Interrupt handling.
>  */
> 	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> -	mov	x0, sp
> +	mrs	x21, tpidr_el1
> +	adr_l	x20, irq_sp
> +	add	x20, x20, x21
> +
> +	ldr	x21, [x20]
> +	mov	x20, sp
> +
> +	mov	x0, x21
> +	mov	x1, x20
> +	bl	irq_copy_thread_info
> +
> +	/* test for recursive use of irq_sp */
> +	cbz	w0, 1f
> +	mrs	x30, elr_el1
> +	mov	sp, x21
> +
> +	/*
> +	 * Create a fake stack frame to bump unwind_frame() onto the original
> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
> +	 */
> +	push	x29, x30

It is the most challenging item to handle a frame pointer in this context.

As I mentioned previously, a stack tracer on ftrace is not supported yet.
How about decoupling "IRQ Stack feature" from "Stack Tracer Fix"?

> +
> +1:	ldr_l	x1, handle_arch_irq
> +	mov     x0, x20
> 	blr	x1
> +
> +	mov	x0, x20
> +	mov	x1, x21
> +	bl	irq_copy_thread_info
> +	mov	sp, x20
> +
> 	.endm
> 
> 	.text
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e7e34c..10b57a006da8 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -26,11 +26,14 @@
> #include <linux/smp.h>
> #include <linux/init.h>
> #include <linux/irqchip.h>
> +#include <linux/percpu.h>
> #include <linux/seq_file.h>
> #include <linux/ratelimit.h>
> 
> unsigned long irq_err_count;
> 
> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
> +
> int arch_show_interrupts(struct seq_file *p, int prec)
> {
> #ifdef CONFIG_SMP
> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
> 	irqchip_init();
> 	if (!handle_arch_irq)
> 		panic("No interrupt controller found.");
> +
> +	/* Allocate an irq stack for the boot cpu */
> +	if (alloc_irq_stack(smp_processor_id()))
> +		panic("Failed to allocate irq stack for boot cpu.");
> }
> 
> #ifdef CONFIG_HOTPLUG_CPU
> @@ -117,3 +124,48 @@ void migrate_irqs(void)
> 	local_irq_restore(flags);
> }
> #endif /* CONFIG_HOTPLUG_CPU */
> +
> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	struct page *irq_stack_page;
> +	union thread_union *irq_stack;
> +
> +	/* reuse stack allocated previously */
> +	if (per_cpu(irq_sp, cpu))
> +		return 0;

I'd like to avoid even this simple check since CPU hotplug could be heavily
used for power management.

> +
> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +	if (!irq_stack_page) {
> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
> +			smp_processor_id(), cpu);
> +		return -ENOMEM;
> +	}
> +	irq_stack = page_address(irq_stack_page);
> +
> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
> +			       + THREAD_START_SP;
> +
> +	return 0;
> +}
> +
> +/*
> + * Copy struct thread_info between two stacks, and update current->stack.
> + * This is used when moving to the irq exception stack.
> + * Changing current->stack is necessary so that non-arch thread_info writers
> + * don't use the new thread_info->task->stack to find the old thread_info when
> + * setting flags like TIF_NEED_RESCHED...
> + */
> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
> +{
> +	struct thread_info *src = get_thread_info(src_sp);
> +	struct thread_info *dst = get_thread_info(dst_sp);
> +
> +	if (dst == src)
> +		return 0;
> +
> +	*dst = *src;
> +	current->stack = dst;
> +
> +	return 1;
> +}

Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
twice to handle a single interrupt. Isn's it too expensive?

This is a major difference between my approach and this patch. I think we should get
some feedbacks on struct thread_info information management for v2 preparation.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-08 14:54       ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-08 14:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 7, 2015, at 11:36 PM, James Morse wrote:

Hi James,

> Having to handle interrupts on top of an existing kernel stack means the
> kernel stack must be large enough to accomodate both the maximum kernel
> usage, and the maximum irq handler usage. Switching to a different stack
> when processing irqs allows us to make the stack size smaller.
> 
> Maximum kernel stack usage (running ltp and generating usb+ethernet
> interrupts) was 7256 bytes. With this patch, the same workload gives
> a maximum stack usage of 5816 bytes.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> arch/arm64/include/asm/irq.h         | 12 +++++++++
> arch/arm64/include/asm/thread_info.h |  8 ++++--
> arch/arm64/kernel/entry.S            | 33 ++++++++++++++++++++---
> arch/arm64/kernel/irq.c              | 52 ++++++++++++++++++++++++++++++++++++
> arch/arm64/kernel/smp.c              |  4 +++
> arch/arm64/kernel/stacktrace.c       |  4 ++-
> 6 files changed, 107 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index bbb251b14746..050d4196c736 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,14 +2,20 @@
> #define __ASM_IRQ_H
> 
> #include <linux/irqchip/arm-gic-acpi.h>
> +#include <linux/percpu.h>
> 
> #include <asm-generic/irq.h>
> +#include <asm/thread_info.h>
> +
> +DECLARE_PER_CPU(unsigned long, irq_sp);
> 
> struct pt_regs;
> 
> extern void migrate_irqs(void);
> extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
> 
> +extern int alloc_irq_stack(unsigned int cpu);
> +
> static inline void acpi_irq_init(void)
> {
> 	/*
> @@ -21,4 +27,10 @@ static inline void acpi_irq_init(void)
> }
> #define acpi_irq_init acpi_irq_init
> 
> +static inline bool is_irq_stack(unsigned long sp)
> +{
> +	struct thread_info *ti = get_thread_info(sp);
> +	return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti);
> +}
> +
> #endif
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d18a42a..b906254fc400 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp");
> /*
>  * how to get the thread information struct from C
>  */
> +static inline struct thread_info *get_thread_info(unsigned long sp)
> +{
> +	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
> +}
> +
> static inline struct thread_info *current_thread_info(void) __attribute_const__;
> 
> static inline struct thread_info *current_thread_info(void)
> {
> -	return (struct thread_info *)
> -		(current_stack_pointer & ~(THREAD_SIZE - 1));
> +	return get_thread_info(current_stack_pointer);
> }
> 
> #define thread_saved_pc(tsk)	\
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e16351819fed..d42371f3f5a1 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>  * Interrupt handling.
>  */
> 	.macro	irq_handler
> -	adrp	x1, handle_arch_irq
> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
> -	mov	x0, sp
> +	mrs	x21, tpidr_el1
> +	adr_l	x20, irq_sp
> +	add	x20, x20, x21
> +
> +	ldr	x21, [x20]
> +	mov	x20, sp
> +
> +	mov	x0, x21
> +	mov	x1, x20
> +	bl	irq_copy_thread_info
> +
> +	/* test for recursive use of irq_sp */
> +	cbz	w0, 1f
> +	mrs	x30, elr_el1
> +	mov	sp, x21
> +
> +	/*
> +	 * Create a fake stack frame to bump unwind_frame() onto the original
> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
> +	 */
> +	push	x29, x30

It is the most challenging item to handle a frame pointer in this context.

As I mentioned previously, a stack tracer on ftrace is not supported yet.
How about decoupling "IRQ Stack feature" from "Stack Tracer Fix"?

> +
> +1:	ldr_l	x1, handle_arch_irq
> +	mov     x0, x20
> 	blr	x1
> +
> +	mov	x0, x20
> +	mov	x1, x21
> +	bl	irq_copy_thread_info
> +	mov	sp, x20
> +
> 	.endm
> 
> 	.text
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index 463fa2e7e34c..10b57a006da8 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -26,11 +26,14 @@
> #include <linux/smp.h>
> #include <linux/init.h>
> #include <linux/irqchip.h>
> +#include <linux/percpu.h>
> #include <linux/seq_file.h>
> #include <linux/ratelimit.h>
> 
> unsigned long irq_err_count;
> 
> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
> +
> int arch_show_interrupts(struct seq_file *p, int prec)
> {
> #ifdef CONFIG_SMP
> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
> 	irqchip_init();
> 	if (!handle_arch_irq)
> 		panic("No interrupt controller found.");
> +
> +	/* Allocate an irq stack for the boot cpu */
> +	if (alloc_irq_stack(smp_processor_id()))
> +		panic("Failed to allocate irq stack for boot cpu.");
> }
> 
> #ifdef CONFIG_HOTPLUG_CPU
> @@ -117,3 +124,48 @@ void migrate_irqs(void)
> 	local_irq_restore(flags);
> }
> #endif /* CONFIG_HOTPLUG_CPU */
> +
> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
> +int alloc_irq_stack(unsigned int cpu)
> +{
> +	struct page *irq_stack_page;
> +	union thread_union *irq_stack;
> +
> +	/* reuse stack allocated previously */
> +	if (per_cpu(irq_sp, cpu))
> +		return 0;

I'd like to avoid even this simple check since CPU hotplug could be heavily
used for power management.

> +
> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
> +	if (!irq_stack_page) {
> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
> +			smp_processor_id(), cpu);
> +		return -ENOMEM;
> +	}
> +	irq_stack = page_address(irq_stack_page);
> +
> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
> +			       + THREAD_START_SP;
> +
> +	return 0;
> +}
> +
> +/*
> + * Copy struct thread_info between two stacks, and update current->stack.
> + * This is used when moving to the irq exception stack.
> + * Changing current->stack is necessary so that non-arch thread_info writers
> + * don't use the new thread_info->task->stack to find the old thread_info when
> + * setting flags like TIF_NEED_RESCHED...
> + */
> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
> +{
> +	struct thread_info *src = get_thread_info(src_sp);
> +	struct thread_info *dst = get_thread_info(dst_sp);
> +
> +	if (dst == src)
> +		return 0;
> +
> +	*dst = *src;
> +	current->stack = dst;
> +
> +	return 1;
> +}

Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
twice to handle a single interrupt. Isn's it too expensive?

This is a major difference between my approach and this patch. I think we should get
some feedbacks on struct thread_info information management for v2 preparation.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-08  1:45             ` AKASHI Takahiro
@ 2015-09-08 14:59               ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-08 14:59 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: James Morse, linux-arm-kernel, linux-kernel@vger.kernel.org Mailing List

On Sep 8, 2015, at 10:45 AM, AKASHI Takahiro wrote:
> Jungseok,

Hi Akashi,

> On 09/08/2015 01:34 AM, Jungseok Lee wrote:
>> On Sep 8, 2015, at 1:06 AM, James Morse wrote:
>>> On 07/09/15 16:48, Jungseok Lee wrote:
>>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>> 
>>>> Hi James,
>>>> 
>>>>> Having to handle interrupts on top of an existing kernel stack means the
>>>>> kernel stack must be large enough to accomodate both the maximum kernel
>>>>> usage, and the maximum irq handler usage. Switching to a different stack
>>>>> when processing irqs allows us to make the stack size smaller.
>>>>> 
>>>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>>>> a maximum stack usage of 5816 bytes.
>>>> 
>>>> I'd like to know how to measure the max stack depth.
>>>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>>>> region and find or track down an untouched region?
>>> 
>>> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
>>> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
>>> 
>>> What problems did you encounter?
>>> (I may be missing something…)
>> 
>> When I enabled the feature, all entries had *0* size except the last entry.
>> It can be reproduced easily as looking in debugs:/tracing/stack_trace.
> 
> I'm afraid that you have not applied one of patches in my RFC:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355919.html

My description is not clear. Whenever I use a stack tracer, I always put your
RFC and Steve's stack tracer fix on top of my tree.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-08 14:59               ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-08 14:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 8, 2015, at 10:45 AM, AKASHI Takahiro wrote:
> Jungseok,

Hi Akashi,

> On 09/08/2015 01:34 AM, Jungseok Lee wrote:
>> On Sep 8, 2015, at 1:06 AM, James Morse wrote:
>>> On 07/09/15 16:48, Jungseok Lee wrote:
>>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>> 
>>>> Hi James,
>>>> 
>>>>> Having to handle interrupts on top of an existing kernel stack means the
>>>>> kernel stack must be large enough to accomodate both the maximum kernel
>>>>> usage, and the maximum irq handler usage. Switching to a different stack
>>>>> when processing irqs allows us to make the stack size smaller.
>>>>> 
>>>>> Maximum kernel stack usage (running ltp and generating usb+ethernet
>>>>> interrupts) was 7256 bytes. With this patch, the same workload gives
>>>>> a maximum stack usage of 5816 bytes.
>>>> 
>>>> I'd like to know how to measure the max stack depth.
>>>> AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack
>>>> region and find or track down an untouched region?
>>> 
>>> I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' ->
>>> 'Tracers', then looked in debugfs:/tracing/stack_max_size.
>>> 
>>> What problems did you encounter?
>>> (I may be missing something?)
>> 
>> When I enabled the feature, all entries had *0* size except the last entry.
>> It can be reproduced easily as looking in debugs:/tracing/stack_trace.
> 
> I'm afraid that you have not applied one of patches in my RFC:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355919.html

My description is not clear. Whenever I use a stack tracer, I always put your
RFC and Steve's stack tracer fix on top of my tree.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-08 14:54       ` Jungseok Lee
@ 2015-09-08 16:47         ` James Morse
  -1 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-08 16:47 UTC (permalink / raw)
  To: Jungseok Lee; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On 08/09/15 15:54, Jungseok Lee wrote:
> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
> 
> Hi James,
> 
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index e16351819fed..d42371f3f5a1 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>>  * Interrupt handling.
>>  */
>> 	.macro	irq_handler
>> -	adrp	x1, handle_arch_irq
>> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
>> -	mov	x0, sp
>> +	mrs	x21, tpidr_el1
>> +	adr_l	x20, irq_sp
>> +	add	x20, x20, x21
>> +
>> +	ldr	x21, [x20]
>> +	mov	x20, sp
>> +
>> +	mov	x0, x21
>> +	mov	x1, x20
>> +	bl	irq_copy_thread_info
>> +
>> +	/* test for recursive use of irq_sp */
>> +	cbz	w0, 1f
>> +	mrs	x30, elr_el1
>> +	mov	sp, x21
>> +
>> +	/*
>> +	 * Create a fake stack frame to bump unwind_frame() onto the original
>> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
>> +	 */
>> +	push	x29, x30
> 
> It is the most challenging item to handle a frame pointer in this context.
> 
> As I mentioned previously, a stack tracer on ftrace is not supported yet.
> How about decoupling "IRQ Stack feature" from "Stack Tracer Fix"?

Yes - I discovered today this was more complicated than I thought. I will
need to do some more reading.


>> +
>> +1:	ldr_l	x1, handle_arch_irq
>> +	mov     x0, x20
>> 	blr	x1
>> +
>> +	mov	x0, x20
>> +	mov	x1, x21
>> +	bl	irq_copy_thread_info
>> +	mov	sp, x20
>> +
>> 	.endm
>>
>> 	.text
>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>> index 463fa2e7e34c..10b57a006da8 100644
>> --- a/arch/arm64/kernel/irq.c
>> +++ b/arch/arm64/kernel/irq.c
>> @@ -26,11 +26,14 @@
>> #include <linux/smp.h>
>> #include <linux/init.h>
>> #include <linux/irqchip.h>
>> +#include <linux/percpu.h>
>> #include <linux/seq_file.h>
>> #include <linux/ratelimit.h>
>>
>> unsigned long irq_err_count;
>>
>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>> +
>> int arch_show_interrupts(struct seq_file *p, int prec)
>> {
>> #ifdef CONFIG_SMP
>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>> 	irqchip_init();
>> 	if (!handle_arch_irq)
>> 		panic("No interrupt controller found.");
>> +
>> +	/* Allocate an irq stack for the boot cpu */
>> +	if (alloc_irq_stack(smp_processor_id()))
>> +		panic("Failed to allocate irq stack for boot cpu.");
>> }
>>
>> #ifdef CONFIG_HOTPLUG_CPU
>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>> 	local_irq_restore(flags);
>> }
>> #endif /* CONFIG_HOTPLUG_CPU */
>> +
>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>> +int alloc_irq_stack(unsigned int cpu)
>> +{
>> +	struct page *irq_stack_page;
>> +	union thread_union *irq_stack;
>> +
>> +	/* reuse stack allocated previously */
>> +	if (per_cpu(irq_sp, cpu))
>> +		return 0;
> 
> I'd like to avoid even this simple check since CPU hotplug could be heavily
> used for power management.

I don't think its a problem:
__cpu_up() contains a call to wait_for_completion_timeout() (which could
eventually end up in the scheduler), so I don't think it could ever be on a
'really hot' path.

For really frequent hotplug-like power management, cpu_suspend() makes use
of firmware support to power-off cores - from what I can see it doesn't use
__cpu_up().


>> +
>> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>> +	if (!irq_stack_page) {
>> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
>> +			smp_processor_id(), cpu);
>> +		return -ENOMEM;
>> +	}
>> +	irq_stack = page_address(irq_stack_page);
>> +
>> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
>> +			       + THREAD_START_SP;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Copy struct thread_info between two stacks, and update current->stack.
>> + * This is used when moving to the irq exception stack.
>> + * Changing current->stack is necessary so that non-arch thread_info writers
>> + * don't use the new thread_info->task->stack to find the old thread_info when
>> + * setting flags like TIF_NEED_RESCHED...
>> + */
>> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
>> +{
>> +	struct thread_info *src = get_thread_info(src_sp);
>> +	struct thread_info *dst = get_thread_info(dst_sp);
>> +
>> +	if (dst == src)
>> +		return 0;
>> +
>> +	*dst = *src;
>> +	current->stack = dst;
>> +
>> +	return 1;
>> +}
> 
> Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
> twice to handle a single interrupt. Isn's it too expensive?
> 
> This is a major difference between my approach and this patch. I think we should get
> some feedbacks on struct thread_info information management for v2 preparation.

Agreed.

The other difference, as Akashi Takahiro pointed out, is the behaviour of
object_is_on_stack(). What should this return for an object on an irq stack?


Thanks,

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-08 16:47         ` James Morse
  0 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-08 16:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 08/09/15 15:54, Jungseok Lee wrote:
> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
> 
> Hi James,
> 
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index e16351819fed..d42371f3f5a1 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>>  * Interrupt handling.
>>  */
>> 	.macro	irq_handler
>> -	adrp	x1, handle_arch_irq
>> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
>> -	mov	x0, sp
>> +	mrs	x21, tpidr_el1
>> +	adr_l	x20, irq_sp
>> +	add	x20, x20, x21
>> +
>> +	ldr	x21, [x20]
>> +	mov	x20, sp
>> +
>> +	mov	x0, x21
>> +	mov	x1, x20
>> +	bl	irq_copy_thread_info
>> +
>> +	/* test for recursive use of irq_sp */
>> +	cbz	w0, 1f
>> +	mrs	x30, elr_el1
>> +	mov	sp, x21
>> +
>> +	/*
>> +	 * Create a fake stack frame to bump unwind_frame() onto the original
>> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
>> +	 */
>> +	push	x29, x30
> 
> It is the most challenging item to handle a frame pointer in this context.
> 
> As I mentioned previously, a stack tracer on ftrace is not supported yet.
> How about decoupling "IRQ Stack feature" from "Stack Tracer Fix"?

Yes - I discovered today this was more complicated than I thought. I will
need to do some more reading.


>> +
>> +1:	ldr_l	x1, handle_arch_irq
>> +	mov     x0, x20
>> 	blr	x1
>> +
>> +	mov	x0, x20
>> +	mov	x1, x21
>> +	bl	irq_copy_thread_info
>> +	mov	sp, x20
>> +
>> 	.endm
>>
>> 	.text
>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>> index 463fa2e7e34c..10b57a006da8 100644
>> --- a/arch/arm64/kernel/irq.c
>> +++ b/arch/arm64/kernel/irq.c
>> @@ -26,11 +26,14 @@
>> #include <linux/smp.h>
>> #include <linux/init.h>
>> #include <linux/irqchip.h>
>> +#include <linux/percpu.h>
>> #include <linux/seq_file.h>
>> #include <linux/ratelimit.h>
>>
>> unsigned long irq_err_count;
>>
>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>> +
>> int arch_show_interrupts(struct seq_file *p, int prec)
>> {
>> #ifdef CONFIG_SMP
>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>> 	irqchip_init();
>> 	if (!handle_arch_irq)
>> 		panic("No interrupt controller found.");
>> +
>> +	/* Allocate an irq stack for the boot cpu */
>> +	if (alloc_irq_stack(smp_processor_id()))
>> +		panic("Failed to allocate irq stack for boot cpu.");
>> }
>>
>> #ifdef CONFIG_HOTPLUG_CPU
>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>> 	local_irq_restore(flags);
>> }
>> #endif /* CONFIG_HOTPLUG_CPU */
>> +
>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>> +int alloc_irq_stack(unsigned int cpu)
>> +{
>> +	struct page *irq_stack_page;
>> +	union thread_union *irq_stack;
>> +
>> +	/* reuse stack allocated previously */
>> +	if (per_cpu(irq_sp, cpu))
>> +		return 0;
> 
> I'd like to avoid even this simple check since CPU hotplug could be heavily
> used for power management.

I don't think its a problem:
__cpu_up() contains a call to wait_for_completion_timeout() (which could
eventually end up in the scheduler), so I don't think it could ever be on a
'really hot' path.

For really frequent hotplug-like power management, cpu_suspend() makes use
of firmware support to power-off cores - from what I can see it doesn't use
__cpu_up().


>> +
>> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>> +	if (!irq_stack_page) {
>> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
>> +			smp_processor_id(), cpu);
>> +		return -ENOMEM;
>> +	}
>> +	irq_stack = page_address(irq_stack_page);
>> +
>> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
>> +			       + THREAD_START_SP;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Copy struct thread_info between two stacks, and update current->stack.
>> + * This is used when moving to the irq exception stack.
>> + * Changing current->stack is necessary so that non-arch thread_info writers
>> + * don't use the new thread_info->task->stack to find the old thread_info when
>> + * setting flags like TIF_NEED_RESCHED...
>> + */
>> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
>> +{
>> +	struct thread_info *src = get_thread_info(src_sp);
>> +	struct thread_info *dst = get_thread_info(dst_sp);
>> +
>> +	if (dst == src)
>> +		return 0;
>> +
>> +	*dst = *src;
>> +	current->stack = dst;
>> +
>> +	return 1;
>> +}
> 
> Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
> twice to handle a single interrupt. Isn's it too expensive?
> 
> This is a major difference between my approach and this patch. I think we should get
> some feedbacks on struct thread_info information management for v2 preparation.

Agreed.

The other difference, as Akashi Takahiro pointed out, is the behaviour of
object_is_on_stack(). What should this return for an object on an irq stack?


Thanks,

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-08 16:47         ` James Morse
@ 2015-09-09 13:22           ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-09 13:22 UTC (permalink / raw)
  To: James Morse; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On Sep 9, 2015, at 1:47 AM, James Morse wrote:
> On 08/09/15 15:54, Jungseok Lee wrote:
>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>> 
>> Hi James,
>> 
>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>> index e16351819fed..d42371f3f5a1 100644
>>> --- a/arch/arm64/kernel/entry.S
>>> +++ b/arch/arm64/kernel/entry.S
>>> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>>> * Interrupt handling.
>>> */
>>> 	.macro	irq_handler
>>> -	adrp	x1, handle_arch_irq
>>> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
>>> -	mov	x0, sp
>>> +	mrs	x21, tpidr_el1
>>> +	adr_l	x20, irq_sp
>>> +	add	x20, x20, x21
>>> +
>>> +	ldr	x21, [x20]
>>> +	mov	x20, sp
>>> +
>>> +	mov	x0, x21
>>> +	mov	x1, x20
>>> +	bl	irq_copy_thread_info
>>> +
>>> +	/* test for recursive use of irq_sp */
>>> +	cbz	w0, 1f
>>> +	mrs	x30, elr_el1
>>> +	mov	sp, x21
>>> +
>>> +	/*
>>> +	 * Create a fake stack frame to bump unwind_frame() onto the original
>>> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
>>> +	 */
>>> +	push	x29, x30
>> 
>> It is the most challenging item to handle a frame pointer in this context.
>> 
>> As I mentioned previously, a stack tracer on ftrace is not supported yet.
>> How about decoupling "IRQ Stack feature" from "Stack Tracer Fix"?
> 
> Yes - I discovered today this was more complicated than I thought. I will
> need to do some more reading.
> 
> 
>>> +
>>> +1:	ldr_l	x1, handle_arch_irq
>>> +	mov     x0, x20
>>> 	blr	x1
>>> +
>>> +	mov	x0, x20
>>> +	mov	x1, x21
>>> +	bl	irq_copy_thread_info
>>> +	mov	sp, x20
>>> +
>>> 	.endm
>>> 
>>> 	.text
>>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>>> index 463fa2e7e34c..10b57a006da8 100644
>>> --- a/arch/arm64/kernel/irq.c
>>> +++ b/arch/arm64/kernel/irq.c
>>> @@ -26,11 +26,14 @@
>>> #include <linux/smp.h>
>>> #include <linux/init.h>
>>> #include <linux/irqchip.h>
>>> +#include <linux/percpu.h>
>>> #include <linux/seq_file.h>
>>> #include <linux/ratelimit.h>
>>> 
>>> unsigned long irq_err_count;
>>> 
>>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>>> +
>>> int arch_show_interrupts(struct seq_file *p, int prec)
>>> {
>>> #ifdef CONFIG_SMP
>>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>>> 	irqchip_init();
>>> 	if (!handle_arch_irq)
>>> 		panic("No interrupt controller found.");
>>> +
>>> +	/* Allocate an irq stack for the boot cpu */
>>> +	if (alloc_irq_stack(smp_processor_id()))
>>> +		panic("Failed to allocate irq stack for boot cpu.");
>>> }
>>> 
>>> #ifdef CONFIG_HOTPLUG_CPU
>>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>>> 	local_irq_restore(flags);
>>> }
>>> #endif /* CONFIG_HOTPLUG_CPU */
>>> +
>>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>>> +int alloc_irq_stack(unsigned int cpu)
>>> +{
>>> +	struct page *irq_stack_page;
>>> +	union thread_union *irq_stack;
>>> +
>>> +	/* reuse stack allocated previously */
>>> +	if (per_cpu(irq_sp, cpu))
>>> +		return 0;
>> 
>> I'd like to avoid even this simple check since CPU hotplug could be heavily
>> used for power management.
> 
> I don't think its a problem:
> __cpu_up() contains a call to wait_for_completion_timeout() (which could
> eventually end up in the scheduler), so I don't think it could ever be on a
> 'really hot' path.
> 
> For really frequent hotplug-like power management, cpu_suspend() makes use
> of firmware support to power-off cores - from what I can see it doesn't use
> __cpu_up().

In case of some platforms, CPU hotplug is triggered via sysfs for power management
based on user data. What is advantage of putting stack allocation into this path?

IRQ stack allocation is an critical one-shot operation. So, there would be no issue
to give this work to a booting core. 

>>> +
>>> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>>> +	if (!irq_stack_page) {
>>> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
>>> +			smp_processor_id(), cpu);
>>> +		return -ENOMEM;
>>> +	}
>>> +	irq_stack = page_address(irq_stack_page);

I forgot leaving a comment here. How about using __get_free_pages? It returns an
address instead of page. 

>>> +
>>> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
>>> +			       + THREAD_START_SP;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>> + * Copy struct thread_info between two stacks, and update current->stack.
>>> + * This is used when moving to the irq exception stack.
>>> + * Changing current->stack is necessary so that non-arch thread_info writers
>>> + * don't use the new thread_info->task->stack to find the old thread_info when
>>> + * setting flags like TIF_NEED_RESCHED...
>>> + */
>>> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
>>> +{
>>> +	struct thread_info *src = get_thread_info(src_sp);
>>> +	struct thread_info *dst = get_thread_info(dst_sp);
>>> +
>>> +	if (dst == src)
>>> +		return 0;
>>> +
>>> +	*dst = *src;
>>> +	current->stack = dst;
>>> +
>>> +	return 1;
>>> +}
>> 
>> Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
>> twice to handle a single interrupt. Isn's it too expensive?
>> 
>> This is a major difference between my approach and this patch. I think we should get
>> some feedbacks on struct thread_info information management for v2 preparation.
> 
> Agreed.
> 
> The other difference, as Akashi Takahiro pointed out, is the behaviour of
> object_is_on_stack(). What should this return for an object on an irq stack?

0 should be returned in that case to align with x86 behavior according to check_stack()
context and the commit, 81520a1b0649d0.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-09 13:22           ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-09 13:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 9, 2015, at 1:47 AM, James Morse wrote:
> On 08/09/15 15:54, Jungseok Lee wrote:
>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>> 
>> Hi James,
>> 
>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>> index e16351819fed..d42371f3f5a1 100644
>>> --- a/arch/arm64/kernel/entry.S
>>> +++ b/arch/arm64/kernel/entry.S
>>> @@ -190,10 +190,37 @@ tsk	.req	x28		// current thread_info
>>> * Interrupt handling.
>>> */
>>> 	.macro	irq_handler
>>> -	adrp	x1, handle_arch_irq
>>> -	ldr	x1, [x1, #:lo12:handle_arch_irq]
>>> -	mov	x0, sp
>>> +	mrs	x21, tpidr_el1
>>> +	adr_l	x20, irq_sp
>>> +	add	x20, x20, x21
>>> +
>>> +	ldr	x21, [x20]
>>> +	mov	x20, sp
>>> +
>>> +	mov	x0, x21
>>> +	mov	x1, x20
>>> +	bl	irq_copy_thread_info
>>> +
>>> +	/* test for recursive use of irq_sp */
>>> +	cbz	w0, 1f
>>> +	mrs	x30, elr_el1
>>> +	mov	sp, x21
>>> +
>>> +	/*
>>> +	 * Create a fake stack frame to bump unwind_frame() onto the original
>>> +	 * stack. This relies on x29 not being clobbered by kernel_entry().
>>> +	 */
>>> +	push	x29, x30
>> 
>> It is the most challenging item to handle a frame pointer in this context.
>> 
>> As I mentioned previously, a stack tracer on ftrace is not supported yet.
>> How about decoupling "IRQ Stack feature" from "Stack Tracer Fix"?
> 
> Yes - I discovered today this was more complicated than I thought. I will
> need to do some more reading.
> 
> 
>>> +
>>> +1:	ldr_l	x1, handle_arch_irq
>>> +	mov     x0, x20
>>> 	blr	x1
>>> +
>>> +	mov	x0, x20
>>> +	mov	x1, x21
>>> +	bl	irq_copy_thread_info
>>> +	mov	sp, x20
>>> +
>>> 	.endm
>>> 
>>> 	.text
>>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>>> index 463fa2e7e34c..10b57a006da8 100644
>>> --- a/arch/arm64/kernel/irq.c
>>> +++ b/arch/arm64/kernel/irq.c
>>> @@ -26,11 +26,14 @@
>>> #include <linux/smp.h>
>>> #include <linux/init.h>
>>> #include <linux/irqchip.h>
>>> +#include <linux/percpu.h>
>>> #include <linux/seq_file.h>
>>> #include <linux/ratelimit.h>
>>> 
>>> unsigned long irq_err_count;
>>> 
>>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>>> +
>>> int arch_show_interrupts(struct seq_file *p, int prec)
>>> {
>>> #ifdef CONFIG_SMP
>>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>>> 	irqchip_init();
>>> 	if (!handle_arch_irq)
>>> 		panic("No interrupt controller found.");
>>> +
>>> +	/* Allocate an irq stack for the boot cpu */
>>> +	if (alloc_irq_stack(smp_processor_id()))
>>> +		panic("Failed to allocate irq stack for boot cpu.");
>>> }
>>> 
>>> #ifdef CONFIG_HOTPLUG_CPU
>>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>>> 	local_irq_restore(flags);
>>> }
>>> #endif /* CONFIG_HOTPLUG_CPU */
>>> +
>>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>>> +int alloc_irq_stack(unsigned int cpu)
>>> +{
>>> +	struct page *irq_stack_page;
>>> +	union thread_union *irq_stack;
>>> +
>>> +	/* reuse stack allocated previously */
>>> +	if (per_cpu(irq_sp, cpu))
>>> +		return 0;
>> 
>> I'd like to avoid even this simple check since CPU hotplug could be heavily
>> used for power management.
> 
> I don't think its a problem:
> __cpu_up() contains a call to wait_for_completion_timeout() (which could
> eventually end up in the scheduler), so I don't think it could ever be on a
> 'really hot' path.
> 
> For really frequent hotplug-like power management, cpu_suspend() makes use
> of firmware support to power-off cores - from what I can see it doesn't use
> __cpu_up().

In case of some platforms, CPU hotplug is triggered via sysfs for power management
based on user data. What is advantage of putting stack allocation into this path?

IRQ stack allocation is an critical one-shot operation. So, there would be no issue
to give this work to a booting core. 

>>> +
>>> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>>> +	if (!irq_stack_page) {
>>> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
>>> +			smp_processor_id(), cpu);
>>> +		return -ENOMEM;
>>> +	}
>>> +	irq_stack = page_address(irq_stack_page);

I forgot leaving a comment here. How about using __get_free_pages? It returns an
address instead of page. 

>>> +
>>> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
>>> +			       + THREAD_START_SP;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>> + * Copy struct thread_info between two stacks, and update current->stack.
>>> + * This is used when moving to the irq exception stack.
>>> + * Changing current->stack is necessary so that non-arch thread_info writers
>>> + * don't use the new thread_info->task->stack to find the old thread_info when
>>> + * setting flags like TIF_NEED_RESCHED...
>>> + */
>>> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
>>> +{
>>> +	struct thread_info *src = get_thread_info(src_sp);
>>> +	struct thread_info *dst = get_thread_info(dst_sp);
>>> +
>>> +	if (dst == src)
>>> +		return 0;
>>> +
>>> +	*dst = *src;
>>> +	current->stack = dst;
>>> +
>>> +	return 1;
>>> +}
>> 
>> Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
>> twice to handle a single interrupt. Isn's it too expensive?
>> 
>> This is a major difference between my approach and this patch. I think we should get
>> some feedbacks on struct thread_info information management for v2 preparation.
> 
> Agreed.
> 
> The other difference, as Akashi Takahiro pointed out, is the behaviour of
> object_is_on_stack(). What should this return for an object on an irq stack?

0 should be returned in that case to align with x86 behavior according to check_stack()
context and the commit, 81520a1b0649d0.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-09 13:22           ` Jungseok Lee
@ 2015-09-09 18:13             ` James Morse
  -1 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-09 18:13 UTC (permalink / raw)
  To: Jungseok Lee; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On 09/09/15 14:22, Jungseok Lee wrote:
> On Sep 9, 2015, at 1:47 AM, James Morse wrote:
>> On 08/09/15 15:54, Jungseok Lee wrote:
>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>>>> index 463fa2e7e34c..10b57a006da8 100644
>>>> --- a/arch/arm64/kernel/irq.c
>>>> +++ b/arch/arm64/kernel/irq.c
>>>> @@ -26,11 +26,14 @@
>>>> #include <linux/smp.h>
>>>> #include <linux/init.h>
>>>> #include <linux/irqchip.h>
>>>> +#include <linux/percpu.h>
>>>> #include <linux/seq_file.h>
>>>> #include <linux/ratelimit.h>
>>>>
>>>> unsigned long irq_err_count;
>>>>
>>>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>>>> +
>>>> int arch_show_interrupts(struct seq_file *p, int prec)
>>>> {
>>>> #ifdef CONFIG_SMP
>>>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>>>> 	irqchip_init();
>>>> 	if (!handle_arch_irq)
>>>> 		panic("No interrupt controller found.");
>>>> +
>>>> +	/* Allocate an irq stack for the boot cpu */
>>>> +	if (alloc_irq_stack(smp_processor_id()))
>>>> +		panic("Failed to allocate irq stack for boot cpu.");
>>>> }
>>>>
>>>> #ifdef CONFIG_HOTPLUG_CPU
>>>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>>>> 	local_irq_restore(flags);
>>>> }
>>>> #endif /* CONFIG_HOTPLUG_CPU */
>>>> +
>>>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>>>> +int alloc_irq_stack(unsigned int cpu)
>>>> +{
>>>> +	struct page *irq_stack_page;
>>>> +	union thread_union *irq_stack;
>>>> +
>>>> +	/* reuse stack allocated previously */
>>>> +	if (per_cpu(irq_sp, cpu))
>>>> +		return 0;
>>>
>>> I'd like to avoid even this simple check since CPU hotplug could be heavily
>>> used for power management.
>>
>> I don't think its a problem:
>> __cpu_up() contains a call to wait_for_completion_timeout() (which could
>> eventually end up in the scheduler), so I don't think it could ever be on a
>> 'really hot' path.
>>
>> For really frequent hotplug-like power management, cpu_suspend() makes use
>> of firmware support to power-off cores - from what I can see it doesn't use
>> __cpu_up().
> 
> In case of some platforms, CPU hotplug is triggered via sysfs for power management
> based on user data. What is advantage of putting stack allocation into this path?

It will only happen for CPUs that are brought up.


> IRQ stack allocation is an critical one-shot operation. So, there would be no issue
> to give this work to a booting core. 

I agree, but:

>From include/linux/cpumask.h:
> *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
> *  ACPI reports present at boot.

(This doesn't seem to happen with DT - but might with ACPI.)

NR_CPUs could be much bigger than the number of cpus the system ever has.
Allocating a stack for every possible cpu would waste memory. It is better
to do it just-in-time, when we know the memory will be used.

This already happens for the per-cpu idle task, (please check I traced
these through correctly!)
_cpu_up()
  idle_thread_get()
    init_idle()
      fork_idle()
        copy_process()
          dup_task_struct()
            alloc_task_struct_node()
            alloc_thread_info_node()
            arch_dup_task_struct()

So plenty of memory-allocation occurs during _cpu_up(), idle_init() checks
whether the idle task has already been created.


>>>> +
>>>> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>>>> +	if (!irq_stack_page) {
>>>> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
>>>> +			smp_processor_id(), cpu);
>>>> +		return -ENOMEM;
>>>> +	}
>>>> +	irq_stack = page_address(irq_stack_page);
> 
> I forgot leaving a comment here. How about using __get_free_pages? It returns an
> address instead of page. 

Good idea, I was paranoid about the stack not being correctly aligned (as
current_thread_info() relies on this). I lifted that alloc_kmem_pages line
from kernel/fork.c.


>>>> +
>>>> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
>>>> +			       + THREAD_START_SP;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Copy struct thread_info between two stacks, and update current->stack.
>>>> + * This is used when moving to the irq exception stack.
>>>> + * Changing current->stack is necessary so that non-arch thread_info writers
>>>> + * don't use the new thread_info->task->stack to find the old thread_info when
>>>> + * setting flags like TIF_NEED_RESCHED...
>>>> + */
>>>> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
>>>> +{
>>>> +	struct thread_info *src = get_thread_info(src_sp);
>>>> +	struct thread_info *dst = get_thread_info(dst_sp);
>>>> +
>>>> +	if (dst == src)
>>>> +		return 0;
>>>> +
>>>> +	*dst = *src;
>>>> +	current->stack = dst;
>>>> +
>>>> +	return 1;
>>>> +}
>>>
>>> Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
>>> twice to handle a single interrupt. Isn's it too expensive?
>>>
>>> This is a major difference between my approach and this patch. I think we should get
>>> some feedbacks on struct thread_info information management for v2 preparation.
>>
>> Agreed.
>>
>> The other difference, as Akashi Takahiro pointed out, is the behaviour of
>> object_is_on_stack(). What should this return for an object on an irq stack?
> 
> 0 should be returned in that case to align with x86 behavior according to check_stack()
> context and the commit, 81520a1b0649d0.

Unless we can update ftrace on x86 too! :)


Thanks for your comments,


James



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-09 18:13             ` James Morse
  0 siblings, 0 replies; 52+ messages in thread
From: James Morse @ 2015-09-09 18:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/09/15 14:22, Jungseok Lee wrote:
> On Sep 9, 2015, at 1:47 AM, James Morse wrote:
>> On 08/09/15 15:54, Jungseok Lee wrote:
>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>>>> index 463fa2e7e34c..10b57a006da8 100644
>>>> --- a/arch/arm64/kernel/irq.c
>>>> +++ b/arch/arm64/kernel/irq.c
>>>> @@ -26,11 +26,14 @@
>>>> #include <linux/smp.h>
>>>> #include <linux/init.h>
>>>> #include <linux/irqchip.h>
>>>> +#include <linux/percpu.h>
>>>> #include <linux/seq_file.h>
>>>> #include <linux/ratelimit.h>
>>>>
>>>> unsigned long irq_err_count;
>>>>
>>>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>>>> +
>>>> int arch_show_interrupts(struct seq_file *p, int prec)
>>>> {
>>>> #ifdef CONFIG_SMP
>>>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>>>> 	irqchip_init();
>>>> 	if (!handle_arch_irq)
>>>> 		panic("No interrupt controller found.");
>>>> +
>>>> +	/* Allocate an irq stack for the boot cpu */
>>>> +	if (alloc_irq_stack(smp_processor_id()))
>>>> +		panic("Failed to allocate irq stack for boot cpu.");
>>>> }
>>>>
>>>> #ifdef CONFIG_HOTPLUG_CPU
>>>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>>>> 	local_irq_restore(flags);
>>>> }
>>>> #endif /* CONFIG_HOTPLUG_CPU */
>>>> +
>>>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>>>> +int alloc_irq_stack(unsigned int cpu)
>>>> +{
>>>> +	struct page *irq_stack_page;
>>>> +	union thread_union *irq_stack;
>>>> +
>>>> +	/* reuse stack allocated previously */
>>>> +	if (per_cpu(irq_sp, cpu))
>>>> +		return 0;
>>>
>>> I'd like to avoid even this simple check since CPU hotplug could be heavily
>>> used for power management.
>>
>> I don't think its a problem:
>> __cpu_up() contains a call to wait_for_completion_timeout() (which could
>> eventually end up in the scheduler), so I don't think it could ever be on a
>> 'really hot' path.
>>
>> For really frequent hotplug-like power management, cpu_suspend() makes use
>> of firmware support to power-off cores - from what I can see it doesn't use
>> __cpu_up().
> 
> In case of some platforms, CPU hotplug is triggered via sysfs for power management
> based on user data. What is advantage of putting stack allocation into this path?

It will only happen for CPUs that are brought up.


> IRQ stack allocation is an critical one-shot operation. So, there would be no issue
> to give this work to a booting core. 

I agree, but:

>From include/linux/cpumask.h:
> *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
> *  ACPI reports present at boot.

(This doesn't seem to happen with DT - but might with ACPI.)

NR_CPUs could be much bigger than the number of cpus the system ever has.
Allocating a stack for every possible cpu would waste memory. It is better
to do it just-in-time, when we know the memory will be used.

This already happens for the per-cpu idle task, (please check I traced
these through correctly!)
_cpu_up()
  idle_thread_get()
    init_idle()
      fork_idle()
        copy_process()
          dup_task_struct()
            alloc_task_struct_node()
            alloc_thread_info_node()
            arch_dup_task_struct()

So plenty of memory-allocation occurs during _cpu_up(), idle_init() checks
whether the idle task has already been created.


>>>> +
>>>> +	irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER);
>>>> +	if (!irq_stack_page) {
>>>> +		pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n",
>>>> +			smp_processor_id(), cpu);
>>>> +		return -ENOMEM;
>>>> +	}
>>>> +	irq_stack = page_address(irq_stack_page);
> 
> I forgot leaving a comment here. How about using __get_free_pages? It returns an
> address instead of page. 

Good idea, I was paranoid about the stack not being correctly aligned (as
current_thread_info() relies on this). I lifted that alloc_kmem_pages line
from kernel/fork.c.


>>>> +
>>>> +	per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack
>>>> +			       + THREAD_START_SP;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Copy struct thread_info between two stacks, and update current->stack.
>>>> + * This is used when moving to the irq exception stack.
>>>> + * Changing current->stack is necessary so that non-arch thread_info writers
>>>> + * don't use the new thread_info->task->stack to find the old thread_info when
>>>> + * setting flags like TIF_NEED_RESCHED...
>>>> + */
>>>> +asmlinkage int irq_copy_thread_info(unsigned long dst_sp, unsigned long src_sp)
>>>> +{
>>>> +	struct thread_info *src = get_thread_info(src_sp);
>>>> +	struct thread_info *dst = get_thread_info(dst_sp);
>>>> +
>>>> +	if (dst == src)
>>>> +		return 0;
>>>> +
>>>> +	*dst = *src;
>>>> +	current->stack = dst;
>>>> +
>>>> +	return 1;
>>>> +}
>>>
>>> Although this is 32-byte memcpy, sizeof(struct thread_info), this function is called
>>> twice to handle a single interrupt. Isn's it too expensive?
>>>
>>> This is a major difference between my approach and this patch. I think we should get
>>> some feedbacks on struct thread_info information management for v2 preparation.
>>
>> Agreed.
>>
>> The other difference, as Akashi Takahiro pointed out, is the behaviour of
>> object_is_on_stack(). What should this return for an object on an irq stack?
> 
> 0 should be returned in that case to align with x86 behavior according to check_stack()
> context and the commit, 81520a1b0649d0.

Unless we can update ftrace on x86 too! :)


Thanks for your comments,


James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
  2015-09-09 18:13             ` James Morse
@ 2015-09-10 23:30               ` Jungseok Lee
  -1 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-10 23:30 UTC (permalink / raw)
  To: James Morse; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

On Sep 10, 2015, at 3:13 AM, James Morse wrote:
> On 09/09/15 14:22, Jungseok Lee wrote:
>> On Sep 9, 2015, at 1:47 AM, James Morse wrote:
>>> On 08/09/15 15:54, Jungseok Lee wrote:
>>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>>>>> index 463fa2e7e34c..10b57a006da8 100644
>>>>> --- a/arch/arm64/kernel/irq.c
>>>>> +++ b/arch/arm64/kernel/irq.c
>>>>> @@ -26,11 +26,14 @@
>>>>> #include <linux/smp.h>
>>>>> #include <linux/init.h>
>>>>> #include <linux/irqchip.h>
>>>>> +#include <linux/percpu.h>
>>>>> #include <linux/seq_file.h>
>>>>> #include <linux/ratelimit.h>
>>>>> 
>>>>> unsigned long irq_err_count;
>>>>> 
>>>>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>>>>> +
>>>>> int arch_show_interrupts(struct seq_file *p, int prec)
>>>>> {
>>>>> #ifdef CONFIG_SMP
>>>>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>>>>> 	irqchip_init();
>>>>> 	if (!handle_arch_irq)
>>>>> 		panic("No interrupt controller found.");
>>>>> +
>>>>> +	/* Allocate an irq stack for the boot cpu */
>>>>> +	if (alloc_irq_stack(smp_processor_id()))
>>>>> +		panic("Failed to allocate irq stack for boot cpu.");
>>>>> }
>>>>> 
>>>>> #ifdef CONFIG_HOTPLUG_CPU
>>>>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>>>>> 	local_irq_restore(flags);
>>>>> }
>>>>> #endif /* CONFIG_HOTPLUG_CPU */
>>>>> +
>>>>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>>>>> +int alloc_irq_stack(unsigned int cpu)
>>>>> +{
>>>>> +	struct page *irq_stack_page;
>>>>> +	union thread_union *irq_stack;
>>>>> +
>>>>> +	/* reuse stack allocated previously */
>>>>> +	if (per_cpu(irq_sp, cpu))
>>>>> +		return 0;
>>>> 
>>>> I'd like to avoid even this simple check since CPU hotplug could be heavily
>>>> used for power management.
>>> 
>>> I don't think its a problem:
>>> __cpu_up() contains a call to wait_for_completion_timeout() (which could
>>> eventually end up in the scheduler), so I don't think it could ever be on a
>>> 'really hot' path.
>>> 
>>> For really frequent hotplug-like power management, cpu_suspend() makes use
>>> of firmware support to power-off cores - from what I can see it doesn't use
>>> __cpu_up().
>> 
>> In case of some platforms, CPU hotplug is triggered via sysfs for power management
>> based on user data. What is advantage of putting stack allocation into this path?
> 
> It will only happen for CPUs that are brought up.
> 
> 
>> IRQ stack allocation is an critical one-shot operation. So, there would be no issue
>> to give this work to a booting core. 
> 
> I agree, but:
> 
> From include/linux/cpumask.h:
>> *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>> *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
>> *  ACPI reports present at boot.
> 
> (This doesn't seem to happen with DT - but might with ACPI.)
> 
> NR_CPUs could be much bigger than the number of cpus the system ever has.
> Allocating a stack for every possible cpu would waste memory. It is better
> to do it just-in-time, when we know the memory will be used.

Frankly I've not considered that kind of system, but this feature should be
supported smoothly for that system. I will move the allocation logic in v2.

> This already happens for the per-cpu idle task, (please check I traced
> these through correctly!)
> _cpu_up()
>  idle_thread_get()
>    init_idle()
>      fork_idle()
>        copy_process()
>          dup_task_struct()
>            alloc_task_struct_node()
>            alloc_thread_info_node()
>            arch_dup_task_struct()
> 
> So plenty of memory-allocation occurs during _cpu_up(), idle_init() checks
> whether the idle task has already been created.

Got it.

Thanks for the feedbacks.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
@ 2015-09-10 23:30               ` Jungseok Lee
  0 siblings, 0 replies; 52+ messages in thread
From: Jungseok Lee @ 2015-09-10 23:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 10, 2015, at 3:13 AM, James Morse wrote:
> On 09/09/15 14:22, Jungseok Lee wrote:
>> On Sep 9, 2015, at 1:47 AM, James Morse wrote:
>>> On 08/09/15 15:54, Jungseok Lee wrote:
>>>> On Sep 7, 2015, at 11:36 PM, James Morse wrote:
>>>>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
>>>>> index 463fa2e7e34c..10b57a006da8 100644
>>>>> --- a/arch/arm64/kernel/irq.c
>>>>> +++ b/arch/arm64/kernel/irq.c
>>>>> @@ -26,11 +26,14 @@
>>>>> #include <linux/smp.h>
>>>>> #include <linux/init.h>
>>>>> #include <linux/irqchip.h>
>>>>> +#include <linux/percpu.h>
>>>>> #include <linux/seq_file.h>
>>>>> #include <linux/ratelimit.h>
>>>>> 
>>>>> unsigned long irq_err_count;
>>>>> 
>>>>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0;
>>>>> +
>>>>> int arch_show_interrupts(struct seq_file *p, int prec)
>>>>> {
>>>>> #ifdef CONFIG_SMP
>>>>> @@ -55,6 +58,10 @@ void __init init_IRQ(void)
>>>>> 	irqchip_init();
>>>>> 	if (!handle_arch_irq)
>>>>> 		panic("No interrupt controller found.");
>>>>> +
>>>>> +	/* Allocate an irq stack for the boot cpu */
>>>>> +	if (alloc_irq_stack(smp_processor_id()))
>>>>> +		panic("Failed to allocate irq stack for boot cpu.");
>>>>> }
>>>>> 
>>>>> #ifdef CONFIG_HOTPLUG_CPU
>>>>> @@ -117,3 +124,48 @@ void migrate_irqs(void)
>>>>> 	local_irq_restore(flags);
>>>>> }
>>>>> #endif /* CONFIG_HOTPLUG_CPU */
>>>>> +
>>>>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */
>>>>> +int alloc_irq_stack(unsigned int cpu)
>>>>> +{
>>>>> +	struct page *irq_stack_page;
>>>>> +	union thread_union *irq_stack;
>>>>> +
>>>>> +	/* reuse stack allocated previously */
>>>>> +	if (per_cpu(irq_sp, cpu))
>>>>> +		return 0;
>>>> 
>>>> I'd like to avoid even this simple check since CPU hotplug could be heavily
>>>> used for power management.
>>> 
>>> I don't think its a problem:
>>> __cpu_up() contains a call to wait_for_completion_timeout() (which could
>>> eventually end up in the scheduler), so I don't think it could ever be on a
>>> 'really hot' path.
>>> 
>>> For really frequent hotplug-like power management, cpu_suspend() makes use
>>> of firmware support to power-off cores - from what I can see it doesn't use
>>> __cpu_up().
>> 
>> In case of some platforms, CPU hotplug is triggered via sysfs for power management
>> based on user data. What is advantage of putting stack allocation into this path?
> 
> It will only happen for CPUs that are brought up.
> 
> 
>> IRQ stack allocation is an critical one-shot operation. So, there would be no issue
>> to give this work to a booting core. 
> 
> I agree, but:
> 
> From include/linux/cpumask.h:
>> *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>> *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
>> *  ACPI reports present at boot.
> 
> (This doesn't seem to happen with DT - but might with ACPI.)
> 
> NR_CPUs could be much bigger than the number of cpus the system ever has.
> Allocating a stack for every possible cpu would waste memory. It is better
> to do it just-in-time, when we know the memory will be used.

Frankly I've not considered that kind of system, but this feature should be
supported smoothly for that system. I will move the allocation logic in v2.

> This already happens for the per-cpu idle task, (please check I traced
> these through correctly!)
> _cpu_up()
>  idle_thread_get()
>    init_idle()
>      fork_idle()
>        copy_process()
>          dup_task_struct()
>            alloc_task_struct_node()
>            alloc_thread_info_node()
>            arch_dup_task_struct()
> 
> So plenty of memory-allocation occurs during _cpu_up(), idle_init() checks
> whether the idle task has already been created.

Got it.

Thanks for the feedbacks.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2015-09-10 23:30 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-04 14:23 [RFC PATCH 0/3] Implement IRQ stack on ARM64 Jungseok Lee
2015-09-04 14:23 ` Jungseok Lee
2015-09-04 14:23 ` [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h Jungseok Lee
2015-09-04 14:23   ` Jungseok Lee
2015-09-07 14:48   ` James Morse
2015-09-07 14:48     ` James Morse
2015-09-07 14:56   ` Mark Rutland
2015-09-07 14:56     ` Mark Rutland
2015-09-07 15:51     ` Jungseok Lee
2015-09-07 15:51       ` Jungseok Lee
2015-09-04 14:23 ` [RFC PATCH 2/3] arm64: Introduce IRQ stack Jungseok Lee
2015-09-04 14:23   ` Jungseok Lee
2015-09-04 17:12   ` Alexnader Kuleshov
2015-09-04 17:12     ` Alexnader Kuleshov
2015-09-07 14:08     ` Jungseok Lee
2015-09-07 14:08       ` Jungseok Lee
2015-09-07 14:48   ` James Morse
2015-09-07 14:48     ` James Morse
2015-09-08 14:28     ` Jungseok Lee
2015-09-08 14:28       ` Jungseok Lee
2015-09-04 14:23 ` [RFC PATCH 3/3] arm64: Reduce kernel stack size when using " Jungseok Lee
2015-09-04 14:23   ` Jungseok Lee
2015-09-07 14:33 ` [RFC PATCH 0/3] Implement IRQ stack on ARM64 James Morse
2015-09-07 14:33   ` James Morse
2015-09-07 14:36   ` [PATCH] arm64: kernel: Use a separate stack for irq interrupts James Morse
2015-09-07 14:36     ` James Morse
2015-09-07 15:48     ` Jungseok Lee
2015-09-07 15:48       ` Jungseok Lee
2015-09-07 16:06       ` James Morse
2015-09-07 16:06         ` James Morse
2015-09-07 16:34         ` Jungseok Lee
2015-09-07 16:34           ` Jungseok Lee
2015-09-08  1:45           ` AKASHI Takahiro
2015-09-08  1:45             ` AKASHI Takahiro
2015-09-08  6:44             ` AKASHI Takahiro
2015-09-08  6:44               ` AKASHI Takahiro
2015-09-08 14:59             ` Jungseok Lee
2015-09-08 14:59               ` Jungseok Lee
2015-09-08  7:51     ` AKASHI Takahiro
2015-09-08  7:51       ` AKASHI Takahiro
2015-09-08 14:54     ` Jungseok Lee
2015-09-08 14:54       ` Jungseok Lee
2015-09-08 16:47       ` James Morse
2015-09-08 16:47         ` James Morse
2015-09-09 13:22         ` Jungseok Lee
2015-09-09 13:22           ` Jungseok Lee
2015-09-09 18:13           ` James Morse
2015-09-09 18:13             ` James Morse
2015-09-10 23:30             ` Jungseok Lee
2015-09-10 23:30               ` Jungseok Lee
2015-09-07 15:42   ` [RFC PATCH 0/3] Implement IRQ stack on ARM64 Jungseok Lee
2015-09-07 15:42     ` Jungseok Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.