linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path.
@ 2022-01-25 15:26 Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 1/8] kernel/fork: Redo ifdefs around task's handling Sebastian Andrzej Siewior
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

[ This is a repost of https://lkml.kernel.org/r/20211118143452.136421-1-bigeasy@linutronix.de ]

This is a follup-up on the patch
    sched: Delay task stack freeing on RT 
    https://lkml.kernel.org/r/20210928122411.593486363@linutronix.de

It addresses the review feedback:
- Decouple stack accounting from its free invocation. The accounting
  happens in do_exit(), the final free call happens later.

- Add put_task_stack_sched() to finish_task_switch(). Here the VMAP
  stack is cached only. If it fails, or in the !VMAP case then the final
  free happens in delayed_put_task_struct(). This is also an oportunity
  to cache the stack.

From testing I observe the following:

|            bash-1715    [006] .....   124.901510: copy_process: allocC ffffc90007e70000
|      sh-cmds.sh-1746    [007] .....   124.907389: copy_process: allocC ffffc90007dc4000
|          <idle>-0       [019] ...1.   124.918126: free_thread_stack: cache ffffc90007dc4000
|      sh-cmds.sh-1746    [007] .....   124.918279: copy_process: allocC ffffc90007de8000
|          <idle>-0       [004] ...1.   124.920121: free_thread_stack: delay ffffc90007de8001
|          <idle>-0       [007] ...1.   124.920299: free_thread_stack: cache ffffc90007e70000
|          <idle>-0       [007] ..s1.   124.945433: free_thread_stack: cache ffffc90007de8000

TS 124.901510, bash started sh-cmds.sh, obtained stack from cache.
TS 124.907389, script invokes its first command, obtained stacak from
cache. As you can see bash was running on CPU6 but its child was moved
CPU7. 
TS 124.918126, the first command is done, stack is ached on CPU19.
TS 124.918279, script's second command, ache from stack.
TS 124.920121, the command is done. The stack cache on CPU4 is full.
TS 124.920299, the script is done, caches stack on CPU7.
TS 124.945433, the RCU-callback of last command is now happening. On
CPU7, which is where the command was invoked (but not running). Instead
of freeing the stack, it was cached since CPU7 had an empty slot.

If I pin the script to CPU5 and run it with multiple commands then it
works as expected:

|            bash-1799    [005] .....   993.608131: copy_process: allocC ffffc90007fa0000
|      sh-cmds.sh-1827    [005] .....   993.608888: copy_process: allocC ffffc90007fa8000
|      sh-cmds.sh-1827    [005] .....   993.610734: copy_process: allocV ffffc90007ff4000
|      sh-cmds.sh-1829    [005] ...1.   993.610757: free_thread_stack: cache ffffc90007fa8000
|      sh-cmds.sh-1827    [005] .....   993.612401: copy_process: allocC ffffc90007fa8000
|           <...>-1830    [005] ...1.   993.612416: free_thread_stack: cache ffffc90007ff4000
|      sh-cmds.sh-1827    [005] .....   993.613707: copy_process: allocC ffffc90007ff4000
|      sh-cmds.sh-1831    [005] ...1.   993.613723: free_thread_stack: cache ffffc90007fa8000
|      sh-cmds.sh-1827    [005] .....   993.615024: copy_process: allocC ffffc90007fa8000
|           <...>-1832    [005] ...1.   993.615040: free_thread_stack: cache ffffc90007ff4000
|      sh-cmds.sh-1827    [005] .....   993.616380: copy_process: allocC ffffc90007ff4000
|           <...>-1833    [005] ...1.   993.616397: free_thread_stack: cache ffffc90007fa8000
|            bash-1799    [005] ...1.   993.617759: free_thread_stack: cache ffffc90007fa0000
|          <idle>-0       [005] ...1.   993.617871: free_thread_stack: delay ffffc90007ff4001
|          <idle>-0       [005] ..s1.   993.638311: free_thread_stack: free ffffc90007ff4000

and no new is allocated during its runtime and a cached stack is used.

Sebastian



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/8] kernel/fork: Redo ifdefs around task's handling.
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 2/8] kernel/fork: Duplicate task_struct before stack allocation Sebastian Andrzej Siewior
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

The use of ifdef CONFIG_VMAP_STACK is confusing in terms what is
actually happenning and what can happen.
For instance from reading free_thread_stack() it appears that in the
CONFIG_VMAP_STACK case we may receive a non-NULL vm pointer but it may
also be NULL in which case __free_pages() is used to free the stack.
This is however not the case because in the VMAP case a non-NULL pointer
is always returned here.
Since it looks like this might happen, the compiler creates the correct
dead code with the invocation to __free_pages() and everything around
it. Twice.

Add spaces between the ifdef and the identifer to recognize the ifdef
level that we are currently in.
Add the current identifer as a comment behind #else and #endif.
Move the code within free_thread_stack() and alloc_thread_stack_node()
into the relavant ifdef block.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/fork.c | 74 +++++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 35 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b219..f63c0af6002da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -185,7 +185,7 @@ static inline void free_task_struct(struct task_struct *tsk)
  */
 # if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
 
-#ifdef CONFIG_VMAP_STACK
+#  ifdef CONFIG_VMAP_STACK
 /*
  * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
  * flush.  Try to minimize the number of calls by caching stacks.
@@ -210,11 +210,9 @@ static int free_vm_stack_cache(unsigned int cpu)
 
 	return 0;
 }
-#endif
 
 static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
-#ifdef CONFIG_VMAP_STACK
 	void *stack;
 	int i;
 
@@ -258,7 +256,34 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 		tsk->stack = stack;
 	}
 	return stack;
-#else
+}
+
+static void free_thread_stack(struct task_struct *tsk)
+{
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+	int i;
+
+	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		memcg_kmem_uncharge_page(vm->pages[i], 0);
+
+	for (i = 0; i < NR_CACHED_STACKS; i++) {
+		if (this_cpu_cmpxchg(cached_stacks[i], NULL,
+				     tsk->stack_vm_area) != NULL)
+			continue;
+
+		tsk->stack = NULL;
+		tsk->stack_vm_area = NULL;
+		return;
+	}
+	vfree_atomic(tsk->stack);
+	tsk->stack = NULL;
+	tsk->stack_vm_area = NULL;
+}
+
+#  else /* !CONFIG_VMAP_STACK */
+
+static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
+{
 	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
@@ -267,36 +292,17 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 		return tsk->stack;
 	}
 	return NULL;
-#endif
 }
 
-static inline void free_thread_stack(struct task_struct *tsk)
+static void free_thread_stack(struct task_struct *tsk)
 {
-#ifdef CONFIG_VMAP_STACK
-	struct vm_struct *vm = task_stack_vm_area(tsk);
-
-	if (vm) {
-		int i;
-
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
-			memcg_kmem_uncharge_page(vm->pages[i], 0);
-
-		for (i = 0; i < NR_CACHED_STACKS; i++) {
-			if (this_cpu_cmpxchg(cached_stacks[i],
-					NULL, tsk->stack_vm_area) != NULL)
-				continue;
-
-			return;
-		}
-
-		vfree_atomic(tsk->stack);
-		return;
-	}
-#endif
-
 	__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
+	tsk->stack = NULL;
 }
-# else
+
+#  endif /* CONFIG_VMAP_STACK */
+# else /* !(THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)) */
+
 static struct kmem_cache *thread_stack_cache;
 
 static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
@@ -312,6 +318,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
 static void free_thread_stack(struct task_struct *tsk)
 {
 	kmem_cache_free(thread_stack_cache, tsk->stack);
+	tsk->stack = NULL;
 }
 
 void thread_stack_cache_init(void)
@@ -321,8 +328,9 @@ void thread_stack_cache_init(void)
 					THREAD_SIZE, NULL);
 	BUG_ON(thread_stack_cache == NULL);
 }
-# endif
-#endif
+
+# endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
+#endif /* !CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
 
 /* SLAB cache for signal_struct structures (tsk->signal) */
 static struct kmem_cache *signal_cachep;
@@ -432,10 +440,6 @@ static void release_task_stack(struct task_struct *tsk)
 
 	account_kernel_stack(tsk, -1);
 	free_thread_stack(tsk);
-	tsk->stack = NULL;
-#ifdef CONFIG_VMAP_STACK
-	tsk->stack_vm_area = NULL;
-#endif
 }
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/8] kernel/fork: Duplicate task_struct before stack allocation.
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 1/8] kernel/fork: Redo ifdefs around task's handling Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-02-11 23:42   ` Andy Lutomirski
  2022-01-25 15:26 ` [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64 Sebastian Andrzej Siewior
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

alloc_thread_stack_node() already populates the task_struct::stack
member except on IA64. The stack pointer is saved and populated again
because IA64 needs it and arch_dup_task_struct() overwrites it.

Allocate thread's stack after task_struct has been duplicated as a
preparation.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/fork.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f63c0af6002da..c47dcba5d66d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -888,6 +888,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!tsk)
 		return NULL;
 
+	err = arch_dup_task_struct(tsk, orig);
+	if (err)
+		goto free_tsk;
+
 	stack = alloc_thread_stack_node(tsk, node);
 	if (!stack)
 		goto free_tsk;
@@ -897,8 +901,6 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 	stack_vm_area = task_stack_vm_area(tsk);
 
-	err = arch_dup_task_struct(tsk, orig);
-
 	/*
 	 * arch_dup_task_struct() clobbers the stack-related fields.  Make
 	 * sure they're properly initialized before using any stack-related
@@ -912,9 +914,6 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	refcount_set(&tsk->stack_refcount, 1);
 #endif
 
-	if (err)
-		goto free_stack;
-
 	err = scs_prepare(tsk, node);
 	if (err)
 		goto free_stack;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64.
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 1/8] kernel/fork: Redo ifdefs around task's handling Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 2/8] kernel/fork: Duplicate task_struct before stack allocation Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-02-14 18:00   ` Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 4/8] kernel/fork: Don't assign the stack pointer in dup_task_struct() Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

Provide a generic alloc_thread_stack_node() for IA64/
CONFIG_ARCH_THREAD_STACK_ALLOCATOR which returns stack pointer and sets
task_struct::stack so it behaves exactly like the other implementations.

Rename IA64's alloc_thread_stack_node() and add the generic version to
the fork code so it is in one place _and_ to drastically lower chances
of fat fingering the IA64 code.
Do the same for free_thread_stack().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 arch/ia64/include/asm/thread_info.h |  6 +++---
 kernel/fork.c                       | 16 ++++++++++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index 51d20cb377062..1684716f08201 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -55,15 +55,15 @@ struct thread_info {
 #ifndef ASM_OFFSETS_C
 /* how to get the thread information struct from C */
 #define current_thread_info()	((struct thread_info *) ((char *) current + IA64_TASK_SIZE))
-#define alloc_thread_stack_node(tsk, node)	\
+#define arch_alloc_thread_stack_node(tsk, node)	\
 		((unsigned long *) ((char *) (tsk) + IA64_TASK_SIZE))
 #define task_thread_info(tsk)	((struct thread_info *) ((char *) (tsk) + IA64_TASK_SIZE))
 #else
 #define current_thread_info()	((struct thread_info *) 0)
-#define alloc_thread_stack_node(tsk, node)	((unsigned long *) 0)
+#define arch_alloc_thread_stack_node(tsk, node)	((unsigned long *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_stack(tsk)	/* nothing */
+#define arch_free_thread_stack(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/kernel/fork.c b/kernel/fork.c
index c47dcba5d66d2..a0d58ae6fac76 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -330,6 +330,22 @@ void thread_stack_cache_init(void)
 }
 
 # endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
+#else /* CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
+
+static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
+{
+	unsigned long *stack;
+
+	stack = arch_alloc_thread_stack_node(tsk, node);
+	tsk->stack = stack;
+	return stack;
+}
+
+static void free_thread_stack(struct task_struct *tsk, bool cache_only)
+{
+	arch_free_thread_stack(tsk);
+}
+
 #endif /* !CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
 
 /* SLAB cache for signal_struct structures (tsk->signal) */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/8] kernel/fork: Don't assign the stack pointer in dup_task_struct().
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2022-01-25 15:26 ` [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64 Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 5/8] kernel/fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

All four versions of alloc_thread_stack_node() assign now
task_struct::stack in case the allocation was successful.

Let alloc_thread_stack_node() return an error code instead of the stack
pointer and remove the stack assignment in dup_task_struct().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/fork.c | 47 ++++++++++++++++-------------------------------
 1 file changed, 16 insertions(+), 31 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index a0d58ae6fac76..e6337b3c34ff7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -211,7 +211,7 @@ static int free_vm_stack_cache(unsigned int cpu)
 	return 0;
 }
 
-static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
+static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	void *stack;
 	int i;
@@ -232,7 +232,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 		tsk->stack_vm_area = s;
 		tsk->stack = s->addr;
-		return s->addr;
+		return 0;
 	}
 
 	/*
@@ -245,17 +245,16 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 				     THREADINFO_GFP & ~__GFP_ACCOUNT,
 				     PAGE_KERNEL,
 				     0, node, __builtin_return_address(0));
-
+	if (!stack)
+		return -ENOMEM;
 	/*
 	 * We can't call find_vm_area() in interrupt context, and
 	 * free_thread_stack() can be called in interrupt context,
 	 * so cache the vm_struct.
 	 */
-	if (stack) {
-		tsk->stack_vm_area = find_vm_area(stack);
-		tsk->stack = stack;
-	}
-	return stack;
+	tsk->stack_vm_area = find_vm_area(stack);
+	tsk->stack = stack;
+	return 0;
 }
 
 static void free_thread_stack(struct task_struct *tsk)
@@ -282,16 +281,16 @@ static void free_thread_stack(struct task_struct *tsk)
 
 #  else /* !CONFIG_VMAP_STACK */
 
-static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
+static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
 	if (likely(page)) {
 		tsk->stack = kasan_reset_tag(page_address(page));
-		return tsk->stack;
+		return 0;
 	}
-	return NULL;
+	return -ENOMEM;
 }
 
 static void free_thread_stack(struct task_struct *tsk)
@@ -305,14 +304,13 @@ static void free_thread_stack(struct task_struct *tsk)
 
 static struct kmem_cache *thread_stack_cache;
 
-static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
-						  int node)
+static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	unsigned long *stack;
 	stack = kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
 	stack = kasan_reset_tag(stack);
 	tsk->stack = stack;
-	return stack;
+	return stack ? 0 : -ENOMEM;
 }
 
 static void free_thread_stack(struct task_struct *tsk)
@@ -332,13 +330,13 @@ void thread_stack_cache_init(void)
 # endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
 #else /* CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
 
-static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
+static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	unsigned long *stack;
 
 	stack = arch_alloc_thread_stack_node(tsk, node);
 	tsk->stack = stack;
-	return stack;
+	return stack ? 0 : -ENOMEM;
 }
 
 static void free_thread_stack(struct task_struct *tsk, bool cache_only)
@@ -894,8 +892,6 @@ void set_task_stack_end_magic(struct task_struct *tsk)
 static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
-	unsigned long *stack;
-	struct vm_struct *stack_vm_area __maybe_unused;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -908,24 +904,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (err)
 		goto free_tsk;
 
-	stack = alloc_thread_stack_node(tsk, node);
-	if (!stack)
+	err = alloc_thread_stack_node(tsk, node);
+	if (err)
 		goto free_tsk;
 
 	if (memcg_charge_kernel_stack(tsk))
 		goto free_stack;
 
-	stack_vm_area = task_stack_vm_area(tsk);
-
-	/*
-	 * arch_dup_task_struct() clobbers the stack-related fields.  Make
-	 * sure they're properly initialized before using any stack-related
-	 * functions again.
-	 */
-	tsk->stack = stack;
-#ifdef CONFIG_VMAP_STACK
-	tsk->stack_vm_area = stack_vm_area;
-#endif
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	refcount_set(&tsk->stack_refcount, 1);
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/8] kernel/fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK.
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2022-01-25 15:26 ` [PATCH 4/8] kernel/fork: Don't assign the stack pointer in dup_task_struct() Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-01-25 15:26 ` [PATCH 6/8] kernel/fork: Move task stack account to do_exit() Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

memcg_charge_kernel_stack() is only used in the CONFIG_VMAP_STACK case.

Move memcg_charge_kernel_stack() into the CONFIG_VMAP_STACK block and
invoke it from within alloc_thread_stack_node().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/fork.c | 69 +++++++++++++++++++++++++++------------------------
 1 file changed, 36 insertions(+), 33 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index e6337b3c34ff7..73f644482e932 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -211,6 +211,32 @@ static int free_vm_stack_cache(unsigned int cpu)
 	return 0;
 }
 
+static int memcg_charge_kernel_stack(struct task_struct *tsk)
+{
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+	int i;
+	int ret;
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+	BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
+
+	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+		ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, 0);
+		if (ret)
+			goto err;
+	}
+	return 0;
+err:
+	/*
+	 * If memcg_kmem_charge_page() fails, page's memory cgroup pointer is
+	 * NULL, and memcg_kmem_uncharge_page() in free_thread_stack() will
+	 * ignore this page.
+	 */
+	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+		memcg_kmem_uncharge_page(vm->pages[i], 0);
+	return ret;
+}
+
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	void *stack;
@@ -230,6 +256,11 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		/* Clear stale pointers from reused stack. */
 		memset(s->addr, 0, THREAD_SIZE);
 
+		if (memcg_charge_kernel_stack(tsk)) {
+			vfree(s->addr);
+			return -ENOMEM;
+		}
+
 		tsk->stack_vm_area = s;
 		tsk->stack = s->addr;
 		return 0;
@@ -247,6 +278,11 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 				     0, node, __builtin_return_address(0));
 	if (!stack)
 		return -ENOMEM;
+
+	if (memcg_charge_kernel_stack(tsk)) {
+		vfree(stack);
+		return -ENOMEM;
+	}
 	/*
 	 * We can't call find_vm_area() in interrupt context, and
 	 * free_thread_stack() can be called in interrupt context,
@@ -417,36 +453,6 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 	}
 }
 
-static int memcg_charge_kernel_stack(struct task_struct *tsk)
-{
-#ifdef CONFIG_VMAP_STACK
-	struct vm_struct *vm = task_stack_vm_area(tsk);
-	int ret;
-
-	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
-
-	if (vm) {
-		int i;
-
-		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
-
-		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
-			/*
-			 * If memcg_kmem_charge_page() fails, page's
-			 * memory cgroup pointer is NULL, and
-			 * memcg_kmem_uncharge_page() in free_thread_stack()
-			 * will ignore this page.
-			 */
-			ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL,
-						     0);
-			if (ret)
-				return ret;
-		}
-	}
-#endif
-	return 0;
-}
-
 static void release_task_stack(struct task_struct *tsk)
 {
 	if (WARN_ON(READ_ONCE(tsk->__state) != TASK_DEAD))
@@ -908,9 +914,6 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (err)
 		goto free_tsk;
 
-	if (memcg_charge_kernel_stack(tsk))
-		goto free_stack;
-
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	refcount_set(&tsk->stack_refcount, 1);
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/8] kernel/fork: Move task stack account to do_exit().
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2022-01-25 15:26 ` [PATCH 5/8] kernel/fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-02-11 23:43   ` Andy Lutomirski
  2022-01-25 15:26 ` [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch() Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

There is no need to perform the stack accounting of the outgoing task in
its final schedule() invocation which happens with disabled preemption.
The task is leaving, the resources will be freed and the accounting can
happen in do_exit() before the actual schedule invocation which
frees the stack memory.

Move the accounting of the stack memory from release_task_stack() to
exit_task_stack_account() which then can be invoked from do_exit().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/sched/task_stack.h |  2 ++
 kernel/exit.c                    |  1 +
 kernel/fork.c                    | 35 +++++++++++++++++++++-----------
 3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index d10150587d819..892562ebbd3aa 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -79,6 +79,8 @@ static inline void *try_get_task_stack(struct task_struct *tsk)
 static inline void put_task_stack(struct task_struct *tsk) {}
 #endif
 
+void exit_task_stack_account(struct task_struct *tsk);
+
 #define task_stack_end_corrupted(task) \
 		(*(end_of_stack(task)) != STACK_END_MAGIC)
 
diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab93..c303cffe7fdb4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -845,6 +845,7 @@ void __noreturn do_exit(long code)
 		put_page(tsk->task_frag.page);
 
 	validate_creds_for_do_exit(tsk);
+	exit_task_stack_account(tsk);
 
 	check_stack_usage();
 	preempt_disable();
diff --git a/kernel/fork.c b/kernel/fork.c
index 73f644482e932..5f4e659a922e1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -211,9 +211,8 @@ static int free_vm_stack_cache(unsigned int cpu)
 	return 0;
 }
 
-static int memcg_charge_kernel_stack(struct task_struct *tsk)
+static int memcg_charge_kernel_stack(struct vm_struct *vm)
 {
-	struct vm_struct *vm = task_stack_vm_area(tsk);
 	int i;
 	int ret;
 
@@ -239,6 +238,7 @@ static int memcg_charge_kernel_stack(struct task_struct *tsk)
 
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
+	struct vm_struct *vm;
 	void *stack;
 	int i;
 
@@ -256,7 +256,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 		/* Clear stale pointers from reused stack. */
 		memset(s->addr, 0, THREAD_SIZE);
 
-		if (memcg_charge_kernel_stack(tsk)) {
+		if (memcg_charge_kernel_stack(s)) {
 			vfree(s->addr);
 			return -ENOMEM;
 		}
@@ -279,7 +279,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	if (!stack)
 		return -ENOMEM;
 
-	if (memcg_charge_kernel_stack(tsk)) {
+	vm = find_vm_area(stack);
+	if (memcg_charge_kernel_stack(vm)) {
 		vfree(stack);
 		return -ENOMEM;
 	}
@@ -288,19 +289,15 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	 * free_thread_stack() can be called in interrupt context,
 	 * so cache the vm_struct.
 	 */
-	tsk->stack_vm_area = find_vm_area(stack);
+	tsk->stack_vm_area = vm;
 	tsk->stack = stack;
 	return 0;
 }
 
 static void free_thread_stack(struct task_struct *tsk)
 {
-	struct vm_struct *vm = task_stack_vm_area(tsk);
 	int i;
 
-	for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
-		memcg_kmem_uncharge_page(vm->pages[i], 0);
-
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
 		if (this_cpu_cmpxchg(cached_stacks[i], NULL,
 				     tsk->stack_vm_area) != NULL)
@@ -453,12 +450,25 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 	}
 }
 
+void exit_task_stack_account(struct task_struct *tsk)
+{
+	account_kernel_stack(tsk, -1);
+
+	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+		struct vm_struct *vm;
+		int i;
+
+		vm = task_stack_vm_area(tsk);
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+			memcg_kmem_uncharge_page(vm->pages[i], 0);
+	}
+}
+
 static void release_task_stack(struct task_struct *tsk)
 {
 	if (WARN_ON(READ_ONCE(tsk->__state) != TASK_DEAD))
 		return;  /* Better to leak the stack than to free prematurely */
 
-	account_kernel_stack(tsk, -1);
 	free_thread_stack(tsk);
 }
 
@@ -917,6 +927,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	refcount_set(&tsk->stack_refcount, 1);
 #endif
+	account_kernel_stack(tsk, 1);
 
 	err = scs_prepare(tsk, node);
 	if (err)
@@ -960,8 +971,6 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->wake_q.next = NULL;
 	tsk->worker_private = NULL;
 
-	account_kernel_stack(tsk, 1);
-
 	kcov_task_init(tsk);
 	kmap_local_fork(tsk);
 
@@ -980,6 +989,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	return tsk;
 
 free_stack:
+	exit_task_stack_account(tsk);
 	free_thread_stack(tsk);
 free_tsk:
 	free_task_struct(tsk);
@@ -2448,6 +2458,7 @@ static __latent_entropy struct task_struct *copy_process(
 	exit_creds(p);
 bad_fork_free:
 	WRITE_ONCE(p->__state, TASK_DEAD);
+	exit_task_stack_account(p);
 	put_task_stack(p);
 	delayed_free_task(p);
 fork_out:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch().
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
                   ` (5 preceding siblings ...)
  2022-01-25 15:26 ` [PATCH 6/8] kernel/fork: Move task stack account to do_exit() Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-02-11 23:55   ` Andy Lutomirski
  2022-01-25 15:26 ` [PATCH 8/8] kernel/fork: Use IS_ENABLED() in account_kernel_stack() Sebastian Andrzej Siewior
  2022-02-08 17:10 ` [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
  8 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

The task stack could be deallocated later in delayed_put_task_struct().
For fork()/exec() kind of workloads (say a shell script executing
several commands) it is important that the stack is released in
finish_task_switch() so that in VMAP_STACK case it can be cached and
reused in the new task.
If the free/ caching is RCU-delayed then a new stack has to be allocated
because the cache is filled in batches of which only two stacks, out of
many, are recycled.

For PREEMPT_RT it would be good if the wake-up in vfree_atomic() could
be avoided in the scheduling path. Far worse are the other
free_thread_stack() implementations which invoke __free_pages()/
kmem_cache_free() with disabled preemption.

Introduce put_task_stack_sched() which is invoked from the
finish_task_switch() and only caches the VMAP stack. If the cache is
full or !CONFIG_VMAP_STACK is used than the stack is freed from
delayed_put_task_struct(). In the VMAP case this is another opportunity
to fill the cache.

The stack is finally released in delayed_put_task_struct() which means
that a valid stack reference can be held during its invocation. As such
there can be made no assumption whether the task_struct::stack pointer
can be freed if non-NULL.
Set the lowest bit of task_struct::stack if the stack was released via
put_task_stack_sched() and needs a final free in
delayed_put_task_struct(). If the bit is missing then a reference is
held and put_task_stack() will release it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/sched/task_stack.h |  8 +++++
 kernel/exit.c                    |  1 +
 kernel/fork.c                    | 60 ++++++++++++++++++++++++++------
 kernel/sched/core.c              |  7 ++--
 4 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 892562ebbd3aa..ccd1336aa7f42 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -70,6 +70,7 @@ static inline void *try_get_task_stack(struct task_struct *tsk)
 }
 
 extern void put_task_stack(struct task_struct *tsk);
+extern void put_task_stack_sched(struct task_struct *tsk);
 #else
 static inline void *try_get_task_stack(struct task_struct *tsk)
 {
@@ -77,6 +78,13 @@ static inline void *try_get_task_stack(struct task_struct *tsk)
 }
 
 static inline void put_task_stack(struct task_struct *tsk) {}
+static inline void put_task_stack_sched(struct task_struct *tsk) {}
+#endif
+
+#ifdef CONFIG_ARCH_THREAD_STACK_ALLOCATOR
+static inline void task_stack_cleanup(struct task_struct *tsk) {}
+#else
+extern void task_stack_cleanup(struct task_struct *tsk);
 #endif
 
 void exit_task_stack_account(struct task_struct *tsk);
diff --git a/kernel/exit.c b/kernel/exit.c
index c303cffe7fdb4..293b280d23192 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -171,6 +171,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	kprobe_flush_task(tsk);
 	perf_event_delayed_put(tsk);
 	trace_sched_process_free(tsk);
+	task_stack_cleanup(tsk);
 	put_task_struct(tsk);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f4e659a922e1..f48f666582b09 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -179,6 +179,16 @@ static inline void free_task_struct(struct task_struct *tsk)
 
 #ifndef CONFIG_ARCH_THREAD_STACK_ALLOCATOR
 
+#define THREAD_STACK_DELAYED_FREE	1UL
+
+static void thread_stack_mark_delayed_free(struct task_struct *tsk)
+{
+	unsigned long val = (unsigned long)tsk->stack;
+
+	val |= THREAD_STACK_DELAYED_FREE;
+	WRITE_ONCE(tsk->stack, (void *)val);
+}
+
 /*
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
@@ -294,7 +304,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	return 0;
 }
 
-static void free_thread_stack(struct task_struct *tsk)
+static void free_thread_stack(struct task_struct *tsk, bool cache_only)
 {
 	int i;
 
@@ -307,7 +317,12 @@ static void free_thread_stack(struct task_struct *tsk)
 		tsk->stack_vm_area = NULL;
 		return;
 	}
-	vfree_atomic(tsk->stack);
+	if (cache_only) {
+		thread_stack_mark_delayed_free(tsk);
+		return;
+	}
+
+	vfree(tsk->stack);
 	tsk->stack = NULL;
 	tsk->stack_vm_area = NULL;
 }
@@ -326,8 +341,12 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	return -ENOMEM;
 }
 
-static void free_thread_stack(struct task_struct *tsk)
+static void free_thread_stack(struct task_struct *tsk, bool cache_only)
 {
+	if (cache_only) {
+		thread_stack_mark_delayed_free(tsk);
+		return;
+	}
 	__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
 	tsk->stack = NULL;
 }
@@ -346,8 +365,12 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	return stack ? 0 : -ENOMEM;
 }
 
-static void free_thread_stack(struct task_struct *tsk)
+static void free_thread_stack(struct task_struct *tsk, bool cache_only)
 {
+	if (cache_only) {
+		thread_stack_mark_delayed_free(tsk);
+		return;
+	}
 	kmem_cache_free(thread_stack_cache, tsk->stack);
 	tsk->stack = NULL;
 }
@@ -361,8 +384,19 @@ void thread_stack_cache_init(void)
 }
 
 # endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
-#else /* CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
 
+void task_stack_cleanup(struct task_struct *tsk)
+{
+	unsigned long val = (unsigned long)tsk->stack;
+
+	if (!(val & THREAD_STACK_DELAYED_FREE))
+		return;
+
+	WRITE_ONCE(tsk->stack, (void *)(val & ~THREAD_STACK_DELAYED_FREE));
+	free_thread_stack(tsk, false);
+}
+
+#else /* CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	unsigned long *stack;
@@ -464,19 +498,25 @@ void exit_task_stack_account(struct task_struct *tsk)
 	}
 }
 
-static void release_task_stack(struct task_struct *tsk)
+static void release_task_stack(struct task_struct *tsk, bool cache_only)
 {
 	if (WARN_ON(READ_ONCE(tsk->__state) != TASK_DEAD))
 		return;  /* Better to leak the stack than to free prematurely */
 
-	free_thread_stack(tsk);
+	free_thread_stack(tsk, cache_only);
 }
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 void put_task_stack(struct task_struct *tsk)
 {
 	if (refcount_dec_and_test(&tsk->stack_refcount))
-		release_task_stack(tsk);
+		release_task_stack(tsk, false);
+}
+
+void put_task_stack_sched(struct task_struct *tsk)
+{
+	if (refcount_dec_and_test(&tsk->stack_refcount))
+		release_task_stack(tsk, true);
 }
 #endif
 
@@ -490,7 +530,7 @@ void free_task(struct task_struct *tsk)
 	 * The task is finally done with both the stack and thread_info,
 	 * so free both.
 	 */
-	release_task_stack(tsk);
+	release_task_stack(tsk, false);
 #else
 	/*
 	 * If the task had a separate stack allocation, it should be gone
@@ -990,7 +1030,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 free_stack:
 	exit_task_stack_account(tsk);
-	free_thread_stack(tsk);
+	free_thread_stack(tsk, false);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e4ae00e52d14..bfcb45c3e59dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4894,8 +4894,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
-		/* Task is done with its stack. */
-		put_task_stack(prev);
+		/*
+		 * Cache only the VMAP stack. The final deallocation is in
+		 * delayed_put_task_struct.
+		 */
+		put_task_stack_sched(prev);
 
 		put_task_struct_rcu_user(prev);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/8] kernel/fork: Use IS_ENABLED() in account_kernel_stack().
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
                   ` (6 preceding siblings ...)
  2022-01-25 15:26 ` [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch() Sebastian Andrzej Siewior
@ 2022-01-25 15:26 ` Sebastian Andrzej Siewior
  2022-02-08 17:10 ` [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
  8 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-25 15:26 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot,
	Sebastian Andrzej Siewior

Not strickly needed but checking CONFIG_VMAP_STACK instead of
task_stack_vm_area()' result allows the compiler the remove the else
path in the CONFIG_VMAP_STACK case where the pointer can't be NULL.

Check for CONFIG_VMAP_STACK in order to use the proper path.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/fork.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f48f666582b09..92cafad00c653 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,16 +468,16 @@ void vm_area_free(struct vm_area_struct *vma)
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	void *stack = task_stack_page(tsk);
-	struct vm_struct *vm = task_stack_vm_area(tsk);
-
-	if (vm) {
+	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+		struct vm_struct *vm = task_stack_vm_area(tsk);
 		int i;
 
 		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
 			mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
+		void *stack = task_stack_page(tsk);
+
 		/* All stack pages are in the same node. */
 		mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
 				      account * (THREAD_SIZE / 1024));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path.
  2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
                   ` (7 preceding siblings ...)
  2022-01-25 15:26 ` [PATCH 8/8] kernel/fork: Use IS_ENABLED() in account_kernel_stack() Sebastian Andrzej Siewior
@ 2022-02-08 17:10 ` Sebastian Andrzej Siewior
  8 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-08 17:10 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

On 2022-01-25 16:26:44 [+0100], To linux-kernel@vger.kernel.org wrote:
> [ This is a repost of https://lkml.kernel.org/r/20211118143452.136421-1-bigeasy@linutronix.de ]
> 
> This is a follup-up on the patch
>     sched: Delay task stack freeing on RT 
>     https://lkml.kernel.org/r/20210928122411.593486363@linutronix.de
> 
> It addresses the review feedback:
> - Decouple stack accounting from its free invocation. The accounting
>   happens in do_exit(), the final free call happens later.
> 
> - Add put_task_stack_sched() to finish_task_switch(). Here the VMAP
>   stack is cached only. If it fails, or in the !VMAP case then the final
>   free happens in delayed_put_task_struct(). This is also an oportunity
>   to cache the stack.
> 
> >From testing I observe the following:
> 
> |            bash-1715    [006] .....   124.901510: copy_process: allocC ffffc90007e70000
> |      sh-cmds.sh-1746    [007] .....   124.907389: copy_process: allocC ffffc90007dc4000
> |          <idle>-0       [019] ...1.   124.918126: free_thread_stack: cache ffffc90007dc4000
> |      sh-cmds.sh-1746    [007] .....   124.918279: copy_process: allocC ffffc90007de8000
> |          <idle>-0       [004] ...1.   124.920121: free_thread_stack: delay ffffc90007de8001
> |          <idle>-0       [007] ...1.   124.920299: free_thread_stack: cache ffffc90007e70000
> |          <idle>-0       [007] ..s1.   124.945433: free_thread_stack: cache ffffc90007de8000
> 
> TS 124.901510, bash started sh-cmds.sh, obtained stack from cache.
> TS 124.907389, script invokes its first command, obtained stacak from
> cache. As you can see bash was running on CPU6 but its child was moved
> CPU7. 
> TS 124.918126, the first command is done, stack is ached on CPU19.
> TS 124.918279, script's second command, ache from stack.
> TS 124.920121, the command is done. The stack cache on CPU4 is full.
> TS 124.920299, the script is done, caches stack on CPU7.
> TS 124.945433, the RCU-callback of last command is now happening. On
> CPU7, which is where the command was invoked (but not running). Instead
> of freeing the stack, it was cached since CPU7 had an empty slot.
> 
> If I pin the script to CPU5 and run it with multiple commands then it
> works as expected:
> 
> |            bash-1799    [005] .....   993.608131: copy_process: allocC ffffc90007fa0000
> |      sh-cmds.sh-1827    [005] .....   993.608888: copy_process: allocC ffffc90007fa8000
> |      sh-cmds.sh-1827    [005] .....   993.610734: copy_process: allocV ffffc90007ff4000
> |      sh-cmds.sh-1829    [005] ...1.   993.610757: free_thread_stack: cache ffffc90007fa8000
> |      sh-cmds.sh-1827    [005] .....   993.612401: copy_process: allocC ffffc90007fa8000
> |           <...>-1830    [005] ...1.   993.612416: free_thread_stack: cache ffffc90007ff4000
> |      sh-cmds.sh-1827    [005] .....   993.613707: copy_process: allocC ffffc90007ff4000
> |      sh-cmds.sh-1831    [005] ...1.   993.613723: free_thread_stack: cache ffffc90007fa8000
> |      sh-cmds.sh-1827    [005] .....   993.615024: copy_process: allocC ffffc90007fa8000
> |           <...>-1832    [005] ...1.   993.615040: free_thread_stack: cache ffffc90007ff4000
> |      sh-cmds.sh-1827    [005] .....   993.616380: copy_process: allocC ffffc90007ff4000
> |           <...>-1833    [005] ...1.   993.616397: free_thread_stack: cache ffffc90007fa8000
> |            bash-1799    [005] ...1.   993.617759: free_thread_stack: cache ffffc90007fa0000
> |          <idle>-0       [005] ...1.   993.617871: free_thread_stack: delay ffffc90007ff4001
> |          <idle>-0       [005] ..s1.   993.638311: free_thread_stack: free ffffc90007ff4000
> 
> and no new is allocated during its runtime and a cached stack is used.

ping

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/8] kernel/fork: Duplicate task_struct before stack allocation.
  2022-01-25 15:26 ` [PATCH 2/8] kernel/fork: Duplicate task_struct before stack allocation Sebastian Andrzej Siewior
@ 2022-02-11 23:42   ` Andy Lutomirski
  2022-02-14 11:39     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2022-02-11 23:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, linux-ia64
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Vincent Guittot

On 1/25/22 07:26, Sebastian Andrzej Siewior wrote:
> alloc_thread_stack_node() already populates the task_struct::stack
> member except on IA64. The stack pointer is saved and populated again
> because IA64 needs it and arch_dup_task_struct() overwrites it.

I understand the problem, I think.

> 
> Allocate thread's stack after task_struct has been duplicated as a
> preparation.
> 

But I don't understand this.  How does this patch relate to the problem?

Also, you appear to be missing a change to the free_stack and free_tsk 
code at the end of dup_task_struct().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/8] kernel/fork: Move task stack account to do_exit().
  2022-01-25 15:26 ` [PATCH 6/8] kernel/fork: Move task stack account to do_exit() Sebastian Andrzej Siewior
@ 2022-02-11 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 21+ messages in thread
From: Andy Lutomirski @ 2022-02-11 23:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, linux-ia64
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Vincent Guittot

On 1/25/22 07:26, Sebastian Andrzej Siewior wrote:
> There is no need to perform the stack accounting of the outgoing task in
> its final schedule() invocation which happens with disabled preemption.
> The task is leaving, the resources will be freed and the accounting can
> happen in do_exit() before the actual schedule invocation which
> frees the stack memory.
> 
> Move the accounting of the stack memory from release_task_stack() to
> exit_task_stack_account() which then can be invoked from do_exit().

Seems reasonable.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch().
  2022-01-25 15:26 ` [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch() Sebastian Andrzej Siewior
@ 2022-02-11 23:55   ` Andy Lutomirski
  2022-02-14 12:10     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2022-02-11 23:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, linux-ia64
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Vincent Guittot

On 1/25/22 07:26, Sebastian Andrzej Siewior wrote:
> The task stack could be deallocated later in delayed_put_task_struct().
> For fork()/exec() kind of workloads (say a shell script executing
> several commands) it is important that the stack is released in
> finish_task_switch() so that in VMAP_STACK case it can be cached and
> reused in the new task.
> If the free/ caching is RCU-delayed then a new stack has to be allocated
> because the cache is filled in batches of which only two stacks, out of
> many, are recycled.
> 
> For PREEMPT_RT it would be good if the wake-up in vfree_atomic() could
> be avoided in the scheduling path. Far worse are the other
> free_thread_stack() implementations which invoke __free_pages()/
> kmem_cache_free() with disabled preemption.
> 
> Introduce put_task_stack_sched() which is invoked from the
> finish_task_switch() and only caches the VMAP stack. If the cache is
> full or !CONFIG_VMAP_STACK is used than the stack is freed from
> delayed_put_task_struct(). In the VMAP case this is another opportunity
> to fill the cache.
> 
> The stack is finally released in delayed_put_task_struct() which means
> that a valid stack reference can be held during its invocation. As such
> there can be made no assumption whether the task_struct::stack pointer
> can be freed if non-NULL.
> Set the lowest bit of task_struct::stack if the stack was released via
> put_task_stack_sched() and needs a final free in
> delayed_put_task_struct(). If the bit is missing then a reference is
> held and put_task_stack() will release it.

I don't understand what this bit is for or why the logic needs to be 
this complicated.  Can you set ->stack to NULL if and only if you freed 
it early?

> +static void free_thread_stack(struct task_struct *tsk, bool cache_only)

This is messy.  Please clean it up for real:

static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
{
   for (...) try to put it in this slot;
}

And the callers can do things like:

if (try_release_thread_stack_to_cache(...))
   return;

/* need to free for real */
free it or delayed-free it.

--Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/8] kernel/fork: Duplicate task_struct before stack allocation.
  2022-02-11 23:42   ` Andy Lutomirski
@ 2022-02-14 11:39     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-14 11:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ia64, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

On 2022-02-11 15:42:38 [-0800], Andy Lutomirski wrote:
> On 1/25/22 07:26, Sebastian Andrzej Siewior wrote:
> > alloc_thread_stack_node() already populates the task_struct::stack
> > member except on IA64. The stack pointer is saved and populated again
> > because IA64 needs it and arch_dup_task_struct() overwrites it.
> 
> I understand the problem, I think.
> 
> > 
> > Allocate thread's stack after task_struct has been duplicated as a
> > preparation.
> > 
> 
> But I don't understand this.  How does this patch relate to the problem?

So I duplicate the task-struct, assign the stack pointer in
alloc_thread_stack_node() with no need to update the stack pointer
later. Otherwise arch_dup_task_struct() would reset the pointer.

> Also, you appear to be missing a change to the free_stack and free_tsk code
> at the end of dup_task_struct().

It looks right. What am I missing?

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch().
  2022-02-11 23:55   ` Andy Lutomirski
@ 2022-02-14 12:10     ` Sebastian Andrzej Siewior
  2022-02-14 12:24       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-14 12:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ia64, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

On 2022-02-11 15:55:01 [-0800], Andy Lutomirski wrote:
> > Set the lowest bit of task_struct::stack if the stack was released via
> > put_task_stack_sched() and needs a final free in
> > delayed_put_task_struct(). If the bit is missing then a reference is
> > held and put_task_stack() will release it.
> 
> I don't understand what this bit is for or why the logic needs to be this
> complicated.  Can you set ->stack to NULL if and only if you freed it early?

What do I do if put_task_stack() is invoked from finish_task_switch()
and I can't free but have to do something?

> > +static void free_thread_stack(struct task_struct *tsk, bool cache_only)
> 
> This is messy.  Please clean it up for real:
> 
> static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
> {
>   for (...) try to put it in this slot;
> }
> 
> And the callers can do things like:
> 
> if (try_release_thread_stack_to_cache(...))
>   return;
> 
> /* need to free for real */
> free it or delayed-free it.

I think I could use the first few bytes of the stack as a RCU-head. Let
me try that.

> --Andy

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch().
  2022-02-14 12:10     ` Sebastian Andrzej Siewior
@ 2022-02-14 12:24       ` Sebastian Andrzej Siewior
  2022-02-14 16:54         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-14 12:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ia64, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

On 2022-02-14 13:10:06 [+0100], To Andy Lutomirski wrote:
> 
> I think I could use the first few bytes of the stack as a RCU-head. Let
> me try that.

task::stack_vm_area and ::stack. Now I remember why I went for that bit.
But I do have (hopefully) a better idea now.

> > --Andy

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch().
  2022-02-14 12:24       ` Sebastian Andrzej Siewior
@ 2022-02-14 16:54         ` Sebastian Andrzej Siewior
  2022-02-14 17:48           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-14 16:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ia64, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

On 2022-02-14 13:24:05 [+0100], To Andy Lutomirski wrote:
> task::stack_vm_area and ::stack. Now I remember why I went for that bit.
> But I do have (hopefully) a better idea now.

Need to update the patch description but that should work then:

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 892562ebbd3aa..12b3f472b1358 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -70,6 +70,7 @@ static inline void *try_get_task_stack(struct task_struct *tsk)
 }
 
 extern void put_task_stack(struct task_struct *tsk);
+extern void put_task_stack_sched(struct task_struct *tsk);
 #else
 static inline void *try_get_task_stack(struct task_struct *tsk)
 {
@@ -77,6 +78,7 @@ static inline void *try_get_task_stack(struct task_struct *tsk)
 }
 
 static inline void put_task_stack(struct task_struct *tsk) {}
+static inline void put_task_stack_sched(struct task_struct *tsk) {}
 #endif
 
 void exit_task_stack_account(struct task_struct *tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f4e659a922e1..d7e118c86f9e6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -193,6 +193,44 @@ static inline void free_task_struct(struct task_struct *tsk)
 #define NR_CACHED_STACKS 2
 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
 
+struct vm_stack {
+	struct rcu_head rcu;
+	struct vm_struct *stack_vm_area;
+};
+
+static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
+{
+	unsigned int i;
+
+	for (i = 0; i < NR_CACHED_STACKS; i++) {
+		if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void thread_stack_free_rcu(struct rcu_head *rh)
+{
+	struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
+
+	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
+		return;
+
+	vfree(vm_stack);
+}
+
+static void thread_stack_delayed_free(struct task_struct *tsk)
+{
+	struct vm_stack *vm_stack = tsk->stack;
+
+	vm_stack->stack_vm_area = tsk->stack_vm_area;
+	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
+
+	tsk->stack = NULL;
+	tsk->stack_vm_area = NULL;
+}
+
 static int free_vm_stack_cache(unsigned int cpu)
 {
 	struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
@@ -294,26 +332,39 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	return 0;
 }
 
-static void free_thread_stack(struct task_struct *tsk)
+static void free_thread_stack(struct task_struct *tsk, bool delayed_free)
 {
-	int i;
-
-	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		if (this_cpu_cmpxchg(cached_stacks[i], NULL,
-				     tsk->stack_vm_area) != NULL)
-			continue;
-
+	if (try_release_thread_stack_to_cache(tsk->stack_vm_area)) {
 		tsk->stack = NULL;
 		tsk->stack_vm_area = NULL;
 		return;
 	}
-	vfree_atomic(tsk->stack);
+
+	if (delayed_free) {
+		thread_stack_delayed_free(tsk);
+		return;
+	}
+
+	vfree(tsk->stack);
 	tsk->stack = NULL;
 	tsk->stack_vm_area = NULL;
 }
 
 #  else /* !CONFIG_VMAP_STACK */
 
+static void thread_stack_free_rcu(struct rcu_head *rh)
+{
+	__free_pages(virt_to_page(rh), THREAD_SIZE_ORDER);
+}
+
+static void thread_stack_delayed_free(struct task_struct *tsk)
+{
+	struct rcu_head *rh = tsk->stack;
+
+	call_rcu(rh, thread_stack_free_rcu);
+	tsk->stack = NULL;
+}
+
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
@@ -326,8 +377,12 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	return -ENOMEM;
 }
 
-static void free_thread_stack(struct task_struct *tsk)
+static void free_thread_stack(struct task_struct *tsk, bool delayed_free)
 {
+	if (delayed_free) {
+		thread_stack_delayed_free(tsk);
+		return;
+	}
 	__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
 	tsk->stack = NULL;
 }
@@ -337,6 +392,19 @@ static void free_thread_stack(struct task_struct *tsk)
 
 static struct kmem_cache *thread_stack_cache;
 
+static void thread_stack_free_rcu(struct rcu_head *rh)
+{
+	kmem_cache_free(thread_stack_cache, rh);
+}
+
+static void thread_stack_delayed_free(struct task_struct *tsk)
+{
+	struct rcu_head *rh = tsk->stack;
+
+	call_rcu(rh, thread_stack_free_rcu);
+	tsk->stack = NULL;
+}
+
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	unsigned long *stack;
@@ -346,8 +414,12 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	return stack ? 0 : -ENOMEM;
 }
 
-static void free_thread_stack(struct task_struct *tsk)
+static void free_thread_stack(struct task_struct *tsk, bool delayed_free)
 {
+	if (delayed_free) {
+		thread_stack_delayed_free(tsk);
+		return;
+	}
 	kmem_cache_free(thread_stack_cache, tsk->stack);
 	tsk->stack = NULL;
 }
@@ -372,7 +444,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	return stack ? 0 : -ENOMEM;
 }
 
-static void free_thread_stack(struct task_struct *tsk, bool cache_only)
+static void free_thread_stack(struct task_struct *tsk, bool delayed_free)
 {
 	arch_free_thread_stack(tsk);
 }
@@ -464,19 +536,25 @@ void exit_task_stack_account(struct task_struct *tsk)
 	}
 }
 
-static void release_task_stack(struct task_struct *tsk)
+static void release_task_stack(struct task_struct *tsk, bool delayed_free)
 {
 	if (WARN_ON(READ_ONCE(tsk->__state) != TASK_DEAD))
 		return;  /* Better to leak the stack than to free prematurely */
 
-	free_thread_stack(tsk);
+	free_thread_stack(tsk, delayed_free);
 }
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 void put_task_stack(struct task_struct *tsk)
 {
 	if (refcount_dec_and_test(&tsk->stack_refcount))
-		release_task_stack(tsk);
+		release_task_stack(tsk, false);
+}
+
+void put_task_stack_sched(struct task_struct *tsk)
+{
+	if (refcount_dec_and_test(&tsk->stack_refcount))
+		release_task_stack(tsk, true);
 }
 #endif
 
@@ -490,7 +568,7 @@ void free_task(struct task_struct *tsk)
 	 * The task is finally done with both the stack and thread_info,
 	 * so free both.
 	 */
-	release_task_stack(tsk);
+	release_task_stack(tsk, false);
 #else
 	/*
 	 * If the task had a separate stack allocation, it should be gone
@@ -990,7 +1068,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 free_stack:
 	exit_task_stack_account(tsk);
-	free_thread_stack(tsk);
+	free_thread_stack(tsk, false);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcf0c180617c2..defe31036930a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4895,8 +4895,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
-		/* Task is done with its stack. */
-		put_task_stack(prev);
+		/*
+		 * Task is done with its stack. Try to cache VMAP stack and
+		 * delay free it otherwise.
+		 */
+		put_task_stack_sched(prev);
 
 		put_task_struct_rcu_user(prev);
 	}
-- 
2.34.1


> > > --Andy
> 
Sebastian

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch().
  2022-02-14 16:54         ` Sebastian Andrzej Siewior
@ 2022-02-14 17:48           ` Sebastian Andrzej Siewior
  2022-02-14 18:15             ` [PATCH v2 " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-14 17:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ia64, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

On 2022-02-14 17:54:48 [+0100], To Andy Lutomirski wrote:
> index fcf0c180617c2..defe31036930a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4895,8 +4895,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);
>  
> -		/* Task is done with its stack. */
> -		put_task_stack(prev);
> +		/*
> +		 * Task is done with its stack. Try to cache VMAP stack and
> +		 * delay free it otherwise.
> +		 */
> +		put_task_stack_sched(prev);

Now that I write the commit message, there is probably nothing wrong
with unconditionally delaying it via RCU if caching failed. Then I don't
have to explain that there is one function is for the atomic context and
the other for non-atomic.

>  		put_task_struct_rcu_user(prev);
>  	}
> -- 
Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64.
  2022-01-25 15:26 ` [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64 Sebastian Andrzej Siewior
@ 2022-02-14 18:00   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-14 18:00 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Andy Lutomirski, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

On 2022-01-25 16:26:47 [+0100], To linux-kernel@vger.kernel.org wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c47dcba5d66d2..a0d58ae6fac76 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -330,6 +330,22 @@ void thread_stack_cache_init(void)
>  
> +static void free_thread_stack(struct task_struct *tsk, bool cache_only)

This cache_only parameter shouldn't be here…

> +{
> +	arch_free_thread_stack(tsk);
> +}
> +
>  #endif /* !CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
>  
>  /* SLAB cache for signal_struct structures (tsk->signal) */

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch().
  2022-02-14 17:48           ` Sebastian Andrzej Siewior
@ 2022-02-14 18:15             ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-14 18:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ia64, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Vincent Guittot

The task stack could be deallocated later. For fork()/exec() kind of
workloads (say a shell script executing several commands) it is
important that the stack is released in finish_task_switch() so that in
VMAP_STACK case it can be cached and reused in the new task.

For PREEMPT_RT it would be good if the wake-up in vfree_atomic() could
be avoided in the scheduling path. Far worse are the other
free_thread_stack() implementations which invoke __free_pages()/
kmem_cache_free() with disabled preemption.

Cache the stack in free_thread_stack() in the VMAP_STACK case and
RCU-delay the free path otherwise. Free the stack in the RCU callback.
In the VMAP_STACK case this is another opportunity to fill the cache.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

If that works and there are no other objection then I'm going to repost
the complete series.

 kernel/fork.c | 76 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 63 insertions(+), 13 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 984f69d6f211f..aa17ed2a2afc7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -193,6 +193,41 @@ static inline void free_task_struct(struct task_struct *tsk)
 #define NR_CACHED_STACKS 2
 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
 
+struct vm_stack {
+	struct rcu_head rcu;
+	struct vm_struct *stack_vm_area;
+};
+
+static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
+{
+	unsigned int i;
+
+	for (i = 0; i < NR_CACHED_STACKS; i++) {
+		if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void thread_stack_free_rcu(struct rcu_head *rh)
+{
+	struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
+
+	if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
+		return;
+
+	vfree(vm_stack);
+}
+
+static void thread_stack_delayed_free(struct task_struct *tsk)
+{
+	struct vm_stack *vm_stack = tsk->stack;
+
+	vm_stack->stack_vm_area = tsk->stack_vm_area;
+	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
+}
+
 static int free_vm_stack_cache(unsigned int cpu)
 {
 	struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
@@ -296,24 +331,27 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 static void free_thread_stack(struct task_struct *tsk)
 {
-	int i;
+	if (!try_release_thread_stack_to_cache(tsk->stack_vm_area))
+		thread_stack_delayed_free(tsk);
 
-	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		if (this_cpu_cmpxchg(cached_stacks[i], NULL,
-				     tsk->stack_vm_area) != NULL)
-			continue;
-
-		tsk->stack = NULL;
-		tsk->stack_vm_area = NULL;
-		return;
-	}
-	vfree_atomic(tsk->stack);
 	tsk->stack = NULL;
 	tsk->stack_vm_area = NULL;
 }
 
 #  else /* !CONFIG_VMAP_STACK */
 
+static void thread_stack_free_rcu(struct rcu_head *rh)
+{
+	__free_pages(virt_to_page(rh), THREAD_SIZE_ORDER);
+}
+
+static void thread_stack_delayed_free(struct task_struct *tsk)
+{
+	struct rcu_head *rh = tsk->stack;
+
+	call_rcu(rh, thread_stack_free_rcu);
+}
+
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
@@ -328,7 +366,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 static void free_thread_stack(struct task_struct *tsk)
 {
-	__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
+	thread_stack_delayed_free(tsk);
 	tsk->stack = NULL;
 }
 
@@ -337,6 +375,18 @@ static void free_thread_stack(struct task_struct *tsk)
 
 static struct kmem_cache *thread_stack_cache;
 
+static void thread_stack_free_rcu(struct rcu_head *rh)
+{
+	kmem_cache_free(thread_stack_cache, rh);
+}
+
+static void thread_stack_delayed_free(struct task_struct *tsk)
+{
+	struct rcu_head *rh = tsk->stack;
+
+	call_rcu(rh, thread_stack_free_rcu);
+}
+
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	unsigned long *stack;
@@ -348,7 +398,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 static void free_thread_stack(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_stack_cache, tsk->stack);
+	thread_stack_delayed_free(tsk);
 	tsk->stack = NULL;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/8] kernel/fork: Use IS_ENABLED() in account_kernel_stack().
  2021-11-18 14:34 [PATCH " Sebastian Andrzej Siewior
@ 2021-11-18 14:34 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-11-18 14:34 UTC (permalink / raw)
  To: linux-kernel, linux-ia64
  Cc: Ingo Molnar, Peter Zijlstra, Andy Lutomirski, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Sebastian Andrzej Siewior

Not strickly needed but checking CONFIG_VMAP_STACK instead of
task_stack_vm_area()' result allows the compiler the remove the else
path in the CONFIG_VMAP_STACK case where the pointer can't be NULL.

Check for CONFIG_VMAP_STACK in order to use the proper path.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/fork.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0aa079e56e3b4..15286e469cf2e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,16 +465,16 @@ void vm_area_free(struct vm_area_struct *vma)
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	void *stack = task_stack_page(tsk);
-	struct vm_struct *vm = task_stack_vm_area(tsk);
-
-	if (vm) {
+	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+		struct vm_struct *vm = task_stack_vm_area(tsk);
 		int i;
 
 		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
 			mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
 					      account * (PAGE_SIZE / 1024));
 	} else {
+		void *stack = task_stack_page(tsk);
+
 		/* All stack pages are in the same node. */
 		mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
 				      account * (THREAD_SIZE / 1024));
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-02-14 18:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-25 15:26 [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
2022-01-25 15:26 ` [PATCH 1/8] kernel/fork: Redo ifdefs around task's handling Sebastian Andrzej Siewior
2022-01-25 15:26 ` [PATCH 2/8] kernel/fork: Duplicate task_struct before stack allocation Sebastian Andrzej Siewior
2022-02-11 23:42   ` Andy Lutomirski
2022-02-14 11:39     ` Sebastian Andrzej Siewior
2022-01-25 15:26 ` [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64 Sebastian Andrzej Siewior
2022-02-14 18:00   ` Sebastian Andrzej Siewior
2022-01-25 15:26 ` [PATCH 4/8] kernel/fork: Don't assign the stack pointer in dup_task_struct() Sebastian Andrzej Siewior
2022-01-25 15:26 ` [PATCH 5/8] kernel/fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK Sebastian Andrzej Siewior
2022-01-25 15:26 ` [PATCH 6/8] kernel/fork: Move task stack account to do_exit() Sebastian Andrzej Siewior
2022-02-11 23:43   ` Andy Lutomirski
2022-01-25 15:26 ` [PATCH 7/8] kernel/fork: Only cache the VMAP stack in finish_task_switch() Sebastian Andrzej Siewior
2022-02-11 23:55   ` Andy Lutomirski
2022-02-14 12:10     ` Sebastian Andrzej Siewior
2022-02-14 12:24       ` Sebastian Andrzej Siewior
2022-02-14 16:54         ` Sebastian Andrzej Siewior
2022-02-14 17:48           ` Sebastian Andrzej Siewior
2022-02-14 18:15             ` [PATCH v2 " Sebastian Andrzej Siewior
2022-01-25 15:26 ` [PATCH 8/8] kernel/fork: Use IS_ENABLED() in account_kernel_stack() Sebastian Andrzej Siewior
2022-02-08 17:10 ` [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path Sebastian Andrzej Siewior
  -- strict thread matches above, loose matches on Subject: below --
2021-11-18 14:34 [PATCH " Sebastian Andrzej Siewior
2021-11-18 14:34 ` [PATCH 8/8] kernel/fork: Use IS_ENABLED() in account_kernel_stack() Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).