From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1764393AbcIPJUe (ORCPT ); Fri, 16 Sep 2016 05:20:34 -0400 Received: from terminus.zytor.com ([198.137.202.10]:37258 "EHLO terminus.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756469AbcIPJU0 (ORCPT ); Fri, 16 Sep 2016 05:20:26 -0400 Date: Fri, 16 Sep 2016 02:19:31 -0700 From: tip-bot for Andy Lutomirski Message-ID: Cc: jpoimboe@redhat.com, bp@alien8.de, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org, brgerst@gmail.com, mingo@kernel.org, dvlasenk@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, jann@thejh.net, hpa@zytor.com Reply-To: luto@kernel.org, jann@thejh.net, hpa@zytor.com, tglx@linutronix.de, jpoimboe@redhat.com, peterz@infradead.org, dvlasenk@redhat.com, mingo@kernel.org, linux-kernel@vger.kernel.org, brgerst@gmail.com, torvalds@linux-foundation.org, bp@alien8.de In-Reply-To: <94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org> References: <94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org> To: linux-tip-commits@vger.kernel.org Subject: [tip:x86/asm] fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y Git-Commit-ID: ac496bf48d97f2503eaa353996a4dd5e4383eaf0 X-Mailer: tip-git-log-daemon Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Commit-ID: ac496bf48d97f2503eaa353996a4dd5e4383eaf0 Gitweb: http://git.kernel.org/tip/ac496bf48d97f2503eaa353996a4dd5e4383eaf0 Author: Andy Lutomirski AuthorDate: Thu, 15 Sep 2016 22:45:49 -0700 Committer: Ingo Molnar CommitDate: Fri, 16 Sep 2016 09:18:54 +0200 fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y vmalloc() is a bit slow, and pounding vmalloc()/vfree() will eventually force a global TLB flush. To reduce pressure on them, if CONFIG_VMAP_STACK=y, cache two thread stacks per CPU. This will let us quickly allocate a hopefully cache-hot, TLB-hot stack under heavy forking workloads (shell script style). On my silly pthread_create() benchmark, it saves about 2 µs per pthread_create()+join() with CONFIG_VMAP_STACK=y. Signed-off-by: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Jann Horn Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar --- kernel/fork.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 53 insertions(+), 9 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 5dd0a51..c060c7e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -159,15 +159,41 @@ void __weak arch_release_thread_stack(unsigned long *stack) * kmemcache based allocator. */ # if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) + +#ifdef CONFIG_VMAP_STACK +/* + * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB + * flush. Try to minimize the number of calls by caching stacks. + */ +#define NR_CACHED_STACKS 2 +static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]); +#endif + static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) { #ifdef CONFIG_VMAP_STACK - void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, - VMALLOC_START, VMALLOC_END, - THREADINFO_GFP | __GFP_HIGHMEM, - PAGE_KERNEL, - 0, node, - __builtin_return_address(0)); + void *stack; + int i; + + local_irq_disable(); + for (i = 0; i < NR_CACHED_STACKS; i++) { + struct vm_struct *s = this_cpu_read(cached_stacks[i]); + + if (!s) + continue; + this_cpu_write(cached_stacks[i], NULL); + + tsk->stack_vm_area = s; + local_irq_enable(); + return s->addr; + } + local_irq_enable(); + + stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, + VMALLOC_START, VMALLOC_END, + THREADINFO_GFP | __GFP_HIGHMEM, + PAGE_KERNEL, + 0, node, __builtin_return_address(0)); /* * We can't call find_vm_area() in interrupt context, and @@ -187,10 +213,28 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) static inline void free_thread_stack(struct task_struct *tsk) { - if (task_stack_vm_area(tsk)) +#ifdef CONFIG_VMAP_STACK + if (task_stack_vm_area(tsk)) { + unsigned long flags; + int i; + + local_irq_save(flags); + for (i = 0; i < NR_CACHED_STACKS; i++) { + if (this_cpu_read(cached_stacks[i])) + continue; + + this_cpu_write(cached_stacks[i], tsk->stack_vm_area); + local_irq_restore(flags); + return; + } + local_irq_restore(flags); + vfree(tsk->stack); - else - __free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER); + return; + } +#endif + + __free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER); } # else static struct kmem_cache *thread_stack_cache;