linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
@ 2015-05-24 16:02 Jungseok Lee
  2015-05-24 17:49 ` Arnd Bergmann
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jungseok Lee @ 2015-05-24 16:02 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: barami97, Catalin Marinas, Will Deacon, linux-kernel, linux-mm

Fork-routine sometimes fails to get a physically contiguous region for
thread_info on 4KB page system although free memory is enough. That is,
a physically contiguous region, which is currently 16KB, is not available
since system memory is fragmented.

This patch tries to solve the problem as allocating thread_info memory
from vmalloc space, not 1:1 mapping one. The downside is one additional
page allocation in case of vmalloc. However, vmalloc space is large enough,
around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
not a big tradeoff for fork-routine service.

Suggested-by: Sungjinn Chung <barami97@gmail.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 arch/arm64/Kconfig                   | 12 ++++++++++++
 arch/arm64/include/asm/thread_info.h |  9 +++++++++
 arch/arm64/kernel/process.c          |  7 +++++++
 3 files changed, 28 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 99930cf..93c236a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
 
+config ARCH_THREAD_INFO_ALLOCATOR
+	bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
+	depends on ARM64_4K_PAGES
+	default n
+	help
+	  This feature enables vmalloc based thread_info allocator. It
+	  prevents fork-routine from begin failed to obtain physically
+	  contiguour region due to memory fragmentation on low system
+	  memory platforms.
+
+	  If unsure, say N
+
 config HW_PERF_EVENTS
 	bool "Enable hardware performance counter support for perf events"
 	depends on PERF_EVENTS
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d1..e753e59 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -61,6 +61,15 @@ struct thread_info {
 #define init_thread_info	(init_thread_union.thread_info)
 #define init_stack		(init_thread_union.stack)
 
+#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
+#define alloc_thread_info_node(tsk, node)				\
+({									\
+	__vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START,	\
+			VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0,	\
+			NUMA_NO_NODE, __builtin_return_address(0));	\
+})
+#define free_thread_info(ti)	vfree(ti)
+#endif
 /*
  * how to get the current stack pointer from C
  */
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index c506bee..c4b6aae 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -238,6 +238,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
+struct page *arch_thread_info_to_page(struct thread_info *ti)
+{
+	return vmalloc_to_page(ti);
+}
+#endif
+
 asmlinkage void ret_from_fork(void) asm("ret_from_fork");
 
 int copy_thread(unsigned long clone_flags, unsigned long stack_start,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-24 16:02 [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator Jungseok Lee
@ 2015-05-24 17:49 ` Arnd Bergmann
  2015-05-25 10:01   ` Jungseok Lee
  2015-05-25 14:40 ` Minchan Kim
  2015-05-26  2:52 ` yalin wang
  2 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2015-05-24 17:49 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Jungseok Lee, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm

On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
> Fork-routine sometimes fails to get a physically contiguous region for
> thread_info on 4KB page system although free memory is enough. That is,
> a physically contiguous region, which is currently 16KB, is not available
> since system memory is fragmented.
> 
> This patch tries to solve the problem as allocating thread_info memory
> from vmalloc space, not 1:1 mapping one. The downside is one additional
> page allocation in case of vmalloc. However, vmalloc space is large enough,
> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> not a big tradeoff for fork-routine service.

vmalloc has a rather large runtime cost. I'd argue that failing to allocate
thread_info structures means something has gone very wrong.

Can you describe the scenario that leads to fragmentation this bad?

Could the stack size be reduced to 8KB perhaps?

	Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-24 17:49 ` Arnd Bergmann
@ 2015-05-25 10:01   ` Jungseok Lee
  2015-05-25 14:58     ` Minchan Kim
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jungseok Lee @ 2015-05-25 10:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm

On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
> On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
>> Fork-routine sometimes fails to get a physically contiguous region for
>> thread_info on 4KB page system although free memory is enough. That is,
>> a physically contiguous region, which is currently 16KB, is not available
>> since system memory is fragmented.
>> 
>> This patch tries to solve the problem as allocating thread_info memory
>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
>> not a big tradeoff for fork-routine service.
> 
> vmalloc has a rather large runtime cost. I'd argue that failing to allocate
> thread_info structures means something has gone very wrong.

That is why the feature is marked "N" by default.
I focused on fork-routine stability rather than performance.

Could you give me an idea how to evaluate performance degradation?
Running some benchmarks would be helpful, but I would like to try to
gather data based on meaningful methodology.

> Can you describe the scenario that leads to fragmentation this bad?

Android, but I could not describe an exact reproduction procedure step
by step since it's behaved and reproduced randomly. As reading the following
thread from mm mailing list, a similar symptom is observed on other systems. 

https://lkml.org/lkml/2015/4/28/59

Although I do not know the details of a system mentioned in the thread,
even order-2 page allocation is not smoothly operated due to fragmentation on
low memory system.

I think the point is *low memory system*. 64-bit kernel is usually a feasible
option when system memory is enough, but 64-bit kernel and low memory system
combo is not unusual in case of ARM64.

> Could the stack size be reduced to 8KB perhaps?

I guess probably not.

A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
The stack size is 16KB on x86_64. I am not sure whether all applications,
which work fine on x86_64 machine, run very well on ARM64 with 8KB stack size.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-24 16:02 [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator Jungseok Lee
  2015-05-24 17:49 ` Arnd Bergmann
@ 2015-05-25 14:40 ` Minchan Kim
  2015-05-26 11:29   ` Jungseok Lee
  2015-05-26  2:52 ` yalin wang
  2 siblings, 1 reply; 22+ messages in thread
From: Minchan Kim @ 2015-05-25 14:40 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: linux-arm-kernel, barami97, Catalin Marinas, Will Deacon,
	linux-kernel, linux-mm

Hello Jungseok,

On Mon, May 25, 2015 at 01:02:20AM +0900, Jungseok Lee wrote:
> Fork-routine sometimes fails to get a physically contiguous region for
> thread_info on 4KB page system although free memory is enough. That is,
> a physically contiguous region, which is currently 16KB, is not available
> since system memory is fragmented.

Order less than PAGE_ALLOC_COSTLY_ORDER should not fail in current
mm implementation. If you saw the order-2,3 high-order allocation fail
maybe your application received SIGKILL by someone. LMK?

> 
> This patch tries to solve the problem as allocating thread_info memory
> from vmalloc space, not 1:1 mapping one. The downside is one additional
> page allocation in case of vmalloc. However, vmalloc space is large enough,

The size you want to allocate is 16KB in here but additional 4K?
It increases 25% memory footprint, which is huge downside.

> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> not a big tradeoff for fork-routine service.
> 
> Suggested-by: Sungjinn Chung <barami97@gmail.com>
> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  arch/arm64/Kconfig                   | 12 ++++++++++++
>  arch/arm64/include/asm/thread_info.h |  9 +++++++++
>  arch/arm64/kernel/process.c          |  7 +++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 99930cf..93c236a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
>  config HAVE_ARCH_PFN_VALID
>  	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
>  
> +config ARCH_THREAD_INFO_ALLOCATOR
> +	bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
> +	depends on ARM64_4K_PAGES
> +	default n
> +	help
> +	  This feature enables vmalloc based thread_info allocator. It
> +	  prevents fork-routine from begin failed to obtain physically
> +	  contiguour region due to memory fragmentation on low system
> +	  memory platforms.
> +
> +	  If unsure, say N
> +
>  config HW_PERF_EVENTS
>  	bool "Enable hardware performance counter support for perf events"
>  	depends on PERF_EVENTS
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d1..e753e59 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -61,6 +61,15 @@ struct thread_info {
>  #define init_thread_info	(init_thread_union.thread_info)
>  #define init_stack		(init_thread_union.stack)
>  
> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +#define alloc_thread_info_node(tsk, node)				\
> +({									\
> +	__vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START,	\
> +			VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0,	\
> +			NUMA_NO_NODE, __builtin_return_address(0));	\
> +})
> +#define free_thread_info(ti)	vfree(ti)
> +#endif
>  /*
>   * how to get the current stack pointer from C
>   */
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index c506bee..c4b6aae 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -238,6 +238,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +struct page *arch_thread_info_to_page(struct thread_info *ti)
> +{
> +	return vmalloc_to_page(ti);
> +}
> +#endif
> +
>  asmlinkage void ret_from_fork(void) asm("ret_from_fork");
>  
>  int copy_thread(unsigned long clone_flags, unsigned long stack_start,
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 10:01   ` Jungseok Lee
@ 2015-05-25 14:58     ` Minchan Kim
  2015-05-26 12:10       ` Jungseok Lee
  2015-05-25 16:47     ` Catalin Marinas
  2015-05-25 21:20     ` Arnd Bergmann
  2 siblings, 1 reply; 22+ messages in thread
From: Minchan Kim @ 2015-05-25 14:58 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: Arnd Bergmann, linux-arm-kernel, Catalin Marinas, barami97,
	Will Deacon, linux-kernel, linux-mm

On Mon, May 25, 2015 at 07:01:33PM +0900, Jungseok Lee wrote:
> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
> > On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
> >> Fork-routine sometimes fails to get a physically contiguous region for
> >> thread_info on 4KB page system although free memory is enough. That is,
> >> a physically contiguous region, which is currently 16KB, is not available
> >> since system memory is fragmented.
> >> 
> >> This patch tries to solve the problem as allocating thread_info memory
> >> from vmalloc space, not 1:1 mapping one. The downside is one additional
> >> page allocation in case of vmalloc. However, vmalloc space is large enough,
> >> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> >> not a big tradeoff for fork-routine service.
> > 
> > vmalloc has a rather large runtime cost. I'd argue that failing to allocate
> > thread_info structures means something has gone very wrong.
> 
> That is why the feature is marked "N" by default.
> I focused on fork-routine stability rather than performance.

If VM has trouble with order-2 allocation, your system would be
trouble soon although fork at the moment manages to be successful
because such small high-order(ex, order <= PAGE_ALLOC_COSTLY_ORDER)
allocation is common in the kernel so VM should handle it smoothly.
If VM didn't, it means we should fix VM itself, not a specific
allocation site. Fork is one of victim by that.

> 
> Could you give me an idea how to evaluate performance degradation?
> Running some benchmarks would be helpful, but I would like to try to
> gather data based on meaningful methodology.
> 
> > Can you describe the scenario that leads to fragmentation this bad?
> 
> Android, but I could not describe an exact reproduction procedure step
> by step since it's behaved and reproduced randomly. As reading the following
> thread from mm mailing list, a similar symptom is observed on other systems. 
> 
> https://lkml.org/lkml/2015/4/28/59
> 
> Although I do not know the details of a system mentioned in the thread,
> even order-2 page allocation is not smoothly operated due to fragmentation on
> low memory system.

What Joonsoo have tackle is generic fragmentation problem, not *a* fork fail,
which is more right approach to handle small high-order allocation problem.

> 
> I think the point is *low memory system*. 64-bit kernel is usually a feasible
> option when system memory is enough, but 64-bit kernel and low memory system
> combo is not unusual in case of ARM64.
> 
> > Could the stack size be reduced to 8KB perhaps?
> 
> I guess probably not.
> 
> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
> The stack size is 16KB on x86_64. I am not sure whether all applications,
> which work fine on x86_64 machine, run very well on ARM64 with 8KB stack size.
> 
> Best Regards
> Jungseok Lee
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 10:01   ` Jungseok Lee
  2015-05-25 14:58     ` Minchan Kim
@ 2015-05-25 16:47     ` Catalin Marinas
  2015-05-25 20:29       ` Arnd Bergmann
  2015-05-26 13:02       ` Jungseok Lee
  2015-05-25 21:20     ` Arnd Bergmann
  2 siblings, 2 replies; 22+ messages in thread
From: Catalin Marinas @ 2015-05-25 16:47 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: Arnd Bergmann, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm, linux-arm-kernel

On 25 May 2015, at 13:01, Jungseok Lee <jungseoklee85@gmail.com> wrote:

>> Could the stack size be reduced to 8KB perhaps?
> 
> I guess probably not.
> 
> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.

We could go back to 8KB stacks if we implement support for separate IRQ 
stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads 
and SP1 for IRQ handlers.

Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 16:47     ` Catalin Marinas
@ 2015-05-25 20:29       ` Arnd Bergmann
  2015-05-25 22:36         ` Catalin Marinas
  2015-05-26 13:02       ` Jungseok Lee
  1 sibling, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2015-05-25 20:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Jungseok Lee, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm, linux-arm-kernel

On Monday 25 May 2015 19:47:15 Catalin Marinas wrote:
> On 25 May 2015, at 13:01, Jungseok Lee <jungseoklee85@gmail.com> wrote:
> 
> >> Could the stack size be reduced to 8KB perhaps?
> > 
> > I guess probably not.
> > 
> > A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
> 
> We could go back to 8KB stacks if we implement support for separate IRQ 
> stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads 
> and SP1 for IRQ handlers.

I think most architectures that see a lot of benchmarks have moved to
irqstacks at some point, that definitely sounds like a useful idea,
even if the implementation turns out to be a bit more tricky than
what you describe.

There are a lot of workloads that would benefit from having lower
per-thread memory cost.

	Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 10:01   ` Jungseok Lee
  2015-05-25 14:58     ` Minchan Kim
  2015-05-25 16:47     ` Catalin Marinas
@ 2015-05-25 21:20     ` Arnd Bergmann
  2 siblings, 0 replies; 22+ messages in thread
From: Arnd Bergmann @ 2015-05-25 21:20 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: linux-arm-kernel, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm, Minchan Kim

On Monday 25 May 2015 19:01:33 Jungseok Lee wrote:
> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:

> > Could the stack size be reduced to 8KB perhaps?
> 
> I guess probably not.
> 
> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
> The stack size is 16KB on x86_64. I am not sure whether all applications,
> which work fine on x86_64 machine, run very well on ARM64 with 8KB stack size.

A more interesting explanation is probably the one that led to x86
adopting 16kb pages, see https://lwn.net/Articles/600644 , 
https://lwn.net/Articles/600649/ and http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6538b8ea88

Based on that, I have to agree that lowering the size back to 8kb seems
unrealistic at the moment, but at the same time, we need to actively
keep working on reducing the stack size. IRQ stacks are one important
point here, but in the trace that Minchan Kim has in the commit for x86-64
there is not even an interrupt.

I've looked at the code that is generated by aarch64-linux-gcc-4.8 now
and found that it does not try very hard to reduce the stack size,
the first function I fed it end up up using 48 bytes of stack where
half of that would suffice, so there is probably room for optimization
within the boundaries of the ABI, possibly more if we allow kernel
specific calling conventions as some other architectures have implemented
in the past.

	Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 20:29       ` Arnd Bergmann
@ 2015-05-25 22:36         ` Catalin Marinas
  2015-05-26  9:51           ` Arnd Bergmann
  0 siblings, 1 reply; 22+ messages in thread
From: Catalin Marinas @ 2015-05-25 22:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jungseok Lee, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm, linux-arm-kernel

On 25 May 2015, at 23:29, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 25 May 2015 19:47:15 Catalin Marinas wrote:
>> On 25 May 2015, at 13:01, Jungseok Lee <jungseoklee85@gmail.com> wrote:
>>>> Could the stack size be reduced to 8KB perhaps?
>>> 
>>> I guess probably not.
>>> 
>>> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
>> 
>> We could go back to 8KB stacks if we implement support for separate IRQ 
>> stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads 
>> and SP1 for IRQ handlers.
> 
> I think most architectures that see a lot of benchmarks have moved to
> irqstacks at some point, that definitely sounds like a useful idea,
> even if the implementation turns out to be a bit more tricky than
> what you describe.

Of course, it's more complicated than just setting up two stacks (but I'm away for a 
week and writing from a phone). We would need to deal with the initial per-CPU setup, 
rescheduling following an IRQ, CPU on following power management and maybe 
other issues. However, the architecture helps us a bit by allowing both SP0 and SP1 to be 
used at EL1. 

> There are a lot of workloads that would benefit from having lower
> per-thread memory cost.

If we keep the 16KB stack, is there any advantage in a separate IRQ one (assuming 
that we won't overflow 16KB)?

Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-24 16:02 [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator Jungseok Lee
  2015-05-24 17:49 ` Arnd Bergmann
  2015-05-25 14:40 ` Minchan Kim
@ 2015-05-26  2:52 ` yalin wang
  2015-05-26 12:21   ` Jungseok Lee
  2 siblings, 1 reply; 22+ messages in thread
From: yalin wang @ 2015-05-26  2:52 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: linux-arm-kernel, barami97, Catalin Marinas, Will Deacon,
	Linux Kernel Mailing List, linux-mm

2015-05-25 0:02 GMT+08:00 Jungseok Lee <jungseoklee85@gmail.com>:
> Fork-routine sometimes fails to get a physically contiguous region for
> thread_info on 4KB page system although free memory is enough. That is,
> a physically contiguous region, which is currently 16KB, is not available
> since system memory is fragmented.
>
> This patch tries to solve the problem as allocating thread_info memory
> from vmalloc space, not 1:1 mapping one. The downside is one additional
> page allocation in case of vmalloc. However, vmalloc space is large enough,
> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> not a big tradeoff for fork-routine service.
>
> Suggested-by: Sungjinn Chung <barami97@gmail.com>
> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  arch/arm64/Kconfig                   | 12 ++++++++++++
>  arch/arm64/include/asm/thread_info.h |  9 +++++++++
>  arch/arm64/kernel/process.c          |  7 +++++++
>  3 files changed, 28 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 99930cf..93c236a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
>  config HAVE_ARCH_PFN_VALID
>         def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
>
> +config ARCH_THREAD_INFO_ALLOCATOR
> +       bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
> +       depends on ARM64_4K_PAGES
> +       default n
> +       help
> +         This feature enables vmalloc based thread_info allocator. It
> +         prevents fork-routine from begin failed to obtain physically
> +         contiguour region due to memory fragmentation on low system
> +         memory platforms.
> +
> +         If unsure, say N
> +
>  config HW_PERF_EVENTS
>         bool "Enable hardware performance counter support for perf events"
>         depends on PERF_EVENTS
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d1..e753e59 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -61,6 +61,15 @@ struct thread_info {
>  #define init_thread_info       (init_thread_union.thread_info)
>  #define init_stack             (init_thread_union.stack)
>
> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +#define alloc_thread_info_node(tsk, node)                              \
> +({                                                                     \
> +       __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START,   \
> +                       VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0,        \
> +                       NUMA_NO_NODE, __builtin_return_address(0));     \
> +})
why not add __GFP_HIGHMEM, if you decided to use vmalloc() alloc stack pages?

Thanks

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 22:36         ` Catalin Marinas
@ 2015-05-26  9:51           ` Arnd Bergmann
  0 siblings, 0 replies; 22+ messages in thread
From: Arnd Bergmann @ 2015-05-26  9:51 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Jungseok Lee, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm, linux-arm-kernel

On Tuesday 26 May 2015 01:36:29 Catalin Marinas wrote:
> 
> > There are a lot of workloads that would benefit from having lower
> > per-thread memory cost.
> 
> If we keep the 16KB stack, is there any advantage in a separate IRQ one (assuming 
> that we won't overflow 16KB)?

It makes possible errors more reproducible: we already know that we need over
8kb for normal stacks based on Minchan's findings, and the chance that an interrupt
happens at a time when the stack is the highest is very low, but that just makes
the bug much harder to find if you ever run into it.

If we overflow the stack (independent of its size) with a process stack by itself,
it will always happen in the same call chain, not a combination of a call chain
an a particularly bad interrupt.

	Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 14:40 ` Minchan Kim
@ 2015-05-26 11:29   ` Jungseok Lee
  2015-05-27  4:10     ` Minchan Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Jungseok Lee @ 2015-05-26 11:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-arm-kernel, barami97, Catalin Marinas, Will Deacon,
	linux-kernel, linux-mm

On May 25, 2015, at 11:40 PM, Minchan Kim wrote:
> Hello Jungseok,

Hi, Minchan,

> On Mon, May 25, 2015 at 01:02:20AM +0900, Jungseok Lee wrote:
>> Fork-routine sometimes fails to get a physically contiguous region for
>> thread_info on 4KB page system although free memory is enough. That is,
>> a physically contiguous region, which is currently 16KB, is not available
>> since system memory is fragmented.
> 
> Order less than PAGE_ALLOC_COSTLY_ORDER should not fail in current
> mm implementation. If you saw the order-2,3 high-order allocation fail
> maybe your application received SIGKILL by someone. LMK?

Exactly right. The allocation is failed via the following path.

if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
	goto nopage;

IMHO, a reclaim operation would be not needed in this context if memory is
allocated from vmalloc space. It means there is no need to traverse shrinker list. 

>> This patch tries to solve the problem as allocating thread_info memory
>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>> page allocation in case of vmalloc. However, vmalloc space is large enough,
> 
> The size you want to allocate is 16KB in here but additional 4K?
> It increases 25% memory footprint, which is huge downside.

I agree with the point, and most people who try to use vmalloc might know the number.
However, an interoperation on the number depends on a point of view.

Vmalloc is large enough and not fully utilized in case of ARM64.
With the considerations, there is a room to do math as follows.

4KB / 240GB = 1.5e-8 (4KB page + 3 level combo)

It would be not a huge downside if fork-routine is not damaged due to fragmentation.

However, this is one of reasons to add "RFC" prefix in the patch set. How is the
additional 4KB interpreted and considered?

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 14:58     ` Minchan Kim
@ 2015-05-26 12:10       ` Jungseok Lee
  2015-05-27  4:24         ` Minchan Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Jungseok Lee @ 2015-05-26 12:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Arnd Bergmann, linux-arm-kernel, Catalin Marinas, barami97,
	Will Deacon, linux-kernel, linux-mm

On May 25, 2015, at 11:58 PM, Minchan Kim wrote:
> On Mon, May 25, 2015 at 07:01:33PM +0900, Jungseok Lee wrote:
>> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
>>> On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
>>>> Fork-routine sometimes fails to get a physically contiguous region for
>>>> thread_info on 4KB page system although free memory is enough. That is,
>>>> a physically contiguous region, which is currently 16KB, is not available
>>>> since system memory is fragmented.
>>>> 
>>>> This patch tries to solve the problem as allocating thread_info memory
>>>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>>>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>>>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
>>>> not a big tradeoff for fork-routine service.
>>> 
>>> vmalloc has a rather large runtime cost. I'd argue that failing to allocate
>>> thread_info structures means something has gone very wrong.
>> 
>> That is why the feature is marked "N" by default.
>> I focused on fork-routine stability rather than performance.
> 
> If VM has trouble with order-2 allocation, your system would be
> trouble soon although fork at the moment manages to be successful
> because such small high-order(ex, order <= PAGE_ALLOC_COSTLY_ORDER)
> allocation is common in the kernel so VM should handle it smoothly.
> If VM didn't, it means we should fix VM itself, not a specific
> allocation site. Fork is one of victim by that.

A problem I observed is an user space, not a kernel side. As user applications
fail to create threads in order to distribute their jobs properly, they are getting
in trouble slowly and then gone.

Yes, fork is one of victim, but damages user applications seriously.
At this snapshot, free memory is enough.

>> Could you give me an idea how to evaluate performance degradation?
>> Running some benchmarks would be helpful, but I would like to try to
>> gather data based on meaningful methodology.
>> 
>>> Can you describe the scenario that leads to fragmentation this bad?
>> 
>> Android, but I could not describe an exact reproduction procedure step
>> by step since it's behaved and reproduced randomly. As reading the following
>> thread from mm mailing list, a similar symptom is observed on other systems. 
>> 
>> https://lkml.org/lkml/2015/4/28/59
>> 
>> Although I do not know the details of a system mentioned in the thread,
>> even order-2 page allocation is not smoothly operated due to fragmentation on
>> low memory system.
> 
> What Joonsoo have tackle is generic fragmentation problem, not *a* fork fail,
> which is more right approach to handle small high-order allocation problem.

I totally agree with that point. One of the best ways is to figure out a generic
anti-fragmentation with VM system improvement. Reducing the stack size to 8KB is also
a really great approach. My intention is not to overlook them or figure out a workaround.

IMHO, vmalloc would be a different option in case of ARM64 on low memory systems since
*fork failure from fragmentation* is a nontrivial issue.

Do you think the patch set doesn't need to be considered?

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-26  2:52 ` yalin wang
@ 2015-05-26 12:21   ` Jungseok Lee
  0 siblings, 0 replies; 22+ messages in thread
From: Jungseok Lee @ 2015-05-26 12:21 UTC (permalink / raw)
  To: yalin wang
  Cc: linux-arm-kernel, barami97, Catalin Marinas, Will Deacon,
	Linux Kernel Mailing List, linux-mm

On May 26, 2015, at 11:52 AM, yalin wang wrote:
> 2015-05-25 0:02 GMT+08:00 Jungseok Lee <jungseoklee85@gmail.com>:
>> Fork-routine sometimes fails to get a physically contiguous region for
>> thread_info on 4KB page system although free memory is enough. That is,
>> a physically contiguous region, which is currently 16KB, is not available
>> since system memory is fragmented.
>> 
>> This patch tries to solve the problem as allocating thread_info memory
>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
>> not a big tradeoff for fork-routine service.
>> 
>> Suggested-by: Sungjinn Chung <barami97@gmail.com>
>> Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> ---
>> arch/arm64/Kconfig                   | 12 ++++++++++++
>> arch/arm64/include/asm/thread_info.h |  9 +++++++++
>> arch/arm64/kernel/process.c          |  7 +++++++
>> 3 files changed, 28 insertions(+)
>> 
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 99930cf..93c236a 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
>> config HAVE_ARCH_PFN_VALID
>>        def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
>> 
>> +config ARCH_THREAD_INFO_ALLOCATOR
>> +       bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
>> +       depends on ARM64_4K_PAGES
>> +       default n
>> +       help
>> +         This feature enables vmalloc based thread_info allocator. It
>> +         prevents fork-routine from begin failed to obtain physically
>> +         contiguour region due to memory fragmentation on low system
>> +         memory platforms.
>> +
>> +         If unsure, say N
>> +
>> config HW_PERF_EVENTS
>>        bool "Enable hardware performance counter support for perf events"
>>        depends on PERF_EVENTS
>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>> index dcd06d1..e753e59 100644
>> --- a/arch/arm64/include/asm/thread_info.h
>> +++ b/arch/arm64/include/asm/thread_info.h
>> @@ -61,6 +61,15 @@ struct thread_info {
>> #define init_thread_info       (init_thread_union.thread_info)
>> #define init_stack             (init_thread_union.stack)
>> 
>> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
>> +#define alloc_thread_info_node(tsk, node)                              \
>> +({                                                                     \
>> +       __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START,   \
>> +                       VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0,        \
>> +                       NUMA_NO_NODE, __builtin_return_address(0));     \
>> +})
> why not add __GFP_HIGHMEM, if you decided to use vmalloc() alloc stack pages?

I do not add the flag since there is no high memory on a current ARM64 kernel. 

It would be helpful to review include/linux/gfp.h and the following code snippet
from arch/arm64/kernel/module.c.

void *module_alloc(unsigned long size)
{
	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
				    GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
				    NUMA_NO_NODE, __builtin_return_address(0));
}

Best Regards
Jungseok Lee


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-25 16:47     ` Catalin Marinas
  2015-05-25 20:29       ` Arnd Bergmann
@ 2015-05-26 13:02       ` Jungseok Lee
  1 sibling, 0 replies; 22+ messages in thread
From: Jungseok Lee @ 2015-05-26 13:02 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Arnd Bergmann, Catalin Marinas, barami97, Will Deacon,
	linux-kernel, linux-mm, linux-arm-kernel

On May 26, 2015, at 1:47 AM, Catalin Marinas wrote:
> On 25 May 2015, at 13:01, Jungseok Lee <jungseoklee85@gmail.com> wrote:
> 
>>> Could the stack size be reduced to 8KB perhaps?
>> 
>> I guess probably not.
>> 
>> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
> 
> We could go back to 8KB stacks if we implement support for separate IRQ 
> stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads 
> and SP1 for IRQ handlers.

Definitely interesting.

It looks like there are two options based on discussion.
1) Reduce the stack size with separate IRQ stack scheme
2) Figure out a generic anti-fragmentation solution

Do I miss anything?

I am still not sure about the first scheme as reviewing Minchan's findings repeatedly,
but I agree that the item should be worked actively.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-26 11:29   ` Jungseok Lee
@ 2015-05-27  4:10     ` Minchan Kim
  2015-05-27  6:22       ` Sergey Senozhatsky
  2015-05-27 16:08       ` Jungseok Lee
  0 siblings, 2 replies; 22+ messages in thread
From: Minchan Kim @ 2015-05-27  4:10 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: linux-arm-kernel, barami97, Catalin Marinas, Will Deacon,
	linux-kernel, linux-mm

Hello Jungseok,

On Tue, May 26, 2015 at 08:29:59PM +0900, Jungseok Lee wrote:
> On May 25, 2015, at 11:40 PM, Minchan Kim wrote:
> > Hello Jungseok,
> 
> Hi, Minchan,
> 
> > On Mon, May 25, 2015 at 01:02:20AM +0900, Jungseok Lee wrote:
> >> Fork-routine sometimes fails to get a physically contiguous region for
> >> thread_info on 4KB page system although free memory is enough. That is,
> >> a physically contiguous region, which is currently 16KB, is not available
> >> since system memory is fragmented.
> > 
> > Order less than PAGE_ALLOC_COSTLY_ORDER should not fail in current
> > mm implementation. If you saw the order-2,3 high-order allocation fail
> > maybe your application received SIGKILL by someone. LMK?
> 
> Exactly right. The allocation is failed via the following path.
> 
> if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> 	goto nopage;
> 
> IMHO, a reclaim operation would be not needed in this context if memory is
> allocated from vmalloc space. It means there is no need to traverse shrinker list. 

For making fork successful with using vmalloc, it's bandaid.

> 
> >> This patch tries to solve the problem as allocating thread_info memory
> >> from vmalloc space, not 1:1 mapping one. The downside is one additional
> >> page allocation in case of vmalloc. However, vmalloc space is large enough,
> > 
> > The size you want to allocate is 16KB in here but additional 4K?
> > It increases 25% memory footprint, which is huge downside.
> 
> I agree with the point, and most people who try to use vmalloc might know the number.
> However, an interoperation on the number depends on a point of view.
> 
> Vmalloc is large enough and not fully utilized in case of ARM64.
> With the considerations, there is a room to do math as follows.
> 
> 4KB / 240GB = 1.5e-8 (4KB page + 3 level combo)
> 
> It would be not a huge downside if fork-routine is not damaged due to fragmentation.

Okay, address size point of view, it wouldn't be significant problem.
Then, let's see it performance as point of view.

If we use vmalloc, it needs additional data structure for vmalloc
management, several additional allocation request, page table hanlding
and TLB flush.

Normally, forking is very frequent operation so we shouldn't do
make it slow and memory consumption bigger if there isn't big reason.

> 
> However, this is one of reasons to add "RFC" prefix in the patch set. How is the
> additional 4KB interpreted and considered?
> 
> Best Regards
> Jungseok Lee

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-26 12:10       ` Jungseok Lee
@ 2015-05-27  4:24         ` Minchan Kim
  2015-05-27 16:00           ` Jungseok Lee
  0 siblings, 1 reply; 22+ messages in thread
From: Minchan Kim @ 2015-05-27  4:24 UTC (permalink / raw)
  To: Jungseok Lee
  Cc: Arnd Bergmann, linux-arm-kernel, Catalin Marinas, barami97,
	Will Deacon, linux-kernel, linux-mm

On Tue, May 26, 2015 at 09:10:11PM +0900, Jungseok Lee wrote:
> On May 25, 2015, at 11:58 PM, Minchan Kim wrote:
> > On Mon, May 25, 2015 at 07:01:33PM +0900, Jungseok Lee wrote:
> >> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
> >>> On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
> >>>> Fork-routine sometimes fails to get a physically contiguous region for
> >>>> thread_info on 4KB page system although free memory is enough. That is,
> >>>> a physically contiguous region, which is currently 16KB, is not available
> >>>> since system memory is fragmented.
> >>>> 
> >>>> This patch tries to solve the problem as allocating thread_info memory
> >>>> from vmalloc space, not 1:1 mapping one. The downside is one additional
> >>>> page allocation in case of vmalloc. However, vmalloc space is large enough,
> >>>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> >>>> not a big tradeoff for fork-routine service.
> >>> 
> >>> vmalloc has a rather large runtime cost. I'd argue that failing to allocate
> >>> thread_info structures means something has gone very wrong.
> >> 
> >> That is why the feature is marked "N" by default.
> >> I focused on fork-routine stability rather than performance.
> > 
> > If VM has trouble with order-2 allocation, your system would be
> > trouble soon although fork at the moment manages to be successful
> > because such small high-order(ex, order <= PAGE_ALLOC_COSTLY_ORDER)
> > allocation is common in the kernel so VM should handle it smoothly.
> > If VM didn't, it means we should fix VM itself, not a specific
> > allocation site. Fork is one of victim by that.
> 
> A problem I observed is an user space, not a kernel side. As user applications
> fail to create threads in order to distribute their jobs properly, they are getting
> in trouble slowly and then gone.
> 
> Yes, fork is one of victim, but damages user applications seriously.
> At this snapshot, free memory is enough.

Yes, it's the one you found.

        *Free memory is enough but why forking was failed*

You should find the exact reason for it rather than papering over by
hiding forking fail.

1. Investigate how many of movable/unmovable page ratio at the moment
2. Investigate why compaction doesn't work
3. Investigate why reclaim couldn't make order-2 page


> 
> >> Could you give me an idea how to evaluate performance degradation?
> >> Running some benchmarks would be helpful, but I would like to try to
> >> gather data based on meaningful methodology.
> >> 
> >>> Can you describe the scenario that leads to fragmentation this bad?
> >> 
> >> Android, but I could not describe an exact reproduction procedure step
> >> by step since it's behaved and reproduced randomly. As reading the following
> >> thread from mm mailing list, a similar symptom is observed on other systems. 
> >> 
> >> https://lkml.org/lkml/2015/4/28/59
> >> 
> >> Although I do not know the details of a system mentioned in the thread,
> >> even order-2 page allocation is not smoothly operated due to fragmentation on
> >> low memory system.
> > 
> > What Joonsoo have tackle is generic fragmentation problem, not *a* fork fail,
> > which is more right approach to handle small high-order allocation problem.
> 
> I totally agree with that point. One of the best ways is to figure out a generic
> anti-fragmentation with VM system improvement. Reducing the stack size to 8KB is also
> a really great approach. My intention is not to overlook them or figure out a workaround.
> 
> IMHO, vmalloc would be a different option in case of ARM64 on low memory systems since
> *fork failure from fragmentation* is a nontrivial issue.
> 
> Do you think the patch set doesn't need to be considered?

I don't know because the changelog doesn't have full description
about your problem. You just wrote "forking was failed so we want
to avoid that by vmalloc because forking is important".
It seems to me it is just bandaid.

What you should provide for description is

" Forking was failed although there were lots of free pages
  so I investigated it and found root causes in somewhere
  so this patch fixes the problem"

Thanks.


> 
> Best Regards
> Jungseok Lee

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-27  4:10     ` Minchan Kim
@ 2015-05-27  6:22       ` Sergey Senozhatsky
  2015-05-27  7:31         ` Arnd Bergmann
  2015-05-27 16:08       ` Jungseok Lee
  1 sibling, 1 reply; 22+ messages in thread
From: Sergey Senozhatsky @ 2015-05-27  6:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jungseok Lee, linux-arm-kernel, barami97, Catalin Marinas,
	Will Deacon, linux-kernel, linux-mm

On (05/27/15 13:10), Minchan Kim wrote:
> On Tue, May 26, 2015 at 08:29:59PM +0900, Jungseok Lee wrote:
> > On May 25, 2015, at 11:40 PM, Minchan Kim wrote:
> > > Hello Jungseok,
> > 
> > Hi, Minchan,
> > 
> > > On Mon, May 25, 2015 at 01:02:20AM +0900, Jungseok Lee wrote:
> > >> Fork-routine sometimes fails to get a physically contiguous region for
> > >> thread_info on 4KB page system although free memory is enough. That is,
> > >> a physically contiguous region, which is currently 16KB, is not available
> > >> since system memory is fragmented.
> > > 
> > > Order less than PAGE_ALLOC_COSTLY_ORDER should not fail in current
> > > mm implementation. If you saw the order-2,3 high-order allocation fail
> > > maybe your application received SIGKILL by someone. LMK?
> > 
> > Exactly right. The allocation is failed via the following path.
> > 
> > if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > 	goto nopage;
> > 
> > IMHO, a reclaim operation would be not needed in this context if memory is
> > allocated from vmalloc space. It means there is no need to traverse shrinker list. 
> 
> For making fork successful with using vmalloc, it's bandaid.
> 
> > 
> > >> This patch tries to solve the problem as allocating thread_info memory
> > >> from vmalloc space, not 1:1 mapping one. The downside is one additional
> > >> page allocation in case of vmalloc. However, vmalloc space is large enough,
> > > 
> > > The size you want to allocate is 16KB in here but additional 4K?
> > > It increases 25% memory footprint, which is huge downside.
> > 
> > I agree with the point, and most people who try to use vmalloc might know the number.
> > However, an interoperation on the number depends on a point of view.
> > 
> > Vmalloc is large enough and not fully utilized in case of ARM64.
> > With the considerations, there is a room to do math as follows.
> > 
> > 4KB / 240GB = 1.5e-8 (4KB page + 3 level combo)
> > 
> > It would be not a huge downside if fork-routine is not damaged due to fragmentation.
> 
> Okay, address size point of view, it wouldn't be significant problem.
> Then, let's see it performance as point of view.
> 
> If we use vmalloc, it needs additional data structure for vmalloc
> management, several additional allocation request, page table hanlding
> and TLB flush.

plus a guard page. I don't see VM_NO_GUARD being passed.

	-ss

> 
> Normally, forking is very frequent operation so we shouldn't do
> make it slow and memory consumption bigger if there isn't big reason.
> 
> > 
> > However, this is one of reasons to add "RFC" prefix in the patch set. How is the
> > additional 4KB interpreted and considered?
> > 
> > Best Regards
> > Jungseok Lee
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-27  6:22       ` Sergey Senozhatsky
@ 2015-05-27  7:31         ` Arnd Bergmann
  2015-05-27 16:05           ` Jungseok Lee
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2015-05-27  7:31 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Sergey Senozhatsky, Minchan Kim, Jungseok Lee, Catalin Marinas,
	barami97, Will Deacon, linux-kernel, linux-mm

On Wednesday 27 May 2015 15:22:50 Sergey Senozhatsky wrote:
> On (05/27/15 13:10), Minchan Kim wrote:
> > On Tue, May 26, 2015 at 08:29:59PM +0900, Jungseok Lee wrote:
> > > 
> > > if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > >     goto nopage;
> > > 
> > > IMHO, a reclaim operation would be not needed in this context if memory is
> > > allocated from vmalloc space. It means there is no need to traverse shrinker list. 
> > 
> > For making fork successful with using vmalloc, it's bandaid.

Right.

> > > >> This patch tries to solve the problem as allocating thread_info memory
> > > >> from vmalloc space, not 1:1 mapping one. The downside is one additional
> > > >> page allocation in case of vmalloc. However, vmalloc space is large enough,
> > > > 
> > > > The size you want to allocate is 16KB in here but additional 4K?
> > > > It increases 25% memory footprint, which is huge downside.
> > > 
> > > I agree with the point, and most people who try to use vmalloc might know the number.
> > > However, an interoperation on the number depends on a point of view.
> > > 
> > > Vmalloc is large enough and not fully utilized in case of ARM64.
> > > With the considerations, there is a room to do math as follows.
> > > 
> > > 4KB / 240GB = 1.5e-8 (4KB page + 3 level combo)
> > > 
> > > It would be not a huge downside if fork-routine is not damaged due to fragmentation.
> > 
> > Okay, address size point of view, it wouldn't be significant problem.
> > Then, let's see it performance as point of view.
> > 
> > If we use vmalloc, it needs additional data structure for vmalloc
> > management, several additional allocation request, page table hanlding
> > and TLB flush.

One upside of it is that we could in theory make THREAD_SIZE 12KB or
20KB instead of 16KB if we wanted to, as vmalloc does not have the
power-of-two requirement.

The downsides of vmalloc that you list are probably much stronger.

Another one is that /proc/vmallocinfo would become completely unreadable
on systems with lots of threads.

Finally, accessing data in vmalloc memory requires 4KB TLBs, while the
linear mapping usually uses hugepages, so we get extra page table walks
in the kernel for accessing the kernel stack, or for any kernel code
that looks at the thread_info of another thread.

> plus a guard page. I don't see VM_NO_GUARD being passed.

That's only a virtual page, which is virtually free here, it does not
consume any real memory.

	Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-27  4:24         ` Minchan Kim
@ 2015-05-27 16:00           ` Jungseok Lee
  0 siblings, 0 replies; 22+ messages in thread
From: Jungseok Lee @ 2015-05-27 16:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Arnd Bergmann, linux-arm-kernel, Catalin Marinas, barami97,
	Will Deacon, linux-kernel, linux-mm

On May 27, 2015, at 1:24 PM, Minchan Kim wrote:

Hi, Minchan,

> On Tue, May 26, 2015 at 09:10:11PM +0900, Jungseok Lee wrote:
>> On May 25, 2015, at 11:58 PM, Minchan Kim wrote:
>>> On Mon, May 25, 2015 at 07:01:33PM +0900, Jungseok Lee wrote:
>>>> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
>>>>> On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
>>>>>> Fork-routine sometimes fails to get a physically contiguous region for
>>>>>> thread_info on 4KB page system although free memory is enough. That is,
>>>>>> a physically contiguous region, which is currently 16KB, is not available
>>>>>> since system memory is fragmented.
>>>>>> 
>>>>>> This patch tries to solve the problem as allocating thread_info memory
>>>>>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>>>>>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>>>>>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
>>>>>> not a big tradeoff for fork-routine service.
>>>>> 
>>>>> vmalloc has a rather large runtime cost. I'd argue that failing to allocate
>>>>> thread_info structures means something has gone very wrong.
>>>> 
>>>> That is why the feature is marked "N" by default.
>>>> I focused on fork-routine stability rather than performance.
>>> 
>>> If VM has trouble with order-2 allocation, your system would be
>>> trouble soon although fork at the moment manages to be successful
>>> because such small high-order(ex, order <= PAGE_ALLOC_COSTLY_ORDER)
>>> allocation is common in the kernel so VM should handle it smoothly.
>>> If VM didn't, it means we should fix VM itself, not a specific
>>> allocation site. Fork is one of victim by that.
>> 
>> A problem I observed is an user space, not a kernel side. As user applications
>> fail to create threads in order to distribute their jobs properly, they are getting
>> in trouble slowly and then gone.
>> 
>> Yes, fork is one of victim, but damages user applications seriously.
>> At this snapshot, free memory is enough.
> 
> Yes, it's the one you found.
> 
>        *Free memory is enough but why forking was failed*
> 
> You should find the exact reason for it rather than papering over by
> hiding forking fail.
> 
> 1. Investigate how many of movable/unmovable page ratio at the moment
> 2. Investigate why compaction doesn't work
> 3. Investigate why reclaim couldn't make order-2 page
> 
> 
>> 
>>>> Could you give me an idea how to evaluate performance degradation?
>>>> Running some benchmarks would be helpful, but I would like to try to
>>>> gather data based on meaningful methodology.
>>>> 
>>>>> Can you describe the scenario that leads to fragmentation this bad?
>>>> 
>>>> Android, but I could not describe an exact reproduction procedure step
>>>> by step since it's behaved and reproduced randomly. As reading the following
>>>> thread from mm mailing list, a similar symptom is observed on other systems. 
>>>> 
>>>> https://lkml.org/lkml/2015/4/28/59
>>>> 
>>>> Although I do not know the details of a system mentioned in the thread,
>>>> even order-2 page allocation is not smoothly operated due to fragmentation on
>>>> low memory system.
>>> 
>>> What Joonsoo have tackle is generic fragmentation problem, not *a* fork fail,
>>> which is more right approach to handle small high-order allocation problem.
>> 
>> I totally agree with that point. One of the best ways is to figure out a generic
>> anti-fragmentation with VM system improvement. Reducing the stack size to 8KB is also
>> a really great approach. My intention is not to overlook them or figure out a workaround.
>> 
>> IMHO, vmalloc would be a different option in case of ARM64 on low memory systems since
>> *fork failure from fragmentation* is a nontrivial issue.
>> 
>> Do you think the patch set doesn't need to be considered?
> 
> I don't know because the changelog doesn't have full description
> about your problem. You just wrote "forking was failed so we want
> to avoid that by vmalloc because forking is important".

A technical feedback is always welcome.
I really thank everyone who leaves comments in this thread.

However, it is pretty disappointing that my commit log is distorted like that.

[Fork-routine sometimes fails to get a physically contiguous region for
thread_info on 4KB page system although free memory is enough. That is,
a physically contiguous region, which is currently 16KB, is not available
since system memory is fragmented.

This patch tries to solve the problem as allocating thread_info memory
from vmalloc space, not 1:1 mapping one. The downside is one additional
page allocation in case of vmalloc. However, vmalloc space is large enough,
around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
not a big tradeoff for fork-routine service.]

Is "forking was failed so we want to avoid that by vmalloc because forking is
important" your paraphrase of the above paragraphs?

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-27  7:31         ` Arnd Bergmann
@ 2015-05-27 16:05           ` Jungseok Lee
  0 siblings, 0 replies; 22+ messages in thread
From: Jungseok Lee @ 2015-05-27 16:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Sergey Senozhatsky, Minchan Kim,
	Catalin Marinas, barami97, Will Deacon, linux-kernel, linux-mm

On May 27, 2015, at 4:31 PM, Arnd Bergmann wrote:
> On Wednesday 27 May 2015 15:22:50 Sergey Senozhatsky wrote:
>> On (05/27/15 13:10), Minchan Kim wrote:
>>> On Tue, May 26, 2015 at 08:29:59PM +0900, Jungseok Lee wrote:
>>>> 
>>>> if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>>>>    goto nopage;
>>>> 
>>>> IMHO, a reclaim operation would be not needed in this context if memory is
>>>> allocated from vmalloc space. It means there is no need to traverse shrinker list. 
>>> 
>>> For making fork successful with using vmalloc, it's bandaid.
> 
> Right.

Thanks for a clear feedback!

It sounds like Catalin's idea should be considered seriously in ARM64 perspective.

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator
  2015-05-27  4:10     ` Minchan Kim
  2015-05-27  6:22       ` Sergey Senozhatsky
@ 2015-05-27 16:08       ` Jungseok Lee
  1 sibling, 0 replies; 22+ messages in thread
From: Jungseok Lee @ 2015-05-27 16:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-arm-kernel, barami97, Catalin Marinas, Will Deacon,
	linux-kernel, linux-mm

On May 27, 2015, at 1:10 PM, Minchan Kim wrote:
> Hello Jungseok,

Hi, Minchan,

> On Tue, May 26, 2015 at 08:29:59PM +0900, Jungseok Lee wrote:
>> On May 25, 2015, at 11:40 PM, Minchan Kim wrote:
>>> Hello Jungseok,
>> 
>> Hi, Minchan,
>> 
>>> On Mon, May 25, 2015 at 01:02:20AM +0900, Jungseok Lee wrote:
>>>> Fork-routine sometimes fails to get a physically contiguous region for
>>>> thread_info on 4KB page system although free memory is enough. That is,
>>>> a physically contiguous region, which is currently 16KB, is not available
>>>> since system memory is fragmented.
>>> 
>>> Order less than PAGE_ALLOC_COSTLY_ORDER should not fail in current
>>> mm implementation. If you saw the order-2,3 high-order allocation fail
>>> maybe your application received SIGKILL by someone. LMK?
>> 
>> Exactly right. The allocation is failed via the following path.
>> 
>> if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>> 	goto nopage;
>> 
>> IMHO, a reclaim operation would be not needed in this context if memory is
>> allocated from vmalloc space. It means there is no need to traverse shrinker list. 
> 
> For making fork successful with using vmalloc, it's bandaid.

Thanks for clarification!

Best Regards
Jungseok Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-05-27 16:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-24 16:02 [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator Jungseok Lee
2015-05-24 17:49 ` Arnd Bergmann
2015-05-25 10:01   ` Jungseok Lee
2015-05-25 14:58     ` Minchan Kim
2015-05-26 12:10       ` Jungseok Lee
2015-05-27  4:24         ` Minchan Kim
2015-05-27 16:00           ` Jungseok Lee
2015-05-25 16:47     ` Catalin Marinas
2015-05-25 20:29       ` Arnd Bergmann
2015-05-25 22:36         ` Catalin Marinas
2015-05-26  9:51           ` Arnd Bergmann
2015-05-26 13:02       ` Jungseok Lee
2015-05-25 21:20     ` Arnd Bergmann
2015-05-25 14:40 ` Minchan Kim
2015-05-26 11:29   ` Jungseok Lee
2015-05-27  4:10     ` Minchan Kim
2015-05-27  6:22       ` Sergey Senozhatsky
2015-05-27  7:31         ` Arnd Bergmann
2015-05-27 16:05           ` Jungseok Lee
2015-05-27 16:08       ` Jungseok Lee
2015-05-26  2:52 ` yalin wang
2015-05-26 12:21   ` Jungseok Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).