linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* 4.12-rc ppc64 4k-page needs costly allocations
@ 2017-05-30 19:43 Hugh Dickins
  2017-05-31  6:46 ` Michael Ellerman
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Hugh Dickins @ 2017-05-30 19:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michael Ellerman, Christoph Lameter, linuxppc-dev, linux-mm

Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
I find that swapping loads on ppc64 on G5 with 4k pages are failing:

SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
  cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
  pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
  node 0: slabs: 209, objs: 209, free: 8
gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
Call Trace:
[c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
[c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
[c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
[c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
[c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
[c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
[c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
[c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
[c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
[c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
[c00000000090be30] [c00000000000a118] system_call+0x38/0xfc

I did try booting with slub_debug=O as the message suggested, but that
made no difference: it still hoped for but failed on order:4 allocations.

I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
it seemed to be a hard requirement for something, but I didn't find what.

I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
the expected order:3, which then results in OOM-killing rather than direct
allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
makes no real difference to the outcome: swapping loads still abort early.

Relying on order:3 or order:4 allocations is just too optimistic: ppc64
with 4k pages would do better not to expect to support a 128TB userspace.

I tried the obvious partial revert below, but it's not good enough:
the system did not boot beyond

Starting init: /sbin/init exists but couldn't execute it (error -7)
Starting init: /bin/sh exists but couldn't execute it (error -7)
Kernel panic - not syncing: No working init found. ...

--- 4.12-rc2/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ linux/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  9
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE	(sizeof(pte_t) << H_PTE_INDEX_SIZE)
--- 4.12-rc2/arch/powerpc/include/asm/processor.h
+++ linux/arch/powerpc/include/asm/processor.h
@@ -110,7 +110,7 @@ void release_thread(struct task_struct *
 #define TASK_SIZE_128TB (0x0000800000000000UL)
 #define TASK_SIZE_512TB (0x0002000000000000UL)
 
-#ifdef CONFIG_PPC_BOOK3S_64
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
 /*
  * Max value currently used:
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-05-30 19:43 4.12-rc ppc64 4k-page needs costly allocations Hugh Dickins
@ 2017-05-31  6:46 ` Michael Ellerman
  2017-05-31 14:09   ` Christoph Lameter
  2017-05-31 14:06 ` Christoph Lameter
  2017-06-01  4:19 ` Aneesh Kumar K.V
  2 siblings, 1 reply; 18+ messages in thread
From: Michael Ellerman @ 2017-05-31  6:46 UTC (permalink / raw)
  To: Hugh Dickins, Aneesh Kumar K.V; +Cc: Christoph Lameter, linuxppc-dev, linux-mm

Hugh Dickins <hughd@google.com> writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
> [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
> [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
> [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
> [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
> [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
> [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...

Ouch, sorry.

I boot test a G5 with 4K pages, but I don't stress test it much so I
didn't notice this.

I think making 128TB depend on 64K pages makes sense, Aneesh is going to
try and do a patch for that.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-05-30 19:43 4.12-rc ppc64 4k-page needs costly allocations Hugh Dickins
  2017-05-31  6:46 ` Michael Ellerman
@ 2017-05-31 14:06 ` Christoph Lameter
  2017-06-01  4:19 ` Aneesh Kumar K.V
  2 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2017-05-31 14:06 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Aneesh Kumar K.V, Michael Ellerman, linuxppc-dev, linux-mm

On Tue, 30 May 2017, Hugh Dickins wrote:

> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.

CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
be able to enable it at runtime.

> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.

SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.

Why are the slab allocators used to create slab caches for large object
sizes?

> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.

I thought you had these huge 64k page sizes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-05-31  6:46 ` Michael Ellerman
@ 2017-05-31 14:09   ` Christoph Lameter
  2017-05-31 18:44     ` Hugh Dickins
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2017-05-31 14:09 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Hugh Dickins, Aneesh Kumar K.V, linuxppc-dev, linux-mm

On Wed, 31 May 2017, Michael Ellerman wrote:

> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.

Ahh. Ok debugging increased the object size to an order 4. This should be
order 3 without debugging.

> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.

I am curious as to what is going on there. Do you have the output from
these failed allocations?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-05-31 14:09   ` Christoph Lameter
@ 2017-05-31 18:44     ` Hugh Dickins
  2017-05-31 19:02       ` Mathieu Malaterre
  2017-06-01 15:31       ` Christoph Lameter
  0 siblings, 2 replies; 18+ messages in thread
From: Hugh Dickins @ 2017-05-31 18:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michael Ellerman, Hugh Dickins, Aneesh Kumar K.V, linuxppc-dev, linux-mm

[ Merging two mails into one response ]

On Wed, 31 May 2017, Christoph Lameter wrote:
> On Tue, 30 May 2017, Hugh Dickins wrote:
> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
> 
> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.
> 
> I am curious as to what is going on there. Do you have the output from
> these failed allocations?

I thought the relevant output was in my mail.  I did skip the Mem-Info
dump, since that just seemed noise in this case: we know memory can get
fragmented.  What more output are you looking for?

> 
> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> > it seemed to be a hard requirement for something, but I didn't find what.
> 
> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
> be able to enable it at runtime.

Yes, I thought so.

> 
> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> > the expected order:3, which then results in OOM-killing rather than direct
> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> > makes no real difference to the outcome: swapping loads still abort early.
> 
> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
> 
> Ahh. Ok debugging increased the object size to an order 4. This should be
> order 3 without debugging.

But it was still order 4 when booted with slub_debug=O, which surprised me.
And that surprises you too?  If so, then we ought to dig into it further.

> 
> Why are the slab allocators used to create slab caches for large object
> sizes?

There may be more optimal ways to allocate, but I expect that when
the ppc guys are writing the code to handle both 4k and 64k page sizes,
kmem caches offer the best span of possibility without complication.

> 
> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> > with 4k pages would do better not to expect to support a 128TB userspace.
> 
> I thought you had these huge 64k page sizes?

ppc64 does support 64k page sizes, and they've been the default for years;
but since 4k pages are still supported, I choose to use those (I doubt
I could ever get the same load going with 64k pages).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-05-31 18:44     ` Hugh Dickins
@ 2017-05-31 19:02       ` Mathieu Malaterre
  2017-06-01 15:31       ` Christoph Lameter
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Malaterre @ 2017-05-31 19:02 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Lameter, linuxppc-dev, Aneesh Kumar K.V, linux-mm

On Wed, May 31, 2017 at 8:44 PM, Hugh Dickins <hughd@google.com> wrote:
> [ Merging two mails into one response ]
>
> On Wed, 31 May 2017, Christoph Lameter wrote:
>> On Tue, 30 May 2017, Hugh Dickins wrote:
>> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>>
>> > I did try booting with slub_debug=O as the message suggested, but that
>> > made no difference: it still hoped for but failed on order:4 allocations.
>>
>> I am curious as to what is going on there. Do you have the output from
>> these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?
>
>>
>> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
>> > it seemed to be a hard requirement for something, but I didn't find what.
>>
>> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
>> be able to enable it at runtime.
>
> Yes, I thought so.
>
>>
>> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
>> > the expected order:3, which then results in OOM-killing rather than direct
>> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
>> > makes no real difference to the outcome: swapping loads still abort early.
>>
>> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
>>
>> Ahh. Ok debugging increased the object size to an order 4. This should be
>> order 3 without debugging.
>
> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.
>
>>
>> Why are the slab allocators used to create slab caches for large object
>> sizes?
>
> There may be more optimal ways to allocate, but I expect that when
> the ppc guys are writing the code to handle both 4k and 64k page sizes,
> kmem caches offer the best span of possibility without complication.
>
>>
>> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
>> > with 4k pages would do better not to expect to support a 128TB userspace.
>>
>> I thought you had these huge 64k page sizes?
>
> ppc64 does support 64k page sizes, and they've been the default for years;
> but since 4k pages are still supported, I choose to use those (I doubt
> I could ever get the same load going with 64k pages).

4k is pretty much required on ppc64 when it comes to nouveau:

https://bugs.freedesktop.org/show_bug.cgi?id=94757

2cts

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-05-30 19:43 4.12-rc ppc64 4k-page needs costly allocations Hugh Dickins
  2017-05-31  6:46 ` Michael Ellerman
  2017-05-31 14:06 ` Christoph Lameter
@ 2017-06-01  4:19 ` Aneesh Kumar K.V
  2017-06-01 16:57   ` Hugh Dickins
  2 siblings, 1 reply; 18+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-01  4:19 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Michael Ellerman, Christoph Lameter, linuxppc-dev, linux-mm

Hugh Dickins <hughd@google.com> writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
> [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
> [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
> [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
> [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
> [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
> [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...
>

Can you try this patch.

commit fc55c0dc8b23446f937c1315aa61e74673de5ee6
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Thu Jun 1 08:06:40 2017 +0530

    powerpc/mm/4k: Limit 4k page size to 64TB
    
    Supporting 512TB requires us to do a order 3 allocation for level 1 page
    table(pgd). Limit 4k to 64TB for now.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b671ca..0c4e470571ca 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  9
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE	(sizeof(pte_t) << H_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index a2123f291ab0..5de3271026f1 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -110,13 +110,15 @@ void release_thread(struct task_struct *);
 #define TASK_SIZE_128TB (0x0000800000000000UL)
 #define TASK_SIZE_512TB (0x0002000000000000UL)
 
-#ifdef CONFIG_PPC_BOOK3S_64
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
 /*
  * Max value currently used:
  */
-#define TASK_SIZE_USER64	TASK_SIZE_512TB
+#define TASK_SIZE_USER64		TASK_SIZE_512TB
+#define DEFAULT_MAP_WINDOW_USER64	TASK_SIZE_128TB
 #else
-#define TASK_SIZE_USER64	TASK_SIZE_64TB
+#define TASK_SIZE_USER64		TASK_SIZE_64TB
+#define DEFAULT_MAP_WINDOW_USER64	TASK_SIZE_64TB
 #endif
 
 /*
@@ -132,7 +134,7 @@ void release_thread(struct task_struct *);
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
-#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4))
+#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(DEFAULT_MAP_WINDOW_USER64 / 4))
 
 #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
 		TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
@@ -143,8 +145,8 @@ void release_thread(struct task_struct *);
  * with 128TB and conditionally enable upto 512TB
  */
 #ifdef CONFIG_PPC_BOOK3S_64
-#define DEFAULT_MAP_WINDOW	((is_32bit_task()) ? \
-				 TASK_SIZE_USER32 : TASK_SIZE_128TB)
+#define DEFAULT_MAP_WINDOW	((is_32bit_task()) ?			\
+				 TASK_SIZE_USER32 : DEFAULT_MAP_WINDOW_USER64)
 #else
 #define DEFAULT_MAP_WINDOW	TASK_SIZE
 #endif
@@ -153,7 +155,7 @@ void release_thread(struct task_struct *);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /* Limit stack to 128TB */
-#define STACK_TOP_USER64 TASK_SIZE_128TB
+#define STACK_TOP_USER64 DEFAULT_MAP_WINDOW_USER64
 #else
 #define STACK_TOP_USER64 TASK_SIZE_USER64
 #endif
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 8389ff5ac002..77062461c469 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -921,7 +921,7 @@ void __init setup_arch(char **cmdline_p)
 
 #ifdef CONFIG_PPC_MM_SLICES
 #ifdef CONFIG_PPC64
-	init_mm.context.addr_limit = TASK_SIZE_128TB;
+	init_mm.context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
 #else
 #error	"context.addr_limit not initialized."
 #endif
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index c6dca2ae78ef..a3edf813d455 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -99,7 +99,7 @@ static int hash__init_new_context(struct mm_struct *mm)
 	 * mm->context.addr_limit. Default to max task size so that we copy the
 	 * default values to paca which will help us to handle slb miss early.
 	 */
-	mm->context.addr_limit = TASK_SIZE_128TB;
+	mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
 
 	/*
 	 * The old code would re-promote on fork, we don't do that when using
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-05-31 18:44     ` Hugh Dickins
  2017-05-31 19:02       ` Mathieu Malaterre
@ 2017-06-01 15:31       ` Christoph Lameter
  2017-06-01 17:22         ` Hugh Dickins
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2017-06-01 15:31 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Michael Ellerman, Aneesh Kumar K.V, linuxppc-dev, linux-mm



> > I am curious as to what is going on there. Do you have the output from
> > these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?

The output for the failing allocations when you disabling debugging. For
that I would think that you need remove(!) the slub_debug statement on the kernel
command line. You can verify that debug is off by inspecting the values in
/sys/kernel/slab/<yourcache>/<debug option>

> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.

No it does no longer. I dont think slub_debug=O does disable debugging
(frankly I am not sure what it does). Please do not specify any debug options.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-01  4:19 ` Aneesh Kumar K.V
@ 2017-06-01 16:57   ` Hugh Dickins
  0 siblings, 0 replies; 18+ messages in thread
From: Hugh Dickins @ 2017-06-01 16:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Hugh Dickins, Michael Ellerman, Christoph Lameter, linuxppc-dev,
	linux-mm

On Thu, 1 Jun 2017, Aneesh Kumar K.V wrote:
> Hugh Dickins <hughd@google.com> writes:
> 
> > Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> > I find that swapping loads on ppc64 on G5 with 4k pages are failing:
> >
> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
> >   node 0: slabs: 209, objs: 209, free: 8
> > gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> > CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> > Call Trace:
> > [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
> > [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
> > [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> > [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
> > [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> > [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
> > [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
> > [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
> > [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
> > [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
> > [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc
> >
> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.
> >
> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> > it seemed to be a hard requirement for something, but I didn't find what.
> >
> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> > the expected order:3, which then results in OOM-killing rather than direct
> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> > makes no real difference to the outcome: swapping loads still abort early.
> >
> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> > with 4k pages would do better not to expect to support a 128TB userspace.
> >
> > I tried the obvious partial revert below, but it's not good enough:
> > the system did not boot beyond
> >
> > Starting init: /sbin/init exists but couldn't execute it (error -7)
> > Starting init: /bin/sh exists but couldn't execute it (error -7)
> > Kernel panic - not syncing: No working init found. ...
> >
> 
> Can you try this patch.

Thanks!  By the time I got to try it, you'd sent another later in the
day.  Fractionally different, and I didn't spend any time working out
whether the difference was significant or cosmetic, I just tried that
second one instead.  No problems with it so far, hasn't been running
long, but long enough to say that it definitely fixes the problems
I was getting - thank you.

Hugh

> 
> commit fc55c0dc8b23446f937c1315aa61e74673de5ee6
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Thu Jun 1 08:06:40 2017 +0530
> 
>     powerpc/mm/4k: Limit 4k page size to 64TB
>     
>     Supporting 512TB requires us to do a order 3 allocation for level 1 page
>     table(pgd). Limit 4k to 64TB for now.
>     
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b4b5e6b671ca..0c4e470571ca 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -8,7 +8,7 @@
>  #define H_PTE_INDEX_SIZE  9
>  #define H_PMD_INDEX_SIZE  7
>  #define H_PUD_INDEX_SIZE  9
> -#define H_PGD_INDEX_SIZE  12
> +#define H_PGD_INDEX_SIZE  9
>  
>  #ifndef __ASSEMBLY__
>  #define H_PTE_TABLE_SIZE	(sizeof(pte_t) << H_PTE_INDEX_SIZE)
> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
> index a2123f291ab0..5de3271026f1 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -110,13 +110,15 @@ void release_thread(struct task_struct *);
>  #define TASK_SIZE_128TB (0x0000800000000000UL)
>  #define TASK_SIZE_512TB (0x0002000000000000UL)
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
>  /*
>   * Max value currently used:
>   */
> -#define TASK_SIZE_USER64	TASK_SIZE_512TB
> +#define TASK_SIZE_USER64		TASK_SIZE_512TB
> +#define DEFAULT_MAP_WINDOW_USER64	TASK_SIZE_128TB
>  #else
> -#define TASK_SIZE_USER64	TASK_SIZE_64TB
> +#define TASK_SIZE_USER64		TASK_SIZE_64TB
> +#define DEFAULT_MAP_WINDOW_USER64	TASK_SIZE_64TB
>  #endif
>  
>  /*
> @@ -132,7 +134,7 @@ void release_thread(struct task_struct *);
>   * space during mmap's.
>   */
>  #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
> -#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4))
> +#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(DEFAULT_MAP_WINDOW_USER64 / 4))
>  
>  #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
>  		TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
> @@ -143,8 +145,8 @@ void release_thread(struct task_struct *);
>   * with 128TB and conditionally enable upto 512TB
>   */
>  #ifdef CONFIG_PPC_BOOK3S_64
> -#define DEFAULT_MAP_WINDOW	((is_32bit_task()) ? \
> -				 TASK_SIZE_USER32 : TASK_SIZE_128TB)
> +#define DEFAULT_MAP_WINDOW	((is_32bit_task()) ?			\
> +				 TASK_SIZE_USER32 : DEFAULT_MAP_WINDOW_USER64)
>  #else
>  #define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #endif
> @@ -153,7 +155,7 @@ void release_thread(struct task_struct *);
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
>  /* Limit stack to 128TB */
> -#define STACK_TOP_USER64 TASK_SIZE_128TB
> +#define STACK_TOP_USER64 DEFAULT_MAP_WINDOW_USER64
>  #else
>  #define STACK_TOP_USER64 TASK_SIZE_USER64
>  #endif
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index 8389ff5ac002..77062461c469 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -921,7 +921,7 @@ void __init setup_arch(char **cmdline_p)
>  
>  #ifdef CONFIG_PPC_MM_SLICES
>  #ifdef CONFIG_PPC64
> -	init_mm.context.addr_limit = TASK_SIZE_128TB;
> +	init_mm.context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
>  #else
>  #error	"context.addr_limit not initialized."
>  #endif
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index c6dca2ae78ef..a3edf813d455 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -99,7 +99,7 @@ static int hash__init_new_context(struct mm_struct *mm)
>  	 * mm->context.addr_limit. Default to max task size so that we copy the
>  	 * default values to paca which will help us to handle slb miss early.
>  	 */
> -	mm->context.addr_limit = TASK_SIZE_128TB;
> +	mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
>  
>  	/*
>  	 * The old code would re-promote on fork, we don't do that when using
>  
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-01 15:31       ` Christoph Lameter
@ 2017-06-01 17:22         ` Hugh Dickins
  2017-06-01 18:16           ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2017-06-01 17:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Michael Ellerman, Aneesh Kumar K.V, linuxppc-dev, linux-mm

On Thu, 1 Jun 2017, Christoph Lameter wrote:
> 
> > > I am curious as to what is going on there. Do you have the output from
> > > these failed allocations?
> >
> > I thought the relevant output was in my mail.  I did skip the Mem-Info
> > dump, since that just seemed noise in this case: we know memory can get
> > fragmented.  What more output are you looking for?
> 
> The output for the failing allocations when you disabling debugging. For
> that I would think that you need remove(!) the slub_debug statement on the kernel
> command line. You can verify that debug is off by inspecting the values in
> /sys/kernel/slab/<yourcache>/<debug option>

The output was with debugging disabled.  Except when I tried adding that
slub_debug=O on the kernel command line, as the warning suggested, I did
not have any slub_debug statement on the command line; and did not have
CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are

CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_MEMCG_SYSFS_ON is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SLABINFO=y
# CONFIG_SLUB_DEBUG_ON is not set
CONFIG_SLUB_STATS=y

> 
> > But it was still order 4 when booted with slub_debug=O, which surprised me.
> > And that surprises you too?  If so, then we ought to dig into it further.
> 
> No it does no longer. I dont think slub_debug=O does disable debugging
> (frankly I am not sure what it does). Please do not specify any debug options.

But I think you are now surprised, when I say no slub_debug options
were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
(before I tried the new kernel with Aneesh's fix patch)
in case they tell you anything...

pgtable-2^12/aliases:0
pgtable-2^12/align:32768
grep: pgtable-2^12/alloc_calls: Function not implemented
pgtable-2^12/alloc_fastpath:5847 C0=1587 C1=1449 C2=1392 C3=1419
pgtable-2^12/alloc_from_partial:12637 C0=3292 C1=3020 C2=3051 C3=3274
pgtable-2^12/alloc_node_mismatch:0
pgtable-2^12/alloc_refill:41038 C0=10600 C1=10025 C2=10191 C3=10222
pgtable-2^12/alloc_slab:517 C0=148 C1=110 C2=105 C3=154
pgtable-2^12/alloc_slowpath:54203 C0=14041 C1=13157 C2=13349 C3=13656
pgtable-2^12/cache_dma:0
pgtable-2^12/cmpxchg_double_cpu_fail:0
pgtable-2^12/cmpxchg_double_fail:0
pgtable-2^12/cpu_partial:2
pgtable-2^12/cpu_partial_alloc:25894 C0=6719 C1=6334 C2=6288 C3=6553
pgtable-2^12/cpu_partial_drain:8441 C0=2035 C1=2211 C2=2268 C3=1927
pgtable-2^12/cpu_partial_free:38987 C0=9642 C1=10042 C2=10132 C3=9171
pgtable-2^12/cpu_partial_node:12237 C0=3183 C1=2928 C2=2961 C3=3165
pgtable-2^12/cpu_slabs:11
pgtable-2^12/cpuslab_flush:17 C0=5 C2=4 C3=8
pgtable-2^12/ctor:pgd_ctor+0x0/0x18
pgtable-2^12/deactivate_bypass:39027 C0=10153 C1=9463 C2=9439 C3=9972
pgtable-2^12/deactivate_empty:446 C0=98 C1=118 C2=123 C3=107
pgtable-2^12/deactivate_full:16 C0=5 C2=3 C3=8
pgtable-2^12/deactivate_remote_frees:0
pgtable-2^12/deactivate_to_head:1 C2=1
pgtable-2^12/deactivate_to_tail:0
pgtable-2^12/destroy_by_rcu:0
pgtable-2^12/free_add_partial:24877 C0=6007 C1=6515 C2=6681 C3=5674
grep: pgtable-2^12/free_calls: Function not implemented
pgtable-2^12/free_fastpath:5849 C0=1587 C1=1449 C2=1394 C3=1419
pgtable-2^12/free_frozen:15145 C0=3989 C1=3701 C2=3683 C3=3772
pgtable-2^12/free_remove_partial:0
pgtable-2^12/free_slab:446 C0=98 C1=118 C2=123 C3=107
pgtable-2^12/free_slowpath:54132 C0=13631 C1=13743 C2=13815 C3=12943
pgtable-2^12/hwcache_align:0
pgtable-2^12/min_partial:8
pgtable-2^12/object_size:32768
pgtable-2^12/objects:67
pgtable-2^12/objects_partial:0
pgtable-2^12/objs_per_slab:1
pgtable-2^12/order:4
pgtable-2^12/order_fallback:13 C0=2 C1=1 C2=5 C3=5
pgtable-2^12/partial:4
pgtable-2^12/poison:0
pgtable-2^12/reclaim_account:0
pgtable-2^12/red_zone:0
pgtable-2^12/reserved:0
pgtable-2^12/sanity_checks:0
pgtable-2^12/slab_size:65536
pgtable-2^12/slabs:71
pgtable-2^12/slabs_cpu_partial:7(7) C0=1(1) C1=3(3) C2=1(1) C3=2(2)
pgtable-2^12/store_user:0
pgtable-2^12/total_objects:71
pgtable-2^12/trace:0

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-01 17:22         ` Hugh Dickins
@ 2017-06-01 18:16           ` Christoph Lameter
  2017-06-01 18:37             ` Hugh Dickins
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2017-06-01 18:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Michael Ellerman, Aneesh Kumar K.V, linuxppc-dev, linux-mm

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are
>
> CONFIG_SLUB_DEBUG=y
> # CONFIG_SLUB_MEMCG_SYSFS_ON is not set
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> # CONFIG_SLAB_FREELIST_RANDOM is not set
> CONFIG_SLUB_CPU_PARTIAL=y
> CONFIG_SLABINFO=y
> # CONFIG_SLUB_DEBUG_ON is not set
> CONFIG_SLUB_STATS=y

Thats fine.

> But I think you are now surprised, when I say no slub_debug options
> were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
> (before I tried the new kernel with Aneesh's fix patch)
> in case they tell you anything...
>
> pgtable-2^12/poison:0
> pgtable-2^12/red_zone:0
> pgtable-2^12/reserved:0
> pgtable-2^12/sanity_checks:0
> pgtable-2^12/store_user:0

Ok so debugging was off but the slab cache has a ctor callback which
mandates that the free pointer cannot use the free object space when
the object is not in use. Thus the size of the object must be increased to
accomodate the freepointer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-01 18:16           ` Christoph Lameter
@ 2017-06-01 18:37             ` Hugh Dickins
  2017-06-02  3:09               ` Michael Ellerman
  2017-06-02 14:32               ` Christoph Lameter
  0 siblings, 2 replies; 18+ messages in thread
From: Hugh Dickins @ 2017-06-01 18:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Michael Ellerman, Aneesh Kumar K.V, linuxppc-dev, linux-mm

On Thu, 1 Jun 2017, Christoph Lameter wrote:
> 
> Ok so debugging was off but the slab cache has a ctor callback which
> mandates that the free pointer cannot use the free object space when
> the object is not in use. Thus the size of the object must be increased to
> accomodate the freepointer.

Thanks a lot for working that out.  Makes sense, fully understood now,
nothing to worry about (though makes one wonder whether it's efficient
to use ctors on high-alignment caches; or whether an internal "zero-me"
ctor would be useful).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-01 18:37             ` Hugh Dickins
@ 2017-06-02  3:09               ` Michael Ellerman
  2017-06-02  4:00                 ` Hugh Dickins
  2017-06-02 14:32               ` Christoph Lameter
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Ellerman @ 2017-06-02  3:09 UTC (permalink / raw)
  To: Hugh Dickins, Christoph Lameter; +Cc: Aneesh Kumar K.V, linuxppc-dev, linux-mm

Hugh Dickins <hughd@google.com> writes:

> On Thu, 1 Jun 2017, Christoph Lameter wrote:
>> 
>> Ok so debugging was off but the slab cache has a ctor callback which
>> mandates that the free pointer cannot use the free object space when
>> the object is not in use. Thus the size of the object must be increased to
>> accomodate the freepointer.
>
> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Or should we just be using kmem_cache_zalloc() when we allocate from
those slabs?

Given all the ctor's do is memset to 0.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-02  3:09               ` Michael Ellerman
@ 2017-06-02  4:00                 ` Hugh Dickins
  2017-06-02 14:33                   ` Christoph Lameter
  2017-06-08  5:44                   ` Michael Ellerman
  0 siblings, 2 replies; 18+ messages in thread
From: Hugh Dickins @ 2017-06-02  4:00 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Hugh Dickins, Christoph Lameter, Aneesh Kumar K.V, linuxppc-dev,
	linux-mm

On Fri, 2 Jun 2017, Michael Ellerman wrote:
> Hugh Dickins <hughd@google.com> writes:
> > On Thu, 1 Jun 2017, Christoph Lameter wrote:
> >> 
> >> Ok so debugging was off but the slab cache has a ctor callback which
> >> mandates that the free pointer cannot use the free object space when
> >> the object is not in use. Thus the size of the object must be increased to
> >> accomodate the freepointer.
> >
> > Thanks a lot for working that out.  Makes sense, fully understood now,
> > nothing to worry about (though makes one wonder whether it's efficient
> > to use ctors on high-alignment caches; or whether an internal "zero-me"
> > ctor would be useful).
> 
> Or should we just be using kmem_cache_zalloc() when we allocate from
> those slabs?
> 
> Given all the ctor's do is memset to 0.

I'm not sure.  From a memory-utilization point of view, with SLUB,
using kmem_cache_zalloc() there would certainly be better.

But you may be forgetting that the constructor is applied only when a
new slab of objects is allocated, not each time an object is allocated
from that slab (and the user of those objects agrees to free objects
back to the cache in a reusable state: zeroed in this case).

So from a cpu-utilization point of view, it's better to use the ctor:
it's saving you lots of redundant memsets.

SLUB versus SLAB, cpu versus memory?  Since someone has taken the
trouble to write it with ctors in the past, I didn't feel on firm
enough ground to recommend such a change.  But it may be obvious
to someone else that your suggestion would be better (or worse).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-01 18:37             ` Hugh Dickins
  2017-06-02  3:09               ` Michael Ellerman
@ 2017-06-02 14:32               ` Christoph Lameter
  2017-06-08  5:52                 ` Michael Ellerman
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2017-06-02 14:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Michael Ellerman, Aneesh Kumar K.V, linuxppc-dev, linux-mm

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Use kzalloc to zero it. And here is another example of using slab
allocations for page frames. Use the page allocator for this? The page
allocator is there for allocating page frames. The slab allocator main
purpose is to allocate small objects....


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-02  4:00                 ` Hugh Dickins
@ 2017-06-02 14:33                   ` Christoph Lameter
  2017-06-08  5:44                   ` Michael Ellerman
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2017-06-02 14:33 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Michael Ellerman, Aneesh Kumar K.V, linuxppc-dev, linux-mm

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> SLUB versus SLAB, cpu versus memory?  Since someone has taken the
> trouble to write it with ctors in the past, I didn't feel on firm
> enough ground to recommend such a change.  But it may be obvious
> to someone else that your suggestion would be better (or worse).

Umm how about using alloc_pages() for pageframes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-02  4:00                 ` Hugh Dickins
  2017-06-02 14:33                   ` Christoph Lameter
@ 2017-06-08  5:44                   ` Michael Ellerman
  1 sibling, 0 replies; 18+ messages in thread
From: Michael Ellerman @ 2017-06-08  5:44 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Lameter, Aneesh Kumar K.V, linuxppc-dev, linux-mm

Hugh Dickins <hughd@google.com> writes:
> On Fri, 2 Jun 2017, Michael Ellerman wrote:
>> Hugh Dickins <hughd@google.com> writes:
>> > On Thu, 1 Jun 2017, Christoph Lameter wrote:
>> >> 
>> >> Ok so debugging was off but the slab cache has a ctor callback which
>> >> mandates that the free pointer cannot use the free object space when
>> >> the object is not in use. Thus the size of the object must be increased to
>> >> accomodate the freepointer.
>> >
>> > Thanks a lot for working that out.  Makes sense, fully understood now,
>> > nothing to worry about (though makes one wonder whether it's efficient
>> > to use ctors on high-alignment caches; or whether an internal "zero-me"
>> > ctor would be useful).
>> 
>> Or should we just be using kmem_cache_zalloc() when we allocate from
>> those slabs?
>> 
>> Given all the ctor's do is memset to 0.
>
> I'm not sure.  From a memory-utilization point of view, with SLUB,
> using kmem_cache_zalloc() there would certainly be better.
>
> But you may be forgetting that the constructor is applied only when a
> new slab of objects is allocated, not each time an object is allocated
> from that slab (and the user of those objects agrees to free objects
> back to the cache in a reusable state: zeroed in this case).

Ah yes, I was "forgetting" that :) - ie. didn't know it.

> So from a cpu-utilization point of view, it's better to use the ctor:
> it's saving you lots of redundant memsets.

OK. Presumably we guarantee (somewhere) that the page tables are zeroed
before we free them, which is a natural result of tearing down all
mappings?

But then I see other arches (x86, arm64 at least), which don't use a
constructor, and use __GPF_ZERO (via PGALLOC_GFP) at allocation time.

eg. arm64:

	pgd_cache = kmem_cache_create("pgd_cache", PGD_SIZE, PGD_SIZE,
				      SLAB_PANIC, NULL);
        ...
	return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);


So that's a bit puzzling.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 4.12-rc ppc64 4k-page needs costly allocations
  2017-06-02 14:32               ` Christoph Lameter
@ 2017-06-08  5:52                 ` Michael Ellerman
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Ellerman @ 2017-06-08  5:52 UTC (permalink / raw)
  To: Christoph Lameter, Hugh Dickins; +Cc: Aneesh Kumar K.V, linuxppc-dev, linux-mm

Christoph Lameter <cl@linux.com> writes:

> On Thu, 1 Jun 2017, Hugh Dickins wrote:
>
>> Thanks a lot for working that out.  Makes sense, fully understood now,
>> nothing to worry about (though makes one wonder whether it's efficient
>> to use ctors on high-alignment caches; or whether an internal "zero-me"
>> ctor would be useful).
>
> Use kzalloc to zero it.

But that's changing a per slab creation memset into a per object
allocation memset, isn't it?

> And here is another example of using slab allocations for page frames.
> Use the page allocator for this? The page allocator is there for
> allocating page frames. The slab allocator main purpose is to allocate
> small objects....

Well usually they are small (< PAGE_SIZE), because we have 64K pages.

But we could rework the code to use the page allocator on 4K configs.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-06-08  5:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-30 19:43 4.12-rc ppc64 4k-page needs costly allocations Hugh Dickins
2017-05-31  6:46 ` Michael Ellerman
2017-05-31 14:09   ` Christoph Lameter
2017-05-31 18:44     ` Hugh Dickins
2017-05-31 19:02       ` Mathieu Malaterre
2017-06-01 15:31       ` Christoph Lameter
2017-06-01 17:22         ` Hugh Dickins
2017-06-01 18:16           ` Christoph Lameter
2017-06-01 18:37             ` Hugh Dickins
2017-06-02  3:09               ` Michael Ellerman
2017-06-02  4:00                 ` Hugh Dickins
2017-06-02 14:33                   ` Christoph Lameter
2017-06-08  5:44                   ` Michael Ellerman
2017-06-02 14:32               ` Christoph Lameter
2017-06-08  5:52                 ` Michael Ellerman
2017-05-31 14:06 ` Christoph Lameter
2017-06-01  4:19 ` Aneesh Kumar K.V
2017-06-01 16:57   ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).