All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug in reclaim logic with exhausted nodes?
@ 2014-03-11 21:06 ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-11 21:06 UTC (permalink / raw)
  To: linux-mm; +Cc: anton, linuxppc-dev, mgorman, cl, rientjes

We have seen the following situation on a test system:

2-node system, each node has 32GB of memory.

2 gigantic (16GB) pages reserved at boot-time, both of which are
allocated from node 1.

SLUB notices this:

[    0.000000] SLUB: Unable to allocate memory from node 1
[    0.000000] SLUB: Allocating a useless per node structure in order to
be able to continue

After boot, user then did:

echo 24 > /proc/sys/vm/nr_hugepages

And tasks are stuck:

[<c0000000010980b8>] kexec_stack+0xb8/0x8000
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f9334b0>] 0xc00000004f9334b0
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f91f440>] 0xc00000004f91f440
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
[<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
[<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

kswapd1 is also pegged at this point at 100% cpu.

If we go in and manually:

echo 24 >
/sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages

rather than relying on the interleaving allocator from the sysctl, the
allocation succeeds (and the echo returns immediately).

I think we are hitting the following:

mm/hugetlb.c::alloc_fresh_huge_page_node():

        page = alloc_pages_exact_node(nid,
                htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
                                                __GFP_REPEAT|__GFP_NOWARN,
                huge_page_order(h));

include/linux/gfp.h:

#define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)

and mm/page_alloc.c::__alloc_pages_slowpath():

        /*
         * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
         * __GFP_NOWARN set) should not cause reclaim since the subsystem
         * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
         * using a larger set of nodes after it has established that the
         * allowed per node queues are empty and that nodes are
         * over allocated.
         */
        if (IS_ENABLED(CONFIG_NUMA) &&
                        (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
                goto nopage;

so we *do* reclaim in this callpath. Under my reading, since node1 is
exhausted, no matter how much work kswapd1 does, it will never reclaim
memory from node1 to satisfy a 16M page allocation request (or any
other, for that matter).

I see the following possible changes/fixes, but am unsure if
a) my analysis is right
b) which is best.

1) Since we did notice early in boot that (in this case) node 1 was
exhausted, perhaps we should mark it as such there somehow, and if a
__GFP_THISNODE allocation request comes through on such a node, we
immediately fallthrough to nopage?

2) There is the following check
        /*
         * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
         * specified, then we retry until we no longer reclaim any pages
         * (above), or we've reclaimed an order of pages at least as
         * large as the allocation's order. In both cases, if the
         * allocation still fails, we stop retrying.
         */
        if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
                return 1;

I wonder if we should add a check to also be sure that the pages we are
reclaiming, if __GFP_THISNODE is set, are from the right node?

       if (gfp_mask & __GFP_THISNODE && the progress we have made is on
       		the node requested?)

3) did_some_progress could be updated to track where the progress is
occuring, and if we are in __GFP_THISNODE allocation request and we
didn't make any progress on the correct node, we fail the allocation?

I think this situation could be reproduced (and am working on it) by
exhausting a NUMA node with 16M hugepages and then using the generic
RR allocator to ask for more. Other node exhaustion cases probably
exist, but since we can't swap the hugepages, it seems like the most
straightforward way to try and reproduce it.

Any thoughts on this? Am I way off base?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Bug in reclaim logic with exhausted nodes?
@ 2014-03-11 21:06 ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-11 21:06 UTC (permalink / raw)
  To: linux-mm; +Cc: cl, rientjes, linuxppc-dev, anton, mgorman

We have seen the following situation on a test system:

2-node system, each node has 32GB of memory.

2 gigantic (16GB) pages reserved at boot-time, both of which are
allocated from node 1.

SLUB notices this:

[    0.000000] SLUB: Unable to allocate memory from node 1
[    0.000000] SLUB: Allocating a useless per node structure in order to
be able to continue

After boot, user then did:

echo 24 > /proc/sys/vm/nr_hugepages

And tasks are stuck:

[<c0000000010980b8>] kexec_stack+0xb8/0x8000
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f9334b0>] 0xc00000004f9334b0
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f91f440>] 0xc00000004f91f440
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
[<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
[<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

kswapd1 is also pegged at this point at 100% cpu.

If we go in and manually:

echo 24 >
/sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages

rather than relying on the interleaving allocator from the sysctl, the
allocation succeeds (and the echo returns immediately).

I think we are hitting the following:

mm/hugetlb.c::alloc_fresh_huge_page_node():

        page = alloc_pages_exact_node(nid,
                htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
                                                __GFP_REPEAT|__GFP_NOWARN,
                huge_page_order(h));

include/linux/gfp.h:

#define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)

and mm/page_alloc.c::__alloc_pages_slowpath():

        /*
         * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
         * __GFP_NOWARN set) should not cause reclaim since the subsystem
         * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
         * using a larger set of nodes after it has established that the
         * allowed per node queues are empty and that nodes are
         * over allocated.
         */
        if (IS_ENABLED(CONFIG_NUMA) &&
                        (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
                goto nopage;

so we *do* reclaim in this callpath. Under my reading, since node1 is
exhausted, no matter how much work kswapd1 does, it will never reclaim
memory from node1 to satisfy a 16M page allocation request (or any
other, for that matter).

I see the following possible changes/fixes, but am unsure if
a) my analysis is right
b) which is best.

1) Since we did notice early in boot that (in this case) node 1 was
exhausted, perhaps we should mark it as such there somehow, and if a
__GFP_THISNODE allocation request comes through on such a node, we
immediately fallthrough to nopage?

2) There is the following check
        /*
         * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
         * specified, then we retry until we no longer reclaim any pages
         * (above), or we've reclaimed an order of pages at least as
         * large as the allocation's order. In both cases, if the
         * allocation still fails, we stop retrying.
         */
        if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
                return 1;

I wonder if we should add a check to also be sure that the pages we are
reclaiming, if __GFP_THISNODE is set, are from the right node?

       if (gfp_mask & __GFP_THISNODE && the progress we have made is on
       		the node requested?)

3) did_some_progress could be updated to track where the progress is
occuring, and if we are in __GFP_THISNODE allocation request and we
didn't make any progress on the correct node, we fail the allocation?

I think this situation could be reproduced (and am working on it) by
exhausting a NUMA node with 16M hugepages and then using the generic
RR allocator to ask for more. Other node exhaustion cases probably
exist, but since we can't swap the hugepages, it seems like the most
straightforward way to try and reproduce it.

Any thoughts on this? Am I way off base?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-11 21:06 ` Nishanth Aravamudan
@ 2014-03-13 17:01   ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-13 17:01 UTC (permalink / raw)
  To: linux-mm; +Cc: cl, rientjes, linuxppc-dev, anton, mgorman

There might have been an error in my original mail, so resending...

On 11.03.2014 [14:06:14 -0700], Nishanth Aravamudan wrote:
> We have seen the following situation on a test system:
> 
> 2-node system, each node has 32GB of memory.
> 
> 2 gigantic (16GB) pages reserved at boot-time, both of which are
> allocated from node 1.
> 
> SLUB notices this:
> 
> [    0.000000] SLUB: Unable to allocate memory from node 1
> [    0.000000] SLUB: Allocating a useless per node structure in order to
> be able to continue
> 
> After boot, user then did:
> 
> echo 24 > /proc/sys/vm/nr_hugepages
> 
> And tasks are stuck:
> 
> [<c0000000010980b8>] kexec_stack+0xb8/0x8000
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f9334b0>] 0xc00000004f9334b0
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f91f440>] 0xc00000004f91f440
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
> [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
> [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> kswapd1 is also pegged at this point at 100% cpu.
> 
> If we go in and manually:
> 
> echo 24 >
> /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> 
> rather than relying on the interleaving allocator from the sysctl, the
> allocation succeeds (and the echo returns immediately).
> 
> I think we are hitting the following:
> 
> mm/hugetlb.c::alloc_fresh_huge_page_node():
> 
>         page = alloc_pages_exact_node(nid,
>                 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>                                                 __GFP_REPEAT|__GFP_NOWARN,
>                 huge_page_order(h));
> 
> include/linux/gfp.h:
> 
> #define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> 
> and mm/page_alloc.c::__alloc_pages_slowpath():
> 
>         /*
>          * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
>          * __GFP_NOWARN set) should not cause reclaim since the subsystem
>          * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
>          * using a larger set of nodes after it has established that the
>          * allowed per node queues are empty and that nodes are
>          * over allocated.
>          */
>         if (IS_ENABLED(CONFIG_NUMA) &&
>                         (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>                 goto nopage;
> 
> so we *do* reclaim in this callpath. Under my reading, since node1 is
> exhausted, no matter how much work kswapd1 does, it will never reclaim
> memory from node1 to satisfy a 16M page allocation request (or any
> other, for that matter).
> 
> I see the following possible changes/fixes, but am unsure if
> a) my analysis is right
> b) which is best.
> 
> 1) Since we did notice early in boot that (in this case) node 1 was
> exhausted, perhaps we should mark it as such there somehow, and if a
> __GFP_THISNODE allocation request comes through on such a node, we
> immediately fallthrough to nopage?
> 
> 2) There is the following check
>         /*
>          * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
>          * specified, then we retry until we no longer reclaim any pages
>          * (above), or we've reclaimed an order of pages at least as
>          * large as the allocation's order. In both cases, if the
>          * allocation still fails, we stop retrying.
>          */
>         if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
>                 return 1;
> 
> I wonder if we should add a check to also be sure that the pages we are
> reclaiming, if __GFP_THISNODE is set, are from the right node?
> 
>        if (gfp_mask & __GFP_THISNODE && the progress we have made is on
>        		the node requested?)
> 
> 3) did_some_progress could be updated to track where the progress is
> occuring, and if we are in __GFP_THISNODE allocation request and we
> didn't make any progress on the correct node, we fail the allocation?
> 
> I think this situation could be reproduced (and am working on it) by
> exhausting a NUMA node with 16M hugepages and then using the generic
> RR allocator to ask for more. Other node exhaustion cases probably
> exist, but since we can't swap the hugepages, it seems like the most
> straightforward way to try and reproduce it.
> 
> Any thoughts on this? Am I way off base?
> 
> Thanks,
> Nish
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-13 17:01   ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-13 17:01 UTC (permalink / raw)
  To: linux-mm; +Cc: mgorman, cl, linuxppc-dev, anton, rientjes

There might have been an error in my original mail, so resending...

On 11.03.2014 [14:06:14 -0700], Nishanth Aravamudan wrote:
> We have seen the following situation on a test system:
> 
> 2-node system, each node has 32GB of memory.
> 
> 2 gigantic (16GB) pages reserved at boot-time, both of which are
> allocated from node 1.
> 
> SLUB notices this:
> 
> [    0.000000] SLUB: Unable to allocate memory from node 1
> [    0.000000] SLUB: Allocating a useless per node structure in order to
> be able to continue
> 
> After boot, user then did:
> 
> echo 24 > /proc/sys/vm/nr_hugepages
> 
> And tasks are stuck:
> 
> [<c0000000010980b8>] kexec_stack+0xb8/0x8000
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f9334b0>] 0xc00000004f9334b0
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f91f440>] 0xc00000004f91f440
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
> [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
> [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> kswapd1 is also pegged at this point at 100% cpu.
> 
> If we go in and manually:
> 
> echo 24 >
> /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> 
> rather than relying on the interleaving allocator from the sysctl, the
> allocation succeeds (and the echo returns immediately).
> 
> I think we are hitting the following:
> 
> mm/hugetlb.c::alloc_fresh_huge_page_node():
> 
>         page = alloc_pages_exact_node(nid,
>                 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>                                                 __GFP_REPEAT|__GFP_NOWARN,
>                 huge_page_order(h));
> 
> include/linux/gfp.h:
> 
> #define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> 
> and mm/page_alloc.c::__alloc_pages_slowpath():
> 
>         /*
>          * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
>          * __GFP_NOWARN set) should not cause reclaim since the subsystem
>          * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
>          * using a larger set of nodes after it has established that the
>          * allowed per node queues are empty and that nodes are
>          * over allocated.
>          */
>         if (IS_ENABLED(CONFIG_NUMA) &&
>                         (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>                 goto nopage;
> 
> so we *do* reclaim in this callpath. Under my reading, since node1 is
> exhausted, no matter how much work kswapd1 does, it will never reclaim
> memory from node1 to satisfy a 16M page allocation request (or any
> other, for that matter).
> 
> I see the following possible changes/fixes, but am unsure if
> a) my analysis is right
> b) which is best.
> 
> 1) Since we did notice early in boot that (in this case) node 1 was
> exhausted, perhaps we should mark it as such there somehow, and if a
> __GFP_THISNODE allocation request comes through on such a node, we
> immediately fallthrough to nopage?
> 
> 2) There is the following check
>         /*
>          * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
>          * specified, then we retry until we no longer reclaim any pages
>          * (above), or we've reclaimed an order of pages at least as
>          * large as the allocation's order. In both cases, if the
>          * allocation still fails, we stop retrying.
>          */
>         if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
>                 return 1;
> 
> I wonder if we should add a check to also be sure that the pages we are
> reclaiming, if __GFP_THISNODE is set, are from the right node?
> 
>        if (gfp_mask & __GFP_THISNODE && the progress we have made is on
>        		the node requested?)
> 
> 3) did_some_progress could be updated to track where the progress is
> occuring, and if we are in __GFP_THISNODE allocation request and we
> didn't make any progress on the correct node, we fail the allocation?
> 
> I think this situation could be reproduced (and am working on it) by
> exhausting a NUMA node with 16M hugepages and then using the generic
> RR allocator to ask for more. Other node exhaustion cases probably
> exist, but since we can't swap the hugepages, it seems like the most
> straightforward way to try and reproduce it.
> 
> Any thoughts on this? Am I way off base?
> 
> Thanks,
> Nish
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-13 17:01   ` Nishanth Aravamudan
@ 2014-03-24 23:05     ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-24 23:05 UTC (permalink / raw)
  To: linux-mm; +Cc: cl, rientjes, linuxppc-dev, anton, mgorman

Anyone have any ideas here?

On 13.03.2014 [10:01:27 -0700], Nishanth Aravamudan wrote:
> There might have been an error in my original mail, so resending...
> 
> On 11.03.2014 [14:06:14 -0700], Nishanth Aravamudan wrote:
> > We have seen the following situation on a test system:
> > 
> > 2-node system, each node has 32GB of memory.
> > 
> > 2 gigantic (16GB) pages reserved at boot-time, both of which are
> > allocated from node 1.
> > 
> > SLUB notices this:
> > 
> > [    0.000000] SLUB: Unable to allocate memory from node 1
> > [    0.000000] SLUB: Allocating a useless per node structure in order to
> > be able to continue
> > 
> > After boot, user then did:
> > 
> > echo 24 > /proc/sys/vm/nr_hugepages
> > 
> > And tasks are stuck:
> > 
> > [<c0000000010980b8>] kexec_stack+0xb8/0x8000
> > [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> > [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> > [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> > [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> > [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> > [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> > [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> > [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> > [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> > [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> > [<c000000000009e7c>] syscall_exit+0x0/0x7c
> > 
> > [<c00000004f9334b0>] 0xc00000004f9334b0
> > [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> > [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> > [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> > [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> > [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> > [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> > [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> > [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> > [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> > [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> > [<c000000000009e7c>] syscall_exit+0x0/0x7c
> > 
> > [<c00000004f91f440>] 0xc00000004f91f440
> > [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> > [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> > [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> > [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> > [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> > [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> > [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
> > [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
> > [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
> > [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> > [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> > [<c000000000009e7c>] syscall_exit+0x0/0x7c
> > 
> > kswapd1 is also pegged at this point at 100% cpu.
> > 
> > If we go in and manually:
> > 
> > echo 24 >
> > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> > 
> > rather than relying on the interleaving allocator from the sysctl, the
> > allocation succeeds (and the echo returns immediately).
> > 
> > I think we are hitting the following:
> > 
> > mm/hugetlb.c::alloc_fresh_huge_page_node():
> > 
> >         page = alloc_pages_exact_node(nid,
> >                 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
> >                                                 __GFP_REPEAT|__GFP_NOWARN,
> >                 huge_page_order(h));
> > 
> > include/linux/gfp.h:
> > 
> > #define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> > 
> > and mm/page_alloc.c::__alloc_pages_slowpath():
> > 
> >         /*
> >          * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> >          * __GFP_NOWARN set) should not cause reclaim since the subsystem
> >          * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
> >          * using a larger set of nodes after it has established that the
> >          * allowed per node queues are empty and that nodes are
> >          * over allocated.
> >          */
> >         if (IS_ENABLED(CONFIG_NUMA) &&
> >                         (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> >                 goto nopage;
> > 
> > so we *do* reclaim in this callpath. Under my reading, since node1 is
> > exhausted, no matter how much work kswapd1 does, it will never reclaim
> > memory from node1 to satisfy a 16M page allocation request (or any
> > other, for that matter).
> > 
> > I see the following possible changes/fixes, but am unsure if
> > a) my analysis is right
> > b) which is best.
> > 
> > 1) Since we did notice early in boot that (in this case) node 1 was
> > exhausted, perhaps we should mark it as such there somehow, and if a
> > __GFP_THISNODE allocation request comes through on such a node, we
> > immediately fallthrough to nopage?
> > 
> > 2) There is the following check
> >         /*
> >          * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> >          * specified, then we retry until we no longer reclaim any pages
> >          * (above), or we've reclaimed an order of pages at least as
> >          * large as the allocation's order. In both cases, if the
> >          * allocation still fails, we stop retrying.
> >          */
> >         if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> >                 return 1;
> > 
> > I wonder if we should add a check to also be sure that the pages we are
> > reclaiming, if __GFP_THISNODE is set, are from the right node?
> > 
> >        if (gfp_mask & __GFP_THISNODE && the progress we have made is on
> >        		the node requested?)
> > 
> > 3) did_some_progress could be updated to track where the progress is
> > occuring, and if we are in __GFP_THISNODE allocation request and we
> > didn't make any progress on the correct node, we fail the allocation?
> > 
> > I think this situation could be reproduced (and am working on it) by
> > exhausting a NUMA node with 16M hugepages and then using the generic
> > RR allocator to ask for more. Other node exhaustion cases probably
> > exist, but since we can't swap the hugepages, it seems like the most
> > straightforward way to try and reproduce it.
> > 
> > Any thoughts on this? Am I way off base?
> > 
> > Thanks,
> > Nish
> > 
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-24 23:05     ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-24 23:05 UTC (permalink / raw)
  To: linux-mm; +Cc: mgorman, cl, linuxppc-dev, anton, rientjes

Anyone have any ideas here?

On 13.03.2014 [10:01:27 -0700], Nishanth Aravamudan wrote:
> There might have been an error in my original mail, so resending...
> 
> On 11.03.2014 [14:06:14 -0700], Nishanth Aravamudan wrote:
> > We have seen the following situation on a test system:
> > 
> > 2-node system, each node has 32GB of memory.
> > 
> > 2 gigantic (16GB) pages reserved at boot-time, both of which are
> > allocated from node 1.
> > 
> > SLUB notices this:
> > 
> > [    0.000000] SLUB: Unable to allocate memory from node 1
> > [    0.000000] SLUB: Allocating a useless per node structure in order to
> > be able to continue
> > 
> > After boot, user then did:
> > 
> > echo 24 > /proc/sys/vm/nr_hugepages
> > 
> > And tasks are stuck:
> > 
> > [<c0000000010980b8>] kexec_stack+0xb8/0x8000
> > [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> > [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> > [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> > [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> > [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> > [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> > [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> > [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> > [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> > [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> > [<c000000000009e7c>] syscall_exit+0x0/0x7c
> > 
> > [<c00000004f9334b0>] 0xc00000004f9334b0
> > [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> > [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> > [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> > [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> > [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> > [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> > [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> > [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> > [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> > [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> > [<c000000000009e7c>] syscall_exit+0x0/0x7c
> > 
> > [<c00000004f91f440>] 0xc00000004f91f440
> > [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> > [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> > [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> > [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> > [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> > [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> > [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
> > [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
> > [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
> > [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> > [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> > [<c000000000009e7c>] syscall_exit+0x0/0x7c
> > 
> > kswapd1 is also pegged at this point at 100% cpu.
> > 
> > If we go in and manually:
> > 
> > echo 24 >
> > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> > 
> > rather than relying on the interleaving allocator from the sysctl, the
> > allocation succeeds (and the echo returns immediately).
> > 
> > I think we are hitting the following:
> > 
> > mm/hugetlb.c::alloc_fresh_huge_page_node():
> > 
> >         page = alloc_pages_exact_node(nid,
> >                 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
> >                                                 __GFP_REPEAT|__GFP_NOWARN,
> >                 huge_page_order(h));
> > 
> > include/linux/gfp.h:
> > 
> > #define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> > 
> > and mm/page_alloc.c::__alloc_pages_slowpath():
> > 
> >         /*
> >          * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> >          * __GFP_NOWARN set) should not cause reclaim since the subsystem
> >          * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
> >          * using a larger set of nodes after it has established that the
> >          * allowed per node queues are empty and that nodes are
> >          * over allocated.
> >          */
> >         if (IS_ENABLED(CONFIG_NUMA) &&
> >                         (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> >                 goto nopage;
> > 
> > so we *do* reclaim in this callpath. Under my reading, since node1 is
> > exhausted, no matter how much work kswapd1 does, it will never reclaim
> > memory from node1 to satisfy a 16M page allocation request (or any
> > other, for that matter).
> > 
> > I see the following possible changes/fixes, but am unsure if
> > a) my analysis is right
> > b) which is best.
> > 
> > 1) Since we did notice early in boot that (in this case) node 1 was
> > exhausted, perhaps we should mark it as such there somehow, and if a
> > __GFP_THISNODE allocation request comes through on such a node, we
> > immediately fallthrough to nopage?
> > 
> > 2) There is the following check
> >         /*
> >          * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> >          * specified, then we retry until we no longer reclaim any pages
> >          * (above), or we've reclaimed an order of pages at least as
> >          * large as the allocation's order. In both cases, if the
> >          * allocation still fails, we stop retrying.
> >          */
> >         if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> >                 return 1;
> > 
> > I wonder if we should add a check to also be sure that the pages we are
> > reclaiming, if __GFP_THISNODE is set, are from the right node?
> > 
> >        if (gfp_mask & __GFP_THISNODE && the progress we have made is on
> >        		the node requested?)
> > 
> > 3) did_some_progress could be updated to track where the progress is
> > occuring, and if we are in __GFP_THISNODE allocation request and we
> > didn't make any progress on the correct node, we fail the allocation?
> > 
> > I think this situation could be reproduced (and am working on it) by
> > exhausting a NUMA node with 16M hugepages and then using the generic
> > RR allocator to ask for more. Other node exhaustion cases probably
> > exist, but since we can't swap the hugepages, it seems like the most
> > straightforward way to try and reproduce it.
> > 
> > Any thoughts on this? Am I way off base?
> > 
> > Thanks,
> > Nish
> > 
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-24 23:05     ` Nishanth Aravamudan
@ 2014-03-25 16:17       ` Christoph Lameter
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-25 16:17 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:

> Anyone have any ideas here?

Dont do that? Check on boot to not allow exhausting a node with huge
pages?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-25 16:17       ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-25 16:17 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:

> Anyone have any ideas here?

Dont do that? Check on boot to not allow exhausting a node with huge
pages?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-25 16:17       ` Christoph Lameter
@ 2014-03-25 16:23         ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-25 16:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
> On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
> 
> > Anyone have any ideas here?
> 
> Dont do that? Check on boot to not allow exhausting a node with huge
> pages?

Gigantic hugepages are allocated by the hypervisor (not the Linux VM),
and we don't control where the allocation occurs. Yes, ideally, they
would be interleaved to avoid this situation, but I can also see reasons
for having them all be from one node so that tasks can be affinitized
and get the guarantee of the 16GB pagesize benefit.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-25 16:23         ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-25 16:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
> On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
> 
> > Anyone have any ideas here?
> 
> Dont do that? Check on boot to not allow exhausting a node with huge
> pages?

Gigantic hugepages are allocated by the hypervisor (not the Linux VM),
and we don't control where the allocation occurs. Yes, ideally, they
would be interleaved to avoid this situation, but I can also see reasons
for having them all be from one node so that tasks can be affinitized
and get the guarantee of the 16GB pagesize benefit.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-25 16:23         ` Nishanth Aravamudan
@ 2014-03-25 16:53           ` Christoph Lameter
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-25 16:53 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

> On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
> > On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
> >
> > > Anyone have any ideas here?
> >
> > Dont do that? Check on boot to not allow exhausting a node with huge
> > pages?
>
> Gigantic hugepages are allocated by the hypervisor (not the Linux VM),

Ok so the kernel starts booting up and then suddenly the hypervisor takes
the 2 16G pages before even the slab allocator is working?

Not sure if I understand that correctly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-25 16:53           ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-25 16:53 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

> On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
> > On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
> >
> > > Anyone have any ideas here?
> >
> > Dont do that? Check on boot to not allow exhausting a node with huge
> > pages?
>
> Gigantic hugepages are allocated by the hypervisor (not the Linux VM),

Ok so the kernel starts booting up and then suddenly the hypervisor takes
the 2 16G pages before even the slab allocator is working?

Not sure if I understand that correctly.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-25 16:53           ` Christoph Lameter
@ 2014-03-25 18:10             ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-25 18:10 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On 25.03.2014 [11:53:48 -0500], Christoph Lameter wrote:
> On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:
> 
> > On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
> > > On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
> > >
> > > > Anyone have any ideas here?
> > >
> > > Dont do that? Check on boot to not allow exhausting a node with huge
> > > pages?
> >
> > Gigantic hugepages are allocated by the hypervisor (not the Linux VM),
> 
> Ok so the kernel starts booting up and then suddenly the hypervisor takes
> the 2 16G pages before even the slab allocator is working?

There is nothing "sudden" about it.

On power, very early, we find the 16G pages (gpages in the powerpc arch
code) in the device-tree:

early_setup ->
	early_init_mmu ->
		htab_initialize ->
			htab_init_page_sizes ->
				htab_dt_scan_hugepage_blocks ->
					memblock_reserve
						which marks the memory
						as reserved
					add_gpage
						which saves the address
						off so future calls for
						alloc_bootmem_huge_page()

hugetlb_init ->
		hugetlb_init_hstates ->
			hugetlb_hstate_alloc_pages ->
				alloc_bootmem_huge_page

> Not sure if I understand that correctly.

Basically this is present memory that is "reserved" for the 16GB usage
per the LPAR configuration. We honor that configuration in Linux based
upon the contents of the device-tree. It just so happens in the
configuration from my original e-mail that a consequence of this is that
a NUMA node has memory (topologically), but none of that memory is free,
nor will it ever be free.

Perhaps, in this case, we could just remove that node from the N_MEMORY
mask? Memory allocations will never succeed from the node, and we can
never free these 16GB pages. It is really not any different than a
memoryless node *except* when you are using the 16GB pages.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-25 18:10             ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-25 18:10 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On 25.03.2014 [11:53:48 -0500], Christoph Lameter wrote:
> On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:
> 
> > On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
> > > On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
> > >
> > > > Anyone have any ideas here?
> > >
> > > Dont do that? Check on boot to not allow exhausting a node with huge
> > > pages?
> >
> > Gigantic hugepages are allocated by the hypervisor (not the Linux VM),
> 
> Ok so the kernel starts booting up and then suddenly the hypervisor takes
> the 2 16G pages before even the slab allocator is working?

There is nothing "sudden" about it.

On power, very early, we find the 16G pages (gpages in the powerpc arch
code) in the device-tree:

early_setup ->
	early_init_mmu ->
		htab_initialize ->
			htab_init_page_sizes ->
				htab_dt_scan_hugepage_blocks ->
					memblock_reserve
						which marks the memory
						as reserved
					add_gpage
						which saves the address
						off so future calls for
						alloc_bootmem_huge_page()

hugetlb_init ->
		hugetlb_init_hstates ->
			hugetlb_hstate_alloc_pages ->
				alloc_bootmem_huge_page

> Not sure if I understand that correctly.

Basically this is present memory that is "reserved" for the 16GB usage
per the LPAR configuration. We honor that configuration in Linux based
upon the contents of the device-tree. It just so happens in the
configuration from my original e-mail that a consequence of this is that
a NUMA node has memory (topologically), but none of that memory is free,
nor will it ever be free.

Perhaps, in this case, we could just remove that node from the N_MEMORY
mask? Memory allocations will never succeed from the node, and we can
never free these 16GB pages. It is really not any different than a
memoryless node *except* when you are using the 16GB pages.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-25 18:10             ` Nishanth Aravamudan
@ 2014-03-25 18:25               ` Christoph Lameter
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-25 18:25 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

> On power, very early, we find the 16G pages (gpages in the powerpc arch
> code) in the device-tree:
>
> early_setup ->
> 	early_init_mmu ->
> 		htab_initialize ->
> 			htab_init_page_sizes ->
> 				htab_dt_scan_hugepage_blocks ->
> 					memblock_reserve
> 						which marks the memory
> 						as reserved
> 					add_gpage
> 						which saves the address
> 						off so future calls for
> 						alloc_bootmem_huge_page()
>
> hugetlb_init ->
> 		hugetlb_init_hstates ->
> 			hugetlb_hstate_alloc_pages ->
> 				alloc_bootmem_huge_page
>
> > Not sure if I understand that correctly.
>
> Basically this is present memory that is "reserved" for the 16GB usage
> per the LPAR configuration. We honor that configuration in Linux based
> upon the contents of the device-tree. It just so happens in the
> configuration from my original e-mail that a consequence of this is that
> a NUMA node has memory (topologically), but none of that memory is free,
> nor will it ever be free.

Well dont do that

> Perhaps, in this case, we could just remove that node from the N_MEMORY
> mask? Memory allocations will never succeed from the node, and we can
> never free these 16GB pages. It is really not any different than a
> memoryless node *except* when you are using the 16GB pages.

That looks to be the correct way to handle things. Maybe mark the node as
offline or somehow not present so that the kernel ignores it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-25 18:25               ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-25 18:25 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

> On power, very early, we find the 16G pages (gpages in the powerpc arch
> code) in the device-tree:
>
> early_setup ->
> 	early_init_mmu ->
> 		htab_initialize ->
> 			htab_init_page_sizes ->
> 				htab_dt_scan_hugepage_blocks ->
> 					memblock_reserve
> 						which marks the memory
> 						as reserved
> 					add_gpage
> 						which saves the address
> 						off so future calls for
> 						alloc_bootmem_huge_page()
>
> hugetlb_init ->
> 		hugetlb_init_hstates ->
> 			hugetlb_hstate_alloc_pages ->
> 				alloc_bootmem_huge_page
>
> > Not sure if I understand that correctly.
>
> Basically this is present memory that is "reserved" for the 16GB usage
> per the LPAR configuration. We honor that configuration in Linux based
> upon the contents of the device-tree. It just so happens in the
> configuration from my original e-mail that a consequence of this is that
> a NUMA node has memory (topologically), but none of that memory is free,
> nor will it ever be free.

Well dont do that

> Perhaps, in this case, we could just remove that node from the N_MEMORY
> mask? Memory allocations will never succeed from the node, and we can
> never free these 16GB pages. It is really not any different than a
> memoryless node *except* when you are using the 16GB pages.

That looks to be the correct way to handle things. Maybe mark the node as
offline or somehow not present so that the kernel ignores it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-25 18:25               ` Christoph Lameter
@ 2014-03-25 18:37                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-25 18:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On 25.03.2014 [13:25:30 -0500], Christoph Lameter wrote:
> On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:
> 
> > On power, very early, we find the 16G pages (gpages in the powerpc arch
> > code) in the device-tree:
> >
> > early_setup ->
> > 	early_init_mmu ->
> > 		htab_initialize ->
> > 			htab_init_page_sizes ->
> > 				htab_dt_scan_hugepage_blocks ->
> > 					memblock_reserve
> > 						which marks the memory
> > 						as reserved
> > 					add_gpage
> > 						which saves the address
> > 						off so future calls for
> > 						alloc_bootmem_huge_page()
> >
> > hugetlb_init ->
> > 		hugetlb_init_hstates ->
> > 			hugetlb_hstate_alloc_pages ->
> > 				alloc_bootmem_huge_page
> >
> > > Not sure if I understand that correctly.
> >
> > Basically this is present memory that is "reserved" for the 16GB usage
> > per the LPAR configuration. We honor that configuration in Linux based
> > upon the contents of the device-tree. It just so happens in the
> > configuration from my original e-mail that a consequence of this is that
> > a NUMA node has memory (topologically), but none of that memory is free,
> > nor will it ever be free.
> 
> Well dont do that

I appreciate the help you're offering, but that's really not an option.
The customer/user has configured the system in such a way so they can
leverage the gigantic pages. And *most* everything seems to work fine
except for the case I mentioned in my original e-mail. I guess we could
fewer 16GB pages if it would exhaust a NUMA node, but ... I think the
underlying mapping would be a 16GB one, so it will not be accurate from
a performance perspective (although it should perform better).

> > Perhaps, in this case, we could just remove that node from the N_MEMORY
> > mask? Memory allocations will never succeed from the node, and we can
> > never free these 16GB pages. It is really not any different than a
> > memoryless node *except* when you are using the 16GB pages.
> 
> That looks to be the correct way to handle things. Maybe mark the node as
> offline or somehow not present so that the kernel ignores it.

Ok, I'll consider these options. Thanks!

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-25 18:37                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-25 18:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On 25.03.2014 [13:25:30 -0500], Christoph Lameter wrote:
> On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:
> 
> > On power, very early, we find the 16G pages (gpages in the powerpc arch
> > code) in the device-tree:
> >
> > early_setup ->
> > 	early_init_mmu ->
> > 		htab_initialize ->
> > 			htab_init_page_sizes ->
> > 				htab_dt_scan_hugepage_blocks ->
> > 					memblock_reserve
> > 						which marks the memory
> > 						as reserved
> > 					add_gpage
> > 						which saves the address
> > 						off so future calls for
> > 						alloc_bootmem_huge_page()
> >
> > hugetlb_init ->
> > 		hugetlb_init_hstates ->
> > 			hugetlb_hstate_alloc_pages ->
> > 				alloc_bootmem_huge_page
> >
> > > Not sure if I understand that correctly.
> >
> > Basically this is present memory that is "reserved" for the 16GB usage
> > per the LPAR configuration. We honor that configuration in Linux based
> > upon the contents of the device-tree. It just so happens in the
> > configuration from my original e-mail that a consequence of this is that
> > a NUMA node has memory (topologically), but none of that memory is free,
> > nor will it ever be free.
> 
> Well dont do that

I appreciate the help you're offering, but that's really not an option.
The customer/user has configured the system in such a way so they can
leverage the gigantic pages. And *most* everything seems to work fine
except for the case I mentioned in my original e-mail. I guess we could
fewer 16GB pages if it would exhaust a NUMA node, but ... I think the
underlying mapping would be a 16GB one, so it will not be accurate from
a performance perspective (although it should perform better).

> > Perhaps, in this case, we could just remove that node from the N_MEMORY
> > mask? Memory allocations will never succeed from the node, and we can
> > never free these 16GB pages. It is really not any different than a
> > memoryless node *except* when you are using the 16GB pages.
> 
> That looks to be the correct way to handle things. Maybe mark the node as
> offline or somehow not present so that the kernel ignores it.

Ok, I'll consider these options. Thanks!

-Nish

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-25 18:25               ` Christoph Lameter
@ 2014-03-27 20:33                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-27 20:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

Hi Christoph,

On 25.03.2014 [13:25:30 -0500], Christoph Lameter wrote:
> On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:
> 
> > On power, very early, we find the 16G pages (gpages in the powerpc arch
> > code) in the device-tree:
> >
> > early_setup ->
> > 	early_init_mmu ->
> > 		htab_initialize ->
> > 			htab_init_page_sizes ->
> > 				htab_dt_scan_hugepage_blocks ->
> > 					memblock_reserve
> > 						which marks the memory
> > 						as reserved
> > 					add_gpage
> > 						which saves the address
> > 						off so future calls for
> > 						alloc_bootmem_huge_page()
> >
> > hugetlb_init ->
> > 		hugetlb_init_hstates ->
> > 			hugetlb_hstate_alloc_pages ->
> > 				alloc_bootmem_huge_page
> >
> > > Not sure if I understand that correctly.
> >
> > Basically this is present memory that is "reserved" for the 16GB usage
> > per the LPAR configuration. We honor that configuration in Linux based
> > upon the contents of the device-tree. It just so happens in the
> > configuration from my original e-mail that a consequence of this is that
> > a NUMA node has memory (topologically), but none of that memory is free,
> > nor will it ever be free.
> 
> Well dont do that
> 
> > Perhaps, in this case, we could just remove that node from the N_MEMORY
> > mask? Memory allocations will never succeed from the node, and we can
> > never free these 16GB pages. It is really not any different than a
> > memoryless node *except* when you are using the 16GB pages.
> 
> That looks to be the correct way to handle things. Maybe mark the node as
> offline or somehow not present so that the kernel ignores it.

This is a SLUB condition:

mm/slub.c::early_kmem_cache_node_alloc():
...
        page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
...
        if (page_to_nid(page) != node) {
                printk(KERN_ERR "SLUB: Unable to allocate memory from "
                                "node %d\n", node);
                printk(KERN_ERR "SLUB: Allocating a useless per node structure "
                                "in order to be able to continue\n");
        }
...

Since this is quite early, and we have not set up the nodemasks yet,
does it make sense to perhaps have a temporary init-time nodemask that
we set bits in here, and "fix-up" those nodes when we setup the
nodemasks?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-27 20:33                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-03-27 20:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

Hi Christoph,

On 25.03.2014 [13:25:30 -0500], Christoph Lameter wrote:
> On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:
> 
> > On power, very early, we find the 16G pages (gpages in the powerpc arch
> > code) in the device-tree:
> >
> > early_setup ->
> > 	early_init_mmu ->
> > 		htab_initialize ->
> > 			htab_init_page_sizes ->
> > 				htab_dt_scan_hugepage_blocks ->
> > 					memblock_reserve
> > 						which marks the memory
> > 						as reserved
> > 					add_gpage
> > 						which saves the address
> > 						off so future calls for
> > 						alloc_bootmem_huge_page()
> >
> > hugetlb_init ->
> > 		hugetlb_init_hstates ->
> > 			hugetlb_hstate_alloc_pages ->
> > 				alloc_bootmem_huge_page
> >
> > > Not sure if I understand that correctly.
> >
> > Basically this is present memory that is "reserved" for the 16GB usage
> > per the LPAR configuration. We honor that configuration in Linux based
> > upon the contents of the device-tree. It just so happens in the
> > configuration from my original e-mail that a consequence of this is that
> > a NUMA node has memory (topologically), but none of that memory is free,
> > nor will it ever be free.
> 
> Well dont do that
> 
> > Perhaps, in this case, we could just remove that node from the N_MEMORY
> > mask? Memory allocations will never succeed from the node, and we can
> > never free these 16GB pages. It is really not any different than a
> > memoryless node *except* when you are using the 16GB pages.
> 
> That looks to be the correct way to handle things. Maybe mark the node as
> offline or somehow not present so that the kernel ignores it.

This is a SLUB condition:

mm/slub.c::early_kmem_cache_node_alloc():
...
        page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
...
        if (page_to_nid(page) != node) {
                printk(KERN_ERR "SLUB: Unable to allocate memory from "
                                "node %d\n", node);
                printk(KERN_ERR "SLUB: Allocating a useless per node structure "
                                "in order to be able to continue\n");
        }
...

Since this is quite early, and we have not set up the nodemasks yet,
does it make sense to perhaps have a temporary init-time nodemask that
we set bits in here, and "fix-up" those nodes when we setup the
nodemasks?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-27 20:33                 ` Nishanth Aravamudan
@ 2014-03-29  5:40                   ` Christoph Lameter
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-29  5:40 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On Thu, 27 Mar 2014, Nishanth Aravamudan wrote:

> > That looks to be the correct way to handle things. Maybe mark the node as
> > offline or somehow not present so that the kernel ignores it.
>
> This is a SLUB condition:
>
> mm/slub.c::early_kmem_cache_node_alloc():
> ...
>         page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
> ...

So the page allocation from the node failed. We have a strange boot
condition where the OS is aware of anode but allocations on that node
fail.

 >         if (page_to_nid(page) != node) {
>                 printk(KERN_ERR "SLUB: Unable to allocate memory from "
>                                 "node %d\n", node);
>                 printk(KERN_ERR "SLUB: Allocating a useless per node structure "
>                                 "in order to be able to continue\n");
>         }
> ...
>
> Since this is quite early, and we have not set up the nodemasks yet,
> does it make sense to perhaps have a temporary init-time nodemask that
> we set bits in here, and "fix-up" those nodes when we setup the
> nodemasks?

Please take care of this earlier than this. The page allocator in general
should allow allocations from all nodes with memory during boot,




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-03-29  5:40                   ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-03-29  5:40 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On Thu, 27 Mar 2014, Nishanth Aravamudan wrote:

> > That looks to be the correct way to handle things. Maybe mark the node as
> > offline or somehow not present so that the kernel ignores it.
>
> This is a SLUB condition:
>
> mm/slub.c::early_kmem_cache_node_alloc():
> ...
>         page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
> ...

So the page allocation from the node failed. We have a strange boot
condition where the OS is aware of anode but allocations on that node
fail.

 >         if (page_to_nid(page) != node) {
>                 printk(KERN_ERR "SLUB: Unable to allocate memory from "
>                                 "node %d\n", node);
>                 printk(KERN_ERR "SLUB: Allocating a useless per node structure "
>                                 "in order to be able to continue\n");
>         }
> ...
>
> Since this is quite early, and we have not set up the nodemasks yet,
> does it make sense to perhaps have a temporary init-time nodemask that
> we set bits in here, and "fix-up" those nodes when we setup the
> nodemasks?

Please take care of this earlier than this. The page allocator in general
should allow allocations from all nodes with memory during boot,

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-03-29  5:40                   ` Christoph Lameter
@ 2014-04-01  1:33                     ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-04-01  1:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On 29.03.2014 [00:40:41 -0500], Christoph Lameter wrote:
> On Thu, 27 Mar 2014, Nishanth Aravamudan wrote:
> 
> > > That looks to be the correct way to handle things. Maybe mark the node as
> > > offline or somehow not present so that the kernel ignores it.
> >
> > This is a SLUB condition:
> >
> > mm/slub.c::early_kmem_cache_node_alloc():
> > ...
> >         page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
> > ...
> 
> So the page allocation from the node failed. We have a strange boot
> condition where the OS is aware of anode but allocations on that node
> fail.

Yep. The node exists, it's just fully exhausted at boot (due to the
presence of 16GB pages reserved at boot-time).

>  >         if (page_to_nid(page) != node) {
> >                 printk(KERN_ERR "SLUB: Unable to allocate memory from "
> >                                 "node %d\n", node);
> >                 printk(KERN_ERR "SLUB: Allocating a useless per node structure "
> >                                 "in order to be able to continue\n");
> >         }
> > ...
> >
> > Since this is quite early, and we have not set up the nodemasks yet,
> > does it make sense to perhaps have a temporary init-time nodemask that
> > we set bits in here, and "fix-up" those nodes when we setup the
> > nodemasks?
> 
> Please take care of this earlier than this. The page allocator in
> general should allow allocations from all nodes with memory during
> boot,

I'd appreciate a bit more guidance? I'm suggesting that in this case the
node functionally has no memory. So the page allocator should not allow
allocations from it -- except (I need to investigate this still)
userspace accessing the 16GB pages on that node, but that, I believe,
doesn't go through the page allocator at all, it's all from hugetlb
interfaces. It seems to me there is a bug in SLUB that we are noting
that we have a useless per-node structure for a given nid, but not
actually preventing requests to that node or reclaim because of those
allocations.

The page allocator is actually fine here, afaict. We've pulled out
memory from this node, even though it's present, so none is free. All of
that is working as expected, based upon the issue we've seen. The
problems start when we "force" (by way of a round-robin page allocation
request from /proc/sys/vm/nr_hugepages) a THISNODE allocation to come
from the exhausted node, which has no memory free, causing reclaim,
which progresses on other nodes, and thus never alleviates the
allocation failure (and can't).

I think there is a logical bug (even if it only occurs in this
particular corner case) where if reclaim progresses for a THISNODE
allocation, we don't check *where* the reclaim is progressing, and thus
may falsely be indicating that we have done some progress when in fact
the allocation that is causing reclaim will not possibly make any more
progress.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-04-01  1:33                     ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-04-01  1:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On 29.03.2014 [00:40:41 -0500], Christoph Lameter wrote:
> On Thu, 27 Mar 2014, Nishanth Aravamudan wrote:
> 
> > > That looks to be the correct way to handle things. Maybe mark the node as
> > > offline or somehow not present so that the kernel ignores it.
> >
> > This is a SLUB condition:
> >
> > mm/slub.c::early_kmem_cache_node_alloc():
> > ...
> >         page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
> > ...
> 
> So the page allocation from the node failed. We have a strange boot
> condition where the OS is aware of anode but allocations on that node
> fail.

Yep. The node exists, it's just fully exhausted at boot (due to the
presence of 16GB pages reserved at boot-time).

>  >         if (page_to_nid(page) != node) {
> >                 printk(KERN_ERR "SLUB: Unable to allocate memory from "
> >                                 "node %d\n", node);
> >                 printk(KERN_ERR "SLUB: Allocating a useless per node structure "
> >                                 "in order to be able to continue\n");
> >         }
> > ...
> >
> > Since this is quite early, and we have not set up the nodemasks yet,
> > does it make sense to perhaps have a temporary init-time nodemask that
> > we set bits in here, and "fix-up" those nodes when we setup the
> > nodemasks?
> 
> Please take care of this earlier than this. The page allocator in
> general should allow allocations from all nodes with memory during
> boot,

I'd appreciate a bit more guidance? I'm suggesting that in this case the
node functionally has no memory. So the page allocator should not allow
allocations from it -- except (I need to investigate this still)
userspace accessing the 16GB pages on that node, but that, I believe,
doesn't go through the page allocator at all, it's all from hugetlb
interfaces. It seems to me there is a bug in SLUB that we are noting
that we have a useless per-node structure for a given nid, but not
actually preventing requests to that node or reclaim because of those
allocations.

The page allocator is actually fine here, afaict. We've pulled out
memory from this node, even though it's present, so none is free. All of
that is working as expected, based upon the issue we've seen. The
problems start when we "force" (by way of a round-robin page allocation
request from /proc/sys/vm/nr_hugepages) a THISNODE allocation to come
from the exhausted node, which has no memory free, causing reclaim,
which progresses on other nodes, and thus never alleviates the
allocation failure (and can't).

I think there is a logical bug (even if it only occurs in this
particular corner case) where if reclaim progresses for a THISNODE
allocation, we don't check *where* the reclaim is progressing, and thus
may falsely be indicating that we have done some progress when in fact
the allocation that is causing reclaim will not possibly make any more
progress.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-04-01  1:33                     ` Nishanth Aravamudan
@ 2014-04-03 16:41                       ` Christoph Lameter
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-04-03 16:41 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

On Mon, 31 Mar 2014, Nishanth Aravamudan wrote:

> Yep. The node exists, it's just fully exhausted at boot (due to the
> presence of 16GB pages reserved at boot-time).

Well if you want us to support that then I guess you need to propose
patches to address this issue.

> I'd appreciate a bit more guidance? I'm suggesting that in this case the
> node functionally has no memory. So the page allocator should not allow
> allocations from it -- except (I need to investigate this still)
> userspace accessing the 16GB pages on that node, but that, I believe,
> doesn't go through the page allocator at all, it's all from hugetlb
> interfaces. It seems to me there is a bug in SLUB that we are noting
> that we have a useless per-node structure for a given nid, but not
> actually preventing requests to that node or reclaim because of those
> allocations.

Well if you can address that without impacting the fastpath then we could
do this. Otherwise we would need a fake structure here to avoid adding
checks to the fastpath

> I think there is a logical bug (even if it only occurs in this
> particular corner case) where if reclaim progresses for a THISNODE
> allocation, we don't check *where* the reclaim is progressing, and thus
> may falsely be indicating that we have done some progress when in fact
> the allocation that is causing reclaim will not possibly make any more
> progress.

Ok maybe we could address this corner case. How would you do this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-04-03 16:41                       ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2014-04-03 16:41 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

On Mon, 31 Mar 2014, Nishanth Aravamudan wrote:

> Yep. The node exists, it's just fully exhausted at boot (due to the
> presence of 16GB pages reserved at boot-time).

Well if you want us to support that then I guess you need to propose
patches to address this issue.

> I'd appreciate a bit more guidance? I'm suggesting that in this case the
> node functionally has no memory. So the page allocator should not allow
> allocations from it -- except (I need to investigate this still)
> userspace accessing the 16GB pages on that node, but that, I believe,
> doesn't go through the page allocator at all, it's all from hugetlb
> interfaces. It seems to me there is a bug in SLUB that we are noting
> that we have a useless per-node structure for a given nid, but not
> actually preventing requests to that node or reclaim because of those
> allocations.

Well if you can address that without impacting the fastpath then we could
do this. Otherwise we would need a fake structure here to avoid adding
checks to the fastpath

> I think there is a logical bug (even if it only occurs in this
> particular corner case) where if reclaim progresses for a THISNODE
> allocation, we don't check *where* the reclaim is progressing, and thus
> may falsely be indicating that we have done some progress when in fact
> the allocation that is causing reclaim will not possibly make any more
> progress.

Ok maybe we could address this corner case. How would you do this?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
  2014-04-03 16:41                       ` Christoph Lameter
@ 2014-05-12 18:46                         ` Nishanth Aravamudan
  -1 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-05-12 18:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, rientjes, linuxppc-dev, anton, mgorman

Hi Christoph,

Sorry for the delay in my response!

On 03.04.2014 [11:41:37 -0500], Christoph Lameter wrote:
> On Mon, 31 Mar 2014, Nishanth Aravamudan wrote:
> 
> > Yep. The node exists, it's just fully exhausted at boot (due to the
> > presence of 16GB pages reserved at boot-time).
> 
> Well if you want us to support that then I guess you need to propose
> patches to address this issue.

Yep, that's my plan, I was hoping to get input from developers/experts
such as yourself first. Obviously, code speaks louder though...

> > I'd appreciate a bit more guidance? I'm suggesting that in this case
> > the node functionally has no memory. So the page allocator should
> > not allow allocations from it -- except (I need to investigate this
> > still) userspace accessing the 16GB pages on that node, but that, I
> > believe, doesn't go through the page allocator at all, it's all from
> > hugetlb interfaces. It seems to me there is a bug in SLUB that we
> > are noting that we have a useless per-node structure for a given
> > nid, but not actually preventing requests to that node or reclaim
> > because of those allocations.
> 
> Well if you can address that without impacting the fastpath then we
> could do this. Otherwise we would need a fake structure here to avoid
> adding checks to the fastpath

Ok, I'll keep thinking about what makes the most sense.

> > I think there is a logical bug (even if it only occurs in this
> > particular corner case) where if reclaim progresses for a THISNODE
> > allocation, we don't check *where* the reclaim is progressing, and thus
> > may falsely be indicating that we have done some progress when in fact
> > the allocation that is causing reclaim will not possibly make any more
> > progress.
> 
> Ok maybe we could address this corner case. How would you do this?

This is where I started to get stumped. It seems like did_some_progress
is only checking that any progress is made. It would be more expensive
in the reclaim path to check what nodes we made progress on and verify
it was on the intended one (if we are reclaiming due to THISNODE). I
will try and look at this case specifically more, I apologize it's
taking me quite a bit of time to get up-to-speed on the code and design.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bug in reclaim logic with exhausted nodes?
@ 2014-05-12 18:46                         ` Nishanth Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nishanth Aravamudan @ 2014-05-12 18:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, mgorman, linuxppc-dev, anton, rientjes

Hi Christoph,

Sorry for the delay in my response!

On 03.04.2014 [11:41:37 -0500], Christoph Lameter wrote:
> On Mon, 31 Mar 2014, Nishanth Aravamudan wrote:
> 
> > Yep. The node exists, it's just fully exhausted at boot (due to the
> > presence of 16GB pages reserved at boot-time).
> 
> Well if you want us to support that then I guess you need to propose
> patches to address this issue.

Yep, that's my plan, I was hoping to get input from developers/experts
such as yourself first. Obviously, code speaks louder though...

> > I'd appreciate a bit more guidance? I'm suggesting that in this case
> > the node functionally has no memory. So the page allocator should
> > not allow allocations from it -- except (I need to investigate this
> > still) userspace accessing the 16GB pages on that node, but that, I
> > believe, doesn't go through the page allocator at all, it's all from
> > hugetlb interfaces. It seems to me there is a bug in SLUB that we
> > are noting that we have a useless per-node structure for a given
> > nid, but not actually preventing requests to that node or reclaim
> > because of those allocations.
> 
> Well if you can address that without impacting the fastpath then we
> could do this. Otherwise we would need a fake structure here to avoid
> adding checks to the fastpath

Ok, I'll keep thinking about what makes the most sense.

> > I think there is a logical bug (even if it only occurs in this
> > particular corner case) where if reclaim progresses for a THISNODE
> > allocation, we don't check *where* the reclaim is progressing, and thus
> > may falsely be indicating that we have done some progress when in fact
> > the allocation that is causing reclaim will not possibly make any more
> > progress.
> 
> Ok maybe we could address this corner case. How would you do this?

This is where I started to get stumped. It seems like did_some_progress
is only checking that any progress is made. It would be more expensive
in the reclaim path to check what nodes we made progress on and verify
it was on the intended one (if we are reclaiming due to THISNODE). I
will try and look at this case specifically more, I apologize it's
taking me quite a bit of time to get up-to-speed on the code and design.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-05-12 18:47 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-11 21:06 Bug in reclaim logic with exhausted nodes? Nishanth Aravamudan
2014-03-11 21:06 ` Nishanth Aravamudan
2014-03-13 17:01 ` Nishanth Aravamudan
2014-03-13 17:01   ` Nishanth Aravamudan
2014-03-24 23:05   ` Nishanth Aravamudan
2014-03-24 23:05     ` Nishanth Aravamudan
2014-03-25 16:17     ` Christoph Lameter
2014-03-25 16:17       ` Christoph Lameter
2014-03-25 16:23       ` Nishanth Aravamudan
2014-03-25 16:23         ` Nishanth Aravamudan
2014-03-25 16:53         ` Christoph Lameter
2014-03-25 16:53           ` Christoph Lameter
2014-03-25 18:10           ` Nishanth Aravamudan
2014-03-25 18:10             ` Nishanth Aravamudan
2014-03-25 18:25             ` Christoph Lameter
2014-03-25 18:25               ` Christoph Lameter
2014-03-25 18:37               ` Nishanth Aravamudan
2014-03-25 18:37                 ` Nishanth Aravamudan
2014-03-27 20:33               ` Nishanth Aravamudan
2014-03-27 20:33                 ` Nishanth Aravamudan
2014-03-29  5:40                 ` Christoph Lameter
2014-03-29  5:40                   ` Christoph Lameter
2014-04-01  1:33                   ` Nishanth Aravamudan
2014-04-01  1:33                     ` Nishanth Aravamudan
2014-04-03 16:41                     ` Christoph Lameter
2014-04-03 16:41                       ` Christoph Lameter
2014-05-12 18:46                       ` Nishanth Aravamudan
2014-05-12 18:46                         ` Nishanth Aravamudan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.