linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines
@ 2014-09-01 12:55 Mel Gorman
  2014-09-02  1:12 ` Kamezawa Hiroyuki
  2014-09-02 13:51 ` Johannes Weiner
  0 siblings, 2 replies; 9+ messages in thread
From: Mel Gorman @ 2014-09-01 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, David Rientjes, KAMEZAWA Hiroyuki,
	Fengguang Wu, linux-mm, linux-kernel

Zones are allocated by the page allocator in either node or zone order.
Node ordering is preferred in terms of locality and is applied automatically
in one of three cases.

  1. If a node has only low memory

  2. If DMA/DMA32 is a high percentage of memory

  3. If low memory on a single node is greater than 70% of the node size

Otherwise zone ordering is used to preserve low memory. Unfortunately
a consequence of this is that a machine with balanced NUMA nodes will
experience different performance characteristics depending on which node
they happen to start from.

The point of zone ordering is to protect lower nodes for devices that require
DMA/DMA32 memory. When NUMA was first introduced, this was critical as 32-bit
NUMA machines commonly suffered from low memory exhaustion problems. On
64-bit machines the primary concern is devices that are 32-bit only which
is less severe than the low memory exhaustion problem on 32-bit NUMA. It
seems there are really few devices that depends on it.

AGP -- I assume this is getting more rare but even then I think the allocations
	happen early in boot time where lowmem pressure is less of a problem

DRM -- If the device is 32-bit only then there may be low pressure. I didn't
	evaluate these in detail but it looks like some of these are mobile
	graphics card. Not many NUMA laptops out there. DRM folk should know
	better though.

Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?

B43 wireless card -- again not really a NUMA thing.

I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
machines in case someone throws a brain damanged TV or graphics card in there.
This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
to make it default everywhere but I understand that some embedded arches may
be using 32-bit NUMA where I cannot predict the consequences.

The performance impact depends on the workload and the characteristics of the
machine and the machine I tested on had a large Normal zone on node 0 so the
impact is within the noise for the majority of tests. The allocation stats
show more allocation requests were from DMA32 and local node. Running SpecJBB
with multiple JVMs and automatic NUMA balancing disabled the results were

specjbb
                     3.17.0-rc2            3.17.0-rc2
                        vanilla        nodeorder-v1r1
Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)

                            3.17.0-rc2  3.17.0-rc2
                               vanillanodeorder-v1r1
DMA allocs                           0           0
DMA32 allocs                        57     1491992
Normal allocs                 32543566    30026383
Movable allocs                       0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
THP fault alloc                  55164       52987
THP collapse alloc                 139         147
THP splits                          26          21
NUMA alloc hit                 4169066     4250692
NUMA alloc miss                      0           0

Note that there were more DMA32 allocations with the patch applied.  In this
particular case there was no difference in numa_hit and numa_miss. The
expectation is that DMA32 was being used at the low watermark instead of
falling into the slow path. kswapd was not woken but it's not worken for
THP allocations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18cee0d..20059aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3579,6 +3579,17 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 	zonelist->_zonerefs[pos].zone_idx = 0;
 }
 
+#if defined(CONFIG_64BIT)
+
+/* Devices that require DMA32/DMA are relatively rare and do not justify a
+ * penalty to every machine in case the specialised case applies. Default
+ * to Node-ordering on 64-bit NUMA machines
+ */
+static int default_zonelist_order(void)
+{
+	return ZONELIST_ORDER_NODE;
+}
+#else
 static int default_zonelist_order(void)
 {
 	int nid, zone_type;
@@ -3641,6 +3652,7 @@ static int default_zonelist_order(void)
 	}
 	return ZONELIST_ORDER_ZONE;
 }
+#endif /* CONFIG_64BIT */
 
 static void set_zonelist_order(void)
 {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines
  2014-09-01 12:55 [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines Mel Gorman
@ 2014-09-02  1:12 ` Kamezawa Hiroyuki
  2014-09-02 13:51 ` Johannes Weiner
  1 sibling, 0 replies; 9+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-02  1:12 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, David Rientjes, Fengguang Wu,
	linux-mm, linux-kernel

(2014/09/01 21:55), Mel Gorman wrote:
> Zones are allocated by the page allocator in either node or zone order.
> Node ordering is preferred in terms of locality and is applied automatically
> in one of three cases.
>
>    1. If a node has only low memory
>
>    2. If DMA/DMA32 is a high percentage of memory
>
>    3. If low memory on a single node is greater than 70% of the node size
>
> Otherwise zone ordering is used to preserve low memory. Unfortunately
> a consequence of this is that a machine with balanced NUMA nodes will
> experience different performance characteristics depending on which node
> they happen to start from.
>
> The point of zone ordering is to protect lower nodes for devices that require
> DMA/DMA32 memory. When NUMA was first introduced, this was critical as 32-bit
> NUMA machines commonly suffered from low memory exhaustion problems. On
> 64-bit machines the primary concern is devices that are 32-bit only which
> is less severe than the low memory exhaustion problem on 32-bit NUMA. It
> seems there are really few devices that depends on it.
>
> AGP -- I assume this is getting more rare but even then I think the allocations
> 	happen early in boot time where lowmem pressure is less of a problem
>
> DRM -- If the device is 32-bit only then there may be low pressure. I didn't
> 	evaluate these in detail but it looks like some of these are mobile
> 	graphics card. Not many NUMA laptops out there. DRM folk should know
> 	better though.
>
> Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
>
> B43 wireless card -- again not really a NUMA thing.
>
> I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
> machines in case someone throws a brain damanged TV or graphics card in there.
> This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
> to make it default everywhere but I understand that some embedded arches may
> be using 32-bit NUMA where I cannot predict the consequences.
>
> The performance impact depends on the workload and the characteristics of the
> machine and the machine I tested on had a large Normal zone on node 0 so the
> impact is within the noise for the majority of tests. The allocation stats
> show more allocation requests were from DMA32 and local node. Running SpecJBB
> with multiple JVMs and automatic NUMA balancing disabled the results were
>
> specjbb
>                       3.17.0-rc2            3.17.0-rc2
>                          vanilla        nodeorder-v1r1
> Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
> Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
> Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
> Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
> Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
> Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
> Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
> Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
> Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
> Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
> Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
> Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
> Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
> Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
> Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
> Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
> Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
> Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
> Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
> Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
> Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
> Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
> Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
> Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
> TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
> TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
> TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
> TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
> TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
> TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)
>
>                              3.17.0-rc2  3.17.0-rc2
>                                 vanillanodeorder-v1r1
> DMA allocs                           0           0
> DMA32 allocs                        57     1491992
> Normal allocs                 32543566    30026383
> Movable allocs                       0           0
> Direct pages scanned                 0           0
> Kswapd pages scanned                 0           0
> Kswapd pages reclaimed               0           0
> Direct pages reclaimed               0           0
> Kswapd efficiency                 100%        100%
> Kswapd velocity                  0.000       0.000
> Direct efficiency                 100%        100%
> Direct velocity                  0.000       0.000
> Percentage direct scans             0%          0%
> Zone normal velocity             0.000       0.000
> Zone dma32 velocity              0.000       0.000
> Zone dma velocity                0.000       0.000
> THP fault alloc                  55164       52987
> THP collapse alloc                 139         147
> THP splits                          26          21
> NUMA alloc hit                 4169066     4250692
> NUMA alloc miss                      0           0
>
> Note that there were more DMA32 allocations with the patch applied.  In this
> particular case there was no difference in numa_hit and numa_miss. The
> expectation is that DMA32 was being used at the low watermark instead of
> falling into the slow path. kswapd was not woken but it's not worken for
> THP allocations.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I agree. When I introduced the option(at ia64 numa development), there might be
some troubles. Recent NUMA tend to have enough memory on node0.

And problems can be avoided with numa_zonelist_order" "boot option, anyway.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>





> ---
>   mm/page_alloc.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 18cee0d..20059aa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3579,6 +3579,17 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
>   	zonelist->_zonerefs[pos].zone_idx = 0;
>   }
>
> +#if defined(CONFIG_64BIT)
> +
> +/* Devices that require DMA32/DMA are relatively rare and do not justify a
> + * penalty to every machine in case the specialised case applies. Default
> + * to Node-ordering on 64-bit NUMA machines
> + */
> +static int default_zonelist_order(void)
> +{
> +	return ZONELIST_ORDER_NODE;
> +}
> +#else
>   static int default_zonelist_order(void)
>   {
>   	int nid, zone_type;
> @@ -3641,6 +3652,7 @@ static int default_zonelist_order(void)
>   	}
>   	return ZONELIST_ORDER_ZONE;
>   }
> +#endif /* CONFIG_64BIT */
>
>   static void set_zonelist_order(void)
>   {
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines
  2014-09-01 12:55 [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines Mel Gorman
  2014-09-02  1:12 ` Kamezawa Hiroyuki
@ 2014-09-02 13:51 ` Johannes Weiner
  2014-09-02 14:01   ` Kamezawa Hiroyuki
  2014-09-02 15:21   ` Mel Gorman
  1 sibling, 2 replies; 9+ messages in thread
From: Johannes Weiner @ 2014-09-02 13:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, David Rientjes, KAMEZAWA Hiroyuki,
	Fengguang Wu, linux-mm, linux-kernel

On Mon, Sep 01, 2014 at 01:55:51PM +0100, Mel Gorman wrote:
> Zones are allocated by the page allocator in either node or zone order.
> Node ordering is preferred in terms of locality and is applied automatically
> in one of three cases.
> 
>   1. If a node has only low memory
> 
>   2. If DMA/DMA32 is a high percentage of memory
> 
>   3. If low memory on a single node is greater than 70% of the node size
> 
> Otherwise zone ordering is used to preserve low memory. Unfortunately
> a consequence of this is that a machine with balanced NUMA nodes will
> experience different performance characteristics depending on which node
> they happen to start from.
> 
> The point of zone ordering is to protect lower nodes for devices that require
> DMA/DMA32 memory. When NUMA was first introduced, this was critical as 32-bit
> NUMA machines commonly suffered from low memory exhaustion problems. On
> 64-bit machines the primary concern is devices that are 32-bit only which
> is less severe than the low memory exhaustion problem on 32-bit NUMA. It
> seems there are really few devices that depends on it.
> 
> AGP -- I assume this is getting more rare but even then I think the allocations
> 	happen early in boot time where lowmem pressure is less of a problem
> 
> DRM -- If the device is 32-bit only then there may be low pressure. I didn't
> 	evaluate these in detail but it looks like some of these are mobile
> 	graphics card. Not many NUMA laptops out there. DRM folk should know
> 	better though.
> 
> Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
> 
> B43 wireless card -- again not really a NUMA thing.
> 
> I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
> machines in case someone throws a brain damanged TV or graphics card in there.
> This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
> to make it default everywhere but I understand that some embedded arches may
> be using 32-bit NUMA where I cannot predict the consequences.

This patch is a step in the right direction, but I'm not too fond of
further fragmenting this code and where it applies, while leaving all
the complexity from the heuristics and the zonelist building in, just
on spec.  Could we at least remove the heuristics too?  If anybody is
affected by this, they can always override the default on the cmdline.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines
  2014-09-02 13:51 ` Johannes Weiner
@ 2014-09-02 14:01   ` Kamezawa Hiroyuki
  2014-09-02 15:21   ` Mel Gorman
  1 sibling, 0 replies; 9+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-02 14:01 UTC (permalink / raw)
  To: Johannes Weiner, Mel Gorman
  Cc: Andrew Morton, Rik van Riel, David Rientjes, Fengguang Wu,
	linux-mm, linux-kernel

(2014/09/02 22:51), Johannes Weiner wrote:
> On Mon, Sep 01, 2014 at 01:55:51PM +0100, Mel Gorman wrote:
>> Zones are allocated by the page allocator in either node or zone order.
>> Node ordering is preferred in terms of locality and is applied automatically
>> in one of three cases.
>>
>>    1. If a node has only low memory
>>
>>    2. If DMA/DMA32 is a high percentage of memory
>>
>>    3. If low memory on a single node is greater than 70% of the node size
>>
>> Otherwise zone ordering is used to preserve low memory. Unfortunately
>> a consequence of this is that a machine with balanced NUMA nodes will
>> experience different performance characteristics depending on which node
>> they happen to start from.
>>
>> The point of zone ordering is to protect lower nodes for devices that require
>> DMA/DMA32 memory. When NUMA was first introduced, this was critical as 32-bit
>> NUMA machines commonly suffered from low memory exhaustion problems. On
>> 64-bit machines the primary concern is devices that are 32-bit only which
>> is less severe than the low memory exhaustion problem on 32-bit NUMA. It
>> seems there are really few devices that depends on it.
>>
>> AGP -- I assume this is getting more rare but even then I think the allocations
>> 	happen early in boot time where lowmem pressure is less of a problem
>>
>> DRM -- If the device is 32-bit only then there may be low pressure. I didn't
>> 	evaluate these in detail but it looks like some of these are mobile
>> 	graphics card. Not many NUMA laptops out there. DRM folk should know
>> 	better though.
>>
>> Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
>>
>> B43 wireless card -- again not really a NUMA thing.
>>
>> I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
>> machines in case someone throws a brain damanged TV or graphics card in there.
>> This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
>> to make it default everywhere but I understand that some embedded arches may
>> be using 32-bit NUMA where I cannot predict the consequences.
>
> This patch is a step in the right direction, but I'm not too fond of
> further fragmenting this code and where it applies, while leaving all
> the complexity from the heuristics and the zonelist building in, just
> on spec.  Could we at least remove the heuristics too?  If anybody is
> affected by this, they can always override the default on the cmdline.
>
I'm okay with removing heuristics. There were a request to add "automatic detection"
at the time this feature was developped. But I'm not sure whether the logic is
still required. i.e. at that age, node-0 memory was small and default node order
can cause OOM easily.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines
  2014-09-02 13:51 ` Johannes Weiner
  2014-09-02 14:01   ` Kamezawa Hiroyuki
@ 2014-09-02 15:21   ` Mel Gorman
  2014-09-04 15:29     ` Johannes Weiner
  1 sibling, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2014-09-02 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, David Rientjes, KAMEZAWA Hiroyuki,
	Fengguang Wu, linux-mm, linux-kernel

On Tue, Sep 02, 2014 at 09:51:20AM -0400, Johannes Weiner wrote:
> On Mon, Sep 01, 2014 at 01:55:51PM +0100, Mel Gorman wrote:
> > Zones are allocated by the page allocator in either node or zone order.
> > Node ordering is preferred in terms of locality and is applied automatically
> > in one of three cases.
> > 
> >   1. If a node has only low memory
> > 
> >   2. If DMA/DMA32 is a high percentage of memory
> > 
> >   3. If low memory on a single node is greater than 70% of the node size
> > 
> > Otherwise zone ordering is used to preserve low memory. Unfortunately
> > a consequence of this is that a machine with balanced NUMA nodes will
> > experience different performance characteristics depending on which node
> > they happen to start from.
> > 
> > The point of zone ordering is to protect lower nodes for devices that require
> > DMA/DMA32 memory. When NUMA was first introduced, this was critical as 32-bit
> > NUMA machines commonly suffered from low memory exhaustion problems. On
> > 64-bit machines the primary concern is devices that are 32-bit only which
> > is less severe than the low memory exhaustion problem on 32-bit NUMA. It
> > seems there are really few devices that depends on it.
> > 
> > AGP -- I assume this is getting more rare but even then I think the allocations
> > 	happen early in boot time where lowmem pressure is less of a problem
> > 
> > DRM -- If the device is 32-bit only then there may be low pressure. I didn't
> > 	evaluate these in detail but it looks like some of these are mobile
> > 	graphics card. Not many NUMA laptops out there. DRM folk should know
> > 	better though.
> > 
> > Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
> > 
> > B43 wireless card -- again not really a NUMA thing.
> > 
> > I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
> > machines in case someone throws a brain damanged TV or graphics card in there.
> > This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
> > to make it default everywhere but I understand that some embedded arches may
> > be using 32-bit NUMA where I cannot predict the consequences.
> 
> This patch is a step in the right direction, but I'm not too fond of
> further fragmenting this code and where it applies, while leaving all
> the complexity from the heuristics and the zonelist building in, just
> on spec.  Could we at least remove the heuristics too?  If anybody is
> affected by this, they can always override the default on the cmdline.

I see no problem with deleting the heuristics. Default node for 64-bit
and default zone for 32-bit sound ok to you?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines
  2014-09-02 15:21   ` Mel Gorman
@ 2014-09-04 15:29     ` Johannes Weiner
  2014-09-05 10:30       ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2014-09-04 15:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, David Rientjes, KAMEZAWA Hiroyuki,
	Fengguang Wu, linux-mm, linux-kernel

On Tue, Sep 02, 2014 at 04:21:43PM +0100, Mel Gorman wrote:
> On Tue, Sep 02, 2014 at 09:51:20AM -0400, Johannes Weiner wrote:
> > On Mon, Sep 01, 2014 at 01:55:51PM +0100, Mel Gorman wrote:
> > > I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
> > > machines in case someone throws a brain damanged TV or graphics card in there.
> > > This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
> > > to make it default everywhere but I understand that some embedded arches may
> > > be using 32-bit NUMA where I cannot predict the consequences.
> > 
> > This patch is a step in the right direction, but I'm not too fond of
> > further fragmenting this code and where it applies, while leaving all
> > the complexity from the heuristics and the zonelist building in, just
> > on spec.  Could we at least remove the heuristics too?  If anybody is
> > affected by this, they can always override the default on the cmdline.
> 
> I see no problem with deleting the heuristics. Default node for 64-bit
> and default zone for 32-bit sound ok to you?

Is there a strong reason against defaulting both to node order?  Zone
ordering, if anything, is a niche application.  We might even be able
to remove it in the future.  We still have the backup of allowing the
user to explicitely request zone ordering on the commandline, should
someone depend on it unexpectedly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines
  2014-09-04 15:29     ` Johannes Weiner
@ 2014-09-05 10:30       ` Mel Gorman
  2014-09-09 14:46         ` [PATCH] mm: page_alloc: Default node-ordering on 64-bit NUMA, zone-ordering on 32-bit v2 Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2014-09-05 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, David Rientjes, KAMEZAWA Hiroyuki,
	Fengguang Wu, linux-mm, linux-kernel

On Thu, Sep 04, 2014 at 11:29:29AM -0400, Johannes Weiner wrote:
> On Tue, Sep 02, 2014 at 04:21:43PM +0100, Mel Gorman wrote:
> > On Tue, Sep 02, 2014 at 09:51:20AM -0400, Johannes Weiner wrote:
> > > On Mon, Sep 01, 2014 at 01:55:51PM +0100, Mel Gorman wrote:
> > > > I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
> > > > machines in case someone throws a brain damanged TV or graphics card in there.
> > > > This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
> > > > to make it default everywhere but I understand that some embedded arches may
> > > > be using 32-bit NUMA where I cannot predict the consequences.
> > > 
> > > This patch is a step in the right direction, but I'm not too fond of
> > > further fragmenting this code and where it applies, while leaving all
> > > the complexity from the heuristics and the zonelist building in, just
> > > on spec.  Could we at least remove the heuristics too?  If anybody is
> > > affected by this, they can always override the default on the cmdline.
> > 
> > I see no problem with deleting the heuristics. Default node for 64-bit
> > and default zone for 32-bit sound ok to you?
> 
> Is there a strong reason against defaulting both to node order?  Zone
> ordering, if anything, is a niche application.  We might even be able
> to remove it in the future.  We still have the backup of allowing the
> user to explicitely request zone ordering on the commandline, should
> someone depend on it unexpectedly.

Low memory depletion is the reason to default to zone order on 32-bit
NUMA. If processes on node 0 deplete the Normal zone from normal
activity then other nodes must keep reclaiming from Normal for all kernel
allocations. The problem is worse if CONFIG_HIGHPTE is not set.

A default of node-ordering on 32-bit NUMA increases low memory pressure
leading to increased reclaim and potentially easier to trigger OOM. I
expect this problem was worse in the past when the normal zone could be
filled with dirty pages under writeback.  However low memory pressure is
still enough of a concern that I'm wary of changing the default of 32-bit
NUMA without knowing who even cares about 32-bit NUMA.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mm: page_alloc: Default node-ordering on 64-bit NUMA, zone-ordering on 32-bit v2
  2014-09-05 10:30       ` Mel Gorman
@ 2014-09-09 14:46         ` Mel Gorman
  2014-09-09 14:52           ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2014-09-09 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, KAMEZAWA Hiroyuki,
	Fengguang Wu, linux-mm, linux-kernel

Changelog since v1
o Default to zone-ordering on 32-bit and remove heuristics
o Expand changelog

Zones are allocated by the page allocator in either node or zone order.
Node ordering is preferred in terms of locality and is applied automatically
in one of three cases.

  1. If a node has only low memory

  2. If DMA/DMA32 is a high percentage of memory

  3. If low memory on a single node is greater than 70% of the node size

Otherwise zone ordering is used to preserve low memory for devices that
require it. Unfortunately a consequence of this is that a machine with
balanced NUMA nodes will experience different performance characteristics
depending on which node they happen to start from.

The point of zone ordering is to protect lower nodes for devices that
require DMA/DMA32 memory. When NUMA was first introduced, this was critical
as 32-bit NUMA machines existed and exhausting low memory triggered OOMs
easily as so many allocations required low memory. On 64-bit machines the
primary concern is devices that are 32-bit only which is less severe than
the low memory exhaustion problem on 32-bit NUMA. It seems there are really
few devices that depends on it.

AGP -- I assume this is getting more rare but even then I think the allocations
	happen early in boot time where lowmem pressure is less of a problem

DRM -- If the device is 32-bit only then there may be low pressure. I didn't
	evaluate these in detail but it looks like some of these are mobile
	graphics card. Not many NUMA laptops out there. DRM folk should know
	better though.

Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?

B43 wireless card -- again not really a NUMA thing.

I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
machines in case someone throws a brain damanged TV or graphics card in there.
This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
to make it default everywhere but I understand that some embedded arches may
be using 32-bit NUMA where I cannot predict the consequences.

The performance impact depends on the workload and the characteristics of the
machine and the machine I tested on had a large Normal zone on node 0 so the
impact is within the noise for the majority of tests. The allocation stats
show more allocation requests were from DMA32 and local node. Running SpecJBB
with multiple JVMs and automatic NUMA balancing disabled the results were

specjbb
                     3.17.0-rc2            3.17.0-rc2
                        vanilla        nodeorder-v1r1
Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)

                            3.17.0-rc2  3.17.0-rc2
                               vanillanodeorder-v1r1
DMA allocs                           0           0
DMA32 allocs                        57     1491992
Normal allocs                 32543566    30026383
Movable allocs                       0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
THP fault alloc                  55164       52987
THP collapse alloc                 139         147
THP splits                          26          21
NUMA alloc hit                 4169066     4250692
NUMA alloc miss                      0           0

Note that there were more DMA32 allocations with the patch applied.  In this
particular case there was no difference in numa_hit and numa_miss. The
expectation is that DMA32 was being used at the low watermark instead of
falling into the slow path. kswapd was not woken but it's not worken for
THP allocations.

On 32-bit, this patch defaults to zone-ordering as low memory depletion
can be a serious problem on 32-bit large memory machines. If the default
ordering was node then processes on node 0 will deplete the Normal zone
due to normal activity.  The problem is worse if CONFIG_HIGHPTE is not
set. If combined with large amounts of dirty/writeback pages in Normal
zone then there is also a high risk of OOM. The heuristics are removed
as it's not clear they were ever important on 32-bit. They were only
relevant for setting node-ordering on 64-bit.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 78 +++++++++++++++------------------------------------------
 1 file changed, 20 insertions(+), 58 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18cee0d..0a7ed33 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3579,68 +3579,30 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 	zonelist->_zonerefs[pos].zone_idx = 0;
 }
 
+#if defined(CONFIG_64BIT)
+/*
+ * Devices that require DMA32/DMA are relatively rare and do not justify a
+ * penalty to every machine in case the specialised case applies. Default
+ * to Node-ordering on 64-bit NUMA machines
+ */
+static int default_zonelist_order(void)
+{
+	return ZONELIST_ORDER_NODE;
+}
+#else
+/*
+ * On 32-bit, the Normal zone needs to be preserved for allocations accessible
+ * by the kernel. If processes running on node 0 deplete the low memory zone
+ * then reclaim will occur more frequency increasing stalls and potentially
+ * be easier to OOM if a large percentage of the zone is under writeback or
+ * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
+ * Hence, default to zone ordering on 32-bit.
+ */
 static int default_zonelist_order(void)
 {
-	int nid, zone_type;
-	unsigned long low_kmem_size, total_size;
-	struct zone *z;
-	int average_size;
-	/*
-	 * ZONE_DMA and ZONE_DMA32 can be very small area in the system.
-	 * If they are really small and used heavily, the system can fall
-	 * into OOM very easily.
-	 * This function detect ZONE_DMA/DMA32 size and configures zone order.
-	 */
-	/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
-	low_kmem_size = 0;
-	total_size = 0;
-	for_each_online_node(nid) {
-		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
-			z = &NODE_DATA(nid)->node_zones[zone_type];
-			if (populated_zone(z)) {
-				if (zone_type < ZONE_NORMAL)
-					low_kmem_size += z->managed_pages;
-				total_size += z->managed_pages;
-			} else if (zone_type == ZONE_NORMAL) {
-				/*
-				 * If any node has only lowmem, then node order
-				 * is preferred to allow kernel allocations
-				 * locally; otherwise, they can easily infringe
-				 * on other nodes when there is an abundance of
-				 * lowmem available to allocate from.
-				 */
-				return ZONELIST_ORDER_NODE;
-			}
-		}
-	}
-	if (!low_kmem_size ||  /* there are no DMA area. */
-	    low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
-		return ZONELIST_ORDER_NODE;
-	/*
-	 * look into each node's config.
-	 * If there is a node whose DMA/DMA32 memory is very big area on
-	 * local memory, NODE_ORDER may be suitable.
-	 */
-	average_size = total_size /
-				(nodes_weight(node_states[N_MEMORY]) + 1);
-	for_each_online_node(nid) {
-		low_kmem_size = 0;
-		total_size = 0;
-		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
-			z = &NODE_DATA(nid)->node_zones[zone_type];
-			if (populated_zone(z)) {
-				if (zone_type < ZONE_NORMAL)
-					low_kmem_size += z->present_pages;
-				total_size += z->present_pages;
-			}
-		}
-		if (low_kmem_size &&
-		    total_size > average_size && /* ignore small node */
-		    low_kmem_size > total_size * 70/100)
-			return ZONELIST_ORDER_NODE;
-	}
 	return ZONELIST_ORDER_ZONE;
 }
+#endif /* CONFIG_64BIT */
 
 static void set_zonelist_order(void)
 {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: page_alloc: Default node-ordering on 64-bit NUMA, zone-ordering on 32-bit v2
  2014-09-09 14:46         ` [PATCH] mm: page_alloc: Default node-ordering on 64-bit NUMA, zone-ordering on 32-bit v2 Mel Gorman
@ 2014-09-09 14:52           ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2014-09-09 14:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, David Rientjes, KAMEZAWA Hiroyuki,
	Fengguang Wu, linux-mm, linux-kernel

On Tue, Sep 09, 2014 at 03:46:30PM +0100, Mel Gorman wrote:
> Changelog since v1
> o Default to zone-ordering on 32-bit and remove heuristics
> o Expand changelog
> 
> Zones are allocated by the page allocator in either node or zone order.
> Node ordering is preferred in terms of locality and is applied automatically
> in one of three cases.
> 
>   1. If a node has only low memory
> 
>   2. If DMA/DMA32 is a high percentage of memory
> 
>   3. If low memory on a single node is greater than 70% of the node size
> 
> Otherwise zone ordering is used to preserve low memory for devices that
> require it. Unfortunately a consequence of this is that a machine with
> balanced NUMA nodes will experience different performance characteristics
> depending on which node they happen to start from.
> 
> The point of zone ordering is to protect lower nodes for devices that
> require DMA/DMA32 memory. When NUMA was first introduced, this was critical
> as 32-bit NUMA machines existed and exhausting low memory triggered OOMs
> easily as so many allocations required low memory. On 64-bit machines the
> primary concern is devices that are 32-bit only which is less severe than
> the low memory exhaustion problem on 32-bit NUMA. It seems there are really
> few devices that depends on it.
> 
> AGP -- I assume this is getting more rare but even then I think the allocations
> 	happen early in boot time where lowmem pressure is less of a problem
> 
> DRM -- If the device is 32-bit only then there may be low pressure. I didn't
> 	evaluate these in detail but it looks like some of these are mobile
> 	graphics card. Not many NUMA laptops out there. DRM folk should know
> 	better though.
> 
> Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
> 
> B43 wireless card -- again not really a NUMA thing.
> 
> I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
> machines in case someone throws a brain damanged TV or graphics card in there.
> This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
> to make it default everywhere but I understand that some embedded arches may
> be using 32-bit NUMA where I cannot predict the consequences.
> 
> The performance impact depends on the workload and the characteristics of the
> machine and the machine I tested on had a large Normal zone on node 0 so the
> impact is within the noise for the majority of tests. The allocation stats
> show more allocation requests were from DMA32 and local node. Running SpecJBB
> with multiple JVMs and automatic NUMA balancing disabled the results were
> 
> specjbb
>                      3.17.0-rc2            3.17.0-rc2
>                         vanilla        nodeorder-v1r1
> Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
> Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
> Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
> Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
> Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
> Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
> Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
> Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
> Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
> Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
> Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
> Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
> Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
> Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
> Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
> Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
> Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
> Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
> Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
> Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
> Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
> Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
> Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
> Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
> TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
> TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
> TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
> TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
> TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
> TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)
> 
>                             3.17.0-rc2  3.17.0-rc2
>                                vanillanodeorder-v1r1
> DMA allocs                           0           0
> DMA32 allocs                        57     1491992
> Normal allocs                 32543566    30026383
> Movable allocs                       0           0
> Direct pages scanned                 0           0
> Kswapd pages scanned                 0           0
> Kswapd pages reclaimed               0           0
> Direct pages reclaimed               0           0
> Kswapd efficiency                 100%        100%
> Kswapd velocity                  0.000       0.000
> Direct efficiency                 100%        100%
> Direct velocity                  0.000       0.000
> Percentage direct scans             0%          0%
> Zone normal velocity             0.000       0.000
> Zone dma32 velocity              0.000       0.000
> Zone dma velocity                0.000       0.000
> THP fault alloc                  55164       52987
> THP collapse alloc                 139         147
> THP splits                          26          21
> NUMA alloc hit                 4169066     4250692
> NUMA alloc miss                      0           0
> 
> Note that there were more DMA32 allocations with the patch applied.  In this
> particular case there was no difference in numa_hit and numa_miss. The
> expectation is that DMA32 was being used at the low watermark instead of
> falling into the slow path. kswapd was not woken but it's not worken for
> THP allocations.
> 
> On 32-bit, this patch defaults to zone-ordering as low memory depletion
> can be a serious problem on 32-bit large memory machines. If the default
> ordering was node then processes on node 0 will deplete the Normal zone
> due to normal activity.  The problem is worse if CONFIG_HIGHPTE is not
> set. If combined with large amounts of dirty/writeback pages in Normal
> zone then there is also a high risk of OOM. The heuristics are removed
> as it's not clear they were ever important on 32-bit. They were only
> relevant for setting node-ordering on 64-bit.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-09-09 14:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-01 12:55 [PATCH] mm: page_alloc: Default to node-ordering on 64-bit NUMA machines Mel Gorman
2014-09-02  1:12 ` Kamezawa Hiroyuki
2014-09-02 13:51 ` Johannes Weiner
2014-09-02 14:01   ` Kamezawa Hiroyuki
2014-09-02 15:21   ` Mel Gorman
2014-09-04 15:29     ` Johannes Weiner
2014-09-05 10:30       ` Mel Gorman
2014-09-09 14:46         ` [PATCH] mm: page_alloc: Default node-ordering on 64-bit NUMA, zone-ordering on 32-bit v2 Mel Gorman
2014-09-09 14:52           ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).