linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Jiang Liu <jiang.liu@linux.intel.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Mel Gorman" <mgorman@suse.de>,
	"David Rientjes" <rientjes@google.com>,
	"Mike Galbraith" <umgwanakikbuti@gmail.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
	"Tang Chen" <tangchen@cn.fujitsu.com>,
	"Tejun Heo" <tj@kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Len Brown" <len.brown@intel.com>, "Pavel Machek" <pavel@ucw.cz>,
	"Borislav Petkov" <bp@alien8.de>,
	"Andy Lutomirski" <luto@amacapital.net>,
	"Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Jan H. Schönherr" <jschoenh@amazon.de>,
	"Igor Mammedov" <imammedo@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"Xishi Qiu" <qiuxishi@huawei.com>,
	"Luiz Capitulino" <lcapitulino@redhat.com>,
	"Dave Young" <dyoung@redhat.com>,
	"Tony Luck" <tony.luck@intel.com>,
	linux-mm@kvack.org, linux-hotplug@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Subject: Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
Date: Tue, 18 Aug 2015 09:31:37 +0200	[thread overview]
Message-ID: <20150818073136.GA22311@gmail.com> (raw)
In-Reply-To: <1439781546-7217-10-git-send-email-jiang.liu@linux.intel.com>


* Jiang Liu <jiang.liu@linux.intel.com> wrote:

> With current implementation, all CPUs within a NUMA node will be
> assocaited with another NUMA node if the node has no memory installed.

typo.

> 
> For example, on a four-node system, CPUs on node 2 and 3 are associated
> with node 0 when are no memory install on node 2 and 3, which may
> confuse users.
>
> root@bkd01sdp:~# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 0 size: 15602 MB
> node 0 free: 15014 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15985 MB
> node 1 free: 15686 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> To be worse, the CPU affinity relationship won't get fixed even after
> memory has been added to those nodes. After memory hot-addition to
> node 2, CPUs on node 2 are still associated with node 0. This may cause
> sub-optimal performance.
> root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
> available: 3 nodes (0-2)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 0 size: 15602 MB
> node 0 free: 14743 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15985 MB
> node 1 free: 15715 MB
> node 2 cpus:
> node 2 size: 128 MB
> node 2 free: 128 MB
> node distances:
> node   0   1   2
>   0:  10  21  21
>   1:  21  10  21
>   2:  21  21  10
> 
> With support of memoryless node enabled, it will correctly report system
> hardware topology for nodes without memory installed.
> root@bkd01sdp:~# numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
> node 0 size: 15725 MB
> node 0 free: 15129 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15862 MB
> node 1 free: 15627 MB
> node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
> node 2 size: 0 MB
> node 2 free: 0 MB
> node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 3 size: 0 MB
> node 3 free: 0 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  21  21
>   1:  21  10  21  21
>   2:  21  21  10  21
>   3:  21  21  21  10
> 
> With memoryless node enabled, CPUs are correctly associated with node 2
> after memory hot-addition to node 2.
> root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
> node 0 size: 15725 MB
> node 0 free: 14872 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15862 MB
> node 1 free: 15641 MB
> node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
> node 2 size: 128 MB
> node 2 free: 127 MB
> node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 3 size: 0 MB
> node 3 free: 0 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  21  21
>   1:  21  10  21  21
>   2:  21  21  10  21
>   3:  21  21  21  10
> 
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  arch/x86/Kconfig            |    3 +++
>  arch/x86/kernel/acpi/boot.c |    4 +++-
>  arch/x86/kernel/smpboot.c   |    2 ++
>  arch/x86/mm/numa.c          |   49 +++++++++++++++++++++++++++++++------------
>  4 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..5d7ad70ace0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2069,6 +2069,9 @@ config USE_PERCPU_NUMA_NODE_ID
>  	def_bool y
>  	depends on NUMA
>  
> +config HAVE_MEMORYLESS_NODES
> +	def_bool NUMA
> +
>  config ARCH_ENABLE_SPLIT_PMD_PTLOCK
>  	def_bool y
>  	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 07930e1d2fe9..3403f1f0f28d 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -711,6 +711,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  		}
>  		set_apicid_to_node(physid, nid);
>  		numa_set_node(cpu, nid);
> +		set_cpu_numa_mem(cpu, local_memory_node(nid));
>  	}
>  #endif
>  }
> @@ -743,9 +744,10 @@ int acpi_unmap_cpu(int cpu)
>  {
>  #ifdef CONFIG_ACPI_NUMA
>  	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
> +	set_cpu_numa_mem(cpu, NUMA_NO_NODE);
>  #endif
>  
> -	per_cpu(x86_cpu_to_apicid, cpu) = -1;
> +	per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
>  	set_cpu_present(cpu, false);
>  	num_processors--;
>  
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b1f3ed9c7a9e..aeec91ac6fd4 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -162,6 +162,8 @@ static void smp_callin(void)
>  	 */
>  	phys_id = read_apic_id();
>  
> +	set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
> +
>  	/*
>  	 * the boot CPU has finished the init stage and is spinning
>  	 * on callin_map until we finish. We are free to set up this
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 08860bdf5744..f2a4e23bd14d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -22,6 +22,7 @@
>  
>  int __initdata numa_off;
>  nodemask_t numa_nodes_parsed __initdata;
> +static nodemask_t numa_nodes_empty __initdata;
>  
>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>  EXPORT_SYMBOL(node_data);
> @@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>  			end = max(mi->blk[i].end, end);
>  		}
>  
> -		if (start >= end)
> -			continue;
> -
>  		/*
>  		 * Don't confuse VM with a node that doesn't have the
>  		 * minimum amount of memory:
>  		 */
> -		if (end && (end - start) < NODE_MIN_SIZE)
> -			continue;
> -
> -		alloc_node_data(nid);
> +		if (start < end && (end - start) >= NODE_MIN_SIZE) {
> +			alloc_node_data(nid);
> +		} else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +			alloc_node_data(nid);
> +			node_set(nid, numa_nodes_empty);
> +		}
>  	}
>  
>  	/* Dump memblock with node info and return. */
> @@ -587,14 +587,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>   */
>  static void __init numa_init_array(void)
>  {
> -	int rr, i;
> +	int i, rr = MAX_NUMNODES;
>  
> -	rr = first_node(node_online_map);
>  	for (i = 0; i < nr_cpu_ids; i++) {
> +		/* Search for an onlined node with memory */
> +		do {
> +			if (rr != MAX_NUMNODES)
> +				rr = next_node(rr, node_online_map);
> +			if (rr == MAX_NUMNODES)
> +				rr = first_node(node_online_map);
> +		} while (node_isset(rr, numa_nodes_empty));
> +
>  		numa_set_node(i, rr);
> -		rr = next_node(rr, node_online_map);
> -		if (rr == MAX_NUMNODES)
> -			rr = first_node(node_online_map);
>  	}
>  }
>  
> @@ -696,9 +700,12 @@ static __init int find_near_online_node(int node)
>  {
>  	int n, val;
>  	int min_val = INT_MAX;
> -	int best_node = -1;
> +	int best_node = NUMA_NO_NODE;
>  
>  	for_each_online_node(n) {
> +		if (node_isset(n, numa_nodes_empty))
> +			continue;
> +
>  		val = node_distance(node, n);
>  
>  		if (val < min_val) {
> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>  		if (!node_online(node))
>  			node = find_near_online_node(node);
>  		numa_set_node(cpu, node);
> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);
> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);
> +	}
> +
> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> +
> +		for_each_node_mask(nid, numa_nodes_empty) {
> +			node_set_offline(nid);
> +			memblock_free(__pa(node_data[nid]), nd_size);
> +			node_data[nid] = NULL;
> +		}
>  	}
>  }

So this patch makes messy code even messier.

I'd like to see the fixes, but this really needs to be done cleaner.

There are several problems:

1) the naming is not clear enough between VM and scheduling nodes and their masks. 
   For example we are mixing uses of 'numa_nodes_empty' (a memory space concept),
   'node_online_map' (a scheduling concept), which makes the code hard to read.

   To add insult to injury, 'numa_nodes_empty' is added with zero comments:

       > +static nodemask_t numa_nodes_empty __initdata;

   To resolve this the names should be clearer I think. Something like 
   numa_nomem_mask or so.

2) the existing code is (unfortunately) confusing to begin with. For example what 
   does find_near_online_node() do? It's not commented.

   init_cpu_to_node() has comments but it's mostly implementational gibberish that 
   does not answer the question of what the function's main, high level purpose 
   is. I'm uneasy about modifying code that is hard to read - it should be 
   improved first.

3)

   So I'm wondering about logic like this:

> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);

   So first we link the node in the _numa_mem_ array if the node has memory (?).

> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);

   But we unconditionally clear it in the numa_nodes_empty - i.e. it has memory?
   Shouldn't the node_clear() be inside the 'has memory' condition?

4)

    Bits like this are confusing:

> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);

    Why do we 'destroy' them? What does 'destroy' mean here?

So I think this series should first make the whole code readable and 
understandable - then fix the bugs as gradually as possible: one bug one patch.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2015-08-18  7:31 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
2015-08-17  3:18 ` [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition Jiang Liu
2015-08-17  3:18 ` [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node() Jiang Liu
2015-08-18  0:31   ` David Rientjes
2015-08-19  7:18     ` Jiang Liu
2015-08-20  0:00       ` David Rientjes
2015-10-09  2:35         ` Jiang Liu
2015-08-17  3:19 ` [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node Jiang Liu
2015-08-18  0:25   ` David Rientjes
2015-08-19  8:20     ` Jiang Liu
2015-08-20  0:02       ` David Rientjes
2015-08-20  6:36         ` Jiang Liu
2015-10-09  5:04           ` Jiang Liu
2015-08-19 11:52   ` Robin Holt
2015-08-19 12:45     ` Jiang Liu
2015-08-17  3:19 ` [Patch V3 4/9] openvswitch: " Jiang Liu
2015-08-18  0:14   ` Pravin Shelar
2015-08-17  3:19 ` [Patch V3 5/9] i40e: Use numa_mem_id() to better " Jiang Liu
2015-08-18  0:35   ` David Rientjes
2015-08-19 22:38   ` [Intel-wired-lan] " Patil, Kiran
2015-08-20  0:18     ` David Rientjes
2015-10-08 20:20       ` Andrew Morton
2015-10-09  5:52         ` Jiang Liu
2015-10-09  9:08           ` Kamezawa Hiroyuki
2015-10-09  9:25             ` Jiang Liu
2015-08-17  3:19 ` [Patch V3 6/9] i40evf: " Jiang Liu
2015-08-17 19:03   ` [Intel-wired-lan] " Patil, Kiran
2015-08-18 21:34     ` Jeff Kirsher
2015-08-17  3:19 ` [Patch V3 7/9] x86, numa: Kill useless code to improve code readability Jiang Liu
2015-08-17  3:19 ` [Patch V3 8/9] mm: Update _mem_id_[] for every possible CPU when memory configuration changes Jiang Liu
2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
2015-08-18  6:11   ` Tang Chen
2015-08-18  6:59     ` Jiang Liu
2015-08-18 11:28       ` Tang Chen
2015-08-18  7:31   ` Ingo Molnar [this message]
2015-08-17 21:35 ` [Patch V3 0/9] Enable memoryless node support for x86 Andrew Morton
2015-08-18 10:02 ` Tang Chen
2015-08-19  8:09   ` Jiang Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150818073136.GA22311@gmail.com \
    --to=mingo@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dyoung@redhat.com \
    --cc=hpa@zytor.com \
    --cc=imammedo@redhat.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=jschoenh@amazon.de \
    --cc=lcapitulino@redhat.com \
    --cc=len.brown@intel.com \
    --cc=linux-hotplug@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pavel@ucw.cz \
    --cc=peterz@infradead.org \
    --cc=qiuxishi@huawei.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rientjes@google.com \
    --cc=rjw@rjwysocki.net \
    --cc=tangchen@cn.fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=umgwanakikbuti@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).