linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL tip:x86/mm]
@ 2011-02-24 14:51 Tejun Heo
  2011-02-24 14:52 ` [GIT PULL tip:x86/mm] bootmem,x86: cleanup changes Tejun Heo
  2011-02-24 19:08 ` [GIT PULL tip:x86/mm] Yinghai Lu
  0 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2011-02-24 14:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: yinghai, tglx, hpa, linux-kernel

Ingo, please pull from the following git branch to receive four
commits from Yinghai.  HEAD is d1b19426b0 (x86: Rename e820_table_* to
pgt_buf_*).

The first three separate nobootmem code into mm/nobootmem.c and the
last one renames e820_table_* variables to pgt_buf_*.  All four
patches are cleanups and shouldn't cause any behavior difference.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm

As usual, if HEAD doesn't appear, please pull from master.

  ssh://master.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm

Thanks.

Yinghai Lu (4):
      bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
      bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
      bootmem: Move __alloc_memory_core_early() to nobootmem.c
      x86: Rename e820_table_* to pgt_buf_*

 arch/x86/include/asm/init.h |    6 +-
 arch/x86/mm/init.c          |   20 +-
 arch/x86/mm/init_32.c       |    8 +-
 arch/x86/mm/init_64.c       |    4 +-
 arch/x86/xen/mmu.c          |    2 +-
 include/linux/mm.h          |    2 -
 mm/Makefile                 |    8 +-
 mm/bootmem.c                |  180 +-----------------
 mm/nobootmem.c              |  435 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |   37 ----
 10 files changed, 472 insertions(+), 230 deletions(-)
 create mode 100644 mm/nobootmem.c

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [GIT PULL tip:x86/mm] bootmem,x86: cleanup changes
  2011-02-24 14:51 [GIT PULL tip:x86/mm] Tejun Heo
@ 2011-02-24 14:52 ` Tejun Heo
  2011-02-24 19:08 ` [GIT PULL tip:x86/mm] Yinghai Lu
  1 sibling, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-02-24 14:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: yinghai, tglx, hpa, linux-kernel

Sorry, forgot the subject.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-02-24 14:51 [GIT PULL tip:x86/mm] Tejun Heo
  2011-02-24 14:52 ` [GIT PULL tip:x86/mm] bootmem,x86: cleanup changes Tejun Heo
@ 2011-02-24 19:08 ` Yinghai Lu
  2011-02-24 19:23   ` Ingo Molnar
  1 sibling, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-02-24 19:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Ingo Molnar, tglx, hpa, linux-kernel

On 02/24/2011 06:51 AM, Tejun Heo wrote:
> Ingo, please pull from the following git branch to receive four
> commits from Yinghai.  HEAD is d1b19426b0 (x86: Rename e820_table_* to
> pgt_buf_*).
> 
> The first three separate nobootmem code into mm/nobootmem.c and the
> last one renames e820_table_* variables to pgt_buf_*.  All four
> patches are cleanups and shouldn't cause any behavior difference.
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
> 
> As usual, if HEAD doesn't appear, please pull from master.
> 
>   ssh://master.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
> 
> Thanks.
> 
> Yinghai Lu (4):
>       bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
>       bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
>       bootmem: Move __alloc_memory_core_early() to nobootmem.c
>       x86: Rename e820_table_* to pgt_buf_*
> 
>  arch/x86/include/asm/init.h |    6 +-
>  arch/x86/mm/init.c          |   20 +-
>  arch/x86/mm/init_32.c       |    8 +-
>  arch/x86/mm/init_64.c       |    4 +-
>  arch/x86/xen/mmu.c          |    2 +-
>  include/linux/mm.h          |    2 -
>  mm/Makefile                 |    8 +-
>  mm/bootmem.c                |  180 +-----------------
>  mm/nobootmem.c              |  435 +++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c             |   37 ----
>  10 files changed, 472 insertions(+), 230 deletions(-)
>  create mode 100644 mm/nobootmem.c
> 

better to put first three into seperate branch. and it is with core code.
something like tip/mm

So will not pollute tip/x86/mm. and they can be pushed separately.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-02-24 19:08 ` [GIT PULL tip:x86/mm] Yinghai Lu
@ 2011-02-24 19:23   ` Ingo Molnar
  2011-02-24 19:28     ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2011-02-24 19:23 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Tejun Heo, tglx, hpa, linux-kernel


* Yinghai Lu <yinghai@kernel.org> wrote:

> On 02/24/2011 06:51 AM, Tejun Heo wrote:
> > Ingo, please pull from the following git branch to receive four
> > commits from Yinghai.  HEAD is d1b19426b0 (x86: Rename e820_table_* to
> > pgt_buf_*).
> > 
> > The first three separate nobootmem code into mm/nobootmem.c and the
> > last one renames e820_table_* variables to pgt_buf_*.  All four
> > patches are cleanups and shouldn't cause any behavior difference.
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
> > 
> > As usual, if HEAD doesn't appear, please pull from master.
> > 
> >   ssh://master.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
> > 
> > Thanks.
> > 
> > Yinghai Lu (4):
> >       bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
> >       bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
> >       bootmem: Move __alloc_memory_core_early() to nobootmem.c
> >       x86: Rename e820_table_* to pgt_buf_*
> > 
> >  arch/x86/include/asm/init.h |    6 +-
> >  arch/x86/mm/init.c          |   20 +-
> >  arch/x86/mm/init_32.c       |    8 +-
> >  arch/x86/mm/init_64.c       |    4 +-
> >  arch/x86/xen/mmu.c          |    2 +-
> >  include/linux/mm.h          |    2 -
> >  mm/Makefile                 |    8 +-
> >  mm/bootmem.c                |  180 +-----------------
> >  mm/nobootmem.c              |  435 +++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c             |   37 ----
> >  10 files changed, 472 insertions(+), 230 deletions(-)
> >  create mode 100644 mm/nobootmem.c
> > 
> 
> better to put first three into seperate branch. and it is with core code.
> something like tip/mm
> 
> So will not pollute tip/x86/mm. and they can be pushed separately.

Well, realistically they will be tested together and will go to Linus under the 
x86/mm label anyway, so there's little reason to keep them separate at this point.

So i've pulled them. Thanks guys!

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-02-24 19:23   ` Ingo Molnar
@ 2011-02-24 19:28     ` Yinghai Lu
  2011-02-24 19:32       ` Ingo Molnar
  2011-03-01 17:18       ` [GIT PULL tip:x86/mm] David Rientjes
  0 siblings, 2 replies; 68+ messages in thread
From: Yinghai Lu @ 2011-02-24 19:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Tejun Heo, tglx, hpa, linux-kernel, David Rientjes

On 02/24/2011 11:23 AM, Ingo Molnar wrote:
> 
> * Yinghai Lu <yinghai@kernel.org> wrote:
> 
>> On 02/24/2011 06:51 AM, Tejun Heo wrote:
>>> Ingo, please pull from the following git branch to receive four
>>> commits from Yinghai.  HEAD is d1b19426b0 (x86: Rename e820_table_* to
>>> pgt_buf_*).
>>>
>>> The first three separate nobootmem code into mm/nobootmem.c and the
>>> last one renames e820_table_* variables to pgt_buf_*.  All four
>>> patches are cleanups and shouldn't cause any behavior difference.
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
>>>
>>> As usual, if HEAD doesn't appear, please pull from master.
>>>
>>>   ssh://master.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
>>>
>>> Thanks.
>>>
>>> Yinghai Lu (4):
>>>       bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
>>>       bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
>>>       bootmem: Move __alloc_memory_core_early() to nobootmem.c
>>>       x86: Rename e820_table_* to pgt_buf_*
>>>
>>>  arch/x86/include/asm/init.h |    6 +-
>>>  arch/x86/mm/init.c          |   20 +-
>>>  arch/x86/mm/init_32.c       |    8 +-
>>>  arch/x86/mm/init_64.c       |    4 +-
>>>  arch/x86/xen/mmu.c          |    2 +-
>>>  include/linux/mm.h          |    2 -
>>>  mm/Makefile                 |    8 +-
>>>  mm/bootmem.c                |  180 +-----------------
>>>  mm/nobootmem.c              |  435 +++++++++++++++++++++++++++++++++++++++++++
>>>  mm/page_alloc.c             |   37 ----
>>>  10 files changed, 472 insertions(+), 230 deletions(-)
>>>  create mode 100644 mm/nobootmem.c
>>>
>>
>> better to put first three into seperate branch. and it is with core code.
>> something like tip/mm
>>
>> So will not pollute tip/x86/mm. and they can be pushed separately.
> 
> Well, realistically they will be tested together and will go to Linus under the 
> x86/mm label anyway, so there's little reason to keep them separate at this point.
> 
> So i've pulled them. Thanks guys!

DavidR reported that x86/mm broke his numa emulation with 128M etc.

So wonder if that would hold you to push whole tip/x86/mm to Linus for .39
or need to rebase it while taking the tip/x86/numa-emulation-unify out.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-02-24 19:28     ` Yinghai Lu
@ 2011-02-24 19:32       ` Ingo Molnar
  2011-02-24 19:46         ` Tejun Heo
  2011-03-01 17:18       ` [GIT PULL tip:x86/mm] David Rientjes
  1 sibling, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2011-02-24 19:32 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Tejun Heo, tglx, hpa, linux-kernel, David Rientjes


* Yinghai Lu <yinghai@kernel.org> wrote:

> On 02/24/2011 11:23 AM, Ingo Molnar wrote:
> > 
> > * Yinghai Lu <yinghai@kernel.org> wrote:
> > 
> >> On 02/24/2011 06:51 AM, Tejun Heo wrote:
> >>> Ingo, please pull from the following git branch to receive four
> >>> commits from Yinghai.  HEAD is d1b19426b0 (x86: Rename e820_table_* to
> >>> pgt_buf_*).
> >>>
> >>> The first three separate nobootmem code into mm/nobootmem.c and the
> >>> last one renames e820_table_* variables to pgt_buf_*.  All four
> >>> patches are cleanups and shouldn't cause any behavior difference.
> >>>
> >>>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
> >>>
> >>> As usual, if HEAD doesn't appear, please pull from master.
> >>>
> >>>   ssh://master.kernel.org/pub/scm/linux/kernel/git/tj/misc.git x86-mm
> >>>
> >>> Thanks.
> >>>
> >>> Yinghai Lu (4):
> >>>       bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
> >>>       bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
> >>>       bootmem: Move __alloc_memory_core_early() to nobootmem.c
> >>>       x86: Rename e820_table_* to pgt_buf_*
> >>>
> >>>  arch/x86/include/asm/init.h |    6 +-
> >>>  arch/x86/mm/init.c          |   20 +-
> >>>  arch/x86/mm/init_32.c       |    8 +-
> >>>  arch/x86/mm/init_64.c       |    4 +-
> >>>  arch/x86/xen/mmu.c          |    2 +-
> >>>  include/linux/mm.h          |    2 -
> >>>  mm/Makefile                 |    8 +-
> >>>  mm/bootmem.c                |  180 +-----------------
> >>>  mm/nobootmem.c              |  435 +++++++++++++++++++++++++++++++++++++++++++
> >>>  mm/page_alloc.c             |   37 ----
> >>>  10 files changed, 472 insertions(+), 230 deletions(-)
> >>>  create mode 100644 mm/nobootmem.c
> >>>
> >>
> >> better to put first three into seperate branch. and it is with core code.
> >> something like tip/mm
> >>
> >> So will not pollute tip/x86/mm. and they can be pushed separately.
> > 
> > Well, realistically they will be tested together and will go to Linus under the 
> > x86/mm label anyway, so there's little reason to keep them separate at this point.
> > 
> > So i've pulled them. Thanks guys!
> 
> DavidR reported that x86/mm broke his numa emulation with 128M etc.

That regression needs to be fixed. Tejun, do you know about that bug?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-02-24 19:32       ` Ingo Molnar
@ 2011-02-24 19:46         ` Tejun Heo
  2011-02-24 22:46           ` [patch] x86, mm: Fix size of numa_distance array David Rientjes
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-02-24 19:46 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Yinghai Lu, tglx, hpa, linux-kernel, David Rientjes

Hello,

On Thu, Feb 24, 2011 at 8:32 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> DavidR reported that x86/mm broke his numa emulation with 128M etc.
>
> That regression needs to be fixed. Tejun, do you know about that bug?

Nope, David said he was gonna look into what happened but never got
back.  David?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch] x86, mm: Fix size of numa_distance array
  2011-02-24 19:46         ` Tejun Heo
@ 2011-02-24 22:46           ` David Rientjes
  2011-02-24 23:30             ` Yinghai Lu
                               ` (3 more replies)
  0 siblings, 4 replies; 68+ messages in thread
From: David Rientjes @ 2011-02-24 22:46 UTC (permalink / raw)
  To: Tejun Heo, Ingo Molnar; +Cc: Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

On Thu, 24 Feb 2011, Tejun Heo wrote:

> >> DavidR reported that x86/mm broke his numa emulation with 128M etc.
> >
> > That regression needs to be fixed. Tejun, do you know about that bug?
> 
> Nope, David said he was gonna look into what happened but never got
> back.  David?
> 

I merged x86/mm with Linus' tree, it booted fine without numa=fake but 
then panics with numa=fake=128M (and could only be captured by 
earlyprintk):

[    0.000000] BUG: unable to handle kernel paging request at ffff88007ff00000
[    0.000000] IP: [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
[    0.000000] PGD 1804063 PUD 7fefd067 PMD 7fefe067 PTE 0
[    0.000000] Oops: 0002 [#1] SMP 
[    0.000000] last sysfs file: 
[    0.000000] CPU 0 
[    0.000000] Modules linked in:
[    0.000000] 
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.38-x86-mm #1
[    0.000000] RIP: 0010:[<ffffffff818ffc15>]  [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
[    0.000000] RSP: 0000:ffffffff81801d28  EFLAGS: 00010006
[    0.000000] RAX: 0000000000000009 RBX: 00000000000001ff RCX: 0000000000000ff8
[    0.000000] RDX: 0000000000000008 RSI: 000000007feff014 RDI: ffffffff8199ed0a
[    0.000000] RBP: ffffffff81801dc8 R08: 0000000000001000 R09: 000000008199ed0a
[    0.000000] R10: 000000007feff004 R11: 000000007fefd000 R12: 00000000000001ff
[    0.000000] R13: ffff88007feff000 R14: ffffffff81801d28 R15: ffffffff819b7ca0
[    0.000000] FS:  0000000000000000(0000) GS:ffffffff818da000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: ffff88007ff00000 CR3: 0000000001803000 CR4: 00000000000000b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180b020)
[    0.000000] Stack:
[    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
[    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
[    0.000000]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff818ffc6d>] numa_set_distance+0x24/0xac
[    0.000000]  [<ffffffff81901581>] numa_emulation+0x236/0x284
[    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
[    0.000000]  [<ffffffff8190020a>] initmem_init+0xe8/0x56c
[    0.000000]  [<ffffffff8104fa43>] ? native_apic_mem_read+0x9/0x13
[    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
[    0.000000]  [<ffffffff8190068e>] ? amd_numa_init+0x0/0x376
[    0.000000]  [<ffffffff818ffa69>] ? dummy_numa_init+0x0/0x66
[    0.000000]  [<ffffffff818f974f>] ? register_lapic_address+0x75/0x85
[    0.000000]  [<ffffffff818f1b86>] setup_arch+0xa29/0xae9
[    0.000000]  [<ffffffff81456552>] ? printk+0x41/0x47
[    0.000000]  [<ffffffff818eda0d>] start_kernel+0x8a/0x386
[    0.000000]  [<ffffffff818ed2a4>] x86_64_start_reservations+0xb4/0xb8
[    0.000000]  [<ffffffff818ed39a>] x86_64_start_kernel+0xf2/0xf9

That's this:

430		numa_distance_cnt = cnt;
431	
432		/* fill with the default distances */
433		for (i = 0; i < cnt; i++)
434			for (j = 0; j < cnt; j++)
435	===>			numa_distance[i * cnt + j] = i == j ?
436					LOCAL_DISTANCE : REMOTE_DISTANCE;
437		printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
438	
439		return 0;

We're overflowing the array and it's easy to see why:

        for_each_node_mask(i, nodes_parsed)
                cnt = i;
        size = ++cnt * sizeof(numa_distance[0]);

cnt is the highest node id parsed, so numa_distance[] must be cnt * cnt.  
The following patch fixes the issue on top of x86/mm.

I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so 
numa=fake=128M would result in 512 nodes.  That's going to require 2MB for 
numa_distance (and that's not __initdata).  Before these changes, we 
calculated numa_distance() using pxms without this additional mapping, is 
there any way to reduce this?  (Admittedly real NUMA machines with 512 
nodes wouldn't mind sacrificing 2MB, but we didn't need this before.)



x86, mm: Fix size of numa_distance array

numa_distance should be sized like the SLIT, an NxN matrix where N is the
highest node id.  This patch fixes the calulcation to avoid overflowing
the array on the subsequent iteration.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 arch/x86/mm/numa_64.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index cccc01d..abf0131 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -414,7 +414,7 @@ static int __init numa_alloc_distance(void)
 
 	for_each_node_mask(i, nodes_parsed)
 		cnt = i;
-	size = ++cnt * sizeof(numa_distance[0]);
+	size = cnt * cnt * sizeof(numa_distance[0]);
 
 	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
 				      size, PAGE_SIZE);

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [patch] x86, mm: Fix size of numa_distance array
  2011-02-24 22:46           ` [patch] x86, mm: Fix size of numa_distance array David Rientjes
@ 2011-02-24 23:30             ` Yinghai Lu
  2011-02-24 23:31             ` David Rientjes
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 68+ messages in thread
From: Yinghai Lu @ 2011-02-24 23:30 UTC (permalink / raw)
  To: David Rientjes; +Cc: Tejun Heo, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 02/24/2011 02:46 PM, David Rientjes wrote:
> On Thu, 24 Feb 2011, Tejun Heo wrote:
> 
>>>> DavidR reported that x86/mm broke his numa emulation with 128M etc.
>>>
>>> That regression needs to be fixed. Tejun, do you know about that bug?
>>
>> Nope, David said he was gonna look into what happened but never got
>> back.  David?
>>
> 
> I merged x86/mm with Linus' tree, it booted fine without numa=fake but 
> then panics with numa=fake=128M (and could only be captured by 
> earlyprintk):
> 
> [    0.000000] BUG: unable to handle kernel paging request at ffff88007ff00000
> [    0.000000] IP: [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
> [    0.000000] PGD 1804063 PUD 7fefd067 PMD 7fefe067 PTE 0
> [    0.000000] Oops: 0002 [#1] SMP 
> [    0.000000] last sysfs file: 
> [    0.000000] CPU 0 
> [    0.000000] Modules linked in:
> [    0.000000] 
> [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.38-x86-mm #1
> [    0.000000] RIP: 0010:[<ffffffff818ffc15>]  [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
> [    0.000000] RSP: 0000:ffffffff81801d28  EFLAGS: 00010006
> [    0.000000] RAX: 0000000000000009 RBX: 00000000000001ff RCX: 0000000000000ff8
> [    0.000000] RDX: 0000000000000008 RSI: 000000007feff014 RDI: ffffffff8199ed0a
> [    0.000000] RBP: ffffffff81801dc8 R08: 0000000000001000 R09: 000000008199ed0a
> [    0.000000] R10: 000000007feff004 R11: 000000007fefd000 R12: 00000000000001ff
> [    0.000000] R13: ffff88007feff000 R14: ffffffff81801d28 R15: ffffffff819b7ca0
> [    0.000000] FS:  0000000000000000(0000) GS:ffffffff818da000(0000) knlGS:0000000000000000
> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.000000] CR2: ffff88007ff00000 CR3: 0000000001803000 CR4: 00000000000000b0
> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [    0.000000] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180b020)
> [    0.000000] Stack:
> [    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> [    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> [    0.000000]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [    0.000000] Call Trace:
> [    0.000000]  [<ffffffff818ffc6d>] numa_set_distance+0x24/0xac
> [    0.000000]  [<ffffffff81901581>] numa_emulation+0x236/0x284
> [    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
> [    0.000000]  [<ffffffff8190020a>] initmem_init+0xe8/0x56c
> [    0.000000]  [<ffffffff8104fa43>] ? native_apic_mem_read+0x9/0x13
> [    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
> [    0.000000]  [<ffffffff8190068e>] ? amd_numa_init+0x0/0x376
> [    0.000000]  [<ffffffff818ffa69>] ? dummy_numa_init+0x0/0x66
> [    0.000000]  [<ffffffff818f974f>] ? register_lapic_address+0x75/0x85
> [    0.000000]  [<ffffffff818f1b86>] setup_arch+0xa29/0xae9
> [    0.000000]  [<ffffffff81456552>] ? printk+0x41/0x47
> [    0.000000]  [<ffffffff818eda0d>] start_kernel+0x8a/0x386
> [    0.000000]  [<ffffffff818ed2a4>] x86_64_start_reservations+0xb4/0xb8
> [    0.000000]  [<ffffffff818ed39a>] x86_64_start_kernel+0xf2/0xf9
> 
> That's this:
> 
> 430		numa_distance_cnt = cnt;
> 431	
> 432		/* fill with the default distances */
> 433		for (i = 0; i < cnt; i++)
> 434			for (j = 0; j < cnt; j++)
> 435	===>			numa_distance[i * cnt + j] = i == j ?
> 436					LOCAL_DISTANCE : REMOTE_DISTANCE;
> 437		printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
> 438	
> 439		return 0;
> 
> We're overflowing the array and it's easy to see why:
> 
>         for_each_node_mask(i, nodes_parsed)
>                 cnt = i;
>         size = ++cnt * sizeof(numa_distance[0]);
> 
> cnt is the highest node id parsed, so numa_distance[] must be cnt * cnt.  
> The following patch fixes the issue on top of x86/mm.
> 
> I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so 
> numa=fake=128M would result in 512 nodes.  That's going to require 2MB for 
> numa_distance (and that's not __initdata).  Before these changes, we 
> calculated numa_distance() using pxms without this additional mapping, is 
> there any way to reduce this?  (Admittedly real NUMA machines with 512 
> nodes wouldn't mind sacrificing 2MB, but we didn't need this before.)
> 
> 
> 
> x86, mm: Fix size of numa_distance array
> 
> numa_distance should be sized like the SLIT, an NxN matrix where N is the
> highest node id.  This patch fixes the calulcation to avoid overflowing
> the array on the subsequent iteration.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  arch/x86/mm/numa_64.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> index cccc01d..abf0131 100644
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -414,7 +414,7 @@ static int __init numa_alloc_distance(void)
>  
>  	for_each_node_mask(i, nodes_parsed)
>  		cnt = i;
> -	size = ++cnt * sizeof(numa_distance[0]);
> +	size = cnt * cnt * sizeof(numa_distance[0]);
should be

+	cnt++;
+	size = cnt * cnt * sizeof(numa_distance[0]);


>  
>  	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
>  				      size, PAGE_SIZE);


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch] x86, mm: Fix size of numa_distance array
  2011-02-24 22:46           ` [patch] x86, mm: Fix size of numa_distance array David Rientjes
  2011-02-24 23:30             ` Yinghai Lu
@ 2011-02-24 23:31             ` David Rientjes
  2011-02-25  9:05               ` Tejun Heo
  2011-02-25  9:03             ` Tejun Heo
  2011-02-25  9:11             ` [PATCH x86-mm] x86-64, NUMA: " Tejun Heo
  3 siblings, 1 reply; 68+ messages in thread
From: David Rientjes @ 2011-02-24 23:31 UTC (permalink / raw)
  To: Tejun Heo, Ingo Molnar; +Cc: Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

On Thu, 24 Feb 2011, David Rientjes wrote:

> numa_distance should be sized like the SLIT, an NxN matrix where N is the
> highest node id.  This patch fixes the calulcation to avoid overflowing
> the array on the subsequent iteration.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  arch/x86/mm/numa_64.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> index cccc01d..abf0131 100644
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -414,7 +414,7 @@ static int __init numa_alloc_distance(void)
>  
>  	for_each_node_mask(i, nodes_parsed)
>  		cnt = i;
> -	size = ++cnt * sizeof(numa_distance[0]);
> +	size = cnt * cnt * sizeof(numa_distance[0]);
>  
>  	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
>  				      size, PAGE_SIZE);
> 

This also looks like it needs the following to not erroneously consider a 
node id to be out of bounds.  Why were we oversizing cnt in the old code 
above by 1?

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -454,7 +454,7 @@ void __init numa_set_distance(int from, int to, int distance)
 	if (!numa_distance && numa_alloc_distance() < 0)
 		return;
 
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt) {
+	if (from > numa_distance_cnt || to > numa_distance_cnt) {
 		printk_once(KERN_DEBUG "NUMA: Debug: distance out of bound, from=%d to=%d distance=%d\n",
 			    from, to, distance);
 		return;
@@ -472,7 +472,7 @@ void __init numa_set_distance(int from, int to, int distance)
 
 int __node_distance(int from, int to)
 {
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
+	if (from > numa_distance_cnt || to > numa_distance_cnt)
 		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
 	return numa_distance[from * numa_distance_cnt + to];
 }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch] x86, mm: Fix size of numa_distance array
  2011-02-24 22:46           ` [patch] x86, mm: Fix size of numa_distance array David Rientjes
  2011-02-24 23:30             ` Yinghai Lu
  2011-02-24 23:31             ` David Rientjes
@ 2011-02-25  9:03             ` Tejun Heo
  2011-02-25 10:58               ` Tejun Heo
  2011-02-25  9:11             ` [PATCH x86-mm] x86-64, NUMA: " Tejun Heo
  3 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-02-25  9:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

Hello,

On Thu, Feb 24, 2011 at 02:46:38PM -0800, David Rientjes wrote:
> That's this:
> 
> 430		numa_distance_cnt = cnt;
> 431	
> 432		/* fill with the default distances */
> 433		for (i = 0; i < cnt; i++)
> 434			for (j = 0; j < cnt; j++)
> 435	===>			numa_distance[i * cnt + j] = i == j ?
> 436					LOCAL_DISTANCE : REMOTE_DISTANCE;
> 437		printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
> 438	
> 439		return 0;
> 
> We're overflowing the array and it's easy to see why:
> 
>         for_each_node_mask(i, nodes_parsed)
>                 cnt = i;
>         size = ++cnt * sizeof(numa_distance[0]);
> 
> cnt is the highest node id parsed, so numa_distance[] must be cnt * cnt.  
> The following patch fixes the issue on top of x86/mm.

Oops, that was stupid.

> I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so 
> numa=fake=128M would result in 512 nodes.  That's going to require 2MB for 
> numa_distance (and that's not __initdata).  Before these changes, we 
> calculated numa_distance() using pxms without this additional mapping, is 
> there any way to reduce this?  (Admittedly real NUMA machines with 512 
> nodes wouldn't mind sacrificing 2MB, but we didn't need this before.)

We can leave the physical distance table unmodified and map through
emu_nid_to_phys[] while dereferencing.  It just seemed simpler this
way.  Does it actually matter?  Anyways, I'll give it a shot.  Do you
guys actually use 512 nodes?

> x86, mm: Fix size of numa_distance array
> 
> numa_distance should be sized like the SLIT, an NxN matrix where N is the
> highest node id.  This patch fixes the calulcation to avoid overflowing
> the array on the subsequent iteration.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  arch/x86/mm/numa_64.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> index cccc01d..abf0131 100644
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -414,7 +414,7 @@ static int __init numa_alloc_distance(void)
>  
>  	for_each_node_mask(i, nodes_parsed)
>  		cnt = i;
> -	size = ++cnt * sizeof(numa_distance[0]);
> +	size = cnt * cnt * sizeof(numa_distance[0]);

It should be cnt++; cnt * cnt; as Yinghai wrote.

>  	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
>  				      size, PAGE_SIZE);

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch] x86, mm: Fix size of numa_distance array
  2011-02-24 23:31             ` David Rientjes
@ 2011-02-25  9:05               ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-02-25  9:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

Hello,

On Thu, Feb 24, 2011 at 03:31:24PM -0800, David Rientjes wrote:
> > -	size = ++cnt * sizeof(numa_distance[0]);
> > +	size = cnt * cnt * sizeof(numa_distance[0]);
> >  
> >  	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
> >  				      size, PAGE_SIZE);
> > 
> 
> This also looks like it needs the following to not erroneously consider a 
> node id to be out of bounds.  Why were we oversizing cnt in the old code 
> above by 1?

Umm... because @cnt should be length not the last index?

> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -454,7 +454,7 @@ void __init numa_set_distance(int from, int to, int distance)
>  	if (!numa_distance && numa_alloc_distance() < 0)
>  		return;
>  
> -	if (from >= numa_distance_cnt || to >= numa_distance_cnt) {
> +	if (from > numa_distance_cnt || to > numa_distance_cnt) {
>  		printk_once(KERN_DEBUG "NUMA: Debug: distance out of bound, from=%d to=%d distance=%d\n",
>  			    from, to, distance);
>  		return;
> @@ -472,7 +472,7 @@ void __init numa_set_distance(int from, int to, int distance)
>  
>  int __node_distance(int from, int to)
>  {
> -	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
> +	if (from > numa_distance_cnt || to > numa_distance_cnt)

Again, numa_distance_cnt is the number of elements in one dimension of
the table not the index, while @from and @to are 0 based index.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH x86-mm] x86-64, NUMA: Fix size of numa_distance array
  2011-02-24 22:46           ` [patch] x86, mm: Fix size of numa_distance array David Rientjes
                               ` (2 preceding siblings ...)
  2011-02-25  9:03             ` Tejun Heo
@ 2011-02-25  9:11             ` Tejun Heo
  3 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-02-25  9:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

>From 1f565a896ee139a70e1a16f74a4ec29707691b0b Mon Sep 17 00:00:00 2001
From: David Rientjes <rientjes@google.com>
Date: Fri, 25 Feb 2011 10:06:39 +0100

numa_distance should be sized like the SLIT, an NxN matrix where N is
the highest node id + 1.  This patch fixes the calculation to avoid
overflowing the array on the subsequent iteration.

-tj: The original patch used last index to calculate size.  Yinghai
     pointed out it should be incremented so it is the number of
     elements instead of the last index to calculate the size of the
     table.  Updated accordingly.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
Applied with cnt++ added as Yinghai pointed out.  Thanks.

 arch/x86/mm/numa_64.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index cccc01d..7757d22 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -414,7 +414,8 @@ static int __init numa_alloc_distance(void)
 
 	for_each_node_mask(i, nodes_parsed)
 		cnt = i;
-	size = ++cnt * sizeof(numa_distance[0]);
+	cnt++;
+	size = cnt * cnt * sizeof(numa_distance[0]);
 
 	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
 				      size, PAGE_SIZE);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [patch] x86, mm: Fix size of numa_distance array
  2011-02-25  9:03             ` Tejun Heo
@ 2011-02-25 10:58               ` Tejun Heo
  2011-02-25 11:05                 ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-02-25 10:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

On Fri, Feb 25, 2011 at 10:03:01AM +0100, Tejun Heo wrote:
> > I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so 
> > numa=fake=128M would result in 512 nodes.  That's going to require 2MB for 
> > numa_distance (and that's not __initdata).  Before these changes, we 
> > calculated numa_distance() using pxms without this additional mapping, is 
> > there any way to reduce this?  (Admittedly real NUMA machines with 512 
> > nodes wouldn't mind sacrificing 2MB, but we didn't need this before.)
> 
> We can leave the physical distance table unmodified and map through
> emu_nid_to_phys[] while dereferencing.  It just seemed simpler this
> way.  Does it actually matter?  Anyways, I'll give it a shot.  Do you
> guys actually use 512 nodes?

So, the patch looks like the following and it even reduces LOC but I'm
not sure whether I want to apply this.  Previously, once emluation
step is complete, the rest of the system wouldn't care whether nodes
are being emulated or not.  After this change, although it's still
contained in numa_64.c, we end up with some configurations remapped
and some still using physical nodes.  Unless someone tells me that
2MiB is frigging precious on machines with 512 emulated nodes, I don't
think I'm gonna apply this one.

Thanks.

Index: work/arch/x86/mm/numa_internal.h
===================================================================
--- work.orig/arch/x86/mm/numa_internal.h
+++ work/arch/x86/mm/numa_internal.h
@@ -17,15 +17,24 @@ struct numa_meminfo {
 
 void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
 int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
-void __init numa_reset_distance(void);
 
 #ifdef CONFIG_NUMA_EMU
-void __init numa_emulation(struct numa_meminfo *numa_meminfo,
-			   int numa_dist_cnt);
+extern int emu_nid_to_phys[MAX_NUMNODES] __cpuinitdata;
+
+void __init numa_emulation(struct numa_meminfo *numa_meminfo);
+
+static inline int numa_nid_to_phys(int nid)
+{
+	return nid == NUMA_NO_NODE ? nid : emu_nid_to_phys[nid];
+}
 #else
 static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
 				  int numa_dist_cnt)
 { }
+static inline int numa_nid_to_phys(int nid)
+{
+	return nid;
+}
 #endif
 
 #endif	/* __X86_MM_NUMA_INTERNAL_H */
Index: work/arch/x86/mm/numa_emulation.c
===================================================================
--- work.orig/arch/x86/mm/numa_emulation.c
+++ work/arch/x86/mm/numa_emulation.c
@@ -9,7 +9,8 @@
 
 #include "numa_internal.h"
 
-static int emu_nid_to_phys[MAX_NUMNODES] __cpuinitdata;
+int emu_nid_to_phys[MAX_NUMNODES] __cpuinitdata;
+
 static char *emu_cmdline __initdata;
 
 void __init numa_emu_cmdline(char *str)
@@ -270,12 +271,10 @@ static int __init split_nodes_size_inter
 /**
  * numa_emulation - Emulate NUMA nodes
  * @numa_meminfo: NUMA configuration to massage
- * @numa_dist_cnt: The size of the physical NUMA distance table
  *
  * Emulate NUMA nodes according to the numa=fake kernel parameter.
  * @numa_meminfo contains the physical memory configuration and is modified
- * to reflect the emulated configuration on success.  @numa_dist_cnt is
- * used to determine the size of the physical distance table.
+ * to reflect the emulated configuration on success.
  *
  * On success, the following modifications are made.
  *
@@ -284,22 +283,18 @@ static int __init split_nodes_size_inter
  * - __apicid_to_node[] is updated such that APIC IDs are mapped to the
  *   emulated nodes.
  *
- * - NUMA distance table is rebuilt to represent distances between emulated
- *   nodes.  The distances are determined considering how emulated nodes
- *   are mapped to physical nodes and match the actual distances.
- *
  * - emu_nid_to_phys[] reflects how emulated nodes are mapped to physical
- *   nodes.  This is used by numa_add_cpu() and numa_remove_cpu().
+ *   nodes.  This is used by numa_add_cpu(), numa_remove_cpu() and to index
+ *   numa_distance table.
  *
  * If emulation is not enabled or fails, emu_nid_to_phys[] is filled with
  * identity mapping and no other modification is made.
  */
-void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
+void __init numa_emulation(struct numa_meminfo *numa_meminfo)
 {
 	static struct numa_meminfo ei __initdata;
 	static struct numa_meminfo pi __initdata;
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
-	u8 *phys_dist = NULL;
 	int i, j, ret;
 
 	if (!emu_cmdline)
@@ -336,29 +331,6 @@ void __init numa_emulation(struct numa_m
 		goto no_emu;
 	}
 
-	/*
-	 * Copy the original distance table.  It's temporary so no need to
-	 * reserve it.
-	 */
-	if (numa_dist_cnt) {
-		size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
-		u64 phys;
-
-		phys = memblock_find_in_range(0,
-					      (u64)max_pfn_mapped << PAGE_SHIFT,
-					      size, PAGE_SIZE);
-		if (phys == MEMBLOCK_ERROR) {
-			pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
-			goto no_emu;
-		}
-		phys_dist = __va(phys);
-
-		for (i = 0; i < numa_dist_cnt; i++)
-			for (j = 0; j < numa_dist_cnt; j++)
-				phys_dist[i * numa_dist_cnt + j] =
-					node_distance(i, j);
-	}
-
 	/* commit */
 	*numa_meminfo = ei;
 
@@ -381,23 +353,6 @@ void __init numa_emulation(struct numa_m
 		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
 			emu_nid_to_phys[i] = 0;
 
-	/* transform distance table */
-	numa_reset_distance();
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		for (j = 0; j < MAX_NUMNODES; j++) {
-			int physi = emu_nid_to_phys[i];
-			int physj = emu_nid_to_phys[j];
-			int dist;
-
-			if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
-				dist = physi == physj ?
-					LOCAL_DISTANCE : REMOTE_DISTANCE;
-			else
-				dist = phys_dist[physi * numa_dist_cnt + physj];
-
-			numa_set_distance(i, j, dist);
-		}
-	}
 	return;
 
 no_emu:
Index: work/arch/x86/mm/numa_64.c
===================================================================
--- work.orig/arch/x86/mm/numa_64.c
+++ work/arch/x86/mm/numa_64.c
@@ -35,6 +35,15 @@ static unsigned long __initdata nodemap_
 
 static struct numa_meminfo numa_meminfo __initdata;
 
+/*
+ * Table recording distances between nodes.  The distance from node A to
+ * node B, where both A and B are less than numa_distance_cnt, is stored at
+ * numa_distance[A * numa_distance_cnt + B].
+ *
+ * Note that numa_distance table is always indexed by the physical node IDs
+ * even when NUMA emulation is enabled to simplify implementation and avoid
+ * creating large distance table when there are a lot of emulated nodes.
+ */
 static int numa_distance_cnt;
 static u8 *numa_distance;
 
@@ -388,7 +397,7 @@ static void __init numa_nodemask_from_me
  * The current table is freed.  The next numa_set_distance() call will
  * create a new one.
  */
-void __init numa_reset_distance(void)
+static void __init numa_reset_distance(void)
 {
 	size_t size;
 
@@ -471,10 +480,32 @@ void __init numa_set_distance(int from,
 	numa_distance[from * numa_distance_cnt + to] = distance;
 }
 
+/**
+ * __node_distance - Determine NUMA distance from one node to another
+ * @from: the 'from' node
+ * @to: the 'to' node
+ *
+ * Determine the distance from @from to @to.
+ *
+ * RETURNS:
+ * NUMA distance.
+ */
 int __node_distance(int from, int to)
 {
+	/*
+	 * First, convert the nids to physical as numa_distance is always
+	 * indexed with physical nids.
+	 */
+	from = numa_nid_to_phys(from);
+	to = numa_nid_to_phys(to);
+
+	/*
+	 * If either one is beyond the table dimension, just compare the
+	 * physical nids directly and claim LOCAL if they're the same.
+	 */
 	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
 		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+
 	return numa_distance[from * numa_distance_cnt + to];
 }
 EXPORT_SYMBOL(__node_distance);
@@ -604,7 +635,7 @@ void __init initmem_init(void)
 		if (numa_cleanup_meminfo(&numa_meminfo) < 0)
 			continue;
 
-		numa_emulation(&numa_meminfo, numa_distance_cnt);
+		numa_emulation(&numa_meminfo);
 
 		if (numa_register_memblks(&numa_meminfo) < 0)
 			continue;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch] x86, mm: Fix size of numa_distance array
  2011-02-25 10:58               ` Tejun Heo
@ 2011-02-25 11:05                 ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-02-25 11:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

On Fri, Feb 25, 2011 at 11:58:46AM +0100, Tejun Heo wrote:
> On Fri, Feb 25, 2011 at 10:03:01AM +0100, Tejun Heo wrote:
> > > I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so 
> > > numa=fake=128M would result in 512 nodes.  That's going to require 2MB for 
> > > numa_distance (and that's not __initdata).  Before these changes, we 
> > > calculated numa_distance() using pxms without this additional mapping, is 
> > > there any way to reduce this?  (Admittedly real NUMA machines with 512 
> > > nodes wouldn't mind sacrificing 2MB, but we didn't need this before.)
> > 
> > We can leave the physical distance table unmodified and map through
> > emu_nid_to_phys[] while dereferencing.  It just seemed simpler this
> > way.  Does it actually matter?  Anyways, I'll give it a shot.  Do you
> > guys actually use 512 nodes?
> 
> So, the patch looks like the following and it even reduces LOC but I'm
> not sure whether I want to apply this.  Previously, once emluation
> step is complete, the rest of the system wouldn't care whether nodes
> are being emulated or not.  After this change, although it's still
> contained in numa_64.c, we end up with some configurations remapped
> and some still using physical nodes.  Unless someone tells me that
> 2MiB is frigging precious on machines with 512 emulated nodes, I don't
> think I'm gonna apply this one.

Also, the calculation isn't quite right.  If you have 512 nodes,
that's 2^9 * 2^9 entries and, with one byte per entry, 2^18 == 256KiB.
With 1024 nodes, it becomes 1MiB.  I suggest just swallowing it.  I
really want to avoid emulated/physical thing spilling out of emulation
code proper.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-02-24 19:28     ` Yinghai Lu
  2011-02-24 19:32       ` Ingo Molnar
@ 2011-03-01 17:18       ` David Rientjes
  2011-03-01 18:25         ` Tejun Heo
                           ` (2 more replies)
  1 sibling, 3 replies; 68+ messages in thread
From: David Rientjes @ 2011-03-01 17:18 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Ingo Molnar, Tejun Heo, tglx, H. Peter Anvin, linux-kernel

On Thu, 24 Feb 2011, Yinghai Lu wrote:

> DavidR reported that x86/mm broke his numa emulation with 128M etc.
> 
> So wonder if that would hold you to push whole tip/x86/mm to Linus for .39
> or need to rebase it while taking the tip/x86/numa-emulation-unify out.
> 

Ok, so 1f565a896ee1 (x86-64, NUMA: Fix size of numa_distance array) fixes 
the boot failure when using numa=fake, but there's still another issue 
that was introduced with regard to emulated distances between fake nodes 
sitting hardware using a SLIT.

This is important because we want to ensure that the physical topoloy of 
the machine is still represented in an emulated environment to 
appropriately describe the expected latencies between the nodes.  It also 
allows users who are using numa=fake purely as a debugging tool to test 
more interesting configurations and benchmark memory accesses between 
emulated nodes as though they were real.

For example, on my four-node system with a custom SLIT, this is the 
distance when booting without numa=fake:

	$ cat /sys/devices/system/node/node*/distance 
	10 20 20 30
	20 10 20 20
	20 20 10 20
	30 20 20 10

These physical nodes are all symmetric in size.

With numa=fake=16, we expect to see the fake nodes interleaved (as the 
default) over the set of physical nodes.  This would suggest distance 
files for these nodes to be:

	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20

(And that is what we see with 2.6.37.)

However, x86/mm describes these distances differently:

	node0/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
	node1/distance:10 10 20 20 10 20 20 20 10 20 20 20 10 20 20 20
	node2/distance:10 20 10 20 10 20 20 20 10 20 20 20 10 20 20 20
	node3/distance:10 20 20 10 10 20 20 20 10 20 20 20 10 20 20 20
	node4/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
	node5/distance:10 20 20 20 10 10 20 20 10 20 20 20 10 20 20 20
	node6/distance:10 20 20 20 10 20 10 20 10 20 20 20 10 20 20 20
	node7/distance:10 20 20 20 10 20 20 10 10 20 20 20 10 20 20 20
	node8/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
	node9/distance:10 20 20 20 10 20 20 20 10 10 20 20 10 20 20 20
	node10/distance:10 20 20 20 10 20 20 20 10 20 10 20 10 20 20 20
	node11/distance:10 20 20 20 10 20 20 20 10 20 20 10 10 20 20 20
	node12/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
	node13/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 10 20 20
	node14/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 10 20
	node15/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 10

It looks as though the emulation changes sitting in x86/mm have dropped 
the SLIT and are merely describing the emulated nodes as either having 
physical affinity or not.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-03-01 17:18       ` [GIT PULL tip:x86/mm] David Rientjes
@ 2011-03-01 18:25         ` Tejun Heo
  2011-03-01 22:19         ` Yinghai Lu
  2011-03-02 10:04         ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Tejun Heo
  2 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-01 18:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hey, David.

On Tue, Mar 01, 2011 at 09:18:45AM -0800, David Rientjes wrote:
> This is important because we want to ensure that the physical topoloy of 
> the machine is still represented in an emulated environment to 
> appropriately describe the expected latencies between the nodes.  It also 
> allows users who are using numa=fake purely as a debugging tool to test 
> more interesting configurations and benchmark memory accesses between 
> emulated nodes as though they were real.

Sure, definitely.

> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 
> (And that is what we see with 2.6.37.)

And this should have been the result.  I've actually tested with fake
original distance table including asymmetric distances.

> However, x86/mm describes these distances differently:
> 
> 	node0/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node1/distance:10 10 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node2/distance:10 20 10 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node3/distance:10 20 20 10 10 20 20 20 10 20 20 20 10 20 20 20
> 	node4/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node5/distance:10 20 20 20 10 10 20 20 10 20 20 20 10 20 20 20
> 	node6/distance:10 20 20 20 10 20 10 20 10 20 20 20 10 20 20 20
> 	node7/distance:10 20 20 20 10 20 20 10 10 20 20 20 10 20 20 20
> 	node8/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node9/distance:10 20 20 20 10 20 20 20 10 10 20 20 10 20 20 20
> 	node10/distance:10 20 20 20 10 20 20 20 10 20 10 20 10 20 20 20
> 	node11/distance:10 20 20 20 10 20 20 20 10 20 20 10 10 20 20 20
> 	node12/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node13/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 10 20 20
> 	node14/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 10 20
> 	node15/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 10
> 
> It looks as though the emulation changes sitting in x86/mm have dropped 
> the SLIT and are merely describing the emulated nodes as either having 
> physical affinity or not.

It looks like I missed something.  I'll look into it first thing
tomorrow.  If you feel like looking into where it's going wrong,
please go ahead.  BTW, how did you insert the custom SLIT table?  If
it's some ACPI trick I can use here, do you care to share?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-03-01 17:18       ` [GIT PULL tip:x86/mm] David Rientjes
  2011-03-01 18:25         ` Tejun Heo
@ 2011-03-01 22:19         ` Yinghai Lu
  2011-03-02  9:17           ` Tejun Heo
  2011-03-02 10:04         ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Tejun Heo
  2 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-01 22:19 UTC (permalink / raw)
  To: David Rientjes; +Cc: Ingo Molnar, Tejun Heo, tglx, H. Peter Anvin, linux-kernel

On 03/01/2011 09:18 AM, David Rientjes wrote:
> On Thu, 24 Feb 2011, Yinghai Lu wrote:
> 
>> DavidR reported that x86/mm broke his numa emulation with 128M etc.
>>
>> So wonder if that would hold you to push whole tip/x86/mm to Linus for .39
>> or need to rebase it while taking the tip/x86/numa-emulation-unify out.
>>
> 
> Ok, so 1f565a896ee1 (x86-64, NUMA: Fix size of numa_distance array) fixes 
> the boot failure when using numa=fake, but there's still another issue 
> that was introduced with regard to emulated distances between fake nodes 
> sitting hardware using a SLIT.
> 
> This is important because we want to ensure that the physical topoloy of 
> the machine is still represented in an emulated environment to 
> appropriately describe the expected latencies between the nodes.  It also 
> allows users who are using numa=fake purely as a debugging tool to test 
> more interesting configurations and benchmark memory accesses between 
> emulated nodes as though they were real.
> 
> For example, on my four-node system with a custom SLIT, this is the 
> distance when booting without numa=fake:
> 
> 	$ cat /sys/devices/system/node/node*/distance 
> 	10 20 20 30
> 	20 10 20 20
> 	20 20 10 20
> 	30 20 20 10
> 
> These physical nodes are all symmetric in size.
> 
> With numa=fake=16, we expect to see the fake nodes interleaved (as the 
> default) over the set of physical nodes.  This would suggest distance 
> files for these nodes to be:
> 
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 
> (And that is what we see with 2.6.37.)
> 
> However, x86/mm describes these distances differently:
> 
> 	node0/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node1/distance:10 10 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node2/distance:10 20 10 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node3/distance:10 20 20 10 10 20 20 20 10 20 20 20 10 20 20 20
> 	node4/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node5/distance:10 20 20 20 10 10 20 20 10 20 20 20 10 20 20 20
> 	node6/distance:10 20 20 20 10 20 10 20 10 20 20 20 10 20 20 20
> 	node7/distance:10 20 20 20 10 20 20 10 10 20 20 20 10 20 20 20
> 	node8/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node9/distance:10 20 20 20 10 20 20 20 10 10 20 20 10 20 20 20
> 	node10/distance:10 20 20 20 10 20 20 20 10 20 10 20 10 20 20 20
> 	node11/distance:10 20 20 20 10 20 20 20 10 20 20 10 10 20 20 20
> 	node12/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node13/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 10 20 20
> 	node14/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 10 20
> 	node15/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 10
> 
> It looks as though the emulation changes sitting in x86/mm have dropped 
> the SLIT and are merely describing the emulated nodes as either having 
> physical affinity or not.

please check:

[PATCH] x86, numa, emu: Fix slit ignoring.

David Reported that after numa_emu clean up, SLIT does not honor anymore.

after looking at the code, it seems the cleanup does have several problems:
1. need to reserve temp numa dist.
	We only can use find_...without_reserve tricks when we are done with
	 the old one before get another new one.
2. during copying should only copy with NEW numa_dist_cnt size.
	so need to call numa_alloc_dist at first before copy.
3. phys_dist whould numa_dist_cnt square size
4. numa_reset_distance should free numa_dist_cnt square size

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/numa_64.c        |    6 ++---
 arch/x86/mm/numa_emulation.c |   50 ++++++++++++++++++++++++++++++-------------
 arch/x86/mm/numa_internal.h  |    1 
 3 files changed, 40 insertions(+), 17 deletions(-)

Index: linux-2.6/arch/x86/mm/numa_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_64.c
+++ linux-2.6/arch/x86/mm/numa_64.c
@@ -393,7 +393,7 @@ void __init numa_reset_distance(void)
 	size_t size;
 
 	if (numa_distance_cnt) {
-		size = numa_distance_cnt * sizeof(numa_distance[0]);
+		size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
 		memblock_x86_free_range(__pa(numa_distance),
 					__pa(numa_distance) + size);
 		numa_distance_cnt = 0;
@@ -401,7 +401,7 @@ void __init numa_reset_distance(void)
 	numa_distance = NULL;
 }
 
-static int __init numa_alloc_distance(void)
+int __init numa_alloc_distance(void)
 {
 	nodemask_t nodes_parsed;
 	size_t size;
@@ -437,7 +437,7 @@ static int __init numa_alloc_distance(vo
 				LOCAL_DISTANCE : REMOTE_DISTANCE;
 	printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
 
-	return 0;
+	return cnt;
 }
 
 /**
Index: linux-2.6/arch/x86/mm/numa_emulation.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_emulation.c
+++ linux-2.6/arch/x86/mm/numa_emulation.c
@@ -300,7 +300,9 @@ void __init numa_emulation(struct numa_m
 	static struct numa_meminfo pi __initdata;
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
 	u8 *phys_dist = NULL;
+	int phys_size = 0;
 	int i, j, ret;
+	int new_nr;
 
 	if (!emu_cmdline)
 		goto no_emu;
@@ -341,16 +343,17 @@ void __init numa_emulation(struct numa_m
 	 * reserve it.
 	 */
 	if (numa_dist_cnt) {
-		size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
 		u64 phys;
 
+		phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
 		phys = memblock_find_in_range(0,
 					      (u64)max_pfn_mapped << PAGE_SHIFT,
-					      size, PAGE_SIZE);
+					      phys_size, PAGE_SIZE);
 		if (phys == MEMBLOCK_ERROR) {
 			pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
 			goto no_emu;
 		}
+		memblock_x86_reserve_range(phys, phys + phys_size, "TMP NUMA DIST");
 		phys_dist = __va(phys);
 
 		for (i = 0; i < numa_dist_cnt; i++)
@@ -383,21 +386,40 @@ void __init numa_emulation(struct numa_m
 
 	/* transform distance table */
 	numa_reset_distance();
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		for (j = 0; j < MAX_NUMNODES; j++) {
-			int physi = emu_nid_to_phys[i];
-			int physj = emu_nid_to_phys[j];
-			int dist;
-
-			if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
-				dist = physi == physj ?
-					LOCAL_DISTANCE : REMOTE_DISTANCE;
-			else
+	/* allocate numa_distance at first, it will set new numa_dist_cnt */
+	new_nr = numa_alloc_distance();
+	if (new_nr < 0)
+		goto free_temp_phys;
+
+	/*
+	 * only set it when we have old phys_dist,
+	 * numa_alloc_distance already set default values
+	 */
+	if (phys_dist)
+		for (i = 0; i < new_nr; i++) {
+			for (j = 0; j < new_nr; j++) {
+				int physi = emu_nid_to_phys[i];
+				int physj = emu_nid_to_phys[j];
+				int dist;
+
+				/* really need this check ? */
+				if (physi >= numa_dist_cnt ||
+				    physj >= numa_dist_cnt)
+					continue;
+
 				dist = phys_dist[physi * numa_dist_cnt + physj];
 
-			numa_set_distance(i, j, dist);
+				numa_set_distance(i, j, dist);
+			}
 		}
-	}
+
+free_temp_phys:
+
+	/* Free the temp storage for phys */
+	if (phys_dist)
+		memblock_x86_free_range(__pa(phys_dist),
+					__pa(phys_dist) + phys_size);
+
 	return;
 
 no_emu:
Index: linux-2.6/arch/x86/mm/numa_internal.h
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_internal.h
+++ linux-2.6/arch/x86/mm/numa_internal.h
@@ -18,6 +18,7 @@ struct numa_meminfo {
 void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
 int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
 void __init numa_reset_distance(void);
+int numa_alloc_distance(void);
 
 #ifdef CONFIG_NUMA_EMU
 void __init numa_emulation(struct numa_meminfo *numa_meminfo,

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [GIT PULL tip:x86/mm]
  2011-03-01 22:19         ` Yinghai Lu
@ 2011-03-02  9:17           ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-02  9:17 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hello, Yinghai.

On Tue, Mar 01, 2011 at 02:19:13PM -0800, Yinghai Lu wrote:
> after looking at the code, it seems the cleanup does have several problems:
> 1. need to reserve temp numa dist.
> 	We only can use find_...without_reserve tricks when we are done with
> 	 the old one before get another new one.

Indeed, thanks for catching this.

> 2. during copying should only copy with NEW numa_dist_cnt size.
> 	so need to call numa_alloc_dist at first before copy.

It doesn't matter.  numa_set_distance() ignores distances for nodes
which are out of scope.

> 3. phys_dist whould numa_dist_cnt square size
> 4. numa_reset_distance should free numa_dist_cnt square size

Yeah, I seem to have completely forgotten about the square thing.
Stupid.

I'll regenerate patches with changes for 1, 3 and 4.  For 2, I'll add
comments explaining why it's okay.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH x86/mm] x86-64, NUMA: Fix distance table handling
  2011-03-01 17:18       ` [GIT PULL tip:x86/mm] David Rientjes
  2011-03-01 18:25         ` Tejun Heo
  2011-03-01 22:19         ` Yinghai Lu
@ 2011-03-02 10:04         ` Tejun Heo
  2011-03-02 10:07           ` Ingo Molnar
                             ` (2 more replies)
  2 siblings, 3 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 10:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

>From dcd414c795a28fff6e511d58e4cfd1202323c703 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 2 Mar 2011 10:54:10 +0100

NUMA distance table handling has the following problems.

* numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
  as the table size when it should be using the square of
  numa_distance.

* The same size miscalculation when allocation space for phys_dist in
  numa_emulation().

* In numa_emulation(), phys_dist must be reserved; otherwise, the new
  emulated distance table may overlap it.

Fix them and, while at it, take numa_distance_cnt resetting in
numa_reset_distance() out of the if block to simplify the code a bit.

David Rientjes reported incorrect handling of distance table during
emulation and Yinghai identified the above problems and wrote the
original patch to fix the problems.  This patch is based on Yinghai's
patch.

Reported-by: David Rientjes <rientjes@google.com>
Patch-originally-from: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
David, can you please verify whether this patch fixes the problem?
Thanks.

 arch/x86/mm/numa_64.c        |    9 ++++-----
 arch/x86/mm/numa_emulation.c |   16 ++++++++++------
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index cccc01d..a94f12c 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -390,14 +390,13 @@ static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
  */
 void __init numa_reset_distance(void)
 {
-	size_t size;
+	size_t size = numa_distance_cnt * numa_distance_cnt *
+		      sizeof(numa_distance[0]);
 
-	if (numa_distance_cnt) {
-		size = numa_distance_cnt * sizeof(numa_distance[0]);
+	if (numa_distance_cnt)
 		memblock_x86_free_range(__pa(numa_distance),
 					__pa(numa_distance) + size);
-		numa_distance_cnt = 0;
-	}
+	numa_distance_cnt = 0;
 	numa_distance = NULL;
 }
 
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 607a2e8..50fda06 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -300,6 +300,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	static struct numa_meminfo pi __initdata;
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
 	u8 *phys_dist = NULL;
+	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
 	int i, j, ret;
 
 	if (!emu_cmdline)
@@ -336,21 +337,19 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		goto no_emu;
 	}
 
-	/*
-	 * Copy the original distance table.  It's temporary so no need to
-	 * reserve it.
-	 */
+	/* copy the physical distance table */
 	if (numa_dist_cnt) {
-		size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
 		u64 phys;
 
 		phys = memblock_find_in_range(0,
 					      (u64)max_pfn_mapped << PAGE_SHIFT,
-					      size, PAGE_SIZE);
+					      phys_size, PAGE_SIZE);
 		if (phys == MEMBLOCK_ERROR) {
 			pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
 			goto no_emu;
 		}
+		memblock_x86_reserve_range(phys, phys + phys_size,
+					   "TMP NUMA DIST");
 		phys_dist = __va(phys);
 
 		for (i = 0; i < numa_dist_cnt; i++)
@@ -398,6 +397,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 			numa_set_distance(i, j, dist);
 		}
 	}
+
+	/* free the copied physical distance table */
+	if (phys_dist)
+		memblock_x86_free_range(__pa(phys_dist),
+					__pa(phys_dist) + phys_size);
 	return;
 
 no_emu:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:04         ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Tejun Heo
@ 2011-03-02 10:07           ` Ingo Molnar
  2011-03-02 10:15             ` Tejun Heo
  2011-03-02 10:25           ` [PATCH x86/mm UPDATED] " Tejun Heo
  2011-03-02 10:43           ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Ingo Molnar
  2 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2011-03-02 10:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> +	size_t size = numa_distance_cnt * numa_distance_cnt *
> +		      sizeof(numa_distance[0]);

> +		memblock_x86_reserve_range(phys, phys + phys_size,
> +					   "TMP NUMA DIST");

> +		memblock_x86_free_range(__pa(phys_dist),
> +					__pa(phys_dist) + phys_size);

These silly linebreaks are really annoying. Please ignore checkpatch when the 
solution makes the result so much uglier. Having line width up to 90-95 is still 
fine.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:07           ` Ingo Molnar
@ 2011-03-02 10:15             ` Tejun Heo
  2011-03-02 10:36               ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 10:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Rientjes, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

On Wed, Mar 02, 2011 at 11:07:27AM +0100, Ingo Molnar wrote:
> 
> * Tejun Heo <tj@kernel.org> wrote:
> 
> > +	size_t size = numa_distance_cnt * numa_distance_cnt *
> > +		      sizeof(numa_distance[0]);
> 
> > +		memblock_x86_reserve_range(phys, phys + phys_size,
> > +					   "TMP NUMA DIST");
> 
> > +		memblock_x86_free_range(__pa(phys_dist),
> > +					__pa(phys_dist) + phys_size);
> 
> These silly linebreaks are really annoying. Please ignore checkpatch when the 
> solution makes the result so much uglier. Having line width up to 90-95 is still 
> fine.

I'm letting all the printks format strings go over the limit (whatever
that may be) but all these files already mostly conform to 80-column
limit so I'm a bit hesitant to break it for codes.  Hey, but it's
ultimately your call.

FWIW, I'm not really decided about 80 vs. whatever column issue.
Having a common limit definitely helps a lot but it seems almost
impossible to agree on one - is it 90, 95, 100 or 120?  Given that, it
almost seems just sticking to 80 might be the only doable solution.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:04         ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Tejun Heo
  2011-03-02 10:07           ` Ingo Molnar
@ 2011-03-02 10:25           ` Tejun Heo
  2011-03-02 10:39             ` [PATCH x86/mm] x86-64, NUMA: Better explain numa_distance handling Tejun Heo
                               ` (2 more replies)
  2011-03-02 10:43           ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Ingo Molnar
  2 siblings, 3 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 10:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

>From d968be2ff381c667bfd09795f82248558902a1ae Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 2 Mar 2011 11:22:14 +0100

NUMA distance table handling has the following problems.

* numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
  as the table size when it should be using the square of
  numa_distance.

* The same size miscalculation when allocation space for phys_dist in
  numa_emulation().

* In numa_emulation(), phys_dist must be reserved; otherwise, the new
  emulated distance table may overlap it.

Fix them and, while at it, take numa_distance_cnt resetting in
numa_reset_distance() out of the if block to simplify the code a bit.

David Rientjes reported incorrect handling of distance table during
emulation and Yinghai identified the above problems and wrote the
original patch to fix the problems.  This patch is based on Yinghai's
patch.

-v2: Ingo was unhappy with 80-column limit induced linebreaks.  Let
     lines run over 80-column.

Reported-by: David Rientjes <rientjes@google.com>
Patch-originally-from: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/numa_64.c        |    8 +++-----
 arch/x86/mm/numa_emulation.c |   14 ++++++++------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 7757d22..541746f 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -390,14 +390,12 @@ static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
  */
 void __init numa_reset_distance(void)
 {
-	size_t size;
+	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
 
-	if (numa_distance_cnt) {
-		size = numa_distance_cnt * sizeof(numa_distance[0]);
+	if (numa_distance_cnt)
 		memblock_x86_free_range(__pa(numa_distance),
 					__pa(numa_distance) + size);
-		numa_distance_cnt = 0;
-	}
+	numa_distance_cnt = 0;
 	numa_distance = NULL;
 }
 
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 607a2e8..0afa25d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -300,6 +300,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	static struct numa_meminfo pi __initdata;
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
 	u8 *phys_dist = NULL;
+	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
 	int i, j, ret;
 
 	if (!emu_cmdline)
@@ -336,21 +337,18 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		goto no_emu;
 	}
 
-	/*
-	 * Copy the original distance table.  It's temporary so no need to
-	 * reserve it.
-	 */
+	/* copy the physical distance table */
 	if (numa_dist_cnt) {
-		size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
 		u64 phys;
 
 		phys = memblock_find_in_range(0,
 					      (u64)max_pfn_mapped << PAGE_SHIFT,
-					      size, PAGE_SIZE);
+					      phys_size, PAGE_SIZE);
 		if (phys == MEMBLOCK_ERROR) {
 			pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
 			goto no_emu;
 		}
+		memblock_x86_reserve_range(phys, phys + phys_size, "TMP NUMA DIST");
 		phys_dist = __va(phys);
 
 		for (i = 0; i < numa_dist_cnt; i++)
@@ -398,6 +396,10 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 			numa_set_distance(i, j, dist);
 		}
 	}
+
+	/* free the copied physical distance table */
+	if (phys_dist)
+		memblock_x86_free_range(__pa(phys_dist), __pa(phys_dist) + phys_size);
 	return;
 
 no_emu:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:15             ` Tejun Heo
@ 2011-03-02 10:36               ` Ingo Molnar
  0 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2011-03-02 10:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> FWIW, I'm not really decided about 80 vs. whatever column issue.

It only really matters when the underlying code structure is clearly inefficient: 
too many indentations, etc.

but printks or function calls that go beyond 80 cols a bit do not deserve to be 
line-broken.

> Having a common limit definitely helps a lot but it seems almost
> impossible to agree on one - is it 90, 95, 100 or 120?  Given that, it
> almost seems just sticking to 80 might be the only doable solution.

The problem is that many sensible code structures break with a limit of 80.

So i'd suggest being permissive when the code is fine (printks, function calls at 
the first or second level of indentation, etc.) and being conservative when the 
underlying code is not fine.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH x86/mm] x86-64, NUMA: Better explain numa_distance handling
  2011-03-02 10:25           ` [PATCH x86/mm UPDATED] " Tejun Heo
@ 2011-03-02 10:39             ` Tejun Heo
  2011-03-02 10:42               ` [PATCH UPDATED " Tejun Heo
  2011-03-02 14:30             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling David Rientjes
  2011-03-02 16:16             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling Yinghai Lu
  2 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 10:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

>From a168b5aa6f8f9db7d8aeda0d85a1d19ed1373981 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 2 Mar 2011 11:32:47 +0100

Handling of out-of-bounds distances and allocation failure can use
better documentation.  Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/numa_64.c        |   10 +++++++++-
 arch/x86/mm/numa_emulation.c |    6 +++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 541746f..01ab53c 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -392,11 +392,12 @@ void __init numa_reset_distance(void)
 {
 	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
 
+	/* numa_distance could be 1LU marking allocation failure, test cnt */
 	if (numa_distance_cnt)
 		memblock_x86_free_range(__pa(numa_distance),
 					__pa(numa_distance) + size);
 	numa_distance_cnt = 0;
-	numa_distance = NULL;
+	numa_distance = NULL;	/* enable table creation */
 }
 
 static int __init numa_alloc_distance(void)
@@ -447,6 +448,13 @@ static int __init numa_alloc_distance(void)
  * Set the distance from node @from to @to to @distance.  If distance table
  * doesn't exist, one which is large enough to accomodate all the currently
  * known nodes will be created.
+ *
+ * If such table cannot be allocated, a warning is printed and further
+ * calls are ignored until the distance table is reset with
+ * numa_reset_distance().
+ *
+ * If @from or @to is higher than the highest known node, the call is
+ * ignored.
  */
 void __init numa_set_distance(int from, int to, int distance)
 {
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 0afa25d..aeecea9 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -379,7 +379,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
 			emu_nid_to_phys[i] = 0;
 
-	/* transform distance table */
+	/*
+	 * Transform distance table.  numa_set_distance() ignores all
+	 * out-of-bound distances.  Just call it for every possible node
+	 * combination.
+	 */
 	numa_reset_distance();
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		for (j = 0; j < MAX_NUMNODES; j++) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH UPDATED x86/mm] x86-64, NUMA: Better explain numa_distance handling
  2011-03-02 10:39             ` [PATCH x86/mm] x86-64, NUMA: Better explain numa_distance handling Tejun Heo
@ 2011-03-02 10:42               ` Tejun Heo
  2011-03-02 14:31                 ` David Rientjes
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 10:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

>From 615e1632e55b6b8fb52da00c5e8540ef159f759b Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 2 Mar 2011 11:32:47 +0100

Handling of out-of-bounds distances and allocation failure can use
better documentation.  Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
---
Sorry, sent out the wrong version.  This one is a bit more detailed
and the version I put into the tree.

Thanks.

 arch/x86/mm/numa_64.c        |   11 ++++++++++-
 arch/x86/mm/numa_emulation.c |    6 +++++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 541746f..74064e8 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -392,11 +392,12 @@ void __init numa_reset_distance(void)
 {
 	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
 
+	/* numa_distance could be 1LU marking allocation failure, test cnt */
 	if (numa_distance_cnt)
 		memblock_x86_free_range(__pa(numa_distance),
 					__pa(numa_distance) + size);
 	numa_distance_cnt = 0;
-	numa_distance = NULL;
+	numa_distance = NULL;	/* enable table creation */
 }
 
 static int __init numa_alloc_distance(void)
@@ -447,6 +448,14 @@ static int __init numa_alloc_distance(void)
  * Set the distance from node @from to @to to @distance.  If distance table
  * doesn't exist, one which is large enough to accomodate all the currently
  * known nodes will be created.
+ *
+ * If such table cannot be allocated, a warning is printed and further
+ * calls are ignored until the distance table is reset with
+ * numa_reset_distance().
+ *
+ * If @from or @to is higher than the highest known node at the time of
+ * table creation or @distance doesn't make sense, the call is ignored.
+ * This is to allow simplification of specific NUMA config implementations.
  */
 void __init numa_set_distance(int from, int to, int distance)
 {
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 0afa25d..aeecea9 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -379,7 +379,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
 			emu_nid_to_phys[i] = 0;
 
-	/* transform distance table */
+	/*
+	 * Transform distance table.  numa_set_distance() ignores all
+	 * out-of-bound distances.  Just call it for every possible node
+	 * combination.
+	 */
 	numa_reset_distance();
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		for (j = 0; j < MAX_NUMNODES; j++) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:04         ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Tejun Heo
  2011-03-02 10:07           ` Ingo Molnar
  2011-03-02 10:25           ` [PATCH x86/mm UPDATED] " Tejun Heo
@ 2011-03-02 10:43           ` Ingo Molnar
  2011-03-02 10:53             ` Tejun Heo
  2 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2011-03-02 10:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> >From dcd414c795a28fff6e511d58e4cfd1202323c703 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 2 Mar 2011 10:54:10 +0100

Btw., given that you basically took most of Yinghai's fixes, wouldnt it be better to 
have this as a 'From: Yinghai' commit, with -v2 comments about minor updates you 
did?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:43           ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Ingo Molnar
@ 2011-03-02 10:53             ` Tejun Heo
  2011-03-02 10:59               ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 10:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Rientjes, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

On Wed, Mar 02, 2011 at 11:43:36AM +0100, Ingo Molnar wrote:
> 
> * Tejun Heo <tj@kernel.org> wrote:
> 
> > >From dcd414c795a28fff6e511d58e4cfd1202323c703 Mon Sep 17 00:00:00 2001
> > From: Tejun Heo <tj@kernel.org>
> > Date: Wed, 2 Mar 2011 10:54:10 +0100
> 
> Btw., given that you basically took most of Yinghai's fixes, wouldnt it be better to 
> have this as a 'From: Yinghai' commit, with -v2 comments about minor updates you 
> did?

Yeah, this one was on the boundary for me.  I felt that I edited too
much of the original patch to leave Yinghai's From and S-O-B.  Please
don't take it wrong.  I wasn't trying to emphasize my editing but
wasn't sure whether the chain of custody can be maintained after such
edit.  Maybe I should just ask for respin, with which I haven't had a
lot of luck lately.  Eh well, I'll just change it this time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:53             ` Tejun Heo
@ 2011-03-02 10:59               ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 10:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Rientjes, Yinghai Lu, tglx, H. Peter Anvin, linux-kernel

On Wed, Mar 02, 2011 at 11:53:25AM +0100, Tejun Heo wrote:
> Yeah, this one was on the boundary for me.  I felt that I edited too
> much of the original patch to leave Yinghai's From and S-O-B.  Please
> don't take it wrong.  I wasn't trying to emphasize my editing but
> wasn't sure whether the chain of custody can be maintained after such
> edit.  Maybe I should just ask for respin, with which I haven't had a
> lot of luck lately.  Eh well, I'll just change it this time.

Alright, updated in the git tree.  I'll send pull request in a few
hours.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:25           ` [PATCH x86/mm UPDATED] " Tejun Heo
  2011-03-02 10:39             ` [PATCH x86/mm] x86-64, NUMA: Better explain numa_distance handling Tejun Heo
@ 2011-03-02 14:30             ` David Rientjes
  2011-03-02 15:42               ` Tejun Heo
                                 ` (2 more replies)
  2011-03-02 16:16             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling Yinghai Lu
  2 siblings, 3 replies; 68+ messages in thread
From: David Rientjes @ 2011-03-02 14:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, 2 Mar 2011, Tejun Heo wrote:

> From d968be2ff381c667bfd09795f82248558902a1ae Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 2 Mar 2011 11:22:14 +0100
> 
> NUMA distance table handling has the following problems.
> 
> * numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
>   as the table size when it should be using the square of
>   numa_distance.
> 
> * The same size miscalculation when allocation space for phys_dist in
>   numa_emulation().
> 
> * In numa_emulation(), phys_dist must be reserved; otherwise, the new
>   emulated distance table may overlap it.
> 
> Fix them and, while at it, take numa_distance_cnt resetting in
> numa_reset_distance() out of the if block to simplify the code a bit.
> 
> David Rientjes reported incorrect handling of distance table during
> emulation and Yinghai identified the above problems and wrote the
> original patch to fix the problems.  This patch is based on Yinghai's
> patch.
> 
> -v2: Ingo was unhappy with 80-column limit induced linebreaks.  Let
>      lines run over 80-column.
> 
> Reported-by: David Rientjes <rientjes@google.com>
> Patch-originally-from: Yinghai Lu <yinghai@kernel.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Ingo Molnar <mingo@elte.hu>

Acked-by: David Rientjes <rientjes@google.com>

There's also this in numa_emulation() that isn't a safe assumption:

        /* make sure all emulated nodes are mapped to a physical node */
        for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
                if (emu_nid_to_phys[i] == NUMA_NO_NODE)
                        emu_nid_to_phys[i] = 0;

Node id 0 is not always online depending on how you setup your SRAT.  I'm 
not sure why emu_nid_to_phys[] would ever map a fake node id that doesn't 
exist to a physical node id rather than NUMA_NO_NODE, so I think it can 
just be removed.  Otherwise, it should be mapped to a physical node id 
that is known to be online.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH UPDATED x86/mm] x86-64, NUMA: Better explain numa_distance handling
  2011-03-02 10:42               ` [PATCH UPDATED " Tejun Heo
@ 2011-03-02 14:31                 ` David Rientjes
  0 siblings, 0 replies; 68+ messages in thread
From: David Rientjes @ 2011-03-02 14:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, 2 Mar 2011, Tejun Heo wrote:

> From 615e1632e55b6b8fb52da00c5e8540ef159f759b Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 2 Mar 2011 11:32:47 +0100
> 
> Handling of out-of-bounds distances and allocation failure can use
> better documentation.  Add it.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Yinghai Lu <yinghai@kernel.org>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 14:30             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling David Rientjes
@ 2011-03-02 15:42               ` Tejun Heo
  2011-03-02 21:12                 ` Yinghai Lu
  2011-03-03 20:00                 ` David Rientjes
  2011-03-04 15:31               ` [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling Tejun Heo
  2011-03-05 15:50               ` [tip:x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() tip-bot for Tejun Heo
  2 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 15:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hey,

On Wed, Mar 02, 2011 at 06:30:59AM -0800, David Rientjes wrote:
> Acked-by: David Rientjes <rientjes@google.com>
> 
> There's also this in numa_emulation() that isn't a safe assumption:
> 
>         /* make sure all emulated nodes are mapped to a physical node */
>         for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
>                 if (emu_nid_to_phys[i] == NUMA_NO_NODE)
>                         emu_nid_to_phys[i] = 0;
> 
> Node id 0 is not always online depending on how you setup your SRAT.  I'm 
> not sure why emu_nid_to_phys[] would ever map a fake node id that doesn't 
> exist to a physical node id rather than NUMA_NO_NODE, so I think it can 
> just be removed.  Otherwise, it should be mapped to a physical node id 
> that is known to be online.

Unless I screwed up, that behavior isn't new.  It just put in a
different form.  Looking through the code... Okay, I think node 0
always exists.  SRAT PXM isn't used as node number directly.  It goes
through acpi_map_pxm_to_node() which allocates nids from 0 up.
amdtopology also guarantees the existence of node 0, so I think we're
in the safe and that probably is the reason why we had the above
behavior in the first place.

IIRC, there are other places which assume the existence of node 0.
Whether it's a good idea or not, I'm not sure but requring node 0 to
be always allocated doesn't sound too wrong to me.  Maybe we can add
BUG_ON() if node 0 is offline somewhere.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 10:25           ` [PATCH x86/mm UPDATED] " Tejun Heo
  2011-03-02 10:39             ` [PATCH x86/mm] x86-64, NUMA: Better explain numa_distance handling Tejun Heo
  2011-03-02 14:30             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling David Rientjes
@ 2011-03-02 16:16             ` Yinghai Lu
  2011-03-02 16:37               ` Tejun Heo
  2 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 16:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 02:25 AM, Tejun Heo wrote:
>>From d968be2ff381c667bfd09795f82248558902a1ae Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 2 Mar 2011 11:22:14 +0100
> 
> NUMA distance table handling has the following problems.
> 
> * numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
>   as the table size when it should be using the square of
>   numa_distance.
> 
> * The same size miscalculation when allocation space for phys_dist in
>   numa_emulation().
> 
> * In numa_emulation(), phys_dist must be reserved; otherwise, the new
>   emulated distance table may overlap it.
> 
> Fix them and, while at it, take numa_distance_cnt resetting in
> numa_reset_distance() out of the if block to simplify the code a bit.
> 
> David Rientjes reported incorrect handling of distance table during
> emulation and Yinghai identified the above problems and wrote the
> original patch to fix the problems.  This patch is based on Yinghai's
> patch.
> 
> -v2: Ingo was unhappy with 80-column limit induced linebreaks.  Let
>      lines run over 80-column.
> 
> Reported-by: David Rientjes <rientjes@google.com>
> Patch-originally-from: Yinghai Lu <yinghai@kernel.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/x86/mm/numa_64.c        |    8 +++-----
>  arch/x86/mm/numa_emulation.c |   14 ++++++++------
>  2 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> index 7757d22..541746f 100644
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -390,14 +390,12 @@ static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
>   */
>  void __init numa_reset_distance(void)
>  {
> -	size_t size;
> +	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
>  
> -	if (numa_distance_cnt) {
> -		size = numa_distance_cnt * sizeof(numa_distance[0]);
> +	if (numa_distance_cnt)
>  		memblock_x86_free_range(__pa(numa_distance),
>  					__pa(numa_distance) + size);
> -		numa_distance_cnt = 0;
> -	}
> +	numa_distance_cnt = 0;
>  	numa_distance = NULL;
>  }

my original part:

@@ -393,7 +393,7 @@ void __init numa_reset_distance(void)
        size_t size;
 
        if (numa_distance_cnt) {
-               size = numa_distance_cnt * sizeof(numa_distance[0]);
+               size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
                memblock_x86_free_range(__pa(numa_distance),
                                        __pa(numa_distance) + size);
                numa_distance_cnt = 0;

So can you tell me why you need to make those change?
	move out assigning or numa_distance_cnt and size of the the IF

>  
> diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
> index 607a2e8..0afa25d 100644
> --- a/arch/x86/mm/numa_emulation.c
> +++ b/arch/x86/mm/numa_emulation.c
> @@ -300,6 +300,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  	static struct numa_meminfo pi __initdata;
>  	const u64 max_addr = max_pfn << PAGE_SHIFT;
>  	u8 *phys_dist = NULL;
> +	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
>  	int i, j, ret;
>  
>  	if (!emu_cmdline)
> @@ -336,21 +337,18 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  		goto no_emu;
>  	}
>  
> -	/*
> -	 * Copy the original distance table.  It's temporary so no need to
> -	 * reserve it.
> -	 */
> +	/* copy the physical distance table */
>  	if (numa_dist_cnt) {
> -		size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
>  		u64 phys;
>  
>  		phys = memblock_find_in_range(0,
>  					      (u64)max_pfn_mapped << PAGE_SHIFT,
> -					      size, PAGE_SIZE);
> +					      phys_size, PAGE_SIZE);
>  		if (phys == MEMBLOCK_ERROR) {
>  			pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
>  			goto no_emu;
>  		}
> +		memblock_x86_reserve_range(phys, phys + phys_size, "TMP NUMA DIST");
>  		phys_dist = __va(phys);
>  
>  		for (i = 0; i < numa_dist_cnt; i++)
> @@ -398,6 +396,10 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  			numa_set_distance(i, j, dist);
>  		}
>  	}
> +
> +	/* free the copied physical distance table */
> +	if (phys_dist)
> +		memblock_x86_free_range(__pa(phys_dist), __pa(phys_dist) + phys_size);
>  	return;
>  
>  no_emu:

you missed 

@@ -383,21 +386,40 @@ void __init numa_emulation(struct numa_m
 
 	/* transform distance table */
 	numa_reset_distance();
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		for (j = 0; j < MAX_NUMNODES; j++) {
-			int physi = emu_nid_to_phys[i];
-			int physj = emu_nid_to_phys[j];
-			int dist;
-
-			if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
-				dist = physi == physj ?
-					LOCAL_DISTANCE : REMOTE_DISTANCE;
-			else
+	/* allocate numa_distance at first, it will set new numa_dist_cnt */
+	new_nr = numa_alloc_distance();
+	if (new_nr < 0)
+		goto free_temp_phys;
+
+	/*
+	 * only set it when we have old phys_dist,
+	 * numa_alloc_distance already set default values
+	 */
+	if (phys_dist)
+		for (i = 0; i < new_nr; i++) {
+			for (j = 0; j < new_nr; j++) {
+				int physi = emu_nid_to_phys[i];
+				int physj = emu_nid_to_phys[j];
+				int dist;
+
+				/* really need this check ? */
+				if (physi >= numa_dist_cnt ||
+				    physj >= numa_dist_cnt)
+					continue;
+
 				dist = phys_dist[physi * numa_dist_cnt + physj];
 
-			numa_set_distance(i, j, dist);
+				numa_set_distance(i, j, dist);
+			}
 		}
-	}
+

the change include:
1. you only need to go over new_nr*new_nr instead huge MAX_NUMNODES * MAX_NUMNODES
2. you do NOT need to go over it if you don't have phys_dist assigned before.
   numa_alloc_distance already have that default set.
3. do need to check if phys_dist is assigned before referring phys_dist.

so please just use my original patch.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 16:16             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling Yinghai Lu
@ 2011-03-02 16:37               ` Tejun Heo
  2011-03-02 16:46                 ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 16:37 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hey, Yinghai.

On Wed, Mar 02, 2011 at 08:16:18AM -0800, Yinghai Lu wrote:
> my original part:
> 
> @@ -393,7 +393,7 @@ void __init numa_reset_distance(void)
>         size_t size;
>  
>         if (numa_distance_cnt) {
> -               size = numa_distance_cnt * sizeof(numa_distance[0]);
> +               size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
>                 memblock_x86_free_range(__pa(numa_distance),
>                                         __pa(numa_distance) + size);
>                 numa_distance_cnt = 0;
> 
> So can you tell me why you need to make those change?
> 	move out assigning or numa_distance_cnt and size of the the IF

Please read the patch description.  I actually wrote that down.  :-)

> the change include:
> 1. you only need to go over new_nr*new_nr instead huge MAX_NUMNODES * MAX_NUMNODES
> 2. you do NOT need to go over it if you don't have phys_dist assigned before.
>    numa_alloc_distance already have that default set.
> 3. do need to check if phys_dist is assigned before referring phys_dist.

* If you wanted to make that change, split it into a separate patch.
  Don't mix it with changes which actually fix the bug.

* I don't think it's gonna matter all that much.  It's one time and
  only used if emulation is enabled, but then again yeap MAX_NUMNODES
  * MAX_NUMNODES can get quite high, but it looks way too complicated
  for what it achieves.  Just looping over enabled nodes should
  achieve about the same thing in much simpler way, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 16:37               ` Tejun Heo
@ 2011-03-02 16:46                 ` Yinghai Lu
  2011-03-02 16:55                   ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 16:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 08:37 AM, Tejun Heo wrote:
> Hey, Yinghai.
> 
> On Wed, Mar 02, 2011 at 08:16:18AM -0800, Yinghai Lu wrote:
>> my original part:
>>
>> @@ -393,7 +393,7 @@ void __init numa_reset_distance(void)
>>         size_t size;
>>  
>>         if (numa_distance_cnt) {
>> -               size = numa_distance_cnt * sizeof(numa_distance[0]);
>> +               size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
>>                 memblock_x86_free_range(__pa(numa_distance),
>>                                         __pa(numa_distance) + size);
>>                 numa_distance_cnt = 0;
>>
>> So can you tell me why you need to make those change?
>> 	move out assigning or numa_distance_cnt and size of the the IF
> 
> Please read the patch description.  I actually wrote that down.  :-)

well you said:
> while at it, take numa_distance_cnt resetting in
> numa_reset_distance() out of the if block to simplify the code a bit.

what are you talking about? what do you mean "simplify the code a bit" ?

> 
>> the change include:
>> 1. you only need to go over new_nr*new_nr instead huge MAX_NUMNODES * MAX_NUMNODES
>> 2. you do NOT need to go over it if you don't have phys_dist assigned before.
>>    numa_alloc_distance already have that default set.
>> 3. do need to check if phys_dist is assigned before referring phys_dist.
> 
> * If you wanted to make that change, split it into a separate patch.
>   Don't mix it with changes which actually fix the bug.
> 
> * I don't think it's gonna matter all that much.  It's one time and
>   only used if emulation is enabled, but then again yeap MAX_NUMNODES
>   * MAX_NUMNODES can get quite high, but it looks way too complicated
>   for what it achieves.  Just looping over enabled nodes should
>   achieve about the same thing in much simpler way, right?

what kind of excuse to put inefficiency code there!

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 16:46                 ` Yinghai Lu
@ 2011-03-02 16:55                   ` Tejun Heo
  2011-03-02 18:52                     ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 16:55 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, Mar 02, 2011 at 08:46:17AM -0800, Yinghai Lu wrote:
> > * I don't think it's gonna matter all that much.  It's one time and
> >   only used if emulation is enabled, but then again yeap MAX_NUMNODES
> >   * MAX_NUMNODES can get quite high, but it looks way too complicated
> >   for what it achieves.  Just looping over enabled nodes should
> >   achieve about the same thing in much simpler way, right?
> 
> what kind of excuse to put inefficiency code there!

Complexity of a solution should match the benefit of the complexity.
Code complexity is one of the most important metrics that we need to
keep an eye on.  If you don't do that, the code base becomes very ugly
and difficult to maintain very quickly.  So, yes, some amount of
execution inefficiency is acceptable depending on circumstances.
Efficiency too is something which should be traded off against other
benefits.

In this case, it's not a performance critical path at all and similar
level of efficiency can be achieved in much simpler way, so let's do
that, okay?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 16:55                   ` Tejun Heo
@ 2011-03-02 18:52                     ` Yinghai Lu
  2011-03-02 19:02                       ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 18:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 08:55 AM, Tejun Heo wrote:
> On Wed, Mar 02, 2011 at 08:46:17AM -0800, Yinghai Lu wrote:
>>> * I don't think it's gonna matter all that much.  It's one time and
>>>   only used if emulation is enabled, but then again yeap MAX_NUMNODES
>>>   * MAX_NUMNODES can get quite high, but it looks way too complicated
>>>   for what it achieves.  Just looping over enabled nodes should
>>>   achieve about the same thing in much simpler way, right?
>>
>> what kind of excuse to put inefficiency code there!
> 
> Complexity of a solution should match the benefit of the complexity.
> Code complexity is one of the most important metrics that we need to
> keep an eye on.  If you don't do that, the code base becomes very ugly
> and difficult to maintain very quickly.  So, yes, some amount of
> execution inefficiency is acceptable depending on circumstances.
> Efficiency too is something which should be traded off against other
> benefits.

No. it is not acceptable in your case.

We can accept that something like: during init stage, do some probe and call pathes to be happy.
like subarch.

Also why did you omit my first question?

>>>>diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
>>>>> index 7757d22..541746f 100644
>>>>> --- a/arch/x86/mm/numa_64.c
>>>>> +++ b/arch/x86/mm/numa_64.c
>>>>> @@ -390,14 +390,12 @@ static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
>>>>>   */
>>>>>  void __init numa_reset_distance(void)
>>>>>  {
>>>>> -	size_t size;
>>>>> +	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
>>>>>  
>>>>> -	if (numa_distance_cnt) {
>>>>> -		size = numa_distance_cnt * sizeof(numa_distance[0]);
>>>>> +	if (numa_distance_cnt)
>>>>>  		memblock_x86_free_range(__pa(numa_distance),
>>>>>  					__pa(numa_distance) + size);
>>>>> -		numa_distance_cnt = 0;
>>>>> -	}
>>>>> +	numa_distance_cnt = 0;
>>>>>  	numa_distance = NULL;
>>>>>  }

>> my original part:
>> >>
>> >> @@ -393,7 +393,7 @@ void __init numa_reset_distance(void)
>> >>         size_t size;
>> >>  
>> >>         if (numa_distance_cnt) {
>> >> -               size = numa_distance_cnt * sizeof(numa_distance[0]);
>> >> +               size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
>> >>                 memblock_x86_free_range(__pa(numa_distance),
>> >>                                         __pa(numa_distance) + size);
>> >>                 numa_distance_cnt = 0;
>> >>
>> >> So can you tell me why you need to make those change?
>> >> 	move out assigning or numa_distance_cnt and size of the the IF
> > 
> > Please read the patch description.  I actually wrote that down.  :-)
well you said:
> > while at it, take numa_distance_cnt resetting in
> > numa_reset_distance() out of the if block to simplify the code a bit.
what are you talking about? what do you mean "simplify the code a bit" ?


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 18:52                     ` Yinghai Lu
@ 2011-03-02 19:02                       ` Tejun Heo
  2011-03-02 19:06                         ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 19:02 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hey, Yinghai.

On Wed, Mar 02, 2011 at 10:52:28AM -0800, Yinghai Lu wrote:
> > Complexity of a solution should match the benefit of the complexity.
> > Code complexity is one of the most important metrics that we need to
> > keep an eye on.  If you don't do that, the code base becomes very ugly
> > and difficult to maintain very quickly.  So, yes, some amount of
> > execution inefficiency is acceptable depending on circumstances.
> > Efficiency too is something which should be traded off against other
> > benefits.
> 
> No. it is not acceptable in your case.
> 
> We can accept that something like: during init stage, do some probe
> and call pathes to be happy.  like subarch.

Hmmm?  I can't really follow your sentence.  This is init stage.
Anyways, why can't it just walk over the enabled nodes?  What would be
the difference?

> Also why did you omit my first question?

Yeah, well, because that wasn't completely consistent with what I said
earlier.  I wanted to tell you to take the assignments out of if () on
your earlier patch but I just let it pass and now I had this another
patch touching the same code, so I just had to do it.

I know it's a petty style thing but it's my pet peeve and I can't help
it when related change goes through me, so there it is.  I'm sorry but
I'll probaly do it again.  I beg your understanding.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 19:02                       ` Tejun Heo
@ 2011-03-02 19:06                         ` Yinghai Lu
  2011-03-02 19:13                           ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 19:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 11:02 AM, Tejun Heo wrote:
> Hey, Yinghai.
> 
> On Wed, Mar 02, 2011 at 10:52:28AM -0800, Yinghai Lu wrote:
>>> Complexity of a solution should match the benefit of the complexity.
>>> Code complexity is one of the most important metrics that we need to
>>> keep an eye on.  If you don't do that, the code base becomes very ugly
>>> and difficult to maintain very quickly.  So, yes, some amount of
>>> execution inefficiency is acceptable depending on circumstances.
>>> Efficiency too is something which should be traded off against other
>>> benefits.
>>
>> No. it is not acceptable in your case.
>>
>> We can accept that something like: during init stage, do some probe
>> and call pathes to be happy.  like subarch.
> 
> Hmmm?  I can't really follow your sentence.  This is init stage.
> Anyways, why can't it just walk over the enabled nodes?  What would be
> the difference?

my point is that we really not need to go over it if original is not there.

> 
>> Also why did you omit my first question?
> 
> Yeah, well, because that wasn't completely consistent with what I said
> earlier.  I wanted to tell you to take the assignments out of if () on
> your earlier patch but I just let it pass and now I had this another
> patch touching the same code, so I just had to do it.
> 
> I know it's a petty style thing but it's my pet peeve and I can't help
> it when related change goes through me, so there it is.  I'm sorry but
> I'll probaly do it again.  I beg your understanding.

never mind.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 19:06                         ` Yinghai Lu
@ 2011-03-02 19:13                           ` Tejun Heo
  2011-03-02 20:32                             ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 19:13 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hello,

On Wed, Mar 02, 2011 at 11:06:41AM -0800, Yinghai Lu wrote:
> > Hmmm?  I can't really follow your sentence.  This is init stage.
> > Anyways, why can't it just walk over the enabled nodes?  What would be
> > the difference?
> 
> my point is that we really not need to go over it if original is not there.

Oh, you mean if (!phys_dist)?  Yeah yeah sure, I was mostly talking
about allocating new table separately and returning the count and all
those things.  Can you just do the phys_dist testing and going over
enabled nodes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 19:13                           ` Tejun Heo
@ 2011-03-02 20:32                             ` Yinghai Lu
  2011-03-02 20:57                               ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 20:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 11:13 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 02, 2011 at 11:06:41AM -0800, Yinghai Lu wrote:
>>> Hmmm?  I can't really follow your sentence.  This is init stage.
>>> Anyways, why can't it just walk over the enabled nodes?  What would be
>>> the difference?
>>
>> my point is that we really not need to go over it if original is not there.
> 
> Oh, you mean if (!phys_dist)?  Yeah yeah sure, I was mostly talking
> about allocating new table separately and returning the count and all
> those things.  Can you just do the phys_dist testing and going over
> enabled nodes?

why use enabled nodes?

numa_alloc_distance already have calculated. with numa_nodes_parsed and numa_meminfo

        /* size the new table and allocate it */
        nodes_parsed = numa_nodes_parsed;
        numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);

        for_each_node_mask(i, nodes_parsed)
                cnt = i;
        cnt++;


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 20:32                             ` Yinghai Lu
@ 2011-03-02 20:57                               ` Tejun Heo
  2011-03-02 21:14                                 ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-02 20:57 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, Mar 02, 2011 at 12:32:53PM -0800, Yinghai Lu wrote:
> why use enabled nodes?
> 
> numa_alloc_distance already have calculated. with numa_nodes_parsed and numa_meminfo

Because it's simpler that way.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 15:42               ` Tejun Heo
@ 2011-03-02 21:12                 ` Yinghai Lu
  2011-03-02 21:36                   ` Yinghai Lu
  2011-03-03 20:04                   ` David Rientjes
  2011-03-03 20:00                 ` David Rientjes
  1 sibling, 2 replies; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 21:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 07:42 AM, Tejun Heo wrote:
> Hey,
> 
> On Wed, Mar 02, 2011 at 06:30:59AM -0800, David Rientjes wrote:
>> Acked-by: David Rientjes <rientjes@google.com>
>>
>> There's also this in numa_emulation() that isn't a safe assumption:
>>
>>         /* make sure all emulated nodes are mapped to a physical node */
>>         for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
>>                 if (emu_nid_to_phys[i] == NUMA_NO_NODE)
>>                         emu_nid_to_phys[i] = 0;
>>
>> Node id 0 is not always online depending on how you setup your SRAT.  I'm 
>> not sure why emu_nid_to_phys[] would ever map a fake node id that doesn't 
>> exist to a physical node id rather than NUMA_NO_NODE, so I think it can 
>> just be removed.  Otherwise, it should be mapped to a physical node id 
>> that is known to be online.
> 
> Unless I screwed up, that behavior isn't new.  It just put in a
> different form.  Looking through the code... Okay, I think node 0
> always exists.  SRAT PXM isn't used as node number directly.  It goes
> through acpi_map_pxm_to_node() which allocates nids from 0 up.
> amdtopology also guarantees the existence of node 0, so I think we're
> in the safe and that probably is the reason why we had the above
> behavior in the first place.
> 
> IIRC, there are other places which assume the existence of node 0.
> Whether it's a good idea or not, I'm not sure but requring node 0 to
> be always allocated doesn't sound too wrong to me.  Maybe we can add
> BUG_ON() if node 0 is offline somewhere.


When first socket does not have memory, we will not node 0 online.
and cpu_to_node() will have those cpus round to near node like node1 or node7.

BTW: this conf get broken several times, and get fixed several times.

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 20:57                               ` Tejun Heo
@ 2011-03-02 21:14                                 ` Yinghai Lu
  2011-03-03  6:17                                   ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 21:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 12:57 PM, Tejun Heo wrote:
> On Wed, Mar 02, 2011 at 12:32:53PM -0800, Yinghai Lu wrote:
>> why use enabled nodes?
>>
>> numa_alloc_distance already have calculated. with numa_nodes_parsed and numa_meminfo
> 
> Because it's simpler that way.
> 
what do you mean "enabled nodes"? numa_nodes_parsed? or from new numa_meminfo?





^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 21:12                 ` Yinghai Lu
@ 2011-03-02 21:36                   ` Yinghai Lu
  2011-03-03 20:07                     ` David Rientjes
  2011-03-03 20:04                   ` David Rientjes
  1 sibling, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-02 21:36 UTC (permalink / raw)
  To: David Rientjes; +Cc: Tejun Heo, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 01:12 PM, Yinghai Lu wrote:
> On 03/02/2011 07:42 AM, Tejun Heo wrote:
>> Hey,
>>
>> On Wed, Mar 02, 2011 at 06:30:59AM -0800, David Rientjes wrote:
>>> Acked-by: David Rientjes <rientjes@google.com>
>>>
>>> There's also this in numa_emulation() that isn't a safe assumption:
>>>
>>>         /* make sure all emulated nodes are mapped to a physical node */
>>>         for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
>>>                 if (emu_nid_to_phys[i] == NUMA_NO_NODE)
>>>                         emu_nid_to_phys[i] = 0;
>>>
>>> Node id 0 is not always online depending on how you setup your SRAT.  I'm 
>>> not sure why emu_nid_to_phys[] would ever map a fake node id that doesn't 
>>> exist to a physical node id rather than NUMA_NO_NODE, so I think it can 
>>> just be removed.  Otherwise, it should be mapped to a physical node id 
>>> that is known to be online.
>>
>> Unless I screwed up, that behavior isn't new.  It just put in a
>> different form.  Looking through the code... Okay, I think node 0
>> always exists.  SRAT PXM isn't used as node number directly.  It goes
>> through acpi_map_pxm_to_node() which allocates nids from 0 up.
>> amdtopology also guarantees the existence of node 0, so I think we're
>> in the safe and that probably is the reason why we had the above
>> behavior in the first place.
>>
>> IIRC, there are other places which assume the existence of node 0.
>> Whether it's a good idea or not, I'm not sure but requring node 0 to
>> be always allocated doesn't sound too wrong to me.  Maybe we can add
>> BUG_ON() if node 0 is offline somewhere.
> 
> 
> When first socket does not have memory, we will not node 0 online.
> and cpu_to_node() will have those cpus round to near node like node1 or node7.
> 
> BTW: this conf get broken several times, and get fixed several times.

david,

it looks like numa emu does not support that conf already.

old code:
void __cpuinit numa_add_cpu(int cpu)
{
        unsigned long addr;
        u16 apicid;
        int physnid;
        int nid = NUMA_NO_NODE;

        apicid = early_per_cpu(x86_cpu_to_apicid, cpu);
        if (apicid != BAD_APICID)
                nid = apicid_to_node[apicid];
        if (nid == NUMA_NO_NODE)
                nid = early_cpu_to_node(cpu);
        BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));


current code:
void __cpuinit numa_add_cpu(int cpu)
{
        int physnid, nid;

        nid = numa_cpu_node(cpu);
        if (nid == NUMA_NO_NODE)
                nid = early_cpu_to_node(cpu);
        BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));

        physnid = emu_nid_to_phys[nid];

        /*
         * Map the cpu to each emulated node that is allocated on the physical
         * node of the cpu's apic id.
         */
        for_each_online_node(nid)
                if (emu_nid_to_phys[nid] == physnid)
                        cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
}


please note numa_cpu_node or old code will return nid that is node 0, and even node0 does not mem and not onlined.

maybe we can just change to nid = cpu_to_node() to get nodeid that is onlined.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 21:14                                 ` Yinghai Lu
@ 2011-03-03  6:17                                   ` Tejun Heo
  2011-03-10 18:46                                     ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-03  6:17 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, Mar 02, 2011 at 01:14:29PM -0800, Yinghai Lu wrote:
> what do you mean "enabled nodes"? numa_nodes_parsed? or from new
> numa_meminfo?

Yeah.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 15:42               ` Tejun Heo
  2011-03-02 21:12                 ` Yinghai Lu
@ 2011-03-03 20:00                 ` David Rientjes
  1 sibling, 0 replies; 68+ messages in thread
From: David Rientjes @ 2011-03-03 20:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, 2 Mar 2011, Tejun Heo wrote:

> > There's also this in numa_emulation() that isn't a safe assumption:
> > 
> >         /* make sure all emulated nodes are mapped to a physical node */
> >         for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
> >                 if (emu_nid_to_phys[i] == NUMA_NO_NODE)
> >                         emu_nid_to_phys[i] = 0;
> > 
> > Node id 0 is not always online depending on how you setup your SRAT.  I'm 
> > not sure why emu_nid_to_phys[] would ever map a fake node id that doesn't 
> > exist to a physical node id rather than NUMA_NO_NODE, so I think it can 
> > just be removed.  Otherwise, it should be mapped to a physical node id 
> > that is known to be online.
> 
> Unless I screwed up, that behavior isn't new.  It just put in a
> different form.  Looking through the code... Okay, I think node 0
> always exists.  SRAT PXM isn't used as node number directly.  It goes
> through acpi_map_pxm_to_node() which allocates nids from 0 up.
> amdtopology also guarantees the existence of node 0, so I think we're
> in the safe and that probably is the reason why we had the above
> behavior in the first place.
> 

The node may not have any memory or may be offline, which means we 
shouldn't be mapping fake nodes to it.  There shouldn't be any code that 
can't handle emu_nid_to_phys[] being NUMA_NO_NODE, otherwise there's a 
design issue.

> IIRC, there are other places which assume the existence of node 0.
> Whether it's a good idea or not, I'm not sure but requring node 0 to
> be always allocated doesn't sound too wrong to me.  Maybe we can add
> BUG_ON() if node 0 is offline somewhere.
> 

There shouldn't be because you can't guarantee it is online, it's not a 
BUG_ON() condition, it can easily be done if the pxm for node 0 in the 
SRAT is empty or non-existant.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 21:12                 ` Yinghai Lu
  2011-03-02 21:36                   ` Yinghai Lu
@ 2011-03-03 20:04                   ` David Rientjes
  1 sibling, 0 replies; 68+ messages in thread
From: David Rientjes @ 2011-03-03 20:04 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Tejun Heo, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, 2 Mar 2011, Yinghai Lu wrote:

> BTW: this conf get broken several times, and get fixed several times.
> 

Yes, it has been.  I think it's because there's an assumption that if the 
NUMA initialization fails that we automatically fake a single node 0 
covering all RAM.  The pxm for node 0 doesn't actually need to have any 
memory bound to it in the SRAT, so it is trivial with a custom BIOS to 
avoid onlining the node.  There doesn't appear to be any other problems 
related to this that we run into throughout the kernel, but it certainly 
needs to be fixed in this new emulation code even if those fake-to-phys 
mappings aren't currently used because it's a bad assumption.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-02 21:36                   ` Yinghai Lu
@ 2011-03-03 20:07                     ` David Rientjes
  2011-03-04 14:32                       ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: David Rientjes @ 2011-03-03 20:07 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Tejun Heo, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Wed, 2 Mar 2011, Yinghai Lu wrote:

> david,
> 
> it looks like numa emu does not support that conf already.
> 
> old code:
> void __cpuinit numa_add_cpu(int cpu)
> {
>         unsigned long addr;
>         u16 apicid;
>         int physnid;
>         int nid = NUMA_NO_NODE;
> 
>         apicid = early_per_cpu(x86_cpu_to_apicid, cpu);
>         if (apicid != BAD_APICID)
>                 nid = apicid_to_node[apicid];
>         if (nid == NUMA_NO_NODE)
>                 nid = early_cpu_to_node(cpu);
>         BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));
> 

This isn't broken, the node must be online for the cpu to be bound to it.  
I'm specifically concerned about the case where the node is offline 
because it has no memory mapped to it and all cpus are bound to node 1, 
for instance.

The new emulation code needs to remove the assumption that node 0 always 
is online and become more robust since there's no such restriction in the 
ACPI spec.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-03 20:07                     ` David Rientjes
@ 2011-03-04 14:32                       ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-04 14:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hello,

On Thu, Mar 03, 2011 at 12:07:10PM -0800, David Rientjes wrote:
> The new emulation code needs to remove the assumption that node 0 always 
> is online and become more robust since there's no such restriction in the 
> ACPI spec.

Yeah, indeed, that's possible.  ACPI spec isn't really the deciding
factor here.  The node IDs are purely software and a system is
required to have at least one node, so it seems reasonable to require
node 0 to be always online - or even to require all online node IDs to
be consecutive starting from 0.  The only reason we may have holes in
node IDs now is because PXM -> nid mapping isn't released for empty
memory-only nodes.

The biggest downside of deferring checks for this type of rare
conditions to upper layers when not strictly necessary is that it
tends to lead to one-off bugs which trigger very infrequently.

That said, maybe we're already too deep toward that direction and it's
much easier to weed out the assumptions that node0 is always online.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling
  2011-03-02 14:30             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling David Rientjes
  2011-03-02 15:42               ` Tejun Heo
@ 2011-03-04 15:31               ` Tejun Heo
  2011-03-04 21:33                 ` David Rientjes
  2011-03-05 15:50               ` [tip:x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() tip-bot for Tejun Heo
  2 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-04 15:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Undetermined entries in emu_nid_to_phys[] are filled with zero
assuming that physical node 0 is always online; however, this might
not be true depending on hardware configuration.  Find a physical node
which is actually online and use it instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1103020628210.31626@chino.kir.corp.google.com>
---
 arch/x86/mm/numa_emulation.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 75b31dc..3696be0 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -301,6 +301,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
 	u8 *phys_dist = NULL;
 	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
+	int dfl_phys_nid;
 	int i, j, ret;
 
 	if (!emu_cmdline)
@@ -357,6 +358,19 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 					node_distance(i, j);
 	}
 
+	/* determine the default phys nid to use for unmapped nodes */
+	dfl_phys_nid = NUMA_NO_NODE;
+	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) {
+		if (emu_nid_to_phys[i] != NUMA_NO_NODE) {
+			dfl_phys_nid = emu_nid_to_phys[i];
+			break;
+		}
+	}
+	if (dfl_phys_nid == NUMA_NO_NODE) {
+		pr_warning("NUMA: Warning: can't determine default physical node, disabling emulation\n");
+		goto no_emu;
+	}
+
 	/* commit */
 	*numa_meminfo = ei;
 
@@ -377,7 +391,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	/* make sure all emulated nodes are mapped to a physical node */
 	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
 		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
-			emu_nid_to_phys[i] = 0;
+			emu_nid_to_phys[i] = dfl_phys_nid;
 
 	/*
 	 * Transform distance table.  numa_set_distance() ignores all

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling
  2011-03-04 15:31               ` [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling Tejun Heo
@ 2011-03-04 21:33                 ` David Rientjes
  2011-03-05  7:50                   ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: David Rientjes @ 2011-03-04 21:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Fri, 4 Mar 2011, Tejun Heo wrote:

> Undetermined entries in emu_nid_to_phys[] are filled with zero
> assuming that physical node 0 is always online; however, this might
> not be true depending on hardware configuration.  Find a physical node
> which is actually online and use it instead.
> 

I still don't understand why emu_nid_to_phys[] can't return NUMA_NO_NODE 
and this needs to be done at all.  An emulated node should be mapped to a 
physical node based on the address ranges of that emulated node, so it's 
impossible for that to not map directly to a physical online node.  We 
don't support memoryless emulated nodes for hotpluggable SRAT entries.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling
  2011-03-04 21:33                 ` David Rientjes
@ 2011-03-05  7:50                   ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-05  7:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yinghai Lu, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Fri, Mar 04, 2011 at 01:33:29PM -0800, David Rientjes wrote:
> On Fri, 4 Mar 2011, Tejun Heo wrote:
> 
> > Undetermined entries in emu_nid_to_phys[] are filled with zero
> > assuming that physical node 0 is always online; however, this might
> > not be true depending on hardware configuration.  Find a physical node
> > which is actually online and use it instead.
> > 
> 
> I still don't understand why emu_nid_to_phys[] can't return NUMA_NO_NODE 
> and this needs to be done at all.  An emulated node should be mapped to a 
> physical node based on the address ranges of that emulated node, so it's 
> impossible for that to not map directly to a physical online node.  We 
> don't support memoryless emulated nodes for hotpluggable SRAT entries.

Me neither.  That was the original logic.  I just transformed the
logic into emu_nid_to_phys[].  If you think you can remove that
safely, please go ahead and submit a patch.  That said, it's not
exactly high priority thing to do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()
  2011-03-02 14:30             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling David Rientjes
  2011-03-02 15:42               ` Tejun Heo
  2011-03-04 15:31               ` [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling Tejun Heo
@ 2011-03-05 15:50               ` tip-bot for Tejun Heo
  2 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Tejun Heo @ 2011-03-05 15:50 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, tj, tglx, rientjes

Commit-ID:  078a198906c796981f93ff100c210506e91aade5
Gitweb:     http://git.kernel.org/tip/078a198906c796981f93ff100c210506e91aade5
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Fri, 4 Mar 2011 16:32:02 +0100
Committer:  Tejun Heo <tj@kernel.org>
CommitDate: Fri, 4 Mar 2011 16:32:37 +0100

x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()

Undetermined entries in emu_nid_to_phys[] are filled with zero
assuming that physical node 0 is always online; however, this might
not be true depending on hardware configuration.  Find a physical node
which is actually online and use it instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1103020628210.31626@chino.kir.corp.google.com>
---
 arch/x86/mm/numa_emulation.c |   16 +++++++++++++++-
 1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 75b31dc..3696be0 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -301,6 +301,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
 	u8 *phys_dist = NULL;
 	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
+	int dfl_phys_nid;
 	int i, j, ret;
 
 	if (!emu_cmdline)
@@ -357,6 +358,19 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 					node_distance(i, j);
 	}
 
+	/* determine the default phys nid to use for unmapped nodes */
+	dfl_phys_nid = NUMA_NO_NODE;
+	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) {
+		if (emu_nid_to_phys[i] != NUMA_NO_NODE) {
+			dfl_phys_nid = emu_nid_to_phys[i];
+			break;
+		}
+	}
+	if (dfl_phys_nid == NUMA_NO_NODE) {
+		pr_warning("NUMA: Warning: can't determine default physical node, disabling emulation\n");
+		goto no_emu;
+	}
+
 	/* commit */
 	*numa_meminfo = ei;
 
@@ -377,7 +391,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	/* make sure all emulated nodes are mapped to a physical node */
 	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
 		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
-			emu_nid_to_phys[i] = 0;
+			emu_nid_to_phys[i] = dfl_phys_nid;
 
 	/*
 	 * Transform distance table.  numa_set_distance() ignores all

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-03  6:17                                   ` Tejun Heo
@ 2011-03-10 18:46                                     ` Yinghai Lu
  2011-03-11  8:29                                       ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-10 18:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/02/2011 10:17 PM, Tejun Heo wrote:
> On Wed, Mar 02, 2011 at 01:14:29PM -0800, Yinghai Lu wrote:
>> what do you mean "enabled nodes"? numa_nodes_parsed? or from new
>> numa_meminfo?
> 
> Yeah.
> 

that will duplicate the code:

        nodes_parsed = numa_nodes_parsed;
        numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);

        for_each_node_mask(i, nodes_parsed)
                cnt = i;
        cnt++;

in numa_alloc_distance().

Why just let numa_alloc_distance() to be called at first and return cnt ?

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-10 18:46                                     ` Yinghai Lu
@ 2011-03-11  8:29                                       ` Tejun Heo
  2011-03-11  8:33                                         ` Tejun Heo
  2011-03-11  9:31                                         ` [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation Tejun Heo
  0 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-11  8:29 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hey, Yinghai.

On Thu, Mar 10, 2011 at 10:46:55AM -0800, Yinghai Lu wrote:
> that will duplicate the code:
> 
>         nodes_parsed = numa_nodes_parsed;
>         numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
> 
>         for_each_node_mask(i, nodes_parsed)
>                 cnt = i;
>         cnt++;
> 
> in numa_alloc_distance().
> 
> Why just let numa_alloc_distance() to be called at first and return cnt ?

Look at your patch.  It exports numa_alloc_distanc() and calls the
function explicitly just to use for (i...) loop instead of
for_each_node_mask().

If you look at how numa_set_distance() is used, it's supposed to be
fire-and-forget thing.  Specific NUMA configuration implementations
provide the information to the generic code but they don't have to
care how it's managed or whether something fails for whatever reason
and I'd like to keep it that way.  It's a small distinction but small
abstractions and simplifications like that add up to make the code
base easier to understand and maintain.  And you wanna break that for
what?  for (i...) instead of for_each_node_mask().

Also, I don't think your patch is correct.  Even if there is phys_dist
table, emu distance tables should be built because emulated nids don't
map to physical nids one to one.  ie. Two different emulated nids can
share a physical node and the distance table should explicitly reflect
that.

Nobody cares if we spend a bit more cycles going over numa bitmap some
more times during boot for frigging NUMA emulation.  It doesn't matter
AT ALL.  Don't micro optimize where it doesn't matter at the cost of
code readability and maintainability.  It is actively harmful.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11  8:29                                       ` Tejun Heo
@ 2011-03-11  8:33                                         ` Tejun Heo
  2011-03-11 15:48                                           ` Yinghai Lu
  2011-03-11  9:31                                         ` [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-11  8:33 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Fri, Mar 11, 2011 at 09:29:38AM +0100, Tejun Heo wrote:
> Also, I don't think your patch is correct.  Even if there is phys_dist
                                                              ^
							      no
> table, emu distance tables should be built because emulated nids don't
> map to physical nids one to one.  ie. Two different emulated nids can
> share a physical node and the distance table should explicitly reflect
> that.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
  2011-03-11  8:29                                       ` Tejun Heo
  2011-03-11  8:33                                         ` Tejun Heo
@ 2011-03-11  9:31                                         ` Tejun Heo
  2011-03-11 15:42                                           ` Yinghai Lu
  2011-03-11 19:05                                           ` Yinghai Lu
  1 sibling, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-11  9:31 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

The distance transforming in numa_emulation() used to call
numa_set_distance() for all MAX_NUMNODES * MAX_NUMNODES node
combinations regardless of which are enabled.  As numa_set_distance()
ignores all out-of-bound distance settings, this doesn't cause any
problem other than looping unnecessarily many times during boot.

However, as MAX_NUMNODES * MAX_NUMNODES can be pretty high, update the
code such that it iterates through only the enabled combinations.

Yinghai Lu identified the issue and provided an initial patch to
address the issue; however, the patch was incorrect in that it didn't
build emulated distance table when there's no physical distance table
and unnecessarily complex.

  http://thread.gmane.org/gmane.linux.kernel/1107986/focus=1107988

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/numa_emulation.c |   23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 3696be0..ad091e4 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -301,7 +301,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
 	u8 *phys_dist = NULL;
 	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
-	int dfl_phys_nid;
+	int max_emu_nid, dfl_phys_nid;
 	int i, j, ret;
 
 	if (!emu_cmdline)
@@ -358,12 +358,17 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 					node_distance(i, j);
 	}
 
-	/* determine the default phys nid to use for unmapped nodes */
+	/*
+	 * Determine the max emulated nid and the default phys nid to use
+	 * for unmapped nodes.
+	 */
+	max_emu_nid = 0;
 	dfl_phys_nid = NUMA_NO_NODE;
 	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) {
 		if (emu_nid_to_phys[i] != NUMA_NO_NODE) {
-			dfl_phys_nid = emu_nid_to_phys[i];
-			break;
+			max_emu_nid = i;
+			if (dfl_phys_nid == NUMA_NO_NODE)
+				dfl_phys_nid = emu_nid_to_phys[i];
 		}
 	}
 	if (dfl_phys_nid == NUMA_NO_NODE) {
@@ -393,14 +398,10 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
 			emu_nid_to_phys[i] = dfl_phys_nid;
 
-	/*
-	 * Transform distance table.  numa_set_distance() ignores all
-	 * out-of-bound distances.  Just call it for every possible node
-	 * combination.
-	 */
+	/* transform distance table */
 	numa_reset_distance();
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		for (j = 0; j < MAX_NUMNODES; j++) {
+	for (i = 0; i < max_emu_nid + 1; i++) {
+		for (j = 0; j < max_emu_nid + 1; j++) {
 			int physi = emu_nid_to_phys[i];
 			int physj = emu_nid_to_phys[j];
 			int dist;

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
  2011-03-11  9:31                                         ` [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation Tejun Heo
@ 2011-03-11 15:42                                           ` Yinghai Lu
  2011-03-11 16:03                                             ` Tejun Heo
  2011-03-11 19:05                                           ` Yinghai Lu
  1 sibling, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-11 15:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/11/2011 01:31 AM, Tejun Heo wrote:
> The distance transforming in numa_emulation() used to call
> numa_set_distance() for all MAX_NUMNODES * MAX_NUMNODES node
> combinations regardless of which are enabled.  As numa_set_distance()
> ignores all out-of-bound distance settings, this doesn't cause any
> problem other than looping unnecessarily many times during boot.
> 
> However, as MAX_NUMNODES * MAX_NUMNODES can be pretty high, update the
> code such that it iterates through only the enabled combinations.
> 
> Yinghai Lu identified the issue and provided an initial patch to
> address the issue; however, the patch was incorrect in that it didn't
> build emulated distance table when there's no physical distance table
> and unnecessarily complex.
> 
>   http://thread.gmane.org/gmane.linux.kernel/1107986/focus=1107988
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  arch/x86/mm/numa_emulation.c |   23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
> index 3696be0..ad091e4 100644
> --- a/arch/x86/mm/numa_emulation.c
> +++ b/arch/x86/mm/numa_emulation.c
> @@ -301,7 +301,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  	const u64 max_addr = max_pfn << PAGE_SHIFT;
>  	u8 *phys_dist = NULL;
>  	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
> -	int dfl_phys_nid;
> +	int max_emu_nid, dfl_phys_nid;
>  	int i, j, ret;
>  
>  	if (!emu_cmdline)
> @@ -358,12 +358,17 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  					node_distance(i, j);
>  	}
>  
> -	/* determine the default phys nid to use for unmapped nodes */
> +	/*
> +	 * Determine the max emulated nid and the default phys nid to use
> +	 * for unmapped nodes.
> +	 */
> +	max_emu_nid = 0;
>  	dfl_phys_nid = NUMA_NO_NODE;
>  	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) {
>  		if (emu_nid_to_phys[i] != NUMA_NO_NODE) {
> -			dfl_phys_nid = emu_nid_to_phys[i];
> -			break;
> +			max_emu_nid = i;
> +			if (dfl_phys_nid == NUMA_NO_NODE)
> +				dfl_phys_nid = emu_nid_to_phys[i];
>  		}
>  	}
>  	if (dfl_phys_nid == NUMA_NO_NODE) {
> @@ -393,14 +398,10 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
>  			emu_nid_to_phys[i] = dfl_phys_nid;
>  
> -	/*
> -	 * Transform distance table.  numa_set_distance() ignores all
> -	 * out-of-bound distances.  Just call it for every possible node
> -	 * combination.
> -	 */
> +	/* transform distance table */
>  	numa_reset_distance();
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		for (j = 0; j < MAX_NUMNODES; j++) {
> +	for (i = 0; i < max_emu_nid + 1; i++) {
> +		for (j = 0; j < max_emu_nid + 1; j++) {

using num_emu_nids would be better?

Yinghai


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11  8:33                                         ` Tejun Heo
@ 2011-03-11 15:48                                           ` Yinghai Lu
  2011-03-11 15:54                                             ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-11 15:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/11/2011 12:33 AM, Tejun Heo wrote:
> On Fri, Mar 11, 2011 at 09:29:38AM +0100, Tejun Heo wrote:
>> Also, I don't think your patch is correct.  Even if there is phys_dist
>                                                               ^
> 							      no
>> table, emu distance tables should be built because emulated nids don't
>> map to physical nids one to one.  ie. Two different emulated nids can
>> share a physical node and the distance table should explicitly reflect
>> that.

ok, when original SLIT is not there, mean only one really node.
so the new distance table should be filled with LOCAL_DISTANCE.

Yinghai


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11 15:48                                           ` Yinghai Lu
@ 2011-03-11 15:54                                             ` Tejun Heo
  2011-03-11 18:02                                               ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-11 15:54 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Fri, Mar 11, 2011 at 07:48:00AM -0800, Yinghai Lu wrote:
> On 03/11/2011 12:33 AM, Tejun Heo wrote:
> > On Fri, Mar 11, 2011 at 09:29:38AM +0100, Tejun Heo wrote:
> >> Also, I don't think your patch is correct.  Even if there is phys_dist
> >                                                               ^
> > 							      no
> >> table, emu distance tables should be built because emulated nids don't
> >> map to physical nids one to one.  ie. Two different emulated nids can
> >> share a physical node and the distance table should explicitly reflect
> >> that.
> 
> ok, when original SLIT is not there, mean only one really node.
> so the new distance table should be filled with LOCAL_DISTANCE.

No, NUMA implementation can skip numa_set_distance() entirely if the
distance is LOCAL_DISTANCE if nids are equal, REMOTE_DISTANCE
otherwise.  In fact, any amdtopology configuraiton would behave this
way, so it's incorrect to fill the table with LOCAL_DISTANCE.  You
have to check the physnid mapping and build new table whether physical
table exists or not.  Lack of physical distance table doesn't mean all
nodes are LOCAL_DISTANCE.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
  2011-03-11 15:42                                           ` Yinghai Lu
@ 2011-03-11 16:03                                             ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2011-03-11 16:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hello, Yinghai.

On Fri, Mar 11, 2011 at 07:42:26AM -0800, Yinghai Lu wrote:
> > @@ -393,14 +398,10 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
> >  		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
> >  			emu_nid_to_phys[i] = dfl_phys_nid;
> >  
> > -	/*
> > -	 * Transform distance table.  numa_set_distance() ignores all
> > -	 * out-of-bound distances.  Just call it for every possible node
> > -	 * combination.
> > -	 */
> > +	/* transform distance table */
> >  	numa_reset_distance();
> > -	for (i = 0; i < MAX_NUMNODES; i++) {
> > -		for (j = 0; j < MAX_NUMNODES; j++) {
> > +	for (i = 0; i < max_emu_nid + 1; i++) {
> > +		for (j = 0; j < max_emu_nid + 1; j++) {
> 
> using num_emu_nids would be better?

Well, this is mostly frivolous but I think max is better here.  We
don't really care about the number of emulated nodes.  The number
we're looking for is the number which allows the code to cover all
enabled configurations.  So, I think max reflects the logic better.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11 15:54                                             ` Tejun Heo
@ 2011-03-11 18:02                                               ` Yinghai Lu
  2011-03-11 18:19                                                 ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-11 18:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Fri, Mar 11, 2011 at 7:54 AM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Mar 11, 2011 at 07:48:00AM -0800, Yinghai Lu wrote:
>> On 03/11/2011 12:33 AM, Tejun Heo wrote:
>> > On Fri, Mar 11, 2011 at 09:29:38AM +0100, Tejun Heo wrote:
>> >> Also, I don't think your patch is correct.  Even if there is phys_dist
>> >                                                               ^
>> >                                                           no
>> >> table, emu distance tables should be built because emulated nids don't
>> >> map to physical nids one to one.  ie. Two different emulated nids can
>> >> share a physical node and the distance table should explicitly reflect
>> >> that.
>>
>> ok, when original SLIT is not there, mean only one really node.
>> so the new distance table should be filled with LOCAL_DISTANCE.
>
> No, NUMA implementation can skip numa_set_distance() entirely if the
> distance is LOCAL_DISTANCE if nids are equal, REMOTE_DISTANCE
> otherwise.  In fact, any amdtopology configuraiton would behave this
> way, so it's incorrect to fill the table with LOCAL_DISTANCE.  You
> have to check the physnid mapping and build new table whether physical
> table exists or not.  Lack of physical distance table doesn't mean all
> nodes are LOCAL_DISTANCE.

too bad. We should call numa_alloc_distance in amdtopology to set
default value in that array.

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11 18:02                                               ` Yinghai Lu
@ 2011-03-11 18:19                                                 ` Tejun Heo
  2011-03-11 18:25                                                   ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-11 18:19 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

Hello,

On Fri, Mar 11, 2011 at 10:02:41AM -0800, Yinghai Lu wrote:
> > No, NUMA implementation can skip numa_set_distance() entirely if the
> > distance is LOCAL_DISTANCE if nids are equal, REMOTE_DISTANCE
> > otherwise.  In fact, any amdtopology configuraiton would behave this
> > way, so it's incorrect to fill the table with LOCAL_DISTANCE.  You
> > have to check the physnid mapping and build new table whether physical
> > table exists or not.  Lack of physical distance table doesn't mean all
> > nodes are LOCAL_DISTANCE.
> 
> too bad. We should call numa_alloc_distance in amdtopology to set
> default value in that array.

I'm not following.  If there's no distance table, the distance is
assumed to be LOCAL between the same node and REMOTE if the nodes are
different, which is exactly the way it should be for those machines.
Why is this bad and why would you allocate distance table for such
configurations?

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11 18:19                                                 ` Tejun Heo
@ 2011-03-11 18:25                                                   ` Yinghai Lu
  2011-03-11 18:29                                                     ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Yinghai Lu @ 2011-03-11 18:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Fri, Mar 11, 2011 at 10:19 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Fri, Mar 11, 2011 at 10:02:41AM -0800, Yinghai Lu wrote:
>> > No, NUMA implementation can skip numa_set_distance() entirely if the
>> > distance is LOCAL_DISTANCE if nids are equal, REMOTE_DISTANCE
>> > otherwise.  In fact, any amdtopology configuraiton would behave this
>> > way, so it's incorrect to fill the table with LOCAL_DISTANCE.  You
>> > have to check the physnid mapping and build new table whether physical
>> > table exists or not.  Lack of physical distance table doesn't mean all
>> > nodes are LOCAL_DISTANCE.
>>
>> too bad. We should call numa_alloc_distance in amdtopology to set
>> default value in that array.
>
> I'm not following.  If there's no distance table, the distance is
> assumed to be LOCAL between the same node and REMOTE if the nodes are
> different, which is exactly the way it should be for those machines.
> Why is this bad and why would you allocate distance table for such
> configurations?

now even emulation have that distance array.

why keep it simple to make all path have that array?

Yinghai

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11 18:25                                                   ` Yinghai Lu
@ 2011-03-11 18:29                                                     ` Tejun Heo
  2011-03-11 18:45                                                       ` Yinghai Lu
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2011-03-11 18:29 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On Fri, Mar 11, 2011 at 10:25:23AM -0800, Yinghai Lu wrote:
> now even emulation have that distance array.
> 
> why keep it simple to make all path have that array?

Okay, I don't know, maybe, but I don't think it's gonna buy much.  We
need boundary check in the distance testing function anyway in case
someone calls in with out-of-bound nid's and not having distance table
is just a degenerate case of the generic sanity check, so there really
isn't much to be gained by allocating dummy table.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling
  2011-03-11 18:29                                                     ` Tejun Heo
@ 2011-03-11 18:45                                                       ` Yinghai Lu
  0 siblings, 0 replies; 68+ messages in thread
From: Yinghai Lu @ 2011-03-11 18:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/11/2011 10:29 AM, Tejun Heo wrote:
> On Fri, Mar 11, 2011 at 10:25:23AM -0800, Yinghai Lu wrote:
>> now even emulation have that distance array.
>>
>> why keep it simple to make all path have that array?
> 
> Okay, I don't know, maybe, but I don't think it's gonna buy much.  We
> need boundary check in the distance testing function anyway in case
> someone calls in with out-of-bound nid's and not having distance table
> is just a degenerate case of the generic sanity check, so there really
> isn't much to be gained by allocating dummy table.

for out-of-bound nid access, looks like should generate BUG_ON for it ?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
  2011-03-11  9:31                                         ` [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation Tejun Heo
  2011-03-11 15:42                                           ` Yinghai Lu
@ 2011-03-11 19:05                                           ` Yinghai Lu
  1 sibling, 0 replies; 68+ messages in thread
From: Yinghai Lu @ 2011-03-11 19:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Ingo Molnar, tglx, H. Peter Anvin, linux-kernel

On 03/11/2011 01:31 AM, Tejun Heo wrote:
> The distance transforming in numa_emulation() used to call
> numa_set_distance() for all MAX_NUMNODES * MAX_NUMNODES node
> combinations regardless of which are enabled.  As numa_set_distance()
> ignores all out-of-bound distance settings, this doesn't cause any
> problem other than looping unnecessarily many times during boot.
> 
> However, as MAX_NUMNODES * MAX_NUMNODES can be pretty high, update the
> code such that it iterates through only the enabled combinations.
> 
> Yinghai Lu identified the issue and provided an initial patch to
> address the issue; however, the patch was incorrect in that it didn't
> build emulated distance table when there's no physical distance table
> and unnecessarily complex.
> 
>   http://thread.gmane.org/gmane.linux.kernel/1107986/focus=1107988
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Yinghai Lu <yinghai@kernel.org>

Acked-by: Yinghai Lu <yinghai@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2011-03-11 19:06 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-24 14:51 [GIT PULL tip:x86/mm] Tejun Heo
2011-02-24 14:52 ` [GIT PULL tip:x86/mm] bootmem,x86: cleanup changes Tejun Heo
2011-02-24 19:08 ` [GIT PULL tip:x86/mm] Yinghai Lu
2011-02-24 19:23   ` Ingo Molnar
2011-02-24 19:28     ` Yinghai Lu
2011-02-24 19:32       ` Ingo Molnar
2011-02-24 19:46         ` Tejun Heo
2011-02-24 22:46           ` [patch] x86, mm: Fix size of numa_distance array David Rientjes
2011-02-24 23:30             ` Yinghai Lu
2011-02-24 23:31             ` David Rientjes
2011-02-25  9:05               ` Tejun Heo
2011-02-25  9:03             ` Tejun Heo
2011-02-25 10:58               ` Tejun Heo
2011-02-25 11:05                 ` Tejun Heo
2011-02-25  9:11             ` [PATCH x86-mm] x86-64, NUMA: " Tejun Heo
2011-03-01 17:18       ` [GIT PULL tip:x86/mm] David Rientjes
2011-03-01 18:25         ` Tejun Heo
2011-03-01 22:19         ` Yinghai Lu
2011-03-02  9:17           ` Tejun Heo
2011-03-02 10:04         ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Tejun Heo
2011-03-02 10:07           ` Ingo Molnar
2011-03-02 10:15             ` Tejun Heo
2011-03-02 10:36               ` Ingo Molnar
2011-03-02 10:25           ` [PATCH x86/mm UPDATED] " Tejun Heo
2011-03-02 10:39             ` [PATCH x86/mm] x86-64, NUMA: Better explain numa_distance handling Tejun Heo
2011-03-02 10:42               ` [PATCH UPDATED " Tejun Heo
2011-03-02 14:31                 ` David Rientjes
2011-03-02 14:30             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling David Rientjes
2011-03-02 15:42               ` Tejun Heo
2011-03-02 21:12                 ` Yinghai Lu
2011-03-02 21:36                   ` Yinghai Lu
2011-03-03 20:07                     ` David Rientjes
2011-03-04 14:32                       ` Tejun Heo
2011-03-03 20:04                   ` David Rientjes
2011-03-03 20:00                 ` David Rientjes
2011-03-04 15:31               ` [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling Tejun Heo
2011-03-04 21:33                 ` David Rientjes
2011-03-05  7:50                   ` Tejun Heo
2011-03-05 15:50               ` [tip:x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() tip-bot for Tejun Heo
2011-03-02 16:16             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling Yinghai Lu
2011-03-02 16:37               ` Tejun Heo
2011-03-02 16:46                 ` Yinghai Lu
2011-03-02 16:55                   ` Tejun Heo
2011-03-02 18:52                     ` Yinghai Lu
2011-03-02 19:02                       ` Tejun Heo
2011-03-02 19:06                         ` Yinghai Lu
2011-03-02 19:13                           ` Tejun Heo
2011-03-02 20:32                             ` Yinghai Lu
2011-03-02 20:57                               ` Tejun Heo
2011-03-02 21:14                                 ` Yinghai Lu
2011-03-03  6:17                                   ` Tejun Heo
2011-03-10 18:46                                     ` Yinghai Lu
2011-03-11  8:29                                       ` Tejun Heo
2011-03-11  8:33                                         ` Tejun Heo
2011-03-11 15:48                                           ` Yinghai Lu
2011-03-11 15:54                                             ` Tejun Heo
2011-03-11 18:02                                               ` Yinghai Lu
2011-03-11 18:19                                                 ` Tejun Heo
2011-03-11 18:25                                                   ` Yinghai Lu
2011-03-11 18:29                                                     ` Tejun Heo
2011-03-11 18:45                                                       ` Yinghai Lu
2011-03-11  9:31                                         ` [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation Tejun Heo
2011-03-11 15:42                                           ` Yinghai Lu
2011-03-11 16:03                                             ` Tejun Heo
2011-03-11 19:05                                           ` Yinghai Lu
2011-03-02 10:43           ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Ingo Molnar
2011-03-02 10:53             ` Tejun Heo
2011-03-02 10:59               ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).