All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues
@ 2010-10-25 22:41 David Howells
  2010-10-26  9:10 ` Tejun Heo
  2010-10-26 10:22 ` David Howells
  0 siblings, 2 replies; 63+ messages in thread
From: David Howells @ 2010-10-25 22:41 UTC (permalink / raw)
  To: torvalds, akpm
  Cc: Tejun Heo, linux-am33-list, linux-kernel, Akira Takeuchi, Mark Salter

In the MN10300 arch, we occasionally see an assertion being tripped in
alloc_cwqs() at the following line:

        /* just in case, make sure it's actually aligned */
  --->  BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
        return wq->cpu_wq.v ? 0 : -ENOMEM;

The values are:

        wa->cpu_wq.v => 0x902776e0
        align => 0x100

and align is calculated by the following:

        const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS,
                                   __alignof__(unsigned long long));

This is because the pointer in question (wq->cpu_wq.v) loses some of its lower
bits to control flags, and so the object it points to must be sufficiently
aligned to avoid the need to use those bits for pointing to things.

Currently, 4 control bits and 4 colour bits are used in normal circumstances,
plus a debugging bit if debugging is set.  This requires the
cpu_workqueue_struct struct to be at least 256 bytes aligned (or 512 bytes
aligned with debugging).

PERCPU() alignment on MN13000, however, is only 32 bytes as set in
vmlinux.lds.S.  So we set this to PAGE_SIZE (4096) to match most other arches
and stick a comment in alloc_cwqs() for anyone else who triggers the assertion.

Reported-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mark Salter <msalter@redhat.com>
cc: Tejun Heo <tj@kernel.org>
---

 arch/mn10300/kernel/vmlinux.lds.S |    2 +-
 kernel/workqueue.c                |    4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/arch/mn10300/kernel/vmlinux.lds.S b/arch/mn10300/kernel/vmlinux.lds.S
index 10549dc..febbeee 100644
--- a/arch/mn10300/kernel/vmlinux.lds.S
+++ b/arch/mn10300/kernel/vmlinux.lds.S
@@ -70,7 +70,7 @@ SECTIONS
 	.exit.text : { EXIT_TEXT; }
 	.exit.data : { EXIT_DATA; }
 
-  PERCPU(32)
+  PERCPU(PAGE_SIZE)
   . = ALIGN(PAGE_SIZE);
   __init_end = .;
   /* freed after init ends here */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 30acdb7..e5ff2cb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2791,7 +2791,9 @@ static int alloc_cwqs(struct workqueue_struct *wq)
 		}
 	}
 
-	/* just in case, make sure it's actually aligned */
+	/* just in case, make sure it's actually aligned
+	 * - this is affected by PERCPU() alignment in vmlinux.lds.S
+	 */
 	BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
 	return wq->cpu_wq.v ? 0 : -ENOMEM;
 }


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues
  2010-10-25 22:41 [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues David Howells
@ 2010-10-26  9:10 ` Tejun Heo
  2010-10-26 10:22 ` David Howells
  1 sibling, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2010-10-26  9:10 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, linux-am33-list, linux-kernel, Akira Takeuchi,
	Mark Salter

Hello,

On 10/26/2010 12:41 AM, David Howells wrote:
> PERCPU() alignment on MN13000, however, is only 32 bytes as set in
> vmlinux.lds.S.  So we set this to PAGE_SIZE (4096) to match most other arches
> and stick a comment in alloc_cwqs() for anyone else who triggers the assertion.

Ah, okay, but I'm not quite sure how that would affect the alignment
of dynamically allocated percpu memory.  Is this SMP or UP build?  Can
you please double check the bug doesn't trigger with the section
alignment updated?

> -  PERCPU(32)
> +  PERCPU(PAGE_SIZE)

Hmmm... during initialization, the initial percpu memory is
re-allocated using bootmem allocator with proper alignment and the
output section is just used as data source and discarded once init is
complete.  So, unless I'm mistaken, I don't think this would affect
anything.  That said, I think it might be better to just remove the
alignment parameter from the macro and force align to PAGE_SIZE.  It
doesn't really help anything.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues
  2010-10-25 22:41 [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues David Howells
  2010-10-26  9:10 ` Tejun Heo
@ 2010-10-26 10:22 ` David Howells
  2010-10-26 12:14   ` Tejun Heo
  2010-10-26 14:50   ` [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues David Howells
  1 sibling, 2 replies; 63+ messages in thread
From: David Howells @ 2010-10-26 10:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, torvalds, akpm, linux-am33-list, linux-kernel,
	Akira Takeuchi, Mark Salter

Tejun Heo <tj@kernel.org> wrote:

> Ah, okay, but I'm not quite sure how that would affect the alignment
> of dynamically allocated percpu memory.  Is this SMP or UP build?

It is definitely SMP.

> Can you please double check the bug doesn't trigger with the section
> alignment updated?

It can be made to trigger consistently without the change, and simply updating
that alignment makes it go away.  It seems unlikely that it's affecting
subsequent stuff in the final link since the PERCPU() is immediately followed
by an alignment to PAGE_SIZE:

	PERCPU(PAGE_SIZE)
	. = ALIGN(PAGE_SIZE);

I've attached the kernel log below.  CPUID is 0 indicating this happened on
CPU 0 (the boot CPU).

> That said, I think it might be better to just remove the alignment parameter
> from the macro and force align to PAGE_SIZE.

That's not necessarily good.  Two arches to note:

	arch/x86/kernel/vmlinux.lds.S:  PERCPU(THREAD_SIZE)

which may be bigger than PAGE_SIZE and:

	arch/frv/kernel/vmlinux.lds.S:  PERCPU(4096)

FRV's page size is 16KB, so on that we really don't want it to be PAGE_SIZE.

David
---
Linux version 2.6.36-rc7-01208-g3e148cd (takeuchi@shampoo.scd.mei.co.jp) (gcc version 4.2.1 20100927 (GNUPro 07r1) (Based on: GCC 4.2, BINUTILS 2.17, GDB 6.6)) #3 SMP PREEMPT Fri Oct 22 16:37:22 JST 2010
Panasonic am34-2, rev 1
DDR2-SDRAM: 384MB/512MB memory available @0x84000000.
On node 0 totalpages: 16384
free_area_init_node: node 0, pgdat 9025cf80, node_mem_map 902c5000
  Normal zone: 128 pages used for memmap
  Normal zone: 0 pages reserved
  Normal zone: 16256 pages, LIFO batch:3
PERCPU: Embedded 7 pages/cpu @90348000 s6176 r8192 d14304 u65536
pcpu-alloc: s6176 r8192 d14304 u65536 alloc=16*4096
pcpu-alloc: [0] 0 [0] 1
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16256
Kernel command line: root=/dev/nfs rw console=ttySM0,115200 nfsroot=192.168.10.1:/home/takeuchi/AM-Linux-2.6/nfsroot init=/bin/sh -l ip=192.168.10.72:192.168.10.1:192.168.10.1:255.255.255.0:iar-takeuchi:eth0:off debug mem=64M
PID hash table entries: 256 (order: -2, 1024 bytes)
Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
Memory: 62072k/65536k available (1859k kernel code, 3464k reserved, 557k data, 96k init, 0k highmem)
Hierarchical RCU implementation.
        RCU-based detection of stalled CPUs is disabled.
        Verbose stalled-CPUs detection is disabled.
NR_IRQS:197
timestamp counter I/O clock running at 100.00 (calibrated against RTC)
console [ttySM0] enabled
Calibrating delay loop... 162.30 BogoMIPS (lpj=324608)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 512
Initializing cgroup subsys ns
Initializing cgroup subsys cpuacct
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
CPU#0 : ioclk speed: 100.00MHz : bogomips : 162.30
Booting CPU#1
Initializing CPU#1
CPU#1 : ioclk speed: 100.00MHz : bogomips : 162.30
------------[ cut here ]------------
Kernel BUG at 9002a159 [verbose debug info unavailable]

An unsupported syscall insn was used by the kernel
: 0378
PC:  9002a157 EPSW:  00000f00  SSP: 93c25ec4 mode: Super
d0:  902776e0   d1:  000000e0   d2: 90217fec   d3: 93c16c60
a0:  902776e0   a1:  902538ec   a2: 93c1ef60   a3: 93c1ef60
e0:  00000002   e1:  00000000   e2: 00000000   e3: 00000002
e4:  00000000   e5:  9002ab90   e6: 00000100   e7: 93c1ef68
lar: 900ed2d8   lir: 41f00ef1  mdr: 901cf10d  usp: 00000000
cvf: 00000000   crl: 00000000  crh: 00000000  drq: 00000000
threadinfo=93c24000 task=93c22be0)
Process swapper (pid: 1)
CPUID:  00000000
CPUP:   0080
TBR:    900002a0
DEAR:   6f9f2709
sISR:   04000000
NMICR:  0000
BCBERR: 00000000
BCBEAR: 4c00050c
MMUFCR: 00000000
IPTEU : 08063c00  IPTEL2: 00000200
DPTEU:  300fa000  DPTEL2: 00000200

Stack:
  9002a037 00000001 9002ab90 902b695c 90011805 902b5740 00000002 93c16c60
  00000000 00000004 00000001 9002ab90 902b573c 902617c6 90217fec 90019280
  00000000 00000000 00000000 90261628 9027365c 00000000 00000000 00000000

Call Trace: [<9002a037>] __alloc_workqueue_key+0xa7/0x370
 [<9002ab90>] idle_worker_timeout+0x0/0x68
 [<90011805>] wake_up_process+0x11/0x18
 [<9002ab90>] idle_worker_timeout+0x0/0x68
 [<902617c6>] init_workqueues+0x19e/0x2b4
 [<90019280>] cpu_maps_update_done+0x10/0x14
 [<90261628>] init_workqueues+0x0/0x2b4
 [<9000149a>] do_one_initcall+0x116/0x1f0
 [<9025e431>] kernel_init+0x65/0x224
 [<9025e3cc>] kernel_init+0x0/0x224
 [<9001b188>] do_exit+0x0/0x714
 [<9000108c>] loop_set_secondary_icr+0x7c/0x8e


Kernel panic - not syncing: Attempted to kill init!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues
  2010-10-26 10:22 ` David Howells
@ 2010-10-26 12:14   ` Tejun Heo
  2010-10-26 12:27     ` Tejun Heo
  2010-10-26 14:50   ` [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues David Howells
  1 sibling, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-26 12:14 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, linux-am33-list, linux-kernel, Akira Takeuchi,
	Mark Salter

Hello,

On 10/26/2010 12:22 PM, David Howells wrote:
>> Can you please double check the bug doesn't trigger with the section
>> alignment updated?
> 
> It can be made to trigger consistently without the change, and simply updating
> that alignment makes it go away.  It seems unlikely that it's affecting
> subsequent stuff in the final link since the PERCPU() is immediately followed
> by an alignment to PAGE_SIZE:
> 
> 	PERCPU(PAGE_SIZE)
> 	. = ALIGN(PAGE_SIZE);
> 
> I've attached the kernel log below.  CPUID is 0 indicating this happened on
> CPU 0 (the boot CPU).

Ah, I see now.  The actual areas are properly aligned but the percpu
address is determined as offset from the percpu output section base so
the percpu pointers in the percpu address space end up misaligned with
the actual kernel addresses and the code in workqueue checks the
address in percpu AS, so, yeap, it's caused by the misalignment of the
percpu section.  Except for triggering BUG_ON(), it shouldn't cause a
real issue tho as work_data points to the translated addresses in the
kernel AS for specific CPU.  Needs to be fixed anyways.

>> That said, I think it might be better to just remove the alignment parameter
>> from the macro and force align to PAGE_SIZE.
> 
> That's not necessarily good.  Two arches to note:
> 
> 	arch/x86/kernel/vmlinux.lds.S:  PERCPU(THREAD_SIZE)

I don't think the current percpu allocator honors alignment larger
than PAGE_SIZE no matter how large the alignment for the percpu output
section is.  I'll look into it deeper but I think we might just have
been lucky and the alignment somehow didn't bite us yet.  The only
user of THREAD_SIZE mask at this point seems to be cpu_init().  Maybe
we can remove this requirement.  I'll look into it.

> which may be bigger than PAGE_SIZE and:
> 
> 	arch/frv/kernel/vmlinux.lds.S:  PERCPU(4096)
> 
> FRV's page size is 16KB, so on that we really don't want it to be PAGE_SIZE.

Why not?  It's in the init section which will be freed anyway and with
the kernel image compression it's not even gonna add any noticeable
amount to the kernel image size.  There isn't any benefit in using
anything smaller than PAGE_SIZE for alignment.  Also, percpu allocator
guarantees alignment requirement upto PAGE_SIZE is honored.  If the
output section uses smaller alignment, the percpu AS will end up being
misaligned.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues
  2010-10-26 12:14   ` Tejun Heo
@ 2010-10-26 12:27     ` Tejun Heo
  2010-10-26 12:45       ` [PATCH] x86, percpu: revert commit fe8e0c25 Tejun Heo
  0 siblings, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-26 12:27 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, linux-am33-list, linux-kernel, Akira Takeuchi,
	Mark Salter

On 10/26/2010 02:14 PM, Tejun Heo wrote:
>>> That said, I think it might be better to just remove the alignment parameter
>>> from the macro and force align to PAGE_SIZE.
>>
>> That's not necessarily good.  Two arches to note:
>>
>> 	arch/x86/kernel/vmlinux.lds.S:  PERCPU(THREAD_SIZE)
> 
> I don't think the current percpu allocator honors alignment larger
> than PAGE_SIZE no matter how large the alignment for the percpu output
> section is.  I'll look into it deeper but I think we might just have
> been lucky and the alignment somehow didn't bite us yet.  The only
> user of THREAD_SIZE mask at this point seems to be cpu_init().  Maybe
> we can remove this requirement.  I'll look into it.

Okay, this was added by commit fe8e0c2 in this merge window.  It's
broken and needs to be reverted.  I'll send a patch to revert it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] x86, percpu: revert commit fe8e0c25
  2010-10-26 12:27     ` Tejun Heo
@ 2010-10-26 12:45       ` Tejun Heo
  2010-10-26 13:25         ` Ingo Molnar
  2010-10-26 14:06         ` [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE Tejun Heo
  0 siblings, 2 replies; 63+ messages in thread
From: Tejun Heo @ 2010-10-26 12:45 UTC (permalink / raw)
  To: torvalds, Alexander van Heukelum
  Cc: David Howells, akpm, linux-am33-list, linux-kernel,
	Akira Takeuchi, Mark Salter, Ingo Molnar

Commit fe8e0c25 (x86, 32-bit: Align percpu area and irq stacks to
THREAD_SIZE) aligned PERCPU section to THREAD_SIZE which can be larger
than PAGE_SIZE, introduced DEFINE_PER_CPU_MULTIPAGE_ALIGNED() and used
it to make irq stacks aligned to THREAD_SIZE on x86_32.

This won't work.  The PERCPU output section is used as the template to
prepare the percpu area and the actual percpu area is _alwasy_ aligned
to PAGE_SIZE whether the source area is aligned to larger size or not.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Ingo Molnar <mingo@elte.hu>
---
Sorry about not catching this earlier.  If x86_32 irq stacks need
percpu areas with larger alignment, it needs to implement it itself by
either explicitly allocating more space for padding or allocating
stack area manually for each CPU.  I don't think it is justifiable to
make generic percpu allocator honor alignments larger than PAGE_SIZE
for this case.

Thanks.

 arch/x86/kernel/irq_32.c      |    4 ++--
 arch/x86/kernel/vmlinux.lds.S |    2 +-
 include/linux/percpu-defs.h   |   12 ------------
 3 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 50fbbe6..3b5609f 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -60,8 +60,8 @@ union irq_ctx {
 static DEFINE_PER_CPU(union irq_ctx *, hardirq_ctx);
 static DEFINE_PER_CPU(union irq_ctx *, softirq_ctx);

-static DEFINE_PER_CPU_MULTIPAGE_ALIGNED(union irq_ctx, hardirq_stack, THREAD_SIZE);
-static DEFINE_PER_CPU_MULTIPAGE_ALIGNED(union irq_ctx, softirq_stack, THREAD_SIZE);
+static DEFINE_PER_CPU_PAGE_ALIGNED(union irq_ctx, hardirq_stack);
+static DEFINE_PER_CPU_PAGE_ALIGNED(union irq_ctx, softirq_stack);

 static void call_on_stack(void *func, void *stack)
 {
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index e03530a..38e2b67 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -301,7 +301,7 @@ SECTIONS
 	}

 #if !defined(CONFIG_X86_64) || !defined(CONFIG_SMP)
-	PERCPU(THREAD_SIZE)
+	PERCPU(PAGE_SIZE)
 #endif

 	. = ALIGN(PAGE_SIZE);
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 018db9a..27ef6b1 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -148,18 +148,6 @@
 	DEFINE_PER_CPU_SECTION(type, name, "..readmostly")

 /*
- * Declaration/definition used for large per-CPU variables that must be
- * aligned to something larger than the pagesize.
- */
-#define DECLARE_PER_CPU_MULTIPAGE_ALIGNED(type, name, size)		\
-	DECLARE_PER_CPU_SECTION(type, name, "..page_aligned")		\
-	__aligned(size)
-
-#define DEFINE_PER_CPU_MULTIPAGE_ALIGNED(type, name, size)		\
-	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
-	__aligned(size)
-
-/*
  * Intermodule exports for per-CPU variables.  sparse forgets about
  * address space across EXPORT_SYMBOL(), change EXPORT_SYMBOL() to
  * noop if __CHECKER__.

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86, percpu: revert commit fe8e0c25
  2010-10-26 12:45       ` [PATCH] x86, percpu: revert commit fe8e0c25 Tejun Heo
@ 2010-10-26 13:25         ` Ingo Molnar
  2010-10-26 13:34           ` Tejun Heo
  2010-10-26 14:06         ` [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE Tejun Heo
  1 sibling, 1 reply; 63+ messages in thread
From: Ingo Molnar @ 2010-10-26 13:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, Alexander van Heukelum, David Howells, akpm,
	linux-am33-list, linux-kernel, Akira Takeuchi, Mark Salter


* Tejun Heo <tj@kernel.org> wrote:

> Commit fe8e0c25 (x86, 32-bit: Align percpu area and irq stacks to THREAD_SIZE) 
> aligned PERCPU section to THREAD_SIZE which can be larger than PAGE_SIZE, 
> introduced DEFINE_PER_CPU_MULTIPAGE_ALIGNED() and used it to make irq stacks 
> aligned to THREAD_SIZE on x86_32.
> 
> This won't work.  The PERCPU output section is used as the template to prepare the 
> percpu area and the actual percpu area is _alwasy_ aligned to PAGE_SIZE whether 
> the source area is aligned to larger size or not.

The problem is, this will reintroduce a nasty boot crash which commit fe8e0c25 
fixed. Do you say that fe8e0c25 didnt have the alignment effect?

	Ingo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86, percpu: revert commit fe8e0c25
  2010-10-26 13:25         ` Ingo Molnar
@ 2010-10-26 13:34           ` Tejun Heo
  2010-10-26 13:49             ` Brian Gerst
  0 siblings, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-26 13:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, Alexander van Heukelum, David Howells, akpm,
	linux-am33-list, linux-kernel, Akira Takeuchi, Mark Salter

Hello,

On 10/26/2010 03:25 PM, Ingo Molnar wrote:
> 
> * Tejun Heo <tj@kernel.org> wrote:
> 
>> Commit fe8e0c25 (x86, 32-bit: Align percpu area and irq stacks to THREAD_SIZE) 
>> aligned PERCPU section to THREAD_SIZE which can be larger than PAGE_SIZE, 
>> introduced DEFINE_PER_CPU_MULTIPAGE_ALIGNED() and used it to make irq stacks 
>> aligned to THREAD_SIZE on x86_32.
>>
>> This won't work.  The PERCPU output section is used as the template to prepare the 
>> percpu area and the actual percpu area is _alwasy_ aligned to PAGE_SIZE whether 
>> the source area is aligned to larger size or not.
> 
> The problem is, this will reintroduce a nasty boot crash which commit fe8e0c25 
> fixed. Do you say that fe8e0c25 didnt have the alignment effect?

AFAICS, not in a way which is correct.  The patch probably made the
following two differences.

* The stack in the template area is THREAD_SIZE aligned.  If something
  was dereferencing it before percpu init, this could have helped.
  IIRC, x86 early init code does use the template area.

* The percpu address would be THREAD_SIZE aligned while the translated
  kernel address for each cpu wouldn't be.  For masking stack pointer
  to find out task struct, I don't think aligning the percpu address
  would have been helpful.

It's simply broken and needs to be reverted.  If the patch somehow
fixed boot crash, yeah, we probably want to put a fix for it first
tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86, percpu: revert commit fe8e0c25
  2010-10-26 13:34           ` Tejun Heo
@ 2010-10-26 13:49             ` Brian Gerst
  2010-10-26 15:08               ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Brian Gerst @ 2010-10-26 13:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, torvalds, Alexander van Heukelum, David Howells,
	akpm, linux-am33-list, linux-kernel, Akira Takeuchi, Mark Salter

On Tue, Oct 26, 2010 at 9:34 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On 10/26/2010 03:25 PM, Ingo Molnar wrote:
>>
>> * Tejun Heo <tj@kernel.org> wrote:
>>
>>> Commit fe8e0c25 (x86, 32-bit: Align percpu area and irq stacks to THREAD_SIZE)
>>> aligned PERCPU section to THREAD_SIZE which can be larger than PAGE_SIZE,
>>> introduced DEFINE_PER_CPU_MULTIPAGE_ALIGNED() and used it to make irq stacks
>>> aligned to THREAD_SIZE on x86_32.
>>>
>>> This won't work.  The PERCPU output section is used as the template to prepare the
>>> percpu area and the actual percpu area is _alwasy_ aligned to PAGE_SIZE whether
>>> the source area is aligned to larger size or not.
>>
>> The problem is, this will reintroduce a nasty boot crash which commit fe8e0c25
>> fixed. Do you say that fe8e0c25 didnt have the alignment effect?
>
> AFAICS, not in a way which is correct.  The patch probably made the
> following two differences.
>
> * The stack in the template area is THREAD_SIZE aligned.  If something
>  was dereferencing it before percpu init, this could have helped.
>  IIRC, x86 early init code does use the template area.
>
> * The percpu address would be THREAD_SIZE aligned while the translated
>  kernel address for each cpu wouldn't be.  For masking stack pointer
>  to find out task struct, I don't think aligning the percpu address
>  would have been helpful.
>
> It's simply broken and needs to be reverted.  If the patch somehow
> fixed boot crash, yeah, we probably want to put a fix for it first
> tho.
>
> Thanks.

Probably the best fix is to go back to allocating the stacks with
get_free_pages(), and only keep the pointers in percpu memory.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
  2010-10-26 12:45       ` [PATCH] x86, percpu: revert commit fe8e0c25 Tejun Heo
  2010-10-26 13:25         ` Ingo Molnar
@ 2010-10-26 14:06         ` Tejun Heo
  2011-03-24  6:46             ` Mike Frysinger
  2011-03-24  8:54           ` [PATCH UPDATED] " Tejun Heo
  1 sibling, 2 replies; 63+ messages in thread
From: Tejun Heo @ 2010-10-26 14:06 UTC (permalink / raw)
  To: torvalds, Alexander van Heukelum
  Cc: David Howells, akpm, linux-am33-list, linux-kernel,
	Akira Takeuchi, Mark Salter, Ingo Molnar, Mike Frysinger,
	uclinux-dist-devel, Jeff Dike, user-mode-linux-devel

Percpu allocator honors alignment request upto PAGE_SIZE and both the
percpu addresses in the percpu address space and the translated kernel
addresses should be aligned accordingly.  The calculation of the
former depends on the alignment of percpu output section in the kernel
image.

The linker script macros PERCPU_VADDR() and PERCPU() are used to
define this output section and the latter takes @align parameter.
Several architectures are using @align smaller than PAGE_SIZE breaking
percpu memory alignment.

This patch removes @align parameter from PERCPU(), renames it to
PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
add PCPU_SETUP_BUG_ON() checks such that alignment problems are
reliably detected and remove percpu alignment comment recently added
in workqueue.c as the condition would trigger BUG way before reaching
there.

For blackfin, frv and um, this patch raises the alignment of percpu
area.  As the area is in .init, there shouldn't be any noticeable
difference.

This problem was discovered by David Howells while debugging boot
failure on mn10300.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
---
Linus, please don't apply the previous revert and this patch yet.
Ingo, with mn10300 fixed, I don't think these patches are urgent.
I'll hold onto revert patch and this one for now and push them to
Linus after irqstacks is fixed on x86_32.

Thanks.

 arch/alpha/kernel/vmlinux.lds.S    |    2 +-
 arch/arm/kernel/vmlinux.lds.S      |    2 +-
 arch/blackfin/kernel/vmlinux.lds.S |    2 +-
 arch/cris/kernel/vmlinux.lds.S     |    2 +-
 arch/frv/kernel/vmlinux.lds.S      |    2 +-
 arch/m32r/kernel/vmlinux.lds.S     |    2 +-
 arch/mips/kernel/vmlinux.lds.S     |    2 +-
 arch/mn10300/kernel/vmlinux.lds.S  |    2 +-
 arch/parisc/kernel/vmlinux.lds.S   |    2 +-
 arch/powerpc/kernel/vmlinux.lds.S  |    2 +-
 arch/s390/kernel/vmlinux.lds.S     |    2 +-
 arch/sh/kernel/vmlinux.lds.S       |    2 +-
 arch/sparc/kernel/vmlinux.lds.S    |    2 +-
 arch/tile/kernel/vmlinux.lds.S     |    2 +-
 arch/um/include/asm/common.lds.S   |    2 +-
 arch/x86/kernel/vmlinux.lds.S      |    2 +-
 arch/xtensa/kernel/vmlinux.lds.S   |    2 +-
 include/asm-generic/vmlinux.lds.h  |   19 +++++++++----------
 kernel/workqueue.c                 |    4 +---
 mm/percpu.c                        |    2 ++
 20 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index 003ef4c..92d6d88 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -38,7 +38,7 @@ SECTIONS
 	__init_begin = ALIGN(PAGE_SIZE);
 	INIT_TEXT_SECTION(PAGE_SIZE)
 	INIT_DATA_SECTION(16)
-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION
 	/* Align to THREAD_SIZE rather than PAGE_SIZE here so any padding page
 	   needed for the THREAD_SIZE aligned init_task gets freed after init */
 	. = ALIGN(THREAD_SIZE);
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index 1953e3d..092d83a 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -70,7 +70,7 @@ SECTIONS
 #endif
 	}

-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION

 #ifndef CONFIG_XIP_KERNEL
 	. = ALIGN(PAGE_SIZE);
diff --git a/arch/blackfin/kernel/vmlinux.lds.S b/arch/blackfin/kernel/vmlinux.lds.S
index 4122678..d188a7e 100644
--- a/arch/blackfin/kernel/vmlinux.lds.S
+++ b/arch/blackfin/kernel/vmlinux.lds.S
@@ -136,7 +136,7 @@ SECTIONS

 	. = ALIGN(16);
 	INIT_DATA_SECTION(16)
-	PERCPU(4)
+	PERCPU_SECTION

 	.exit.data :
 	{
diff --git a/arch/cris/kernel/vmlinux.lds.S b/arch/cris/kernel/vmlinux.lds.S
index 4422189..cbdc41d 100644
--- a/arch/cris/kernel/vmlinux.lds.S
+++ b/arch/cris/kernel/vmlinux.lds.S
@@ -107,7 +107,7 @@ SECTIONS
 #endif
 	__vmlinux_end = .;		/* Last address of the physical file. */
 #ifdef CONFIG_ETRAX_ARCH_V32
-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION

 	.init.ramfs : {
 		INIT_RAM_FS
diff --git a/arch/frv/kernel/vmlinux.lds.S b/arch/frv/kernel/vmlinux.lds.S
index 8b973f3..43e59e6 100644
--- a/arch/frv/kernel/vmlinux.lds.S
+++ b/arch/frv/kernel/vmlinux.lds.S
@@ -37,7 +37,7 @@ SECTIONS
   _einittext = .;

   INIT_DATA_SECTION(8)
-  PERCPU(4096)
+  PERCPU_SECTION

   . = ALIGN(PAGE_SIZE);
   __init_end = .;
diff --git a/arch/m32r/kernel/vmlinux.lds.S b/arch/m32r/kernel/vmlinux.lds.S
index 7da94ea..a676ecb 100644
--- a/arch/m32r/kernel/vmlinux.lds.S
+++ b/arch/m32r/kernel/vmlinux.lds.S
@@ -53,7 +53,7 @@ SECTIONS
   __init_begin = .;
   INIT_TEXT_SECTION(PAGE_SIZE)
   INIT_DATA_SECTION(16)
-  PERCPU(PAGE_SIZE)
+  PERCPU_SECTION
   . = ALIGN(PAGE_SIZE);
   __init_end = .;
   /* freed after init ends here */
diff --git a/arch/mips/kernel/vmlinux.lds.S b/arch/mips/kernel/vmlinux.lds.S
index f25df73..6fcf586 100644
--- a/arch/mips/kernel/vmlinux.lds.S
+++ b/arch/mips/kernel/vmlinux.lds.S
@@ -108,7 +108,7 @@ SECTIONS
 		EXIT_DATA
 	}

-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;
 	/* freed after init ends here */
diff --git a/arch/mn10300/kernel/vmlinux.lds.S b/arch/mn10300/kernel/vmlinux.lds.S
index febbeee..2e7ecf9 100644
--- a/arch/mn10300/kernel/vmlinux.lds.S
+++ b/arch/mn10300/kernel/vmlinux.lds.S
@@ -70,7 +70,7 @@ SECTIONS
 	.exit.text : { EXIT_TEXT; }
 	.exit.data : { EXIT_DATA; }

-  PERCPU(PAGE_SIZE)
+  PERCPU_SECTION
   . = ALIGN(PAGE_SIZE);
   __init_end = .;
   /* freed after init ends here */
diff --git a/arch/parisc/kernel/vmlinux.lds.S b/arch/parisc/kernel/vmlinux.lds.S
index d64a6bb..8573902 100644
--- a/arch/parisc/kernel/vmlinux.lds.S
+++ b/arch/parisc/kernel/vmlinux.lds.S
@@ -145,7 +145,7 @@ SECTIONS
 		EXIT_DATA
 	}

-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;
 	/* freed after init ends here */
diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S
index 8a0deef..e17d0b6 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -160,7 +160,7 @@ SECTIONS
 		INIT_RAM_FS
 	}

-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION

 	. = ALIGN(8);
 	.machine.desc : AT(ADDR(.machine.desc) - LOAD_OFFSET) {
diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
index a68ac10..b7f26f8 100644
--- a/arch/s390/kernel/vmlinux.lds.S
+++ b/arch/s390/kernel/vmlinux.lds.S
@@ -77,7 +77,7 @@ SECTIONS
 	. = ALIGN(PAGE_SIZE);
 	INIT_DATA_SECTION(0x100)

-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;		/* freed after init ends here */

diff --git a/arch/sh/kernel/vmlinux.lds.S b/arch/sh/kernel/vmlinux.lds.S
index 7f8a709..f08347a 100644
--- a/arch/sh/kernel/vmlinux.lds.S
+++ b/arch/sh/kernel/vmlinux.lds.S
@@ -66,7 +66,7 @@ SECTIONS
 		__machvec_end = .;
 	}

-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION

 	/*
 	 * .exit.text is discarded at runtime, not link time, to deal with
diff --git a/arch/sparc/kernel/vmlinux.lds.S b/arch/sparc/kernel/vmlinux.lds.S
index 0c1e678..c1ea7c4 100644
--- a/arch/sparc/kernel/vmlinux.lds.S
+++ b/arch/sparc/kernel/vmlinux.lds.S
@@ -108,7 +108,7 @@ SECTIONS
 		__sun4v_2insn_patch_end = .;
 	}

-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION

 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;
diff --git a/arch/tile/kernel/vmlinux.lds.S b/arch/tile/kernel/vmlinux.lds.S
index 25fdc0c..adb8168 100644
--- a/arch/tile/kernel/vmlinux.lds.S
+++ b/arch/tile/kernel/vmlinux.lds.S
@@ -63,7 +63,7 @@ SECTIONS
     *(.init.page)
   } :data =0
   INIT_DATA_SECTION(16)
-  PERCPU(PAGE_SIZE)
+  PERCPU_SECTION
   . = ALIGN(PAGE_SIZE);
   VMLINUX_SYMBOL(_einitdata) = .;

diff --git a/arch/um/include/asm/common.lds.S b/arch/um/include/asm/common.lds.S
index ac55b9e..e8bfd8a 100644
--- a/arch/um/include/asm/common.lds.S
+++ b/arch/um/include/asm/common.lds.S
@@ -42,7 +42,7 @@
 	INIT_SETUP(0)
   }

-  PERCPU(32)
+  PERCPU_SECTION
 	
   .initcall.init : {
 	INIT_CALLS
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 38e2b67..6cfd560 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -301,7 +301,7 @@ SECTIONS
 	}

 #if !defined(CONFIG_X86_64) || !defined(CONFIG_SMP)
-	PERCPU(PAGE_SIZE)
+	PERCPU_SECTION
 #endif

 	. = ALIGN(PAGE_SIZE);
diff --git a/arch/xtensa/kernel/vmlinux.lds.S b/arch/xtensa/kernel/vmlinux.lds.S
index 9b52615..2520ac8 100644
--- a/arch/xtensa/kernel/vmlinux.lds.S
+++ b/arch/xtensa/kernel/vmlinux.lds.S
@@ -155,7 +155,7 @@ SECTIONS
     INIT_RAM_FS
   }

-  PERCPU(PAGE_SIZE)
+  PERCPU_SECTION

   /* We need this dummy segment here */

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index f4229fb..f708bdd 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -15,7 +15,7 @@
  *	HEAD_TEXT_SECTION
  *	INIT_TEXT_SECTION(PAGE_SIZE)
  *	INIT_DATA_SECTION(...)
- *	PERCPU(PAGE_SIZE)
+ *	PERCPU_SECTION
  *	__init_end = .;
  *
  *	_stext = .;
@@ -679,7 +679,7 @@
  *
  * Note that this macros defines __per_cpu_load as an absolute symbol.
  * If there is no need to put the percpu section at a predetermined
- * address, use PERCPU().
+ * address, use PERCPU_SECTION.
  */
 #define PERCPU_VADDR(vaddr, phdr)					\
 	VMLINUX_SYMBOL(__per_cpu_load) = .;				\
@@ -697,20 +697,19 @@
 	. = VMLINUX_SYMBOL(__per_cpu_load) + SIZEOF(.data..percpu);

 /**
- * PERCPU - define output section for percpu area, simple version
- * @align: required alignment
+ * PERCPU_SECTION - define output section for percpu area, simple version
  *
- * Align to @align and outputs output section for percpu area.  This
- * macro doesn't maniuplate @vaddr or @phdr and __per_cpu_load and
+ * Align to PAGE_SIZE and output section for percpu area.  This macro
+ * doesn't manipulate @vaddr or @phdr and __per_cpu_load and
  * __per_cpu_start will be identical.
  *
- * This macro is equivalent to ALIGN(align); PERCPU_VADDR( , ) except
- * that __per_cpu_load is defined as a relative symbol against
+ * This macro is equivalent to ALIGN(PAGE_SIZE); PERCPU_VADDR( , )
+ * except that __per_cpu_load is defined as a relative symbol against
  * .data..percpu which is required for relocatable x86_32
  * configuration.
  */
-#define PERCPU(align)							\
-	. = ALIGN(align);						\
+#define PERCPU_SECTION							\
+	. = ALIGN(PAGE_SIZE);						\
 	.data..percpu	: AT(ADDR(.data..percpu) - LOAD_OFFSET) {	\
 		VMLINUX_SYMBOL(__per_cpu_load) = .;			\
 		VMLINUX_SYMBOL(__per_cpu_start) = .;			\
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e5ff2cb..30acdb7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2791,9 +2791,7 @@ static int alloc_cwqs(struct workqueue_struct *wq)
 		}
 	}

-	/* just in case, make sure it's actually aligned
-	 * - this is affected by PERCPU() alignment in vmlinux.lds.S
-	 */
+	/* just in case, make sure it's actually aligned */
 	BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
 	return wq->cpu_wq.v ? 0 : -ENOMEM;
 }
diff --git a/mm/percpu.c b/mm/percpu.c
index efe8168..1aa65d3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1220,8 +1220,10 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
 #ifdef CONFIG_SMP
 	PCPU_SETUP_BUG_ON(!ai->static_size);
+	PCPU_SETUP_BUG_ON((unsigned long)__per_cpu_start & ~PAGE_MASK);
 #endif
 	PCPU_SETUP_BUG_ON(!base_addr);
+	PCPU_SETUP_BUG_ON((unsigned long)base_addr & ~PAGE_MASK);
 	PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
 	PCPU_SETUP_BUG_ON(ai->unit_size & ~PAGE_MASK);
 	PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues
  2010-10-26 10:22 ` David Howells
  2010-10-26 12:14   ` Tejun Heo
@ 2010-10-26 14:50   ` David Howells
  2010-10-26 14:56     ` Tejun Heo
  1 sibling, 1 reply; 63+ messages in thread
From: David Howells @ 2010-10-26 14:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, torvalds, akpm, linux-am33-list, linux-kernel,
	Akira Takeuchi, Mark Salter

Tejun Heo <tj@kernel.org> wrote:

> Ah, I see now.  The actual areas are properly aligned but the percpu
> address is determined as offset from the percpu output section base so
> the percpu pointers in the percpu address space end up misaligned with
> the actual kernel addresses and the code in workqueue checks the
> address in percpu AS, so, yeap, it's caused by the misalignment of the
> percpu section. 

Okay, I see that.

> > FRV's page size is 16KB, so on that we really don't want it to be
> > PAGE_SIZE.
> 
> Why not?  It's in the init section which will be freed anyway and with
> the kernel image compression it's not even gonna add any noticeable
> amount to the kernel image size.  There isn't any benefit in using
> anything smaller than PAGE_SIZE for alignment.  Also, percpu allocator
> guarantees alignment requirement upto PAGE_SIZE is honored.  If the
> output section uses smaller alignment, the percpu AS will end up being
> misaligned.

The bootloader we have doesn't do decompression.  On the other hand, does the
PERCPU stuff need to be allocated space in the image by the linker?  Can it be
initialised to anything other than zeros?

David

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues
  2010-10-26 14:50   ` [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues David Howells
@ 2010-10-26 14:56     ` Tejun Heo
  0 siblings, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2010-10-26 14:56 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, linux-am33-list, linux-kernel, Akira Takeuchi,
	Mark Salter

Hello, David.

On 10/26/2010 04:50 PM, David Howells wrote:
>>> FRV's page size is 16KB, so on that we really don't want it to be
>>> PAGE_SIZE.
>>
>> Why not?  It's in the init section which will be freed anyway and with
>> the kernel image compression it's not even gonna add any noticeable
>> amount to the kernel image size.  There isn't any benefit in using
>> anything smaller than PAGE_SIZE for alignment.  Also, percpu allocator
>> guarantees alignment requirement upto PAGE_SIZE is honored.  If the
>> output section uses smaller alignment, the percpu AS will end up being
>> misaligned.
> 
> The bootloader we have doesn't do decompression.

I see.  :-(

> On the other hand, does the PERCPU stuff need to be allocated space
> in the image by the linker?  Can it be initialised to anything other
> than zeros?

Sure, for example, in kernel/timer.c.

  static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;

It doesn't seem too popular at this point tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86, percpu: revert commit fe8e0c25
  2010-10-26 13:49             ` Brian Gerst
@ 2010-10-26 15:08               ` Linus Torvalds
  2010-10-27  5:43                 ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Brian Gerst
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2010-10-26 15:08 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Tejun Heo, Ingo Molnar, Alexander van Heukelum, David Howells,
	akpm, linux-am33-list, linux-kernel, Akira Takeuchi, Mark Salter

On Tue, Oct 26, 2010 at 6:49 AM, Brian Gerst <brgerst@gmail.com> wrote:
>
> Probably the best fix is to go back to allocating the stacks with
> get_free_pages(), and only keep the pointers in percpu memory.

Yes, please do. The "irq stack in percpu area" upsides are almost zero
afaik, and the complexity required for it has been ridiculous.

                                 Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-26 15:08               ` Linus Torvalds
@ 2010-10-27  5:43                 ` Brian Gerst
  2010-10-27  6:07                   ` Eric Dumazet
                                     ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Brian Gerst @ 2010-10-27  5:43 UTC (permalink / raw)
  To: tj; +Cc: x86, linux-kernel, torvalds, mingo

The percpu allocator cannot handle alignments larger than one page.
Allocate the irq stacks seperately, and only keep the pointers as
percpu data.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/include/asm/irq.h |    2 --
 arch/x86/kernel/irq_32.c   |   12 ++----------
 arch/x86/kernel/smpboot.c  |    1 -
 3 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 0bf5b00..13b0eba 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -21,10 +21,8 @@ static inline int irq_canonicalize(int irq)
 
 #ifdef CONFIG_X86_32
 extern void irq_ctx_init(int cpu);
-extern void irq_ctx_exit(int cpu);
 #else
 # define irq_ctx_init(cpu) do { } while (0)
-# define irq_ctx_exit(cpu) do { } while (0)
 #endif
 
 #define __ARCH_HAS_DO_SOFTIRQ
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 50fbbe6..64668db 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -60,9 +60,6 @@ union irq_ctx {
 static DEFINE_PER_CPU(union irq_ctx *, hardirq_ctx);
 static DEFINE_PER_CPU(union irq_ctx *, softirq_ctx);
 
-static DEFINE_PER_CPU_MULTIPAGE_ALIGNED(union irq_ctx, hardirq_stack, THREAD_SIZE);
-static DEFINE_PER_CPU_MULTIPAGE_ALIGNED(union irq_ctx, softirq_stack, THREAD_SIZE);
-
 static void call_on_stack(void *func, void *stack)
 {
 	asm volatile("xchgl	%%ebx,%%esp	\n"
@@ -128,7 +125,7 @@ void __cpuinit irq_ctx_init(int cpu)
 	if (per_cpu(hardirq_ctx, cpu))
 		return;
 
-	irqctx = &per_cpu(hardirq_stack, cpu);
+	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;
@@ -137,7 +134,7 @@ void __cpuinit irq_ctx_init(int cpu)
 
 	per_cpu(hardirq_ctx, cpu) = irqctx;
 
-	irqctx = &per_cpu(softirq_stack, cpu);
+	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;
@@ -150,11 +147,6 @@ void __cpuinit irq_ctx_init(int cpu)
 	       cpu, per_cpu(hardirq_ctx, cpu),  per_cpu(softirq_ctx, cpu));
 }
 
-void irq_ctx_exit(int cpu)
-{
-	per_cpu(hardirq_ctx, cpu) = NULL;
-}
-
 asmlinkage void do_softirq(void)
 {
 	unsigned long flags;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6af1185..90baf56 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1373,7 +1373,6 @@ void play_dead_common(void)
 {
 	idle_task_exit();
 	reset_lazy_tlbstate();
-	irq_ctx_exit(raw_smp_processor_id());
 	c1e_remove_cpu(raw_smp_processor_id());
 
 	mb();
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27  5:43                 ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Brian Gerst
@ 2010-10-27  6:07                   ` Eric Dumazet
  2010-10-27  9:57                     ` Peter Zijlstra
  2010-10-27 15:19                   ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Linus Torvalds
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27  6:07 UTC (permalink / raw)
  To: Brian Gerst; +Cc: tj, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 01:43 -0400, Brian Gerst a écrit :
> The percpu allocator cannot handle alignments larger than one page.
> Allocate the irq stacks seperately, and only keep the pointers as
> percpu data.
> 
> Signed-off-by: Brian Gerst <brgerst@gmail.com>

>  
> -	irqctx = &per_cpu(hardirq_stack, cpu);
> +	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);

Hmm, then we lose NUMA affinity for stacks.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27  6:07                   ` Eric Dumazet
@ 2010-10-27  9:57                     ` Peter Zijlstra
  2010-10-27 13:33                       ` Eric Dumazet
  2010-10-28 14:40                       ` [PATCH] x86-32: NUMA irq stacks allocations Eric Dumazet
  0 siblings, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2010-10-27  9:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Brian Gerst, tj, x86, linux-kernel, torvalds, mingo

On Wed, 2010-10-27 at 08:07 +0200, Eric Dumazet wrote:
> > -     irqctx = &per_cpu(hardirq_stack, cpu);
> > +     irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
> 
> Hmm, then we lose NUMA affinity for stacks. 

I guess we could use:

  alloc_pages_node(cpu_to_node(cpu), THREAD_FLAGS, THREAD_ORDER);



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27  9:57                     ` Peter Zijlstra
@ 2010-10-27 13:33                       ` Eric Dumazet
  2010-10-27 13:42                         ` Tejun Heo
  2010-10-28 14:40                       ` [PATCH] x86-32: NUMA irq stacks allocations Eric Dumazet
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 13:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Brian Gerst, tj, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 11:57 +0200, Peter Zijlstra a écrit :
> On Wed, 2010-10-27 at 08:07 +0200, Eric Dumazet wrote:
> > > -     irqctx = &per_cpu(hardirq_stack, cpu);
> > > +     irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
> > 
> > Hmm, then we lose NUMA affinity for stacks. 
> 
> I guess we could use:
> 
>   alloc_pages_node(cpu_to_node(cpu), THREAD_FLAGS, THREAD_ORDER);
> 
> 

Anyway, I just discovered per_cpu data on my machine (NUMA capable) all
sit on a single node, if 32bit kernel used.

# cat /proc/buddyinfo 
Node 0, zone      DMA      0      1      0      1      2      1      1      0      1      1      3 
Node 0, zone   Normal     94    251     81     16      3      2      1      2      1      2    187 
Node 0, zone  HighMem    113     88     47     36     18      5      4      3      2      0    268 
Node 1, zone  HighMem    154     97     43     16      9      4      3      2      3      2    482 

# dmesg | grep pcpu
[    0.000000] pcpu-alloc: s41920 r0 d23616 u65536 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 


dual socket machine (E5540  @ 2.53GHz), total of 8 cores, 16 threads.

# dmesg | grep irqstack
[    0.000000] CPU 0 irqstacks, hard=f4a00000 soft=f4a02000
[    0.173397] CPU 1 irqstacks, hard=f4a10000 soft=f4a12000
[    0.284939] CPU 2 irqstacks, hard=f4a20000 soft=f4a22000
[    0.392908] CPU 3 irqstacks, hard=f4a30000 soft=f4a32000
[    0.500757] CPU 4 irqstacks, hard=f4a40000 soft=f4a42000
[    0.608713] CPU 5 irqstacks, hard=f4a50000 soft=f4a52000
[    0.716665] CPU 6 irqstacks, hard=f4a60000 soft=f4a62000
[    0.828668] CPU 7 irqstacks, hard=f4a70000 soft=f4a72000
[    0.936555] CPU 8 irqstacks, hard=f4a80000 soft=f4a82000
[    1.044525] CPU 9 irqstacks, hard=f4a90000 soft=f4a92000
[    1.152470] CPU 10 irqstacks, hard=f4aa0000 soft=f4aa2000
[    1.260367] CPU 11 irqstacks, hard=f4ab0000 soft=f4ab2000
[    1.368313] CPU 12 irqstacks, hard=f4ac0000 soft=f4ac2000
[    1.476313] CPU 13 irqstacks, hard=f4ad0000 soft=f4ad2000
[    1.584167] CPU 14 irqstacks, hard=f4ae0000 soft=f4ae2000
[    1.692222] CPU 15 irqstacks, hard=f4af0000 soft=f4af2000


With a 64bit kernel its fine :

[    0.000000] pcpu-alloc: s76992 r8192 d21312 u131072 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 00 02 04 06 08 10 12 14 17 19 21 23 25 27 29 31 
[    0.000000] pcpu-alloc: [1] 01 03 05 07 09 11 13 15 16 18 20 22 24 26 28 30 

I presume node 1 having only HighMem could be the reason ?




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 13:33                       ` Eric Dumazet
@ 2010-10-27 13:42                         ` Tejun Heo
  2010-10-27 13:57                           ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-27 13:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Hello,

On 10/27/2010 03:33 PM, Eric Dumazet wrote:
> Le mercredi 27 octobre 2010 à 11:57 +0200, Peter Zijlstra a écrit :
>> On Wed, 2010-10-27 at 08:07 +0200, Eric Dumazet wrote:
>>>> -     irqctx = &per_cpu(hardirq_stack, cpu);
>>>> +     irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
>>>
>>> Hmm, then we lose NUMA affinity for stacks. 
>>
>> I guess we could use:
>>
>>   alloc_pages_node(cpu_to_node(cpu), THREAD_FLAGS, THREAD_ORDER);
>>
>>
> 
> Anyway, I just discovered per_cpu data on my machine (NUMA capable) all
> sit on a single node, if 32bit kernel used.
> 
> # cat /proc/buddyinfo 
> Node 0, zone      DMA      0      1      0      1      2      1      1      0      1      1      3 
> Node 0, zone   Normal     94    251     81     16      3      2      1      2      1      2    187 
> Node 0, zone  HighMem    113     88     47     36     18      5      4      3      2      0    268 
> Node 1, zone  HighMem    154     97     43     16      9      4      3      2      3      2    482 
...
> 
> I presume node 1 having only HighMem could be the reason ?

What does cpu_to_node() on each cpu say?  Also, do you know why
num_possible_cpus() is 32, not 16?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 13:42                         ` Tejun Heo
@ 2010-10-27 13:57                           ` Eric Dumazet
  2010-10-27 14:00                             ` Tejun Heo
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 13:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 15:42 +0200, Tejun Heo a écrit :
> Hello,
> 
> On 10/27/2010 03:33 PM, Eric Dumazet wrote:
> > Le mercredi 27 octobre 2010 à 11:57 +0200, Peter Zijlstra a écrit :
> >> On Wed, 2010-10-27 at 08:07 +0200, Eric Dumazet wrote:
> >>>> -     irqctx = &per_cpu(hardirq_stack, cpu);
> >>>> +     irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
> >>>
> >>> Hmm, then we lose NUMA affinity for stacks. 
> >>
> >> I guess we could use:
> >>
> >>   alloc_pages_node(cpu_to_node(cpu), THREAD_FLAGS, THREAD_ORDER);
> >>
> >>
> > 
> > Anyway, I just discovered per_cpu data on my machine (NUMA capable) all
> > sit on a single node, if 32bit kernel used.
> > 
> > # cat /proc/buddyinfo 
> > Node 0, zone      DMA      0      1      0      1      2      1      1      0      1      1      3 
> > Node 0, zone   Normal     94    251     81     16      3      2      1      2      1      2    187 
> > Node 0, zone  HighMem    113     88     47     36     18      5      4      3      2      0    268 
> > Node 1, zone  HighMem    154     97     43     16      9      4      3      2      3      2    482 
> ...
> > 
> > I presume node 1 having only HighMem could be the reason ?
> 
> What does cpu_to_node() on each cpu say?  Also, do you know why
> num_possible_cpus() is 32, not 16?
> 

I dont know, machine is HP ProLiant BL460c G6 
[    0.000000] SMP: Allowing 32 CPUs, 16 hotplug CPUs

for_each_possible_cpu(cpu) {
	pr_err("cpu=%d node=%d\n", cpu, cpu_to_node(cpu));
}

cpu=0 node=1
cpu=1 node=0
cpu=2 node=1
cpu=3 node=0
cpu=4 node=1
cpu=5 node=0
cpu=6 node=1
cpu=7 node=0
cpu=8 node=1
cpu=9 node=0
cpu=10 node=1
cpu=11 node=0
cpu=12 node=1
cpu=13 node=0
cpu=14 node=1
cpu=15 node=0
cpu=16 node=0
cpu=17 node=0
cpu=18 node=0
cpu=19 node=0
cpu=20 node=0
cpu=21 node=0
cpu=22 node=0
cpu=23 node=0
cpu=24 node=0
cpu=25 node=0
cpu=26 node=0
cpu=27 node=0
cpu=28 node=0
cpu=29 node=0
cpu=30 node=0
cpu=31 node=0





^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 13:57                           ` Eric Dumazet
@ 2010-10-27 14:00                             ` Tejun Heo
  2010-10-27 14:24                               ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-27 14:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

On 10/27/2010 03:57 PM, Eric Dumazet wrote:
>> What does cpu_to_node() on each cpu say?  Also, do you know why
>> num_possible_cpus() is 32, not 16?
>>
> 
> I dont know, machine is HP ProLiant BL460c G6 
> [    0.000000] SMP: Allowing 32 CPUs, 16 hotplug CPUs
> 
> for_each_possible_cpu(cpu) {
> 	pr_err("cpu=%d node=%d\n", cpu, cpu_to_node(cpu));
> }
> 
> cpu=0 node=1
> cpu=1 node=0
> cpu=2 node=1
> cpu=3 node=0
> cpu=4 node=1
> cpu=5 node=0
> cpu=6 node=1
> cpu=7 node=0
> cpu=8 node=1
> cpu=9 node=0
> cpu=10 node=1
> cpu=11 node=0
> cpu=12 node=1
> cpu=13 node=0
> cpu=14 node=1
> cpu=15 node=0
> cpu=16 node=0
> cpu=17 node=0
> cpu=18 node=0
> cpu=19 node=0
> cpu=20 node=0
> cpu=21 node=0
> cpu=22 node=0
> cpu=23 node=0
> cpu=24 node=0
> cpu=25 node=0
> cpu=26 node=0
> cpu=27 node=0
> cpu=28 node=0
> cpu=29 node=0
> cpu=30 node=0
> cpu=31 node=0

Heh, interesting table.  What does the same code say on 64bit?  Is it
the same?

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 14:00                             ` Tejun Heo
@ 2010-10-27 14:24                               ` Eric Dumazet
  2010-10-27 14:39                                 ` Tejun Heo
  2010-10-27 14:39                                 ` Eric Dumazet
  0 siblings, 2 replies; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 14:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 16:00 +0200, Tejun Heo a écrit :

> Heh, interesting table.  What does the same code say on 64bit?  Is it
> the same?
> 

Yes this is the same

32bit : # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 1 3 5 7 9 11 13 15
node 0 size: 2047 MB
node 0 free: 144 MB
node 1 cpus: 0 2 4 6 8 10 12 14
node 1 size: 2038 MB
node 1 free: 197 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 


64bit : # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 1 3 5 7 9 11 13 15
node 0 size: 2047 MB
node 0 free: 1868 MB
node 1 cpus: 0 2 4 6 8 10 12 14
node 1 size: 2038 MB
node 1 free: 1912 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
# cat /proc/buddyinfo 
Node 0, zone      DMA      1      0      1      0      2      1      1      0      1      1      3 
Node 0, zone    DMA32    454    206     93     15      3      1      1      2      1      3    460 
Node 1, zone    DMA32      3     19     17      1      0      1      1      1      1      1    380 
Node 1, zone   Normal     89     67     15      6      0      2      2      0      0      3     95 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 14:24                               ` Eric Dumazet
@ 2010-10-27 14:39                                 ` Tejun Heo
  2010-10-27 14:39                                 ` Eric Dumazet
  1 sibling, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2010-10-27 14:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

On 10/27/2010 04:24 PM, Eric Dumazet wrote:
> Le mercredi 27 octobre 2010 à 16:00 +0200, Tejun Heo a écrit :
> 
>> Heh, interesting table.  What does the same code say on 64bit?  Is it
>> the same?
>>
> 
> Yes this is the same

Weird, then why did the percpu code interleaved cpus 16-31 between
node 0 and 1?  Percpu layout code tries pretty hard to group cpus into
percpu units according to NUMA mapping but if the nodes are too
unbalanced that doing so would result in too big waste of address
space, it gives up.  I _think_ that's what happened with the weird
24:8 NUMA split on 32bit.  So, the interesting part is cpus 16-31 not
0-15.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 14:24                               ` Eric Dumazet
  2010-10-27 14:39                                 ` Tejun Heo
@ 2010-10-27 14:39                                 ` Eric Dumazet
  2010-10-27 14:43                                   ` Tejun Heo
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 14:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 16:24 +0200, Eric Dumazet a écrit :
> Le mercredi 27 octobre 2010 à 16:00 +0200, Tejun Heo a écrit :
> 
> > Heh, interesting table.  What does the same code say on 64bit?  Is it
> > the same?
> > 
> 
> Yes this is the same

Oops sorry :!)

On 64bit kernel, the 16 'possible but not online' cpus are not on node
0, but balanced between two nodes.

cpu=0 node=1
cpu=1 node=0
cpu=2 node=1
cpu=3 node=0
cpu=4 node=1
cpu=5 node=0
cpu=6 node=1
cpu=7 node=0
cpu=8 node=1
cpu=9 node=0
cpu=10 node=1
cpu=11 node=0
cpu=12 node=1
cpu=13 node=0
cpu=14 node=1
cpu=15 node=0
cpu=16 node=0
cpu=17 node=1
cpu=18 node=0
cpu=19 node=1
cpu=20 node=0
cpu=21 node=1
cpu=22 node=0
cpu=23 node=1
cpu=24 node=0
cpu=25 node=1
cpu=26 node=0
cpu=27 node=1
cpu=28 node=0
cpu=29 node=1
cpu=30 node=0
cpu=31 node=1



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 14:39                                 ` Eric Dumazet
@ 2010-10-27 14:43                                   ` Tejun Heo
  2010-10-27 15:21                                     ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-27 14:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

On 10/27/2010 04:39 PM, Eric Dumazet wrote:
> Le mercredi 27 octobre 2010 à 16:24 +0200, Eric Dumazet a écrit :
>> Le mercredi 27 octobre 2010 à 16:00 +0200, Tejun Heo a écrit :
>>
>>> Heh, interesting table.  What does the same code say on 64bit?  Is it
>>> the same?
>>>
>>
>> Yes this is the same
> 
> Oops sorry :!)
> 
> On 64bit kernel, the 16 'possible but not online' cpus are not on node
> 0, but balanced between two nodes.

Ah, okay, that explains it.  So, your NUMA table is screwed up.  It
would be interesting to dig down where the difference between 32 and
64bit comes from.  Maybe it's coming from differences in our init code
rather than from BIOS?

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27  5:43                 ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Brian Gerst
  2010-10-27  6:07                   ` Eric Dumazet
@ 2010-10-27 15:19                   ` Linus Torvalds
  2010-10-27 15:30                     ` Ingo Molnar
  2010-10-27 16:03                   ` [tip:x86/urgent] " tip-bot for Brian Gerst
  2010-10-27 16:04                   ` [tip:x86/urgent] percpu: Remove the multi-page alignment facility tip-bot for Ingo Molnar
  3 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2010-10-27 15:19 UTC (permalink / raw)
  To: Brian Gerst; +Cc: tj, x86, linux-kernel, mingo

On Tue, Oct 26, 2010 at 10:43 PM, Brian Gerst <brgerst@gmail.com> wrote:
> The percpu allocator cannot handle alignments larger than one page.
> Allocate the irq stacks seperately, and only keep the pointers as
> percpu data.

Ok, so I definitely want this (although it sounds like it would be
good to do the allocation numa-aware - possibly a separate issue).

However, I also want to remove all the crap that got added for the
multi-page percpu support. It was ugly, and apparently never really
worked. All the PER_CPU_MULTIPAGE_ALIGNED crud just needs to go away.

Ingo, can you take care of this all, or should I just take the patch
and remove the multipage stuff manually?

                                  Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 14:43                                   ` Tejun Heo
@ 2010-10-27 15:21                                     ` Eric Dumazet
  2010-10-27 15:35                                       ` Tejun Heo
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 15:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 16:43 +0200, Tejun Heo a écrit :

> Ah, okay, that explains it.  So, your NUMA table is screwed up.  It
> would be interesting to dig down where the difference between 32 and
> 64bit comes from.  Maybe it's coming from differences in our init code
> rather than from BIOS?
> 

I wish it could explain it.
I upgraded BIOS to latest one from HP. no change.

If I remove HOTPLUG support I still get :


cpu=0 node=1
cpu=1 node=0
cpu=2 node=1
cpu=3 node=0
cpu=4 node=1
cpu=5 node=0
cpu=6 node=1
cpu=7 node=0
cpu=8 node=1
cpu=9 node=0
cpu=10 node=1
cpu=11 node=0
cpu=12 node=1
cpu=13 node=0
cpu=14 node=1
cpu=15 node=0

[    0.000000] SMP: Allowing 16 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 64
[    0.000000] Allocating PCI resources starting at e4000000 (gap: e4000000:1ac00000)
[    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:16 nr_node_ids:8
[    0.000000] PERCPU: Embedded 16 pages/cpu @f4600000 s42752 r0 d22784 u131072
[    0.000000] pcpu-alloc: s42752 r0 d22784 u131072 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 


# cat /proc/buddyinfo 
Node 0, zone      DMA      0      1      1      1      2      1      1      0      1      1      3 
Node 0, zone   Normal    362    205     46     13      5      2      2      3      3      3    186 
Node 0, zone  HighMem    182    132    102     70     30      2      1      1      1      1    275 
Node 1, zone  HighMem    140     86    107     41     13      3      4      3      2      2    489 




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 15:19                   ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Linus Torvalds
@ 2010-10-27 15:30                     ` Ingo Molnar
  2010-10-27 15:33                       ` Ingo Molnar
  0 siblings, 1 reply; 63+ messages in thread
From: Ingo Molnar @ 2010-10-27 15:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Brian Gerst, tj, x86, linux-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Oct 26, 2010 at 10:43 PM, Brian Gerst <brgerst@gmail.com> wrote:
> > The percpu allocator cannot handle alignments larger than one page.
> > Allocate the irq stacks seperately, and only keep the pointers as
> > percpu data.
> 
> Ok, so I definitely want this (although it sounds like it would be good to do the 
> allocation numa-aware - possibly a separate issue).
> 
> However, I also want to remove all the crap that got added for the multi-page 
> percpu support. It was ugly, and apparently never really worked. All the 
> PER_CPU_MULTIPAGE_ALIGNED crud just needs to go away.
> 
> Ingo, can you take care of this all, or should I just take the patch and remove 
> the multipage stuff manually?

Sure, i'm queuing up Brian's patch (initially wanted to wait for the NUMA-aware 
version) and will remove all the multipage pcpu bits - will send you a pull request 
later today, after a bit of testing.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 15:30                     ` Ingo Molnar
@ 2010-10-27 15:33                       ` Ingo Molnar
  2010-10-27 15:40                         ` Tejun Heo
  0 siblings, 1 reply; 63+ messages in thread
From: Ingo Molnar @ 2010-10-27 15:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Brian Gerst, tj, x86, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > Ingo, can you take care of this all, or should I just take the patch and remove 
> > the multipage stuff manually?
> 
> Sure, i'm queuing up Brian's patch (initially wanted to wait for the NUMA-aware 
> version) and will remove all the multipage pcpu bits - will send you a pull 
> request later today, after a bit of testing.

Btw., the NUMA stuff never really worked percpu alloc, as per Eric's observation:

 | Anyway, I just discovered per_cpu data on my machine (NUMA capable) all sit on a 
 | single node, if 32bit kernel used.

... and in practice it's not really relevant on 32-bit anyway. So we can decouple 
the two issues just fine.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 15:21                                     ` Eric Dumazet
@ 2010-10-27 15:35                                       ` Tejun Heo
  2010-10-27 16:07                                         ` Eric Dumazet
  2010-10-27 20:55                                         ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Eric Dumazet
  0 siblings, 2 replies; 63+ messages in thread
From: Tejun Heo @ 2010-10-27 15:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

On 10/27/2010 05:21 PM, Eric Dumazet wrote:
> I wish it could explain it.
> I upgraded BIOS to latest one from HP. no change.
> 
> If I remove HOTPLUG support I still get :
> 
> cpu=0 node=1
> cpu=1 node=0
> cpu=2 node=1
> cpu=3 node=0
> cpu=4 node=1
> cpu=5 node=0
> cpu=6 node=1
> cpu=7 node=0
> cpu=8 node=1
> cpu=9 node=0
> cpu=10 node=1
> cpu=11 node=0
> cpu=12 node=1
> cpu=13 node=0
> cpu=14 node=1
> cpu=15 node=0
> 
> [    0.000000] SMP: Allowing 16 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 64
> [    0.000000] Allocating PCI resources starting at e4000000 (gap: e4000000:1ac00000)
> [    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:16 nr_node_ids:8
> [    0.000000] PERCPU: Embedded 16 pages/cpu @f4600000 s42752 r0 d22784 u131072
> [    0.000000] pcpu-alloc: s42752 r0 d22784 u131072 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

Hmmm, okay.  Can you please print out early_cpu_to_node() output for
each cpu from arch/x86/kernel/setup_percpu.c::setup_per_cpu_areas()?
BTW, some clarifications.

* In the pcpu-alloc debug message, the n of [n] might not necessarily
  match the NUMA node.

* I was confused before.  If CPU distance reported by
  early_cpu_to_node() is greater than LOCAL_DISTANCE (ie. NUMA
  configuration), cpus will always belong to different [n].  What gets
  adjusted is the size of each unit.

* No matter what, here, the end result is correct.  As there's no low
  memory on node 1, it doesn't matter how the groups are organized in
  the first chunk as long as embedding is used.  And for other chunks,
  pages for each cpu are allocated separatedly w/ cpu_to_node() anyway
  so NUMA affinity will be correct, again, regardless of the group
  organization.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 15:33                       ` Ingo Molnar
@ 2010-10-27 15:40                         ` Tejun Heo
  2010-10-27 15:43                           ` Ingo Molnar
  0 siblings, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-27 15:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Brian Gerst, x86, linux-kernel

Hello, Ingo, Linus.

On 10/27/2010 05:33 PM, Ingo Molnar wrote:
> Btw., the NUMA stuff never really worked percpu alloc, as per Eric's observation:

Oh, it works in general.  In this case, it's probably because the NUMA
configuration is rather weird in 32bit (but then again NUMA on 32bit
is supposed to be so).  I think it's just 32bit init code setting up
early_cpu_to_node() differently.

>  | Anyway, I just discovered per_cpu data on my machine (NUMA capable) all sit on a 
>  | single node, if 32bit kernel used.
> 
> ... and in practice it's not really relevant on 32-bit anyway. So we can decouple 
> the two issues just fine.

Yeah, the two issues are separate and as I wrote in the other message
the end result is correct (as much as it could be).  I'll try to dig
down what exactly is going on but I don't think there's anything to be
alarmed about.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 15:40                         ` Tejun Heo
@ 2010-10-27 15:43                           ` Ingo Molnar
  0 siblings, 0 replies; 63+ messages in thread
From: Ingo Molnar @ 2010-10-27 15:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linus Torvalds, Brian Gerst, x86, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo, Linus.
> 
> On 10/27/2010 05:33 PM, Ingo Molnar wrote:
> > Btw., the NUMA stuff never really worked percpu alloc, as per Eric's observation:
> 
> Oh, it works in general.  In this case, it's probably because the NUMA 
> configuration is rather weird in 32bit (but then again NUMA on 32bit is supposed 
> to be so).  I think it's just 32bit init code setting up early_cpu_to_node() 
> differently.

Yeah, i meant on x86 32-bit only - which the whole thread is about.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [tip:x86/urgent] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27  5:43                 ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Brian Gerst
  2010-10-27  6:07                   ` Eric Dumazet
  2010-10-27 15:19                   ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Linus Torvalds
@ 2010-10-27 16:03                   ` tip-bot for Brian Gerst
  2010-10-27 16:04                   ` [tip:x86/urgent] percpu: Remove the multi-page alignment facility tip-bot for Ingo Molnar
  3 siblings, 0 replies; 63+ messages in thread
From: tip-bot for Brian Gerst @ 2010-10-27 16:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, brgerst, tglx, mingo

Commit-ID:  22d4cd4c4dce6d7b7d9a7e396aa4f87fe7a649b1
Gitweb:     http://git.kernel.org/tip/22d4cd4c4dce6d7b7d9a7e396aa4f87fe7a649b1
Author:     Brian Gerst <brgerst@gmail.com>
AuthorDate: Wed, 27 Oct 2010 01:43:02 -0400
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 27 Oct 2010 17:31:42 +0200

x86-32: Allocate irq stacks seperate from percpu area

The percpu allocator cannot handle alignments larger than one
page. Allocate the irq stacks seperately, and only keep the
pointers as percpu data.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tj@kernel.org
LKML-Reference: <1288158182-1753-1-git-send-email-brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/irq.h |    2 --
 arch/x86/kernel/irq_32.c   |   12 ++----------
 arch/x86/kernel/smpboot.c  |    1 -
 3 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 0bf5b00..13b0eba 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -21,10 +21,8 @@ static inline int irq_canonicalize(int irq)
 
 #ifdef CONFIG_X86_32
 extern void irq_ctx_init(int cpu);
-extern void irq_ctx_exit(int cpu);
 #else
 # define irq_ctx_init(cpu) do { } while (0)
-# define irq_ctx_exit(cpu) do { } while (0)
 #endif
 
 #define __ARCH_HAS_DO_SOFTIRQ
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 50fbbe6..64668db 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -60,9 +60,6 @@ union irq_ctx {
 static DEFINE_PER_CPU(union irq_ctx *, hardirq_ctx);
 static DEFINE_PER_CPU(union irq_ctx *, softirq_ctx);
 
-static DEFINE_PER_CPU_MULTIPAGE_ALIGNED(union irq_ctx, hardirq_stack, THREAD_SIZE);
-static DEFINE_PER_CPU_MULTIPAGE_ALIGNED(union irq_ctx, softirq_stack, THREAD_SIZE);
-
 static void call_on_stack(void *func, void *stack)
 {
 	asm volatile("xchgl	%%ebx,%%esp	\n"
@@ -128,7 +125,7 @@ void __cpuinit irq_ctx_init(int cpu)
 	if (per_cpu(hardirq_ctx, cpu))
 		return;
 
-	irqctx = &per_cpu(hardirq_stack, cpu);
+	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;
@@ -137,7 +134,7 @@ void __cpuinit irq_ctx_init(int cpu)
 
 	per_cpu(hardirq_ctx, cpu) = irqctx;
 
-	irqctx = &per_cpu(softirq_stack, cpu);
+	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;
@@ -150,11 +147,6 @@ void __cpuinit irq_ctx_init(int cpu)
 	       cpu, per_cpu(hardirq_ctx, cpu),  per_cpu(softirq_ctx, cpu));
 }
 
-void irq_ctx_exit(int cpu)
-{
-	per_cpu(hardirq_ctx, cpu) = NULL;
-}
-
 asmlinkage void do_softirq(void)
 {
 	unsigned long flags;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6af1185..90baf56 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1373,7 +1373,6 @@ void play_dead_common(void)
 {
 	idle_task_exit();
 	reset_lazy_tlbstate();
-	irq_ctx_exit(raw_smp_processor_id());
 	c1e_remove_cpu(raw_smp_processor_id());
 
 	mb();

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [tip:x86/urgent] percpu: Remove the multi-page alignment facility
  2010-10-27  5:43                 ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Brian Gerst
                                     ` (2 preceding siblings ...)
  2010-10-27 16:03                   ` [tip:x86/urgent] " tip-bot for Brian Gerst
@ 2010-10-27 16:04                   ` tip-bot for Ingo Molnar
  3 siblings, 0 replies; 63+ messages in thread
From: tip-bot for Ingo Molnar @ 2010-10-27 16:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, brgerst, tj, tglx, mingo

Commit-ID:  47f19a0814e80e1d4e5c17d61b70fca85ea09162
Gitweb:     http://git.kernel.org/tip/47f19a0814e80e1d4e5c17d61b70fca85ea09162
Author:     Ingo Molnar <mingo@elte.hu>
AuthorDate: Wed, 27 Oct 2010 17:41:17 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 27 Oct 2010 17:53:25 +0200

percpu: Remove the multi-page alignment facility

[DECLARE|DEFINE]_PER_CPU_MULTIPAGE_ALIGNED never really worked because
the head percpu section was only page aligned. Now that the last user
is gone (32-bit IRQ stacks), remove the generic percpu facility.

Cc: Brian Gerst <brgerst@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1288158182-1753-1-git-send-email-brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/percpu-defs.h |   12 ------------
 1 files changed, 0 insertions(+), 12 deletions(-)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 018db9a..27ef6b1 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -148,18 +148,6 @@
 	DEFINE_PER_CPU_SECTION(type, name, "..readmostly")
 
 /*
- * Declaration/definition used for large per-CPU variables that must be
- * aligned to something larger than the pagesize.
- */
-#define DECLARE_PER_CPU_MULTIPAGE_ALIGNED(type, name, size)		\
-	DECLARE_PER_CPU_SECTION(type, name, "..page_aligned")		\
-	__aligned(size)
-
-#define DEFINE_PER_CPU_MULTIPAGE_ALIGNED(type, name, size)		\
-	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
-	__aligned(size)
-
-/*
  * Intermodule exports for per-CPU variables.  sparse forgets about
  * address space across EXPORT_SYMBOL(), change EXPORT_SYMBOL() to
  * noop if __CHECKER__.

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 15:35                                       ` Tejun Heo
@ 2010-10-27 16:07                                         ` Eric Dumazet
  2010-10-27 17:33                                           ` [PATCH] numa: fix slab_node(MPOL_BIND) Eric Dumazet
  2010-10-27 20:55                                         ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Eric Dumazet
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 16:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 17:35 +0200, Tejun Heo a écrit :

> Hmmm, okay.  Can you please print out early_cpu_to_node() output for
> each cpu from arch/x86/kernel/setup_percpu.c::setup_per_cpu_areas()?
> BTW, some clarifications.
> 
> * In the pcpu-alloc debug message, the n of [n] might not necessarily
>   match the NUMA node.
> 
> * I was confused before.  If CPU distance reported by
>   early_cpu_to_node() is greater than LOCAL_DISTANCE (ie. NUMA
>   configuration), cpus will always belong to different [n].  What gets
>   adjusted is the size of each unit.
> 
> * No matter what, here, the end result is correct.  As there's no low
>   memory on node 1, it doesn't matter how the groups are organized in
>   the first chunk as long as embedding is used.  And for other chunks,
>   pages for each cpu are allocated separatedly w/ cpu_to_node() anyway
>   so NUMA affinity will be correct, again, regardless of the group
>   organization.
> 
> Thanks.
> 

Will do in a few moment, once I recover from frozen machine  :(
(See end of this mail)

Thanks !

By the way, booting with hashdist=1 to make alloc_large_system_hash()
use vmalloc() show that only pages from node 0 were used at boot.

# grep alloc_large /proc/vmallocinfo 
0xf7a01000-0xf7a82000  528384 alloc_large_system_hash+0x144/0x1d9 pages=128 vmalloc N0=128
0xf7a83000-0xf7ac4000  266240 alloc_large_system_hash+0x144/0x1d9 pages=64 vmalloc N0=64
0xf7b11000-0xf7b32000  135168 alloc_large_system_hash+0x144/0x1d9 pages=32 vmalloc N0=32
0xf7b33000-0xf7c34000 1052672 alloc_large_system_hash+0x144/0x1d9 pages=256 vmalloc N0=256
0xf7c39000-0xf7cba000  528384 alloc_large_system_hash+0x144/0x1d9 pages=128 vmalloc N0=128
0xf7cbb000-0xf7cc0000   20480 alloc_large_system_hash+0x144/0x1d9 pages=4 vmalloc N0=4
0xf7cc1000-0xf7cc6000   20480 alloc_large_system_hash+0x144/0x1d9 pages=4 vmalloc N0=4


So I tried following experiment :

# swapoff
# numactl --membind=0 swapon -a
# grep swap /proc/vmallocinfo 
0xf9bf3000-0xf9cf4000 1052672 sys_swapon+0x4aa/0xb24 pages=256 vmalloc N0=256
# swapoff -a
# numactl --membind=1 swapon -a

<<FREEZE>>



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] numa: fix slab_node(MPOL_BIND)
  2010-10-27 16:07                                         ` Eric Dumazet
@ 2010-10-27 17:33                                           ` Eric Dumazet
  2010-10-28 15:59                                             ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 17:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 18:07 +0200, Eric Dumazet a écrit :

> So I tried following experiment :
> 
> # swapoff
> # numactl --membind=0 swapon -a
> # grep swap /proc/vmallocinfo 
> 0xf9bf3000-0xf9cf4000 1052672 sys_swapon+0x4aa/0xb24 pages=256 vmalloc N0=256
> # swapoff -a
> # numactl --membind=1 swapon -a
> 
> <<FREEZE>>
> 

Crash in fact, not freeze, in slab_node()

Problem is : we dereference a NULL zone pointer.

(node 1 has HighMem only)

Following patch seems to solve the problem for me

# swapoff -a
# numactl --membind=1 swapon -a
# grep swap /proc/vmallocinfo 
0xf9da5000-0xf9ea6000 1052672 sys_swapon+0x3f9/0xa34 pages=256 vmalloc N1=256


Thanks


[PATCH] numa: fix slab_node(MPOL_BIND) 

When a node contains only HighMem memory, slab_node(MPOL_BIND)
dereferences a NULL pointer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 mm/mempolicy.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 81a1276..4a57f13 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1597,7 +1597,7 @@ unsigned slab_node(struct mempolicy *policy)
 		(void)first_zones_zonelist(zonelist, highest_zoneidx,
 							&policy->v.nodes,
 							&zone);
-		return zone->node;
+		return zone ? zone->node : numa_node_id();
 	}
 
 	default:



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 15:35                                       ` Tejun Heo
  2010-10-27 16:07                                         ` Eric Dumazet
@ 2010-10-27 20:55                                         ` Eric Dumazet
  2010-10-28 12:01                                           ` Tejun Heo
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-27 20:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le mercredi 27 octobre 2010 à 17:35 +0200, Tejun Heo a écrit :

> Hmmm, okay.  Can you please print out early_cpu_to_node() output for
> each cpu from arch/x86/kernel/setup_percpu.c::setup_per_cpu_areas()?
> BTW, some clarifications.
> 
> * In the pcpu-alloc debug message, the n of [n] might not necessarily
>   match the NUMA node.
> 
> * I was confused before.  If CPU distance reported by
>   early_cpu_to_node() is greater than LOCAL_DISTANCE (ie. NUMA
>   configuration), cpus will always belong to different [n].  What gets
>   adjusted is the size of each unit.
> 
> * No matter what, here, the end result is correct.  As there's no low
>   memory on node 1, it doesn't matter how the groups are organized in
>   the first chunk as long as embedding is used.  And for other chunks,
>   pages for each cpu are allocated separatedly w/ cpu_to_node() anyway
>   so NUMA affinity will be correct, again, regardless of the group
>   organization.
> 
> Thanks.
> 

Hi Tejun

I changed the User/Kernel split from 3G/1G to 1G/3G so that I have
LOWMEM on both nodes. Still pcpu allocates all percpu from node0.

With following patch :

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 002b796..0611256 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -212,6 +212,7 @@ void __init setup_per_cpu_areas(void)
 		per_cpu(cpu_number, cpu) = cpu;
 		setup_percpu_segment(cpu);
 		setup_stack_canary_segment(cpu);
+		pr_err("cpu=%d early_cpu_to_node()=%d\n", cpu, early_cpu_to_node(cpu));
 		/*
 		 * Copy data used in early init routines from the
 		 * initial arrays to the per cpu data areas.  These


I get :

[    0.000000] Linux version 2.6.36-06800-g80ca147-dirty (root@svivoipvnx021) (gcc version 4.5.1 (GCC) ) #200 SMP Wed Oct 27 22:40:59 CEST 2010
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
[    0.000000]  BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000df62f000 (usable)
[    0.000000]  BIOS-e820: 00000000df62f000 - 00000000df63c000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000df63c000 - 00000000df63d000 (usable)
[    0.000000]  BIOS-e820: 00000000df63d000 - 00000000e4000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fee10000 (reserved)
[    0.000000]  BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 000000011ffff000 (usable)
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI 2.6 present.
[    0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
[    0.000000] last_pfn = 0x11ffff max_arch_pfn = 0x1000000
[    0.000000] MTRR default type: write-back
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-BFFFF uncachable
[    0.000000]   C0000-FFFFF write-protect
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 00E0000000 mask FFE0000000 uncachable
[    0.000000]   1 disabled
[    0.000000]   2 disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[    0.000000] found SMP MP-table at [400f4f80] f4f80
[    0.000000] initial memory mapped : 0 - 01e00000
[    0.000000] init_memory_mapping: 0000000000000000-00000000b71fe000
[    0.000000]  0000000000 - 0000200000 page 4k
[    0.000000]  0000200000 - 00b7000000 page 2M
[    0.000000]  00b7000000 - 00b71fe000 page 4k
[    0.000000] kernel direct mapping tables up to b71fe000 @ 1df4000-1e00000
[    0.000000] RAMDISK: 37f75000 - 37ff0000
[    0.000000] ACPI: RSDP 000f4f00 00024 (v02 HP    )
[    0.000000] ACPI: XSDT df630080 000AC (v01 HP     ProLiant 00000002   �? 0000162E)
[    0.000000] ACPI: FACP df630180 000F4 (v03 HP     ProLiant 00000002   �? 0000162E)
[    0.000000] ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 (20101013/tbfadt-607)
[    0.000000] ACPI Warning: Invalid length for Pm2ControlBlock: 32, using default 8 (20101013/tbfadt-607)
[    0.000000] ACPI: DSDT df630280 01F88 (v01 HP         DSDT 00000001 INTL 20030228)
[    0.000000] ACPI: FACS df62f100 00040
[    0.000000] ACPI: SPCR df62f140 00050 (v01 HP     SPCRRBSU 00000001   �? 0000162E)
[    0.000000] ACPI: MCFG df62f1c0 0003C (v01 HP     ProLiant 00000001      00000000)
[    0.000000] ACPI: HPET df62f200 00038 (v01 HP     ProLiant 00000002   �? 0000162E)
[    0.000000] ACPI: FFFF df62f240 00064 (v02 HP     ProLiant 00000002   �? 0000162E)
[    0.000000] ACPI: SPMI df62f2c0 00040 (v05 HP     ProLiant 00000001   �? 0000162E)
[    0.000000] ACPI: ERST df62f300 001D0 (v01 HP     ProLiant 00000001   �? 0000162E)
[    0.000000] ACPI: APIC df62f500 0015E (v01 HP     ProLiant 00000002      00000000)
[    0.000000] ACPI: SRAT df62f680 00570 (v01 HP     Proliant 00000001   �? 0000162E)
[    0.000000] ACPI: FFFF df62fc00 00176 (v01 HP     ProLiant 00000001   �? 0000162E)
[    0.000000] ACPI: BERT df62fd80 00030 (v01 HP     ProLiant 00000001   �? 0000162E)
[    0.000000] ACPI: HEST df62fdc0 000BC (v01 HP     ProLiant 00000001   �? 0000162E)
[    0.000000] ACPI: DMAR df62fe80 00154 (v01 HP     ProLiant 00000001   �? 0000162E)
[    0.000000] ACPI: SSDT df632240 00125 (v03     HP  CRSPCI0 00000002   HP 00000001)
[    0.000000] ACPI: SSDT df632380 003BB (v01     HP      pcc 00000001 INTL 20090625)
[    0.000000] ACPI: SSDT df632740 00377 (v01     HP     pmab 00000001 INTL 20090625)
[    0.000000] ACPI: SSDT df632ac0 02B64 (v01  INTEL PPM RCM  00000001 INTL 20061109)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] CPU 00 in proximity domain 00
[    0.000000] CPU 01 in proximity domain 00
[    0.000000] CPU 02 in proximity domain 00
[    0.000000] CPU 03 in proximity domain 00
[    0.000000] CPU 04 in proximity domain 00
[    0.000000] CPU 05 in proximity domain 00
[    0.000000] CPU 06 in proximity domain 00
[    0.000000] CPU 07 in proximity domain 00
[    0.000000] CPU 10 in proximity domain 01
[    0.000000] CPU 11 in proximity domain 01
[    0.000000] CPU 12 in proximity domain 01
[    0.000000] CPU 13 in proximity domain 01
[    0.000000] CPU 14 in proximity domain 01
[    0.000000] CPU 15 in proximity domain 01
[    0.000000] CPU 16 in proximity domain 01
[    0.000000] CPU 17 in proximity domain 01
[    0.000000] Memory range 00000000 to 00080000 in proximity domain 00 enabled
[    0.000000] Memory range 00080000 to 000e0000 in proximity domain 01 enabled
[    0.000000] Memory range 00100000 to 00120000 in proximity domain 01 enabled
[    0.000000] pxm bitmap: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
[    0.000000] Number of logical nodes in system = 2
[    0.000000] Number of memory chunks in system = 3
[    0.000000] chunk 0 nid 0 start_pfn 00000000 end_pfn 00080000
[    0.000000] chunk 1 nid 1 start_pfn 00080000 end_pfn 000e0000
[    0.000000] chunk 2 nid 1 start_pfn 00100000 end_pfn 00120000
[    0.000000] Node: 0, start_pfn: 0, end_pfn: 80000
[    0.000000]   Setting physnode_map array to node 0 for pfns:
[    0.000000]   0 4000 8000 c000 10000 14000 18000 1c000 20000 24000 28000 2c000 30000 34000 38000 3c000 40000 44000 48000 4c000 50000 54000 58000 5c000 60000 64000 68000 6c000 70000 74000 78000 7c000 
[    0.000000] Node: 1, start_pfn: 80000, end_pfn: 11ffff
[    0.000000]   Setting physnode_map array to node 1 for pfns:
[    0.000000]   80000 84000 88000 8c000 90000 94000 98000 9c000 a0000 a4000 a8000 ac000 b0000 b4000 b8000 bc000 c0000 c4000 c8000 cc000 d0000 d4000 d8000 dc000 e0000 e4000 e8000 ec000 f0000 f4000 f8000 fc000 100000 104000 108000 10c000 110000 114000 118000 11c000 
[    0.000000] node 0 pfn: [0 - 80000]
[    0.000000] Reserving 4608 pages of KVA for lmem_map of node 0 at 7ee00
[    0.000000] node 1 pfn: [80000 - 120000]
[    0.000000] Reserving 5632 pages of KVA for lmem_map of node 1 at 11e800
[    0.000000] Reserving total of 2800 pages for numa KVA remap
[    0.000000] kva_start_pfn ~ b4800 max_low_pfn ~ b71fe
[    0.000000] max_pfn = 11ffff
[    0.000000] 1678MB HIGHMEM available.
[    0.000000] 2929MB LOWMEM available.
[    0.000000] max_low_pfn = b71fe, highstart_pfn = b71fe
[    0.000000] Low memory ends at vaddr f71fe000
[    0.000000] node 0 will remap to vaddr f4800000 - f5a00000
[    0.000000] allocate_pgdat: node 0 NODE_DATA f4800000
[    0.000000] node 1 will remap to vaddr f5a00000 - f7000000
[    0.000000] allocate_pgdat: node 1 NODE_DATA f5a00000
[    0.000000] remap_numa_kva: node 0
[    0.000000] remap_numa_kva: f4800000 to pfn 0007ee00
[    0.000000] remap_numa_kva: f4a00000 to pfn 0007f000
[    0.000000] remap_numa_kva: f4c00000 to pfn 0007f200
[    0.000000] remap_numa_kva: f4e00000 to pfn 0007f400
[    0.000000] remap_numa_kva: f5000000 to pfn 0007f600
[    0.000000] remap_numa_kva: f5200000 to pfn 0007f800
[    0.000000] remap_numa_kva: f5400000 to pfn 0007fa00
[    0.000000] remap_numa_kva: f5600000 to pfn 0007fc00
[    0.000000] remap_numa_kva: f5800000 to pfn 0007fe00
[    0.000000] remap_numa_kva: node 1
[    0.000000] remap_numa_kva: f5a00000 to pfn 0011e800
[    0.000000] remap_numa_kva: f5c00000 to pfn 0011ea00
[    0.000000] remap_numa_kva: f5e00000 to pfn 0011ec00
[    0.000000] remap_numa_kva: f6000000 to pfn 0011ee00
[    0.000000] remap_numa_kva: f6200000 to pfn 0011f000
[    0.000000] remap_numa_kva: f6400000 to pfn 0011f200
[    0.000000] remap_numa_kva: f6600000 to pfn 0011f400
[    0.000000] remap_numa_kva: f6800000 to pfn 0011f600
[    0.000000] remap_numa_kva: f6a00000 to pfn 0011f800
[    0.000000] remap_numa_kva: f6c00000 to pfn 0011fa00
[    0.000000] remap_numa_kva: f6e00000 to pfn 0011fc00
[    0.000000] High memory starts at vaddr f71fe000
[    0.000000]   mapped low ram: 0 - b71fe000
[    0.000000]   low ram: 0 - b71fe000
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA      0x00000010 -> 0x00001000
[    0.000000]   Normal   0x00001000 -> 0x000b71fe
[    0.000000]   HighMem  0x000b71fe -> 0x0011ffff
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[5] active PFN ranges
[    0.000000]     0: 0x00000010 -> 0x0000009f
[    0.000000]     0: 0x00000100 -> 0x00080000
[    0.000000]     1: 0x00080000 -> 0x000df62f
[    0.000000]     1: 0x000df63c -> 0x000df63d
[    0.000000]     1: 0x00100000 -> 0x0011ffff
[    0.000000] On node 0 totalpages: 524175
[    0.000000] free_area_init_node: node 0, pgdat f4800000, node_mem_map f4802200
[    0.000000]   DMA zone: 32 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 3951 pages, LIFO batch:0
[    0.000000]   Normal zone: 4064 pages used for memmap
[    0.000000]   Normal zone: 516128 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 521775
[    0.000000] free_area_init_node: node 1, pgdat f5a00000, node_mem_map f5a02000
[    0.000000]   Normal zone: 1764 pages used for memmap
[    0.000000]   Normal zone: 224026 pages, LIFO batch:31
[    0.000000]   HighMem zone: 3357 pages used for memmap
[    0.000000]   HighMem zone: 292628 pages, LIFO batch:31
[    0.000000] Using APIC driver default
[    0.000000] ACPI: PM-Timer IO Port: 0x908
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x10] lapic_id[0x20] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x10] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x18] lapic_id[0x30] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x08] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x14] lapic_id[0x24] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x14] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x1c] lapic_id[0x34] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x04] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x12] lapic_id[0x22] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x12] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x1a] lapic_id[0x32] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x02] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x16] lapic_id[0x26] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x16] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x1e] lapic_id[0x36] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x06] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x11] lapic_id[0x21] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x11] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x19] lapic_id[0x31] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x09] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x15] lapic_id[0x25] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x15] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x1d] lapic_id[0x35] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x05] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x13] lapic_id[0x23] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x13] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x1b] lapic_id[0x33] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x03] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x17] lapic_id[0x27] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x17] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x1f] lapic_id[0x37] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x07] enabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: IOAPIC (id[0x00] address[0xfec80000] gsi_base[24])
[    0.000000] IOAPIC[1]: apic_id 0, version 32, address 0xfec80000, GSI 24-47
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.000000] SMP: Allowing 16 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 64
[    0.000000] Allocating PCI resources starting at e4000000 (gap: e4000000:1ac00000)
[    0.000000] converting mcount calls to 0f 1f 44 00 00
[    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:16 nr_node_ids:8
[    0.000000] PERCPU: Embedded 16 pages/cpu @bea00000 s41984 r0 d23552 u131072
[    0.000000] pcpu-alloc: s41984 r0 d23552 u131072 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 
[    0.000000] setup_percpu: cpu=0 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=1 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=2 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=3 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=4 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=5 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=6 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=7 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=8 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=9 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=10 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=11 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=12 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=13 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=14 early_cpu_to_node()=0
[    0.000000] setup_percpu: cpu=15 early_cpu_to_node()=0
[    0.000000] Built 2 zonelists in Zone order, mobility grouping on.  Total pages: 1036733
[    0.000000] Policy zone: HighMem
[    0.000000] Kernel command line: root=/dev/cciss/c0d0p2 nofb sysrq_always_enabled=1 vga=6 hashdist=1
[    0.000000] sysrq: sysrq always enabled.
[    0.000000] PID hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Initializing CPU#0
[    0.000000] Initializing HighMem for node 0 (00000000:00000000)
[    0.000000] Initializing HighMem for node 1 (000b71fe:0011ffff)
[    0.000000] Memory: 4093536k/4718588k available (2921k kernel code, 67736k reserved, 2068k data, 420k init, 1161412k highmem)
[    0.000000] virtual kernel memory layout:
[    0.000000]     fixmap  : 0xff577000 - 0xfffff000   (10784 kB)
[    0.000000]     pkmap   : 0xff200000 - 0xff400000   (2048 kB)
[    0.000000]     vmalloc : 0xf79fe000 - 0xff1fe000   ( 120 MB)
[    0.000000]     lowmem  : 0x40000000 - 0xf71fe000   (2929 MB)
[    0.000000]       .init : 0x414e0000 - 0x41549000   ( 420 kB)
[    0.000000]       .data : 0x412da723 - 0x414df8f8   (2068 kB)
[    0.000000]       .text : 0x41000000 - 0x412da723   (2921 kB)
[    0.000000] Checking if this processor honours the WP bit even in supervisor mode...Ok.
[    0.000000] Hierarchical RCU implementation.
[    0.000000] NR_IRQS:2304
[    0.000000] CPU 0 irqstacks, hard=bea00000 soft=bea02000
[    0.000000] Extended CMOS year: 2000
[    0.000000] Console: colour VGA+ 80x60
[    0.000000] console [tty0] enabled
[    0.000000] hpet clockevent registered
[    0.000000] Fast TSC calibration using PIT
[    0.004000] Detected 2533.451 MHz processor.
[    0.000007] Calibrating delay loop (skipped), value calculated using timer frequency.. 5066.90 BogoMIPS (lpj=10133804)
[    0.000227] pid_max: default: 32768 minimum: 301
[    0.000373] Security Framework initialized
[    0.000477] SELinux:  Initializing.
[    0.000587] SELinux:  Starting in permissive mode
[    0.000696] Dentry cache hash table entries: 524288 (order: 9, 2097152 bytes)
[    0.001478] Inode-cache hash table entries: 262144 (order: 8, 1048576 bytes)
[    0.001908] Mount-cache hash table entries: 512
[    0.002152] CPU: Physical Processor ID: 1
[    0.002253] CPU: Processor Core ID: 0
[    0.002355] mce: CPU supports 9 MCE banks
[    0.002465] CPU0: Thermal monitoring enabled (TM1)
[    0.002575] using mwait in idle threads.
[    0.002679] Performance Events: PEBS fmt1+, Nehalem events, Intel PMU driver.
[    0.002902] ... version:                3
[    0.003004] ... bit width:              48
[    0.003105] ... generic registers:      4
[    0.003207] ... value mask:             0000ffffffffffff
[    0.003316] ... max period:             000000007fffffff
[    0.003426] ... fixed-purpose events:   3
[    0.003527] ... event mask:             000000070000000f
[    0.003912] Freeing SMP alternatives: 16k freed
[    0.004008] ACPI: Core revision 20101013
[    0.008688] Overriding APIC driver with bigsmp
[    0.008793] Enabling APIC mode:  Physflat.  Using 2 I/O APICs
[    0.009176] Leaving ESR disabled.
[    0.009274] Mapping cpu 0 to node 1
[    0.009539] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.049316] CPU0: Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz stepping 05
[    0.157369] CPU 1 irqstacks, hard=bea20000 soft=bea22000
[    0.157372] Booting Node   0, Processors  #1
[    0.167896] Initializing CPU#1
[    0.168480] Leaving ESR disabled.
[    0.168483] Mapping cpu 1 to node 0
[    0.264905] CPU 2 irqstacks, hard=bea40000 soft=bea42000
[    0.265275]  #2
[    0.276090] Initializing CPU#2
[    0.276332] Leaving ESR disabled.
[    0.276334] Mapping cpu 2 to node 1
[    0.372880] CPU 3 irqstacks, hard=bea60000 soft=bea62000
[    0.373254]  #3
[    0.384173] Initializing CPU#3
[    0.384310] Leaving ESR disabled.
[    0.384313] Mapping cpu 3 to node 0
[    0.480803] CPU 4 irqstacks, hard=bea80000 soft=bea82000
[    0.481171]  #4
[    0.491497] Initializing CPU#4
[    0.492227] Leaving ESR disabled.
[    0.492229] Mapping cpu 4 to node 1
[    0.588747] CPU 5 irqstacks, hard=beaa0000 soft=beaa2000
[    0.589119]  #5
[    0.599503] Initializing CPU#5
[    0.600176] Leaving ESR disabled.
[    0.600178] Mapping cpu 5 to node 0
[    0.696707] CPU 6 irqstacks, hard=beac0000 soft=beac2000
[    0.697076]  #6
[    0.707401] Initializing CPU#6
[    0.708132] Leaving ESR disabled.
[    0.708134] Mapping cpu 6 to node 1
[    0.804673] CPU 7 irqstacks, hard=beae0000 soft=beae2000
[    0.805046]  #7
[    0.815429] Initializing CPU#7
[    0.816102] Leaving ESR disabled.
[    0.816105] Mapping cpu 7 to node 0
[    0.912500] CPU 8 irqstacks, hard=beb00000 soft=beb02000
[    0.912867]  #8
[    0.923731] Initializing CPU#8
[    0.923925] Leaving ESR disabled.
[    0.923928] Mapping cpu 8 to node 1
[    1.020523] CPU 9 irqstacks, hard=beb20000 soft=beb22000
[    1.020894]  #9
[    1.031277] Initializing CPU#9
[    1.031950] Leaving ESR disabled.
[    1.031953] Mapping cpu 9 to node 0
[    1.128415] CPU 10 irqstacks, hard=beb40000 soft=beb42000
[    1.128622]  #10
[    1.138939] Initializing CPU#10
[    1.139670] Leaving ESR disabled.
[    1.139672] Mapping cpu 10 to node 1
[    1.236377] CPU 11 irqstacks, hard=beb60000 soft=beb62000
[    1.236751]  #11
[    1.247135] Initializing CPU#11
[    1.247808] Leaving ESR disabled.
[    1.247811] Mapping cpu 11 to node 0
[    1.344334] CPU 12 irqstacks, hard=beb80000 soft=beb82000
[    1.344704]  #12
[    1.355029] Initializing CPU#12
[    1.355761] Leaving ESR disabled.
[    1.355762] Mapping cpu 12 to node 1
[    1.452301] CPU 13 irqstacks, hard=beba0000 soft=beba2000
[    1.452674]  #13
[    1.463594] Initializing CPU#13
[    1.463731] Leaving ESR disabled.
[    1.463734] Mapping cpu 13 to node 0
[    1.560218] CPU 14 irqstacks, hard=bebc0000 soft=bebc2000
[    1.560425]  #14
[    1.570742] Initializing CPU#14
[    1.571474] Leaving ESR disabled.
[    1.571475] Mapping cpu 14 to node 1
[    1.668181] CPU 15 irqstacks, hard=bebe0000 soft=bebe2000
[    1.668555]  #15 Ok.
[    1.678985] Initializing CPU#15
[    1.679658] Leaving ESR disabled.
[    1.679661] Mapping cpu 15 to node 0
[    1.775946] Brought up 16 CPUs
[    1.776333] Total of 16 processors activated (81068.16 BogoMIPS).
[    1.783989] kworker/u:0 used greatest stack depth: 7100 bytes left
[    1.784370] NET: Registered protocol family 16
[    1.784868] ACPI: bus type pci registered
[    1.785054] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xe0000000-0xe3ffffff] (base 0xe0000000)
[    1.785133] PCI: MMCONFIG at [mem 0xe0000000-0xe3ffffff] reserved in E820
[    1.785193] PCI: Using MMCONFIG for extended config space
[    1.785251] PCI: Using configuration type 1 for base access
[    1.789673] bio: create slab <bio-0> at 0
[    1.790492] ACPI: EC: Look up EC in DSDT
[    1.790617] ACPI Error: Field [CDW3] at 96 exceeds Buffer [NULL] size 64 (bits) (20101013/dsopcode-597)
[    1.790867] ACPI Error: Method parse/execution failed [\_SB_._OSC] (Node f45a7e90), AE_AML_BUFFER_LIMIT (20101013/psparse-537)
[    1.796427] ACPI: Interpreter enabled
[    1.796527] ACPI: (supports S0 S5)
[    1.796698] ACPI: Using IOAPIC for interrupt routing
[    1.802633] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    1.802734] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-12])
[    1.802860] pci_root PNP0A03:00: host bridge window [mem 0xe7000000-0xfbffffff]
[    1.802940] pci_root PNP0A03:00: host bridge window [io  0x1000-0x4fff]
[    1.803001] pci_root PNP0A03:00: host bridge window [io  0x0000-0x03af]
[    1.803061] pci_root PNP0A03:00: host bridge window [io  0x03e0-0x0cf7]
[    1.803122] pci_root PNP0A03:00: host bridge window [io  0x0d00-0x0fff]
[    1.803182] pci_root PNP0A03:00: host bridge window [mem 0xfed00000-0xfed03fff]
[    1.803258] pci_root PNP0A03:00: host bridge window [mem 0xfed00000-0xfed44fff]
[    1.803423] pci_root PNP0A03:00: host bridge window [io  0x03b0-0x03bb]
[    1.803541] pci_root PNP0A03:00: host bridge window [io  0x03c0-0x03df]
[    1.803659] pci_root PNP0A03:00: host bridge window [mem 0x000a0000-0x000bffff]
[    1.803872] pci 0000:00:00.0: PME# supported from D0 D3hot D3cold
[    1.803875] pci 0000:00:00.0: PME# disabled
[    1.803928] pci 0000:00:01.0: PME# supported from D0 D3hot D3cold
[    1.803931] pci 0000:00:01.0: PME# disabled
[    1.803983] pci 0000:00:02.0: PME# supported from D0 D3hot D3cold
[    1.803986] pci 0000:00:02.0: PME# disabled
[    1.804038] pci 0000:00:03.0: PME# supported from D0 D3hot D3cold
[    1.804041] pci 0000:00:03.0: PME# disabled
[    1.804094] pci 0000:00:07.0: PME# supported from D0 D3hot D3cold
[    1.804097] pci 0000:00:07.0: PME# disabled
[    1.804148] pci 0000:00:08.0: PME# supported from D0 D3hot D3cold
[    1.804151] pci 0000:00:08.0: PME# disabled
[    1.804207] pci 0000:00:09.0: PME# supported from D0 D3hot D3cold
[    1.804210] pci 0000:00:09.0: PME# disabled
[    1.804263] pci 0000:00:0a.0: PME# supported from D0 D3hot D3cold
[    1.804269] pci 0000:00:0a.0: PME# disabled
[    1.804965] pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold
[    1.804968] pci 0000:00:1c.0: PME# disabled
[    1.805058] pci 0000:00:1d.0: reg 20: [io  0x1000-0x101f]
[    1.805183] pci 0000:00:1d.1: reg 20: [io  0x1020-0x103f]
[    1.805309] pci 0000:00:1d.2: reg 20: [io  0x1040-0x105f]
[    1.805435] pci 0000:00:1d.3: reg 20: [io  0x1060-0x107f]
[    1.805503] pci 0000:00:1d.7: reg 10: [mem 0xf35f0000-0xf35f03ff]
[    1.805569] pci 0000:00:1d.7: PME# supported from D0 D3hot D3cold
[    1.805573] pci 0000:00:1d.7: PME# disabled
[    1.805782] pci 0000:02:00.0: reg 10: [mem 0xfb000000-0xfb7fffff 64bit]
[    1.805793] pci 0000:02:00.0: reg 18: [mem 0xfa800000-0xfaffffff 64bit]
[    1.805811] pci 0000:02:00.0: reg 30: [mem 0x00000000-0x0000ffff pref]
[    1.805836] pci 0000:02:00.0: PME# supported from D0 D3hot D3cold
[    1.805839] pci 0000:02:00.0: PME# disabled
[    1.805872] pci 0000:02:00.1: reg 10: [mem 0xfa000000-0xfa7fffff 64bit]
[    1.805883] pci 0000:02:00.1: reg 18: [mem 0xf9800000-0xf9ffffff 64bit]
[    1.805901] pci 0000:02:00.1: reg 30: [mem 0x00000000-0x0000ffff pref]
[    1.805926] pci 0000:02:00.1: PME# supported from D0 D3hot D3cold
[    1.805929] pci 0000:02:00.1: PME# disabled
[    1.805940] pci 0000:00:01.0: PCI bridge to [bus 02-02]
[    1.806051] pci 0000:00:01.0:   bridge window [io  0xf000-0x0000] (disabled)
[    1.806054] pci 0000:00:01.0:   bridge window [mem 0xf3800000-0xfb7fffff]
[    1.806059] pci 0000:00:01.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.806090] pci 0000:00:02.0: PCI bridge to [bus 0d-0d]
[    1.806201] pci 0000:00:02.0:   bridge window [io  0xf000-0x0000] (disabled)
[    1.806205] pci 0000:00:02.0:   bridge window [mem 0xfff00000-0x000fffff] (disabled)
[    1.806209] pci 0000:00:02.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.806240] pci 0000:00:03.0: PCI bridge to [bus 03-05]
[    1.806350] pci 0000:00:03.0:   bridge window [io  0xf000-0x0000] (disabled)
[    1.806354] pci 0000:00:03.0:   bridge window [mem 0xfff00000-0x000fffff] (disabled)
[    1.806358] pci 0000:00:03.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.806389] pci 0000:00:07.0: PCI bridge to [bus 06-08]
[    1.806500] pci 0000:00:07.0:   bridge window [io  0xf000-0x0000] (disabled)
[    1.806503] pci 0000:00:07.0:   bridge window [mem 0xfff00000-0x000fffff] (disabled)
[    1.806508] pci 0000:00:07.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.806539] pci 0000:00:08.0: PCI bridge to [bus 11-11]
[    1.806650] pci 0000:00:08.0:   bridge window [io  0xf000-0x0000] (disabled)
[    1.806653] pci 0000:00:08.0:   bridge window [mem 0xfff00000-0x000fffff] (disabled)
[    1.806658] pci 0000:00:08.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.806735] pci 0000:09:00.0: PME# supported from D0 D3hot D3cold
[    1.806739] pci 0000:09:00.0: PME# disabled
[    1.806749] pci 0000:00:09.0: PCI bridge to [bus 09-0b]
[    1.806859] pci 0000:00:09.0:   bridge window [io  0xf000-0x0000] (disabled)
[    1.806863] pci 0000:00:09.0:   bridge window [mem 0xfba00000-0xfbafffff]
[    1.806867] pci 0000:00:09.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.806923] pci 0000:0a:04.0: reg 10: [mem 0xfbaf0000-0xfbafffff 64bit]
[    1.806935] pci 0000:0a:04.0: reg 18: [mem 0xfbae0000-0xfbaeffff 64bit]
[    1.806955] pci 0000:0a:04.0: reg 30: [mem 0x00000000-0x0001ffff pref]
[    1.806977] pci 0000:0a:04.0: PME# supported from D3hot D3cold
[    1.806980] pci 0000:0a:04.0: PME# disabled
[    1.807018] pci 0000:0a:04.1: reg 10: [mem 0xfbad0000-0xfbadffff 64bit]
[    1.807030] pci 0000:0a:04.1: reg 18: [mem 0xfbac0000-0xfbacffff 64bit]
[    1.807050] pci 0000:0a:04.1: reg 30: [mem 0x00000000-0x0001ffff pref]
[    1.807072] pci 0000:0a:04.1: PME# supported from D3hot D3cold
[    1.807075] pci 0000:0a:04.1: PME# disabled
[    1.807113] pci 0000:09:00.0: PCI bridge to [bus 0a-0a]
[    1.807226] pci 0000:09:00.0:   bridge window [io  0xfffff000-0x0000] (disabled)
[    1.807230] pci 0000:09:00.0:   bridge window [mem 0xfba00000-0xfbafffff]
[    1.807235] pci 0000:09:00.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.807270] pci 0000:00:0a.0: PCI bridge to [bus 12-12]
[    1.807381] pci 0000:00:0a.0:   bridge window [io  0xf000-0x0000] (disabled)
[    1.807385] pci 0000:00:0a.0:   bridge window [mem 0xfff00000-0x000fffff] (disabled)
[    1.807389] pci 0000:00:0a.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.807459] pci 0000:0c:00.0: reg 10: [mem 0xfbc00000-0xfbffffff 64bit]
[    1.807476] pci 0000:0c:00.0: reg 18: [mem 0xfbbf0000-0xfbbf0fff 64bit]
[    1.807487] pci 0000:0c:00.0: reg 20: [io  0x4000-0x40ff]
[    1.807508] pci 0000:0c:00.0: reg 30: [mem 0x00000000-0x0007ffff pref]
[    1.807541] pci 0000:0c:00.0: supports D1
[    1.807543] pci 0000:0c:00.0: PME# supported from D0
[    1.807547] pci 0000:0c:00.0: PME# disabled
[    1.807564] pci 0000:00:1c.0: PCI bridge to [bus 0c-0c]
[    1.807675] pci 0000:00:1c.0:   bridge window [io  0x4000-0x4fff]
[    1.807678] pci 0000:00:1c.0:   bridge window [mem 0xfbb00000-0xfbffffff]
[    1.807684] pci 0000:00:1c.0:   bridge window [mem 0xfff00000-0x000fffff pref] (disabled)
[    1.807732] pci 0000:01:03.0: reg 10: [mem 0xe8000000-0xefffffff pref]
[    1.807742] pci 0000:01:03.0: reg 14: [io  0x3000-0x30ff]
[    1.807752] pci 0000:01:03.0: reg 18: [mem 0xf37f0000-0xf37fffff]
[    1.807787] pci 0000:01:03.0: reg 30: [mem 0x00000000-0x0001ffff pref]
[    1.807806] pci 0000:01:03.0: supports D1 D2
[    1.807837] pci 0000:01:04.0: reg 10: [io  0x2800-0x28ff]
[    1.807847] pci 0000:01:04.0: reg 14: [mem 0xf37e0000-0xf37e01ff]
[    1.807906] pci 0000:01:04.0: PME# supported from D0 D3hot D3cold
[    1.807910] pci 0000:01:04.0: PME# disabled
[    1.807945] pci 0000:01:04.2: reg 10: [io  0x3400-0x34ff]
[    1.807956] pci 0000:01:04.2: reg 14: [mem 0xf37d0000-0xf37d07ff]
[    1.807967] pci 0000:01:04.2: reg 18: [mem 0xf37c0000-0xf37c3fff]
[    1.807978] pci 0000:01:04.2: reg 1c: [mem 0xf3700000-0xf377ffff]
[    1.808008] pci 0000:01:04.2: reg 30: [mem 0x00000000-0x0000ffff pref]
[    1.808028] pci 0000:01:04.2: PME# supported from D0 D3hot D3cold
[    1.808033] pci 0000:01:04.2: PME# disabled
[    1.808106] pci 0000:01:04.4: reg 20: [io  0x3800-0x381f]
[    1.808144] pci 0000:01:04.4: PME# supported from D0 D3hot D3cold
[    1.808148] pci 0000:01:04.4: PME# disabled
[    1.808181] pci 0000:01:04.6: reg 10: [mem 0xf36f0000-0xf36f00ff]
[    1.808247] pci 0000:01:04.6: PME# supported from D0 D3hot D3cold
[    1.808251] pci 0000:01:04.6: PME# disabled
[    1.808294] pci 0000:00:1e.0: PCI bridge to [bus 01-01] (subtractive decode)
[    1.808416] pci 0000:00:1e.0:   bridge window [io  0x2000-0x3fff]
[    1.808419] pci 0000:00:1e.0:   bridge window [mem 0xf3600000-0xf37fffff]
[    1.808425] pci 0000:00:1e.0:   bridge window [mem 0xe8000000-0xefffffff 64bit pref]
[    1.808427] pci 0000:00:1e.0:   bridge window [mem 0xe7000000-0xfbffffff] (subtractive decode)
[    1.808429] pci 0000:00:1e.0:   bridge window [io  0x1000-0x4fff] (subtractive decode)
[    1.808432] pci 0000:00:1e.0:   bridge window [io  0x0000-0x03af] (subtractive decode)
[    1.808434] pci 0000:00:1e.0:   bridge window [io  0x03e0-0x0cf7] (subtractive decode)
[    1.808436] pci 0000:00:1e.0:   bridge window [io  0x0d00-0x0fff] (subtractive decode)
[    1.808438] pci 0000:00:1e.0:   bridge window [mem 0xfed00000-0xfed03fff] (subtractive decode)
[    1.808440] pci 0000:00:1e.0:   bridge window [mem 0xfed00000-0xfed44fff] (subtractive decode)
[    1.808442] pci 0000:00:1e.0:   bridge window [io  0x03b0-0x03bb] (subtractive decode)
[    1.808444] pci 0000:00:1e.0:   bridge window [io  0x03c0-0x03df] (subtractive decode)
[    1.808447] pci 0000:00:1e.0:   bridge window [mem 0x000a0000-0x000bffff] (subtractive decode)
[    1.808475] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[    1.808627] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.IP2P._PRT]
[    1.808673] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.IPT1._PRT]
[    1.808722] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PT01._PRT]
[    1.808770] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PT03._PRT]
[    1.808840] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PT07._PRT]
[    1.808910] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PT09._PRT]
[    1.813581] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *7 10 11)
[    1.813948] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 7 *10 11)
[    1.814312] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 7 *10 11)
[    1.814678] ACPI: PCI Interrupt Link [LNKD] (IRQs *5 7 10 11)
[    1.832181] ACPI: PCI Interrupt Link [LNKE] (IRQs *5 7 10 11)
[    1.832549] ACPI: PCI Interrupt Link [LNKF] (IRQs 5 7 *10 11)
[    1.832912] ACPI: PCI Interrupt Link [LNKG] (IRQs 5 7 *10 11)
[    1.833276] ACPI: PCI Interrupt Link [LNKH] (IRQs 5 *7 10 11)
[    1.833781] vgaarb: device added: PCI:0000:01:03.0,decodes=io+mem,owns=io+mem,locks=none
[    1.833956] vgaarb: loaded
[    1.834279] SCSI subsystem initialized
[    1.834497] libata version 3.00 loaded.
[    1.834587] usbcore: registered new interface driver usbfs
[    1.834768] usbcore: registered new interface driver hub
[    1.834924] usbcore: registered new device driver usb
[    1.835190] PCI: Using ACPI for IRQ routing
[    1.835298] PCI: pci_cache_line_size set to 64 bytes
[    1.835405] reserve RAM buffer: 000000000009f400 - 000000000009ffff 
[    1.835408] reserve RAM buffer: 00000000df62f000 - 00000000dfffffff 
[    1.835410] reserve RAM buffer: 00000000df63d000 - 00000000dfffffff 
[    1.835411] reserve RAM buffer: 000000011ffff000 - 000000011fffffff 
[    1.835591] HPET: 4 timers in total, 0 timers will be used for per-cpu timer
[    1.867954] Switching to clocksource tsc
[    1.870005] pnp: PnP ACPI init
[    1.870117] ACPI: bus type pnp registered
[    1.870258] pnp 00:00: [bus 00-12]
[    1.870262] pnp 00:00: [mem 0xe7000000-0xfbffffff window]
[    1.870265] pnp 00:00: [io  0x1000-0x4fff window]
[    1.870269] pnp 00:00: [io  0x0000-0x03af window]
[    1.870272] pnp 00:00: [io  0x03e0-0x0cf7 window]
[    1.870274] pnp 00:00: [io  0x0d00-0x0fff window]
[    1.870277] pnp 00:00: [mem 0xfed00000-0xfed03fff window]
[    1.870280] pnp 00:00: [mem 0xfed00000-0xfed44fff window]
[    1.870283] pnp 00:00: [io  0x03b0-0x03bb window]
[    1.870286] pnp 00:00: [io  0x03c0-0x03df window]
[    1.870289] pnp 00:00: [mem 0x000a0000-0x000bffff window]
[    1.870397] pnp 00:00: Plug and Play ACPI device, IDs PNP0a03 PNP0a08 (active)
[    1.870670] pnp 00:01: [io  0x0070-0x0077]
[    1.870673] pnp 00:01: [io  0x0408-0x040f]
[    1.870675] pnp 00:01: [io  0x04d0-0x04d1]
[    1.870678] pnp 00:01: [io  0x0020-0x003f]
[    1.870680] pnp 00:01: [io  0x00a0-0x00bf]
[    1.870683] pnp 00:01: [io  0x0090-0x009f]
[    1.870685] pnp 00:01: [io  0x0050-0x0053]
[    1.870688] pnp 00:01: [io  0x0700-0x071f]
[    1.870690] pnp 00:01: [io  0x0880-0x08ff]
[    1.870693] pnp 00:01: [io  0x0900-0x097f]
[    1.870695] pnp 00:01: [io  0x0010-0x001f]
[    1.870697] pnp 00:01: [io  0x0c80-0x0c83]
[    1.870700] pnp 00:01: [io  0x0cd4-0x0cd7]
[    1.870703] pnp 00:01: [io  0x0f50-0x0f58]
[    1.870705] pnp 00:01: [io  0x00f0]
[    1.870708] pnp 00:01: [io  0x0ca0-0x0ca1]
[    1.870710] pnp 00:01: [io  0x0ca4-0x0ca5]
[    1.870713] pnp 00:01: [mem 0xe0000000-0xe3ffffff]
[    1.870716] pnp 00:01: [mem 0xfe000000-0xfebfffff]
[    1.870719] pnp 00:01: [mem 0xe7ffe000-0xe7ffffff]
[    1.870722] pnp 00:01: [mem 0x00000000-0xffffffffffffffff disabled]
[    1.870725] pnp 00:01: [io  0x03f8-0x03ff]
[    1.870867] pnp 00:01: Plug and Play ACPI device, IDs PNP0c02 (active)
[    1.870881] pnp 00:02: [io  0x0ca2-0x0ca3]
[    1.870951] pnp 00:02: Plug and Play ACPI device, IDs IPI0001 (active)
[    1.870985] pnp 00:03: [mem 0xfed00000-0xfed003ff]
[    1.871058] pnp 00:03: Plug and Play ACPI device, IDs PNP0103 (active)
[    1.871072] pnp 00:04: [dma 7]
[    1.871074] pnp 00:04: [io  0x0000-0x000f]
[    1.871077] pnp 00:04: [io  0x0080-0x008f]
[    1.871080] pnp 00:04: [io  0x00c0-0x00df]
[    1.871148] pnp 00:04: Plug and Play ACPI device, IDs PNP0200 (active)
[    1.871160] pnp 00:05: [io  0x0061]
[    1.871228] pnp 00:05: Plug and Play ACPI device, IDs PNP0800 (active)
[    1.871241] pnp 00:06: [io  0x0060]
[    1.871244] pnp 00:06: [io  0x0064]
[    1.871253] pnp 00:06: [irq 1]
[    1.871321] pnp 00:06: Plug and Play ACPI device, IDs PNP0303 (active)
[    1.871337] pnp 00:07: [irq 12]
[    1.871408] pnp 00:07: Plug and Play ACPI device, IDs PNP0f13 PNP0f0e (active)
[    1.871422] pnp 00:08: [io  0x002e-0x002f]
[    1.871425] pnp 00:08: [io  0x0620-0x065f]
[    1.871428] pnp 00:08: [io  0x0680-0x069f]
[    1.871430] pnp 00:08: [io  0x0600-0x061f]
[    1.871433] pnp 00:08: [io  0x0660-0x067f]
[    1.871435] pnp 00:08: [io  0x0300-0x031f]
[    1.871504] pnp 00:08: Plug and Play ACPI device, IDs PNP0a06 (active)
[    1.871721] pnp 00:09: [irq 3]
[    1.871723] pnp 00:09: [io  0x02f8-0x02ff]
[    1.871943] pnp 00:09: Plug and Play ACPI device, IDs PNP0501 PNP0500 (active)
[    1.872033] pnp 00:0a: [io  0x0070-0x0071]
[    1.872108] pnp 00:0a: Plug and Play ACPI device, IDs PNP0b00 (active)
[    1.872432] pnp: PnP ACPI: found 11 devices
[    1.872538] ACPI: ACPI bus type pnp unregistered
[    1.872654] system 00:01: [io  0x0408-0x040f] has been reserved
[    1.872771] system 00:01: [io  0x04d0-0x04d1] has been reserved
[    1.872888] system 00:01: [io  0x0700-0x071f] has been reserved
[    1.873004] system 00:01: [io  0x0880-0x08ff] has been reserved
[    1.873120] system 00:01: [io  0x0900-0x097f] has been reserved
[    1.873236] system 00:01: [io  0x0c80-0x0c83] has been reserved
[    1.873352] system 00:01: [io  0x0cd4-0x0cd7] has been reserved
[    1.873468] system 00:01: [io  0x0f50-0x0f58] has been reserved
[    1.873584] system 00:01: [io  0x0ca0-0x0ca1] has been reserved
[    1.873700] system 00:01: [io  0x0ca4-0x0ca5] has been reserved
[    1.873817] system 00:01: [io  0x03f8-0x03ff] has been reserved
[    1.873935] system 00:01: [mem 0xe0000000-0xe3ffffff] has been reserved
[    1.874056] system 00:01: [mem 0xfe000000-0xfebfffff] has been reserved
[    1.874177] system 00:01: [mem 0xe7ffe000-0xe7ffffff] has been reserved
[    1.912733] pci 0000:00:01.0: BAR 9: assigned [mem 0xe7000000-0xe70fffff pref]
[    1.912902] pci 0000:00:09.0: BAR 9: assigned [mem 0xe7100000-0xe71fffff pref]
[    1.913070] pci 0000:00:1c.0: BAR 9: assigned [mem 0xe7200000-0xe72fffff pref]
[    1.913238] pci 0000:02:00.0: BAR 6: assigned [mem 0xe7000000-0xe700ffff pref]
[    1.913405] pci 0000:02:00.1: BAR 6: assigned [mem 0xe7010000-0xe701ffff pref]
[    1.913572] pci 0000:00:01.0: PCI bridge to [bus 02-02]
[    1.913684] pci 0000:00:01.0:   bridge window [io  disabled]
[    1.913801] pci 0000:00:01.0:   bridge window [mem 0xf3800000-0xfb7fffff]
[    1.913923] pci 0000:00:01.0:   bridge window [mem 0xe7000000-0xe70fffff pref]
[    1.914093] pci 0000:00:02.0: PCI bridge to [bus 0d-0d]
[    1.914205] pci 0000:00:02.0:   bridge window [io  disabled]
[    1.914322] pci 0000:00:02.0:   bridge window [mem disabled]
[    1.914436] pci 0000:00:02.0:   bridge window [mem pref disabled]
[    1.914556] pci 0000:00:03.0: PCI bridge to [bus 03-05]
[    1.914668] pci 0000:00:03.0:   bridge window [io  disabled]
[    1.914784] pci 0000:00:03.0:   bridge window [mem disabled]
[    1.914899] pci 0000:00:03.0:   bridge window [mem pref disabled]
[    1.915019] pci 0000:00:07.0: PCI bridge to [bus 06-08]
[    1.915131] pci 0000:00:07.0:   bridge window [io  disabled]
[    1.915247] pci 0000:00:07.0:   bridge window [mem disabled]
[    1.915363] pci 0000:00:07.0:   bridge window [mem pref disabled]
[    1.915484] pci 0000:00:08.0: PCI bridge to [bus 11-11]
[    1.915595] pci 0000:00:08.0:   bridge window [io  disabled]
[    1.915710] pci 0000:00:08.0:   bridge window [mem disabled]
[    1.915825] pci 0000:00:08.0:   bridge window [mem pref disabled]
[    1.915946] pci 0000:09:00.0: BAR 9: assigned [mem 0xe7100000-0xe71fffff pref]
[    1.916117] pci 0000:0a:04.0: BAR 6: assigned [mem 0xe7100000-0xe711ffff pref]
[    1.916283] pci 0000:0a:04.1: BAR 6: assigned [mem 0xe7120000-0xe713ffff pref]
[    1.916450] pci 0000:09:00.0: PCI bridge to [bus 0a-0a]
[    1.916562] pci 0000:09:00.0:   bridge window [io  disabled]
[    1.916678] pci 0000:09:00.0:   bridge window [mem 0xfba00000-0xfbafffff]
[    1.916801] pci 0000:09:00.0:   bridge window [mem 0xe7100000-0xe71fffff pref]
[    1.916970] pci 0000:00:09.0: PCI bridge to [bus 09-0b]
[    1.917082] pci 0000:00:09.0:   bridge window [io  disabled]
[    1.917199] pci 0000:00:09.0:   bridge window [mem 0xfba00000-0xfbafffff]
[    1.917322] pci 0000:00:09.0:   bridge window [mem 0xe7100000-0xe71fffff pref]
[    1.917491] pci 0000:00:0a.0: PCI bridge to [bus 12-12]
[    1.917602] pci 0000:00:0a.0:   bridge window [io  disabled]
[    1.917718] pci 0000:00:0a.0:   bridge window [mem disabled]
[    1.917832] pci 0000:00:0a.0:   bridge window [mem pref disabled]
[    1.917954] pci 0000:0c:00.0: BAR 6: assigned [mem 0xe7200000-0xe727ffff pref]
[    1.918120] pci 0000:00:1c.0: PCI bridge to [bus 0c-0c]
[    1.918233] pci 0000:00:1c.0:   bridge window [io  0x4000-0x4fff]
[    1.918352] pci 0000:00:1c.0:   bridge window [mem 0xfbb00000-0xfbffffff]
[    1.918475] pci 0000:00:1c.0:   bridge window [mem 0xe7200000-0xe72fffff pref]
[    1.918648] pci 0000:01:03.0: BAR 6: assigned [mem 0xf3600000-0xf361ffff pref]
[    1.918815] pci 0000:01:04.2: BAR 6: assigned [mem 0xf3620000-0xf362ffff pref]
[    1.918981] pci 0000:00:1e.0: PCI bridge to [bus 01-01]
[    1.919094] pci 0000:00:1e.0:   bridge window [io  0x2000-0x3fff]
[    1.919214] pci 0000:00:1e.0:   bridge window [mem 0xf3600000-0xf37fffff]
[    1.919337] pci 0000:00:1e.0:   bridge window [mem 0xe8000000-0xefffffff 64bit pref]
[    1.919518] pci 0000:00:01.0: setting latency timer to 64
[    1.919526] pci 0000:00:02.0: setting latency timer to 64
[    1.919535] pci 0000:00:03.0: setting latency timer to 64
[    1.919544] pci 0000:00:07.0: setting latency timer to 64
[    1.919552] pci 0000:00:08.0: setting latency timer to 64
[    1.919560] pci 0000:00:09.0: setting latency timer to 64
[    1.919570] pci 0000:09:00.0: setting latency timer to 64
[    1.919578] pci 0000:00:0a.0: setting latency timer to 64
[    1.919592] pci 0000:00:1c.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    1.919714] pci 0000:00:1c.0: setting latency timer to 64
[    1.919722] pci 0000:00:1e.0: setting latency timer to 64
[    1.919726] pci_bus 0000:00: resource 4 [mem 0xe7000000-0xfbffffff]
[    1.919730] pci_bus 0000:00: resource 5 [io  0x1000-0x4fff]
[    1.919733] pci_bus 0000:00: resource 6 [io  0x0000-0x03af]
[    1.919736] pci_bus 0000:00: resource 7 [io  0x03e0-0x0cf7]
[    1.919739] pci_bus 0000:00: resource 8 [io  0x0d00-0x0fff]
[    1.919742] pci_bus 0000:00: resource 9 [mem 0xfed00000-0xfed03fff]
[    1.919746] pci_bus 0000:00: resource 10 [mem 0xfed00000-0xfed44fff]
[    1.919749] pci_bus 0000:00: resource 11 [io  0x03b0-0x03bb]
[    1.919752] pci_bus 0000:00: resource 12 [io  0x03c0-0x03df]
[    1.919755] pci_bus 0000:00: resource 13 [mem 0x000a0000-0x000bffff]
[    1.919759] pci_bus 0000:02: resource 1 [mem 0xf3800000-0xfb7fffff]
[    1.919762] pci_bus 0000:02: resource 2 [mem 0xe7000000-0xe70fffff pref]
[    1.919766] pci_bus 0000:09: resource 1 [mem 0xfba00000-0xfbafffff]
[    1.919769] pci_bus 0000:09: resource 2 [mem 0xe7100000-0xe71fffff pref]
[    1.919772] pci_bus 0000:0a: resource 1 [mem 0xfba00000-0xfbafffff]
[    1.919775] pci_bus 0000:0a: resource 2 [mem 0xe7100000-0xe71fffff pref]
[    1.919779] pci_bus 0000:0c: resource 0 [io  0x4000-0x4fff]
[    1.919782] pci_bus 0000:0c: resource 1 [mem 0xfbb00000-0xfbffffff]
[    1.919785] pci_bus 0000:0c: resource 2 [mem 0xe7200000-0xe72fffff pref]
[    1.919789] pci_bus 0000:01: resource 0 [io  0x2000-0x3fff]
[    1.919792] pci_bus 0000:01: resource 1 [mem 0xf3600000-0xf37fffff]
[    1.919795] pci_bus 0000:01: resource 2 [mem 0xe8000000-0xefffffff 64bit pref]
[    1.919799] pci_bus 0000:01: resource 4 [mem 0xe7000000-0xfbffffff]
[    1.919802] pci_bus 0000:01: resource 5 [io  0x1000-0x4fff]
[    1.919805] pci_bus 0000:01: resource 6 [io  0x0000-0x03af]
[    1.919808] pci_bus 0000:01: resource 7 [io  0x03e0-0x0cf7]
[    1.919811] pci_bus 0000:01: resource 8 [io  0x0d00-0x0fff]
[    1.919814] pci_bus 0000:01: resource 9 [mem 0xfed00000-0xfed03fff]
[    1.919817] pci_bus 0000:01: resource 10 [mem 0xfed00000-0xfed44fff]
[    1.919820] pci_bus 0000:01: resource 11 [io  0x03b0-0x03bb]
[    1.919823] pci_bus 0000:01: resource 12 [io  0x03c0-0x03df]
[    1.919826] pci_bus 0000:01: resource 13 [mem 0x000a0000-0x000bffff]
[    1.919913] NET: Registered protocol family 2
[    1.920142] IP route cache hash table entries: 131072 (order: 7, 524288 bytes)
[    1.920845] TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
[    1.922565] TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
[    1.922841] TCP: Hash tables configured (established 524288 bind 65536)
[    1.922961] TCP reno registered
[    1.923066] UDP hash table entries: 2048 (order: 4, 65536 bytes)
[    1.923204] UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes)
[    1.923481] NET: Registered protocol family 1
[    1.936085] pci 0000:01:03.0: Boot video device
[    1.936724] PCI: CLS 64 bytes, default 64
[    1.936786] Trying to unpack rootfs image as initramfs...
[    1.952603] Freeing initrd memory: 492k freed
[    1.953202] udev used greatest stack depth: 7068 bytes left
[    1.955023] udev used greatest stack depth: 6904 bytes left
[    1.960152] audit: initializing netlink socket (disabled)
[    1.960228] type=2000 audit(1288219525.604:1): initialized
[    1.960337] udev used greatest stack depth: 6860 bytes left
[    1.998555] highmem bounce pool size: 64 pages
[    1.998617] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[    1.998865] VFS: Disk quotas dquot_6.5.2
[    1.998962] Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
[    1.999267] msgmni has been set to 5727
[    1.999368] SELinux:  Registering netfilter hooks
[    1.999637] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
[    1.999716] io scheduler noop registered
[    1.999774] io scheduler deadline registered
[    1.999864] io scheduler cfq registered (default)
[    2.000437] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    2.001177] ACPI: acpi_idle registered with cpuidle
[    2.001344] Monitor-Mwait will be used to enter C-1 state
[    2.001378] Monitor-Mwait will be used to enter C-3 state
[    2.001406] Monitor-Mwait will be used to enter C-3 state
[    2.009175] thermal LNXTHERM:00: registered as thermal_zone0
[    2.009238] ACPI: Thermal Zone [THM0] (8 C)
[    2.009712] Real Time Clock Driver v1.12b
[    2.009770] Linux agpgart interface v0.103
[    2.010662] [drm] Initialized drm 1.1.0 20060810
[    2.010722] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    2.284148] serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[    2.560020] serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
[    2.581013] 00:09: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
[    2.584619] brd: module loaded
[    2.584675] HP CISS Driver (v 3.6.26)
[    2.584851] cciss 0000:0c:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    2.584994] cciss 0000:0c:00.0: irq 64 for MSI/MSI-X
[    2.584998] cciss 0000:0c:00.0: irq 65 for MSI/MSI-X
[    2.585001] cciss 0000:0c:00.0: irq 66 for MSI/MSI-X
[    2.585005] cciss 0000:0c:00.0: irq 67 for MSI/MSI-X
[    2.599835] cciss 0000:0c:00.0: cciss0: <0x323a> at PCI 0000:0c:00.0 IRQ 64 using DAC
[    2.637524]  cciss/c0d0: p1 p2 p3 p4 < p5 p6 p7 >
[    2.638449] Uniform Multi-Platform E-IDE driver
[    2.656512] ide_generic: please use "probe_mask=0x3f" module parameter for probing all legacy ISA IDE ports
[    2.656604] Probing IDE interface ide0...
[    3.215571] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
[    3.215676] Probing IDE interface ide1...
[    3.775281] ide1 at 0x170-0x177,0x376 on irq 15
[    3.775458] ide-gd driver 1.18
[    3.775591] ide-cd driver 5.00
[    3.776080] PNP: PS/2 Controller [PNP0303:KBD,PNP0f0e:PS2M] at 0x60,0x64 irq 1,12
[    3.777842] serio: i8042 KBD port at 0x60,0x64 irq 1
[    3.777906] serio: i8042 AUX port at 0x60,0x64 irq 12
[    3.778170] mice: PS/2 mouse device common for all mice
[    3.782955] cpuidle: using governor ladder
[    3.791748] cpuidle: using governor menu
[    3.792157] usbcore: registered new interface driver usbhid
[    3.792219] usbhid: USB HID core driver
[    3.792323] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[    3.793005] ip_tables: (C) 2000-2006 Netfilter Core Team
[    3.793111] TCP cubic registered
[    3.793167] Initializing XFRM netlink socket
[    3.793228] NET: Registered protocol family 17
[    3.793290] Registering the dns_resolver key type
[    3.793453] Using IPI Shortcut mode
[    3.794786] Freeing unused kernel memory: 420k freed
[    3.795432] Write protecting the kernel text: 2924k
[    3.795873] Write protecting the kernel read-only data: 1564k
[    3.850624] EXT3-fs: barriers not enabled
[    3.861932] kjournald starting.  Commit interval 5 seconds
[    3.862003] EXT3-fs (cciss/c0d0p2): mounted filesystem with writeback data mode
[    4.127764] SELinux:  Disabled at runtime.
[    4.127844] SELinux:  Unregistering netfilter hooks
[    4.279034] type=1404 audit(1288219527.924:2): selinux=0 auid=4294967295 ses=4294967295
[    4.562099] hostname used greatest stack depth: 6724 bytes left
[    4.639764] awk used greatest stack depth: 6112 bytes left



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-27 20:55                                         ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Eric Dumazet
@ 2010-10-28 12:01                                           ` Tejun Heo
  2010-10-28 12:30                                             ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Tejun Heo @ 2010-10-28 12:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Hello, Eric.

On 10/27/2010 10:55 PM, Eric Dumazet wrote:
> I changed the User/Kernel split from 3G/1G to 1G/3G so that I have
> LOWMEM on both nodes. Still pcpu allocates all percpu from node0.
...
> [    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:16 nr_node_ids:8
> [    0.000000] PERCPU: Embedded 16 pages/cpu @bea00000 s41984 r0 d23552 u131072
> [    0.000000] pcpu-alloc: s41984 r0 d23552 u131072 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 
> [    0.000000] setup_percpu: cpu=0 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=1 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=2 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=3 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=4 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=5 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=6 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=7 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=8 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=9 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=10 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=11 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=12 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=13 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=14 early_cpu_to_node()=0
> [    0.000000] setup_percpu: cpu=15 early_cpu_to_node()=0

So, this is the problem.  percpu uses early_cpu_to_node() to determine
which cpu belongs to which NUMA node and according to it all CPUs are
on node 0, so percpu is configured accordingly.  I have no idea why
early_cpu_to_node() is set up like that tho.  Ingo, Thomas, any ideas?

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] x86-32: Allocate irq stacks seperate from percpu area
  2010-10-28 12:01                                           ` Tejun Heo
@ 2010-10-28 12:30                                             ` Eric Dumazet
  0 siblings, 0 replies; 63+ messages in thread
From: Eric Dumazet @ 2010-10-28 12:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Brian Gerst, x86, linux-kernel, torvalds, mingo

Le jeudi 28 octobre 2010 à 14:01 +0200, Tejun Heo a écrit :
> Hello, Eric.
> 
> On 10/27/2010 10:55 PM, Eric Dumazet wrote:
> > I changed the User/Kernel split from 3G/1G to 1G/3G so that I have
> > LOWMEM on both nodes. Still pcpu allocates all percpu from node0.
> ...
> > [    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:16 nr_node_ids:8
> > [    0.000000] PERCPU: Embedded 16 pages/cpu @bea00000 s41984 r0 d23552 u131072
> > [    0.000000] pcpu-alloc: s41984 r0 d23552 u131072 alloc=1*2097152
> > [    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 
> > [    0.000000] setup_percpu: cpu=0 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=1 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=2 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=3 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=4 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=5 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=6 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=7 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=8 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=9 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=10 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=11 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=12 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=13 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=14 early_cpu_to_node()=0
> > [    0.000000] setup_percpu: cpu=15 early_cpu_to_node()=0
> 
> So, this is the problem.  percpu uses early_cpu_to_node() to determine
> which cpu belongs to which NUMA node and according to it all CPUs are
> on node 0, so percpu is configured accordingly.  I have no idea why
> early_cpu_to_node() is set up like that tho.  Ingo, Thomas, any ideas?
> 

CONFIG_X86_32

early_cpu_to_node() uses cpu_to_node_map[]

Set in map_cpu_to_node(), _after_ pcpu stuff if you look at my previous
dmesg output.

arch/x86/kernel/smpboot.c
int cpu_to_node_map[NR_CPUS] __read_mostly = { [0 ... NR_CPUS-1] = 0 };

static void map_cpu_to_node(int cpu, int node)
{
	printk(KERN_INFO "Mapping cpu %d to node %d\n", cpu, node);
	cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
	cpu_to_node_map[cpu] = node;
}

[    0.013437] Mapping cpu 0 to node 1
[    0.172421] Mapping cpu 1 to node 0
[    0.280357] Mapping cpu 2 to node 1
[    0.388310] Mapping cpu 3 to node 0
[    0.496494] Mapping cpu 4 to node 1
[    0.604182] Mapping cpu 5 to node 0
[    0.712050] Mapping cpu 6 to node 1
[    0.820102] Mapping cpu 7 to node 0
...


I added this bit in acpi_map_cpu2node(), just in case.

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index c05872a..f995d3a 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -561,6 +561,7 @@ static void acpi_map_cpu2node(acpi_handle handle,
int cpu, int physid)
 	numa_set_node(cpu, nid);
 #else /* CONFIG_X86_32 */
 	apicid_2_node[physid] = nid;
+	pr_err("cpu_to_node_map(cpu=%d)=%d\n", cpu, nid);
 	cpu_to_node_map[cpu] = nid;
 #endif
 


Seems to be not called.




^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH] x86-32: NUMA irq stacks allocations
  2010-10-27  9:57                     ` Peter Zijlstra
  2010-10-27 13:33                       ` Eric Dumazet
@ 2010-10-28 14:40                       ` Eric Dumazet
  2010-10-29  6:43                         ` [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations tip-bot for Eric Dumazet
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-28 14:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: Brian Gerst, tj, x86, linux-kernel, torvalds

commit 22d4cd4c4d (Allocate irq stacks seperate from percpu area)
removed NUMA affinity of IRQ stacks.

Using alloc_pages_node() instead of __get_free_pages() is safe, even if
the target node has no available LOWMEM pages : alloc_pages_node()
fallbacks to another node.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: tj@kernel.org
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kernel/irq_32.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 64668db..96656f2 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -17,6 +17,7 @@
 #include <linux/delay.h>
 #include <linux/uaccess.h>
 #include <linux/percpu.h>
+#include <linux/mm.h>
 
 #include <asm/apic.h>
 
@@ -125,7 +126,9 @@ void __cpuinit irq_ctx_init(int cpu)
 	if (per_cpu(hardirq_ctx, cpu))
 		return;
 
-	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
+	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
+					       THREAD_FLAGS,
+					       THREAD_ORDER));
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;
@@ -134,7 +137,9 @@ void __cpuinit irq_ctx_init(int cpu)
 
 	per_cpu(hardirq_ctx, cpu) = irqctx;
 
-	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
+	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
+					       THREAD_FLAGS,
+					       THREAD_ORDER));
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] numa: fix slab_node(MPOL_BIND)
  2010-10-27 17:33                                           ` [PATCH] numa: fix slab_node(MPOL_BIND) Eric Dumazet
@ 2010-10-28 15:59                                             ` Linus Torvalds
  2010-10-28 16:27                                               ` Eric Dumazet
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Linus Torvalds @ 2010-10-28 15:59 UTC (permalink / raw)
  To: Eric Dumazet, Mel Gorman, Christoph Lameter, Lee Schermerhorn,
	Andrew Morton
  Cc: Tejun Heo, Peter Zijlstra, Brian Gerst, x86, linux-kernel, mingo

Hmm. More people added to the discussion..

This code seems to go back all the way to commit 19770b32609b: "mm:
filter based on a nodemask as well as a gfp_mask". Which was back in
April 2008. and got merged into 2.6.26.

And I'd be happy to commit it (in fact, I was going to), but when
looking for other uses of first_zones_zonelist(), I found
local_memory_node() which does the exact same thing: ignore the return
value, and unconditionally dereference the resulting 'zone' variable.

And so does - although less obviously - mm/vmscan.c for the
wait_iff_confgested() thing.

So are those buggy too, since first_zones_zonelist() can apparently return NULL?

Please advise...

                  Linus

On Wed, Oct 27, 2010 at 10:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 27 octobre 2010 à 18:07 +0200, Eric Dumazet a écrit :
>
>> So I tried following experiment :
>>
>> # swapoff
>> # numactl --membind=0 swapon -a
>> # grep swap /proc/vmallocinfo
>> 0xf9bf3000-0xf9cf4000 1052672 sys_swapon+0x4aa/0xb24 pages=256 vmalloc N0=256
>> # swapoff -a
>> # numactl --membind=1 swapon -a
>>
>> <<FREEZE>>
>>
>
> Crash in fact, not freeze, in slab_node()
>
> Problem is : we dereference a NULL zone pointer.
>
> (node 1 has HighMem only)
>
> Following patch seems to solve the problem for me
>
> # swapoff -a
> # numactl --membind=1 swapon -a
> # grep swap /proc/vmallocinfo
> 0xf9da5000-0xf9ea6000 1052672 sys_swapon+0x3f9/0xa34 pages=256 vmalloc N1=256
>
>
> Thanks
>
>
> [PATCH] numa: fix slab_node(MPOL_BIND)
>
> When a node contains only HighMem memory, slab_node(MPOL_BIND)
> dereferences a NULL pointer.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  mm/mempolicy.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 81a1276..4a57f13 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1597,7 +1597,7 @@ unsigned slab_node(struct mempolicy *policy)
>                (void)first_zones_zonelist(zonelist, highest_zoneidx,
>                                                        &policy->v.nodes,
>                                                        &zone);
> -               return zone->node;
> +               return zone ? zone->node : numa_node_id();
>        }
>
>        default:
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] numa: fix slab_node(MPOL_BIND)
  2010-10-28 15:59                                             ` Linus Torvalds
@ 2010-10-28 16:27                                               ` Eric Dumazet
  2010-10-28 16:45                                               ` Mel Gorman
  2010-10-28 16:55                                               ` Christoph Lameter
  2 siblings, 0 replies; 63+ messages in thread
From: Eric Dumazet @ 2010-10-28 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Christoph Lameter, Lee Schermerhorn, Andrew Morton,
	Tejun Heo, Peter Zijlstra, Brian Gerst, x86, linux-kernel, mingo

Le jeudi 28 octobre 2010 à 08:59 -0700, Linus Torvalds a écrit :
> Hmm. More people added to the discussion..
> 
> This code seems to go back all the way to commit 19770b32609b: "mm:
> filter based on a nodemask as well as a gfp_mask". Which was back in
> April 2008. and got merged into 2.6.26.
> 
> And I'd be happy to commit it (in fact, I was going to), but when
> looking for other uses of first_zones_zonelist(), I found
> local_memory_node() which does the exact same thing: ignore the return
> value, and unconditionally dereference the resulting 'zone' variable.
> 
> And so does - although less obviously - mm/vmscan.c for the
> wait_iff_confgested() thing.
> 
> So are those buggy too, since first_zones_zonelist() can apparently return NULL?
> 
> Please advise...
> 

local_memory_node() is for ia64 only, and it is a bit different :

(void)first_zones_zonelist(node_zonelist(node, GFP_KERNEL),
	gfp_zone(GFP_KERNEL),
	NULL,
	&zone);

Apparently node_zonelist(node, GFP_KERNEL) returns a pointer to
something that must have a zone suitable for GFP_KERNEL

While the code I tried to fix uses :

zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
without any guarantee it contains a LOWMEM zone


Also check commit 7eb54824b76793dd86afb54f182ef9aa64b3a45a
for a similar fix in the past.

About do_try_to_free_pages(), I can not comment on this one, sorry.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] numa: fix slab_node(MPOL_BIND)
  2010-10-28 15:59                                             ` Linus Torvalds
  2010-10-28 16:27                                               ` Eric Dumazet
@ 2010-10-28 16:45                                               ` Mel Gorman
  2010-10-28 16:55                                               ` Christoph Lameter
  2 siblings, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2010-10-28 16:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Christoph Lameter, Lee Schermerhorn, Andrew Morton,
	Tejun Heo, Peter Zijlstra, Brian Gerst, x86, linux-kernel, mingo

On Thu, Oct 28, 2010 at 08:59:42AM -0700, Linus Torvalds wrote:
> Hmm. More people added to the discussion..
> 
> This code seems to go back all the way to commit 19770b32609b: "mm:
> filter based on a nodemask as well as a gfp_mask". Which was back in
> April 2008. and got merged into 2.6.26.
> 

I am about to run out the door so I didn't read the thread but
first_zones_zonelist() can indeed return NULL. It happens when the
zonelist is empty (unlikely) or when a nodemask is applied restricting
the allowable nodes and that results in no valid zones (more likely).

> And I'd be happy to commit it (in fact, I was going to), but when
> looking for other uses of first_zones_zonelist(), I found
> local_memory_node() which does the exact same thing: ignore the return
> value, and unconditionally dereference the resulting 'zone' variable.
> 

That does look unsafe.

> And so does - although less obviously - mm/vmscan.c for the
> wait_iff_confgested() thing.
> 

It should be implicitly safe although it is non-obvious.  wait_iff_congested
in mm/vmscan.c is called from do_try_to_free_pages() which is in the direct
reclaim path. To get there, it must have passed this check in page_alloc.c

        first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
        if (!preferred_zone) {
                put_mems_allowed();
                return NULL;
        }

Did I miss anything?

The memory controller also can end up there but for it to get into
trouble, they would have to be trying to shrink a cgroup with an invalid
zonelist. Is that possible?

> So are those buggy too, since first_zones_zonelist() can apparently return NULL?
> 

Yes, it can.

> Please advise...
> 

Callers need to check for NULL or be sure they are not dealing with an
empty zonelist.

>                   Linus
> 
> On Wed, Oct 27, 2010 at 10:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mercredi 27 octobre 2010 à 18:07 +0200, Eric Dumazet a écrit :
> >
> >> So I tried following experiment :
> >>
> >> # swapoff
> >> # numactl --membind=0 swapon -a
> >> # grep swap /proc/vmallocinfo
> >> 0xf9bf3000-0xf9cf4000 1052672 sys_swapon+0x4aa/0xb24 pages=256 vmalloc N0=256
> >> # swapoff -a
> >> # numactl --membind=1 swapon -a
> >>
> >> <<FREEZE>>
> >>
> >
> > Crash in fact, not freeze, in slab_node()
> >
> > Problem is : we dereference a NULL zone pointer.
> >
> > (node 1 has HighMem only)
> >
> > Following patch seems to solve the problem for me
> >
> > # swapoff -a
> > # numactl --membind=1 swapon -a
> > # grep swap /proc/vmallocinfo
> > 0xf9da5000-0xf9ea6000 1052672 sys_swapon+0x3f9/0xa34 pages=256 vmalloc N1=256
> >
> >
> > Thanks
> >
> >
> > [PATCH] numa: fix slab_node(MPOL_BIND)
> >
> > When a node contains only HighMem memory, slab_node(MPOL_BIND)
> > dereferences a NULL pointer.
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > ---
> >  mm/mempolicy.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 81a1276..4a57f13 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1597,7 +1597,7 @@ unsigned slab_node(struct mempolicy *policy)
> >                (void)first_zones_zonelist(zonelist, highest_zoneidx,
> >                                                        &policy->v.nodes,
> >                                                        &zone);
> > -               return zone->node;
> > +               return zone ? zone->node : numa_node_id();
> >        }
> >
> >        default:
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] numa: fix slab_node(MPOL_BIND)
  2010-10-28 15:59                                             ` Linus Torvalds
  2010-10-28 16:27                                               ` Eric Dumazet
  2010-10-28 16:45                                               ` Mel Gorman
@ 2010-10-28 16:55                                               ` Christoph Lameter
  2010-10-28 21:07                                                 ` Andrew Morton
  2 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2010-10-28 16:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Mel Gorman, Lee Schermerhorn, Andrew Morton,
	Tejun Heo, Peter Zijlstra, Brian Gerst, x86, linux-kernel, mingo

On Thu, 28 Oct 2010, Linus Torvalds wrote:

> And so does - although less obviously - mm/vmscan.c for the
> wait_iff_confgested() thing.
>
> So are those buggy too, since first_zones_zonelist() can apparently return NULL?

The code is fine.

first_zones_zonelist() can only return NULL for the case that a nodemask
was specified and the code in vmscan.c does not specify a nodemask.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] numa: fix slab_node(MPOL_BIND)
  2010-10-28 16:55                                               ` Christoph Lameter
@ 2010-10-28 21:07                                                 ` Andrew Morton
  2010-10-29 14:55                                                   ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2010-10-28 21:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Eric Dumazet, Mel Gorman, Lee Schermerhorn,
	Tejun Heo, Peter Zijlstra, Brian Gerst, x86, linux-kernel, mingo

On Thu, 28 Oct 2010 11:55:18 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Thu, 28 Oct 2010, Linus Torvalds wrote:
> 
> > And so does - although less obviously - mm/vmscan.c for the
> > wait_iff_confgested() thing.
> >
> > So are those buggy too, since first_zones_zonelist() can apparently return NULL?
> 
> The code is fine.
> 
> first_zones_zonelist() can only return NULL for the case that a nodemask
> was specified and the code in vmscan.c does not specify a nodemask.

Geeze, how did you work that out and how the heck was anyone else
supposed to know this :(


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-28 14:40                       ` [PATCH] x86-32: NUMA irq stacks allocations Eric Dumazet
@ 2010-10-29  6:43                         ` tip-bot for Eric Dumazet
  2010-10-29 18:32                           ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: tip-bot for Eric Dumazet @ 2010-10-29  6:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, eric.dumazet, brgerst, peterz, tglx, mingo

Commit-ID:  5c1eb08936693cd78c71164c8bea0b086ae72c67
Gitweb:     http://git.kernel.org/tip/5c1eb08936693cd78c71164c8bea0b086ae72c67
Author:     Eric Dumazet <eric.dumazet@gmail.com>
AuthorDate: Thu, 28 Oct 2010 16:40:54 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Oct 2010 08:17:07 +0200

x86-32: Restore irq stacks NUMA-aware allocations

Commit 22d4cd4c4d ("Allocate irq stacks seperate from percpu
area") removed NUMA affinity of IRQ stacks as side-effect of
the fix.

Using alloc_pages_node() instead of __get_free_pages() is safe,
even if the target node has no available LOWMEM pages :
alloc_pages_node() fallbacks to another node.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Brian Gerst <brgerst@gmail.com>
Cc: tj@kernel.org
Cc: torvalds@linux-foundation.org
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1288276854.2649.607.camel@edumazet-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/irq_32.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 64668db..96656f2 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -17,6 +17,7 @@
 #include <linux/delay.h>
 #include <linux/uaccess.h>
 #include <linux/percpu.h>
+#include <linux/mm.h>
 
 #include <asm/apic.h>
 
@@ -125,7 +126,9 @@ void __cpuinit irq_ctx_init(int cpu)
 	if (per_cpu(hardirq_ctx, cpu))
 		return;
 
-	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
+	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
+					       THREAD_FLAGS,
+					       THREAD_ORDER));
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;
@@ -134,7 +137,9 @@ void __cpuinit irq_ctx_init(int cpu)
 
 	per_cpu(hardirq_ctx, cpu) = irqctx;
 
-	irqctx = (union irq_ctx *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER);
+	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
+					       THREAD_FLAGS,
+					       THREAD_ORDER));
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH] numa: fix slab_node(MPOL_BIND)
  2010-10-28 21:07                                                 ` Andrew Morton
@ 2010-10-29 14:55                                                   ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2010-10-29 14:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Eric Dumazet, Mel Gorman, Lee Schermerhorn,
	Tejun Heo, Peter Zijlstra, Brian Gerst, x86, linux-kernel, mingo

On Thu, 28 Oct 2010, Andrew Morton wrote:

> On Thu, 28 Oct 2010 11:55:18 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
>
> > On Thu, 28 Oct 2010, Linus Torvalds wrote:
> >
> > > And so does - although less obviously - mm/vmscan.c for the
> > > wait_iff_confgested() thing.
> > >
> > > So are those buggy too, since first_zones_zonelist() can apparently return NULL?
> >
> > The code is fine.
> >
> > first_zones_zonelist() can only return NULL for the case that a nodemask
> > was specified and the code in vmscan.c does not specify a nodemask.
>
> Geeze, how did you work that out and how the heck was anyone else
> supposed to know this :(

Look at the code and how it was modified by Lee. The initial assumption
before his patch was that the zonelist contains all zones of the system.
Therefore you will allways find all possible zone types in the system.
The function did not contain a check for the end of the zonelist and does
not now for the nodemask == NULL.

However, the modification to filter the zonelist then makes it possible to
have subsets of zones not containing the requested zone types now.
Therefore Lee added a check for the end of the zonelist for that case.

Could be better documented. Took me some staring at the code to figure it
out.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-29  6:43                         ` [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations tip-bot for Eric Dumazet
@ 2010-10-29 18:32                           ` Peter Zijlstra
  2010-10-29 20:09                             ` Cyrill Gorcunov
  2010-10-29 20:28                             ` Cyrill Gorcunov
  0 siblings, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2010-10-29 18:32 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, eric.dumazet, brgerst, tglx, mingo
  Cc: linux-tip-commits

On Fri, 2010-10-29 at 06:43 +0000, tip-bot for Eric Dumazet wrote:
> +       irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> +                                              THREAD_FLAGS,
> +                                              THREAD_ORDER)); 

Shouldn't we be checking for a NULL return from alloc_pages_node()
before calling page_address() on it?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-29 18:32                           ` Peter Zijlstra
@ 2010-10-29 20:09                             ` Cyrill Gorcunov
  2010-10-29 20:28                             ` Cyrill Gorcunov
  1 sibling, 0 replies; 63+ messages in thread
From: Cyrill Gorcunov @ 2010-10-29 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, linux-kernel, eric.dumazet, brgerst, tglx, mingo,
	linux-tip-commits

On Fri, Oct 29, 2010 at 08:32:26PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 06:43 +0000, tip-bot for Eric Dumazet wrote:
> > +       irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> > +                                              THREAD_FLAGS,
> > +                                              THREAD_ORDER)); 
> 
> Shouldn't we be checking for a NULL return from alloc_pages_node()
> before calling page_address() on it?

 It seems at this stage it's a panic() or BUG_ON candidate. (Which
reminds me GFP_PANIC flag proposal back in 2009 ;)

  Cyrill

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-29 18:32                           ` Peter Zijlstra
  2010-10-29 20:09                             ` Cyrill Gorcunov
@ 2010-10-29 20:28                             ` Cyrill Gorcunov
  2010-10-29 20:53                               ` Eric Dumazet
  2010-10-29 20:58                               ` Eric Dumazet
  1 sibling, 2 replies; 63+ messages in thread
From: Cyrill Gorcunov @ 2010-10-29 20:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, linux-kernel, eric.dumazet, brgerst, tglx, mingo,
	linux-tip-commits

On Fri, Oct 29, 2010 at 08:32:26PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 06:43 +0000, tip-bot for Eric Dumazet wrote:
> > +       irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> > +                                              THREAD_FLAGS,
> > +                                              THREAD_ORDER)); 
> 
> Shouldn't we be checking for a NULL return from alloc_pages_node()
> before calling page_address() on it?
> --

 Something like below I guess, but probably we could try to allocate
on appropriate NUMA node first and if it fails -- via old alloc_pages
and if it fail in turn -- then we panic.

  Cyrill
---
 arch/x86/kernel/irq_32.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6.git/arch/x86/kernel/irq_32.c
=====================================================================
--- linux-2.6.git.orig/arch/x86/kernel/irq_32.c
+++ linux-2.6.git/arch/x86/kernel/irq_32.c
@@ -122,13 +122,16 @@ execute_on_irq_stack(int overflow, struc
 void __cpuinit irq_ctx_init(int cpu)
 {
 	union irq_ctx *irqctx;
+	struct page *page;
 
 	if (per_cpu(hardirq_ctx, cpu))
 		return;
 
-	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
-					       THREAD_FLAGS,
-					       THREAD_ORDER));
+	page = alloc_pages_node(cpu_to_node(cpu),
+				THREAD_FLAGS, THREAD_ORDER);
+	BUG_ON(!page);
+
+	irqctx				= page_address(page);
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;
@@ -137,9 +140,11 @@ void __cpuinit irq_ctx_init(int cpu)
 
 	per_cpu(hardirq_ctx, cpu) = irqctx;
 
-	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
-					       THREAD_FLAGS,
-					       THREAD_ORDER));
+	page = alloc_pages_node(cpu_to_node(cpu),
+				THREAD_FLAGS, THREAD_ORDER);
+	BUG_ON(!page);
+
+	irqctx				= page_address(page);
 	irqctx->tinfo.task		= NULL;
 	irqctx->tinfo.exec_domain	= NULL;
 	irqctx->tinfo.cpu		= cpu;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-29 20:28                             ` Cyrill Gorcunov
@ 2010-10-29 20:53                               ` Eric Dumazet
  2010-10-29 20:59                                 ` Cyrill Gorcunov
  2010-10-29 20:58                               ` Eric Dumazet
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-29 20:53 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Peter Zijlstra, mingo, hpa, linux-kernel, brgerst, tglx, mingo,
	linux-tip-commits

Le samedi 30 octobre 2010 à 00:28 +0400, Cyrill Gorcunov a écrit :
> On Fri, Oct 29, 2010 at 08:32:26PM +0200, Peter Zijlstra wrote:
> > On Fri, 2010-10-29 at 06:43 +0000, tip-bot for Eric Dumazet wrote:
> > > +       irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> > > +                                              THREAD_FLAGS,
> > > +                                              THREAD_ORDER)); 
> > 
> > Shouldn't we be checking for a NULL return from alloc_pages_node()
> > before calling page_address() on it?
> > --
> 

I didnt check for NULL because original code was not either.

If you cannot allocate memory for the IRQ stack, only choice is to crash
anyway.

Adding BUG_ON() is not that helpful in this respect.

>  Something like below I guess, but probably we could try to allocate
> on appropriate NUMA node first and if it fails -- via old alloc_pages
> and if it fail in turn -- then we panic.
> 
>   Cyrill
> ---
>  arch/x86/kernel/irq_32.c |   17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6.git/arch/x86/kernel/irq_32.c
> =====================================================================
> --- linux-2.6.git.orig/arch/x86/kernel/irq_32.c
> +++ linux-2.6.git/arch/x86/kernel/irq_32.c
> @@ -122,13 +122,16 @@ execute_on_irq_stack(int overflow, struc
>  void __cpuinit irq_ctx_init(int cpu)
>  {
>  	union irq_ctx *irqctx;
> +	struct page *page;
>  
>  	if (per_cpu(hardirq_ctx, cpu))
>  		return;
>  
> -	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> -					       THREAD_FLAGS,
> -					       THREAD_ORDER));
> +	page = alloc_pages_node(cpu_to_node(cpu),
> +				THREAD_FLAGS, THREAD_ORDER);
> +	BUG_ON(!page);
> +
> +	irqctx				= page_address(page);
>  	irqctx->tinfo.task		= NULL;
>  	irqctx->tinfo.exec_domain	= NULL;
>  	irqctx->tinfo.cpu		= cpu;
> @@ -137,9 +140,11 @@ void __cpuinit irq_ctx_init(int cpu)
>  
>  	per_cpu(hardirq_ctx, cpu) = irqctx;
>  
> -	irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> -					       THREAD_FLAGS,
> -					       THREAD_ORDER));
> +	page = alloc_pages_node(cpu_to_node(cpu),
> +				THREAD_FLAGS, THREAD_ORDER);
> +	BUG_ON(!page);
> +
> +	irqctx				= page_address(page);
>  	irqctx->tinfo.task		= NULL;
>  	irqctx->tinfo.exec_domain	= NULL;
>  	irqctx->tinfo.cpu		= cpu;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-29 20:28                             ` Cyrill Gorcunov
  2010-10-29 20:53                               ` Eric Dumazet
@ 2010-10-29 20:58                               ` Eric Dumazet
  2010-10-29 21:21                                 ` Cyrill Gorcunov
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2010-10-29 20:58 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Peter Zijlstra, mingo, hpa, linux-kernel, brgerst, tglx, mingo,
	linux-tip-commits

Le samedi 30 octobre 2010 à 00:28 +0400, Cyrill Gorcunov a écrit :
> On Fri, Oct 29, 2010 at 08:32:26PM +0200, Peter Zijlstra wrote:
> > On Fri, 2010-10-29 at 06:43 +0000, tip-bot for Eric Dumazet wrote:
> > > +       irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> > > +                                              THREAD_FLAGS,
> > > +                                              THREAD_ORDER)); 
> > 
> > Shouldn't we be checking for a NULL return from alloc_pages_node()
> > before calling page_address() on it?
> > --
> 
>  Something like below I guess, but probably we could try to allocate
> on appropriate NUMA node first and if it fails -- via old alloc_pages
> and if it fail in turn -- then we panic.

Maybe my commit message was not clear : 

There is no need to test return from alloc_pages_node() and do the
fallback. It already done properly.

If NULL is returned, then there is no memory at all on the machine.

I tested my patch on my machine with node 1 with HighMem only, and
alloc_pages_node(1, flags, order) gave me a page from node 0.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-29 20:53                               ` Eric Dumazet
@ 2010-10-29 20:59                                 ` Cyrill Gorcunov
  0 siblings, 0 replies; 63+ messages in thread
From: Cyrill Gorcunov @ 2010-10-29 20:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, mingo, hpa, linux-kernel, brgerst, tglx, mingo,
	linux-tip-commits

On Fri, Oct 29, 2010 at 10:53:07PM +0200, Eric Dumazet wrote:
> Le samedi 30 octobre 2010 à 00:28 +0400, Cyrill Gorcunov a écrit :
> > On Fri, Oct 29, 2010 at 08:32:26PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2010-10-29 at 06:43 +0000, tip-bot for Eric Dumazet wrote:
> > > > +       irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> > > > +                                              THREAD_FLAGS,
> > > > +                                              THREAD_ORDER)); 
> > > 
> > > Shouldn't we be checking for a NULL return from alloc_pages_node()
> > > before calling page_address() on it?
> > > --
> > 
> 
> I didnt check for NULL because original code was not either.
> 
> If you cannot allocate memory for the IRQ stack, only choice is to crash
> anyway.
> 
> Adding BUG_ON() is not that helpful in this respect.
> 

 I believe it was a nit in first place, so the BUG_ON simply tells to code
readers that we knew the NULL can happen here but there is simply no way
to continue then.

 Cyrill

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations
  2010-10-29 20:58                               ` Eric Dumazet
@ 2010-10-29 21:21                                 ` Cyrill Gorcunov
  0 siblings, 0 replies; 63+ messages in thread
From: Cyrill Gorcunov @ 2010-10-29 21:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, mingo, hpa, linux-kernel, brgerst, tglx, mingo,
	linux-tip-commits

On Fri, Oct 29, 2010 at 10:58:03PM +0200, Eric Dumazet wrote:
> Le samedi 30 octobre 2010 à 00:28 +0400, Cyrill Gorcunov a écrit :
> > On Fri, Oct 29, 2010 at 08:32:26PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2010-10-29 at 06:43 +0000, tip-bot for Eric Dumazet wrote:
> > > > +       irqctx = page_address(alloc_pages_node(cpu_to_node(cpu),
> > > > +                                              THREAD_FLAGS,
> > > > +                                              THREAD_ORDER)); 
> > > 
> > > Shouldn't we be checking for a NULL return from alloc_pages_node()
> > > before calling page_address() on it?
> > > --
> > 
> >  Something like below I guess, but probably we could try to allocate
> > on appropriate NUMA node first and if it fails -- via old alloc_pages
> > and if it fail in turn -- then we panic.
> 
> Maybe my commit message was not clear : 
> 
> There is no need to test return from alloc_pages_node() and do the
> fallback. It already done properly.
>

Yes Eric, somehow managed to miss that, my bad.
 
  Cyrill

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
  2010-10-26 14:06         ` [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE Tejun Heo
@ 2011-03-24  6:46             ` Mike Frysinger
  2011-03-24  8:54           ` [PATCH UPDATED] " Tejun Heo
  1 sibling, 0 replies; 63+ messages in thread
From: Mike Frysinger @ 2011-03-24  6:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
> The linker script macros PERCPU_VADDR() and PERCPU() are used to
> define this output section and the latter takes @align parameter.
> Several architectures are using @align smaller than PAGE_SIZE breaking
> percpu memory alignment.

hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
failure otherwise:
-   PERCPU(4)
+   PERCPU(PAGE_SIZE)

> This patch removes @align parameter from PERCPU(), renames it to
> PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
> add PCPU_SETUP_BUG_ON() checks such that alignment problems are
> reliably detected and remove percpu alignment comment recently added
> in workqueue.c as the condition would trigger BUG way before reaching
> there.

seems this still hasnt made it to mainline.  has it stalled or
something ?  feel free for the Blackfin bits:
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
-mike

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
@ 2011-03-24  6:46             ` Mike Frysinger
  0 siblings, 0 replies; 63+ messages in thread
From: Mike Frysinger @ 2011-03-24  6:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
> The linker script macros PERCPU_VADDR() and PERCPU() are used to
> define this output section and the latter takes @align parameter.
> Several architectures are using @align smaller than PAGE_SIZE breaking
> percpu memory alignment.

hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
failure otherwise:
-   PERCPU(4)
+   PERCPU(PAGE_SIZE)

> This patch removes @align parameter from PERCPU(), renames it to
> PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
> add PCPU_SETUP_BUG_ON() checks such that alignment problems are
> reliably detected and remove percpu alignment comment recently added
> in workqueue.c as the condition would trigger BUG way before reaching
> there.

seems this still hasnt made it to mainline.  has it stalled or
something ?  feel free for the Blackfin bits:
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
-mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
  2011-03-24  6:46             ` Mike Frysinger
@ 2011-03-24  8:25               ` Tejun Heo
  -1 siblings, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2011-03-24  8:25 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

Hello,

On Thu, Mar 24, 2011 at 02:46:01AM -0400, Mike Frysinger wrote:
> On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
> > The linker script macros PERCPU_VADDR() and PERCPU() are used to
> > define this output section and the latter takes @align parameter.
> > Several architectures are using @align smaller than PAGE_SIZE breaking
> > percpu memory alignment.
> 
> hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
> failure otherwise:
> -   PERCPU(4)
> +   PERCPU(PAGE_SIZE)
> 
> > This patch removes @align parameter from PERCPU(), renames it to
> > PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
> > add PCPU_SETUP_BUG_ON() checks such that alignment problems are
> > reliably detected and remove percpu alignment comment recently added
> > in workqueue.c as the condition would trigger BUG way before reaching
> > there.
> 
> seems this still hasnt made it to mainline.  has it stalled or
> something ?  feel free for the Blackfin bits:
> Signed-off-by: Mike Frysinger <vapier@gentoo.org>

Heh, I just forgot.  I'll queue it with your Acked-by added.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
@ 2011-03-24  8:25               ` Tejun Heo
  0 siblings, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2011-03-24  8:25 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

Hello,

On Thu, Mar 24, 2011 at 02:46:01AM -0400, Mike Frysinger wrote:
> On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
> > The linker script macros PERCPU_VADDR() and PERCPU() are used to
> > define this output section and the latter takes @align parameter.
> > Several architectures are using @align smaller than PAGE_SIZE breaking
> > percpu memory alignment.
> 
> hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
> failure otherwise:
> -   PERCPU(4)
> +   PERCPU(PAGE_SIZE)
> 
> > This patch removes @align parameter from PERCPU(), renames it to
> > PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
> > add PCPU_SETUP_BUG_ON() checks such that alignment problems are
> > reliably detected and remove percpu alignment comment recently added
> > in workqueue.c as the condition would trigger BUG way before reaching
> > there.
> 
> seems this still hasnt made it to mainline.  has it stalled or
> something ?  feel free for the Blackfin bits:
> Signed-off-by: Mike Frysinger <vapier@gentoo.org>

Heh, I just forgot.  I'll queue it with your Acked-by added.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
  2011-03-24  8:25               ` Tejun Heo
@ 2011-03-24  8:51                 ` Tejun Heo
  -1 siblings, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2011-03-24  8:51 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

On Thu, Mar 24, 2011 at 09:25:23AM +0100, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 24, 2011 at 02:46:01AM -0400, Mike Frysinger wrote:
> > On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
> > > The linker script macros PERCPU_VADDR() and PERCPU() are used to
> > > define this output section and the latter takes @align parameter.
> > > Several architectures are using @align smaller than PAGE_SIZE breaking
> > > percpu memory alignment.
> > 
> > hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
> > failure otherwise:
> > -   PERCPU(4)
> > +   PERCPU(PAGE_SIZE)
> > 
> > > This patch removes @align parameter from PERCPU(), renames it to
> > > PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
> > > add PCPU_SETUP_BUG_ON() checks such that alignment problems are
> > > reliably detected and remove percpu alignment comment recently added
> > > in workqueue.c as the condition would trigger BUG way before reaching
> > > there.
> > 
> > seems this still hasnt made it to mainline.  has it stalled or
> > something ?  feel free for the Blackfin bits:
> > Signed-off-by: Mike Frysinger <vapier@gentoo.org>
> 
> Heh, I just forgot.  I'll queue it with your Acked-by added.

BTW, you're gonna push out the blackfin change to mainline soon,
right?  I'll queue the patch for the next merge window once the
blackfin fix hits mainline.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
@ 2011-03-24  8:51                 ` Tejun Heo
  0 siblings, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2011-03-24  8:51 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

On Thu, Mar 24, 2011 at 09:25:23AM +0100, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 24, 2011 at 02:46:01AM -0400, Mike Frysinger wrote:
> > On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
> > > The linker script macros PERCPU_VADDR() and PERCPU() are used to
> > > define this output section and the latter takes @align parameter.
> > > Several architectures are using @align smaller than PAGE_SIZE breaking
> > > percpu memory alignment.
> > 
> > hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
> > failure otherwise:
> > -   PERCPU(4)
> > +   PERCPU(PAGE_SIZE)
> > 
> > > This patch removes @align parameter from PERCPU(), renames it to
> > > PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
> > > add PCPU_SETUP_BUG_ON() checks such that alignment problems are
> > > reliably detected and remove percpu alignment comment recently added
> > > in workqueue.c as the condition would trigger BUG way before reaching
> > > there.
> > 
> > seems this still hasnt made it to mainline.  has it stalled or
> > something ?  feel free for the Blackfin bits:
> > Signed-off-by: Mike Frysinger <vapier@gentoo.org>
> 
> Heh, I just forgot.  I'll queue it with your Acked-by added.

BTW, you're gonna push out the blackfin change to mainline soon,
right?  I'll queue the patch for the next merge window once the
blackfin fix hits mainline.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH UPDATED] percpu: always align percpu output section to PAGE_SIZE
  2010-10-26 14:06         ` [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE Tejun Heo
  2011-03-24  6:46             ` Mike Frysinger
@ 2011-03-24  8:54           ` Tejun Heo
  1 sibling, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2011-03-24  8:54 UTC (permalink / raw)
  To: torvalds, Alexander van Heukelum
  Cc: David Howells, akpm, linux-am33-list, linux-kernel,
	Akira Takeuchi, Mark Salter, Ingo Molnar, Mike Frysinger,
	uclinux-dist-devel, Jeff Dike, user-mode-linux-devel

Percpu allocator honors alignment request upto PAGE_SIZE and both the
percpu addresses in the percpu address space and the translated kernel
addresses should be aligned accordingly.  The calculation of the
former depends on the alignment of percpu output section in the kernel
image.

The linker script macros PERCPU_VADDR() and PERCPU() are used to
define this output section and the latter takes @align parameter.
Several architectures are using @align smaller than PAGE_SIZE breaking
percpu memory alignment.

This patch removes @align parameter from PERCPU(), renames it to
PERCPU_SECTION() and makes it always align to PAGE_SIZE.  While at it,
add PCPU_SETUP_BUG_ON() checks such that alignment problems are
reliably detected and remove percpu alignment comment recently added
in workqueue.c as the condition would trigger BUG way before reaching
there.

For um, this patch raises the alignment of percpu area.  As the area
is in .init, there shouldn't be any noticeable difference.

This problem was discovered by David Howells while debugging boot
failure on mn10300.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
---
I'll apply this to percpu:for-2.6.40 once the blackfin fix hits
mainline.

Thanks.

 arch/alpha/kernel/vmlinux.lds.S    |    2 +-
 arch/arm/kernel/vmlinux.lds.S      |    2 +-
 arch/blackfin/kernel/vmlinux.lds.S |    2 +-
 arch/cris/kernel/vmlinux.lds.S     |    2 +-
 arch/frv/kernel/vmlinux.lds.S      |    2 +-
 arch/m32r/kernel/vmlinux.lds.S     |    2 +-
 arch/mips/kernel/vmlinux.lds.S     |    2 +-
 arch/mn10300/kernel/vmlinux.lds.S  |    2 +-
 arch/parisc/kernel/vmlinux.lds.S   |    2 +-
 arch/powerpc/kernel/vmlinux.lds.S  |    2 +-
 arch/s390/kernel/vmlinux.lds.S     |    2 +-
 arch/sh/kernel/vmlinux.lds.S       |    2 +-
 arch/sparc/kernel/vmlinux.lds.S    |    2 +-
 arch/tile/kernel/vmlinux.lds.S     |    2 +-
 arch/um/include/asm/common.lds.S   |    2 +-
 arch/x86/kernel/vmlinux.lds.S      |    2 +-
 arch/xtensa/kernel/vmlinux.lds.S   |    2 +-
 include/asm-generic/vmlinux.lds.h  |   17 ++++++++---------
 kernel/workqueue.c                 |    4 +---
 mm/percpu.c                        |    2 ++
 20 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index 433be2a..8d57948 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -39,7 +39,7 @@ SECTIONS
 	__init_begin = ALIGN(PAGE_SIZE);
 	INIT_TEXT_SECTION(PAGE_SIZE)
 	INIT_DATA_SECTION(16)
-	PERCPU(L1_CACHE_BYTES, PAGE_SIZE)
+	PERCPU_SECTION(L1_CACHE_BYTES)
 	/* Align to THREAD_SIZE rather than PAGE_SIZE here so any padding page
 	   needed for the THREAD_SIZE aligned init_task gets freed after init */
 	. = ALIGN(THREAD_SIZE);
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index 28fea9b..62d0768 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -78,7 +78,7 @@ SECTIONS
 #endif
 	}
 
-	PERCPU(32, PAGE_SIZE)
+	PERCPU_SECTION(32)
 
 #ifndef CONFIG_XIP_KERNEL
 	. = ALIGN(PAGE_SIZE);
diff --git a/arch/blackfin/kernel/vmlinux.lds.S b/arch/blackfin/kernel/vmlinux.lds.S
index c40d07f..7c51016 100644
--- a/arch/blackfin/kernel/vmlinux.lds.S
+++ b/arch/blackfin/kernel/vmlinux.lds.S
@@ -136,7 +136,7 @@ SECTIONS
 
 	. = ALIGN(16);
 	INIT_DATA_SECTION(16)
-	PERCPU(32, 4)
+	PERCPU_SECTION(32)
 
 	.exit.data :
 	{
diff --git a/arch/cris/kernel/vmlinux.lds.S b/arch/cris/kernel/vmlinux.lds.S
index 728bbd9..a6990cb 100644
--- a/arch/cris/kernel/vmlinux.lds.S
+++ b/arch/cris/kernel/vmlinux.lds.S
@@ -102,7 +102,7 @@ SECTIONS
 #endif
 	__vmlinux_end = .;		/* Last address of the physical file. */
 #ifdef CONFIG_ETRAX_ARCH_V32
-	PERCPU(32, PAGE_SIZE)
+	PERCPU_SECTION(32)
 
 	.init.ramfs : {
 		INIT_RAM_FS
diff --git a/arch/frv/kernel/vmlinux.lds.S b/arch/frv/kernel/vmlinux.lds.S
index 0daae8a..7e958d8 100644
--- a/arch/frv/kernel/vmlinux.lds.S
+++ b/arch/frv/kernel/vmlinux.lds.S
@@ -37,7 +37,7 @@ SECTIONS
   _einittext = .;
 
   INIT_DATA_SECTION(8)
-  PERCPU(L1_CACHE_BYTES, 4096)
+  PERCPU_SECTION(L1_CACHE_BYTES)
 
   . = ALIGN(PAGE_SIZE);
   __init_end = .;
diff --git a/arch/m32r/kernel/vmlinux.lds.S b/arch/m32r/kernel/vmlinux.lds.S
index c194d64..2e7ccf7 100644
--- a/arch/m32r/kernel/vmlinux.lds.S
+++ b/arch/m32r/kernel/vmlinux.lds.S
@@ -53,7 +53,7 @@ SECTIONS
   __init_begin = .;
   INIT_TEXT_SECTION(PAGE_SIZE)
   INIT_DATA_SECTION(16)
-  PERCPU(32, PAGE_SIZE)
+  PERCPU_SECTION(32)
   . = ALIGN(PAGE_SIZE);
   __init_end = .;
   /* freed after init ends here */
diff --git a/arch/mips/kernel/vmlinux.lds.S b/arch/mips/kernel/vmlinux.lds.S
index 832afbb..8616709 100644
--- a/arch/mips/kernel/vmlinux.lds.S
+++ b/arch/mips/kernel/vmlinux.lds.S
@@ -115,7 +115,7 @@ SECTIONS
 		EXIT_DATA
 	}
 
-	PERCPU(1 << CONFIG_MIPS_L1_CACHE_SHIFT, PAGE_SIZE)
+	PERCPU_SECTION(1 << CONFIG_MIPS_L1_CACHE_SHIFT)
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;
 	/* freed after init ends here */
diff --git a/arch/mn10300/kernel/vmlinux.lds.S b/arch/mn10300/kernel/vmlinux.lds.S
index 968bcd2..6f702a6 100644
--- a/arch/mn10300/kernel/vmlinux.lds.S
+++ b/arch/mn10300/kernel/vmlinux.lds.S
@@ -70,7 +70,7 @@ SECTIONS
 	.exit.text : { EXIT_TEXT; }
 	.exit.data : { EXIT_DATA; }
 
-  PERCPU(32, PAGE_SIZE)
+  PERCPU_SECTION(32)
   . = ALIGN(PAGE_SIZE);
   __init_end = .;
   /* freed after init ends here */
diff --git a/arch/parisc/kernel/vmlinux.lds.S b/arch/parisc/kernel/vmlinux.lds.S
index 8f1e4ef..85b8661 100644
--- a/arch/parisc/kernel/vmlinux.lds.S
+++ b/arch/parisc/kernel/vmlinux.lds.S
@@ -145,7 +145,7 @@ SECTIONS
 		EXIT_DATA
 	}
 
-	PERCPU(L1_CACHE_BYTES, PAGE_SIZE)
+	PERCPU_SECTION(L1_CACHE_BYTES)
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;
 	/* freed after init ends here */
diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S
index b9150f0..920276c 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -160,7 +160,7 @@ SECTIONS
 		INIT_RAM_FS
 	}
 
-	PERCPU(L1_CACHE_BYTES, PAGE_SIZE)
+	PERCPU_SECTION(L1_CACHE_BYTES)
 
 	. = ALIGN(8);
 	.machine.desc : AT(ADDR(.machine.desc) - LOAD_OFFSET) {
diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
index 1bc18cd..56fe6bc 100644
--- a/arch/s390/kernel/vmlinux.lds.S
+++ b/arch/s390/kernel/vmlinux.lds.S
@@ -77,7 +77,7 @@ SECTIONS
 	. = ALIGN(PAGE_SIZE);
 	INIT_DATA_SECTION(0x100)
 
-	PERCPU(0x100, PAGE_SIZE)
+	PERCPU_SECTION(0x100)
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;		/* freed after init ends here */
 
diff --git a/arch/sh/kernel/vmlinux.lds.S b/arch/sh/kernel/vmlinux.lds.S
index af4d461..731c10c 100644
--- a/arch/sh/kernel/vmlinux.lds.S
+++ b/arch/sh/kernel/vmlinux.lds.S
@@ -66,7 +66,7 @@ SECTIONS
 		__machvec_end = .;
 	}
 
-	PERCPU(L1_CACHE_BYTES, PAGE_SIZE)
+	PERCPU_SECTION(L1_CACHE_BYTES)
 
 	/*
 	 * .exit.text is discarded at runtime, not link time, to deal with
diff --git a/arch/sparc/kernel/vmlinux.lds.S b/arch/sparc/kernel/vmlinux.lds.S
index 92b557a..c022075 100644
--- a/arch/sparc/kernel/vmlinux.lds.S
+++ b/arch/sparc/kernel/vmlinux.lds.S
@@ -108,7 +108,7 @@ SECTIONS
 		__sun4v_2insn_patch_end = .;
 	}
 
-	PERCPU(SMP_CACHE_BYTES, PAGE_SIZE)
+	PERCPU_SECTION(SMP_CACHE_BYTES)
 
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;
diff --git a/arch/tile/kernel/vmlinux.lds.S b/arch/tile/kernel/vmlinux.lds.S
index c6ce378..c2feb64 100644
--- a/arch/tile/kernel/vmlinux.lds.S
+++ b/arch/tile/kernel/vmlinux.lds.S
@@ -63,7 +63,7 @@ SECTIONS
     *(.init.page)
   } :data =0
   INIT_DATA_SECTION(16)
-  PERCPU(L2_CACHE_BYTES, PAGE_SIZE)
+  PERCPU_SECTION(L2_CACHE_BYTES)
   . = ALIGN(PAGE_SIZE);
   VMLINUX_SYMBOL(_einitdata) = .;
 
diff --git a/arch/um/include/asm/common.lds.S b/arch/um/include/asm/common.lds.S
index 34bede8..4938de5 100644
--- a/arch/um/include/asm/common.lds.S
+++ b/arch/um/include/asm/common.lds.S
@@ -42,7 +42,7 @@
 	INIT_SETUP(0)
   }
 
-  PERCPU(32, 32)
+  PERCPU_SECTION(32)
 	
   .initcall.init : {
 	INIT_CALLS
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index cef446f..b61d6bb 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -305,7 +305,7 @@ SECTIONS
 	}
 
 #if !defined(CONFIG_X86_64) || !defined(CONFIG_SMP)
-	PERCPU(INTERNODE_CACHE_BYTES, THREAD_SIZE)
+	PERCPU_SECTION(INTERNODE_CACHE_BYTES)
 #endif
 
 	. = ALIGN(PAGE_SIZE);
diff --git a/arch/xtensa/kernel/vmlinux.lds.S b/arch/xtensa/kernel/vmlinux.lds.S
index a282006..88ecea3 100644
--- a/arch/xtensa/kernel/vmlinux.lds.S
+++ b/arch/xtensa/kernel/vmlinux.lds.S
@@ -155,7 +155,7 @@ SECTIONS
     INIT_RAM_FS
   }
 
-  PERCPU(XCHAL_ICACHE_LINESIZE, PAGE_SIZE)
+  PERCPU_SECTION(XCHAL_ICACHE_LINESIZE)
 
   /* We need this dummy segment here */
 
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 22d3342..a8c386c 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -15,7 +15,7 @@
  *	HEAD_TEXT_SECTION
  *	INIT_TEXT_SECTION(PAGE_SIZE)
  *	INIT_DATA_SECTION(...)
- *	PERCPU(CACHELINE_SIZE, PAGE_SIZE)
+ *	PERCPU_SECTION(CACHELINE_SIZE)
  *	__init_end = .;
  *
  *	_stext = .;
@@ -703,7 +703,7 @@
  *
  * Note that this macros defines __per_cpu_load as an absolute symbol.
  * If there is no need to put the percpu section at a predetermined
- * address, use PERCPU().
+ * address, use PERCPU_SECTION.
  */
 #define PERCPU_VADDR(cacheline, vaddr, phdr)				\
 	VMLINUX_SYMBOL(__per_cpu_load) = .;				\
@@ -723,20 +723,19 @@
 	. = VMLINUX_SYMBOL(__per_cpu_load) + SIZEOF(.data..percpu);
 
 /**
- * PERCPU - define output section for percpu area, simple version
+ * PERCPU_SECTION - define output section for percpu area, simple version
  * @cacheline: cacheline size
- * @align: required alignment
  *
- * Align to @align and outputs output section for percpu area.  This macro
- * doesn't manipulate @vaddr or @phdr and __per_cpu_load and
+ * Align to PAGE_SIZE and outputs output section for percpu area.  This
+ * macro doesn't manipulate @vaddr or @phdr and __per_cpu_load and
  * __per_cpu_start will be identical.
  *
- * This macro is equivalent to ALIGN(@align); PERCPU_VADDR(@cacheline,,)
+ * This macro is equivalent to ALIGN(PAGE_SIZE); PERCPU_VADDR(@cacheline,,)
  * except that __per_cpu_load is defined as a relative symbol against
  * .data..percpu which is required for relocatable x86_32 configuration.
  */
-#define PERCPU(cacheline, align)					\
-	. = ALIGN(align);						\
+#define PERCPU_SECTION(cacheline)					\
+	. = ALIGN(PAGE_SIZE);						\
 	.data..percpu	: AT(ADDR(.data..percpu) - LOAD_OFFSET) {	\
 		VMLINUX_SYMBOL(__per_cpu_load) = .;			\
 		VMLINUX_SYMBOL(__per_cpu_start) = .;			\
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ee6578b..0ef7b43 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2850,9 +2850,7 @@ static int alloc_cwqs(struct workqueue_struct *wq)
 		}
 	}
 
-	/* just in case, make sure it's actually aligned
-	 * - this is affected by PERCPU() alignment in vmlinux.lds.S
-	 */
+	/* just in case, make sure it's actually aligned */
 	BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
 	return wq->cpu_wq.v ? 0 : -ENOMEM;
 }
diff --git a/mm/percpu.c b/mm/percpu.c
index 8a11cd2..8eb5366 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1216,8 +1216,10 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
 #ifdef CONFIG_SMP
 	PCPU_SETUP_BUG_ON(!ai->static_size);
+	PCPU_SETUP_BUG_ON((unsigned long)__per_cpu_start & ~PAGE_MASK);
 #endif
 	PCPU_SETUP_BUG_ON(!base_addr);
+	PCPU_SETUP_BUG_ON((unsigned long)base_addr & ~PAGE_MASK);
 	PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
 	PCPU_SETUP_BUG_ON(ai->unit_size & ~PAGE_MASK);
 	PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
  2011-03-24  8:51                 ` Tejun Heo
@ 2011-03-24 13:46                   ` Mike Frysinger
  -1 siblings, 0 replies; 63+ messages in thread
From: Mike Frysinger @ 2011-03-24 13:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

On Thu, Mar 24, 2011 at 4:51 AM, Tejun Heo wrote:
> On Thu, Mar 24, 2011 at 09:25:23AM +0100, Tejun Heo wrote:
>> On Thu, Mar 24, 2011 at 02:46:01AM -0400, Mike Frysinger wrote:
>> > On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
>> > > The linker script macros PERCPU_VADDR() and PERCPU() are used to
>> > > define this output section and the latter takes @align parameter.
>> > > Several architectures are using @align smaller than PAGE_SIZE breaking
>> > > percpu memory alignment.
>> >
>> > hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
>> > failure otherwise:
>> > -   PERCPU(4)
>> > +   PERCPU(PAGE_SIZE)
>> >
>> > > This patch removes @align parameter from PERCPU(), renames it to
>> > > PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
>> > > add PCPU_SETUP_BUG_ON() checks such that alignment problems are
>> > > reliably detected and remove percpu alignment comment recently added
>> > > in workqueue.c as the condition would trigger BUG way before reaching
>> > > there.
>> >
>> > seems this still hasnt made it to mainline.  has it stalled or
>> > something ?  feel free for the Blackfin bits:
>> > Signed-off-by: Mike Frysinger <vapier@gentoo.org>
>>
>> Heh, I just forgot.  I'll queue it with your Acked-by added.
>
> BTW, you're gonna push out the blackfin change to mainline soon,
> right?  I'll queue the patch for the next merge window once the
> blackfin fix hits mainline.

Linus has already pulled it
-mike

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
@ 2011-03-24 13:46                   ` Mike Frysinger
  0 siblings, 0 replies; 63+ messages in thread
From: Mike Frysinger @ 2011-03-24 13:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

On Thu, Mar 24, 2011 at 4:51 AM, Tejun Heo wrote:
> On Thu, Mar 24, 2011 at 09:25:23AM +0100, Tejun Heo wrote:
>> On Thu, Mar 24, 2011 at 02:46:01AM -0400, Mike Frysinger wrote:
>> > On Tue, Oct 26, 2010 at 10:06, Tejun Heo wrote:
>> > > The linker script macros PERCPU_VADDR() and PERCPU() are used to
>> > > define this output section and the latter takes @align parameter.
>> > > Several architectures are using @align smaller than PAGE_SIZE breaking
>> > > percpu memory alignment.
>> >
>> > hmm, i just pushed through a fix in the Blackfin tree as we hit a boot
>> > failure otherwise:
>> > -   PERCPU(4)
>> > +   PERCPU(PAGE_SIZE)
>> >
>> > > This patch removes @align parameter from PERCPU(), renames it to
>> > > PERCPU_SECTION and makes it always align to PAGE_SIZE.  While at it,
>> > > add PCPU_SETUP_BUG_ON() checks such that alignment problems are
>> > > reliably detected and remove percpu alignment comment recently added
>> > > in workqueue.c as the condition would trigger BUG way before reaching
>> > > there.
>> >
>> > seems this still hasnt made it to mainline.  has it stalled or
>> > something ?  feel free for the Blackfin bits:
>> > Signed-off-by: Mike Frysinger <vapier@gentoo.org>
>>
>> Heh, I just forgot.  I'll queue it with your Acked-by added.
>
> BTW, you're gonna push out the blackfin change to mainline soon,
> right?  I'll queue the patch for the next merge window once the
> blackfin fix hits mainline.

Linus has already pulled it
-mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Uclinux-dist-devel] [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE
  2011-03-24 13:46                   ` Mike Frysinger
  (?)
@ 2011-03-24 17:51                   ` Tejun Heo
  -1 siblings, 0 replies; 63+ messages in thread
From: Tejun Heo @ 2011-03-24 17:51 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: torvalds, Alexander van Heukelum, linux-am33-list,
	user-mode-linux-devel, Jeff Dike, linux-kernel, David Howells,
	Mark Salter, uclinux-dist-devel, akpm, Ingo Molnar,
	Akira Takeuchi

On Thu, Mar 24, 2011 at 09:46:53AM -0400, Mike Frysinger wrote:
> Linus has already pulled it

Thanks.  Patch queued for 2.6.40.

-- 
tejun

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2011-03-24 18:00 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-25 22:41 [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues David Howells
2010-10-26  9:10 ` Tejun Heo
2010-10-26 10:22 ` David Howells
2010-10-26 12:14   ` Tejun Heo
2010-10-26 12:27     ` Tejun Heo
2010-10-26 12:45       ` [PATCH] x86, percpu: revert commit fe8e0c25 Tejun Heo
2010-10-26 13:25         ` Ingo Molnar
2010-10-26 13:34           ` Tejun Heo
2010-10-26 13:49             ` Brian Gerst
2010-10-26 15:08               ` Linus Torvalds
2010-10-27  5:43                 ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Brian Gerst
2010-10-27  6:07                   ` Eric Dumazet
2010-10-27  9:57                     ` Peter Zijlstra
2010-10-27 13:33                       ` Eric Dumazet
2010-10-27 13:42                         ` Tejun Heo
2010-10-27 13:57                           ` Eric Dumazet
2010-10-27 14:00                             ` Tejun Heo
2010-10-27 14:24                               ` Eric Dumazet
2010-10-27 14:39                                 ` Tejun Heo
2010-10-27 14:39                                 ` Eric Dumazet
2010-10-27 14:43                                   ` Tejun Heo
2010-10-27 15:21                                     ` Eric Dumazet
2010-10-27 15:35                                       ` Tejun Heo
2010-10-27 16:07                                         ` Eric Dumazet
2010-10-27 17:33                                           ` [PATCH] numa: fix slab_node(MPOL_BIND) Eric Dumazet
2010-10-28 15:59                                             ` Linus Torvalds
2010-10-28 16:27                                               ` Eric Dumazet
2010-10-28 16:45                                               ` Mel Gorman
2010-10-28 16:55                                               ` Christoph Lameter
2010-10-28 21:07                                                 ` Andrew Morton
2010-10-29 14:55                                                   ` Christoph Lameter
2010-10-27 20:55                                         ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Eric Dumazet
2010-10-28 12:01                                           ` Tejun Heo
2010-10-28 12:30                                             ` Eric Dumazet
2010-10-28 14:40                       ` [PATCH] x86-32: NUMA irq stacks allocations Eric Dumazet
2010-10-29  6:43                         ` [tip:x86/urgent] x86-32: Restore irq stacks NUMA-aware allocations tip-bot for Eric Dumazet
2010-10-29 18:32                           ` Peter Zijlstra
2010-10-29 20:09                             ` Cyrill Gorcunov
2010-10-29 20:28                             ` Cyrill Gorcunov
2010-10-29 20:53                               ` Eric Dumazet
2010-10-29 20:59                                 ` Cyrill Gorcunov
2010-10-29 20:58                               ` Eric Dumazet
2010-10-29 21:21                                 ` Cyrill Gorcunov
2010-10-27 15:19                   ` [PATCH] x86-32: Allocate irq stacks seperate from percpu area Linus Torvalds
2010-10-27 15:30                     ` Ingo Molnar
2010-10-27 15:33                       ` Ingo Molnar
2010-10-27 15:40                         ` Tejun Heo
2010-10-27 15:43                           ` Ingo Molnar
2010-10-27 16:03                   ` [tip:x86/urgent] " tip-bot for Brian Gerst
2010-10-27 16:04                   ` [tip:x86/urgent] percpu: Remove the multi-page alignment facility tip-bot for Ingo Molnar
2010-10-26 14:06         ` [RFC PATCH] percpu: always align percpu output section to PAGE_SIZE Tejun Heo
2011-03-24  6:46           ` [Uclinux-dist-devel] " Mike Frysinger
2011-03-24  6:46             ` Mike Frysinger
2011-03-24  8:25             ` Tejun Heo
2011-03-24  8:25               ` Tejun Heo
2011-03-24  8:51               ` Tejun Heo
2011-03-24  8:51                 ` Tejun Heo
2011-03-24 13:46                 ` Mike Frysinger
2011-03-24 13:46                   ` Mike Frysinger
2011-03-24 17:51                   ` Tejun Heo
2011-03-24  8:54           ` [PATCH UPDATED] " Tejun Heo
2010-10-26 14:50   ` [PATCH] MN10300: Fix the PERCPU() alignment to allow for workqueues David Howells
2010-10-26 14:56     ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.