All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] percpu section full of holes
@ 2015-06-07 16:55 Eric Dumazet
  2015-06-08  3:09 ` Tejun Heo
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Dumazet @ 2015-06-07 16:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: H. Peter Anvin, linux-kernel, Linus Torvalds

Hi Tejun

In commit bdf977b37418cdf8a2252504779a7e12a09b7575
("x86, percpu: Collect hot percpu variables into one cacheline")

You wrote that forcing ____cacheline_aligned on
current_task would put all hot variables together.

However this seems not generally true.


nm -v vmlinux
...
000000000000a140 d cpu_loops_per_jiffy
<56 bytes hole :( >
000000000000a180 d current_vcpu
...
000000000000a980 D debug_idt_ctr
000000000000a984 D debug_stack_usage
<hole>
000000000000a9a0 D orig_ist
000000000000a9d8 D fpu_owner_task
000000000000a9e0 D __preempt_count
000000000000a9e4 D irq_count
000000000000a9e8 D irq_stack_ptr
<hole>
000000000000aa00 D current_task
000000000000aa08 D kernel_stack
000000000000aa10 d debug_stack_addr

000000000000aa20 D cpu_hw_events

compiler/linker do not seem to care about the order in the source file.

$ grep DEFINE_PER_CPU arch/x86/kernel/cpu/common.c
DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
DEFINE_PER_CPU(unsigned long, kernel_stack) =
DEFINE_PER_CPU_FIRST(union irq_stack_union,
DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
DEFINE_PER_CPU(char *, irq_stack_ptr) =
DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
DEFINE_PER_CPU(struct orig_ist, orig_ist);
static DEFINE_PER_CPU(unsigned long, debug_stack_addr);
DEFINE_PER_CPU(int, debug_stack_usage);
DEFINE_PER_CPU(u32, debug_idt_ctr);
DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) =
DEFINE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);

I wish we had a way to remove the automatic alignment of ELF sections
based on size of objects/structures.
Why __alignof__ is not respected I dont know.

linker propagates the biggest alignment to various built-in.o builds,
and we have all these holes.

objdump -h kernel/built-in.o | grep data..percpu
 28 .data..percpu 00003ce0  0000000000000000  0000000000000000  0011c940  2**6
283 .data..percpu..shared_aligned 00001720  0000000000000000  0000000000000000  00146e80  2**6


We might add a DEFINE_PER_CPU_SMALL() for objects less than sizeof(long),
but this would mean a lot of changes in the tree.

And we would still have holes when per cpu objects are 40 bytes long,
because we would not use the 24 bytes after them.

Oh well.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] percpu section full of holes
  2015-06-07 16:55 [RFC] percpu section full of holes Eric Dumazet
@ 2015-06-08  3:09 ` Tejun Heo
  2015-06-08  3:15   ` Tejun Heo
  0 siblings, 1 reply; 3+ messages in thread
From: Tejun Heo @ 2015-06-08  3:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: H. Peter Anvin, linux-kernel, Linus Torvalds

Hello, Eric.

On Sun, Jun 07, 2015 at 09:55:07AM -0700, Eric Dumazet wrote:
> In commit bdf977b37418cdf8a2252504779a7e12a09b7575
> ("x86, percpu: Collect hot percpu variables into one cacheline")
> 
> You wrote that forcing ____cacheline_aligned on
> current_task would put all hot variables together.
>
> However this seems not generally true.

Heh, that's way back.  It's likely something caused by changes in the
toolchain.

...
> compiler/linker do not seem to care about the order in the source file.

Indeed.  It seems to be related to the top level reordering.
-fno-toplevel-reorder seems to behave weirdly on gcc 5.1.1 - the
option reverses the symbols but the order is different from the source
code order regardless of the option.

Source code order.

 kernel_stack
 current_task
 irq_stack_ptr
 irq_count
 __preempt_count
 fpu_owner_task
 orig_ist
 debug_stack_usage
 debug_idt_ctr

w/o -fno-toplevel-reorder

 193: 0000000000000008     8 OBJECT  GLOBAL DEFAULT   27 fpu_owner_task
 151: 0000000000000010     4 OBJECT  GLOBAL DEFAULT   27 debug_stack_usage
 153: 0000000000000014     4 OBJECT  GLOBAL DEFAULT   27 debug_idt_ctr
 172: 0000000000000018     4 OBJECT  GLOBAL DEFAULT   27 __preempt_count
 187: 0000000000000020     8 OBJECT  GLOBAL DEFAULT   27 kernel_stack
 195: 0000000000000028     4 OBJECT  GLOBAL DEFAULT   27 irq_count
 196: 0000000000000030     8 OBJECT  GLOBAL DEFAULT   27 irq_stack_ptr
 164: 0000000000000040    56 OBJECT  GLOBAL DEFAULT   27 orig_ist
 167: 0000000000000080     8 OBJECT  GLOBAL DEFAULT   27 current_task
			   
w/ -fno-toplevel-reorder (mirror image of w/o -fno-toplevel-reorder?????)

  97: 0000000000000000     8 OBJECT  GLOBAL DEFAULT    6 current_task
 106: 0000000000000020    56 OBJECT  GLOBAL DEFAULT    6 orig_ist
 108: 0000000000000058     8 OBJECT  GLOBAL DEFAULT    6 irq_stack_ptr
 110: 0000000000000060     4 OBJECT  GLOBAL DEFAULT    6 irq_count
 111: 0000000000000068     8 OBJECT  GLOBAL DEFAULT    6 kernel_stack
 113: 0000000000000070     4 OBJECT  GLOBAL DEFAULT    6 __preempt_count
 115: 0000000000000074     4 OBJECT  GLOBAL DEFAULT    6 debug_idt_ctr
 116: 0000000000000078     4 OBJECT  GLOBAL DEFAULT    6 debug_stack_usage
 117: 0000000000000080     8 OBJECT  GLOBAL DEFAULT    6 fpu_owner_task

and then the linker goes onto interpose symbols from different object
files.

...
> I wish we had a way to remove the automatic alignment of ELF sections
> based on size of objects/structures.
> Why __alignof__ is not respected I dont know.

It's still following the alignments specified tho.  It's just that all
the symbols get shuffled.

> linker propagates the biggest alignment to various built-in.o builds,
> and we have all these holes.
> 
> objdump -h kernel/built-in.o | grep data..percpu
>  28 .data..percpu 00003ce0  0000000000000000  0000000000000000  0011c940  2**6
> 283 .data..percpu..shared_aligned 00001720  0000000000000000  0000000000000000  00146e80  2**6
> 
> We might add a DEFINE_PER_CPU_SMALL() for objects less than sizeof(long),
> but this would mean a lot of changes in the tree.

ld has --sort-common option which sorts symbols by their alignments
which isn't ideal but could help packing smaller percpu variables.  I
can't find a way to prevent ld from interposing symbols and we seem to
have lost control over offset placement no matter what.

> And we would still have holes when per cpu objects are 40 bytes long,
> because we would not use the 24 bytes after them.

It's not like we didn't have holes before tho.  We just had holes
according to source code order and could control it when we explicitly
wanted to.  Overall, compiler / linker reshuffling them at least has
possibility of packing things better.

For things we want to pack together, I suppose our only option is
packing them into a struct and defining access macros into it.  It's
unfortunately clumsier but expresses the intended effect clearer to
both readers and toolchain.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] percpu section full of holes
  2015-06-08  3:09 ` Tejun Heo
@ 2015-06-08  3:15   ` Tejun Heo
  0 siblings, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2015-06-08  3:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: H. Peter Anvin, linux-kernel, Linus Torvalds

On Mon, Jun 08, 2015 at 12:09:32PM +0900, Tejun Heo wrote:
> For things we want to pack together, I suppose our only option is
> packing them into a struct and defining access macros into it.  It's
> unfortunately clumsier but expresses the intended effect clearer to
> both readers and toolchain.

One annoying thing is that we prolly have quite a few places where
people put things which are likely to be used together close to each
other intentionally and by convention and we're now losing all that
contextual information.  I have no idea whatsoever whether the gain
from toolchain shuffling symbols around is larger or smaller than
whatever gain which can be gained from such colocations.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-06-08  3:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-07 16:55 [RFC] percpu section full of holes Eric Dumazet
2015-06-08  3:09 ` Tejun Heo
2015-06-08  3:15   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.