All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] sparc64: MM/IRQ patch queue.
@ 2014-09-25 19:40 David Miller
  2014-09-25 22:37 ` Bob Picco
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: David Miller @ 2014-09-25 19:40 UTC (permalink / raw)
  To: sparclinux


Bob, here is the queue of changes that are in my local tree and I
think are just about ready to push out.

They include all of the MM work we did to increase the max phys
bits and fix DEBUG_PAGEALLOC, as well as the sparseirq stuff.

The kernel is so much smaller now, about 7.4MB compared to what used
to be nearly 14MB.  We almost halved the size, and I bet there is some
more low hanging fruit out there.  So we are significantly within the
range of only needing 2 locked TLB entries to hold the kernel (we used
to need 4).

I'm eager to push this, but I also want it to get tested so I'll hold
off for about a day or so in order to give some time for that.

In particular, I'd be real interested in how the new code handles that
stress test wherein a guest was created with an insanely fragmented
memory map, I suspect we still need a bump of MAX_BANKS for that guy.
If you could figure out what kind of value that test needs and let
me know, I'd appreciate it.

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] sparc64: MM/IRQ patch queue.
  2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
@ 2014-09-25 22:37 ` Bob Picco
  2014-09-26  3:59 ` David Miller
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Bob Picco @ 2014-09-25 22:37 UTC (permalink / raw)
  To: sparclinux

Hi,
David Miller wrote:	[Thu Sep 25 2014, 03:40:47PM EDT]
> 
> Bob, here is the queue of changes that are in my local tree and I
> think are just about ready to push out.
> 
> They include all of the MM work we did to increase the max phys
> bits and fix DEBUG_PAGEALLOC, as well as the sparseirq stuff.
> 
> The kernel is so much smaller now, about 7.4MB compared to what used
> to be nearly 14MB.  We almost halved the size, and I bet there is some
> more low hanging fruit out there.  So we are significantly within the
> range of only needing 2 locked TLB entries to hold the kernel (we used
> to need 4).
You might want to tone these down:
[10014000000-100147fffff] PMD -> [ffff801fda800000-ffff801fdaffffff] on node
or terminate them altogether. Only a suggestion and will inspect further.
> 
> I'm eager to push this, but I also want it to get tested so I'll hold
> off for about a day or so in order to give some time for that.
DEBUG_PAGEALLOC wasn't healthy on T5-2. I'll scrutinize further in the
morning. It could be a legitimate issue. Ah I've seen this in kexec for
restart on oops case:
[37729.365306] ixgbe 0001:03:00.1 eth1: NIC Link is Up 1 Gbps, Flow Control: RX/TX
[37729.380874] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[37733.378191] ixgbe 0001:03:00.1 eth1: Detected Tx Unit Hang
[37733.378191]   Tx Queue             <11>
[37733.378191]   TDH, TDT             <0>, <1>
[37733.378191]   next_to_use          <1>
[37733.378191]   next_to_clean        <0>
[37733.378191] tx_buffer_info[next_to_clean]
[37733.378191]   time_stamp           <ffffae52>
[37733.378191]   jiffies              <ffffaf76>
[37733.445218] ixgbe 0001:03:00.1 eth1: tx hang 1 detected on queue 11, resetting adapter
[37733.460961] ixgbe 0001:03:00.1 eth1: initiating reset due to tx timeout
[37733.474246] ixgbe 0001:03:00.1 eth1: Detected Tx Unit Hang
.
> 
> In particular, I'd be real interested in how the new code handles that
> stress test wherein a guest was created with an insanely fragmented
> memory map, I suspect we still need a bump of MAX_BANKS for that guy.
> If you could figure out what kind of value that test needs and let
> me know, I'd appreciate it.
> 
> Thanks!
you're welcome

Later!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] sparc64: MM/IRQ patch queue.
  2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
  2014-09-25 22:37 ` Bob Picco
@ 2014-09-26  3:59 ` David Miller
  2014-09-26 14:28 ` Bob Picco
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2014-09-26  3:59 UTC (permalink / raw)
  To: sparclinux

From: Bob Picco <bpicco@meloft.net>
Date: Thu, 25 Sep 2014 18:37:18 -0400

> You might want to tone these down:
> [10014000000-100147fffff] PMD -> [ffff801fda800000-ffff801fdaffffff] on node
> or terminate them altogether. Only a suggestion and will inspect further.

I used x86 as a model, but I guess with fragmented guest memory it can
be a bit overboard.

It's KERN_DEBUG too btw, which means that unless you ask for it you
won't see those messages.  I bet on the real console they don't
appear, but if you look at dmesg you'll certainly see them.

>> I'm eager to push this, but I also want it to get tested so I'll hold
>> off for about a day or so in order to give some time for that.
> DEBUG_PAGEALLOC wasn't healthy on T5-2. I'll scrutinize further in the
> morning. It could be a legitimate issue.

Strange, does it hang or OOPS?  Can you show me the OOPS messages if
you have them?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] sparc64: MM/IRQ patch queue.
  2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
  2014-09-25 22:37 ` Bob Picco
  2014-09-26  3:59 ` David Miller
@ 2014-09-26 14:28 ` Bob Picco
  2014-09-26 20:04 ` Bob Picco
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Bob Picco @ 2014-09-26 14:28 UTC (permalink / raw)
  To: sparclinux

David Miller wrote:	[Thu Sep 25 2014, 11:59:48PM EDT]
> From: Bob Picco <bpicco@meloft.net>
> Date: Thu, 25 Sep 2014 18:37:18 -0400
> 
> > You might want to tone these down:
> > [10014000000-100147fffff] PMD -> [ffff801fda800000-ffff801fdaffffff] on node
> > or terminate them altogether. Only a suggestion and will inspect further.
> 
> I used x86 as a model, but I guess with fragmented guest memory it can
> be a bit overboard.
Well I'm on a guest about as much as you. I did ask a question about the
guest fragmented memory (MAX_BANKS) but haven't heard back. Though a guest
really has little knowledge of the MCU.
> 
> It's KERN_DEBUG too btw, which means that unless you ask for it you
> won't see those messages.  I bet on the real console they don't
> appear, but if you look at dmesg you'll certainly see them.
I agree. The real benefit is to people like you and I who don't want to
walk the page table state for a struct page. Possibly also for those that
don't have an equivalent to crash. This is made more of a challenge by our
changes.
> 
> >> I'm eager to push this, but I also want it to get tested so I'll hold
> >> off for about a day or so in order to give some time for that.
> > DEBUG_PAGEALLOC wasn't healthy on T5-2. I'll scrutinize further in the
> > morning. It could be a legitimate issue.
> 
> Strange, does it hang or OOPS?  Can you show me the OOPS messages if
> you have them?
Let me examine first. Yesterday I put my shorts on backwards and doubt certain
cranial activity.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] sparc64: MM/IRQ patch queue.
  2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
                   ` (2 preceding siblings ...)
  2014-09-26 14:28 ` Bob Picco
@ 2014-09-26 20:04 ` Bob Picco
  2014-09-26 20:41 ` David Miller
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Bob Picco @ 2014-09-26 20:04 UTC (permalink / raw)
  To: sparclinux

Hi,
Bob Picco wrote:	[Fri Sep 26 2014, 10:28:16AM EDT]
> David Miller wrote:	[Thu Sep 25 2014, 11:59:48PM EDT]
> > From: Bob Picco <bpicco@meloft.net>
> > Date: Thu, 25 Sep 2014 18:37:18 -0400
> > 
> > > You might want to tone these down:
> > > [10014000000-100147fffff] PMD -> [ffff801fda800000-ffff801fdaffffff] on node
> > > or terminate them altogether. Only a suggestion and will inspect further.
> > 
> > I used x86 as a model, but I guess with fragmented guest memory it can
> > be a bit overboard.
> Well I'm on a guest about as much as you. I did ask a question about the
> guest fragmented memory (MAX_BANKS) but haven't heard back. Though a guest
> really has little knowledge of the MCU.
> > 
> > It's KERN_DEBUG too btw, which means that unless you ask for it you
> > won't see those messages.  I bet on the real console they don't
> > appear, but if you look at dmesg you'll certainly see them.
> I agree. The real benefit is to people like you and I who don't want to
> walk the page table state for a struct page. Possibly also for those that
> don't have an equivalent to crash. This is made more of a challenge by our
> changes.
> > 
> > >> I'm eager to push this, but I also want it to get tested so I'll hold
> > >> off for about a day or so in order to give some time for that.
> > > DEBUG_PAGEALLOC wasn't healthy on T5-2. I'll scrutinize further in the
> > > morning. It could be a legitimate issue.
> > 
> > Strange, does it hang or OOPS?  Can you show me the OOPS messages if
> > you have them?
> Let me examine first. Yesterday I put my shorts on backwards and doubt certain
> cranial activity.
We aren't doing so swift or I made a mistake. M7 is flaky which has caused
me grief and probably requires, like Karl suggested, an AC recycle.
This seems(?) to point at vmemmap. I will pursue.

Plus rumor has it CPU upgrades are today. This may require FW upgrade too.
You know as much as me.

I've barely had time to examine DEBUG_PAGEALLOC on T5-2.

This should provide you sufficient data to arrive at an initial vmalloc size
for percpu. I'll eventually examine my T5-8 information too. Hm, just did
quickly:
PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
. Note I haven't considered impact of this in quite a bit.

First I will examine the preprocessor output of several files and examine
boundary conditions.

thanx! My energy level is exponentially decaying.

boot: 3.17.0-rc4 
Allocated 64 Megs of memory at 0x40000000 for kernel
Uncompressing image...
Loading initial ramdisk (194296953 bytes at 0xC00004000000 phys, 0x40C00000 virt)
PROMLIB: Sun IEEE Boot Prom 'OBP 042914_5b6e3f9d8b48 2014/04/29 21:34'
PROMLIB: Root node compatible: sun4v
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Initializing cgroup subsys cpuacct
Linux version 3.17.0-rc4 (root@ca-sparc30.us.oracle.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #2 SMP Fri Sep 26 13:34:44 ADT 2014
bootconsole [earlyprom0] enabled
ARCH: SUN4V
Ethernet address: 00:10:e0:56:96:4a
PAGE_OFFSET is 0xfffe000000000000 (max_phys_bits = 49)
Kernel: Using 3 locked TLB entries for main kernel image.
Remapping the kernel... done.
OF stdout device is: /virtual-devices@100/console@1
PROM: Built device tree with 1254513 bytes of memory.
MDESC: Size is 713520 bytes.
PLATFORM: banner-name [SPARC T7-4]
PLATFORM: name [sun4v-platform]
PLATFORM: hostid [8656964a]
PLATFORM: serial# [0056964a]
PLATFORM: stick-frequency [3b9aca00]
PLATFORM: mac-address [10e056964a]
PLATFORM: watchdog-resolution [1000 ms]
PLATFORM: watchdog-max-timeout [31536000000 ms]
PLATFORM: max-cpus [1024]
Allocated 49152 bytes for kernel page tables.
Zone ranges:
  DMA      [mem 0x50400000-0xffffffffffffffff]
  Normal   [mem 0x00000000-0xc03bffd8bfff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x50400000-0x3fbf0fbfff]
  node   0: [mem 0x3fbf102000-0x3fbf103fff]
  node   1: [mem 0x400000000000-0x4003ffffffff]
  node   1: [mem 0x404000000000-0x4043ffffffff]
  node   1: [mem 0x408000000000-0x4083ffffffff]
  node   1: [mem 0x40c000000000-0x40c3ffffffff]
  node   1: [mem 0x414000000000-0x4143ffffffff]
  node   1: [mem 0x41c000000000-0x41c3ffffffff]
  node   1: [mem 0x420000000000-0x4203ffffffff]
  node   1: [mem 0x424000000000-0x4243ffffffff]
  node   1: [mem 0x428000000000-0x4283ffffffff]
  node   1: [mem 0x42c000000000-0x42c3ffffffff]
  node   1: [mem 0x434000000000-0x4343ffffffff]
  node   1: [mem 0x43c000000000-0x43c3ffffffff]
  node   2: [mem 0x800000000000-0x803bffffffff]
  node   3: [mem 0xc00000000000-0xc03bffd35fff]
  node   3: [mem 0xc03bffd68000-0xc03bffd8bfff]
Booting Linux...
CPU CAPS: [flush,stbar,swap,muldiv,v9,mul32,div32,v8plus]
CPU CAPS: [popc,vis,vis2,ASIBlkInit,fmaf,vis3,hpc,ima]
CPU CAPS: [pause,cbcond,aes,des,camellia,md5,sha1,sha256]
CPU CAPS: [sha512,mpmul,montmul,montsqr,crc32c]
PERCPU: max_distance=0xbffcc0410000 too large for vmalloc space 0xff00000000
PERCPU: auto allocator failed (-22), falling back to page size
PERCPU: 6 8K pages/cpu @0000000100000000 s23552 r8192 d17408
SUN4V: Mondo queue sizes [cpu(131072) dev(16384) r(8192) nr(256)]
Built 4 zonelists in Node order, mobility grouping on.  Total pages: 120389184
Policy zone: Normal
Kernel command line: root=/dev/mapper/VolGroup-lv_root ro rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=VolGroup/lv_swap SYSFONT=latarcyrheb-sun16 rd_LVM_LV=VolGroup/lv_root KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
log_buf_len individual max cpu contribution: 4096 bytes
log_buf_len total cpu_extra contributions: 4190208 bytes
log_buf_len min size: 1048576 bytes
log_buf_len: 8388608 bytes
early log buf free: 938440(89%)
PID hash table entries: 4096 (order: 2, 32768 bytes)
Sorting __ex_table...
BUG: Bad page state in process swapper  pfn:20a000000
page:0000018280000000 count:0 mapcount:-127 mapping:          (null) index:0x2
page flags: 0x1400000000000000()
page dumped because: nonzero mapcount
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc4 #2
Call Trace:
 [000000000053fb40] bad_page+0xa0/0x100
 [000000000053fda0] free_pages_prepare+0x100/0x160
 [00000000005406b0] __free_pages_ok+0x10/0xe0
 [0000000000541f4c] __free_pages+0x2c/0x60
 [0000000000c944ac] __free_pages_bootmem+0xa4/0xb4
 [0000000000c980b4] free_all_bootmem+0xcc/0x12c
 [0000000000c87390] mem_init+0x6c/0xf0
 [0000000000c7e9c4] start_kernel+0x180/0x3dc
 [0000000000976f3c] tlb_fixup_done+0x98/0xbc
 [0000000000000000]           (null)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] sparc64: MM/IRQ patch queue.
  2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
                   ` (3 preceding siblings ...)
  2014-09-26 20:04 ` Bob Picco
@ 2014-09-26 20:41 ` David Miller
  2014-09-26 23:18 ` David Miller
  2014-09-26 23:57 ` David Miller
  6 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2014-09-26 20:41 UTC (permalink / raw)
  To: sparclinux

From: Bob Picco <bpicco@meloft.net>
Date: Fri, 26 Sep 2014 16:04:18 -0400

> This should provide you sufficient data to arrive at an initial vmalloc size
> for percpu. I'll eventually examine my T5-8 information too. Hm, just did
> quickly:
> PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
> . Note I haven't considered impact of this in quite a bit.

Thanks, several things certainly are not happy.

I'll take a look, thanks Bob.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] sparc64: MM/IRQ patch queue.
  2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
                   ` (4 preceding siblings ...)
  2014-09-26 20:41 ` David Miller
@ 2014-09-26 23:18 ` David Miller
  2014-09-26 23:57 ` David Miller
  6 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2014-09-26 23:18 UTC (permalink / raw)
  To: sparclinux

From: David Miller <davem@davemloft.net>
Date: Fri, 26 Sep 2014 16:41:37 -0400 (EDT)

> From: Bob Picco <bpicco@meloft.net>
> Date: Fri, 26 Sep 2014 16:04:18 -0400
> 
>> This should provide you sufficient data to arrive at an initial vmalloc size
>> for percpu. I'll eventually examine my T5-8 information too. Hm, just did
>> quickly:
>> PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
>> . Note I haven't considered impact of this in quite a bit.
> 
> Thanks, several things certainly are not happy.
> 
> I'll take a look, thanks Bob.

Bob can I see the 'memory' OF node on the machine where this happens?

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] sparc64: MM/IRQ patch queue.
  2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
                   ` (5 preceding siblings ...)
  2014-09-26 23:18 ` David Miller
@ 2014-09-26 23:57 ` David Miller
  6 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2014-09-26 23:57 UTC (permalink / raw)
  To: sparclinux

From: David Miller <davem@davemloft.net>
Date: Fri, 26 Sep 2014 19:18:03 -0400 (EDT)

> From: David Miller <davem@davemloft.net>
> Date: Fri, 26 Sep 2014 16:41:37 -0400 (EDT)
> 
>> From: Bob Picco <bpicco@meloft.net>
>> Date: Fri, 26 Sep 2014 16:04:18 -0400
>> 
>>> This should provide you sufficient data to arrive at an initial vmalloc size
>>> for percpu. I'll eventually examine my T5-8 information too. Hm, just did
>>> quickly:
>>> PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
>>> . Note I haven't considered impact of this in quite a bit.
>> 
>> Thanks, several things certainly are not happy.
>> 
>> I'll take a look, thanks Bob.
> 
> Bob can I see the 'memory' OF node on the machine where this happens?

Nevermind, I have it in my 'prtconfs' GIT repo.

The embedded percpu allocator always failed on this machine, it should
always have gone to the the page based percpu allocator fallback.

You can verify this with past boot logs.

It shouldn't crash later though, that's troubling :-)

As to why the embedded handler can't cope with this config, it's because
of how far apart the various NUMA memory nodes are, physical address wise.

What happens is the per-cpu allocator allocates per-cpu memory, enough on
each NUMA node for the cpus on that NUMA node.  Then it walks all of these
pointers and computes the largest distance, in bytes, between any two of
them.   Basically the lowest pointer value, subtracted from the highest
pointer value.

This is the "max distance" thing.  And it cannot be larger than 3/4's the
size of the vmalloc area.

The distance between numa node physical areas seems to be something like
0x80000000000 on the T5 machine.

Anyways, we don't have enough virtual address space with 3-level
kernel page tables to accomodate what the embedded percpu allocator
wants.

Sound like a familiar problem? :-/

So we might have to backtrack and really move to 4-level page tables.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-09-26 23:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-25 19:40 [PATCH 0/6] sparc64: MM/IRQ patch queue David Miller
2014-09-25 22:37 ` Bob Picco
2014-09-26  3:59 ` David Miller
2014-09-26 14:28 ` Bob Picco
2014-09-26 20:04 ` Bob Picco
2014-09-26 20:41 ` David Miller
2014-09-26 23:18 ` David Miller
2014-09-26 23:57 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.