Re: page faults on machines with > 4TB memory

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Elena Ufimtseva <elena.ufimtseva@oracle.com>, xen-devel@lists.xen.org
Cc: adnan.misherfi@oracle.com, jbeulich@suse.com
Subject: Re: page faults on machines with > 4TB memory
Date: Thu, 23 Jul 2015 18:01:45 +0100	[thread overview]
Message-ID: <55B11DF9.4030002@citrix.com> (raw)
In-Reply-To: <20150723163556.GA13434@elena.ufimtseva>

On 23/07/15 17:35, Elena Ufimtseva wrote:
> Hi
>
> While working on bugs during boot time on large oracle server x4-8,
> There is a problem with booting Xen on large machines with > 4TB memory,
> such as Oracle x4-8.
> The page fault occured initially while loading xen pm info into hypervisor
> (you can see it in serial log attahced named 4.4.2_no_mem_override).
> Tracing down an issue shows that page fault occures in timer.c code
> while getting heap size.
>
> Here is the original call trace:
> rocessor: Uploading Xen processor PM info 
> @ (XEN) ----[ Xen-4.4.3-preOVM  x86_64  debug=n  Tainted:    C ]---- 
> @ (XEN) CPU:    0 
> @ (XEN) RIP:    e008:[<ffff82d08022e747>] add_entry+0x27/0x120 
> @ (XEN) RFLAGS: 0000000000010082   CONTEXT: hypervisor 
> @ (XEN) rax: ffff8a2d080513a20   rbx: ffff83808e802300   rcx:
> 00000000000000e8 
> @ (XEN) rdx: 00000000000000e8   rsi: 00000000000000e8   rdi:
> ffff83808e802300 
> @ (XEN) rbp: ffff82d080513a20   rsp: ffff82d0804d7c70   r8:
> ffff8840ffdb5010 
> @ (XEN) r9:  0000000000000017   r10: ffff83808e802180   r11:
> 0200200200200200 
> @ (XEN) r12: ffff82d080533080   r13: 0000000000000296   r14:
> 0100100100100100 
> @ (XEN) r15: 00000000000000e8   cr0: 0000000080050033   cr4:
> 00000000001526f0 
> @ (XEN) cr3: 00000100818b2000   cr2: ffff8840ffdb5010 
> @ (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008 
> @ (XEN) Xen stack trace from rsp=ffff82d0804d7c70: 
> @ (XEN)    ffff83808e802300 ffff82d080513a20 ffff82d08022f59b
> ffff82d080533080 
> @ (XEN)    ffff82d080532f50 00000000000000e8 ffff83808e802328
> 0000000000000000 
> @ (XEN)    ffff82d080513a20 ffff83808e8022c0 ffff82d080533200
> 00000000000000e8 
> @ (XEN)    00000000000000f0 ffff82d0805331c0 ffff82d0802458e2
> 0000000000000000 
> @ (XEN)    00000000000000e8 ffff83808e802334 ffff8384be7979b0
> ffff82d0804d7d78 
> @ (XEN)    0000000000000000 ffff8384be77c700 ffff82d0804d7d78
> ffff82d080513a20 
> @ (XEN)    ffff82d080246207 00000000000000e8 00000000000000e8
> ffff8384be7979b0 
> @ (XEN)    ffff82d08024518a ffff82d080533080 0000000000000070
> ffff82d080533da8 
> @ (XEN)    00000001000000e8 ffff8384be797a00 000000e800000001
> 002ab980002abd68 
> @ (XEN)    0000271000124f80 002abd6800124f80 00000000002ab980
> ffff82d0803753e0 
> @ (XEN)    0000000000010101 0000000000000001 ffff82d0804d7e18
> ffff881fb4afbc88 
> @ (XEN)    ffff82d0804d0000 ffff881fb28a4400 ffff82d0804fca80
> ffffffff819b7080 
> @ (XEN)    ffff82d080266c16 ffff83808fb46ba8 ffff82d080208a82
> ffff83006bddd190 
> @ (XEN)    0000000000000292 0300000100000036 00000001000000f6
> 000000000000000f 
> @ (XEN)    0000007f000c0082 0000000000000000 0000007f000c0082
> 0000000000000000 
> @ (XEN)    000000000000000a ffff881fb28a4400 0000000000000005
> 0000000000000000 
> @ (XEN)    0000000000000000 00000000000000fe 0000000000000001
> 0000000000000001 
> @ (XEN)    0000000000000000 0000000000000000 ffff82d08031f521
> 0000000000000000 
> @ (XEN)    0000000000000246 ffffffff810010ea 0000000000000000
> ffffffff810010ea 
> @ (XEN)    000000000000e030 0000000000000246 ffff83006bddd000
> ffff881fb4afbd48 
> @ (XEN) Xen call trace: 
> @ (XEN)    [<ffff82d08022e747>] add_entry+0x27/0x120 
> @ (XEN)    [<ffff82d08022f59b>] set_timer+0x10b/0x220 
> @ (XEN)    [<ffff82d0802458e2>] cpufreq_governor_dbs+0x1e2/0x2f0 
> @ (XEN)    [<ffff82d080246207>] __cpufreq_set_policy+0x87/0x120 
> @ (XEN)    [<ffff82d08024518a>] cpufreq_add_cpu+0x24a/0x4f0 
> @ (XEN)    [<ffff82d080266c16>] do_platform_op+0x9c6/0x1650 
> @ (XEN)    [<ffff82d080208a82>] evtchn_check_pollers+0x22/0xb0 
> @ (XEN)    [<ffff82d08031f521>] do_iret+0xc1/0x1a0 
> @ (XEN)    [<ffff82d0803243a9>] syscall_enter+0xa9/0xae 
> @ (XEN) 
> @ (XEN) Pagetable walk from ffff8840ffdb5010: 
> @ (XEN)  L4[0x110] = 00000100818b3067 00000000000018b3 
> @ (XEN)  L3[0x103] = 0000000000000000 ffffffffffffffff 
> @ (XEN) 
> @ (XEN) ****************************************
>
> 0xffff82d08022e720 <add_entry>: movzwl 0x28(%rdi),%edx
>    0xffff82d08022e724 <add_entry+4>:    push   %rbp
>    0xffff82d08022e725 <add_entry+5>:    
>     lea    0x2e52f4(%rip),%rax        # 0xffff82d080513a20 <__per_cpu_offset>
>    0xffff82d08022e72c <add_entry+12>:   
>     lea    0x30494d(%rip),%r10        # 0xffff82d080533080 <per_cpu__timers>
>    0xffff82d08022e733 <add_entry+19>:   push   %rbx
>    0xffff82d08022e734 <add_entry+20>:   add    (%rax,%rdx,8),%r10
>    0xffff82d08022e738 <add_entry+24>:   movl   $0x0,0x8(%rdi)
>    0xffff82d08022e73f <add_entry+31>:   movb   $0x3,0x2a(%rdi)
>    0xffff82d08022e743 <add_entry+35>:   mov    0x8(%r10),%r8
>    0xffff82d08022e747 <add_entry+39>:   movzwl (%r8),%ecx
>
> And this points to 
> int sz = GET_HEAP_SIZE(heap);
> in add_entry of timer.c.
>
> static int add_entry(struct timer *t)                                           
> {                                                                               
> ffff82d08022cad3:   53                      push   %rbx                         
>     struct timers *timers = &per_cpu(timers, t->cpu);                           
> ffff82d08022cad4:   4c 03 14 d0             add    (%rax,%rdx,8),%r10           
>     int rc;                                                                     
>                                                                                 
>     ASSERT(t->status == TIMER_STATUS_invalid);                                  
>                                                                                 
>     /* Try to add to heap. t->heap_offset indicates whether we succeed. */      
>     t->heap_offset = 0;                                                         
> ffff82d08022cad8:   c7 47 08 00 00 00 00    movl   $0x0,0x8(%rdi)               
>     t->status = TIMER_STATUS_in_heap;                                           
> ffff82d08022cadf:   c6 47 2a 03             movb   $0x3,0x2a(%rdi)              
>     rc = add_to_heap(timers->heap, t);                                          
> ffff82d08022cae3:   4d 8b 42 08             mov    0x8(%r10),%r8                
>                                                                                 
>                                                                                 
> /* Add new entry @t to @heap. Return TRUE if new top of heap. */                
> static int add_to_heap(struct timer **heap, struct timer *t)                    
> {                                                                               
>     int sz = GET_HEAP_SIZE(heap);                                               
> ffff82d08022cae7:   41 0f b7 08             movzwl (%r8),%ecx                   
>                                                                                 
>     /* Fail if the heap is full. */                                             
>     if ( unlikely(sz == GET_HEAP_LIMIT(heap)) )    
>
> But checking values for nr_cpumask_bits, nr_cpu_ids and NR_CPUS did not
> provide any clues on why it fails here.
>
> After disabling xen cpufreq in linux, the page fault did not appear, but
> creating new guest caused another fatal page fault:
>
> CPU:    0 
> @ (XEN) RIP:    e008:[<ffff82d08025d59b>] __find_first_bit+0xb/0x30 
> @ (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor 
> @ (XEN) rax: 0000000000000000   rbx: 00000000ffdb53c0   rcx: 0000000000000004 
> @ (XEN) rdx: ffff82d080513a20   rsi: 00000000000000f0   rdi: ffff8840ffdb53c0 
> @ (XEN) rbp: 00000000000000e9   rsp: ffff82d0804d7d88   r8:  0000000000000000 
> @ (XEN) r9:  0000000000000000   r10: 0000000000000017   r11: 0000000000000000 
> @ (XEN) r12: ffff8381875ee3e0   r13: ffff82d0804d7e98   r14: 00000000000000e9 
> @ (XEN) r15: 00000000000000f0   cr0: 0000000080050033   cr4: 00000000001526f0 
> @ (XEN) cr3: 0000008174093000   cr2: ffff8840ffdb53c0 
> @ (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008 
> @ (XEN) Xen stack trace from rsp=ffff82d0804d7d88: 
> @ (XEN)    00000000000000e7 ffff82d080206030 000000cf7d47d0a2 00000000000000e9 
> @ (XEN)    00000000000000f0 0000000000000002 ffff83808fb6ffd0 ffff82d080533db8 
> @ (XEN)    0000000000000000 ffff82d080532f50 ffff82d0804d0000 ffff82d080533db8 
> @ (XEN)    00007fa8c83e5004 ffff82d0804d7e08 ffff82d080533db8 ffff83818b4e5000 
> @ (XEN)    000000090000000f 00007fa8c8390001 00007fa800000002 00007fa8ae7f8eb8 
> @ (XEN)    0000000000000002 00007fa898004170 000000000159c320 00000034ccc6cffe 
> @ (XEN)    00007fa8c83e5000 0000000000000000 000000000159c320 fffffc73ffffffff 
> @ (XEN)    00000034ccf6e920 00000034ccf6e920 00000034ccf6e920 00000034ccc94298 
> @ (XEN)    00007fa898004170 00000034ccc94220 ffffffffffffffff ffffffffffffffff 
> @ (XEN)    ffffffffffffffff 000000ffffffffff 00000034ca0e08c7 0000000000000100 
> @ (XEN)    00000034ca0e08c7 0000000000000033 0000000000000246 ffff83006bddd000 
> @ (XEN)    ffff8808456f1e98 00007fa8ae7f8d90 ffff88084ad1d900 0000000000000001 
> @ (XEN)    00007fa8ae7f8d90 ffff82d0803243a9 00000000ffffffff 0000000001d0085c 
> @ (XEN)    00007fa8c84549c0 00007fa898004170 ffff8808456f1e98 00007fa8ae7f8d90 
> @ (XEN)    0000000000000282 00000000019c9998 0000000000000003 0000000001d00a49 
> @ (XEN)    0000000000000024 ffffffff8100148a 00007fa898004170 00007fa8ae7f8ed0 
> @ (XEN)    00007fa8c83e5004 0001010000000000 ffffffff8100148a 000000000000e033 
> @ (XEN)    0000000000000282 ffff8808456f1e40 000000000000e02b 0000000000000000 
> @ (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000 
> @ (XEN)    ffff83006bddd000 0000000000000000 0000000000000000 
> @ (XEN) Xen call trace: 
> @ (XEN)    [<ffff82d08025d59b>] __find_first_bit+0xb/0x30 
> @ (XEN)    [<ffff82d080206030>] do_domctl+0x12b0/0x13d0 
> @ (XEN)    [<ffff82d0803243a9>] syscall_enter+0xa9/0xae 
> @ (XEN) 
> @ (XEN) Pagetable walk from ffff8840ffdb53c0: 
> @ (XEN)  L4[0x110] = 00000080818b3067 00000000000018b3
>
> While booting upstream on the same server (same command line as in other cases)
> causes another page fault (see attaches upstream_no_mem_override.log);
>
> We remembered there there is another open bug about a problem when starting with more than 4 TB memory. The workaround for this was to override mem at Xen command line. Tried this, and with upstream Xen and one that 4.4.3 with enabled cpufreq linux driver, problem dissapears. See attached logs upstream_with_mem_override.log and 4.4.3_with_mem_overrride.log.
>
> Any information on what can be an issue here or any other pointers will be very helpful.
> I will provide additional info if needed.
>
> Thank you
> Elena

This is an issue we have found in XenServer as well.

Observe that ffff8840ffdb53c0 is actually a pointer in the 64bit PV
virtual region, because the xenheap allocator has wandered off the top
of the directmap region.  This is a direct result of passing numa node
information to alloc_xenheap_page(), which overrides the check which
keeps the allocation inside the directmap region.

I have worked around in XenServer with

diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
index 3c64f19..715765a 100644
--- a/xen/arch/x86/e820.c
+++ b/xen/arch/x86/e820.c
@@ -15,7 +15,7 @@
  * opt_mem: Limit maximum address of physical RAM.
  *          Any RAM beyond this address limit is ignored.
  */
-static unsigned long long __initdata opt_mem;
+static unsigned long long __initdata opt_mem = GB(5 * 1024);
 size_param("mem", opt_mem);
 
 /*

This causes Xen to ignore any RAM above the top of the directmap region,
which happens to be 5TiB on Xen 4.5.

In some copious free time, I was going to look into segmenting the
directmap region by numa node, rather than having it linear from 0, so
xenheap pages can still be properly numa-located.

~Andrew