All of lore.kernel.org
 help / color / mirror / Atom feed
* Xen BUG at page_alloc.c:1738 (Xen 4.5)
@ 2015-05-19 18:06 Major Hayden
  2015-05-19 18:16 ` Andrew Cooper
  2015-05-20 10:41 ` Jan Beulich
  0 siblings, 2 replies; 17+ messages in thread
From: Major Hayden @ 2015-05-19 18:06 UTC (permalink / raw)
  To: xen-devel

Hello there,

I've been doing some testing of Xen 4.5 on Fedora 22 (due out within a week) and I have an error that prevents the server from booting in the very early boot process:

> (XEN) Xen call trace:
> (XEN)    [<ffff82d08011d160>] free_domheap_pages+0x240/0x430
> (XEN)    [<ffff82d08018c944>] mmio_ro_do_page_fault+0x114/0x160
> (XEN)    [<ffff82d0801a4c10>] do_page_fault+0x1a0/0x4f0
> (XEN)    [<ffff82d080239768>] handle_exception_saved+0x2e/0x6c
> (XEN) 
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) Xen BUG at page_alloc.c:1738
> (XEN) ****************************************

The full output is over in a Github Gist[1].

I've tested this on some physical machines (Dell, HP, and SuperMicro servers) as well as within a KVM virtual machine but I get the same boot error each time.  It occurs with Xen 4.5 and Linux 3.17-4.0.x.  Xen 4.5.1-rc1 fails in the same way.  I've opened a Red Hat Bug[2] as well as a Xen bug[3] on it.

The code within free_domheap_pages() hasn't changed much since late 2014 so I'm not sure if that's the culprit.  Does anyone have any suggestions on how to debug it further?

Thanks in advance!

[1] https://gist.github.com/major/baa0e2eee7de51a2bcd1
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1219197
[3] http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1908

--
Major Hayden

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-19 18:06 Xen BUG at page_alloc.c:1738 (Xen 4.5) Major Hayden
@ 2015-05-19 18:16 ` Andrew Cooper
  2015-05-20  2:11   ` Major Hayden
  2015-05-20 10:41 ` Jan Beulich
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2015-05-19 18:16 UTC (permalink / raw)
  To: Major Hayden, xen-devel

On 19/05/15 19:06, Major Hayden wrote:
> Hello there,
>
> I've been doing some testing of Xen 4.5 on Fedora 22 (due out within a week) and I have an error that prevents the server from booting in the very early boot process:
>
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d08011d160>] free_domheap_pages+0x240/0x430
>> (XEN)    [<ffff82d08018c944>] mmio_ro_do_page_fault+0x114/0x160
>> (XEN)    [<ffff82d0801a4c10>] do_page_fault+0x1a0/0x4f0
>> (XEN)    [<ffff82d080239768>] handle_exception_saved+0x2e/0x6c
>> (XEN) 
>> (XEN) 
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Xen BUG at page_alloc.c:1738
>> (XEN) ****************************************
> The full output is over in a Github Gist[1].
>
> I've tested this on some physical machines (Dell, HP, and SuperMicro servers) as well as within a KVM virtual machine but I get the same boot error each time.  It occurs with Xen 4.5 and Linux 3.17-4.0.x.  Xen 4.5.1-rc1 fails in the same way.  I've opened a Red Hat Bug[2] as well as a Xen bug[3] on it.
>
> The code within free_domheap_pages() hasn't changed much since late 2014 so I'm not sure if that's the culprit.  Does anyone have any suggestions on how to debug it further?
>
> Thanks in advance!
>
> [1] https://gist.github.com/major/baa0e2eee7de51a2bcd1
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1219197
> [3] http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1908

Can you try a debug hypervisor and rerun, to confirm the stack trace and
see whether any assertions fire.

Can you identify exactly which line xen/common/page_alloc.c:1738 is in
your source?

~Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-19 18:16 ` Andrew Cooper
@ 2015-05-20  2:11   ` Major Hayden
  0 siblings, 0 replies; 17+ messages in thread
From: Major Hayden @ 2015-05-20  2:11 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

I compiled Xen with debugging enabled and it appears to pass the
initial boot but then fails later in the boot process.  I'm working
through that now.  Here's what my source of page_alloc.c looks like
around line 1738:

1731         if ( likely(d) && likely(d != dom_cow) )
1732         {
1733             /* NB. May recursively lock from relinquish_memory(). */
1734             spin_lock_recursive(&d->page_alloc_lock);
1735
1736             for ( i = 0; i < (1 << order); i++ )
1737             {
1738                 BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);
1739                 page_list_del2(&pg[i], &d->page_list,
&d->arch.relmem_list);
1740             }
1741
1742             drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
1743
1744             spin_unlock_recursive(&d->page_alloc_lock);
1745
1746             /*
1747              * Normally we expect a domain to clear pages before
freeing them,
1748              * if it cares about the secrecy of their contents.
However, after
1749              * a domain has died we assume responsibility for erasure.
1750              */
1751             scrub = !!d->is_dying;
1752         }

On Tue, May 19, 2015 at 1:16 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> Can you try a debug hypervisor and rerun, to confirm the stack trace and
> see whether any assertions fire.
>
> Can you identify exactly which line xen/common/page_alloc.c:1738 is in
> your source?

-- 
Major Hayden

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-19 18:06 Xen BUG at page_alloc.c:1738 (Xen 4.5) Major Hayden
  2015-05-19 18:16 ` Andrew Cooper
@ 2015-05-20 10:41 ` Jan Beulich
  2015-05-20 16:52   ` Major Hayden
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2015-05-20 10:41 UTC (permalink / raw)
  To: Major Hayden; +Cc: xen-devel

>>> On 19.05.15 at 20:06, <major@mhtx.net> wrote:
> I've been doing some testing of Xen 4.5 on Fedora 22 (due out within a week) 
> and I have an error that prevents the server from booting in the very early 
> boot process:
> 
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d08011d160>] free_domheap_pages+0x240/0x430
>> (XEN)    [<ffff82d08018c944>] mmio_ro_do_page_fault+0x114/0x160
>> (XEN)    [<ffff82d0801a4c10>] do_page_fault+0x1a0/0x4f0
>> (XEN)    [<ffff82d080239768>] handle_exception_saved+0x2e/0x6c
>> (XEN) 
>> (XEN) 
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Xen BUG at page_alloc.c:1738
>> (XEN) ****************************************
> 
> The full output is over in a Github Gist[1].
> 
> I've tested this on some physical machines (Dell, HP, and SuperMicro 
> servers) as well as within a KVM virtual machine but I get the same boot 
> error each time.

Considering that no-one else is seeing this - is this perhaps connected
to you building Xen with pre-release gcc 5.0.1? This is also because in
order for the above to indeed occur, mmio_ro_do_page_fault()'s
put_page() would need to drop the last reference of a page, yet
page_get_owner_and_reference() doesn't obtain a reference when
a page is unallocated (and hence unowned), i.e. normally a page
would have a refcount of at least 2 here. Hence this would be
possible only due to a race, but the exact same race to be observed
on different hardware _and_ under an emulator is extremely unlikely.

Jan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-20 10:41 ` Jan Beulich
@ 2015-05-20 16:52   ` Major Hayden
  2015-05-20 19:51     ` M A Young
  0 siblings, 1 reply; 17+ messages in thread
From: Major Hayden @ 2015-05-20 16:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 05/20/2015 05:41 AM, Jan Beulich wrote:
> Considering that no-one else is seeing this - is this perhaps connected
> to you building Xen with pre-release gcc 5.0.1? This is also because in
> order for the above to indeed occur, mmio_ro_do_page_fault()'s
> put_page() would need to drop the last reference of a page, yet
> page_get_owner_and_reference() doesn't obtain a reference when
> a page is unallocated (and hence unowned), i.e. normally a page
> would have a refcount of at least 2 here. Hence this would be
> possible only due to a race, but the exact same race to be observed
> on different hardware _and_ under an emulator is extremely unlikely.

That could be a possibility.  There is one Fedora patch[1] to fix a GCC 5 compile error but that's probably unrelated to the crash.

I'm still hunting around to see what I can figure out.

[1] http://pkgs.fedoraproject.org/cgit/xen.git/tree/?h=f22

--
Major Hayden

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-20 16:52   ` Major Hayden
@ 2015-05-20 19:51     ` M A Young
  0 siblings, 0 replies; 17+ messages in thread
From: M A Young @ 2015-05-20 19:51 UTC (permalink / raw)
  To: Major Hayden; +Cc: Jan Beulich, xen-devel

On Wed, 20 May 2015, Major Hayden wrote:

> On 05/20/2015 05:41 AM, Jan Beulich wrote:
> > Considering that no-one else is seeing this - is this perhaps connected
> > to you building Xen with pre-release gcc 5.0.1? This is also because in
> > order for the above to indeed occur, mmio_ro_do_page_fault()'s
> > put_page() would need to drop the last reference of a page, yet
> > page_get_owner_and_reference() doesn't obtain a reference when
> > a page is unallocated (and hence unowned), i.e. normally a page
> > would have a refcount of at least 2 here. Hence this would be
> > possible only due to a race, but the exact same race to be observed
> > on different hardware _and_ under an emulator is extremely unlikely.

You could try with the xen.gz file from 
https://copr-be.cloud.fedoraproject.org/results/myoung/xentest/fedora-21-x86_64/xen-4.5.1-0.rc1.fc21/xen-hypervisor-4.5.1-0.rc1.fc21.x86_64.rpm
It is roughly the same version of xen but built against Fedora 21 and gcc 
4.9.2. If that works then it probably is gcc 5.

	Michael Young

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-06-01  7:47                 ` M A Young
@ 2015-06-06 21:00                   ` M A Young
  0 siblings, 0 replies; 17+ messages in thread
From: M A Young @ 2015-06-06 21:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Jason Fritcher, xen-devel

On Mon, 1 Jun 2015, M A Young wrote:

> On Mon, 1 Jun 2015, Jan Beulich wrote:
> 
> > >>> On 31.05.15 at 00:43, <andrew.cooper3@citrix.com> wrote:
> > > On 30/05/2015 23:07, M A Young wrote:
> > >> On Fri, 29 May 2015, Andrew Cooper wrote:
> > >>> FC22 is miscompiling the C to:
> > >>>
> > >>> struct page_info *page = mfn_to_page(mfn);
> > >>> struct domain *owner = page_get_owner_and_reference(page);
> > >>> if ( owner )
> > >>>     put_page(mfn_to_page(0));
> > >>>
> > >>> which is wrong, and why free_domheap_pages() does legitimately complain
> > >>> about the wonky refcount.
> > >> With a bit of experimentation I have found that compiling with the 
> > >> -fno-caller-saves flag gets this code segment back to the Fedora 21 
> > >> version, thus avoiding the bug.
> > > 
> > > After sending this email, I wondered whether the optimiser as assuming
> > > that %rdi was preserved.  Indeed, it turns out that the generated code
> > > for page_get_owner_and_reference leaves %rdi unmodified, and safe for
> > > reuse after return.
> > > 
> > > If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
> > > still contains the correct result of the original calculation.
> > 
> > And %r8 is known to be preserved too?
> > 
> > > Therefore, I suspect that the bug is in the -fcaller-saves optimisation
> > > code.
> > 
> > I suppose together with us allowing it to do such for global functions
> > by marking everything hidden (i.e. something possibly not seeing much
> > testing).
> > 
> > Questions now are:
> > 1) Was a bug against gcc opened already?
> > 2) What do we do about it? Working around the issue by setting
> > -fno-caller-saves seems awkward, as we'd likely have nothing but
> > the gcc version to tie this to. And considering distros carry their
> > own patch sets, the version alone may not even be enough. (I
> > didn't see any reports against our tip facing a similar issue despite
> > it being built with gcc 5 now too afaik.)
> 
> There is a Fedora bug on this
> https://bugzilla.redhat.com/show_bug.cgi?id=1219197
> which I updated and reassigned to gcc yesterday.

The Fedora gcc maintainer has now filed an upstream bug which is 
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=66444

	Michael Young

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-06-01  7:40               ` Jan Beulich
@ 2015-06-01  7:47                 ` M A Young
  2015-06-06 21:00                   ` M A Young
  0 siblings, 1 reply; 17+ messages in thread
From: M A Young @ 2015-06-01  7:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Jason Fritcher, xen-devel

On Mon, 1 Jun 2015, Jan Beulich wrote:

> >>> On 31.05.15 at 00:43, <andrew.cooper3@citrix.com> wrote:
> > On 30/05/2015 23:07, M A Young wrote:
> >> On Fri, 29 May 2015, Andrew Cooper wrote:
> >>> FC22 is miscompiling the C to:
> >>>
> >>> struct page_info *page = mfn_to_page(mfn);
> >>> struct domain *owner = page_get_owner_and_reference(page);
> >>> if ( owner )
> >>>     put_page(mfn_to_page(0));
> >>>
> >>> which is wrong, and why free_domheap_pages() does legitimately complain
> >>> about the wonky refcount.
> >> With a bit of experimentation I have found that compiling with the 
> >> -fno-caller-saves flag gets this code segment back to the Fedora 21 
> >> version, thus avoiding the bug.
> > 
> > After sending this email, I wondered whether the optimiser as assuming
> > that %rdi was preserved.  Indeed, it turns out that the generated code
> > for page_get_owner_and_reference leaves %rdi unmodified, and safe for
> > reuse after return.
> > 
> > If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
> > still contains the correct result of the original calculation.
> 
> And %r8 is known to be preserved too?
> 
> > Therefore, I suspect that the bug is in the -fcaller-saves optimisation
> > code.
> 
> I suppose together with us allowing it to do such for global functions
> by marking everything hidden (i.e. something possibly not seeing much
> testing).
> 
> Questions now are:
> 1) Was a bug against gcc opened already?
> 2) What do we do about it? Working around the issue by setting
> -fno-caller-saves seems awkward, as we'd likely have nothing but
> the gcc version to tie this to. And considering distros carry their
> own patch sets, the version alone may not even be enough. (I
> didn't see any reports against our tip facing a similar issue despite
> it being built with gcc 5 now too afaik.)

There is a Fedora bug on this
https://bugzilla.redhat.com/show_bug.cgi?id=1219197
which I updated and reassigned to gcc yesterday.

	Michael Young

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-30 22:43             ` Andrew Cooper
@ 2015-06-01  7:40               ` Jan Beulich
  2015-06-01  7:47                 ` M A Young
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2015-06-01  7:40 UTC (permalink / raw)
  To: Andrew Cooper, M A Young; +Cc: Jason Fritcher, xen-devel

>>> On 31.05.15 at 00:43, <andrew.cooper3@citrix.com> wrote:
> On 30/05/2015 23:07, M A Young wrote:
>> On Fri, 29 May 2015, Andrew Cooper wrote:
>>> FC22 is miscompiling the C to:
>>>
>>> struct page_info *page = mfn_to_page(mfn);
>>> struct domain *owner = page_get_owner_and_reference(page);
>>> if ( owner )
>>>     put_page(mfn_to_page(0));
>>>
>>> which is wrong, and why free_domheap_pages() does legitimately complain
>>> about the wonky refcount.
>> With a bit of experimentation I have found that compiling with the 
>> -fno-caller-saves flag gets this code segment back to the Fedora 21 
>> version, thus avoiding the bug.
> 
> After sending this email, I wondered whether the optimiser as assuming
> that %rdi was preserved.  Indeed, it turns out that the generated code
> for page_get_owner_and_reference leaves %rdi unmodified, and safe for
> reuse after return.
> 
> If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
> still contains the correct result of the original calculation.

And %r8 is known to be preserved too?

> Therefore, I suspect that the bug is in the -fcaller-saves optimisation
> code.

I suppose together with us allowing it to do such for global functions
by marking everything hidden (i.e. something possibly not seeing much
testing).

Questions now are:
1) Was a bug against gcc opened already?
2) What do we do about it? Working around the issue by setting
-fno-caller-saves seems awkward, as we'd likely have nothing but
the gcc version to tie this to. And considering distros carry their
own patch sets, the version alone may not even be enough. (I
didn't see any reports against our tip facing a similar issue despite
it being built with gcc 5 now too afaik.)

Jan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-30 22:07           ` M A Young
@ 2015-05-30 22:43             ` Andrew Cooper
  2015-06-01  7:40               ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2015-05-30 22:43 UTC (permalink / raw)
  To: M A Young; +Cc: xen-devel, Jan Beulich, Jason Fritcher

On 30/05/2015 23:07, M A Young wrote:
> On Fri, 29 May 2015, Andrew Cooper wrote:
>
>> On 29/05/15 12:17, M A Young wrote:
>>>>> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
>>>>> boot for me, but if I replace xen.gz with one from the same code built on 
>>>>> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
>>>>> available via 
>>>>> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
>>>>> if anyone else wants to do some testing.
>>>>>
>>>>> 	Michael Young
>>>> Do you have easy access to xen-syms from each build?
>>> Yes.
>>>
>> Thankyou very much.
>>
>> GCC 5 is indeed miscompiling the code. Comparing the fc21 vs fc22 builds:
>>
>> The C snippet from mmio_ro_do_page_fault():
>>
>> struct page_info *page = mfn_to_page(mfn);
>> struct domain *owner = page_get_owner_and_reference(page);
>> if ( owner )
>>     put_page(page);
>>
>> In fc21 is:
>>
>> movabs $0xffff82e000000000,%rbp
>> shr    %cl,%rax
>> or     %rdx,%rax
>> shl    $0x5,%rax
>> add    %rax,%rbp
>> mov    %rbp,%rdi
>> callq  ffff82d080186900 <page_get_owner_and_reference>
>> test   %rax,%rax
>> mov    %rax,%r12
>> je     ffff82d080189c4e <mmio_ro_do_page_fault+0x11e>
>> mov    %rbp,%rdi
>> callq  ffff82d080188ec0 <put_page>
>>
>> and in fc22 is:
>>
>> movabs $0xffff82e000000000,%r8
>> shr    %cl,%rax
>> or     %rdx,%rax
>> shl    $0x5,%rax
>> lea    (%r8,%rax,1),%rdi
>> callq  ffff82d0801874f0 <page_get_owner_and_reference>
>> test   %rax,%rax
>> mov    %rax,%rbp
>> je     ffff82d08018ca14 <mmio_ro_do_page_fault+0x114>
>> mov    %r8,%rdi
>> callq  ffff82d080189a90 <put_page>
>>
>> "lea (%r8,%rax,1),%rdi" in FC22 is slightly shorter than "add %rax,%rbp;
>> mov %rbp,%rdi" in FC21.  In both cases %rdi is now 'page' from the C
>> snippet.
>>
>> In FC21, the result is stored in %rbp, then reloaded from %rbp into %rdi
>> for call to put_page().
>>
>> However, in FC22, the result of the calculation is only held in %rdi,
>> and clobbered by the call to page_get_owner_and_reference().  When it
>> comes to call put_page(), %r8 is reloaded, which is still a pointer to
>> the base of the frametable, not the page we actually took a reference on.
>>
>> FC22 is miscompiling the C to:
>>
>> struct page_info *page = mfn_to_page(mfn);
>> struct domain *owner = page_get_owner_and_reference(page);
>> if ( owner )
>>     put_page(mfn_to_page(0));
>>
>> which is wrong, and why free_domheap_pages() does legitimately complain
>> about the wonky refcount.
> With a bit of experimentation I have found that compiling with the 
> -fno-caller-saves flag gets this code segment back to the Fedora 21 
> version, thus avoiding the bug.

After sending this email, I wondered whether the optimiser as assuming
that %rdi was preserved.  Indeed, it turns out that the generated code
for page_get_owner_and_reference leaves %rdi unmodified, and safe for
reuse after return.

If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
still contains the correct result of the original calculation.

Therefore, I suspect that the bug is in the -fcaller-saves optimisation
code.

~Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-29 18:12         ` Andrew Cooper
@ 2015-05-30 22:07           ` M A Young
  2015-05-30 22:43             ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: M A Young @ 2015-05-30 22:07 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Jan Beulich, Jason Fritcher

On Fri, 29 May 2015, Andrew Cooper wrote:

> On 29/05/15 12:17, M A Young wrote:
> >
> >>> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
> >>> boot for me, but if I replace xen.gz with one from the same code built on 
> >>> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
> >>> available via 
> >>> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
> >>> if anyone else wants to do some testing.
> >>>
> >>> 	Michael Young
> >> Do you have easy access to xen-syms from each build?
> > Yes.
> >
> 
> Thankyou very much.
> 
> GCC 5 is indeed miscompiling the code. Comparing the fc21 vs fc22 builds:
> 
> The C snippet from mmio_ro_do_page_fault():
> 
> struct page_info *page = mfn_to_page(mfn);
> struct domain *owner = page_get_owner_and_reference(page);
> if ( owner )
>     put_page(page);
> 
> In fc21 is:
> 
> movabs $0xffff82e000000000,%rbp
> shr    %cl,%rax
> or     %rdx,%rax
> shl    $0x5,%rax
> add    %rax,%rbp
> mov    %rbp,%rdi
> callq  ffff82d080186900 <page_get_owner_and_reference>
> test   %rax,%rax
> mov    %rax,%r12
> je     ffff82d080189c4e <mmio_ro_do_page_fault+0x11e>
> mov    %rbp,%rdi
> callq  ffff82d080188ec0 <put_page>
> 
> and in fc22 is:
> 
> movabs $0xffff82e000000000,%r8
> shr    %cl,%rax
> or     %rdx,%rax
> shl    $0x5,%rax
> lea    (%r8,%rax,1),%rdi
> callq  ffff82d0801874f0 <page_get_owner_and_reference>
> test   %rax,%rax
> mov    %rax,%rbp
> je     ffff82d08018ca14 <mmio_ro_do_page_fault+0x114>
> mov    %r8,%rdi
> callq  ffff82d080189a90 <put_page>
> 
> "lea (%r8,%rax,1),%rdi" in FC22 is slightly shorter than "add %rax,%rbp;
> mov %rbp,%rdi" in FC21.  In both cases %rdi is now 'page' from the C
> snippet.
> 
> In FC21, the result is stored in %rbp, then reloaded from %rbp into %rdi
> for call to put_page().
> 
> However, in FC22, the result of the calculation is only held in %rdi,
> and clobbered by the call to page_get_owner_and_reference().  When it
> comes to call put_page(), %r8 is reloaded, which is still a pointer to
> the base of the frametable, not the page we actually took a reference on.
> 
> FC22 is miscompiling the C to:
> 
> struct page_info *page = mfn_to_page(mfn);
> struct domain *owner = page_get_owner_and_reference(page);
> if ( owner )
>     put_page(mfn_to_page(0));
> 
> which is wrong, and why free_domheap_pages() does legitimately complain
> about the wonky refcount.

With a bit of experimentation I have found that compiling with the 
-fno-caller-saves flag gets this code segment back to the Fedora 21 
version, thus avoiding the bug.

	Michael Young

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-29 11:17       ` M A Young
@ 2015-05-29 18:12         ` Andrew Cooper
  2015-05-30 22:07           ` M A Young
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2015-05-29 18:12 UTC (permalink / raw)
  To: M A Young; +Cc: xen-devel, Jan Beulich, Jason Fritcher

On 29/05/15 12:17, M A Young wrote:
>
>>> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
>>> boot for me, but if I replace xen.gz with one from the same code built on 
>>> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
>>> available via 
>>> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
>>> if anyone else wants to do some testing.
>>>
>>> 	Michael Young
>> Do you have easy access to xen-syms from each build?
> Yes.
>

Thankyou very much.

GCC 5 is indeed miscompiling the code. Comparing the fc21 vs fc22 builds:

The C snippet from mmio_ro_do_page_fault():

struct page_info *page = mfn_to_page(mfn);
struct domain *owner = page_get_owner_and_reference(page);
if ( owner )
    put_page(page);

In fc21 is:

movabs $0xffff82e000000000,%rbp
shr    %cl,%rax
or     %rdx,%rax
shl    $0x5,%rax
add    %rax,%rbp
mov    %rbp,%rdi
callq  ffff82d080186900 <page_get_owner_and_reference>
test   %rax,%rax
mov    %rax,%r12
je     ffff82d080189c4e <mmio_ro_do_page_fault+0x11e>
mov    %rbp,%rdi
callq  ffff82d080188ec0 <put_page>

and in fc22 is:

movabs $0xffff82e000000000,%r8
shr    %cl,%rax
or     %rdx,%rax
shl    $0x5,%rax
lea    (%r8,%rax,1),%rdi
callq  ffff82d0801874f0 <page_get_owner_and_reference>
test   %rax,%rax
mov    %rax,%rbp
je     ffff82d08018ca14 <mmio_ro_do_page_fault+0x114>
mov    %r8,%rdi
callq  ffff82d080189a90 <put_page>

"lea (%r8,%rax,1),%rdi" in FC22 is slightly shorter than "add %rax,%rbp;
mov %rbp,%rdi" in FC21.  In both cases %rdi is now 'page' from the C
snippet.

In FC21, the result is stored in %rbp, then reloaded from %rbp into %rdi
for call to put_page().

However, in FC22, the result of the calculation is only held in %rdi,
and clobbered by the call to page_get_owner_and_reference().  When it
comes to call put_page(), %r8 is reloaded, which is still a pointer to
the base of the frametable, not the page we actually took a reference on.

FC22 is miscompiling the C to:

struct page_info *page = mfn_to_page(mfn);
struct domain *owner = page_get_owner_and_reference(page);
if ( owner )
    put_page(mfn_to_page(0));

which is wrong, and why free_domheap_pages() does legitimately complain
about the wonky refcount.

~Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-29 10:57     ` Andrew Cooper
@ 2015-05-29 11:17       ` M A Young
  2015-05-29 18:12         ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: M A Young @ 2015-05-29 11:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Jason Fritcher

On Fri, 29 May 2015, Andrew Cooper wrote:

> On 29/05/15 11:50, M A Young wrote:
> > On Fri, 29 May 2015, Andrew Cooper wrote:
> >
> >> Are you in a position to compile identical Xen 4.5 source with two different
> >> versions of gcc?  (current staging-4.5 staging even has the gcc5 build fix
> >> in)
> >>
> >> If it is a gcc compiler bug, we would expect the version compiled with gcc
> >> 4.9 to work fine, but the one compiled with 5 to fail in the identified
> >> manor.
> > I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
> > boot for me, but if I replace xen.gz with one from the same code built on 
> > Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
> > available via 
> > http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
> > if anyone else wants to do some testing.
> >
> > 	Michael Young
> 
> Do you have easy access to xen-syms from each build?

Yes.

	Michael Young

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-29 10:50   ` M A Young
@ 2015-05-29 10:57     ` Andrew Cooper
  2015-05-29 11:17       ` M A Young
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2015-05-29 10:57 UTC (permalink / raw)
  To: M A Young; +Cc: xen-devel, Jason Fritcher

On 29/05/15 11:50, M A Young wrote:
> On Fri, 29 May 2015, Andrew Cooper wrote:
>
>> Are you in a position to compile identical Xen 4.5 source with two different
>> versions of gcc?  (current staging-4.5 staging even has the gcc5 build fix
>> in)
>>
>> If it is a gcc compiler bug, we would expect the version compiled with gcc
>> 4.9 to work fine, but the one compiled with 5 to fail in the identified
>> manor.
> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
> boot for me, but if I replace xen.gz with one from the same code built on 
> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
> available via 
> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
> if anyone else wants to do some testing.
>
> 	Michael Young

Do you have easy access to xen-syms from each build?

~Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-29 10:09 ` Andrew Cooper
@ 2015-05-29 10:50   ` M A Young
  2015-05-29 10:57     ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: M A Young @ 2015-05-29 10:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Jason Fritcher

[-- Attachment #1: Type: TEXT/PLAIN, Size: 720 bytes --]

On Fri, 29 May 2015, Andrew Cooper wrote:

> Are you in a position to compile identical Xen 4.5 source with two different
> versions of gcc?  (current staging-4.5 staging even has the gcc5 build fix
> in)
> 
> If it is a gcc compiler bug, we would expect the version compiled with gcc
> 4.9 to work fine, but the one compiled with 5 to fail in the identified
> manor.

I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
boot for me, but if I replace xen.gz with one from the same code built on 
Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
available via 
http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
if anyone else wants to do some testing.

	Michael Young

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
  2015-05-29  6:24 Jason Fritcher
@ 2015-05-29 10:09 ` Andrew Cooper
  2015-05-29 10:50   ` M A Young
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2015-05-29 10:09 UTC (permalink / raw)
  To: Jason Fritcher, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3011 bytes --]

On 29/05/15 07:24, Jason Fritcher wrote:
> On Wed, 20 May 2015, Major Hayden wrote:
>
> >/ On 05/20/2015 05:41 AM, Jan Beulich wrote:/
> >/ > Considering that no-one else is seeing this - is this perhaps connected/
> >/ > to you building Xen with pre-release gcc 5.0.1? This is also because in/
> >/ > order for the above to indeed occur, mmio_ro_do_page_fault()'s/
> >/ > put_page() would need to drop the last reference of a page, yet/
> >/ > page_get_owner_and_reference() doesn't obtain a reference when/
> >/ > a page is unallocated (and hence unowned), i.e. normally a page/
> >/ > would have a refcount of at least 2 here. Hence this would be/
> >/ > possible only due to a race, but the exact same race to be observed/
> >/ > on different hardware _and_ under an emulator is extremely unlikely./
>
> You could try with the xen.gz file from
> https://copr-be.cloud.fedoraproject.org/results/myoung/xentest/fedora-21-x86_64/xen-4.5.1-0.rc1.fc21/xen-hypervisor-4.5.1-0.rc1.fc21.x86_64.rpm
> It is roughly the same version of xen but built against Fedora 21 and gcc
> 4.9.2. If that works then it probably is gcc 5.
> Greetings,
>
> I have run into pretty much the same issue as the original poster.
>
> I am running a recently updated Arch Linux system, with GCC 5.1.0,
> using UEFI and gummiboot to boot. I currently have a build of Xen
> 4.4.1, built with GCC 4.9.2 from before my last update, that is
> functioning correctly on this machine. But the builds of Xen 4.5.0,
> using GCC 5 and mingw64-binutils for the EFI binary, are all failing
> when Xen starts the Linux kernel, with the same error mentioned in the
> subject. Below is the boot log I captured via the serial port.
>
> http://pastebin.com/bBC78306
>
> Wondering if my specific toolchain was the issue, I downloaded the
> Fedora 22 version of xen-hypervisor and installed its EFI Xen binary
> over my compiled binary and received an identical error message, with
> slightly different addresses in the panic dump. The Fedora version was
> compiled with GCC 5.0.1. Below is the boot log I captured from that
> binary.
>
> http://pastebin.com/jvg1JazC
>
> After finding this thread, and specifically, the quoted message above,
> I downloaded that xen-hypervisor package and installed its EFI Xen
> binary. That binary boots successfully, as seen by the captured boot
> log below.
>
> http://pastebin.com/DKxwaU2U
>
> So, while I’m not familiar enough with Xen to begin to have an idea of
> what could possibly be wrong with Xen or GCC 5 to be causing this bug,
> I’d like to do what I can to track down the issue so I can get a
> working build of Xen 4.5. :)

Are you in a position to compile identical Xen 4.5 source with two
different versions of gcc?  (current staging-4.5 staging even has the
gcc5 build fix in)

If it is a gcc compiler bug, we would expect the version compiled with
gcc 4.9 to work fine, but the one compiled with 5 to fail in the
identified manor.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 5230 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Xen BUG at page_alloc.c:1738 (Xen 4.5)
@ 2015-05-29  6:24 Jason Fritcher
  2015-05-29 10:09 ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Fritcher @ 2015-05-29  6:24 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 2728 bytes --]

On Wed, 20 May 2015, Major Hayden wrote:

> On 05/20/2015 05:41 AM, Jan Beulich wrote:
> > Considering that no-one else is seeing this - is this perhaps connected
> > to you building Xen with pre-release gcc 5.0.1? This is also because in
> > order for the above to indeed occur, mmio_ro_do_page_fault()'s
> > put_page() would need to drop the last reference of a page, yet
> > page_get_owner_and_reference() doesn't obtain a reference when
> > a page is unallocated (and hence unowned), i.e. normally a page
> > would have a refcount of at least 2 here. Hence this would be
> > possible only due to a race, but the exact same race to be observed
> > on different hardware _and_ under an emulator is extremely unlikely.

You could try with the xen.gz file from 
https://copr-be.cloud.fedoraproject.org/results/myoung/xentest/fedora-21-x86_64/xen-4.5.1-0.rc1.fc21/xen-hypervisor-4.5.1-0.rc1.fc21.x86_64.rpm <https://copr-be.cloud.fedoraproject.org/results/myoung/xentest/fedora-21-x86_64/xen-4.5.1-0.rc1.fc21/xen-hypervisor-4.5.1-0.rc1.fc21.x86_64.rpm>
It is roughly the same version of xen but built against Fedora 21 and gcc 
4.9.2. If that works then it probably is gcc 5.
Greetings,

I have run into pretty much the same issue as the original poster.

I am running a recently updated Arch Linux system, with GCC 5.1.0, using UEFI and gummiboot to boot. I currently have a build of Xen 4.4.1, built with GCC 4.9.2 from before my last update, that is functioning correctly on this machine. But the builds of Xen 4.5.0, using GCC 5 and mingw64-binutils for the EFI binary, are all failing when Xen starts the Linux kernel, with the same error mentioned in the subject. Below is the boot log I captured via the serial port.

http://pastebin.com/bBC78306

Wondering if my specific toolchain was the issue, I downloaded the Fedora 22 version of xen-hypervisor and installed its EFI Xen binary over my compiled binary and received an identical error message, with slightly different addresses in the panic dump. The Fedora version was compiled with GCC 5.0.1. Below is the boot log I captured from that binary.

http://pastebin.com/jvg1JazC <http://pastebin.com/jvg1JazC>

After finding this thread, and specifically, the quoted message above, I downloaded that xen-hypervisor package and installed its EFI Xen binary. That binary boots successfully, as seen by the captured boot log below.

http://pastebin.com/DKxwaU2U

So, while I’m not familiar enough with Xen to begin to have an idea of what could possibly be wrong with Xen or GCC 5 to be causing this bug, I’d like to do what I can to track down the issue so I can get a working build of Xen 4.5. :)

Thanks!

—
Jason Fritcher


[-- Attachment #1.1.2: Type: text/html, Size: 4239 bytes --]

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 4100 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-06-06 21:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-19 18:06 Xen BUG at page_alloc.c:1738 (Xen 4.5) Major Hayden
2015-05-19 18:16 ` Andrew Cooper
2015-05-20  2:11   ` Major Hayden
2015-05-20 10:41 ` Jan Beulich
2015-05-20 16:52   ` Major Hayden
2015-05-20 19:51     ` M A Young
2015-05-29  6:24 Jason Fritcher
2015-05-29 10:09 ` Andrew Cooper
2015-05-29 10:50   ` M A Young
2015-05-29 10:57     ` Andrew Cooper
2015-05-29 11:17       ` M A Young
2015-05-29 18:12         ` Andrew Cooper
2015-05-30 22:07           ` M A Young
2015-05-30 22:43             ` Andrew Cooper
2015-06-01  7:40               ` Jan Beulich
2015-06-01  7:47                 ` M A Young
2015-06-06 21:00                   ` M A Young

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.