All of lore.kernel.org
 help / color / mirror / Atom feed
* Xen PV domain regression with KASLR enabled (kernel 3.16)
@ 2014-08-08 11:20 Stefan Bader
  2014-08-08 12:43 ` [Xen-devel] " David Vrabel
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Bader @ 2014-08-08 11:20 UTC (permalink / raw)
  To: xen-devel, Linux Kernel Mailing List; +Cc: Kees Cook, David Vrabel

[-- Attachment #1: Type: text/plain, Size: 2972 bytes --]

Unfortunately I have not yet figured out why this happens, but can confirm by
compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
is ok, but with it enabled there are issues (actually a dom0 does not even boot
as a follow up error).

Details can be seen in [1] but basically this is always some portion of a
vmalloc allocation failing after hitting a freshly allocated PTE space not being
PTE_NONE (usually from a module load triggered by systemd-udevd). In the
non-dom0 case this repeats many times but ends in a guest that allows login. In
the dom0 case there is a more fatal error at some point causing a crash.

I have not tried this for a normal PV guest but for dom0 it also does not help
to add "nokaslr" to the kernel command-line.

-Stefan

19:35:02 [ 2.547049] ------------[ cut here ]------------
19:35:02 [ 2.547065] WARNING: CPU: 0 PID: 97 at
/build/buildd/linux-3.16.0/mm/vmalloc.c:128 vmap_page_range_noflush+0x2d1/0x370()
19:35:02 [ 2.547069] Modules linked in:
19:35:02 [ 2.547073] CPU: 0 PID: 97 Comm: systemd-udevd Not tainted
3.16.0-6-generic #11-Ubuntu
19:35:02 [ 2.547077] 0000000000000009 ffff880002defb98 ffffffff81755538
0000000000000000
19:35:02 [ 2.547082] ffff880002defbd0 ffffffff8106bb0d ffff88000400ec88
0000000000000001
19:35:02 [ 2.547086] ffff880002fcfb00 ffffffffc0391000 0000000000000000
ffff880002defbe0
19:35:02 [ 2.547090] Call Trace:
19:35:02 [ 2.547096] [<ffffffff81755538>] dump_stack+0x45/0x56
19:35:02 [ 2.547101] [<ffffffff8106bb0d>] warn_slowpath_common+0x7d/0xa0
19:35:02 [ 2.547104] [<ffffffff8106bbea>] warn_slowpath_null+0x1a/0x20
19:35:02 [ 2.547108] [<ffffffff81197c31>] vmap_page_range_noflush+0x2d1/0x370
19:35:02 [ 2.547112] [<ffffffff81197cfe>] map_vm_area+0x2e/0x40
19:35:02 [ 2.547115] [<ffffffff8119a058>] __vmalloc_node_range+0x188/0x280
19:35:02 [ 2.547120] [<ffffffff810e92b4>] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547124] [<ffffffff810e92b4>] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547129] [<ffffffff8104f294>] module_alloc+0x74/0xd0
19:35:02 [ 2.547132] [<ffffffff810e92b4>] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547135] [<ffffffff810e92b4>] module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547146] [<ffffffff810e9a6c>] layout_and_allocate+0x74c/0xc70
19:35:02 [ 2.547149] [<ffffffff810ea063>] load_module+0xd3/0x1b70
19:35:02 [ 2.547154] [<ffffffff811cfeb1>] ? vfs_read+0xf1/0x170
19:35:02 [ 2.547157] [<ffffffff810e7aa1>] ? copy_module_from_fd.isra.46+0x121/0x180
19:35:02 [ 2.547161] [<ffffffff810ebc76>] SyS_finit_module+0x86/0xb0
19:35:02 [ 2.547167] [<ffffffff8175de7f>] tracesys+0xe1/0xe6
19:35:02 [ 2.547169] ---[ end trace 8a5de7fc66e75fe4 ]---
19:35:02 [ 2.547172] vmalloc: allocation failure, allocated 20480 of 24576 bytes
19:35:02 [ 2.547175] systemd-udevd: page allocation failure: order:0, mode:0xd2


[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350522


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-08 11:20 Xen PV domain regression with KASLR enabled (kernel 3.16) Stefan Bader
@ 2014-08-08 12:43 ` David Vrabel
  2014-08-08 14:35   ` Stefan Bader
  0 siblings, 1 reply; 29+ messages in thread
From: David Vrabel @ 2014-08-08 12:43 UTC (permalink / raw)
  To: Stefan Bader, xen-devel, Linux Kernel Mailing List
  Cc: Kees Cook, David Vrabel

On 08/08/14 12:20, Stefan Bader wrote:
> Unfortunately I have not yet figured out why this happens, but can confirm by
> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
> is ok, but with it enabled there are issues (actually a dom0 does not even boot
> as a follow up error).
> 
> Details can be seen in [1] but basically this is always some portion of a
> vmalloc allocation failing after hitting a freshly allocated PTE space not being
> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
> non-dom0 case this repeats many times but ends in a guest that allows login. In
> the dom0 case there is a more fatal error at some point causing a crash.
> 
> I have not tried this for a normal PV guest but for dom0 it also does not help
> to add "nokaslr" to the kernel command-line.

Maybe it's overlapping with regions of the virtual address space
reserved for Xen?  What the the VA that fails?

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-08 12:43 ` [Xen-devel] " David Vrabel
@ 2014-08-08 14:35   ` Stefan Bader
  2014-08-12 17:28     ` Kees Cook
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Bader @ 2014-08-08 14:35 UTC (permalink / raw)
  To: David Vrabel, xen-devel, Linux Kernel Mailing List; +Cc: Kees Cook

[-- Attachment #1: Type: text/plain, Size: 1525 bytes --]

On 08.08.2014 14:43, David Vrabel wrote:
> On 08/08/14 12:20, Stefan Bader wrote:
>> Unfortunately I have not yet figured out why this happens, but can confirm by
>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>> as a follow up error).
>>
>> Details can be seen in [1] but basically this is always some portion of a
>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>> the dom0 case there is a more fatal error at some point causing a crash.
>>
>> I have not tried this for a normal PV guest but for dom0 it also does not help
>> to add "nokaslr" to the kernel command-line.
> 
> Maybe it's overlapping with regions of the virtual address space
> reserved for Xen?  What the the VA that fails?
> 
> David
> 
Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
missing p2m tables? I probably need to add debugging to find the failing VA (iow
not sure whether it might be somewhere in the stacktraces in the report).

The kernel-command line does not seem to be looked at. It should put something
into dmesg and that never shows up. Also today's random feature is other PV
guests crashing after a bit somewhere in the check_for_corruption area...

-Stefan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-08 14:35   ` Stefan Bader
@ 2014-08-12 17:28     ` Kees Cook
  2014-08-12 18:05       ` Stefan Bader
  0 siblings, 1 reply; 29+ messages in thread
From: Kees Cook @ 2014-08-12 17:28 UTC (permalink / raw)
  To: Stefan Bader; +Cc: David Vrabel, xen-devel, Linux Kernel Mailing List

On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
> On 08.08.2014 14:43, David Vrabel wrote:
>> On 08/08/14 12:20, Stefan Bader wrote:
>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>>> as a follow up error).
>>>
>>> Details can be seen in [1] but basically this is always some portion of a
>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>
>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>>> to add "nokaslr" to the kernel command-line.
>>
>> Maybe it's overlapping with regions of the virtual address space
>> reserved for Xen?  What the the VA that fails?
>>
>> David
>>
> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
> missing p2m tables? I probably need to add debugging to find the failing VA (iow
> not sure whether it might be somewhere in the stacktraces in the report).
>
> The kernel-command line does not seem to be looked at. It should put something
> into dmesg and that never shows up. Also today's random feature is other PV
> guests crashing after a bit somewhere in the check_for_corruption area...

Right now, the kaslr code just deals with initrd, cmdline, etc. If
there are other reserved regions that aren't listed in the e820, it'll
need to locate and skip them.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-12 17:28     ` Kees Cook
@ 2014-08-12 18:05       ` Stefan Bader
  2014-08-12 18:53         ` Kees Cook
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Bader @ 2014-08-12 18:05 UTC (permalink / raw)
  To: Kees Cook; +Cc: David Vrabel, xen-devel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2284 bytes --]

On 12.08.2014 19:28, Kees Cook wrote:
> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
>> On 08.08.2014 14:43, David Vrabel wrote:
>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>>>> as a follow up error).
>>>>
>>>> Details can be seen in [1] but basically this is always some portion of a
>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>>
>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>>>> to add "nokaslr" to the kernel command-line.
>>>
>>> Maybe it's overlapping with regions of the virtual address space
>>> reserved for Xen?  What the the VA that fails?
>>>
>>> David
>>>
>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
>> not sure whether it might be somewhere in the stacktraces in the report).
>>
>> The kernel-command line does not seem to be looked at. It should put something
>> into dmesg and that never shows up. Also today's random feature is other PV
>> guests crashing after a bit somewhere in the check_for_corruption area...
> 
> Right now, the kaslr code just deals with initrd, cmdline, etc. If
> there are other reserved regions that aren't listed in the e820, it'll
> need to locate and skip them.
> 
> -Kees
> 
Making my little steps towards more understanding I figured out that it isn't
the code that does the relocation. Even with that completely disabled there were
the vmalloc issues. What causes it seems to be the default of the upper limit
and that this changes the split between kernel and modules to 1G+1G instead of
512M+1.5G. That is the reason why nokaslr has no effect.

-Stefan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-12 18:05       ` Stefan Bader
@ 2014-08-12 18:53         ` Kees Cook
  2014-08-12 19:07           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 29+ messages in thread
From: Kees Cook @ 2014-08-12 18:53 UTC (permalink / raw)
  To: Stefan Bader; +Cc: David Vrabel, xen-devel, Linux Kernel Mailing List

On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
<stefan.bader@canonical.com> wrote:
> On 12.08.2014 19:28, Kees Cook wrote:
>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
>>> On 08.08.2014 14:43, David Vrabel wrote:
>>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>>>>> as a follow up error).
>>>>>
>>>>> Details can be seen in [1] but basically this is always some portion of a
>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>>>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>>>
>>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>>>>> to add "nokaslr" to the kernel command-line.
>>>>
>>>> Maybe it's overlapping with regions of the virtual address space
>>>> reserved for Xen?  What the the VA that fails?
>>>>
>>>> David
>>>>
>>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
>>> not sure whether it might be somewhere in the stacktraces in the report).
>>>
>>> The kernel-command line does not seem to be looked at. It should put something
>>> into dmesg and that never shows up. Also today's random feature is other PV
>>> guests crashing after a bit somewhere in the check_for_corruption area...
>>
>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>> there are other reserved regions that aren't listed in the e820, it'll
>> need to locate and skip them.
>>
>> -Kees
>>
> Making my little steps towards more understanding I figured out that it isn't
> the code that does the relocation. Even with that completely disabled there were
> the vmalloc issues. What causes it seems to be the default of the upper limit
> and that this changes the split between kernel and modules to 1G+1G instead of
> 512M+1.5G. That is the reason why nokaslr has no effect.

Oh! That's very interesting. There must be some assumption in Xen
about the kernel VM layout then?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-12 18:53         ` Kees Cook
@ 2014-08-12 19:07           ` Konrad Rzeszutek Wilk
  2014-08-21 16:03             ` Kees Cook
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-12 19:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Stefan Bader, xen-devel, David Vrabel, Linux Kernel Mailing List

On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
> <stefan.bader@canonical.com> wrote:
> > On 12.08.2014 19:28, Kees Cook wrote:
> >> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
> >>> On 08.08.2014 14:43, David Vrabel wrote:
> >>>> On 08/08/14 12:20, Stefan Bader wrote:
> >>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
> >>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
> >>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
> >>>>> as a follow up error).
> >>>>>
> >>>>> Details can be seen in [1] but basically this is always some portion of a
> >>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
> >>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
> >>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
> >>>>> the dom0 case there is a more fatal error at some point causing a crash.
> >>>>>
> >>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
> >>>>> to add "nokaslr" to the kernel command-line.
> >>>>
> >>>> Maybe it's overlapping with regions of the virtual address space
> >>>> reserved for Xen?  What the the VA that fails?
> >>>>
> >>>> David
> >>>>
> >>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
> >>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
> >>> not sure whether it might be somewhere in the stacktraces in the report).
> >>>
> >>> The kernel-command line does not seem to be looked at. It should put something
> >>> into dmesg and that never shows up. Also today's random feature is other PV
> >>> guests crashing after a bit somewhere in the check_for_corruption area...
> >>
> >> Right now, the kaslr code just deals with initrd, cmdline, etc. If
> >> there are other reserved regions that aren't listed in the e820, it'll
> >> need to locate and skip them.
> >>
> >> -Kees
> >>
> > Making my little steps towards more understanding I figured out that it isn't
> > the code that does the relocation. Even with that completely disabled there were
> > the vmalloc issues. What causes it seems to be the default of the upper limit
> > and that this changes the split between kernel and modules to 1G+1G instead of
> > 512M+1.5G. That is the reason why nokaslr has no effect.
> 
> Oh! That's very interesting. There must be some assumption in Xen
> about the kernel VM layout then?

No. I think most of the changes that look at PTE and PMDs are are all
in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
too aggressive 
> 
> -Kees
> 
> -- 
> Kees Cook
> Chrome OS Security
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-12 19:07           ` Konrad Rzeszutek Wilk
@ 2014-08-21 16:03             ` Kees Cook
  2014-08-22  9:20               ` Stefan Bader
  0 siblings, 1 reply; 29+ messages in thread
From: Kees Cook @ 2014-08-21 16:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefan Bader, xen-devel, David Vrabel, Linux Kernel Mailing List

On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
>> <stefan.bader@canonical.com> wrote:
>> > On 12.08.2014 19:28, Kees Cook wrote:
>> >> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
>> >>> On 08.08.2014 14:43, David Vrabel wrote:
>> >>>> On 08/08/14 12:20, Stefan Bader wrote:
>> >>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>> >>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>> >>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>> >>>>> as a follow up error).
>> >>>>>
>> >>>>> Details can be seen in [1] but basically this is always some portion of a
>> >>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>> >>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>> >>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>> >>>>> the dom0 case there is a more fatal error at some point causing a crash.
>> >>>>>
>> >>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>> >>>>> to add "nokaslr" to the kernel command-line.
>> >>>>
>> >>>> Maybe it's overlapping with regions of the virtual address space
>> >>>> reserved for Xen?  What the the VA that fails?
>> >>>>
>> >>>> David
>> >>>>
>> >>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>> >>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
>> >>> not sure whether it might be somewhere in the stacktraces in the report).
>> >>>
>> >>> The kernel-command line does not seem to be looked at. It should put something
>> >>> into dmesg and that never shows up. Also today's random feature is other PV
>> >>> guests crashing after a bit somewhere in the check_for_corruption area...
>> >>
>> >> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>> >> there are other reserved regions that aren't listed in the e820, it'll
>> >> need to locate and skip them.
>> >>
>> >> -Kees
>> >>
>> > Making my little steps towards more understanding I figured out that it isn't
>> > the code that does the relocation. Even with that completely disabled there were
>> > the vmalloc issues. What causes it seems to be the default of the upper limit
>> > and that this changes the split between kernel and modules to 1G+1G instead of
>> > 512M+1.5G. That is the reason why nokaslr has no effect.
>>
>> Oh! That's very interesting. There must be some assumption in Xen
>> about the kernel VM layout then?
>
> No. I think most of the changes that look at PTE and PMDs are are all
> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
> too aggressive

(Sorry I had to cut our chat short at Kernel Summit!)

I sounded like there was another region of memory that Xen was setting
aside for page tables? But Stefan's investigation seems to show this
isn't about layout at boot (since the kaslr=0 case means no relocation
is done). Sounds more like the split between kernel and modules area,
so I'm not sure how the memory area after the initrd would be part of
this. What should next steps be, do you think?

-Kees


>>
>> -Kees
>>
>> --
>> Kees Cook
>> Chrome OS Security
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-21 16:03             ` Kees Cook
@ 2014-08-22  9:20               ` Stefan Bader
  2014-08-26 16:01                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Bader @ 2014-08-22  9:20 UTC (permalink / raw)
  To: Kees Cook, Konrad Rzeszutek Wilk
  Cc: xen-devel, David Vrabel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 5986 bytes --]

On 21.08.2014 18:03, Kees Cook wrote:
> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
>>> <stefan.bader@canonical.com> wrote:
>>>> On 12.08.2014 19:28, Kees Cook wrote:
>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
>>>>>> On 08.08.2014 14:43, David Vrabel wrote:
>>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>>>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>>>>>>>> as a follow up error).
>>>>>>>>
>>>>>>>> Details can be seen in [1] but basically this is always some portion of a
>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>>>>>>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>>>>>>
>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>>>>>>>> to add "nokaslr" to the kernel command-line.
>>>>>>>
>>>>>>> Maybe it's overlapping with regions of the virtual address space
>>>>>>> reserved for Xen?  What the the VA that fails?
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>>>>>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
>>>>>> not sure whether it might be somewhere in the stacktraces in the report).
>>>>>>
>>>>>> The kernel-command line does not seem to be looked at. It should put something
>>>>>> into dmesg and that never shows up. Also today's random feature is other PV
>>>>>> guests crashing after a bit somewhere in the check_for_corruption area...
>>>>>
>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>>>>> there are other reserved regions that aren't listed in the e820, it'll
>>>>> need to locate and skip them.
>>>>>
>>>>> -Kees
>>>>>
>>>> Making my little steps towards more understanding I figured out that it isn't
>>>> the code that does the relocation. Even with that completely disabled there were
>>>> the vmalloc issues. What causes it seems to be the default of the upper limit
>>>> and that this changes the split between kernel and modules to 1G+1G instead of
>>>> 512M+1.5G. That is the reason why nokaslr has no effect.
>>>
>>> Oh! That's very interesting. There must be some assumption in Xen
>>> about the kernel VM layout then?
>>
>> No. I think most of the changes that look at PTE and PMDs are are all
>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
>> too aggressive
> 
> (Sorry I had to cut our chat short at Kernel Summit!)
> 
> I sounded like there was another region of memory that Xen was setting
> aside for page tables? But Stefan's investigation seems to show this
> isn't about layout at boot (since the kaslr=0 case means no relocation
> is done). Sounds more like the split between kernel and modules area,
> so I'm not sure how the memory area after the initrd would be part of
> this. What should next steps be, do you think?

Maybe layout, but not about placement of the kernel. Basically leaving KASLR
enabled but shrink the possible range back to the original kernel/module split
is fine as well.

I am bouncing between feeling close to understand to being confused. Konrad
suggested xen_cleanhighmap being overly aggressive. But maybe its the other way
round. The warning that occurs first indicates that PTE that was obtained for
some vmalloc mapping is not unused (0) as it is expected. So it feels rather
like some cleanup has *not* been done.

Let me think aloud a bit... What seems to cause this, is the change of the
kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
vsyscalls and 2M hole at the end). Which in vaddr terms means:

Before:
ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space

After:
ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from phys 0
ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space

Now, *if* I got this right, this means the kernel starts on a vaddr that is
pointed at by:

PGD[510]->PUD[510]->PMD[0]->PTE[0]

In the old layout the module vaddr area would start in the same PUD area, but
with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls
and the hole would cover PUD[511].

xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit
since I am not sure I understand enough details) I believe is the one PMD
pointed at by PGD[510]->PUD[510]. That could mean that before the change
xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but
not after the change. Maybe that also means it always should have covered more
but this would not be observed as long as modules would not claim more than
512M? I still need to check the vaddr ranges for which xen_cleanhighmap is
actually called. The modules vaddr space would normally not be touched (only
with DEBUG set). I moved that to be unconditionally done but then this might be
of no use when it needs to cover a different PMD...

Really not sure here. But maybe a starter for others...

-Stefan

> 
> -Kees
> 
> 
>>>
>>> -Kees
>>>
>>> --
>>> Kees Cook
>>> Chrome OS Security
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
> 
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-22  9:20               ` Stefan Bader
@ 2014-08-26 16:01                 ` Konrad Rzeszutek Wilk
  2014-08-27  8:03                   ` Stefan Bader
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-26 16:01 UTC (permalink / raw)
  To: Stefan Bader
  Cc: Kees Cook, xen-devel, David Vrabel, Linux Kernel Mailing List

On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote:
> On 21.08.2014 18:03, Kees Cook wrote:
> > On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com> wrote:
> >> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
> >>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
> >>> <stefan.bader@canonical.com> wrote:
> >>>> On 12.08.2014 19:28, Kees Cook wrote:
> >>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
> >>>>>> On 08.08.2014 14:43, David Vrabel wrote:
> >>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
> >>>>>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
> >>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
> >>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
> >>>>>>>> as a follow up error).
> >>>>>>>>
> >>>>>>>> Details can be seen in [1] but basically this is always some portion of a
> >>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
> >>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
> >>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
> >>>>>>>> the dom0 case there is a more fatal error at some point causing a crash.
> >>>>>>>>
> >>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
> >>>>>>>> to add "nokaslr" to the kernel command-line.
> >>>>>>>
> >>>>>>> Maybe it's overlapping with regions of the virtual address space
> >>>>>>> reserved for Xen?  What the the VA that fails?
> >>>>>>>
> >>>>>>> David
> >>>>>>>
> >>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
> >>>>>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
> >>>>>> not sure whether it might be somewhere in the stacktraces in the report).
> >>>>>>
> >>>>>> The kernel-command line does not seem to be looked at. It should put something
> >>>>>> into dmesg and that never shows up. Also today's random feature is other PV
> >>>>>> guests crashing after a bit somewhere in the check_for_corruption area...
> >>>>>
> >>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
> >>>>> there are other reserved regions that aren't listed in the e820, it'll
> >>>>> need to locate and skip them.
> >>>>>
> >>>>> -Kees
> >>>>>
> >>>> Making my little steps towards more understanding I figured out that it isn't
> >>>> the code that does the relocation. Even with that completely disabled there were
> >>>> the vmalloc issues. What causes it seems to be the default of the upper limit
> >>>> and that this changes the split between kernel and modules to 1G+1G instead of
> >>>> 512M+1.5G. That is the reason why nokaslr has no effect.
> >>>
> >>> Oh! That's very interesting. There must be some assumption in Xen
> >>> about the kernel VM layout then?
> >>
> >> No. I think most of the changes that look at PTE and PMDs are are all
> >> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
> >> too aggressive
> > 
> > (Sorry I had to cut our chat short at Kernel Summit!)
> > 
> > I sounded like there was another region of memory that Xen was setting
> > aside for page tables? But Stefan's investigation seems to show this
> > isn't about layout at boot (since the kaslr=0 case means no relocation
> > is done). Sounds more like the split between kernel and modules area,
> > so I'm not sure how the memory area after the initrd would be part of
> > this. What should next steps be, do you think?
> 
> Maybe layout, but not about placement of the kernel. Basically leaving KASLR
> enabled but shrink the possible range back to the original kernel/module split
> is fine as well.
> 
> I am bouncing between feeling close to understand to being confused. Konrad
> suggested xen_cleanhighmap being overly aggressive. But maybe its the other way
> round. The warning that occurs first indicates that PTE that was obtained for
> some vmalloc mapping is not unused (0) as it is expected. So it feels rather
> like some cleanup has *not* been done.
> 
> Let me think aloud a bit... What seems to cause this, is the change of the
> kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
> vsyscalls and 2M hole at the end). Which in vaddr terms means:
> 
> Before:
> ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
> ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
> 
> After:
> ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from phys 0
> ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space
> 
> Now, *if* I got this right, this means the kernel starts on a vaddr that is
> pointed at by:
> 
> PGD[510]->PUD[510]->PMD[0]->PTE[0]
> 
> In the old layout the module vaddr area would start in the same PUD area, but
> with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls
> and the hole would cover PUD[511].

I think there is a fixmap there too?
> 
> xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit
> since I am not sure I understand enough details) I believe is the one PMD
> pointed at by PGD[510]->PUD[510]. That could mean that before the change

That sounds right.

I don't know if you saw:

1248 #ifdef DEBUG                                                                    
1249         /* This is superflous and is not neccessary, but you know what          
1250          * lets do it. The MODULES_VADDR -> MODULES_END should be clear of      
1251          * anything at this stage. */                                           
1252         xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);  
1253 #endif                                                                          
1254 }                                    

Which was me being a bit paranoid and figured it might help in troubleshooting.
If you disable that does it work?

> xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but
> not after the change. Maybe that also means it always should have covered more
> but this would not be observed as long as modules would not claim more than
> 512M? I still need to check the vaddr ranges for which xen_cleanhighmap is
> actually called. The modules vaddr space would normally not be touched (only
> with DEBUG set). I moved that to be unconditionally done but then this might be
> of no use when it needs to cover a different PMD...

What does the toolstack say in regards to allocating the memory? It is pretty
verbose (domainloginfo..something) in printing out the vaddr of where
it stashes the kernel, ramdisk, P2M, and the pagetables (which of course
need to fit all within the 512MB, now 1GB area).

> 
> Really not sure here. But maybe a starter for others...
> 
> -Stefan
> 
> > 
> > -Kees
> > 
> > 
> >>>
> >>> -Kees
> >>>
> >>> --
> >>> Kees Cook
> >>> Chrome OS Security
> >>>
> >>> _______________________________________________
> >>> Xen-devel mailing list
> >>> Xen-devel@lists.xen.org
> >>> http://lists.xen.org/xen-devel
> > 
> > 
> > 
> 
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-26 16:01                 ` Konrad Rzeszutek Wilk
@ 2014-08-27  8:03                   ` Stefan Bader
  2014-08-27 20:49                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Bader @ 2014-08-27  8:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kees Cook, xen-devel, David Vrabel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 10238 bytes --]

On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote:
>> On 21.08.2014 18:03, Kees Cook wrote:
>>> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
>>> <konrad.wilk@oracle.com> wrote:
>>>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
>>>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
>>>>> <stefan.bader@canonical.com> wrote:
>>>>>> On 12.08.2014 19:28, Kees Cook wrote:
>>>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
>>>>>>>> On 08.08.2014 14:43, David Vrabel wrote:
>>>>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>>>>>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>>>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>>>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>>>>>>>>>> as a follow up error).
>>>>>>>>>>
>>>>>>>>>> Details can be seen in [1] but basically this is always some portion of a
>>>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>>>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>>>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>>>>>>>>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>>>>>>>>
>>>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>>>>>>>>>> to add "nokaslr" to the kernel command-line.
>>>>>>>>>
>>>>>>>>> Maybe it's overlapping with regions of the virtual address space
>>>>>>>>> reserved for Xen?  What the the VA that fails?
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>>>>>>>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
>>>>>>>> not sure whether it might be somewhere in the stacktraces in the report).
>>>>>>>>
>>>>>>>> The kernel-command line does not seem to be looked at. It should put something
>>>>>>>> into dmesg and that never shows up. Also today's random feature is other PV
>>>>>>>> guests crashing after a bit somewhere in the check_for_corruption area...
>>>>>>>
>>>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>>>>>>> there are other reserved regions that aren't listed in the e820, it'll
>>>>>>> need to locate and skip them.
>>>>>>>
>>>>>>> -Kees
>>>>>>>
>>>>>> Making my little steps towards more understanding I figured out that it isn't
>>>>>> the code that does the relocation. Even with that completely disabled there were
>>>>>> the vmalloc issues. What causes it seems to be the default of the upper limit
>>>>>> and that this changes the split between kernel and modules to 1G+1G instead of
>>>>>> 512M+1.5G. That is the reason why nokaslr has no effect.
>>>>>
>>>>> Oh! That's very interesting. There must be some assumption in Xen
>>>>> about the kernel VM layout then?
>>>>
>>>> No. I think most of the changes that look at PTE and PMDs are are all
>>>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
>>>> too aggressive
>>>
>>> (Sorry I had to cut our chat short at Kernel Summit!)
>>>
>>> I sounded like there was another region of memory that Xen was setting
>>> aside for page tables? But Stefan's investigation seems to show this
>>> isn't about layout at boot (since the kaslr=0 case means no relocation
>>> is done). Sounds more like the split between kernel and modules area,
>>> so I'm not sure how the memory area after the initrd would be part of
>>> this. What should next steps be, do you think?
>>
>> Maybe layout, but not about placement of the kernel. Basically leaving KASLR
>> enabled but shrink the possible range back to the original kernel/module split
>> is fine as well.
>>
>> I am bouncing between feeling close to understand to being confused. Konrad
>> suggested xen_cleanhighmap being overly aggressive. But maybe its the other way
>> round. The warning that occurs first indicates that PTE that was obtained for
>> some vmalloc mapping is not unused (0) as it is expected. So it feels rather
>> like some cleanup has *not* been done.
>>
>> Let me think aloud a bit... What seems to cause this, is the change of the
>> kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
>> vsyscalls and 2M hole at the end). Which in vaddr terms means:
>>
>> Before:
>> ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
>> ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
>>
>> After:
>> ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from phys 0
>> ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space
>>
>> Now, *if* I got this right, this means the kernel starts on a vaddr that is
>> pointed at by:
>>
>> PGD[510]->PUD[510]->PMD[0]->PTE[0]
>>
>> In the old layout the module vaddr area would start in the same PUD area, but
>> with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls
>> and the hole would cover PUD[511].
> 
> I think there is a fixmap there too?

Right, they forgot that in Documentation/x86/x86_64/mm... but head_64.S has it.
So fixmap seems to be in the 2M space before the vsyscalls.
Btw, apparently I got the PGD index wrong. It is of course 511, not 510.

init_level4_pgt[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..255]->kernel
                                                               [256..511]->mod
                                       [511]->level2_fixmap_pgt[0..505]->mod
                                                               [506]->fixmap
                                                               [507..510]->vsysc
                                                               [511]->hole

With the change being level2_kernel_pgt completely covering kernel only.

>>
>> xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit
>> since I am not sure I understand enough details) I believe is the one PMD
>> pointed at by PGD[510]->PUD[510]. That could mean that before the change
> 
> That sounds right.
> 
> I don't know if you saw:
> 
> 1248 #ifdef DEBUG                                                                    
> 1249         /* This is superflous and is not neccessary, but you know what          
> 1250          * lets do it. The MODULES_VADDR -> MODULES_END should be clear of      
> 1251          * anything at this stage. */                                           
> 1252         xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);  
> 1253 #endif                                                                          
> 1254 }                                    

I saw that but it would have no effect, even with running it. Because
xen_cleanhighmap clamps the pmds it walks over to the kernel_level2_pgt page.
Now MODULES_VADDR is mapped only from level2_fixmap_pgt.
Even with the old layout it might do less that anticipated as it would only
cover 512M and stop then. But I think it really does not matter.
> 
> Which was me being a bit paranoid and figured it might help in troubleshooting.
> If you disable that does it work?
> 
>> xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but
>> not after the change. Maybe that also means it always should have covered more
>> but this would not be observed as long as modules would not claim more than
>> 512M? I still need to check the vaddr ranges for which xen_cleanhighmap is
>> actually called. The modules vaddr space would normally not be touched (only
>> with DEBUG set). I moved that to be unconditionally done but then this might be
>> of no use when it needs to cover a different PMD...
> 
> What does the toolstack say in regards to allocating the memory? It is pretty
> verbose (domainloginfo..something) in printing out the vaddr of where
> it stashes the kernel, ramdisk, P2M, and the pagetables (which of course
> need to fit all within the 512MB, now 1GB area).

That is taken from starting a 2G PV domU with pvgrub (not pygrub):

Xen Minimal OS!
  start_info: 0xd90000(VA)
    nr_pages: 0x80000
  shared_inf: 0xdfe92000(MA)
     pt_base: 0xd93000(VA)
nr_pt_frames: 0xb
    mfn_list: 0x990000(VA)
   mod_start: 0x0(VA)
     mod_len: 0
       flags: 0x0
    cmd_line:
  stack:      0x94f860-0x96f860
MM: Init
      _text: 0x0(VA)
     _etext: 0x6000d(VA)
   _erodata: 0x78000(VA)
     _edata: 0x80b00(VA)
stack start: 0x94f860(VA)
       _end: 0x98fe68(VA)
  start_pfn: da1
    max_pfn: 80000
Mapping memory range 0x1000000 - 0x80000000
setting 0x0-0x78000 readonly


For a moment I was puzzled by the use of max_pfn_mapped in the generic
cleanup_highmap function of 64bit x86. It limits the cleanup to the start of the
mfn_list. And the max_pfn_mapped value changes soon after to reflect the total
amount of memory of the guest.
Making a copy showed it to be around 51M at the time of cleanup. That initially
looks suspect but Xen already replaced the page tables. The compile-time
variants would have 2M large pages on the whole level2_kernel_pgt range. But as
far as I can see, the Xen provided ones don't put in mappings for anything
beyond the provided boot stack which is clean in the xen_cleanhighmap.

So not much further... but then I think I know what I do next. Probably should
have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
and at least get a crash dump of that situation when it occurs. Then I can dig
in there with crash (really should have thought of that before)...

-Stefan
> 
>>
>> Really not sure here. But maybe a starter for others...
>>
>> -Stefan
>>
>>>
>>> -Kees
>>>
>>>
>>>>>
>>>>> -Kees
>>>>>
>>>>> --
>>>>> Kees Cook
>>>>> Chrome OS Security
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xen.org
>>>>> http://lists.xen.org/xen-devel
>>>
>>>
>>>
>>
>>
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
  2014-08-27  8:03                   ` Stefan Bader
@ 2014-08-27 20:49                     ` Konrad Rzeszutek Wilk
  2014-08-28 18:01                       ` [PATCH] Solved the Xen PV/KASLR riddle Stefan Bader
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 20:49 UTC (permalink / raw)
  To: Stefan Bader
  Cc: Kees Cook, xen-devel, David Vrabel, Linux Kernel Mailing List

On Wed, Aug 27, 2014 at 10:03:10AM +0200, Stefan Bader wrote:
> On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote:
> > On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote:
> >> On 21.08.2014 18:03, Kees Cook wrote:
> >>> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
> >>> <konrad.wilk@oracle.com> wrote:
> >>>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
> >>>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
> >>>>> <stefan.bader@canonical.com> wrote:
> >>>>>> On 12.08.2014 19:28, Kees Cook wrote:
> >>>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
> >>>>>>>> On 08.08.2014 14:43, David Vrabel wrote:
> >>>>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
> >>>>>>>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
> >>>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
> >>>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
> >>>>>>>>>> as a follow up error).
> >>>>>>>>>>
> >>>>>>>>>> Details can be seen in [1] but basically this is always some portion of a
> >>>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
> >>>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
> >>>>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
> >>>>>>>>>> the dom0 case there is a more fatal error at some point causing a crash.
> >>>>>>>>>>
> >>>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
> >>>>>>>>>> to add "nokaslr" to the kernel command-line.
> >>>>>>>>>
> >>>>>>>>> Maybe it's overlapping with regions of the virtual address space
> >>>>>>>>> reserved for Xen?  What the the VA that fails?
> >>>>>>>>>
> >>>>>>>>> David
> >>>>>>>>>
> >>>>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
> >>>>>>>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
> >>>>>>>> not sure whether it might be somewhere in the stacktraces in the report).
> >>>>>>>>
> >>>>>>>> The kernel-command line does not seem to be looked at. It should put something
> >>>>>>>> into dmesg and that never shows up. Also today's random feature is other PV
> >>>>>>>> guests crashing after a bit somewhere in the check_for_corruption area...
> >>>>>>>
> >>>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
> >>>>>>> there are other reserved regions that aren't listed in the e820, it'll
> >>>>>>> need to locate and skip them.
> >>>>>>>
> >>>>>>> -Kees
> >>>>>>>
> >>>>>> Making my little steps towards more understanding I figured out that it isn't
> >>>>>> the code that does the relocation. Even with that completely disabled there were
> >>>>>> the vmalloc issues. What causes it seems to be the default of the upper limit
> >>>>>> and that this changes the split between kernel and modules to 1G+1G instead of
> >>>>>> 512M+1.5G. That is the reason why nokaslr has no effect.
> >>>>>
> >>>>> Oh! That's very interesting. There must be some assumption in Xen
> >>>>> about the kernel VM layout then?
> >>>>
> >>>> No. I think most of the changes that look at PTE and PMDs are are all
> >>>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
> >>>> too aggressive
> >>>
> >>> (Sorry I had to cut our chat short at Kernel Summit!)
> >>>
> >>> I sounded like there was another region of memory that Xen was setting
> >>> aside for page tables? But Stefan's investigation seems to show this
> >>> isn't about layout at boot (since the kaslr=0 case means no relocation
> >>> is done). Sounds more like the split between kernel and modules area,
> >>> so I'm not sure how the memory area after the initrd would be part of
> >>> this. What should next steps be, do you think?
> >>
> >> Maybe layout, but not about placement of the kernel. Basically leaving KASLR
> >> enabled but shrink the possible range back to the original kernel/module split
> >> is fine as well.
> >>
> >> I am bouncing between feeling close to understand to being confused. Konrad
> >> suggested xen_cleanhighmap being overly aggressive. But maybe its the other way
> >> round. The warning that occurs first indicates that PTE that was obtained for
> >> some vmalloc mapping is not unused (0) as it is expected. So it feels rather
> >> like some cleanup has *not* been done.
> >>
> >> Let me think aloud a bit... What seems to cause this, is the change of the
> >> kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
> >> vsyscalls and 2M hole at the end). Which in vaddr terms means:
> >>
> >> Before:
> >> ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
> >> ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
> >>
> >> After:
> >> ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from phys 0
> >> ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space
> >>
> >> Now, *if* I got this right, this means the kernel starts on a vaddr that is
> >> pointed at by:
> >>
> >> PGD[510]->PUD[510]->PMD[0]->PTE[0]
> >>
> >> In the old layout the module vaddr area would start in the same PUD area, but
> >> with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls
> >> and the hole would cover PUD[511].
> > 
> > I think there is a fixmap there too?
> 
> Right, they forgot that in Documentation/x86/x86_64/mm... but head_64.S has it.
> So fixmap seems to be in the 2M space before the vsyscalls.
> Btw, apparently I got the PGD index wrong. It is of course 511, not 510.
> 
> init_level4_pgt[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..255]->kernel
>                                                                [256..511]->mod
>                                        [511]->level2_fixmap_pgt[0..505]->mod
>                                                                [506]->fixmap
>                                                                [507..510]->vsysc
>                                                                [511]->hole
> 
> With the change being level2_kernel_pgt completely covering kernel only.
> 
> >>
> >> xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit
> >> since I am not sure I understand enough details) I believe is the one PMD
> >> pointed at by PGD[510]->PUD[510]. That could mean that before the change
> > 
> > That sounds right.
> > 
> > I don't know if you saw:
> > 
> > 1248 #ifdef DEBUG                                                                    
> > 1249         /* This is superflous and is not neccessary, but you know what          
> > 1250          * lets do it. The MODULES_VADDR -> MODULES_END should be clear of      
> > 1251          * anything at this stage. */                                           
> > 1252         xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);  
> > 1253 #endif                                                                          
> > 1254 }                                    
> 
> I saw that but it would have no effect, even with running it. Because
> xen_cleanhighmap clamps the pmds it walks over to the kernel_level2_pgt page.
> Now MODULES_VADDR is mapped only from level2_fixmap_pgt.
> Even with the old layout it might do less that anticipated as it would only
> cover 512M and stop then. But I think it really does not matter.
> > 
> > Which was me being a bit paranoid and figured it might help in troubleshooting.
> > If you disable that does it work?
> > 
> >> xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but
> >> not after the change. Maybe that also means it always should have covered more
> >> but this would not be observed as long as modules would not claim more than
> >> 512M? I still need to check the vaddr ranges for which xen_cleanhighmap is
> >> actually called. The modules vaddr space would normally not be touched (only
> >> with DEBUG set). I moved that to be unconditionally done but then this might be
> >> of no use when it needs to cover a different PMD...
> > 
> > What does the toolstack say in regards to allocating the memory? It is pretty
> > verbose (domainloginfo..something) in printing out the vaddr of where
> > it stashes the kernel, ramdisk, P2M, and the pagetables (which of course
> > need to fit all within the 512MB, now 1GB area).
> 
> That is taken from starting a 2G PV domU with pvgrub (not pygrub):
> 
> Xen Minimal OS!
>   start_info: 0xd90000(VA)
>     nr_pages: 0x80000
>   shared_inf: 0xdfe92000(MA)
>      pt_base: 0xd93000(VA)
> nr_pt_frames: 0xb
>     mfn_list: 0x990000(VA)
>    mod_start: 0x0(VA)
>      mod_len: 0
>        flags: 0x0
>     cmd_line:
>   stack:      0x94f860-0x96f860
> MM: Init
>       _text: 0x0(VA)
>      _etext: 0x6000d(VA)
>    _erodata: 0x78000(VA)
>      _edata: 0x80b00(VA)
> stack start: 0x94f860(VA)
>        _end: 0x98fe68(VA)
>   start_pfn: da1
>     max_pfn: 80000
> Mapping memory range 0x1000000 - 0x80000000
> setting 0x0-0x78000 readonly
> 
> 
> For a moment I was puzzled by the use of max_pfn_mapped in the generic
> cleanup_highmap function of 64bit x86. It limits the cleanup to the start of the
> mfn_list. And the max_pfn_mapped value changes soon after to reflect the total
> amount of memory of the guest.
> Making a copy showed it to be around 51M at the time of cleanup. That initially
> looks suspect but Xen already replaced the page tables. The compile-time
> variants would have 2M large pages on the whole level2_kernel_pgt range. But as
> far as I can see, the Xen provided ones don't put in mappings for anything
> beyond the provided boot stack which is clean in the xen_cleanhighmap.
> 
> So not much further... but then I think I know what I do next. Probably should
> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
> and at least get a crash dump of that situation when it occurs. Then I can dig
> in there with crash (really should have thought of that before)...

<nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
that screams at me, so I fear I will have to wait until you get the crash
and get some clues from that.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-27 20:49                     ` Konrad Rzeszutek Wilk
@ 2014-08-28 18:01                       ` Stefan Bader
  2014-08-28 22:22                         ` Kees Cook
                                           ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Stefan Bader @ 2014-08-28 18:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kees Cook, xen-devel, David Vrabel, Linux Kernel Mailing List,
	Stefan Bader

> > So not much further... but then I think I know what I do next. Probably should
> > have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
> > and at least get a crash dump of that situation when it occurs. Then I can dig
> > in there with crash (really should have thought of that before)...
> 
> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
> that screams at me, so I fear I will have to wait until you get the crash
> and get some clues from that.

Ok, what a journey. So after long hours of painful staring at the code...
(and btw, if someone could tell me how the heck one can do a mfn_to_pfn
in crash, I really would appreaciate :-P)

So at some point I realized that level2_fixmap_pgt seemed to contain
an oddly high number of entries (given that the virtual address that
failed would be mapped by entry 0). And suddenly I realized that apart
from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
actually was the same as level2_kernel_gpt (without the first 16M
cleared).

And then I realized that xen_setup_kernel_pagetable is wrong to a
degree which makes one wonder how this ever worked. Adding PMD_SIZE
as an offset (2M) isn't changing l2 at all. This just copies Xen's
kernel mapping, AGAIN!.

I guess we all just were lucky that in most cases modules would not
use more than 512M (which is the correctly cleaned up remainder
of kernel_level2_pgt)...

I still need to compile a kernel with the patch and the old layout
but I kind of got exited by the find. At least this is tested with
the 1G/~1G layout and it comes up without vmalloc errors.

-Stefan

>From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Thu, 28 Aug 2014 19:17:00 +0200
Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables

This seemed to be one of those what-the-heck moments. When trying to
figure out why changing the kernel/module split (which enabling KASLR
does) causes vmalloc to run wild on boot of 64bit PV guests, after
much scratching my head, found that the current Xen code copies the
same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
but also (due to miscalculating the offset) to level2_fixmap_pgt.

This only worked because the normal kernel image size only covers the
first half of level2_kernel_pgt and module space starts after that.
With the split changing, the kernel image uses the full PUD range of
1G and module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt
                          [511]->level2_fixmap_pgt

And now the incorrect copy of the kernel mapping in that range bites
(hard).

This change might not be the fully correct approach as it basically
removes the pre-set page table entry for the fixmap that is compile
time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
level1 page table is not yet declared in C headers (that might be
fixed). But also with the current bug, it was removed, too. Since
the Xen mappings for level2_kernel_pgt only covered kernel + initrd
and some Xen data this did never reach that far. And still, something
does create entries at level2_fixmap_pgt[506..507]. So it should be
ok. At least I was able to successfully boot a kernel with 1G kernel
image size without any vmalloc whinings.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/xen/mmu.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..803034c 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
 		/* L3_k[510] -> level2_kernel_pgt
-		 * L3_i[511] -> level2_fixmap_pgt */
+		 * L3_k[511] -> level2_fixmap_pgt */
 		convert_pfn_mfn(level3_kernel_pgt);
+
+		/* level2_fixmap_pgt contains a single entry for the
+		 * fixmap area at offset 506. The correct way would
+		 * be to convert level2_fixmap_pgt to mfn and set the
+		 * level1_fixmap_pgt (which is completely empty) to RO,
+		 * too. But currently this page table is not delcared,
+		 * so it would be a bit of voodoo to get its address.
+		 * And also the fixmap entry was never set anyway due
+		 * to using the wrong l2 when getting Xen's tables.
+		 * So let's just nuke it.
+		 * This orphans level1_fixmap_pgt, but that should be
+		 * as it always has been.
+		 */
+		memset(level2_fixmap_pgt, 0, 512*sizeof(long));
 	}
 	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
@@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	addr[1] = (unsigned long)l3;
 	addr[2] = (unsigned long)l2;
 	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
-	 * Both L4[272][0] and L4[511][511] have entries that point to the same
+	 * Both L4[272][0] and L4[511][510] have entries that point to the same
 	 * L2 (PMD) tables. Meaning that if you modify it in __va space
 	 * it will be also modified in the __ka space! (But if you just
 	 * modify the PMD table to point to other PTE's or none, then you
 	 * are OK - which is what cleanup_highmap does) */
 	copy_page(level2_ident_pgt, l2);
-	/* Graft it onto L4[511][511] */
+	/* Graft it onto L4[511][510] */
 	copy_page(level2_kernel_pgt, l2);
 
-	/* Get [511][510] and graft that in level2_fixmap_pgt */
-	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
-	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
-	copy_page(level2_fixmap_pgt, l2);
-	/* Note that we don't do anything with level1_fixmap_pgt which
-	 * we don't need. */
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
 		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-28 18:01                       ` [PATCH] Solved the Xen PV/KASLR riddle Stefan Bader
@ 2014-08-28 22:22                         ` Kees Cook
  2014-08-28 22:42                           ` Andrew Cooper
  2014-08-29 14:08                         ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 29+ messages in thread
From: Kees Cook @ 2014-08-28 22:22 UTC (permalink / raw)
  To: Stefan Bader
  Cc: Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Linux Kernel Mailing List

On Thu, Aug 28, 2014 at 11:01 AM, Stefan Bader
<stefan.bader@canonical.com> wrote:
>> > So not much further... but then I think I know what I do next. Probably should
>> > have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>> > and at least get a crash dump of that situation when it occurs. Then I can dig
>> > in there with crash (really should have thought of that before)...
>>
>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>> that screams at me, so I fear I will have to wait until you get the crash
>> and get some clues from that.
>
> Ok, what a journey. So after long hours of painful staring at the code...
> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
> in crash, I really would appreaciate :-P)
>
> So at some point I realized that level2_fixmap_pgt seemed to contain
> an oddly high number of entries (given that the virtual address that
> failed would be mapped by entry 0). And suddenly I realized that apart
> from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
> actually was the same as level2_kernel_gpt (without the first 16M
> cleared).
>
> And then I realized that xen_setup_kernel_pagetable is wrong to a
> degree which makes one wonder how this ever worked. Adding PMD_SIZE
> as an offset (2M) isn't changing l2 at all. This just copies Xen's
> kernel mapping, AGAIN!.

Woo! Nice find!

-Kees

>
> I guess we all just were lucky that in most cases modules would not
> use more than 512M (which is the correctly cleaned up remainder
> of kernel_level2_pgt)...
>
> I still need to compile a kernel with the patch and the old layout
> but I kind of got exited by the find. At least this is tested with
> the 1G/~1G layout and it comes up without vmalloc errors.
>
> -Stefan
>
> From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.bader@canonical.com>
> Date: Thu, 28 Aug 2014 19:17:00 +0200
> Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables
>
> This seemed to be one of those what-the-heck moments. When trying to
> figure out why changing the kernel/module split (which enabling KASLR
> does) causes vmalloc to run wild on boot of 64bit PV guests, after
> much scratching my head, found that the current Xen code copies the
> same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
> but also (due to miscalculating the offset) to level2_fixmap_pgt.
>
> This only worked because the normal kernel image size only covers the
> first half of level2_kernel_pgt and module space starts after that.
> With the split changing, the kernel image uses the full PUD range of
> 1G and module space starts in the level2_fixmap_pgt. So basically:
>
> L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt
>                           [511]->level2_fixmap_pgt
>
> And now the incorrect copy of the kernel mapping in that range bites
> (hard).
>
> This change might not be the fully correct approach as it basically
> removes the pre-set page table entry for the fixmap that is compile
> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
> level1 page table is not yet declared in C headers (that might be
> fixed). But also with the current bug, it was removed, too. Since
> the Xen mappings for level2_kernel_pgt only covered kernel + initrd
> and some Xen data this did never reach that far. And still, something
> does create entries at level2_fixmap_pgt[506..507]. So it should be
> ok. At least I was able to successfully boot a kernel with 1G kernel
> image size without any vmalloc whinings.
>
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  arch/x86/xen/mmu.c | 26 +++++++++++++++++---------
>  1 file changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index e8a1201..803034c 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>                 /* L3_i[0] -> level2_ident_pgt */
>                 convert_pfn_mfn(level3_ident_pgt);
>                 /* L3_k[510] -> level2_kernel_pgt
> -                * L3_i[511] -> level2_fixmap_pgt */
> +                * L3_k[511] -> level2_fixmap_pgt */
>                 convert_pfn_mfn(level3_kernel_pgt);
> +
> +               /* level2_fixmap_pgt contains a single entry for the
> +                * fixmap area at offset 506. The correct way would
> +                * be to convert level2_fixmap_pgt to mfn and set the
> +                * level1_fixmap_pgt (which is completely empty) to RO,
> +                * too. But currently this page table is not delcared,
> +                * so it would be a bit of voodoo to get its address.
> +                * And also the fixmap entry was never set anyway due
> +                * to using the wrong l2 when getting Xen's tables.
> +                * So let's just nuke it.
> +                * This orphans level1_fixmap_pgt, but that should be
> +                * as it always has been.
> +                */
> +               memset(level2_fixmap_pgt, 0, 512*sizeof(long));
>         }
>         /* We get [511][511] and have Xen's version of level2_kernel_pgt */
>         l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
> @@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>         addr[1] = (unsigned long)l3;
>         addr[2] = (unsigned long)l2;
>         /* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
> -        * Both L4[272][0] and L4[511][511] have entries that point to the same
> +        * Both L4[272][0] and L4[511][510] have entries that point to the same
>          * L2 (PMD) tables. Meaning that if you modify it in __va space
>          * it will be also modified in the __ka space! (But if you just
>          * modify the PMD table to point to other PTE's or none, then you
>          * are OK - which is what cleanup_highmap does) */
>         copy_page(level2_ident_pgt, l2);
> -       /* Graft it onto L4[511][511] */
> +       /* Graft it onto L4[511][510] */
>         copy_page(level2_kernel_pgt, l2);
>
> -       /* Get [511][510] and graft that in level2_fixmap_pgt */
> -       l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
> -       l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
> -       copy_page(level2_fixmap_pgt, l2);
> -       /* Note that we don't do anything with level1_fixmap_pgt which
> -        * we don't need. */
>         if (!xen_feature(XENFEAT_auto_translated_physmap)) {
>                 /* Make pagetable pieces RO */
>                 set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
> --
> 1.9.1
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-28 18:01                       ` [PATCH] Solved the Xen PV/KASLR riddle Stefan Bader
@ 2014-08-28 22:42                           ` Andrew Cooper
  2014-08-28 22:42                           ` Andrew Cooper
  2014-08-29 14:08                         ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2014-08-28 22:42 UTC (permalink / raw)
  To: Stefan Bader, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, xen-devel, Kees Cook, David Vrabel

On 28/08/2014 19:01, Stefan Bader wrote:
>>> So not much further... but then I think I know what I do next. Probably should
>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>>> and at least get a crash dump of that situation when it occurs. Then I can dig
>>> in there with crash (really should have thought of that before)...
>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>> that screams at me, so I fear I will have to wait until you get the crash
>> and get some clues from that.
> Ok, what a journey. So after long hours of painful staring at the code...
> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
> in crash, I really would appreaciate :-P)

The M2P map lives in the Xen reserved virtual address space in each PV
guest, and forms part of the PV ABI.  It is mapped read-only, in the
native width of the guest.

32bit PV (PAE) at 0xF5800000
64bit PV at 0xFFFF800000000000

This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
public header files.  You should be able to blindly construct a pointer
to it (if you have nothing better to hand), as it will be hooked into
the guests pagetables before execution starts.  Therefore,
"MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.

~Andrew

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Solved the Xen PV/KASLR riddle
@ 2014-08-28 22:42                           ` Andrew Cooper
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2014-08-28 22:42 UTC (permalink / raw)
  To: Stefan Bader, Konrad Rzeszutek Wilk
  Cc: David Vrabel, xen-devel, Linux Kernel Mailing List, Kees Cook

On 28/08/2014 19:01, Stefan Bader wrote:
>>> So not much further... but then I think I know what I do next. Probably should
>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>>> and at least get a crash dump of that situation when it occurs. Then I can dig
>>> in there with crash (really should have thought of that before)...
>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>> that screams at me, so I fear I will have to wait until you get the crash
>> and get some clues from that.
> Ok, what a journey. So after long hours of painful staring at the code...
> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
> in crash, I really would appreaciate :-P)

The M2P map lives in the Xen reserved virtual address space in each PV
guest, and forms part of the PV ABI.  It is mapped read-only, in the
native width of the guest.

32bit PV (PAE) at 0xF5800000
64bit PV at 0xFFFF800000000000

This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
public header files.  You should be able to blindly construct a pointer
to it (if you have nothing better to hand), as it will be hooked into
the guests pagetables before execution starts.  Therefore,
"MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.

~Andrew

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-28 22:42                           ` Andrew Cooper
  (?)
@ 2014-08-29  8:37                           ` Stefan Bader
  2014-08-29 14:19                             ` Andrew Cooper
  -1 siblings, 1 reply; 29+ messages in thread
From: Stefan Bader @ 2014-08-29  8:37 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, xen-devel, Kees Cook, David Vrabel

[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]

On 29.08.2014 00:42, Andrew Cooper wrote:
> On 28/08/2014 19:01, Stefan Bader wrote:
>>>> So not much further... but then I think I know what I do next. Probably should
>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>>>> and at least get a crash dump of that situation when it occurs. Then I can dig
>>>> in there with crash (really should have thought of that before)...
>>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>>> that screams at me, so I fear I will have to wait until you get the crash
>>> and get some clues from that.
>> Ok, what a journey. So after long hours of painful staring at the code...
>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>> in crash, I really would appreaciate :-P)
> 
> The M2P map lives in the Xen reserved virtual address space in each PV
> guest, and forms part of the PV ABI.  It is mapped read-only, in the
> native width of the guest.
> 
> 32bit PV (PAE) at 0xF5800000
> 64bit PV at 0xFFFF800000000000
> 
> This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
> public header files.  You should be able to blindly construct a pointer
> to it (if you have nothing better to hand), as it will be hooked into
> the guests pagetables before execution starts.  Therefore,
> "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.

machine_to_phys_mapping is set to that address but its not mapped inside the
crash dump. Somehow vtop in crash handles translations. I need to have a look at
their code, I guess.

Thanks,
Stefan
> 
> ~Andrew
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-28 18:01                       ` [PATCH] Solved the Xen PV/KASLR riddle Stefan Bader
  2014-08-28 22:22                         ` Kees Cook
  2014-08-28 22:42                           ` Andrew Cooper
@ 2014-08-29 14:08                         ` Konrad Rzeszutek Wilk
  2014-08-29 14:27                           ` Stefan Bader
  2 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-29 14:08 UTC (permalink / raw)
  To: Stefan Bader
  Cc: Kees Cook, xen-devel, David Vrabel, Linux Kernel Mailing List

On Thu, Aug 28, 2014 at 08:01:43PM +0200, Stefan Bader wrote:
> > > So not much further... but then I think I know what I do next. Probably should
> > > have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
> > > and at least get a crash dump of that situation when it occurs. Then I can dig
> > > in there with crash (really should have thought of that before)...
> > 
> > <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
> > that screams at me, so I fear I will have to wait until you get the crash
> > and get some clues from that.
> 
> Ok, what a journey. So after long hours of painful staring at the code...
> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
> in crash, I really would appreaciate :-P)
> 
> So at some point I realized that level2_fixmap_pgt seemed to contain
> an oddly high number of entries (given that the virtual address that
> failed would be mapped by entry 0). And suddenly I realized that apart
> from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
> actually was the same as level2_kernel_gpt (without the first 16M
> cleared).
> 
> And then I realized that xen_setup_kernel_pagetable is wrong to a
> degree which makes one wonder how this ever worked. Adding PMD_SIZE
> as an offset (2M) isn't changing l2 at all. This just copies Xen's
> kernel mapping, AGAIN!.
> 
> I guess we all just were lucky that in most cases modules would not
> use more than 512M (which is the correctly cleaned up remainder
> of kernel_level2_pgt)..
> 
> I still need to compile a kernel with the patch and the old layout
> but I kind of got exited by the find. At least this is tested with
> the 1G/~1G layout and it comes up without vmalloc errors.

Woot!
> 
> -Stefan
> 
> >From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.bader@canonical.com>
> Date: Thu, 28 Aug 2014 19:17:00 +0200
> Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables
> 
> This seemed to be one of those what-the-heck moments. When trying to
> figure out why changing the kernel/module split (which enabling KASLR
> does) causes vmalloc to run wild on boot of 64bit PV guests, after
> much scratching my head, found that the current Xen code copies the

s/current Xen/xen_setup_kernel_pagetable/

> same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
> but also (due to miscalculating the offset) to level2_fixmap_pgt.
> 
> This only worked because the normal kernel image size only covers the
> first half of level2_kernel_pgt and module space starts after that.
> With the split changing, the kernel image uses the full PUD range of
> 1G and module space starts in the level2_fixmap_pgt. So basically:
> 
> L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt
>                           [511]->level2_fixmap_pgt
> 

Perhaps you could add a similar drawing of what it looked like
without the kaslr enabled? AS in the 'normal kernel image' scenario?

> And now the incorrect copy of the kernel mapping in that range bites
> (hard).

Want to include the vmalloc warning you got?

> 
> This change might not be the fully correct approach as it basically
> removes the pre-set page table entry for the fixmap that is compile
> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
> level1 page table is not yet declared in C headers (that might be
> fixed). But also with the current bug, it was removed, too. Since
> the Xen mappings for level2_kernel_pgt only covered kernel + initrd
> and some Xen data this did never reach that far. And still, something
> does create entries at level2_fixmap_pgt[506..507]. So it should be
> ok. At least I was able to successfully boot a kernel with 1G kernel
> image size without any vmalloc whinings.
> 
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  arch/x86/xen/mmu.c | 26 +++++++++++++++++---------
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index e8a1201..803034c 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  		/* L3_i[0] -> level2_ident_pgt */
>  		convert_pfn_mfn(level3_ident_pgt);
>  		/* L3_k[510] -> level2_kernel_pgt
> -		 * L3_i[511] -> level2_fixmap_pgt */
> +		 * L3_k[511] -> level2_fixmap_pgt */
>  		convert_pfn_mfn(level3_kernel_pgt);
> +
> +		/* level2_fixmap_pgt contains a single entry for the
> +		 * fixmap area at offset 506. The correct way would
> +		 * be to convert level2_fixmap_pgt to mfn and set the
> +		 * level1_fixmap_pgt (which is completely empty) to RO,
> +		 * too. But currently this page table is not delcared,

declared.
> +		 * so it would be a bit of voodoo to get its address.
> +		 * And also the fixmap entry was never set anyway due

s/anyway//
> +		 * to using the wrong l2 when getting Xen's tables.
> +		 * So let's just nuke it.
> +		 * This orphans level1_fixmap_pgt, but that should be
> +		 * as it always has been.

'as it always has been.' ? Not sure I follow that sentence?

> +		 */
> +		memset(level2_fixmap_pgt, 0, 512*sizeof(long));
>  	}
>  	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
>  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
> @@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	addr[1] = (unsigned long)l3;
>  	addr[2] = (unsigned long)l2;
>  	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
> -	 * Both L4[272][0] and L4[511][511] have entries that point to the same
> +	 * Both L4[272][0] and L4[511][510] have entries that point to the same
>  	 * L2 (PMD) tables. Meaning that if you modify it in __va space
>  	 * it will be also modified in the __ka space! (But if you just
>  	 * modify the PMD table to point to other PTE's or none, then you
>  	 * are OK - which is what cleanup_highmap does) */
>  	copy_page(level2_ident_pgt, l2);
> -	/* Graft it onto L4[511][511] */
> +	/* Graft it onto L4[511][510] */
>  	copy_page(level2_kernel_pgt, l2);
>  
> -	/* Get [511][510] and graft that in level2_fixmap_pgt */
> -	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
> -	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
> -	copy_page(level2_fixmap_pgt, l2);
> -	/* Note that we don't do anything with level1_fixmap_pgt which
> -	 * we don't need. */

Later during bootup we do set the fixmap with entries. I recall (vaguely)
that on the SLES kernels (Classic) the fixmap was needed during early
bootup. The reason was that it used the right away for bootparams (maybe?).

I think your patch is correct in ripping out level1_fixmap_pgt
and level2_fixmap_pgt.

>  	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
>  		/* Make pagetable pieces RO */
>  		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29  8:37                           ` [Xen-devel] " Stefan Bader
@ 2014-08-29 14:19                             ` Andrew Cooper
  2014-08-29 14:32                               ` Stefan Bader
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2014-08-29 14:19 UTC (permalink / raw)
  To: Stefan Bader, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, xen-devel, Kees Cook, David Vrabel

On 29/08/14 09:37, Stefan Bader wrote:
> On 29.08.2014 00:42, Andrew Cooper wrote:
>> On 28/08/2014 19:01, Stefan Bader wrote:
>>>>> So not much further... but then I think I know what I do next. Probably should
>>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>>>>> and at least get a crash dump of that situation when it occurs. Then I can dig
>>>>> in there with crash (really should have thought of that before)...
>>>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>>>> that screams at me, so I fear I will have to wait until you get the crash
>>>> and get some clues from that.
>>> Ok, what a journey. So after long hours of painful staring at the code...
>>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>>> in crash, I really would appreaciate :-P)
>> The M2P map lives in the Xen reserved virtual address space in each PV
>> guest, and forms part of the PV ABI.  It is mapped read-only, in the
>> native width of the guest.
>>
>> 32bit PV (PAE) at 0xF5800000
>> 64bit PV at 0xFFFF800000000000
>>
>> This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
>> public header files.  You should be able to blindly construct a pointer
>> to it (if you have nothing better to hand), as it will be hooked into
>> the guests pagetables before execution starts.  Therefore,
>> "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.
> machine_to_phys_mapping is set to that address but its not mapped inside the
> crash dump. Somehow vtop in crash handles translations. I need to have a look at
> their code, I guess.
>
> Thanks,
> Stefan

What context is the crash dump?  If it is a Xen+dom0 kexec()d to native
linux, then the m2p should still be accessible given dom0's cr3.  If it
is some state copied off-host then you will need to adjust the copy to
include that virtual range.

~Andrew

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:08                         ` Konrad Rzeszutek Wilk
@ 2014-08-29 14:27                           ` Stefan Bader
  2014-08-29 14:31                             ` David Vrabel
  2014-08-29 14:44                             ` [Xen-devel] " Jan Beulich
  0 siblings, 2 replies; 29+ messages in thread
From: Stefan Bader @ 2014-08-29 14:27 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kees Cook, xen-devel, David Vrabel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 8573 bytes --]

On 29.08.2014 16:08, Konrad Rzeszutek Wilk wrote:
> On Thu, Aug 28, 2014 at 08:01:43PM +0200, Stefan Bader wrote:
>>>> So not much further... but then I think I know what I do next. Probably should
>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>>>> and at least get a crash dump of that situation when it occurs. Then I can dig
>>>> in there with crash (really should have thought of that before)...
>>>
>>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>>> that screams at me, so I fear I will have to wait until you get the crash
>>> and get some clues from that.
>>
>> Ok, what a journey. So after long hours of painful staring at the code...
>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>> in crash, I really would appreaciate :-P)
>>
>> So at some point I realized that level2_fixmap_pgt seemed to contain
>> an oddly high number of entries (given that the virtual address that
>> failed would be mapped by entry 0). And suddenly I realized that apart
>> from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
>> actually was the same as level2_kernel_gpt (without the first 16M
>> cleared).
>>
>> And then I realized that xen_setup_kernel_pagetable is wrong to a
>> degree which makes one wonder how this ever worked. Adding PMD_SIZE
>> as an offset (2M) isn't changing l2 at all. This just copies Xen's
>> kernel mapping, AGAIN!.
>>
>> I guess we all just were lucky that in most cases modules would not
>> use more than 512M (which is the correctly cleaned up remainder
>> of kernel_level2_pgt)..
>>
>> I still need to compile a kernel with the patch and the old layout
>> but I kind of got exited by the find. At least this is tested with
>> the 1G/~1G layout and it comes up without vmalloc errors.
> 
> Woot!

Oh yeah! :)

>>
>> -Stefan
>>
>> >From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
>> From: Stefan Bader <stefan.bader@canonical.com>
>> Date: Thu, 28 Aug 2014 19:17:00 +0200
>> Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables
>>
>> This seemed to be one of those what-the-heck moments. When trying to
>> figure out why changing the kernel/module split (which enabling KASLR
>> does) causes vmalloc to run wild on boot of 64bit PV guests, after
>> much scratching my head, found that the current Xen code copies the
> 
> s/current Xen/xen_setup_kernel_pagetable/

ok

> 
>> same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
>> but also (due to miscalculating the offset) to level2_fixmap_pgt.
>>
>> This only worked because the normal kernel image size only covers the
>> first half of level2_kernel_pgt and module space starts after that.
>> With the split changing, the kernel image uses the full PUD range of
>> 1G and module space starts in the level2_fixmap_pgt. So basically:
>>
>> L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt
>>                           [511]->level2_fixmap_pgt
>>
> 
> Perhaps you could add a similar drawing of what it looked like
> without the kaslr enabled? AS in the 'normal kernel image' scenario?

Sure. Btw, someone also contacted me saying they have the same problem without
changing the layout but having really big initrd (500M). While that feels like
it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
kernel image size area then). But if it can happen, then surely it does cause
mappings to be where the module space starts then.

> 
>> And now the incorrect copy of the kernel mapping in that range bites
>> (hard).
> 
> Want to include the vmalloc warning you got?

Yeah, that is a good idea.

> 
>>
>> This change might not be the fully correct approach as it basically
>> removes the pre-set page table entry for the fixmap that is compile
>> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
>> level1 page table is not yet declared in C headers (that might be
>> fixed). But also with the current bug, it was removed, too. Since
>> the Xen mappings for level2_kernel_pgt only covered kernel + initrd
>> and some Xen data this did never reach that far. And still, something
>> does create entries at level2_fixmap_pgt[506..507]. So it should be
>> ok. At least I was able to successfully boot a kernel with 1G kernel
>> image size without any vmalloc whinings.
>>
>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>> ---
>>  arch/x86/xen/mmu.c | 26 +++++++++++++++++---------
>>  1 file changed, 17 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index e8a1201..803034c 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>>  		/* L3_i[0] -> level2_ident_pgt */
>>  		convert_pfn_mfn(level3_ident_pgt);
>>  		/* L3_k[510] -> level2_kernel_pgt
>> -		 * L3_i[511] -> level2_fixmap_pgt */
>> +		 * L3_k[511] -> level2_fixmap_pgt */
>>  		convert_pfn_mfn(level3_kernel_pgt);
>> +
>> +		/* level2_fixmap_pgt contains a single entry for the
>> +		 * fixmap area at offset 506. The correct way would
>> +		 * be to convert level2_fixmap_pgt to mfn and set the
>> +		 * level1_fixmap_pgt (which is completely empty) to RO,
>> +		 * too. But currently this page table is not delcared,
> 
> declared.
>> +		 * so it would be a bit of voodoo to get its address.
>> +		 * And also the fixmap entry was never set anyway due
> 
> s/anyway//

Too much slang I guess. Ok :)

>> +		 * to using the wrong l2 when getting Xen's tables.
>> +		 * So let's just nuke it.
>> +		 * This orphans level1_fixmap_pgt, but that should be
>> +		 * as it always has been.
> 
> 'as it always has been.' ? Not sure I follow that sentence?

Probably need to be more verbose and say "the same basically happens with the
current code, too".

> 
>> +		 */
>> +		memset(level2_fixmap_pgt, 0, 512*sizeof(long));
>>  	}
>>  	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
>>  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
>> @@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>>  	addr[1] = (unsigned long)l3;
>>  	addr[2] = (unsigned long)l2;
>>  	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
>> -	 * Both L4[272][0] and L4[511][511] have entries that point to the same
>> +	 * Both L4[272][0] and L4[511][510] have entries that point to the same
>>  	 * L2 (PMD) tables. Meaning that if you modify it in __va space
>>  	 * it will be also modified in the __ka space! (But if you just
>>  	 * modify the PMD table to point to other PTE's or none, then you
>>  	 * are OK - which is what cleanup_highmap does) */
>>  	copy_page(level2_ident_pgt, l2);
>> -	/* Graft it onto L4[511][511] */
>> +	/* Graft it onto L4[511][510] */
>>  	copy_page(level2_kernel_pgt, l2);
>>  
>> -	/* Get [511][510] and graft that in level2_fixmap_pgt */
>> -	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
>> -	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
>> -	copy_page(level2_fixmap_pgt, l2);
>> -	/* Note that we don't do anything with level1_fixmap_pgt which
>> -	 * we don't need. */
> 
> Later during bootup we do set the fixmap with entries. I recall (vaguely)
> that on the SLES kernels (Classic) the fixmap was needed during early
> bootup. The reason was that it used the right away for bootparams (maybe?).

Ah ok. Yeah I was confident that it got set somewhere else after
xen_setup_kernel_pagetable ran. Because in the dump fixmap and vsyscall entries
where in place while I was pretty sure they are not set in the l2 table that is
copied in from the pagetables provided by the xen toolstack.

> 
> I think your patch is correct in ripping out level1_fixmap_pgt
> and level2_fixmap_pgt.

l1 effectively was already ripped out (not used). The only reason I could think
of for preserving the compile-time l2 table would have been if fixmap would be
not there. But obviously that is (and was) created along with some dynamic l1 table.


Ok, I rework the patch and re-send it (freshly, iow not part of this thread).
And while I am at it, I would add the stable tag.

-Stefan

> 
>>  	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
>>  		/* Make pagetable pieces RO */
>>  		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
>> -- 
>> 1.9.1
>>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:27                           ` Stefan Bader
@ 2014-08-29 14:31                             ` David Vrabel
  2014-08-29 14:35                               ` Stefan Bader
  2014-08-29 14:44                             ` [Xen-devel] " Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: David Vrabel @ 2014-08-29 14:31 UTC (permalink / raw)
  To: Stefan Bader, Konrad Rzeszutek Wilk
  Cc: Kees Cook, xen-devel, Linux Kernel Mailing List

On 29/08/14 15:27, Stefan Bader wrote:
> 
> Ok, I rework the patch and re-send it (freshly, iow not part of this thread).
> And while I am at it, I would add the stable tag.

Can you use a different title? Perhaps:

x86/xen: fix 64-bit PV guest kernel page tables for KASLR

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:19                             ` Andrew Cooper
@ 2014-08-29 14:32                               ` Stefan Bader
  2014-08-29 14:43                                 ` Andrew Cooper
  0 siblings, 1 reply; 29+ messages in thread
From: Stefan Bader @ 2014-08-29 14:32 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, xen-devel, Kees Cook, David Vrabel

[-- Attachment #1: Type: text/plain, Size: 2157 bytes --]

On 29.08.2014 16:19, Andrew Cooper wrote:
> On 29/08/14 09:37, Stefan Bader wrote:
>> On 29.08.2014 00:42, Andrew Cooper wrote:
>>> On 28/08/2014 19:01, Stefan Bader wrote:
>>>>>> So not much further... but then I think I know what I do next. Probably should
>>>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>>>>>> and at least get a crash dump of that situation when it occurs. Then I can dig
>>>>>> in there with crash (really should have thought of that before)...
>>>>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>>>>> that screams at me, so I fear I will have to wait until you get the crash
>>>>> and get some clues from that.
>>>> Ok, what a journey. So after long hours of painful staring at the code...
>>>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>>>> in crash, I really would appreaciate :-P)
>>> The M2P map lives in the Xen reserved virtual address space in each PV
>>> guest, and forms part of the PV ABI.  It is mapped read-only, in the
>>> native width of the guest.
>>>
>>> 32bit PV (PAE) at 0xF5800000
>>> 64bit PV at 0xFFFF800000000000
>>>
>>> This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
>>> public header files.  You should be able to blindly construct a pointer
>>> to it (if you have nothing better to hand), as it will be hooked into
>>> the guests pagetables before execution starts.  Therefore,
>>> "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.
>> machine_to_phys_mapping is set to that address but its not mapped inside the
>> crash dump. Somehow vtop in crash handles translations. I need to have a look at
>> their code, I guess.
>>
>> Thanks,
>> Stefan
> 
> What context is the crash dump?  If it is a Xen+dom0 kexec()d to native
> linux, then the m2p should still be accessible given dom0's cr3.  If it
> is some state copied off-host then you will need to adjust the copy to
> include that virtual range.

No its a domU dump of a PV guest taken with "xl dump-core" (or actually the
result of on-crash trigger).

> 
> ~Andrew
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:31                             ` David Vrabel
@ 2014-08-29 14:35                               ` Stefan Bader
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Bader @ 2014-08-29 14:35 UTC (permalink / raw)
  To: David Vrabel, Konrad Rzeszutek Wilk
  Cc: Kees Cook, xen-devel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 705 bytes --]

On 29.08.2014 16:31, David Vrabel wrote:
> On 29/08/14 15:27, Stefan Bader wrote:
>>
>> Ok, I rework the patch and re-send it (freshly, iow not part of this thread).
>> And while I am at it, I would add the stable tag.
> 
> Can you use a different title? Perhaps:
> 
> x86/xen: fix 64-bit PV guest kernel page tables for KASLR
> 
> David
> 
I can change the title but would not want to include KASLR. Because that is just
indirectly responsible. This is a fix for a problem that always was there but
just escaped notice until KASLR caused the layout to change. This could have
done independently before.

Or, like I mentioned in the last email, be caused by a large initrd.

-Stefan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:32                               ` Stefan Bader
@ 2014-08-29 14:43                                 ` Andrew Cooper
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2014-08-29 14:43 UTC (permalink / raw)
  To: Stefan Bader, Konrad Rzeszutek Wilk
  Cc: Linux Kernel Mailing List, xen-devel, Kees Cook, David Vrabel

On 29/08/14 15:32, Stefan Bader wrote:
> On 29.08.2014 16:19, Andrew Cooper wrote:
>> On 29/08/14 09:37, Stefan Bader wrote:
>>> On 29.08.2014 00:42, Andrew Cooper wrote:
>>>> On 28/08/2014 19:01, Stefan Bader wrote:
>>>>>>> So not much further... but then I think I know what I do next. Probably should
>>>>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
>>>>>>> and at least get a crash dump of that situation when it occurs. Then I can dig
>>>>>>> in there with crash (really should have thought of that before)...
>>>>>> <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
>>>>>> that screams at me, so I fear I will have to wait until you get the crash
>>>>>> and get some clues from that.
>>>>> Ok, what a journey. So after long hours of painful staring at the code...
>>>>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>>>>> in crash, I really would appreaciate :-P)
>>>> The M2P map lives in the Xen reserved virtual address space in each PV
>>>> guest, and forms part of the PV ABI.  It is mapped read-only, in the
>>>> native width of the guest.
>>>>
>>>> 32bit PV (PAE) at 0xF5800000
>>>> 64bit PV at 0xFFFF800000000000
>>>>
>>>> This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
>>>> public header files.  You should be able to blindly construct a pointer
>>>> to it (if you have nothing better to hand), as it will be hooked into
>>>> the guests pagetables before execution starts.  Therefore,
>>>> "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.
>>> machine_to_phys_mapping is set to that address but its not mapped inside the
>>> crash dump. Somehow vtop in crash handles translations. I need to have a look at
>>> their code, I guess.
>>>
>>> Thanks,
>>> Stefan
>> What context is the crash dump?  If it is a Xen+dom0 kexec()d to native
>> linux, then the m2p should still be accessible given dom0's cr3.  If it
>> is some state copied off-host then you will need to adjust the copy to
>> include that virtual range.
> No its a domU dump of a PV guest taken with "xl dump-core" (or actually the
> result of on-crash trigger).

Ah - I believe the m2p lives in one of the Xen elf notes for a domain
coredump.  See what readelf -n shows.

~Andrew

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:27                           ` Stefan Bader
  2014-08-29 14:31                             ` David Vrabel
@ 2014-08-29 14:44                             ` Jan Beulich
  2014-08-29 14:55                               ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2014-08-29 14:44 UTC (permalink / raw)
  To: Stefan Bader, Konrad Rzeszutek Wilk
  Cc: Kees Cook, David Vrabel, xen-devel, Linux Kernel Mailing List

>>> On 29.08.14 at 16:27, <stefan.bader@canonical.com> wrote:
> Sure. Btw, someone also contacted me saying they have the same problem 
> without
> changing the layout but having really big initrd (500M). While that feels 
> like
> it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
> kernel image size area then). But if it can happen, then surely it does 
> cause
> mappings to be where the module space starts then.

Since the initrd doesn't really need to be mapped into the (limited)
virtual address space a pv guest starts with, we specifically got

/*
 * Whether or not the guest can deal with being passed an initrd not
 * mapped through its initial page tables.
 */
#define XEN_ELFNOTE_MOD_START_PFN 16

to deal with that situation. The hypervisor side for Dom0 is in place,
and the kernel side works in our (classic) kernels. Whether it got
implemented for DomU meanwhile I don't know; I'm pretty certain
pv-ops kernels don't support it so far.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:44                             ` [Xen-devel] " Jan Beulich
@ 2014-08-29 14:55                               ` Konrad Rzeszutek Wilk
  2014-09-01  4:03                                 ` Juergen Gross
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-29 14:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefan Bader, Kees Cook, David Vrabel, xen-devel,
	Linux Kernel Mailing List

On Fri, Aug 29, 2014 at 03:44:06PM +0100, Jan Beulich wrote:
> >>> On 29.08.14 at 16:27, <stefan.bader@canonical.com> wrote:
> > Sure. Btw, someone also contacted me saying they have the same problem 
> > without
> > changing the layout but having really big initrd (500M). While that feels 
> > like
> > it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
> > kernel image size area then). But if it can happen, then surely it does 
> > cause
> > mappings to be where the module space starts then.
> 
> Since the initrd doesn't really need to be mapped into the (limited)
> virtual address space a pv guest starts with, we specifically got
> 
> /*
>  * Whether or not the guest can deal with being passed an initrd not
>  * mapped through its initial page tables.
>  */
> #define XEN_ELFNOTE_MOD_START_PFN 16
> 
> to deal with that situation. The hypervisor side for Dom0 is in place,
> and the kernel side works in our (classic) kernels. Whether it got
> implemented for DomU meanwhile I don't know; I'm pretty certain
> pv-ops kernels don't support it so far.

Correct - Not implemented. Here is what I had mentioned in the past:
(see http://lists.xen.org/archives/html/xen-devel/2014-03/msg00580.html)


XEN_ELFNOTE_INIT_P2M, XEN_ELFNOTE_MOD_START_PFN - I had been looking
    at that but I can't figure out a nice way of implementing this
    without the usage of SPARSEMAP_VMAP virtual addresses - which is how
    the classic Xen does it. But then - I don't know who is using huge PV
    guests - as the PVHVM does a fine job? But then with PVH, now you can
    boot with large amount of memory (1TB?) - so some of these issues
    would go away? Except the 'large ramdisk' as that would eat in the
    MODULES_VADDR I think? Needs more thinking.

.. and then I left it and to my suprise saw on Luis's slides that
Jurgen is going to take a look at that (500GB support).

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-08-29 14:55                               ` Konrad Rzeszutek Wilk
@ 2014-09-01  4:03                                 ` Juergen Gross
  2014-09-02 19:22                                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2014-09-01  4:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Jan Beulich
  Cc: Linux Kernel Mailing List, xen-devel, Kees Cook, Stefan Bader,
	David Vrabel

On 08/29/2014 04:55 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 29, 2014 at 03:44:06PM +0100, Jan Beulich wrote:
>>>>> On 29.08.14 at 16:27, <stefan.bader@canonical.com> wrote:
>>> Sure. Btw, someone also contacted me saying they have the same problem
>>> without
>>> changing the layout but having really big initrd (500M). While that feels
>>> like
>>> it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
>>> kernel image size area then). But if it can happen, then surely it does
>>> cause
>>> mappings to be where the module space starts then.
>>
>> Since the initrd doesn't really need to be mapped into the (limited)
>> virtual address space a pv guest starts with, we specifically got
>>
>> /*
>>   * Whether or not the guest can deal with being passed an initrd not
>>   * mapped through its initial page tables.
>>   */
>> #define XEN_ELFNOTE_MOD_START_PFN 16
>>
>> to deal with that situation. The hypervisor side for Dom0 is in place,
>> and the kernel side works in our (classic) kernels. Whether it got
>> implemented for DomU meanwhile I don't know; I'm pretty certain
>> pv-ops kernels don't support it so far.
>
> Correct - Not implemented. Here is what I had mentioned in the past:
> (see http://lists.xen.org/archives/html/xen-devel/2014-03/msg00580.html)
>
>
> XEN_ELFNOTE_INIT_P2M, XEN_ELFNOTE_MOD_START_PFN - I had been looking
>      at that but I can't figure out a nice way of implementing this
>      without the usage of SPARSEMAP_VMAP virtual addresses - which is how
>      the classic Xen does it. But then - I don't know who is using huge PV
>      guests - as the PVHVM does a fine job? But then with PVH, now you can
>      boot with large amount of memory (1TB?) - so some of these issues
>      would go away? Except the 'large ramdisk' as that would eat in the
>      MODULES_VADDR I think? Needs more thinking.
>
> .. and then I left it and to my suprise saw on Luis's slides that
> Jurgen is going to take a look at that (500GB support).

I have a patch which should do the job. It is based on the classic
kernel patch Jan mentioned above. The system is coming up with it, I
haven't tested it with a huge initrd up to now. My plan was to post the
patch together with the rest of the >500GB support, but I can send it
on it's own if required.

Juergen


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-09-01  4:03                                 ` Juergen Gross
@ 2014-09-02 19:22                                   ` Konrad Rzeszutek Wilk
  2014-09-03  4:07                                     ` Juergen Gross
  0 siblings, 1 reply; 29+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-02 19:22 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Jan Beulich, Linux Kernel Mailing List, xen-devel, Kees Cook,
	Stefan Bader, David Vrabel

On Mon, Sep 01, 2014 at 06:03:06AM +0200, Juergen Gross wrote:
> On 08/29/2014 04:55 PM, Konrad Rzeszutek Wilk wrote:
> >On Fri, Aug 29, 2014 at 03:44:06PM +0100, Jan Beulich wrote:
> >>>>>On 29.08.14 at 16:27, <stefan.bader@canonical.com> wrote:
> >>>Sure. Btw, someone also contacted me saying they have the same problem
> >>>without
> >>>changing the layout but having really big initrd (500M). While that feels
> >>>like
> >>>it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
> >>>kernel image size area then). But if it can happen, then surely it does
> >>>cause
> >>>mappings to be where the module space starts then.
> >>
> >>Since the initrd doesn't really need to be mapped into the (limited)
> >>virtual address space a pv guest starts with, we specifically got
> >>
> >>/*
> >>  * Whether or not the guest can deal with being passed an initrd not
> >>  * mapped through its initial page tables.
> >>  */
> >>#define XEN_ELFNOTE_MOD_START_PFN 16
> >>
> >>to deal with that situation. The hypervisor side for Dom0 is in place,
> >>and the kernel side works in our (classic) kernels. Whether it got
> >>implemented for DomU meanwhile I don't know; I'm pretty certain
> >>pv-ops kernels don't support it so far.
> >
> >Correct - Not implemented. Here is what I had mentioned in the past:
> >(see http://lists.xen.org/archives/html/xen-devel/2014-03/msg00580.html)
> >
> >
> >XEN_ELFNOTE_INIT_P2M, XEN_ELFNOTE_MOD_START_PFN - I had been looking
> >     at that but I can't figure out a nice way of implementing this
> >     without the usage of SPARSEMAP_VMAP virtual addresses - which is how
> >     the classic Xen does it. But then - I don't know who is using huge PV
> >     guests - as the PVHVM does a fine job? But then with PVH, now you can
> >     boot with large amount of memory (1TB?) - so some of these issues
> >     would go away? Except the 'large ramdisk' as that would eat in the
> >     MODULES_VADDR I think? Needs more thinking.
> >
> >.. and then I left it and to my suprise saw on Luis's slides that
> >Jurgen is going to take a look at that (500GB support).
> 
> I have a patch which should do the job. It is based on the classic
> kernel patch Jan mentioned above. The system is coming up with it, I
> haven't tested it with a huge initrd up to now. My plan was to post the
> patch together with the rest of the >500GB support, but I can send it
> on it's own if required.

Oooh goodies! I think it makes sense to post it whenever you think
it is in the right state to be posted.

Now that your pvSCSI drivers are in, you have tons of free time
I suspect :-)


> 
> Juergen
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
  2014-09-02 19:22                                   ` Konrad Rzeszutek Wilk
@ 2014-09-03  4:07                                     ` Juergen Gross
  0 siblings, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2014-09-03  4:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jan Beulich, Linux Kernel Mailing List, xen-devel, Kees Cook,
	Stefan Bader, David Vrabel

On 09/02/2014 09:22 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Sep 01, 2014 at 06:03:06AM +0200, Juergen Gross wrote:
>> On 08/29/2014 04:55 PM, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Aug 29, 2014 at 03:44:06PM +0100, Jan Beulich wrote:
>>>>>>> On 29.08.14 at 16:27, <stefan.bader@canonical.com> wrote:
>>>>> Sure. Btw, someone also contacted me saying they have the same problem
>>>>> without
>>>>> changing the layout but having really big initrd (500M). While that feels
>>>>> like
>>>>> it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
>>>>> kernel image size area then). But if it can happen, then surely it does
>>>>> cause
>>>>> mappings to be where the module space starts then.
>>>>
>>>> Since the initrd doesn't really need to be mapped into the (limited)
>>>> virtual address space a pv guest starts with, we specifically got
>>>>
>>>> /*
>>>>   * Whether or not the guest can deal with being passed an initrd not
>>>>   * mapped through its initial page tables.
>>>>   */
>>>> #define XEN_ELFNOTE_MOD_START_PFN 16
>>>>
>>>> to deal with that situation. The hypervisor side for Dom0 is in place,
>>>> and the kernel side works in our (classic) kernels. Whether it got
>>>> implemented for DomU meanwhile I don't know; I'm pretty certain
>>>> pv-ops kernels don't support it so far.
>>>
>>> Correct - Not implemented. Here is what I had mentioned in the past:
>>> (see http://lists.xen.org/archives/html/xen-devel/2014-03/msg00580.html)
>>>
>>>
>>> XEN_ELFNOTE_INIT_P2M, XEN_ELFNOTE_MOD_START_PFN - I had been looking
>>>      at that but I can't figure out a nice way of implementing this
>>>      without the usage of SPARSEMAP_VMAP virtual addresses - which is how
>>>      the classic Xen does it. But then - I don't know who is using huge PV
>>>      guests - as the PVHVM does a fine job? But then with PVH, now you can
>>>      boot with large amount of memory (1TB?) - so some of these issues
>>>      would go away? Except the 'large ramdisk' as that would eat in the
>>>      MODULES_VADDR I think? Needs more thinking.
>>>
>>> .. and then I left it and to my suprise saw on Luis's slides that
>>> Jurgen is going to take a look at that (500GB support).
>>
>> I have a patch which should do the job. It is based on the classic
>> kernel patch Jan mentioned above. The system is coming up with it, I
>> haven't tested it with a huge initrd up to now. My plan was to post the
>> patch together with the rest of the >500GB support, but I can send it
>> on it's own if required.
>
> Oooh goodies! I think it makes sense to post it whenever you think
> it is in the right state to be posted.
>
> Now that your pvSCSI drivers are in, you have tons of free time
> I suspect :-)

Oh yeah. Only one or two lines missing in xl to support it. :-)

I hope to have the >500GB patch ready for testing soon. I'd prefer to
combine this and the large initrd patch in one series, as both need the
same headers to be synced with Xen. In case I'm meeting some serious
issues I'll post the large initrd patch on Friday.

Juergen


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-09-03  4:07 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-08 11:20 Xen PV domain regression with KASLR enabled (kernel 3.16) Stefan Bader
2014-08-08 12:43 ` [Xen-devel] " David Vrabel
2014-08-08 14:35   ` Stefan Bader
2014-08-12 17:28     ` Kees Cook
2014-08-12 18:05       ` Stefan Bader
2014-08-12 18:53         ` Kees Cook
2014-08-12 19:07           ` Konrad Rzeszutek Wilk
2014-08-21 16:03             ` Kees Cook
2014-08-22  9:20               ` Stefan Bader
2014-08-26 16:01                 ` Konrad Rzeszutek Wilk
2014-08-27  8:03                   ` Stefan Bader
2014-08-27 20:49                     ` Konrad Rzeszutek Wilk
2014-08-28 18:01                       ` [PATCH] Solved the Xen PV/KASLR riddle Stefan Bader
2014-08-28 22:22                         ` Kees Cook
2014-08-28 22:42                         ` [Xen-devel] " Andrew Cooper
2014-08-28 22:42                           ` Andrew Cooper
2014-08-29  8:37                           ` [Xen-devel] " Stefan Bader
2014-08-29 14:19                             ` Andrew Cooper
2014-08-29 14:32                               ` Stefan Bader
2014-08-29 14:43                                 ` Andrew Cooper
2014-08-29 14:08                         ` Konrad Rzeszutek Wilk
2014-08-29 14:27                           ` Stefan Bader
2014-08-29 14:31                             ` David Vrabel
2014-08-29 14:35                               ` Stefan Bader
2014-08-29 14:44                             ` [Xen-devel] " Jan Beulich
2014-08-29 14:55                               ` Konrad Rzeszutek Wilk
2014-09-01  4:03                                 ` Juergen Gross
2014-09-02 19:22                                   ` Konrad Rzeszutek Wilk
2014-09-03  4:07                                     ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.