[PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
@ 2013-08-26 10:03 Tomasz Wroblewski
  2013-08-26 10:52 ` Andrew Cooper
  2013-08-26 11:12 ` Jan Beulich
  0 siblings, 2 replies; 15+ messages in thread
From: Tomasz Wroblewski @ 2013-08-26 10:03 UTC (permalink / raw)
  To: xen-devel; +Cc: Tomasz Wroblewski

Xen crashes on boot of xsm/flask enabled builds, if policy module is not specified.
This seems to have worked on 4.1 at least. Can be fixed by testing whether policy_buffer
is NULL before attempting to load from it - it's a global which is set to non-NULL when
policy module is detected.

Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
---
 xen/xsm/flask/hooks.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index fa0589a..cfa2929 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -1585,7 +1585,8 @@ static __init int flask_init(void)
     if ( register_xsm(&flask_ops) )
         panic("Flask: Unable to register with XSM.\n");
 
-    ret = security_load_policy(policy_buffer, policy_size);
+    if ( policy_buffer )
+        ret = security_load_policy(policy_buffer, policy_size);
 
     if ( flask_enforcing )
         printk("Flask:  Starting in enforcing mode.\n");
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 10:03 [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present Tomasz Wroblewski
@ 2013-08-26 10:52 ` Andrew Cooper
  2013-08-26 13:27   ` Daniel De Graaf
  2013-08-26 11:12 ` Jan Beulich
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Cooper @ 2013-08-26 10:52 UTC (permalink / raw)
  To: Tomasz Wroblewski; +Cc: xen-devel, Daniel De Graaf

On 26/08/2013 11:03, Tomasz Wroblewski wrote:
> Xen crashes on boot of xsm/flask enabled builds, if policy module is not specified.
> This seems to have worked on 4.1 at least. Can be fixed by testing whether policy_buffer
> is NULL before attempting to load from it - it's a global which is set to non-NULL when
> policy module is detected.
>
> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

CCing Daniel De Graaf, as the maintainer of this code.

However FWIW,
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

> ---
>  xen/xsm/flask/hooks.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> index fa0589a..cfa2929 100644
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>      if ( register_xsm(&flask_ops) )
>          panic("Flask: Unable to register with XSM.\n");
>  
> -    ret = security_load_policy(policy_buffer, policy_size);
> +    if ( policy_buffer )
> +        ret = security_load_policy(policy_buffer, policy_size);
>  
>      if ( flask_enforcing )
>          printk("Flask:  Starting in enforcing mode.\n");

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 10:03 [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present Tomasz Wroblewski
  2013-08-26 10:52 ` Andrew Cooper
@ 2013-08-26 11:12 ` Jan Beulich
  2013-08-26 12:24   ` Tomasz Wroblewski
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2013-08-26 11:12 UTC (permalink / raw)
  To: Tomasz Wroblewski; +Cc: xen-devel, dgdegra

>>> On 26.08.13 at 12:03, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
> Xen crashes on boot of xsm/flask enabled builds, if policy module is not 
> specified.
> This seems to have worked on 4.1 at least.

Looking at the code (4.1.5) I can't see what would prevent the
same NULL pointer deref. Care to explain?

> Can be fixed by testing whether 
> policy_buffer
> is NULL before attempting to load from it - it's a global which is set to 
> non-NULL when
> policy module is detected.
> 
> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
> ---
>  xen/xsm/flask/hooks.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> index fa0589a..cfa2929 100644
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>      if ( register_xsm(&flask_ops) )
>          panic("Flask: Unable to register with XSM.\n");
>  
> -    ret = security_load_policy(policy_buffer, policy_size);
> +    if ( policy_buffer )
> +        ret = security_load_policy(policy_buffer, policy_size);

Question is whether policy_buffer == NULL really isn't supposed
to result in a -E... return value (as in fact flask initialization failed).

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 11:12 ` Jan Beulich
@ 2013-08-26 12:24   ` Tomasz Wroblewski
  2013-08-26 12:41     ` Andrew Cooper
  2013-08-26 13:00     ` Jan Beulich
  0 siblings, 2 replies; 15+ messages in thread
From: Tomasz Wroblewski @ 2013-08-26 12:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, dgdegra

On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not
>> specified.
>> This seems to have worked on 4.1 at least.
> Looking at the code (4.1.5) I can't see what would prevent the
> same NULL pointer deref. Care to explain?
The crash doesn't happen at the NULL pointer dereference site though, 
but a bit later, when xen tries to flush tlbs for first time I believe, 
which happens during page allocation for the initial domain structure. I 
traced it to the following ASSERT in smp.c (so yes I should add this 
particular crash likely is limited to debug builds then)

void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int 
flags)
{
     ASSERT(local_irq_is_enabled());
     ...

The actual crash message is unhelpful since it's basically only

...
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Unknown interrupt (cr2=0000000000000000)


Either removing the assert (which is obviously bad), or checking for the 
null pointer deref as in the submitted patch seems to be fixing it. I'm 
suspecting it was always broken somehow but just was hidden or had 
different side effects on 4.1 than it does now. I do lack for a good 
explanation why fiddling with null addresses breaks up this assert, though.

>> Can be fixed by testing whether
>> policy_buffer
>> is NULL before attempting to load from it - it's a global which is set to
>> non-NULL when
>> policy module is detected.
>>
>> Signed-off-by: Tomasz Wroblewski<tomasz.wroblewski@citrix.com>
>> ---
>>   xen/xsm/flask/hooks.c |    3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
>> index fa0589a..cfa2929 100644
>> --- a/xen/xsm/flask/hooks.c
>> +++ b/xen/xsm/flask/hooks.c
>> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>>       if ( register_xsm(&flask_ops) )
>>           panic("Flask: Unable to register with XSM.\n");
>>
>> -    ret = security_load_policy(policy_buffer, policy_size);
>> +    if ( policy_buffer )
>> +        ret = security_load_policy(policy_buffer, policy_size);
> Question is whether policy_buffer == NULL really isn't supposed
> to result in a -E... return value (as in fact flask initialization failed).
>
> Jan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 12:24   ` Tomasz Wroblewski
@ 2013-08-26 12:41     ` Andrew Cooper
  2013-08-26 13:00     ` Jan Beulich
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Cooper @ 2013-08-26 12:41 UTC (permalink / raw)
  To: Tomasz Wroblewski; +Cc: xen-devel, dgdegra, Jan Beulich

On 26/08/2013 13:24, Tomasz Wroblewski wrote:
> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>> On 26.08.13 at 12:03, Tomasz
>>>>> Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is
>>> not
>>> specified.
>>> This seems to have worked on 4.1 at least.
>> Looking at the code (4.1.5) I can't see what would prevent the
>> same NULL pointer deref. Care to explain?
> The crash doesn't happen at the NULL pointer dereference site though,
> but a bit later, when xen tries to flush tlbs for first time I
> believe, which happens during page allocation for the initial domain
> structure. I traced it to the following ASSERT in smp.c (so yes I
> should add this particular crash likely is limited to debug builds then)
>
> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned
> int flags)
> {
>     ASSERT(local_irq_is_enabled());
>     ...
>
> The actual crash message is unhelpful since it's basically only
>
> ...
> (XEN) Using scheduler: SMP Credit Scheduler (credit)
> (XEN) Unknown interrupt (cr2=0000000000000000)
>
>
> Either removing the assert (which is obviously bad), or checking for
> the null pointer deref as in the submitted patch seems to be fixing
> it. I'm suspecting it was always broken somehow but just was hidden or
> had different side effects on 4.1 than it does now. I do lack for a
> good explanation why fiddling with null addresses breaks up this
> assert, though.

Do you have any more information than that?  Stack trace or even a stack
dump?

I cant spot how those two would be connected.

~Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 12:24   ` Tomasz Wroblewski
  2013-08-26 12:41     ` Andrew Cooper
@ 2013-08-26 13:00     ` Jan Beulich
  2013-08-26 13:34       ` Tomasz Wroblewski
  2013-08-26 17:00       ` Tomasz Wroblewski
  1 sibling, 2 replies; 15+ messages in thread
From: Jan Beulich @ 2013-08-26 13:00 UTC (permalink / raw)
  To: Tomasz Wroblewski; +Cc: xen-devel, dgdegra

>>> On 26.08.13 at 14:24, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not
>>> specified.
>>> This seems to have worked on 4.1 at least.
>> Looking at the code (4.1.5) I can't see what would prevent the
>> same NULL pointer deref. Care to explain?
> The crash doesn't happen at the NULL pointer dereference site though, 

Then does it deref the NULL pointer, or does it not? If it does and
merely doesn't crash because something happens to be mapped
there, that's still a bug.

> but a bit later, when xen tries to flush tlbs for first time I believe, 
> which happens during page allocation for the initial domain structure. I 
> traced it to the following ASSERT in smp.c (so yes I should add this 
> particular crash likely is limited to debug builds then)
> 
> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int 
> flags)
> {
>      ASSERT(local_irq_is_enabled());
>      ...
> 
> The actual crash message is unhelpful since it's basically only
> 
> ...
> (XEN) Using scheduler: SMP Credit Scheduler (credit)
> (XEN) Unknown interrupt (cr2=0000000000000000)
> 
> 
> Either removing the assert (which is obviously bad), or checking for the 

The assertion is in no way bad. It's the too early use of the
function that is the problem here.

> null pointer deref as in the submitted patch seems to be fixing it. I'm 
> suspecting it was always broken somehow but just was hidden or had 
> different side effects on 4.1 than it does now. I do lack for a good 
> explanation why fiddling with null addresses breaks up this assert, though.

Also, you didn't show the call trace that made things get here (yes,
you may need to construct this manually). I'm in no way convinced
that there's a NULL pointer involved here at all - the fact that CR2
is zero doesn't mean a page fault occurred in the first place.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 10:52 ` Andrew Cooper
@ 2013-08-26 13:27   ` Daniel De Graaf
  2013-08-26 13:32     ` Tomasz Wroblewski
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel De Graaf @ 2013-08-26 13:27 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich; +Cc: Tomasz Wroblewski, xen-devel

On 08/26/2013 06:52 AM, Andrew Cooper wrote:
> On 26/08/2013 11:03, Tomasz Wroblewski wrote:
>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not specified.
>> This seems to have worked on 4.1 at least. Can be fixed by testing whether policy_buffer
>> is NULL before attempting to load from it - it's a global which is set to non-NULL when
>> policy module is detected.
>>
>> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
>
> CCing Daniel De Graaf, as the maintainer of this code.
>
> However FWIW,
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
>> ---
>>   xen/xsm/flask/hooks.c |    3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
>> index fa0589a..cfa2929 100644
>> --- a/xen/xsm/flask/hooks.c
>> +++ b/xen/xsm/flask/hooks.c
>> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>>       if ( register_xsm(&flask_ops) )
>>           panic("Flask: Unable to register with XSM.\n");
>>
>> -    ret = security_load_policy(policy_buffer, policy_size);
>> +    if ( policy_buffer )
>> +        ret = security_load_policy(policy_buffer, policy_size);
>>
>>       if ( flask_enforcing )
>>           printk("Flask:  Starting in enforcing mode.\n");
>

While this change is not wrong, I also don't see how it could fix
anything. The security_load_policy function will not dereference
policy_buffer if policy_size is zero (it'll return -EINVAL first),
and the only location that sets policy_size (xsm_policy_init) also
sets policy_buffer. If this function is setting the NULL pointer,
it also does a dereference - and it will be visible in the printk.

Also, on 08/26/2013 07:12 AM, Jan Beulich wrote:
> Question is whether policy_buffer == NULL really isn't supposed
> to result in a -E... return value (as in fact flask initialization failed).

The return value of flask_init isn't ever checked. A failure to
load the policy at boot is identical to no policy - waiting for a
flask_disable or "xl loadpolicy" hypercall from dom0.

-- 
Daniel De Graaf
National Security Agency

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 13:27   ` Daniel De Graaf
@ 2013-08-26 13:32     ` Tomasz Wroblewski
  0 siblings, 0 replies; 15+ messages in thread
From: Tomasz Wroblewski @ 2013-08-26 13:32 UTC (permalink / raw)
  To: Daniel De Graaf; +Cc: Andrew Cooper, Jan Beulich, xen-devel

On 08/26/2013 03:27 PM, Daniel De Graaf wrote:
> On 08/26/2013 06:52 AM, Andrew Cooper wrote:
>> On 26/08/2013 11:03, Tomasz Wroblewski wrote:
>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is 
>>> not specified.
>>> This seems to have worked on 4.1 at least. Can be fixed by testing 
>>> whether policy_buffer
>>> is NULL before attempting to load from it - it's a global which is 
>>> set to non-NULL when
>>> policy module is detected.
>>>
>>> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
>>
>> CCing Daniel De Graaf, as the maintainer of this code.
>>
>> However FWIW,
>> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>>> ---
>>>   xen/xsm/flask/hooks.c |    3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
>>> index fa0589a..cfa2929 100644
>>> --- a/xen/xsm/flask/hooks.c
>>> +++ b/xen/xsm/flask/hooks.c
>>> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>>>       if ( register_xsm(&flask_ops) )
>>>           panic("Flask: Unable to register with XSM.\n");
>>>
>>> -    ret = security_load_policy(policy_buffer, policy_size);
>>> +    if ( policy_buffer )
>>> +        ret = security_load_policy(policy_buffer, policy_size);
>>>
>>>       if ( flask_enforcing )
>>>           printk("Flask:  Starting in enforcing mode.\n");
>>
>
> While this change is not wrong, I also don't see how it could fix
> anything. The security_load_policy function will not dereference
> policy_buffer if policy_size is zero (it'll return -EINVAL first),
> and the only location that sets policy_size (xsm_policy_init) also
> sets policy_buffer. If this function is setting the NULL pointer,
> it also does a dereference - and it will be visible in the printk.
>
Right. I'm trying to sift through security_load_policy to figure out why 
induces the error later, if policy_buffer is null, because yes you are 
right it does not seem to deref the pointer at first glance, but it 
definitely causes the mentioned error later on if it's called. Guess 
this needs further investigation..


> Also, on 08/26/2013 07:12 AM, Jan Beulich wrote:
>> Question is whether policy_buffer == NULL really isn't supposed
>> to result in a -E... return value (as in fact flask initialization 
>> failed).
>
> The return value of flask_init isn't ever checked. A failure to
> load the policy at boot is identical to no policy - waiting for a
> flask_disable or "xl loadpolicy" hypercall from dom0.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 13:00     ` Jan Beulich
@ 2013-08-26 13:34       ` Tomasz Wroblewski
  2013-08-26 17:00       ` Tomasz Wroblewski
  1 sibling, 0 replies; 15+ messages in thread
From: Tomasz Wroblewski @ 2013-08-26 13:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, dgdegra

On 08/26/2013 03:00 PM, Jan Beulich wrote:
>>>> On 26.08.13 at 14:24, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not
>>>> specified.
>>>> This seems to have worked on 4.1 at least.
>>> Looking at the code (4.1.5) I can't see what would prevent the
>>> same NULL pointer deref. Care to explain?
>> The crash doesn't happen at the NULL pointer dereference site though,
> Then does it deref the NULL pointer, or does it not? If it does and
> merely doesn't crash because something happens to be mapped
> there, that's still a bug.
>
>> but a bit later, when xen tries to flush tlbs for first time I believe,
>> which happens during page allocation for the initial domain structure. I
>> traced it to the following ASSERT in smp.c (so yes I should add this
>> particular crash likely is limited to debug builds then)
>>
>> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int
>> flags)
>> {
>>       ASSERT(local_irq_is_enabled());
>>       ...
>>
>> The actual crash message is unhelpful since it's basically only
>>
>> ...
>> (XEN) Using scheduler: SMP Credit Scheduler (credit)
>> (XEN) Unknown interrupt (cr2=0000000000000000)
>>
>>
>> Either removing the assert (which is obviously bad), or checking for the
> The assertion is in no way bad. It's the too early use of the
> function that is the problem here.

Aye I meant removing the assertion is bad, not the assert. Looks like 
this needs bit more investigation to see what exact bits inside 
security_load_policy causes this, so I'll do that.

>> null pointer deref as in the submitted patch seems to be fixing it. I'm
>> suspecting it was always broken somehow but just was hidden or had
>> different side effects on 4.1 than it does now. I do lack for a good
>> explanation why fiddling with null addresses breaks up this assert, though.
> Also, you didn't show the call trace that made things get here (yes,
> you may need to construct this manually). I'm in no way convinced
> that there's a NULL pointer involved here at all - the fact that CR2
> is zero doesn't mean a page fault occurred in the first place.
Yeah will investigate more
> Jan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 13:00     ` Jan Beulich
  2013-08-26 13:34       ` Tomasz Wroblewski
@ 2013-08-26 17:00       ` Tomasz Wroblewski
  2013-08-27  7:13         ` Jan Beulich
  2013-08-27  8:50         ` [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present Andrew Cooper
  1 sibling, 2 replies; 15+ messages in thread
From: Tomasz Wroblewski @ 2013-08-26 17:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, dgdegra, Andrew Cooper

On 08/26/2013 03:00 PM, Jan Beulich wrote:
>>>> On 26.08.13 at 14:24, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not
>>>> specified.
>>>> This seems to have worked on 4.1 at least.
>>> Looking at the code (4.1.5) I can't see what would prevent the
>>> same NULL pointer deref. Care to explain?
>> The crash doesn't happen at the NULL pointer dereference site though,
> Then does it deref the NULL pointer, or does it not? If it does and
> merely doesn't crash because something happens to be mapped
> there, that's still a bug.
So after bit more tracing it looks like the issue is not a null pointer 
deref (it is avoided as Daniel pointed out), but rather some 
xmalloc/xfree pairs which happen inside security_load_policy even if 
pointer is null.

security_load_policy calls policydb_init, then tries to read the policy 
which fails since length of is 0, so it calls policydb_destroy. Inside 
these functions there is some hashtable construction/destruction 
happening, and a reasonably sizable ones too (there are 7 or so 
hashtables, biggest one uses an array of 512 pointers, they total to use 
768 pointers). I've verified that replacing security_load_policy call 
completely with the following allocation/deallocaiton is enough to cause 
this crash:

     //ret = security_load_policy(policy_buffer, policy_size);
     {
             void ** p = xmalloc_array(void*, 768);
             xfree(p);
     }

Note that this allocation succeeds, and also if you would not call xfree 
(which is not called if say a policy was succesfully loaded), there is 
no crash. So yeah my original patch accidentaly fixes it by just 
avoiding the alloc/free completely.

The shaky manually constructed call graph for the assertion failure:

setup.c: init_idle_domain
schedule.c: scheduler_init
domain.c: domain_create
domain.c: alloc_domain_struct
domain.c: alloc_xenheap_pages
..
page_alloc.c: alloc_heap_pages
flushtlb.h: flush_tlb_mask
flushtlb.h: flush_mask
smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here

I apparently can't get a real stacktrace because adding 
dump_execution_state in flush_area_mask just causes the "Unknown 
interrupt" error, similarily to what hitting the ASSERT fail does. I 
printed the assert condition manually to verify it tho and interrupts 
are disabled there so its bound to fail.

So it looks like the flush_tlb is called too early (with interrupts 
disabled) due to large deallocations in the policy loader code?


On 08/26/2013 03:00 PM, Jan Beulich wrote:
>>>> On 26.08.13 at 14:24, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not
>>>> specified.
>>>> This seems to have worked on 4.1 at least.
>>> Looking at the code (4.1.5) I can't see what would prevent the
>>> same NULL pointer deref. Care to explain?
>> The crash doesn't happen at the NULL pointer dereference site though,

>> but a bit later, when xen tries to flush tlbs for first time I believe,
>> which happens during page allocation for the initial domain structure. I
>> traced it to the following ASSERT in smp.c (so yes I should add this
>> particular crash likely is limited to debug builds then)
>>
>> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int
>> flags)
>> {
>>       ASSERT(local_irq_is_enabled());
>>       ...
>>
>> The actual crash message is unhelpful since it's basically only
>>
>> ...
>> (XEN) Using scheduler: SMP Credit Scheduler (credit)
>> (XEN) Unknown interrupt (cr2=0000000000000000)
>>
>>
>> Either removing the assert (which is obviously bad), or checking for the
> The assertion is in no way bad. It's the too early use of the
> function that is the problem here.
>
>> null pointer deref as in the submitted patch seems to be fixing it. I'm
>> suspecting it was always broken somehow but just was hidden or had
>> different side effects on 4.1 than it does now. I do lack for a good
>> explanation why fiddling with null addresses breaks up this assert, though.
> Also, you didn't show the call trace that made things get here (yes,
> you may need to construct this manually). I'm in no way convinced
> that there's a NULL pointer involved here at all - the fact that CR2
> is zero doesn't mean a page fault occurred in the first place.
>
> Jan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 17:00       ` Tomasz Wroblewski
@ 2013-08-27  7:13         ` Jan Beulich
  2013-08-27  7:23           ` Tomasz Wroblewski
  2013-08-27  8:50         ` [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present Andrew Cooper
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2013-08-27  7:13 UTC (permalink / raw)
  To: Tomasz Wroblewski; +Cc: Andrew Cooper, dgdegra, xen-devel

>>> On 26.08.13 at 19:00, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
> I've verified that replacing security_load_policy call 
> completely with the following allocation/deallocaiton is enough to cause 
> this crash:
> 
>      //ret = security_load_policy(policy_buffer, policy_size);
>      {
>              void ** p = xmalloc_array(void*, 768);
>              xfree(p);
>      }
> 
> Note that this allocation succeeds, and also if you would not call xfree 
> (which is not called if say a policy was succesfully loaded), there is 
> no crash. So yeah my original patch accidentaly fixes it by just 
> avoiding the alloc/free completely.

But I then understand, together with the below, that the crash isn't
down the xfree() path, ...

> The shaky manually constructed call graph for the assertion failure:
> 
> setup.c: init_idle_domain
> schedule.c: scheduler_init
> domain.c: domain_create
> domain.c: alloc_domain_struct
> domain.c: alloc_xenheap_pages
> ..
> page_alloc.c: alloc_heap_pages
> flushtlb.h: flush_tlb_mask
> flushtlb.h: flush_mask
> smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here

... but instead is on a _subsequent_ allocation. Hence the prior
free gets a heap page into a state that makes in non-suitable for
re-use. But you certainly noticed that free_heap_pages() sets a
page's u.free.need_tlbflush only if the page had an owner,
which shouldn't be the case for Xen-internal allocations.

With that, I think I can see where the bug really is: The owner
field (v.inuse._domain) is in a union with the order field re-used
by the xmalloc() implementation for whole page allocations. The
fix therefore ought to be as simple as the patch below.

Jan

--- a/xen/common/xmalloc_tlsf.c
+++ a/xen/common/xmalloc_tlsf.c
@@ -629,6 +629,7 @@ void xfree(void *p)
         unsigned int i, order = get_order_from_pages(size);
 
         BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1));
+        PFN_ORDER(virt_to_page(p)) = 0;
         for ( i = 0; ; ++i )
         {
             if ( !(size & (1 << i)) )

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-27  7:13         ` Jan Beulich
@ 2013-08-27  7:23           ` Tomasz Wroblewski
  2013-08-27  7:47             ` [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc() Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Tomasz Wroblewski @ 2013-08-27  7:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, dgdegra, xen-devel

On 08/27/2013 09:13 AM, Jan Beulich wrote:
>>>> On 26.08.13 at 19:00, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> I've verified that replacing security_load_policy call
>> completely with the following allocation/deallocaiton is enough to cause
>> this crash:
>>
>>       //ret = security_load_policy(policy_buffer, policy_size);
>>       {
>>               void ** p = xmalloc_array(void*, 768);
>>               xfree(p);
>>       }
>>
>> Note that this allocation succeeds, and also if you would not call xfree
>> (which is not called if say a policy was succesfully loaded), there is
>> no crash. So yeah my original patch accidentaly fixes it by just
>> avoiding the alloc/free completely.
> But I then understand, together with the below, that the crash isn't
> down the xfree() path, ...
>
>> The shaky manually constructed call graph for the assertion failure:
>>
>> setup.c: init_idle_domain
>> schedule.c: scheduler_init
>> domain.c: domain_create
>> domain.c: alloc_domain_struct
>> domain.c: alloc_xenheap_pages
>> ..
>> page_alloc.c: alloc_heap_pages
>> flushtlb.h: flush_tlb_mask
>> flushtlb.h: flush_mask
>> smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here
> ... but instead is on a _subsequent_ allocation. Hence the prior
> free gets a heap page into a state that makes in non-suitable for
> re-use. But you certainly noticed that free_heap_pages() sets a
> page's u.free.need_tlbflush only if the page had an owner,
> which shouldn't be the case for Xen-internal allocations.
>
> With that, I think I can see where the bug really is: The owner
> field (v.inuse._domain) is in a union with the order field re-used
> by the xmalloc() implementation for whole page allocations. The
> fix therefore ought to be as simple as the patch below.
Just tested that fix, works! Thanks
> Jan
>
> --- a/xen/common/xmalloc_tlsf.c
> +++ a/xen/common/xmalloc_tlsf.c
> @@ -629,6 +629,7 @@ void xfree(void *p)
>           unsigned int i, order = get_order_from_pages(size);
>
>           BUG_ON((unsigned long)p&  ((PAGE_SIZE<<  order) - 1));
> +        PFN_ORDER(virt_to_page(p)) = 0;
>           for ( i = 0; ; ++i )
>           {
>               if ( !(size&  (1<<  i)) )
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()
  2013-08-27  7:23           ` Tomasz Wroblewski
@ 2013-08-27  7:47             ` Jan Beulich
  2013-09-09 11:14               ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2013-08-27  7:47 UTC (permalink / raw)
  To: Tomasz Wroblewski; +Cc: Andrew Cooper, dgdegra, Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

Not doing this was found to cause problems with sequences of allocation
(multi-page), freeing, and then again allocation of the same page upon
boot when interrupts are still disabled (causing the owner field to be
non-zero, thus making the allocator attempt a TLB flush and, in its
processing, triggering an assertion).

Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

--- a/xen/common/xmalloc_tlsf.c
+++ b/xen/common/xmalloc_tlsf.c
@@ -629,6 +629,7 @@ void xfree(void *p)
         unsigned int i, order = get_order_from_pages(size);
 
         BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1));
+        PFN_ORDER(virt_to_page(p)) = 0;
         for ( i = 0; ; ++i )
         {
             if ( !(size & (1 << i)) )




[-- Attachment #2: xfree-whole-pages.patch --]
[-- Type: text/plain, Size: 944 bytes --]

xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()

Not doing this was found to cause problems with sequences of allocation
(multi-page), freeing, and then again allocation of the same page upon
boot when interrupts are still disabled (causing the owner field to be
non-zero, thus making the allocator attempt a TLB flush and, in its
processing, triggering an assertion).

Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

--- a/xen/common/xmalloc_tlsf.c
+++ b/xen/common/xmalloc_tlsf.c
@@ -629,6 +629,7 @@ void xfree(void *p)
         unsigned int i, order = get_order_from_pages(size);
 
         BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1));
+        PFN_ORDER(virt_to_page(p)) = 0;
         for ( i = 0; ; ++i )
         {
             if ( !(size & (1 << i)) )

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
  2013-08-26 17:00       ` Tomasz Wroblewski
  2013-08-27  7:13         ` Jan Beulich
@ 2013-08-27  8:50         ` Andrew Cooper
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Cooper @ 2013-08-27  8:50 UTC (permalink / raw)
  To: Tomasz Wroblewski; +Cc: xen-devel, dgdegra, Jan Beulich

On 26/08/2013 18:00, Tomasz Wroblewski wrote:
>
>
> The shaky manually constructed call graph for the assertion failure:
>
> setup.c: init_idle_domain
> schedule.c: scheduler_init
> domain.c: domain_create
> domain.c: alloc_domain_struct
> domain.c: alloc_xenheap_pages
> ..
> page_alloc.c: alloc_heap_pages
> flushtlb.h: flush_tlb_mask
> flushtlb.h: flush_mask
> smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here
>
> I apparently can't get a real stacktrace because adding
> dump_execution_state in flush_area_mask just causes the "Unknown
> interrupt" error, similarily to what hitting the ASSERT fail does. I
> printed the assert condition manually to verify it tho and interrupts
> are disabled there so its bound to fail.
>

Just for reference here, as I went digging in the code.
dump_execution_state() makes use of run_in_exception_handler() which
makes use of the ud2 instruction to get its hands on an exception frame,
for the purpose of dumping the state.

The "Unknown Interrupt" means that the CPU is still running on the early
boot IDT, set up in xen/arch/x86/boot/x86_64.S which does a blanket
ignore on all interrupts, including exceptions.

We should probably see about setting up and using the arch traps earlier
in boot.  Failing that, an early_invalid_opcode() handler could at least
hint that it might have hit an early bug, and give some details.

~Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()
  2013-08-27  7:47             ` [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc() Jan Beulich
@ 2013-09-09 11:14               ` Keir Fraser
  0 siblings, 0 replies; 15+ messages in thread
From: Keir Fraser @ 2013-09-09 11:14 UTC (permalink / raw)
  To: Jan Beulich, Tomasz Wroblewski; +Cc: Andrew Cooper, dgdegra, xen-devel

On 27/08/2013 08:47, "Jan Beulich" <JBeulich@suse.com> wrote:

> Not doing this was found to cause problems with sequences of allocation
> (multi-page), freeing, and then again allocation of the same page upon
> boot when interrupts are still disabled (causing the owner field to be
> non-zero, thus making the allocator attempt a TLB flush and, in its
> processing, triggering an assertion).
> 
> Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

Acked-by: Keir Fraser <keir@xen.org>

> --- a/xen/common/xmalloc_tlsf.c
> +++ b/xen/common/xmalloc_tlsf.c
> @@ -629,6 +629,7 @@ void xfree(void *p)
>          unsigned int i, order = get_order_from_pages(size);
>  
>          BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1));
> +        PFN_ORDER(virt_to_page(p)) = 0;
>          for ( i = 0; ; ++i )
>          {
>              if ( !(size & (1 << i)) )
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-09-09 11:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-26 10:03 [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present Tomasz Wroblewski
2013-08-26 10:52 ` Andrew Cooper
2013-08-26 13:27   ` Daniel De Graaf
2013-08-26 13:32     ` Tomasz Wroblewski
2013-08-26 11:12 ` Jan Beulich
2013-08-26 12:24   ` Tomasz Wroblewski
2013-08-26 12:41     ` Andrew Cooper
2013-08-26 13:00     ` Jan Beulich
2013-08-26 13:34       ` Tomasz Wroblewski
2013-08-26 17:00       ` Tomasz Wroblewski
2013-08-27  7:13         ` Jan Beulich
2013-08-27  7:23           ` Tomasz Wroblewski
2013-08-27  7:47             ` [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc() Jan Beulich
2013-09-09 11:14               ` Keir Fraser
2013-08-27  8:50         ` [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.