All of lore.kernel.org
 help / color / mirror / Atom feed
* Hypervisor memory leak/corruption because of guest irqs
@ 2012-09-07 18:04 Andrew Cooper
  2012-09-07 18:42 ` Keir Fraser
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Cooper @ 2012-09-07 18:04 UTC (permalink / raw)
  To: xen-devel, Keir Fraser, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

Hello,

I appear to have opened a can of worms here.  This was discovered when
turning on ASSERT()s, looking for another crash (and adding in 98 new
ASSERTs, along with some 'type checking' magic constants)

The issue has been discovered against Xen 4.1.3, but a cursory
inspection of unstable shows no signs of it being fixed. The relevant
assertion (which I added) is attached.  (In the process of debugging, I
also developed the ASSERT_PRINTK(bool, fmt, ...) macro which will be
upstreamed in due course.)

The root cause of the problem is the compelete abuse of the
irq_desc->action pointer being cast to a irq_guest_action_t* when
in-fact it is an irqaction*, but the (correct) solution is not easy.

destroy_irq() calls dynamic_irq_cleanup() which xfree()'s desc->action. 
This would be all well and fine if it were only an irqaction pointer. 
However, in this case, it is actually an irq_guest_action_t pointer,
meaning that we have free()'d an inactive timer, which is on a pcpu's
inactive timer linked list.  This means that as soon as the free()'d
memory is reused for something new, the linked list gets trashed, which
which point all bets are off with regards to the validity of hypervisor
memory.

As far as I can tell, this bug only manifests in combination with PCI
Passthrough, as we only perform cleanup of guest irqs when a domain with
passthrough is shut down.  The issue was first found by the ASSERT()s in
__list_del(), when something tried to use the pcpu inactive timer list,
after the free()'d memory was reused.

In this specific case, a quick and dirty hack would be to check every
time we free an action and possibly kill the timer if it is a guest irq.

Having said that, it is not a correct fix; the utter abuse of
irq_desc->action has been a ticking timebomb for a long time. 
irq_guest_action_t is private to the 2nd half of irq.c(x86), whereas
irqaction is common and architecture independent.  The only acceptable
solution I see is to re-architect a substantial proportion of the irq code.

Am I missing something obvious? or is the best way to continue (in which
case I have my work cut out as, it is currently affecting XenServer
customers) ?

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


[-- Attachment #2: hypervisor-panic.log --]
[-- Type: text/x-log, Size: 4504 bytes --]

(XEN) [2012-09-07 11:35:18] Assertion '! (t->status == 1 || t->status == 3 || t->status == 4)' failed, line 229, file irq.c
(XEN) [2012-09-07 11:35:18] will free action with timer in status 1
(XEN) [2012-09-07 11:35:18] Xen BUG at irq.c:229
(XEN) [2012-09-07 11:35:18] ----[ Xen-4.1.3  x86_64  debug=n  Tainted:    C ]----
(XEN) [2012-09-07 11:35:18] CPU:    0
(XEN) [2012-09-07 11:35:18] RIP:    e008:[<ffff82c48016d89e>] destroy_irq+0x1ae/0x210
(XEN) [2012-09-07 11:35:18] RFLAGS: 0000000000010082   CONTEXT: hypervisor
(XEN) [2012-09-07 11:35:18] rax: 0000000000000000   rbx: ffff83083fe04800   rcx: 0000000000000000
(XEN) [2012-09-07 11:35:19] rdx: 000000000000000a   rsi: 000000000000000a   rdi: ffff82c4802662c4
(XEN) [2012-09-07 11:35:19] rbp: ffff830993cd77f0   rsp: ffff82c4802b7d88   r8:  0000000000000000
(XEN) [2012-09-07 11:35:19] r9:  0000000000000000   r10: 00000000ffffffff   r11: ffff82c480141560
(XEN) [2012-09-07 11:35:19] r12: 000000000000008f   r13: ffff83083fe04834   r14: 0000000000000286
(XEN) [2012-09-07 11:35:19] r15: ffff830993cd7810   cr0: 0000000080050033   cr4: 00000000000026f0
(XEN) [2012-09-07 11:35:19] cr3: 00000008369ae000   cr2: 00000000e9925e68
(XEN) [2012-09-07 11:35:19] ds: 007b   es: 007b   fs: 00d8   gs: 0033   ss: 0000   cs: e008
(XEN) [2012-09-07 11:35:19] Xen stack trace from rsp=ffff82c4802b7d88:
(XEN) [2012-09-07 11:35:19]    0000000000000000 ffff8303c40757d0 ffff830895660000 000000000000004f
(XEN) [2012-09-07 11:35:19]    ffff8303c40757d0 000000000000008f ffff83083fe04834 ffff82c480169647
(XEN) [2012-09-07 11:35:20]    ffff8303c40757d0 ffff830895660000 000000000000004f ffff83083fe04800
(XEN) [2012-09-07 11:35:20]    000000000000008f ffff82c48016e491 000000000000004f 000000000000013c
(XEN) [2012-09-07 11:35:20]    000000000000008f ffff830895660000 000000000000004f 000000000000013c
(XEN) [2012-09-07 11:35:20]    ffff830895660188 ffff82c4802d4600 0000000000000000 ffff82c48016e71a
(XEN) [2012-09-07 11:35:20]    ffff830895660000 ffff8300be6ac000 ffff830895660000 00000000ffffffff
(XEN) [2012-09-07 11:35:20]    ffff830895660000 ffff82c48015efb2 00000000ffffffff ffff8300be6ac000
(XEN) [2012-09-07 11:35:20]    fffffffffffffff8 ffff82c480104a80 ffff82c4802f4060 0000000000000000
(XEN) [2012-09-07 11:35:20]    0000000000000001 ffff82c4802f4320 ffff82c4802d0600 ffff82c480130191
(XEN) [2012-09-07 11:35:20]    0000000000000000 ffffffffffffffff ffff82c4802b7f18 ffff82c480126f85
(XEN) [2012-09-07 11:35:20]    ffff8300bf2d8000 00000000e98dde14 0000000000000000 0000000000000000
(XEN) [2012-09-07 11:35:21]    0000000000000000 ffff82c480223b26 0000000000000000 0000000000000000
(XEN) [2012-09-07 11:35:21]    0000000000000000 0000000000000000 00000000e98dde14 00000000e913584c
(XEN) [2012-09-07 11:35:21]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [2012-09-07 11:35:21]    0000000000000008 00000000e9135878 0000000000000000 00000000e9135854
(XEN) [2012-09-07 11:35:21]    00000000e98ddec4 000000f900000000 00000000c01d68f9 0000000000000061
(XEN) [2012-09-07 11:35:21]    0000000000000202 00000000e98dde0c 0000000000000069 0000000000000000
(XEN) [2012-09-07 11:35:21]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [2012-09-07 11:35:21]    ffff8300bf2d8000 0000000000000000 0000000000000000
(XEN) [2012-09-07 11:35:21] Xen call trace:
(XEN) [2012-09-07 11:35:21]       [<ffff82c48016d89e>] destroy_irq+0x1ae/0x210
(XEN) [2012-09-07 11:35:22]      7[<ffff82c480169647>] msi_free_irq+0x27/0x210
(XEN) [2012-09-07 11:35:22]     13[<ffff82c48016e491>] unmap_domain_pirq+0x181/0x3b0
(XEN) [2012-09-07 11:35:22]     23[<ffff82c48016e71a>] free_domain_pirqs+0x5a/0x90
(XEN) [2012-09-07 11:35:22]     29[<ffff82c48015efb2>] arch_domain_destroy+0x32/0x330
(XEN) [2012-09-07 11:35:22]     33[<ffff82c480104a80>] complete_domain_destroy+0x80/0x150
(XEN) [2012-09-07 11:35:22]     39[<ffff82c480130191>] rcu_process_callbacks+0xa1/0x200
(XEN) [2012-09-07 11:35:22]     43[<ffff82c480126f85>] __do_softirq+0x65/0x90
(XEN) [2012-09-07 11:35:22]     49[<ffff82c480223b26>] compat_process_softirqs+0x6/0x10
(XEN) [2012-09-07 11:35:22]    
(XEN) [2012-09-07 11:35:22] 
(XEN) [2012-09-07 11:35:22] ****************************************
(XEN) [2012-09-07 11:35:22] Panic on CPU 0:
(XEN) [2012-09-07 11:35:22] Xen BUG at irq.c:229
(XEN) [2012-09-07 11:35:22] ****************************************
(XEN) [2012-09-07 11:35:23] 
(XEN) [2012-09-07 11:35:23] Manual reset required ('noreboot' specified)

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hypervisor memory leak/corruption because of guest irqs
  2012-09-07 18:04 Hypervisor memory leak/corruption because of guest irqs Andrew Cooper
@ 2012-09-07 18:42 ` Keir Fraser
  2012-09-07 19:08   ` Andrew Cooper
  0 siblings, 1 reply; 4+ messages in thread
From: Keir Fraser @ 2012-09-07 18:42 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel, Keir Fraser, Jan Beulich

Seems it means noone thought properly about teardown of guest-bound irqs.
Probably because a lot of that code was dumbly ported over from Linux later.

As for abuse of desc->action, you could turn that field explicitly into a
discriminated union; it is already precisely discriminated by
desc->status&IRQ_GUEST. Apart from that syntactic sugar, the idea of having
that pointer point at two different things dependent on irq type doesn't
seem ugly to me -- if it's irq-bound then it does not have, nor does it
need, an irqaction. But the irq_guest_action of course has to be dealt with
and your problem is that noone has thought about it!

 -- Keir

On 07/09/2012 19:04, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:

> Hello,
> 
> I appear to have opened a can of worms here.  This was discovered when
> turning on ASSERT()s, looking for another crash (and adding in 98 new
> ASSERTs, along with some 'type checking' magic constants)
> 
> The issue has been discovered against Xen 4.1.3, but a cursory
> inspection of unstable shows no signs of it being fixed. The relevant
> assertion (which I added) is attached.  (In the process of debugging, I
> also developed the ASSERT_PRINTK(bool, fmt, ...) macro which will be
> upstreamed in due course.)
> 
> The root cause of the problem is the compelete abuse of the
> irq_desc->action pointer being cast to a irq_guest_action_t* when
> in-fact it is an irqaction*, but the (correct) solution is not easy.
> 
> destroy_irq() calls dynamic_irq_cleanup() which xfree()'s desc->action.
> This would be all well and fine if it were only an irqaction pointer.
> However, in this case, it is actually an irq_guest_action_t pointer,
> meaning that we have free()'d an inactive timer, which is on a pcpu's
> inactive timer linked list.  This means that as soon as the free()'d
> memory is reused for something new, the linked list gets trashed, which
> which point all bets are off with regards to the validity of hypervisor
> memory.
> 
> As far as I can tell, this bug only manifests in combination with PCI
> Passthrough, as we only perform cleanup of guest irqs when a domain with
> passthrough is shut down.  The issue was first found by the ASSERT()s in
> __list_del(), when something tried to use the pcpu inactive timer list,
> after the free()'d memory was reused.
> 
> In this specific case, a quick and dirty hack would be to check every
> time we free an action and possibly kill the timer if it is a guest irq.
> 
> Having said that, it is not a correct fix; the utter abuse of
> irq_desc->action has been a ticking timebomb for a long time.
> irq_guest_action_t is private to the 2nd half of irq.c(x86), whereas
> irqaction is common and architecture independent.  The only acceptable
> solution I see is to re-architect a substantial proportion of the irq code.
> 
> Am I missing something obvious? or is the best way to continue (in which
> case I have my work cut out as, it is currently affecting XenServer
> customers) ?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hypervisor memory leak/corruption because of guest irqs
  2012-09-07 18:42 ` Keir Fraser
@ 2012-09-07 19:08   ` Andrew Cooper
  2012-09-10 12:59     ` Jan Beulich
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Cooper @ 2012-09-07 19:08 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Keir (Xen.org), Jan Beulich, xen-devel


On 07/09/12 19:42, Keir Fraser wrote:
> Seems it means noone thought properly about teardown of guest-bound irqs.
> Probably because a lot of that code was dumbly ported over from Linux later.
>
> As for abuse of desc->action, you could turn that field explicitly into a
> discriminated union; it is already precisely discriminated by
> desc->status&IRQ_GUEST. Apart from that syntactic sugar, the idea of having
> that pointer point at two different things dependent on irq type doesn't
> seem ugly to me -- if it's irq-bound then it does not have, nor does it
> need, an irqaction. But the irq_guest_action of course has to be dealt with
> and your problem is that noone has thought about it!
>
>  -- Keir

Its not completely trivial to turn it into a discriminated union (in an
acceptable way), as irq_desc and irqaction is common while
irq_guest_action_t is x86 specific (and irq.c private, currently),
meaning the introduction of an arch_irqaction.

Now that the two actions are different, you could perhaps convert
__do_IRQ_guest() to be the 'handler' of the irqaction, which appears to
cause large amounts of "if(desc->status & IRQ_GUEST) then do something"
code to fall out.  Continuing along this line leads to large amounts of
code change.

I will see about working on a bare minimum patch to make the teardown of
guest IRQs safe, but I dont think it will be very small, and it will
open the way for some cleanup.

~Andrew

>
> On 07/09/2012 19:04, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
>> Hello,
>>
>> I appear to have opened a can of worms here.  This was discovered when
>> turning on ASSERT()s, looking for another crash (and adding in 98 new
>> ASSERTs, along with some 'type checking' magic constants)
>>
>> The issue has been discovered against Xen 4.1.3, but a cursory
>> inspection of unstable shows no signs of it being fixed. The relevant
>> assertion (which I added) is attached.  (In the process of debugging, I
>> also developed the ASSERT_PRINTK(bool, fmt, ...) macro which will be
>> upstreamed in due course.)
>>
>> The root cause of the problem is the compelete abuse of the
>> irq_desc->action pointer being cast to a irq_guest_action_t* when
>> in-fact it is an irqaction*, but the (correct) solution is not easy.
>>
>> destroy_irq() calls dynamic_irq_cleanup() which xfree()'s desc->action.
>> This would be all well and fine if it were only an irqaction pointer.
>> However, in this case, it is actually an irq_guest_action_t pointer,
>> meaning that we have free()'d an inactive timer, which is on a pcpu's
>> inactive timer linked list.  This means that as soon as the free()'d
>> memory is reused for something new, the linked list gets trashed, which
>> which point all bets are off with regards to the validity of hypervisor
>> memory.
>>
>> As far as I can tell, this bug only manifests in combination with PCI
>> Passthrough, as we only perform cleanup of guest irqs when a domain with
>> passthrough is shut down.  The issue was first found by the ASSERT()s in
>> __list_del(), when something tried to use the pcpu inactive timer list,
>> after the free()'d memory was reused.
>>
>> In this specific case, a quick and dirty hack would be to check every
>> time we free an action and possibly kill the timer if it is a guest irq.
>>
>> Having said that, it is not a correct fix; the utter abuse of
>> irq_desc->action has been a ticking timebomb for a long time.
>> irq_guest_action_t is private to the 2nd half of irq.c(x86), whereas
>> irqaction is common and architecture independent.  The only acceptable
>> solution I see is to re-architect a substantial proportion of the irq code.
>>
>> Am I missing something obvious? or is the best way to continue (in which
>> case I have my work cut out as, it is currently affecting XenServer
>> customers) ?
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hypervisor memory leak/corruption because of guest irqs
  2012-09-07 19:08   ` Andrew Cooper
@ 2012-09-10 12:59     ` Jan Beulich
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Beulich @ 2012-09-10 12:59 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir Fraser, Keir (Xen.org), xen-devel

>>> On 07.09.12 at 21:08, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> Its not completely trivial to turn it into a discriminated union (in an
> acceptable way), as irq_desc and irqaction is common while
> irq_guest_action_t is x86 specific (and irq.c private, currently),
> meaning the introduction of an arch_irqaction.
> 
> Now that the two actions are different, you could perhaps convert
> __do_IRQ_guest() to be the 'handler' of the irqaction, which appears to
> cause large amounts of "if(desc->status & IRQ_GUEST) then do something"
> code to fall out.  Continuing along this line leads to large amounts of
> code change.
> 
> I will see about working on a bare minimum patch to make the teardown of
> guest IRQs safe, but I dont think it will be very small, and it will
> open the way for some cleanup.

Is there any reason why simply tying the removal of the timer
to the clearing of IRQ_GUEST isn't possible? This is basically
already the case for __pirq_guest_unbind() (where the caller
takes care of correctly dealing with the returned action pointer).

Hence if dynamic_irq_cleanup() indeed can be reached with the
flag still set, it would seem quite obvious that this simply needs
adding similar treatment. But I would have expected that the
unbind path should be taken in any case (and destroy_irq()
only be used for subsequent, final cleanup) - is there perhaps
a problem in when the forced unbind happens?

Jan

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-09-10 12:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-07 18:04 Hypervisor memory leak/corruption because of guest irqs Andrew Cooper
2012-09-07 18:42 ` Keir Fraser
2012-09-07 19:08   ` Andrew Cooper
2012-09-10 12:59     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.