All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
@ 2009-10-16  1:38 Cinco, Dante
  2009-10-16  2:34 ` Qing He
  0 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-16  1:38 UTC (permalink / raw)
  To: Qing He; +Cc: xen-devel, Keir Fraser, xiantao.zhang

I'm still trying to track down the problem of lost interrupts when I change /proc/irq/<irq#>/smp_affinity in domU. I'm now at Xen 3.5-unstable changeset 20320 and using pvops dom0 2.6.31.1.

In domU, my PCI devices are at virtual slots 5, 6, 7 and 8 so I use "lspci -vv" to get their respective IRQs and MSI message address/data and I can also see their IRQs in /proc/interrupts (I'm not showing all 16 CPUs):

lspci -vv -s 00:05.0 | grep IRQ; lspci -vv -s 00:06.0 | grep IRQ; lspci -vv -s 00:07.0 | grep IRQ; lspci -vv -s 00:08.0 | grep IRQ
        Interrupt: pin A routed to IRQ 48
        Interrupt: pin B routed to IRQ 49
        Interrupt: pin C routed to IRQ 50
        Interrupt: pin D routed to IRQ 51
lspci -vv -s 00:05.0 | grep Address; lspci -vv -s 00:06.0 | grep Address; lspci -vv -s 00:07.0 | grep Address; lspci -vv -s 00:08.0 | grep Address
                Address: 00000000fee00000  Data: 4071 (vector=113)
                Address: 00000000fee00000  Data: 4089 (vector=137)
                Address: 00000000fee00000  Data: 4099 (vector=153)
                Address: 00000000fee00000  Data: 40a9 (vector=169)
egrep '(HW_TACHYON|CPU0)' /proc/interrupts 
            CPU0       CPU1       
  48:    1571765          0          PCI-MSI-edge      HW_TACHYON
  49:    3204403          0          PCI-MSI-edge      HW_TACHYON
  50:    2643008          0          PCI-MSI-edge      HW_TACHYON
  51:    3270322          0          PCI-MSI-edge      HW_TACHYON

In dom0, my PCI devices show up as a 4-function device: 0:07:0.0, 0:07:0.1, 0:07:0.2, 0:07:0.3 and I also use "lspci -vv" to get the IRQs and MSI info:

lspci -vv -s 0:07:0.0 | grep IRQ;lspci -vv -s 0:07:0.1 | grep IRQ;lspci -vv -s 0:07:0.2 | grep IRQ;lspci -vv -s 0:07:0.3 | grep IRQ
        Interrupt: pin A routed to IRQ 11
        Interrupt: pin B routed to IRQ 10
        Interrupt: pin C routed to IRQ 7
        Interrupt: pin D routed to IRQ 5
lspci -vv -s 0:07:0.0 | grep Address;lspci -vv -s 0:07:0.1 | grep Address;lspci -vv -s 0:07:0.2 | grep Address;lspci -vv -s 0:07:0.3 | grep Address
                Address: 00000000fee00000  Data: 403c (vector=60)
                Address: 00000000fee00000  Data: 4044 (vector=68)
                Address: 00000000fee00000  Data: 404c (vector=76)
                Address: 00000000fee00000  Data: 4054 (vector=84)

I used the "Ctrl-a" "Ctrl-a" "Ctrl-a" "i" key sequence from the Xen console to print the guest interrupt information and the PCI devices. The vectors shown here are actually the vectors as seen from dom0 so I don't understand the label "Guest interrupt information." Meanwhile, the IRQs (74 - 77) do not match those from dom0 (11, 10, 7, 5) or domU (48, 49, 50, 51) as seen by "lspci -vv" but they do match those reported by the "Ctrl-a" key sequence followed by "Q" for PCI devices.

(XEN) Guest interrupt information:
(XEN)    IRQ:  74, IRQ affinity:0x00000001, Vec: 60 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 79(----),
(XEN)    IRQ:  75, IRQ affinity:0x00000001, Vec: 68 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 78(----),
(XEN)    IRQ:  76, IRQ affinity:0x00000001, Vec: 76 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 77(----),
(XEN)    IRQ:  77, IRQ affinity:0x00000001, Vec: 84 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 76(----),

(XEN) ==== PCI devices ====
(XEN) 07:00.3 - dom 1   - MSIs < 77 >
(XEN) 07:00.2 - dom 1   - MSIs < 76 >
(XEN) 07:00.1 - dom 1   - MSIs < 75 >
(XEN) 07:00.0 - dom 1   - MSIs < 74 >

If I look at /var/log/xen/qemu-dm-dpm.log, I see these 4 lines that show the pirq's which matches those in the last column of guest interrupt information:

pt_msi_setup: msi mapped with pirq 4f (79)
pt_msi_setup: msi mapped with pirq 4e (78)
pt_msi_setup: msi mapped with pirq 4d (77)
pt_msi_setup: msi mapped with pirq 4c (76)

The gvec's (71, 89, 99, a9) matches the vectors as seen by lspci in domU:

pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4f gvec 71 gflags 0
pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4e gvec 89 gflags 0
pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4d gvec 99 gflags 0
pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4c gvec a9 gflags 0

I see these same pirq's in the output of "xm dmesg"

(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.0
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4f device = 5 intx = 0
(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.1
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.1
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4e device = 6 intx = 0
(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.2
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.2
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4d device = 7 intx = 0
(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.3
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.3
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4c device = 8 intx = 0

The machine_gsi's match the pirq's while the m_irq's match the IRQ from lspci dom0. What are the guest_gsi's?

(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=79 guest_gsi=36, device=5, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = b device = 5 intx = 0
(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=78 guest_gsi=40, device=6, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4e device = 0x6 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = a device = 6 intx = 0
(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=77 guest_gsi=44, device=7, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4d device = 0x7 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 7 device = 7 intx = 0
(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=76 guest_gsi=17, device=8, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4c device = 0x8 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 5 device = 8 intx = 0

So now when I finally get to the part where I change the smp_affinity, I see a corresponding change in the guest interrupt information, qemu-dm-dpm.log and lspci on both dom0 and domU:

cat /proc/irq/48/smp_affinity 
ffff
echo 2 > /proc/irq/48/smp_affinity
cat /proc/irq/48/smp_affinity 
0002

(XEN) Guest interrupt information: (IRQ affinity changed from 1 to 2, while vector changed from 60 to 92)
(XEN)    IRQ:  74, IRQ affinity:0x00000002, Vec: 92 type=PCI-MSI         status=00000010 in-flight=1 domain-list=1: 79(---M),

pt_msi_update: Update msi with pirq 4f gvec 71 gflags 2 (What is the significance of gflags 2?)
pt_msi_update: Update msi with pirq 4f gvec b1 gflags 2

domU: lspci -vv -s 00:05.0 | grep Address
                Address: 00000000fee02000  Data: 40b1 (dest ID changed from 0 to 2 and vector changed from 0x71 to 0xb1)

dom0: lspci -vv -s 0:07:0.0 | grep Address
                Address: 00000000fee00000  Data: 405c (vector changed from 0x3c (60 decimal) to 0x5c (92 decimal))

I'm confused why there are 4 sets of IRQs: dom0 lspci:[11,10,7,5], domU lspci proc interrupts:[48,49,50,51], pirq:[76,77,78,79], guest int info:[74,75,76,77].

Are the changes resulting from changing the IRQ smp_affinity consistent with what is expected? Any recommendation on where to go from here?

Thanks in advance.

Dante

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  1:38 IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) Cinco, Dante
@ 2009-10-16  2:34 ` Qing He
  2009-10-16  6:37   ` Keir Fraser
  0 siblings, 1 reply; 55+ messages in thread
From: Qing He @ 2009-10-16  2:34 UTC (permalink / raw)
  To: Cinco, Dante; +Cc: xen-devel, Keir Fraser, Zhang, Xiantao

On Fri, 2009-10-16 at 09:38 +0800, Cinco, Dante wrote:
> I'm confused why there are 4 sets of IRQs: dom0 lspci:[11,10,7,5], domU
> lspci proc interrupts:[48,49,50,51], pirq:[76,77,78,79], guest int
> info:[74,75,76,77].

This is indeed a little confusing at first, I'll try to differentiate
them here:
1. dom0 IRQ [11,10,7,5]: this is decided by dom0 kernel, through
   information from host ACPI
2. domU IRQ [48,49,50,51]: this is decided by domU kernel, through
   virtual ACPI present to guest. If there are multiple domUs, this
   space overlaps
3. pirq [76,77,79,79]: this is a per domain concept, it has nothing
   to do with the physical or virtual irq number, the sole purpose
   is to provide an interface between domains (mainly PV) and
   hypervisor. GSI part happens to be indentical mapped though.
4. irq [74,75,76,77]: global hypervisor concept, to track all irqs
   for all domains, origininally named `vector', the name changes
   when hypervisor per-CPU vectoring is introduced.

> pt_msi_update: Update msi with pirq 4f gvec 71 gflags 2 (What is the significance of gflags 2?)
> pt_msi_update: Update msi with pirq 4f gvec b1 gflags 2

gflags is a custom interface that incorporates address and data:
DM, dest, etc. gflags=2 means DM=0 des=2.

The first line is an intermediate result, print when guest updates the
MSI address, the second line indicates an update to MSI data.

> (XEN) Guest interrupt information:
> (XEN)    IRQ:  74, IRQ affinity:0x00000001, Vec: 60 type=PCI-MSI
> status=00000010 in-flight=0 domain-list=1: 79(----),
>
> echo 2 > /proc/irq/48/smp_affinity
>
> (XEN) Guest interrupt information: (IRQ affinity changed from 1 to 2, while
> vector changed from 60 to 92)
> (XEN)    IRQ:  74, IRQ affinity:0x00000002, Vec: 92 type= PCI-MSI
> status=00000010 in-flight=1 domain-list=1: 79(---M),

`(---M)' means masked, that may be why the irq is not received.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  2:34 ` Qing He
@ 2009-10-16  6:37   ` Keir Fraser
  2009-10-16  7:32     ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Keir Fraser @ 2009-10-16  6:37 UTC (permalink / raw)
  To: Qing He, Cinco, Dante; +Cc: xen-devel, Zhang, Xiantao

On 16/10/2009 03:34, "Qing He" <qing.he@intel.com> wrote:

>> (XEN) Guest interrupt information: (IRQ affinity changed from 1 to 2, while
>> vector changed from 60 to 92)
>> (XEN)    IRQ:  74, IRQ affinity:0x00000002, Vec: 92 type= PCI-MSI
>> status=00000010 in-flight=1 domain-list=1: 79(---M),
> 
> `(---M)' means masked, that may be why the irq is not received.

Glad you managed to pick that out of the information overload. :-) It does
look like the next obvious lead to chase down.

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  6:37   ` Keir Fraser
@ 2009-10-16  7:32     ` Zhang, Xiantao
  2009-10-16  8:24       ` Qing He
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-16  7:32 UTC (permalink / raw)
  To: Keir Fraser, He, Qing, Cinco, Dante; +Cc: xen-devel

Keir Fraser wrote:
> On 16/10/2009 03:34, "Qing He" <qing.he@intel.com> wrote:
> 
>>> (XEN) Guest interrupt information: (IRQ affinity changed from 1 to
>>> 2, while vector changed from 60 to 92) (XEN)    IRQ:  74, IRQ
>>> affinity:0x00000002, Vec: 92 type= PCI-MSI status=00000010
>>> in-flight=1 domain-list=1: 79(---M), 
>> 
>> `(---M)' means masked, that may be why the irq is not received.
> 
> Glad you managed to pick that out of the information overload. :-) It
> does look like the next obvious lead to chase down.

According to the description, the issue should be caused by lost EOI write for the MSI interrupt and leads to permanent interrupt mask.   There should be a race between guest setting new vector and  EOIs old vector for the interrupt.  Once guest sets new vector before it EOIs the old vector, hypervisor can't find the pirq which corresponds old vector(has changed to new vector) , so also can't EOI the old vector forever in hardware level. Since the corresponding vector in real processor can't be EOIed, so system may lose all interrupts and result the reported issues ultimately.  But I remembered there should be a timer to handle this case through a forcible EOI write to the real processor after timeout, but seems it doesn't function in the expected way. 
Xiantao 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  8:24       ` Qing He
@ 2009-10-16  8:22         ` Zhang, Xiantao
  2009-10-16  8:34           ` Qing He
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-16  8:22 UTC (permalink / raw)
  To: He, Qing; +Cc: Cinco, Dante, xen-devel, Keir Fraser

He, Qing wrote:
> On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
>> According to the description, the issue should be caused by lost EOI
>> write for the MSI interrupt and leads to permanent interrupt mask.
>> There should be a race between guest setting new vector and  EOIs
>> old vector for the interrupt.  Once guest sets new vector before it
>> EOIs the old vector, hypervisor can't find the pirq which
>> corresponds old vector(has changed 
>> to new vector) , so also can't EOI the old vector forever in hardware
>> level. Since the corresponding vector in real processor can't be
>> EOIed, 
>> so system may lose all interrupts and result the reported issues
>> ultimately. 
> 
>> But I remembered there should be a timer to handle this case
>> through a forcible EOI write to the real processor after timeout,
>> but seems it doesn't function in the expected way.
> 
> The EOI timer is supposed to deal with the irq sharing problem,
> since MSI doesn't share, this timer will not be started in the
> case of MSI.

That maybe a problem if so. If a malicious/buggy guest won't EOI the MSI vector, so host may hang due to lack of timeout mechanism? 
Xiantao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  7:32     ` Zhang, Xiantao
@ 2009-10-16  8:24       ` Qing He
  2009-10-16  8:22         ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Qing He @ 2009-10-16  8:24 UTC (permalink / raw)
  To: Zhang, Xiantao; +Cc: Cinco, Dante, xen-devel, Keir Fraser

On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
> According to the description, the issue should be caused by lost EOI write
> for the MSI interrupt and leads to permanent interrupt mask. There should
> be a race between guest setting new vector and  EOIs old vector for the
> interrupt.  Once guest sets new vector before it EOIs the old vector,
> hypervisor can't find the pirq which corresponds old vector(has changed
> to new vector) , so also can't EOI the old vector forever in hardware
> level. Since the corresponding vector in real processor can't be EOIed,
> so system may lose all interrupts and result the reported issues ultimately.

> But I remembered there should be a timer to handle this case
> through a forcible EOI write to the real processor after timeout,
> but seems it doesn't function in the expected way.

The EOI timer is supposed to deal with the irq sharing problem,
since MSI doesn't share, this timer will not be started in the
case of MSI.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  8:22         ` Zhang, Xiantao
@ 2009-10-16  8:34           ` Qing He
  2009-10-16  8:35             ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Qing He @ 2009-10-16  8:34 UTC (permalink / raw)
  To: Zhang, Xiantao; +Cc: Cinco, Dante, xen-devel, Keir Fraser

On Fri, 2009-10-16 at 16:22 +0800, Zhang, Xiantao wrote:
> He, Qing wrote:
> > On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
> >> According to the description, the issue should be caused by lost EOI
> >> write for the MSI interrupt and leads to permanent interrupt mask.
> >> There should be a race between guest setting new vector and  EOIs
> >> old vector for the interrupt.  Once guest sets new vector before it
> >> EOIs the old vector, hypervisor can't find the pirq which
> >> corresponds old vector(has changed 
> >> to new vector) , so also can't EOI the old vector forever in hardware
> >> level. Since the corresponding vector in real processor can't be
> >> EOIed, 
> >> so system may lose all interrupts and result the reported issues
> >> ultimately. 
> > 
> >> But I remembered there should be a timer to handle this case
> >> through a forcible EOI write to the real processor after timeout,
> >> but seems it doesn't function in the expected way.
> > 
> > The EOI timer is supposed to deal with the irq sharing problem,
> > since MSI doesn't share, this timer will not be started in the
> > case of MSI.
> 
> That maybe a problem if so. If a malicious/buggy guest won't EOI the
> MSI vector, so host may hang due to lack of timeout mechanism? 

Why does host hang? Only the assigned interrupt will block, and that's
exactly what the guest wants :-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  8:34           ` Qing He
@ 2009-10-16  8:35             ` Zhang, Xiantao
  2009-10-16  9:01               ` Qing He
  2009-10-16  9:41               ` Keir Fraser
  0 siblings, 2 replies; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-16  8:35 UTC (permalink / raw)
  To: He, Qing; +Cc: Cinco, Dante, xen-devel, Keir Fraser

He, Qing wrote:
> On Fri, 2009-10-16 at 16:22 +0800, Zhang, Xiantao wrote:
>> He, Qing wrote:
>>> On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
>>>> According to the description, the issue should be caused by lost
>>>> EOI write for the MSI interrupt and leads to permanent interrupt
>>>> mask. There should be a race between guest setting new vector and 
>>>> EOIs old vector for the interrupt.  Once guest sets new vector
>>>> before it EOIs the old vector, hypervisor can't find the pirq which
>>>> corresponds old vector(has changed
>>>> to new vector) , so also can't EOI the old vector forever in
>>>> hardware level. Since the corresponding vector in real processor
>>>> can't be EOIed, so system may lose all interrupts and result the
>>>> reported issues ultimately.
>>> 
>>>> But I remembered there should be a timer to handle this case
>>>> through a forcible EOI write to the real processor after timeout,
>>>> but seems it doesn't function in the expected way.
>>> 
>>> The EOI timer is supposed to deal with the irq sharing problem,
>>> since MSI doesn't share, this timer will not be started in the
>>> case of MSI.
>> 
>> That maybe a problem if so. If a malicious/buggy guest won't EOI the
>> MSI vector, so host may hang due to lack of timeout mechanism?
> 
> Why does host hang? Only the assigned interrupt will block, and that's
> exactly what the guest wants :-)

Hypervisor shouldn't EOI the real vector until guest EOI the corresponding virtual vector , right ?  Not sure.:-)
Xiantao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  8:35             ` Zhang, Xiantao
@ 2009-10-16  9:01               ` Qing He
  2009-10-16  9:42                 ` Qing He
  2009-10-16  9:49                 ` Zhang, Xiantao
  2009-10-16  9:41               ` Keir Fraser
  1 sibling, 2 replies; 55+ messages in thread
From: Qing He @ 2009-10-16  9:01 UTC (permalink / raw)
  To: Zhang, Xiantao; +Cc: Cinco, Dante, xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 2017 bytes --]

On Fri, 2009-10-16 at 16:35 +0800, Zhang, Xiantao wrote:
> He, Qing wrote:
> > On Fri, 2009-10-16 at 16:22 +0800, Zhang, Xiantao wrote:
> >> He, Qing wrote:
> >>> On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
> >>>> According to the description, the issue should be caused by lost
> >>>> EOI write for the MSI interrupt and leads to permanent interrupt
> >>>> mask. There should be a race between guest setting new vector and 
> >>>> EOIs old vector for the interrupt.  Once guest sets new vector
> >>>> before it EOIs the old vector, hypervisor can't find the pirq which
> >>>> corresponds old vector(has changed
> >>>> to new vector) , so also can't EOI the old vector forever in
> >>>> hardware level. Since the corresponding vector in real processor
> >>>> can't be EOIed, so system may lose all interrupts and result the
> >>>> reported issues ultimately.
> >>> 
> >>>> But I remembered there should be a timer to handle this case
> >>>> through a forcible EOI write to the real processor after timeout,
> >>>> but seems it doesn't function in the expected way.
> >>> 
> >>> The EOI timer is supposed to deal with the irq sharing problem,
> >>> since MSI doesn't share, this timer will not be started in the
> >>> case of MSI.
> >> 
> >> That maybe a problem if so. If a malicious/buggy guest won't EOI the
> >> MSI vector, so host may hang due to lack of timeout mechanism?
> > 
> > Why does host hang? Only the assigned interrupt will block, and that's
> > exactly what the guest wants :-)
> 
> Hypervisor shouldn't EOI the real vector until guest EOI the corresponding
> virtual vector , right ?  Not sure.:-)

Yes, it is the algorithm used today.

After reviewing the code, if the guest really does something like
changing affinity within the window between an irq fire and eoi,
there is indeed a problem, attached is the patch. Although I kinda
doubt it, shouldn't desc->lock in guest protect and make these two
operations mutual exclusive.

Dante,
Can you see if this patch helps?

Thanks,
Qing

[-- Attachment #2: msi-eoi-before-update.patch --]
[-- Type: text/x-diff, Size: 752 bytes --]

diff -r 1d7221667204 xen/drivers/passthrough/io.c
--- a/xen/drivers/passthrough/io.c	Thu Oct 08 09:24:32 2009 +0100
+++ b/xen/drivers/passthrough/io.c	Fri Oct 16 16:38:06 2009 +0800
@@ -26,6 +26,7 @@
 #include <xen/hvm/irq.h>
 
 static void hvm_dirq_assist(unsigned long _d);
+static void __msi_pirq_eoi(struct domain *d, int pirq);
 
 static int pt_irq_need_timer(uint32_t flags)
 {
@@ -194,7 +195,9 @@
 	            spin_unlock(&d->event_lock);
         	    return -EBUSY;
             }
- 
+
+            __msi_pirq_eoi(d, pirq);
+
             /* if pirq is already mapped as vmsi, update the guest data/addr */
             old_gvec = hvm_irq_dpci->mirq[pirq].gmsi.gvec;
             hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  8:35             ` Zhang, Xiantao
  2009-10-16  9:01               ` Qing He
@ 2009-10-16  9:41               ` Keir Fraser
  2009-10-16  9:57                 ` Qing He
  2009-10-16  9:58                 ` Zhang, Xiantao
  1 sibling, 2 replies; 55+ messages in thread
From: Keir Fraser @ 2009-10-16  9:41 UTC (permalink / raw)
  To: Zhang, Xiantao, He, Qing; +Cc: Cinco, Dante, xen-devel

On 16/10/2009 09:35, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:

>>> That maybe a problem if so. If a malicious/buggy guest won't EOI the
>>> MSI vector, so host may hang due to lack of timeout mechanism?
>> 
>> Why does host hang? Only the assigned interrupt will block, and that's
>> exactly what the guest wants :-)
> 
> Hypervisor shouldn't EOI the real vector until guest EOI the corresponding
> virtual vector , right ?  Not sure.:-)

If the EOI is via the local APIC, which I suppose it must be, then a timeout
fallback probably is required. This is because priorities are assigned
arbitrarily to guest interrupts, and a non-EOIed interrupt blocks any
lower-priority interrupts. In particular, some of those could be owned by
dom0 for example, and be quite critical to forward progress of the entire
system.

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  9:01               ` Qing He
@ 2009-10-16  9:42                 ` Qing He
  2009-10-16  9:49                 ` Zhang, Xiantao
  1 sibling, 0 replies; 55+ messages in thread
From: Qing He @ 2009-10-16  9:42 UTC (permalink / raw)
  To: Zhang, Xiantao; +Cc: Cinco, Dante, xen-devel, Keir Fraser

On Fri, 2009-10-16 at 17:01 +0800, Qing He wrote:
> Yes, it is the algorithm used today.
> 
> After reviewing the code, if the guest really does something like
> changing affinity within the window between an irq fire and eoi,
> there is indeed a problem, attached is the patch. Although I kinda
> doubt it, shouldn't desc->lock in guest protect and make these two
> operations mutual exclusive.
> 
> Dante,
> Can you see if this patch helps?

Please ignore this patch, I intended to use it to see if this can
confirm the analysis (in the cost of interrupt losses), but it may
actually bring more severe problems.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  9:01               ` Qing He
  2009-10-16  9:42                 ` Qing He
@ 2009-10-16  9:49                 ` Zhang, Xiantao
  2009-10-16 14:54                   ` Zhang, Xiantao
  1 sibling, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-16  9:49 UTC (permalink / raw)
  To: He, Qing; +Cc: Cinco, Dante, xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 2501 bytes --]

He, Qing wrote:
> On Fri, 2009-10-16 at 16:35 +0800, Zhang, Xiantao wrote:
>> He, Qing wrote:
>>> On Fri, 2009-10-16 at 16:22 +0800, Zhang, Xiantao wrote:
>>>> He, Qing wrote:
>>>>> On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
>>>>>> According to the description, the issue should be caused by lost
>>>>>> EOI write for the MSI interrupt and leads to permanent interrupt
>>>>>> mask. There should be a race between guest setting new vector and
>>>>>> EOIs old vector for the interrupt.  Once guest sets new vector
>>>>>> before it EOIs the old vector, hypervisor can't find the pirq
>>>>>> which corresponds old vector(has changed
>>>>>> to new vector) , so also can't EOI the old vector forever in
>>>>>> hardware level. Since the corresponding vector in real processor
>>>>>> can't be EOIed, so system may lose all interrupts and result the
>>>>>> reported issues ultimately.
>>>>> 
>>>>>> But I remembered there should be a timer to handle this case
>>>>>> through a forcible EOI write to the real processor after timeout,
>>>>>> but seems it doesn't function in the expected way.
>>>>> 
>>>>> The EOI timer is supposed to deal with the irq sharing problem,
>>>>> since MSI doesn't share, this timer will not be started in the
>>>>> case of MSI.
>>>> 
>>>> That maybe a problem if so. If a malicious/buggy guest won't EOI
>>>> the MSI vector, so host may hang due to lack of timeout mechanism?
>>> 
>>> Why does host hang? Only the assigned interrupt will block, and
>>> that's exactly what the guest wants :-)
>> 
>> Hypervisor shouldn't EOI the real vector until guest EOI the
>> corresponding virtual vector , right ?  Not sure.:-)
> 
> Yes, it is the algorithm used today.

So it should be still a problem. If guest won't do eoi, host can't do eoi also, and leads to system hang without timeout mechanism. So we may need to introduce a timer for each MSI interrupt source to avoid hanging host, Keir? 

> After reviewing the code, if the guest really does something like
> changing affinity within the window between an irq fire and eoi,
> there is indeed a problem, attached is the patch. Although I kinda
> doubt it, shouldn't desc->lock in guest protect and make these two
> operations mutual exclusive.

We shouldn't let hypervisor do real EOI before guest does the correponding virtual EOI, so this patch maybe have a correctness issue. :-)

Attached the fix according to my privious guess, and it should fix the issue. 

Xiantao

[-- Attachment #2: fix-irq-affinity-msi.patch --]
[-- Type: application/octet-stream, Size: 4449 bytes --]

# HG changeset patch
# User Xiantao Zhang <xiantao.zhang@intel.com>
# Date 1255684803 -28800
# Node ID d1b3cb3fe044285093c923761d4bc40c7af4d199
# Parent  2eba302831c4534ac40283491f887263c7197b4a
x86: vMSI: Fix msi irq affinity issue for hvm guest.

There is a race between guest setting new vector and doing EOI on old vector.
Once guest sets new vector before its doing EOI on vector, when guest does eoi,
hypervisor may fail to find the related pirq, and hypervisor may miss to EOI real
vector and leads to system hang.  We may need to add a timer for each pirq interrupt
source to avoid host hang, but this is another topic, and will be addressed later.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>

diff -r 2eba302831c4 -r d1b3cb3fe044 xen/drivers/passthrough/io.c
--- a/xen/drivers/passthrough/io.c	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/drivers/passthrough/io.c	Fri Oct 16 17:20:03 2009 +0800
@@ -164,7 +164,11 @@ int pt_irq_create_bind_vtd(
         {
             hvm_irq_dpci->mirq[pirq].flags = HVM_IRQ_DPCI_MACH_MSI |
                                              HVM_IRQ_DPCI_GUEST_MSI;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                hvm_irq_dpci->mirq[pirq].gmsi.gvec ?:pt_irq_bind->u.msi.gvec;
             hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gflags =
+                hvm_irq_dpci->mirq[pirq].gmsi.gflags ?:pt_irq_bind->u.msi.gflags;
             hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], pirq, 0);
@@ -178,6 +182,7 @@ int pt_irq_create_bind_vtd(
             {
                 hvm_irq_dpci->mirq[pirq].gmsi.gflags = 0;
                 hvm_irq_dpci->mirq[pirq].gmsi.gvec = 0;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec = 0;
                 hvm_irq_dpci->mirq[pirq].flags = 0;
                 clear_bit(pirq, hvm_irq_dpci->mapping);
                 spin_unlock(&d->event_lock);
@@ -195,7 +200,11 @@ int pt_irq_create_bind_vtd(
             }
  
             /* if pirq is already mapped as vmsi, update the guest data/addr */
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                hvm_irq_dpci->mirq[pirq].gmsi.gvec ?:pt_irq_bind->u.msi.gvec;
             hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gflags =
+                hvm_irq_dpci->mirq[pirq].gmsi.gflags ?:pt_irq_bind->u.msi.gflags;
             hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
         }
         /* Caculate dest_vcpu_id for MSI-type pirq migration */
@@ -424,14 +433,21 @@ void hvm_dpci_msi_eoi(struct domain *d, 
           pirq = find_next_bit(hvm_irq_dpci->mapping, d->nr_pirqs, pirq + 1) )
     {
         if ( (!(hvm_irq_dpci->mirq[pirq].flags & HVM_IRQ_DPCI_MACH_MSI)) ||
-                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector) )
+                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector &&
+                 hvm_irq_dpci->mirq[pirq].gmsi.old_gvec != vector) )
             continue;
 
-        dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
-        dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        if ( hvm_irq_dpci->mirq[pirq].gmsi.gvec == vector ) {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        } else {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DM_MASK);
+        }
         if ( vlapic_match_dest(vcpu_vlapic(current), NULL, 0, dest, dest_mode) )
             break;
     }
+
     if ( pirq < d->nr_pirqs )
         __msi_pirq_eoi(d, pirq);
     spin_unlock(&d->event_lock);
diff -r 2eba302831c4 -r d1b3cb3fe044 xen/include/xen/hvm/irq.h
--- a/xen/include/xen/hvm/irq.h	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/include/xen/hvm/irq.h	Fri Oct 16 17:20:03 2009 +0800
@@ -58,8 +58,10 @@ struct dev_intx_gsi_link {
 #define GLFAGS_SHIFT_TRG_MODE       15
 
 struct hvm_gmsi_info {
-    uint32_t gvec;
+    uint16_t gvec;
+    uint16_t old_gvec;
     uint32_t gflags;
+    uint32_t old_gflags;
     int dest_vcpu_id; /* -1 :multi-dest, non-negative: dest_vcpu_id */
 };
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  9:41               ` Keir Fraser
@ 2009-10-16  9:57                 ` Qing He
  2009-10-16  9:58                 ` Zhang, Xiantao
  1 sibling, 0 replies; 55+ messages in thread
From: Qing He @ 2009-10-16  9:57 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Cinco, Dante, xen-devel, Zhang, Xiantao

On Fri, 2009-10-16 at 17:41 +0800, Keir Fraser wrote:
> On 16/10/2009 09:35, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:
> 
> >>> That maybe a problem if so. If a malicious/buggy guest won't EOI the
> >>> MSI vector, so host may hang due to lack of timeout mechanism?
> >> 
> >> Why does host hang? Only the assigned interrupt will block, and that's
> >> exactly what the guest wants :-)
> > 
> > Hypervisor shouldn't EOI the real vector until guest EOI the corresponding
> > virtual vector , right ?  Not sure.:-)
> 
> If the EOI is via the local APIC, which I suppose it must be, then a timeout
> fallback probably is required. This is because priorities are assigned
> arbitrarily to guest interrupts, and a non-EOIed interrupt blocks any
> lower-priority interrupts. In particular, some of those could be owned by
> dom0 for example, and be quite critical to forward progress of the entire
> system.

Yeah, I just come to realized it.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  9:41               ` Keir Fraser
  2009-10-16  9:57                 ` Qing He
@ 2009-10-16  9:58                 ` Zhang, Xiantao
  2009-10-16 10:21                   ` Jan Beulich
  1 sibling, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-16  9:58 UTC (permalink / raw)
  To: Keir Fraser, He, Qing; +Cc: Cinco, Dante, xen-devel

Keir Fraser wrote:
> On 16/10/2009 09:35, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:
> 
>>>> That maybe a problem if so. If a malicious/buggy guest won't EOI
>>>> the MSI vector, so host may hang due to lack of timeout mechanism?
>>> 
>>> Why does host hang? Only the assigned interrupt will block, and
>>> that's exactly what the guest wants :-)
>> 
>> Hypervisor shouldn't EOI the real vector until guest EOI the
>> corresponding virtual vector , right ?  Not sure.:-)
> 
> If the EOI is via the local APIC, which I suppose it must be, then a
> timeout fallback probably is required. This is because priorities are
> assigned arbitrarily to guest interrupts, and a non-EOIed interrupt
> blocks any lower-priority interrupts. In particular, some of those
> could be owned by dom0 for example, and be quite critical to forward
> progress of the entire system.

Yeah, exactly same with my concern.  We may need to add the timeout mechanism for each interrupt source to avoid that buggy/malicious guests hang host through not writing EOI.  
Xiantao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  9:58                 ` Zhang, Xiantao
@ 2009-10-16 10:21                   ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2009-10-16 10:21 UTC (permalink / raw)
  To: Qing He, Xiantao Zhang; +Cc: Dante Cinco, xen-devel, Keir Fraser

>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 16.10.09 11:58 >>>
>Keir Fraser wrote:
>> If the EOI is via the local APIC, which I suppose it must be, then a
>> timeout fallback probably is required. This is because priorities are
>> assigned arbitrarily to guest interrupts, and a non-EOIed interrupt
>> blocks any lower-priority interrupts. In particular, some of those
>> could be owned by dom0 for example, and be quite critical to forward
>> progress of the entire system.
>
>Yeah, exactly same with my concern.  We may need to add the timeout
>mechanism for each interrupt source to avoid that buggy/malicious
>guests hang host through not writing EOI.  

But that's (supposed to be) happening already: If an MSI interrupt is
maskable, the interrupt gets masked and the EOI is sent immediately.
If it's not maskable, a timer gets started to issue the EOI if the guest
doesn't.

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  9:49                 ` Zhang, Xiantao
@ 2009-10-16 14:54                   ` Zhang, Xiantao
  2009-10-16 18:24                     ` Cinco, Dante
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-16 14:54 UTC (permalink / raw)
  To: Zhang, Xiantao, He, Qing; +Cc: Cinco, Dante, xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 2970 bytes --]

Attached this new one which should eliminate the race ultimately. 
Xiantao 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Zhang, Xiantao
Sent: Friday, October 16, 2009 5:50 PM
To: He, Qing
Cc: Cinco, Dante; xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

He, Qing wrote:
> On Fri, 2009-10-16 at 16:35 +0800, Zhang, Xiantao wrote:
>> He, Qing wrote:
>>> On Fri, 2009-10-16 at 16:22 +0800, Zhang, Xiantao wrote:
>>>> He, Qing wrote:
>>>>> On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
>>>>>> According to the description, the issue should be caused by lost
>>>>>> EOI write for the MSI interrupt and leads to permanent interrupt
>>>>>> mask. There should be a race between guest setting new vector and
>>>>>> EOIs old vector for the interrupt.  Once guest sets new vector
>>>>>> before it EOIs the old vector, hypervisor can't find the pirq
>>>>>> which corresponds old vector(has changed
>>>>>> to new vector) , so also can't EOI the old vector forever in
>>>>>> hardware level. Since the corresponding vector in real processor
>>>>>> can't be EOIed, so system may lose all interrupts and result the
>>>>>> reported issues ultimately.
>>>>> 
>>>>>> But I remembered there should be a timer to handle this case
>>>>>> through a forcible EOI write to the real processor after timeout,
>>>>>> but seems it doesn't function in the expected way.
>>>>> 
>>>>> The EOI timer is supposed to deal with the irq sharing problem,
>>>>> since MSI doesn't share, this timer will not be started in the
>>>>> case of MSI.
>>>> 
>>>> That maybe a problem if so. If a malicious/buggy guest won't EOI
>>>> the MSI vector, so host may hang due to lack of timeout mechanism?
>>> 
>>> Why does host hang? Only the assigned interrupt will block, and
>>> that's exactly what the guest wants :-)
>> 
>> Hypervisor shouldn't EOI the real vector until guest EOI the
>> corresponding virtual vector , right ?  Not sure.:-)
> 
> Yes, it is the algorithm used today.

So it should be still a problem. If guest won't do eoi, host can't do eoi also, and leads to system hang without timeout mechanism. So we may need to introduce a timer for each MSI interrupt source to avoid hanging host, Keir? 

> After reviewing the code, if the guest really does something like
> changing affinity within the window between an irq fire and eoi,
> there is indeed a problem, attached is the patch. Although I kinda
> doubt it, shouldn't desc->lock in guest protect and make these two
> operations mutual exclusive.

We shouldn't let hypervisor do real EOI before guest does the correponding virtual EOI, so this patch maybe have a correctness issue. :-)

Attached the fix according to my privious guess, and it should fix the issue. 

Xiantao

[-- Attachment #2: fix-irq-affinity-msi3.patch --]
[-- Type: application/octet-stream, Size: 5997 bytes --]

# HG changeset patch
# User Xiantao Zhang <xiantao.zhang@intel.com>
# Date 1255684803 -28800
# Node ID d1b3cb3fe044285093c923761d4bc40c7af4d199
# Parent  2eba302831c4534ac40283491f887263c7197b4a
x86: vMSI: Fix msi irq affinity issue for hvm guest.

There is a race between guest setting new vector and doing EOI on old vector.
Once guest sets new vector before its doing EOI on vector, when guest does eoi,
hypervisor may fail to find the related pirq, and hypervisor may miss to EOI real
vector and leads to system hang.  We may need to add a timer for each pirq interrupt
source to avoid host hang, but this is another topic, and will be addressed later.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>

diff -r 2eba302831c4 xen/arch/x86/hvm/vmsi.c
--- a/xen/arch/x86/hvm/vmsi.c	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/arch/x86/hvm/vmsi.c	Fri Oct 16 22:10:36 2009 +0800
@@ -92,8 +92,11 @@ int vmsi_deliver(struct domain *d, int p
     case dest_LowestPrio:
     {
         target = vlapic_lowest_prio(d, NULL, 0, dest, dest_mode);
-        if ( target != NULL )
+        if ( target != NULL ) {
             vmsi_inj_irq(d, target, vector, trig_mode, delivery_mode);
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+        }
         else
             HVM_DBG_LOG(DBG_LEVEL_IOAPIC, "null round robin: "
                         "vector=%x delivery_mode=%x\n",
@@ -106,9 +109,12 @@ int vmsi_deliver(struct domain *d, int p
     {
         for_each_vcpu ( d, v )
             if ( vlapic_match_dest(vcpu_vlapic(v), NULL,
-                                   0, dest, dest_mode) )
+                                   0, dest, dest_mode) ) {
                 vmsi_inj_irq(d, vcpu_vlapic(v),
                              vector, trig_mode, delivery_mode);
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+            }
         break;
     }
 
diff -r 2eba302831c4 xen/drivers/passthrough/io.c
--- a/xen/drivers/passthrough/io.c	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/drivers/passthrough/io.c	Fri Oct 16 21:54:55 2009 +0800
@@ -164,7 +164,9 @@ int pt_irq_create_bind_vtd(
         {
             hvm_irq_dpci->mirq[pirq].flags = HVM_IRQ_DPCI_MACH_MSI |
                                              HVM_IRQ_DPCI_GUEST_MSI;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec = pt_irq_bind->u.msi.gvec;
             hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gflags = pt_irq_bind->u.msi.gflags;
             hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], pirq, 0);
@@ -178,6 +180,8 @@ int pt_irq_create_bind_vtd(
             {
                 hvm_irq_dpci->mirq[pirq].gmsi.gflags = 0;
                 hvm_irq_dpci->mirq[pirq].gmsi.gvec = 0;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec = 0;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gflags = 0;
                 hvm_irq_dpci->mirq[pirq].flags = 0;
                 clear_bit(pirq, hvm_irq_dpci->mapping);
                 spin_unlock(&d->event_lock);
@@ -195,8 +199,14 @@ int pt_irq_create_bind_vtd(
             }
  
             /* if pirq is already mapped as vmsi, update the guest data/addr */
-            hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
-            hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            if ( hvm_irq_dpci->mirq[pirq].gmsi.gvec != pt_irq_bind->u.msi.gvec ) {
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gflags =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gflags;
+                hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+                hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            }
         }
         /* Caculate dest_vcpu_id for MSI-type pirq migration */
         dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
@@ -424,14 +434,21 @@ void hvm_dpci_msi_eoi(struct domain *d, 
           pirq = find_next_bit(hvm_irq_dpci->mapping, d->nr_pirqs, pirq + 1) )
     {
         if ( (!(hvm_irq_dpci->mirq[pirq].flags & HVM_IRQ_DPCI_MACH_MSI)) ||
-                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector) )
+                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector &&
+                 hvm_irq_dpci->mirq[pirq].gmsi.old_gvec != vector) )
             continue;
 
-        dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
-        dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        if ( hvm_irq_dpci->mirq[pirq].gmsi.gvec == vector ) {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        } else {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DM_MASK);
+        }
         if ( vlapic_match_dest(vcpu_vlapic(current), NULL, 0, dest, dest_mode) )
             break;
     }
+
     if ( pirq < d->nr_pirqs )
         __msi_pirq_eoi(d, pirq);
     spin_unlock(&d->event_lock);
diff -r 2eba302831c4 xen/include/xen/hvm/irq.h
--- a/xen/include/xen/hvm/irq.h	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/include/xen/hvm/irq.h	Fri Oct 16 21:48:04 2009 +0800
@@ -58,8 +58,10 @@ struct dev_intx_gsi_link {
 #define GLFAGS_SHIFT_TRG_MODE       15
 
 struct hvm_gmsi_info {
-    uint32_t gvec;
+    uint16_t gvec;
+    uint16_t old_gvec;
     uint32_t gflags;
+    uint32_t old_gflags;
     int dest_vcpu_id; /* -1 :multi-dest, non-negative: dest_vcpu_id */
 };
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16 14:54                   ` Zhang, Xiantao
@ 2009-10-16 18:24                     ` Cinco, Dante
  2009-10-17  0:59                       ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-16 18:24 UTC (permalink / raw)
  To: Zhang, Xiantao, He, Qing; +Cc: Keir, xen-devel, Fraser

Xiantao,
I'm still losing the interrupts with your patch but I see some differences. To simplifiy the data, I'm only going to focus on the first function of my 4-function PCI device.

After changing the IRQ affinity, the IRQ is not masked anymore (unlike before the patch). What stands out for me is the new vector (219) as reported by "guest interrupt information" does not match the vector (187) in dom0 lspci. Before the patch, the new vector in "guest interrupt information" matched the new vector in dom0 lspci (dest ID in dom0 lspci was unchanged). I also saw this message pop on the Xen console when I changed smp_affinity:

(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).

187 is the vector from dom0 lspci before and after the smp_affinity change but "guest interrupt information" reports the new vector is 219. To me, this looks like the new MSI message data (with vector=219) did not get written into the PCI device, right?

Here's a comparison before and after changing smp_affinity from ffff to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):

------------------------------------------------------------------------

/proc/irq/48/smp_affinity=ffff (default):

dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)

domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)

qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
                 pt_msi_update: Update msi with pirq 4f gvec 71 gflags 0

Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001, Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.0
             (XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.0
             (XEN) [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5 intx = 0
             (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd: machine_gsi=79 guest_gsi=36, device=5, intx=0
             (XEN) io.c:381:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0

------------------------------------------------------------------------

/proc/irq/48/smp_affinity=2:

dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)

domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)

Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002, Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71 gflags 2
                 pt_msi_update: Update msi with pirq 4f gvec b1 gflags 2

------------------------------------------------------------------------

-----Original Message-----
From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com] 
Sent: Friday, October 16, 2009 7:55 AM
To: Zhang, Xiantao; He, Qing
Cc: Cinco, Dante; xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Attached this new one which should eliminate the race ultimately. 
Xiantao 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Zhang, Xiantao
Sent: Friday, October 16, 2009 5:50 PM
To: He, Qing
Cc: Cinco, Dante; xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

He, Qing wrote:
> On Fri, 2009-10-16 at 16:35 +0800, Zhang, Xiantao wrote:
>> He, Qing wrote:
>>> On Fri, 2009-10-16 at 16:22 +0800, Zhang, Xiantao wrote:
>>>> He, Qing wrote:
>>>>> On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
>>>>>> According to the description, the issue should be caused by lost 
>>>>>> EOI write for the MSI interrupt and leads to permanent interrupt 
>>>>>> mask. There should be a race between guest setting new vector and 
>>>>>> EOIs old vector for the interrupt.  Once guest sets new vector 
>>>>>> before it EOIs the old vector, hypervisor can't find the pirq 
>>>>>> which corresponds old vector(has changed to new vector) , so also 
>>>>>> can't EOI the old vector forever in hardware level. Since the 
>>>>>> corresponding vector in real processor can't be EOIed, so system 
>>>>>> may lose all interrupts and result the reported issues 
>>>>>> ultimately.
>>>>> 
>>>>>> But I remembered there should be a timer to handle this case 
>>>>>> through a forcible EOI write to the real processor after timeout, 
>>>>>> but seems it doesn't function in the expected way.
>>>>> 
>>>>> The EOI timer is supposed to deal with the irq sharing problem, 
>>>>> since MSI doesn't share, this timer will not be started in the 
>>>>> case of MSI.
>>>> 
>>>> That maybe a problem if so. If a malicious/buggy guest won't EOI 
>>>> the MSI vector, so host may hang due to lack of timeout mechanism?
>>> 
>>> Why does host hang? Only the assigned interrupt will block, and 
>>> that's exactly what the guest wants :-)
>> 
>> Hypervisor shouldn't EOI the real vector until guest EOI the 
>> corresponding virtual vector , right ?  Not sure.:-)
> 
> Yes, it is the algorithm used today.

So it should be still a problem. If guest won't do eoi, host can't do eoi also, and leads to system hang without timeout mechanism. So we may need to introduce a timer for each MSI interrupt source to avoid hanging host, Keir? 

> After reviewing the code, if the guest really does something like 
> changing affinity within the window between an irq fire and eoi, there 
> is indeed a problem, attached is the patch. Although I kinda doubt it, 
> shouldn't desc->lock in guest protect and make these two operations 
> mutual exclusive.

We shouldn't let hypervisor do real EOI before guest does the correponding virtual EOI, so this patch maybe have a correctness issue. :-)

Attached the fix according to my privious guess, and it should fix the issue. 

Xiantao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16 18:24                     ` Cinco, Dante
@ 2009-10-17  0:59                       ` Zhang, Xiantao
  2009-10-20  0:19                         ` Cinco, Dante
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-17  0:59 UTC (permalink / raw)
  To: Cinco, Dante, He, Qing; +Cc: xen-devel, Fraser

 Dante
 It should be another issue as you described.  Can you try the following code to see whether it works for you ?  Just a try.  
Xiantao

diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
+++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
@@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)
             continue;
         irq = desc - irq_desc;
         ASSERT(MSI_IRQ(irq));
-        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
+        //desc->handler->set_affinity(irq, *cpumask_of(v->processor));
         spin_unlock_irq(&desc->lock);
     }
     spin_unlock(&d->event_lock);

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco, Dante
Sent: Saturday, October 17, 2009 2:24 AM
To: Zhang, Xiantao; He, Qing
Cc: Keir; xen-devel@lists.xensource.com; Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Xiantao,
I'm still losing the interrupts with your patch but I see some differences. To simplifiy the data, I'm only going to focus on the first function of my 4-function PCI device.

After changing the IRQ affinity, the IRQ is not masked anymore (unlike before the patch). What stands out for me is the new vector (219) as reported by "guest interrupt information" does not match the vector (187) in dom0 lspci. Before the patch, the new vector in "guest interrupt information" matched the new vector in dom0 lspci (dest ID in dom0 lspci was unchanged). I also saw this message pop on the Xen console when I changed smp_affinity:

(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).

187 is the vector from dom0 lspci before and after the smp_affinity change but "guest interrupt information" reports the new vector is 219. To me, this looks like the new MSI message data (with vector=219) did not get written into the PCI device, right?

Here's a comparison before and after changing smp_affinity from ffff to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):

------------------------------------------------------------------------

/proc/irq/48/smp_affinity=ffff (default):

dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)

domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)

qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
                 pt_msi_update: Update msi with pirq 4f gvec 71 gflags 0

Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001, Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.0
             (XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.0
             (XEN) [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5 intx = 0
             (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd: machine_gsi=79 guest_gsi=36, device=5, intx=0
             (XEN) io.c:381:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0

------------------------------------------------------------------------

/proc/irq/48/smp_affinity=2:

dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)

domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)

Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002, Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71 gflags 2
                 pt_msi_update: Update msi with pirq 4f gvec b1 gflags 2

------------------------------------------------------------------------

-----Original Message-----
From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com] 
Sent: Friday, October 16, 2009 7:55 AM
To: Zhang, Xiantao; He, Qing
Cc: Cinco, Dante; xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Attached this new one which should eliminate the race ultimately. 
Xiantao 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Zhang, Xiantao
Sent: Friday, October 16, 2009 5:50 PM
To: He, Qing
Cc: Cinco, Dante; xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

He, Qing wrote:
> On Fri, 2009-10-16 at 16:35 +0800, Zhang, Xiantao wrote:
>> He, Qing wrote:
>>> On Fri, 2009-10-16 at 16:22 +0800, Zhang, Xiantao wrote:
>>>> He, Qing wrote:
>>>>> On Fri, 2009-10-16 at 15:32 +0800, Zhang, Xiantao wrote:
>>>>>> According to the description, the issue should be caused by lost 
>>>>>> EOI write for the MSI interrupt and leads to permanent interrupt 
>>>>>> mask. There should be a race between guest setting new vector and 
>>>>>> EOIs old vector for the interrupt.  Once guest sets new vector 
>>>>>> before it EOIs the old vector, hypervisor can't find the pirq 
>>>>>> which corresponds old vector(has changed to new vector) , so also 
>>>>>> can't EOI the old vector forever in hardware level. Since the 
>>>>>> corresponding vector in real processor can't be EOIed, so system 
>>>>>> may lose all interrupts and result the reported issues 
>>>>>> ultimately.
>>>>> 
>>>>>> But I remembered there should be a timer to handle this case 
>>>>>> through a forcible EOI write to the real processor after timeout, 
>>>>>> but seems it doesn't function in the expected way.
>>>>> 
>>>>> The EOI timer is supposed to deal with the irq sharing problem, 
>>>>> since MSI doesn't share, this timer will not be started in the 
>>>>> case of MSI.
>>>> 
>>>> That maybe a problem if so. If a malicious/buggy guest won't EOI 
>>>> the MSI vector, so host may hang due to lack of timeout mechanism?
>>> 
>>> Why does host hang? Only the assigned interrupt will block, and 
>>> that's exactly what the guest wants :-)
>> 
>> Hypervisor shouldn't EOI the real vector until guest EOI the 
>> corresponding virtual vector , right ?  Not sure.:-)
> 
> Yes, it is the algorithm used today.

So it should be still a problem. If guest won't do eoi, host can't do eoi also, and leads to system hang without timeout mechanism. So we may need to introduce a timer for each MSI interrupt source to avoid hanging host, Keir? 

> After reviewing the code, if the guest really does something like 
> changing affinity within the window between an irq fire and eoi, there 
> is indeed a problem, attached is the patch. Although I kinda doubt it, 
> shouldn't desc->lock in guest protect and make these two operations 
> mutual exclusive.

We shouldn't let hypervisor do real EOI before guest does the correponding virtual EOI, so this patch maybe have a correctness issue. :-)

Attached the fix according to my privious guess, and it should fix the issue. 

Xiantao
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-17  0:59                       ` Zhang, Xiantao
@ 2009-10-20  0:19                         ` Cinco, Dante
  2009-10-20  5:46                           ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-20  0:19 UTC (permalink / raw)
  To: Zhang, Xiantao, He, Qing; +Cc: Keir, xen-devel, Fraser

Xiantao,
With vcpus=16 (all CPUs) in domU, I'm able to change the IRQ smp_affinity to any one-hot value and see the interrupts routed to the specified CPU. Every now and then though, both domU and dom0 will permanently lockup (cold reboot required) after changing the smp_affinity. If I change it manually via command-line, it seems to be okay but if I change it within a script (such as shifting-left a walking "1" to test all 16 CPUs), it will lockup part way through the script.

Other observations:

The MSI message address/data in dom0 "lspci -vv" stays the same as well as the "interrupt guest information" from the Xen console even though I see the destination ID and vector change in domU "lspci -vv." You're probably expecting this behavior since you removed the set_affinity call in the last patch.

With vcpus=5, I can only change smp_affinity to 1. Any other value aside from 1 or 1f (default) results in an instant, permanent lockup of both domU and dom0 (Xen console still accessible). I also observed when I tried changing the smp_affinity of the first function of the 4-function PCI device to 2, the 3rd and 4th functions got masked:

(XEN)    IRQ: 66, IRQ affinity:0x00000001, Vec:186 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)
(XEN)    IRQ: 67, IRQ affinity:0x00000001, Vec:194 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 78(----)
(XEN)    IRQ: 68, IRQ affinity:0x00000001, Vec:202 type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 77(---M)
(XEN)    IRQ: 69, IRQ affinity:0x00000001, Vec:210 type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 76(---M)

In the above log, I had changed the smp_affinity for IRQ 66 but IRQ 68 and 69 got masked.

Dante

-----Original Message-----
From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com] 
Sent: Friday, October 16, 2009 5:59 PM
To: Cinco, Dante; He, Qing
Cc: xen-devel@lists.xensource.com; Fraser; Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

 Dante
 It should be another issue as you described.  Can you try the following code to see whether it works for you ?  Just a try.  
Xiantao

diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
+++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
@@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)
             continue;
         irq = desc - irq_desc;
         ASSERT(MSI_IRQ(irq));
-        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
+        //desc->handler->set_affinity(irq, *cpumask_of(v->processor));
         spin_unlock_irq(&desc->lock);
     }
     spin_unlock(&d->event_lock);

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco, Dante
Sent: Saturday, October 17, 2009 2:24 AM
To: Zhang, Xiantao; He, Qing
Cc: Keir; xen-devel@lists.xensource.com; Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Xiantao,
I'm still losing the interrupts with your patch but I see some differences. To simplifiy the data, I'm only going to focus on the first function of my 4-function PCI device.

After changing the IRQ affinity, the IRQ is not masked anymore (unlike before the patch). What stands out for me is the new vector (219) as reported by "guest interrupt information" does not match the vector (187) in dom0 lspci. Before the patch, the new vector in "guest interrupt information" matched the new vector in dom0 lspci (dest ID in dom0 lspci was unchanged). I also saw this message pop on the Xen console when I changed smp_affinity:

(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).

187 is the vector from dom0 lspci before and after the smp_affinity change but "guest interrupt information" reports the new vector is 219. To me, this looks like the new MSI message data (with vector=219) did not get written into the PCI device, right?

Here's a comparison before and after changing smp_affinity from ffff to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):

------------------------------------------------------------------------

/proc/irq/48/smp_affinity=ffff (default):

dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)

domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)

qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
                 pt_msi_update: Update msi with pirq 4f gvec 71 gflags 0

Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001, Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.0
             (XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.0
             (XEN) [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5 intx = 0
             (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd: machine_gsi=79 guest_gsi=36, device=5, intx=0
             (XEN) io.c:381:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0

------------------------------------------------------------------------

/proc/irq/48/smp_affinity=2:

dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)

domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)

Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002, Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71 gflags 2
                 pt_msi_update: Update msi with pirq 4f gvec b1 gflags 2

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-20  0:19                         ` Cinco, Dante
@ 2009-10-20  5:46                           ` Zhang, Xiantao
  2009-10-20  7:51                             ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-20  5:46 UTC (permalink / raw)
  To: Cinco, Dante, He, Qing; +Cc: Keir, xen-devel, Fraser

Cinco, Dante wrote:
> Xiantao,
> With vcpus=16 (all CPUs) in domU, I'm able to change the IRQ
> smp_affinity to any one-hot value and see the interrupts routed to
> the specified CPU. Every now and then though, both domU and dom0 will
> permanently lockup (cold reboot required) after changing the
> smp_affinity. If I change it manually via command-line, it seems to
> be okay but if I change it within a script (such as shifting-left a
> walking "1" to test all 16 CPUs), it will lockup part way through the
> script. 

I can't reproduce the failure at my side after applying the patches even with a similar script which changes irq's affinity.  Could you share your script with me ? 



> Other observations:
> 
> In the above log, I had changed the smp_affinity for IRQ 66 but IRQ
> 68 and 69 got masked. 

We can see the warning as "No irq handler for vector" but it shouldn't hang host, and it maybe related to another potential issue, and maybe need further investigation.  

Xiantao

> -----Original Message-----
> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
> Sent: Friday, October 16, 2009 5:59 PM
> To: Cinco, Dante; He, Qing
> Cc: xen-devel@lists.xensource.com; Fraser; Fraser
> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
> > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) 
> 
>  Dante
>  It should be another issue as you described.  Can you try the
> following code to see whether it works for you ?  Just a try. 
> Xiantao
> 
> diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
> --- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
> +++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
> @@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)
>              continue;
>          irq = desc - irq_desc;
>          ASSERT(MSI_IRQ(irq));
> -        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
> +        //desc->handler->set_affinity(irq,
>          *cpumask_of(v->processor)); spin_unlock_irq(&desc->lock);
>      }
>      spin_unlock(&d->event_lock);
> 
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco,
> Dante  
> Sent: Saturday, October 17, 2009 2:24 AM
> To: Zhang, Xiantao; He, Qing
> Cc: Keir; xen-devel@lists.xensource.com; Fraser
> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
> > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) 
> 
> Xiantao,
> I'm still losing the interrupts with your patch but I see some
> differences. To simplifiy the data, I'm only going to focus on the
> first function of my 4-function PCI device.  
> 
> After changing the IRQ affinity, the IRQ is not masked anymore
> (unlike before the patch). What stands out for me is the new vector
> (219) as reported by "guest interrupt information" does not match the
> vector (187) in dom0 lspci. Before the patch, the new vector in
> "guest interrupt information" matched the new vector in dom0 lspci
> (dest ID in dom0 lspci was unchanged). I also saw this message pop on
> the Xen console when I changed smp_affinity:      
> 
> (XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).
> 
> 187 is the vector from dom0 lspci before and after the smp_affinity
> change but "guest interrupt information" reports the new vector is
> 219. To me, this looks like the new MSI message data (with
> vector=219) did not get written into the PCI device, right?   
> 
> Here's a comparison before and after changing smp_affinity from ffff
> to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1): 
> 
> ------------------------------------------------------------------------
> 
> /proc/irq/48/smp_affinity=ffff (default):
> 
> dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)
> 
> domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)
> 
> qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
>                  pt_msi_update: Update msi with pirq 4f gvec 71
> gflags 0 
> 
> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001,
> Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
> 79(----)  
> 
> Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe:
>              bdf = 7:0.0 (XEN) [VT-D]iommu.c:1175:d0
>              domain_context_mapping:PCIe: bdf = 7:0.0 (XEN)
>              [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5
>              intx = 0 (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd:
> machine_gsi=79 guest_gsi=36, device=5, intx=0 (XEN) io.c:381:d0
> XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0  
> 
> ------------------------------------------------------------------------
> 
> /proc/irq/48/smp_affinity=2:
> 
> dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed
> from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged) 
> 
> domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed
> from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177) 
> 
> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002,
> Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
> 79(----)  
> 
> qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71
>                  gflags 2 pt_msi_update: Update msi with pirq 4f gvec
> b1 gflags 2 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-20  5:46                           ` Zhang, Xiantao
@ 2009-10-20  7:51                             ` Zhang, Xiantao
  2009-10-20 17:26                               ` Cinco, Dante
  2009-10-22  6:46                               ` Jan Beulich
  0 siblings, 2 replies; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-20  7:51 UTC (permalink / raw)
  To: Zhang, Xiantao, Cinco, Dante, He, Qing; +Cc: xen-devel, Fraser

[-- Attachment #1: Type: text/plain, Size: 6370 bytes --]

Attached two patches should fix the issues. For the issue which complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),", I root-caused it.  Currenlty, when programs MSI address & data, Xen doesn't perform the mask/unmask logic to avoid inconsistent interrupt genernation. In this case, according to spec, the interrupt generation behavior is undfined, and device may generate MSI interrupts with the expected vector and incorrect destination ID, so leads to the issue.  The attached two patches should address it. 
Fix-irq-affinity-msi3.patch:  same with the previous post.
Mask_msi_irq_when_programe_it.patch : disable irq when program msi. 

Xiantao


Zhang, Xiantao wrote:
> Cinco, Dante wrote:
>> Xiantao,
>> With vcpus=16 (all CPUs) in domU, I'm able to change the IRQ
>> smp_affinity to any one-hot value and see the interrupts routed to
>> the specified CPU. Every now and then though, both domU and dom0 will
>> permanently lockup (cold reboot required) after changing the
>> smp_affinity. If I change it manually via command-line, it seems to
>> be okay but if I change it within a script (such as shifting-left a
>> walking "1" to test all 16 CPUs), it will lockup part way through the
>> script.
> 
> I can't reproduce the failure at my side after applying the patches
> even with a similar script which changes irq's affinity.  Could you
> share your script with me ?  
> 
> 
> 
>> Other observations:
>> 
>> In the above log, I had changed the smp_affinity for IRQ 66 but IRQ
>> 68 and 69 got masked.
> 
> We can see the warning as "No irq handler for vector" but it
> shouldn't hang host, and it maybe related to another potential issue,
> and maybe need further investigation.  
> 
> Xiantao
> 
>> -----Original Message-----
>> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
>> Sent: Friday, October 16, 2009 5:59 PM
>> To: Cinco, Dante; He, Qing
>> Cc: xen-devel@lists.xensource.com; Fraser; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>>  Dante
>>  It should be another issue as you described.  Can you try the
>> following code to see whether it works for you ?  Just a try. Xiantao
>> 
>> diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
>> --- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
>> +++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
>> @@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)          
>>          continue; irq = desc - irq_desc;
>>          ASSERT(MSI_IRQ(irq));
>> -        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
>> +        //desc->handler->set_affinity(irq,
>>          *cpumask_of(v->processor)); spin_unlock_irq(&desc->lock);  
>>      } spin_unlock(&d->event_lock);
>> 
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco,
>> Dante Sent: Saturday, October 17, 2009 2:24 AM
>> To: Zhang, Xiantao; He, Qing
>> Cc: Keir; xen-devel@lists.xensource.com; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>> Xiantao,
>> I'm still losing the interrupts with your patch but I see some
>> differences. To simplifiy the data, I'm only going to focus on the
>> first function of my 4-function PCI device.
>> 
>> After changing the IRQ affinity, the IRQ is not masked anymore
>> (unlike before the patch). What stands out for me is the new vector
>> (219) as reported by "guest interrupt information" does not match the
>> vector (187) in dom0 lspci. Before the patch, the new vector in
>> "guest interrupt information" matched the new vector in dom0 lspci
>> (dest ID in dom0 lspci was unchanged). I also saw this message pop on
>> the Xen console when I changed smp_affinity:
>> 
>> (XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).
>> 
>> 187 is the vector from dom0 lspci before and after the smp_affinity
>> change but "guest interrupt information" reports the new vector is
>> 219. To me, this looks like the new MSI message data (with
>> vector=219) did not get written into the PCI device, right?
>> 
>> Here's a comparison before and after changing smp_affinity from ffff
>> to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):
>> 
>> ------------------------------------------------------------------------
>> 
>> /proc/irq/48/smp_affinity=ffff (default):
>> 
>> dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)
>> 
>> domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)
>> 
>> qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
>>                  pt_msi_update: Update msi with pirq 4f gvec 71
>> gflags 0 
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001,
>> Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----) 
>> 
>> Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe:
>>              bdf = 7:0.0 (XEN) [VT-D]iommu.c:1175:d0
>>              domain_context_mapping:PCIe: bdf = 7:0.0 (XEN)
>>              [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5
>>              intx = 0 (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd:
>> machine_gsi=79 guest_gsi=36, device=5, intx=0 (XEN) io.c:381:d0
>> XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0
>> 
>> ------------------------------------------------------------------------
>> 
>> /proc/irq/48/smp_affinity=2:
>> 
>> dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed
>> from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)
>> 
>> domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed
>> from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002,
>> Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----) 
>> 
>> qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71
>>                  gflags 2 pt_msi_update: Update msi with pirq 4f gvec
>> b1 gflags 2
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


[-- Attachment #2: mask_msi_irq_when_programe_it.patch --]
[-- Type: application/octet-stream, Size: 1762 bytes --]

# HG changeset patch
# User Xiantao Zhang <xiantao.zhang@intel.com>
# Date 1256023207 -28800
# Node ID dcfa6155b692a3f7ebfd2dc7db0335502f72c698
# Parent  7c65f0cdb1903ae1e3b8ecb9da5cf51699098ee3
x86: MSI: Mask/unmask msi irq during the window which
programes msi.

When program msi, it has to mask it first, otherwise, it
may generate inconsistent interrupts. According to spec,
if not masked, the interrupt generation behavioris undefined.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>

diff -r 7c65f0cdb190 -r dcfa6155b692 xen/arch/x86/msi.c
--- a/xen/arch/x86/msi.c    Tue Oct 20 15:02:10 2009 +0800
+++ b/xen/arch/x86/msi.c    Tue Oct 20 15:20:07 2009 +0800
@@ -231,6 +231,7 @@ static void write_msi_msg(struct msi_des
         u8 slot = PCI_SLOT(dev->devfn);
         u8 func = PCI_FUNC(dev->devfn);
 
+        mask_msi_irq(entry->irq);
         pci_conf_write32(bus, slot, func, msi_lower_address_reg(pos),
                          msg->address_lo);
         if ( entry->msi_attrib.is_64 )
@@ -243,6 +244,7 @@ static void write_msi_msg(struct msi_des
         else
             pci_conf_write16(bus, slot, func, msi_data_reg(pos, 0),
                              msg->data);
+        unmask_msi_irq(entry->irq);
         break;
     }
     case PCI_CAP_ID_MSIX:
@@ -250,11 +252,13 @@ static void write_msi_msg(struct msi_des
         void __iomem *base;
         base = entry->mask_base;
 
+        mask_msi_irq(entry->irq);
         writel(msg->address_lo,
                base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         writel(msg->address_hi,
                base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
         writel(msg->data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
+        unmask_msi_irq(entry->irq);
         break;
     }
     default:

[-- Attachment #3: fix-irq-affinity-msi3.patch --]
[-- Type: application/octet-stream, Size: 5997 bytes --]

# HG changeset patch
# User Xiantao Zhang <xiantao.zhang@intel.com>
# Date 1255684803 -28800
# Node ID d1b3cb3fe044285093c923761d4bc40c7af4d199
# Parent  2eba302831c4534ac40283491f887263c7197b4a
x86: vMSI: Fix msi irq affinity issue for hvm guest.

There is a race between guest setting new vector and doing EOI on old vector.
Once guest sets new vector before its doing EOI on vector, when guest does eoi,
hypervisor may fail to find the related pirq, and hypervisor may miss to EOI real
vector and leads to system hang.  We may need to add a timer for each pirq interrupt
source to avoid host hang, but this is another topic, and will be addressed later.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>

diff -r 2eba302831c4 xen/arch/x86/hvm/vmsi.c
--- a/xen/arch/x86/hvm/vmsi.c	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/arch/x86/hvm/vmsi.c	Fri Oct 16 22:10:36 2009 +0800
@@ -92,8 +92,11 @@ int vmsi_deliver(struct domain *d, int p
     case dest_LowestPrio:
     {
         target = vlapic_lowest_prio(d, NULL, 0, dest, dest_mode);
-        if ( target != NULL )
+        if ( target != NULL ) {
             vmsi_inj_irq(d, target, vector, trig_mode, delivery_mode);
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+        }
         else
             HVM_DBG_LOG(DBG_LEVEL_IOAPIC, "null round robin: "
                         "vector=%x delivery_mode=%x\n",
@@ -106,9 +109,12 @@ int vmsi_deliver(struct domain *d, int p
     {
         for_each_vcpu ( d, v )
             if ( vlapic_match_dest(vcpu_vlapic(v), NULL,
-                                   0, dest, dest_mode) )
+                                   0, dest, dest_mode) ) {
                 vmsi_inj_irq(d, vcpu_vlapic(v),
                              vector, trig_mode, delivery_mode);
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+            }
         break;
     }
 
diff -r 2eba302831c4 xen/drivers/passthrough/io.c
--- a/xen/drivers/passthrough/io.c	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/drivers/passthrough/io.c	Fri Oct 16 21:54:55 2009 +0800
@@ -164,7 +164,9 @@ int pt_irq_create_bind_vtd(
         {
             hvm_irq_dpci->mirq[pirq].flags = HVM_IRQ_DPCI_MACH_MSI |
                                              HVM_IRQ_DPCI_GUEST_MSI;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec = pt_irq_bind->u.msi.gvec;
             hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gflags = pt_irq_bind->u.msi.gflags;
             hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], pirq, 0);
@@ -178,6 +180,8 @@ int pt_irq_create_bind_vtd(
             {
                 hvm_irq_dpci->mirq[pirq].gmsi.gflags = 0;
                 hvm_irq_dpci->mirq[pirq].gmsi.gvec = 0;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec = 0;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gflags = 0;
                 hvm_irq_dpci->mirq[pirq].flags = 0;
                 clear_bit(pirq, hvm_irq_dpci->mapping);
                 spin_unlock(&d->event_lock);
@@ -195,8 +199,14 @@ int pt_irq_create_bind_vtd(
             }
  
             /* if pirq is already mapped as vmsi, update the guest data/addr */
-            hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
-            hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            if ( hvm_irq_dpci->mirq[pirq].gmsi.gvec != pt_irq_bind->u.msi.gvec ) {
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gflags =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gflags;
+                hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+                hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            }
         }
         /* Caculate dest_vcpu_id for MSI-type pirq migration */
         dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
@@ -424,14 +434,21 @@ void hvm_dpci_msi_eoi(struct domain *d, 
           pirq = find_next_bit(hvm_irq_dpci->mapping, d->nr_pirqs, pirq + 1) )
     {
         if ( (!(hvm_irq_dpci->mirq[pirq].flags & HVM_IRQ_DPCI_MACH_MSI)) ||
-                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector) )
+                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector &&
+                 hvm_irq_dpci->mirq[pirq].gmsi.old_gvec != vector) )
             continue;
 
-        dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
-        dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        if ( hvm_irq_dpci->mirq[pirq].gmsi.gvec == vector ) {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        } else {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DM_MASK);
+        }
         if ( vlapic_match_dest(vcpu_vlapic(current), NULL, 0, dest, dest_mode) )
             break;
     }
+
     if ( pirq < d->nr_pirqs )
         __msi_pirq_eoi(d, pirq);
     spin_unlock(&d->event_lock);
diff -r 2eba302831c4 xen/include/xen/hvm/irq.h
--- a/xen/include/xen/hvm/irq.h	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/include/xen/hvm/irq.h	Fri Oct 16 21:48:04 2009 +0800
@@ -58,8 +58,10 @@ struct dev_intx_gsi_link {
 #define GLFAGS_SHIFT_TRG_MODE       15
 
 struct hvm_gmsi_info {
-    uint32_t gvec;
+    uint16_t gvec;
+    uint16_t old_gvec;
     uint32_t gflags;
+    uint32_t old_gflags;
     int dest_vcpu_id; /* -1 :multi-dest, non-negative: dest_vcpu_id */
 };
 

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-20  7:51                             ` Zhang, Xiantao
@ 2009-10-20 17:26                               ` Cinco, Dante
  2009-10-21  1:10                                 ` Zhang, Xiantao
  2009-10-22  6:46                               ` Jan Beulich
  1 sibling, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-20 17:26 UTC (permalink / raw)
  To: Zhang, Xiantao, He, Qing; +Cc: Keir, xen-devel, Fraser

Xintao,
With the latest patch (Fix-irq-affinity-msi3.patch, Mask_msi_irq_when_programe_it.patch), should I still apply the previous patch with removes "desc->handler->set_affinity(irq, *cpumask_of(v->processor))" or was that just a one-time experiment that should now be discarded?
Dante

-----Original Message-----
From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com] 
Sent: Tuesday, October 20, 2009 12:51 AM
To: Zhang, Xiantao; Cinco, Dante; He, Qing
Cc: xen-devel@lists.xensource.com; Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Attached two patches should fix the issues. For the issue which complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),", I root-caused it.  Currenlty, when programs MSI address & data, Xen doesn't perform the mask/unmask logic to avoid inconsistent interrupt genernation. In this case, according to spec, the interrupt generation behavior is undfined, and device may generate MSI interrupts with the expected vector and incorrect destination ID, so leads to the issue.  The attached two patches should address it. 
Fix-irq-affinity-msi3.patch:  same with the previous post.
Mask_msi_irq_when_programe_it.patch : disable irq when program msi. 

Xiantao


Zhang, Xiantao wrote:
> Cinco, Dante wrote:
>> Xiantao,
>> With vcpus=16 (all CPUs) in domU, I'm able to change the IRQ 
>> smp_affinity to any one-hot value and see the interrupts routed to 
>> the specified CPU. Every now and then though, both domU and dom0 will 
>> permanently lockup (cold reboot required) after changing the 
>> smp_affinity. If I change it manually via command-line, it seems to 
>> be okay but if I change it within a script (such as shifting-left a 
>> walking "1" to test all 16 CPUs), it will lockup part way through the 
>> script.
> 
> I can't reproduce the failure at my side after applying the patches 
> even with a similar script which changes irq's affinity.  Could you 
> share your script with me ?
> 
> 
> 
>> Other observations:
>> 
>> In the above log, I had changed the smp_affinity for IRQ 66 but IRQ
>> 68 and 69 got masked.
> 
> We can see the warning as "No irq handler for vector" but it shouldn't 
> hang host, and it maybe related to another potential issue, and maybe 
> need further investigation.
> 
> Xiantao
> 
>> -----Original Message-----
>> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
>> Sent: Friday, October 16, 2009 5:59 PM
>> To: Cinco, Dante; He, Qing
>> Cc: xen-devel@lists.xensource.com; Fraser; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>>  Dante
>>  It should be another issue as you described.  Can you try the 
>> following code to see whether it works for you ?  Just a try. Xiantao
>> 
>> diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
>> --- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
>> +++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
>> @@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)          
>>          continue; irq = desc - irq_desc;
>>          ASSERT(MSI_IRQ(irq));
>> -        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
>> +        //desc->handler->set_affinity(irq,
>>          *cpumask_of(v->processor)); spin_unlock_irq(&desc->lock);  
>>      } spin_unlock(&d->event_lock);
>> 
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco, 
>> Dante Sent: Saturday, October 17, 2009 2:24 AM
>> To: Zhang, Xiantao; He, Qing
>> Cc: Keir; xen-devel@lists.xensource.com; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>> Xiantao,
>> I'm still losing the interrupts with your patch but I see some 
>> differences. To simplifiy the data, I'm only going to focus on the 
>> first function of my 4-function PCI device.
>> 
>> After changing the IRQ affinity, the IRQ is not masked anymore 
>> (unlike before the patch). What stands out for me is the new vector
>> (219) as reported by "guest interrupt information" does not match the 
>> vector (187) in dom0 lspci. Before the patch, the new vector in 
>> "guest interrupt information" matched the new vector in dom0 lspci 
>> (dest ID in dom0 lspci was unchanged). I also saw this message pop on 
>> the Xen console when I changed smp_affinity:
>> 
>> (XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).
>> 
>> 187 is the vector from dom0 lspci before and after the smp_affinity 
>> change but "guest interrupt information" reports the new vector is 
>> 219. To me, this looks like the new MSI message data (with
>> vector=219) did not get written into the PCI device, right?
>> 
>> Here's a comparison before and after changing smp_affinity from ffff 
>> to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):
>> 
>> ---------------------------------------------------------------------
>> ---
>> 
>> /proc/irq/48/smp_affinity=ffff (default):
>> 
>> dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)
>> 
>> domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)
>> 
>> qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
>>                  pt_msi_update: Update msi with pirq 4f gvec 71 
>> gflags 0
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001,
>> Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----)
>> 
>> Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe:
>>              bdf = 7:0.0 (XEN) [VT-D]iommu.c:1175:d0
>>              domain_context_mapping:PCIe: bdf = 7:0.0 (XEN)
>>              [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5
>>              intx = 0 (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd:
>> machine_gsi=79 guest_gsi=36, device=5, intx=0 (XEN) io.c:381:d0
>> XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0
>> 
>> ---------------------------------------------------------------------
>> ---
>> 
>> /proc/irq/48/smp_affinity=2:
>> 
>> dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed 
>> from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)
>> 
>> domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed 
>> from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002,
>> Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----)
>> 
>> qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71
>>                  gflags 2 pt_msi_update: Update msi with pirq 4f gvec
>> b1 gflags 2
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-20 17:26                               ` Cinco, Dante
@ 2009-10-21  1:10                                 ` Zhang, Xiantao
  2009-10-22  1:00                                   ` Cinco, Dante
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-21  1:10 UTC (permalink / raw)
  To: Cinco, Dante, He, Qing; +Cc: Keir, xen-devel, Fraser

Only need to apply the two patches and the previous one should be discarded. 
Xiantao 

-----Original Message-----
From: Cinco, Dante [mailto:Dante.Cinco@lsi.com] 
Sent: Wednesday, October 21, 2009 1:27 AM
To: Zhang, Xiantao; He, Qing
Cc: xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Xintao,
With the latest patch (Fix-irq-affinity-msi3.patch, Mask_msi_irq_when_programe_it.patch), should I still apply the previous patch with removes "desc->handler->set_affinity(irq, *cpumask_of(v->processor))" or was that just a one-time experiment that should now be discarded?
Dante

-----Original Message-----
From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com] 
Sent: Tuesday, October 20, 2009 12:51 AM
To: Zhang, Xiantao; Cinco, Dante; He, Qing
Cc: xen-devel@lists.xensource.com; Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Attached two patches should fix the issues. For the issue which complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),", I root-caused it.  Currenlty, when programs MSI address & data, Xen doesn't perform the mask/unmask logic to avoid inconsistent interrupt genernation. In this case, according to spec, the interrupt generation behavior is undfined, and device may generate MSI interrupts with the expected vector and incorrect destination ID, so leads to the issue.  The attached two patches should address it. 
Fix-irq-affinity-msi3.patch:  same with the previous post.
Mask_msi_irq_when_programe_it.patch : disable irq when program msi. 

Xiantao


Zhang, Xiantao wrote:
> Cinco, Dante wrote:
>> Xiantao,
>> With vcpus=16 (all CPUs) in domU, I'm able to change the IRQ 
>> smp_affinity to any one-hot value and see the interrupts routed to 
>> the specified CPU. Every now and then though, both domU and dom0 will 
>> permanently lockup (cold reboot required) after changing the 
>> smp_affinity. If I change it manually via command-line, it seems to 
>> be okay but if I change it within a script (such as shifting-left a 
>> walking "1" to test all 16 CPUs), it will lockup part way through the 
>> script.
> 
> I can't reproduce the failure at my side after applying the patches 
> even with a similar script which changes irq's affinity.  Could you 
> share your script with me ?
> 
> 
> 
>> Other observations:
>> 
>> In the above log, I had changed the smp_affinity for IRQ 66 but IRQ
>> 68 and 69 got masked.
> 
> We can see the warning as "No irq handler for vector" but it shouldn't 
> hang host, and it maybe related to another potential issue, and maybe 
> need further investigation.
> 
> Xiantao
> 
>> -----Original Message-----
>> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
>> Sent: Friday, October 16, 2009 5:59 PM
>> To: Cinco, Dante; He, Qing
>> Cc: xen-devel@lists.xensource.com; Fraser; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>>  Dante
>>  It should be another issue as you described.  Can you try the 
>> following code to see whether it works for you ?  Just a try. Xiantao
>> 
>> diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
>> --- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
>> +++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
>> @@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)          
>>          continue; irq = desc - irq_desc;
>>          ASSERT(MSI_IRQ(irq));
>> -        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
>> +        //desc->handler->set_affinity(irq,
>>          *cpumask_of(v->processor)); spin_unlock_irq(&desc->lock);  
>>      } spin_unlock(&d->event_lock);
>> 
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco, 
>> Dante Sent: Saturday, October 17, 2009 2:24 AM
>> To: Zhang, Xiantao; He, Qing
>> Cc: Keir; xen-devel@lists.xensource.com; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>> Xiantao,
>> I'm still losing the interrupts with your patch but I see some 
>> differences. To simplifiy the data, I'm only going to focus on the 
>> first function of my 4-function PCI device.
>> 
>> After changing the IRQ affinity, the IRQ is not masked anymore 
>> (unlike before the patch). What stands out for me is the new vector
>> (219) as reported by "guest interrupt information" does not match the 
>> vector (187) in dom0 lspci. Before the patch, the new vector in 
>> "guest interrupt information" matched the new vector in dom0 lspci 
>> (dest ID in dom0 lspci was unchanged). I also saw this message pop on 
>> the Xen console when I changed smp_affinity:
>> 
>> (XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).
>> 
>> 187 is the vector from dom0 lspci before and after the smp_affinity 
>> change but "guest interrupt information" reports the new vector is 
>> 219. To me, this looks like the new MSI message data (with
>> vector=219) did not get written into the PCI device, right?
>> 
>> Here's a comparison before and after changing smp_affinity from ffff 
>> to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):
>> 
>> ---------------------------------------------------------------------
>> ---
>> 
>> /proc/irq/48/smp_affinity=ffff (default):
>> 
>> dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)
>> 
>> domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)
>> 
>> qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
>>                  pt_msi_update: Update msi with pirq 4f gvec 71 
>> gflags 0
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001,
>> Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----)
>> 
>> Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe:
>>              bdf = 7:0.0 (XEN) [VT-D]iommu.c:1175:d0
>>              domain_context_mapping:PCIe: bdf = 7:0.0 (XEN)
>>              [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5
>>              intx = 0 (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd:
>> machine_gsi=79 guest_gsi=36, device=5, intx=0 (XEN) io.c:381:d0
>> XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0
>> 
>> ---------------------------------------------------------------------
>> ---
>> 
>> /proc/irq/48/smp_affinity=2:
>> 
>> dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed 
>> from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)
>> 
>> domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed 
>> from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002,
>> Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----)
>> 
>> qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71
>>                  gflags 2 pt_msi_update: Update msi with pirq 4f gvec
>> b1 gflags 2
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-21  1:10                                 ` Zhang, Xiantao
@ 2009-10-22  1:00                                   ` Cinco, Dante
  2009-10-22  1:58                                     ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-22  1:00 UTC (permalink / raw)
  To: Zhang, Xiantao, He, Qing; +Cc: Keir, xen-devel, Fraser

After adding a lot of dprintk's in the code (xen/arch/x86/msi.c, irq.c, pci.c, traps.c), I found out why I'm getting the message "do_IRQ: 1.186 No irq handler for vector (irq -1)." Some time after the new MSI message address (dest ID) and data (vector) were written to the PCI device, something or somebody called guest_io_write() which overwrote the new vector (218) with the old vector (186).

I added an extra read_msi_msg() after write_msi_msg() just to make sure that the new MSI message address and data was actually written to the PCI device. I added some code in pci_conf_write() and pci_conf_read() to print the "cf8" and data if a write/read is targeted at the bus/dev/func/reg of the PCI device. One of my questions is where did the old vector (186) come from? What data structure did guest_io_write() get the 186 from?

I hope this data will help get to the bottom of this IRQ SMP affinity problem.

Dante

-------------------------------------------- BEGIN DATA

cat /proc/irq/48/smp_affinity 
ffff

(XEN) Guest interrupt information:
(XEN) IRQ: 66, IRQ affinity:0x00000001, Vec:186 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

dom0 lspci -vv -s 0:07:0.0 | grep Address
                Address: 00000000fee00000  Data: 40ba (dest ID=0 APIC ID of CPU0, vector=186)

domU lspci -vv -s 00:05.0 | grep IRQ
        Interrupt: pin A routed to IRQ 48
domU lspci -vv -s 00:05.0 | grep Address
                Address: 00000000fee00000  Data: 4071

--------------------------------------------

domU: echo 2 > /proc/irq/48/smp_affinity

(XEN) irq.c:415: __assign_irq_vector::irq=66,old_vector=186,cfg->vector=218,cpu=1
(XEN) hvm.c:248: hvm_migrate_pirqs::irq=66, v->processor=1
(XEN) io_apic.c:339: set_desc_affinity::irq=66,apicid=16,vector=218
(XEN) msi.c:270: write_msi_msg::msg->address_lo=0xfee10000,msg->data=0x40da
(XEN) pci.c:53: pci_conf_write::cf8=0x80070064,offset=0,bytes=4,data=0xfee10000 (MSI message address low, dest ID)
(XEN) pci.c:53: pci_conf_write::cf8=0x80070068,offset=0,bytes=4,data=0x0        (MSI message address high, 64-bit)
(XEN) pci.c:53: pci_conf_write::cf8=0x8007006c,offset=0,bytes=2,data=0x40da     (MSI message data, vector)
(XEN) pci.c:42: pci_conf_read::cf8=0x80070064,offset=0,bytes=4,value=0xfee10000
(XEN) pci.c:42: pci_conf_read::cf8=0x80070068,offset=0,bytes=4,value=0x0
(XEN) pci.c:42: pci_conf_read::cf8=0x8007006c,offset=0,bytes=2,value=0x40da
(XEN) msi.c:204: read_msi_msg::msg->address_lo=0xfee10000,msg->data=0x40da
(XEN) traps.c:1626: guest_io_write::pci_conf_write data=0x40ba                  <<<<<<<<<< culprit
(XEN) pci.c:53: pci_conf_write::cf8=0x8007006c,offset=0,bytes=2,data=0x40ba     <<<<<<<<<< vector reverted back to 186
(XEN) do_IRQ: 1.186 No irq handler for vector (irq -1)                          <<<<<<<<<< can't find handler because vector should have been 218

(XEN) Guest interrupt information:
(XEN) IRQ: 66, IRQ affinity:0x00000002, Vec:218 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 79(----)

dom0 lspci -vv -s 0:07:0.0 | grep Address 
                Address: 00000000fee10000  Data: 40ba (dest ID=16 APIC ID of CPU1, vector=186)

domU lspci -vv -s 00:05.0 | grep Address
                Address: 00000000fee02000  Data: 40b1

I followed the call hierarchy for guest_io_write() as far as I can:

do_page_fault
  fixup_page_fault
    handle_gdt_ldt_mapping_fault
      do_general_protection
        emulate_privileged_op
          guest_io_write

-------------------------------------------- END DATA

-----Original Message-----
From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com] 
Sent: Tuesday, October 20, 2009 6:11 PM
To: Cinco, Dante; He, Qing
Cc: xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Only need to apply the two patches and the previous one should be discarded. 
Xiantao 

-----Original Message-----
From: Cinco, Dante [mailto:Dante.Cinco@lsi.com]
Sent: Wednesday, October 21, 2009 1:27 AM
To: Zhang, Xiantao; He, Qing
Cc: xen-devel@lists.xensource.com; Keir Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Xintao,
With the latest patch (Fix-irq-affinity-msi3.patch, Mask_msi_irq_when_programe_it.patch), should I still apply the previous patch with removes "desc->handler->set_affinity(irq, *cpumask_of(v->processor))" or was that just a one-time experiment that should now be discarded?
Dante

-----Original Message-----
From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
Sent: Tuesday, October 20, 2009 12:51 AM
To: Zhang, Xiantao; Cinco, Dante; He, Qing
Cc: xen-devel@lists.xensource.com; Fraser
Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

Attached two patches should fix the issues. For the issue which complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),", I root-caused it.  Currenlty, when programs MSI address & data, Xen doesn't perform the mask/unmask logic to avoid inconsistent interrupt genernation. In this case, according to spec, the interrupt generation behavior is undfined, and device may generate MSI interrupts with the expected vector and incorrect destination ID, so leads to the issue.  The attached two patches should address it. 
Fix-irq-affinity-msi3.patch:  same with the previous post.
Mask_msi_irq_when_programe_it.patch : disable irq when program msi. 

Xiantao


Zhang, Xiantao wrote:
> Cinco, Dante wrote:
>> Xiantao,
>> With vcpus=16 (all CPUs) in domU, I'm able to change the IRQ 
>> smp_affinity to any one-hot value and see the interrupts routed to 
>> the specified CPU. Every now and then though, both domU and dom0 will 
>> permanently lockup (cold reboot required) after changing the 
>> smp_affinity. If I change it manually via command-line, it seems to 
>> be okay but if I change it within a script (such as shifting-left a 
>> walking "1" to test all 16 CPUs), it will lockup part way through the 
>> script.
> 
> I can't reproduce the failure at my side after applying the patches 
> even with a similar script which changes irq's affinity.  Could you 
> share your script with me ?
> 
> 
> 
>> Other observations:
>> 
>> In the above log, I had changed the smp_affinity for IRQ 66 but IRQ
>> 68 and 69 got masked.
> 
> We can see the warning as "No irq handler for vector" but it shouldn't 
> hang host, and it maybe related to another potential issue, and maybe 
> need further investigation.
> 
> Xiantao
> 
>> -----Original Message-----
>> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
>> Sent: Friday, October 16, 2009 5:59 PM
>> To: Cinco, Dante; He, Qing
>> Cc: xen-devel@lists.xensource.com; Fraser; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>>  Dante
>>  It should be another issue as you described.  Can you try the 
>> following code to see whether it works for you ?  Just a try. Xiantao
>> 
>> diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
>> --- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
>> +++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
>> @@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)          
>>          continue; irq = desc - irq_desc;
>>          ASSERT(MSI_IRQ(irq));
>> -        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
>> +        //desc->handler->set_affinity(irq,
>>          *cpumask_of(v->processor)); spin_unlock_irq(&desc->lock);  
>>      } spin_unlock(&d->event_lock);
>> 
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco, 
>> Dante Sent: Saturday, October 17, 2009 2:24 AM
>> To: Zhang, Xiantao; He, Qing
>> Cc: Keir; xen-devel@lists.xensource.com; Fraser
>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>> 
>> Xiantao,
>> I'm still losing the interrupts with your patch but I see some 
>> differences. To simplifiy the data, I'm only going to focus on the 
>> first function of my 4-function PCI device.
>> 
>> After changing the IRQ affinity, the IRQ is not masked anymore 
>> (unlike before the patch). What stands out for me is the new vector
>> (219) as reported by "guest interrupt information" does not match the 
>> vector (187) in dom0 lspci. Before the patch, the new vector in 
>> "guest interrupt information" matched the new vector in dom0 lspci 
>> (dest ID in dom0 lspci was unchanged). I also saw this message pop on 
>> the Xen console when I changed smp_affinity:
>> 
>> (XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).
>> 
>> 187 is the vector from dom0 lspci before and after the smp_affinity 
>> change but "guest interrupt information" reports the new vector is 
>> 219. To me, this looks like the new MSI message data (with
>> vector=219) did not get written into the PCI device, right?
>> 
>> Here's a comparison before and after changing smp_affinity from ffff 
>> to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):
>> 
>> ---------------------------------------------------------------------
>> ---
>> 
>> /proc/irq/48/smp_affinity=ffff (default):
>> 
>> dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)
>> 
>> domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)
>> 
>> qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
>>                  pt_msi_update: Update msi with pirq 4f gvec 71 
>> gflags 0
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001,
>> Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----)
>> 
>> Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe:
>>              bdf = 7:0.0 (XEN) [VT-D]iommu.c:1175:d0
>>              domain_context_mapping:PCIe: bdf = 7:0.0 (XEN)
>>              [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5
>>              intx = 0 (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd:
>> machine_gsi=79 guest_gsi=36, device=5, intx=0 (XEN) io.c:381:d0
>> XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0
>> 
>> ---------------------------------------------------------------------
>> ---
>> 
>> /proc/irq/48/smp_affinity=2:
>> 
>> dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed 
>> from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)
>> 
>> domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed 
>> from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)
>> 
>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002,
>> Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>> 79(----)
>> 
>> qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71
>>                  gflags 2 pt_msi_update: Update msi with pirq 4f gvec
>> b1 gflags 2
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  1:00                                   ` Cinco, Dante
@ 2009-10-22  1:58                                     ` Zhang, Xiantao
  2009-10-22  2:42                                       ` Zhang, Xiantao
  2009-10-22  5:10                                       ` Qing He
  0 siblings, 2 replies; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-22  1:58 UTC (permalink / raw)
  To: Cinco, Dante, He, Qing; +Cc: xen-devel, Fraser

Dante, 
   Have you applied the two patches when you did the testing?   Without them, we can reproduce the issue you reported, but with them, the issue is gone.  The root-cause is that when program MSI, we have to mask the MSI interrupt source first, otherwise it may generate inconistent interrupts with incorrect destination and right vector or incorrect vector and right destination.  

For exmaple, if the old MSI interrupt info is 0.186 which means the destination id is 0 and the vector is 186, but when the IRQ migrates to another cpu(e.g.  Cpu 1), the MSI info should be changed to 1.194. When you programs MSI info to pci device, if not mask it first, it may generate the interrupt as 1.186 or 0.194. Obviously, ther interrupts with the info 1.186 and 0.194 doesn't exist, and according to the spec, any combination is possible. Since Xen writes addr field first, so it is likely to generate 1.186 instead of 0.194, so your pci devices may generate interrupt with new destination and old vector(1.186).    In my two patches, one is used to fix guest interrupt affinity issue(a race exists between guest eoi old vector and guest setting new vector), and another one is used to safely program MSI info to pci devices to avoid inconsistent interrupts generation.  

> (XEN) traps.c:1626: guest_io_write::pci_conf_write data=0x40ba 

This should be written by dom0(likely to be Qemu).  And if it does exist, we may have to prohibit such unsafe writings about MSI in Qemu.  

Xiantao

     
> <<<<<<<<<< culprit (XEN) pci.c:53:
> pci_conf_write::cf8=0x8007006c,offset=0,bytes=2,data=0x40ba    
> <<<<<<<<<< vector reverted back to 186 (XEN) do_IRQ: 1.186 No irq
> handler for vector (irq -1)                          <<<<<<<<<< can't
> find handler because vector should have been 218         
> 
> (XEN) Guest interrupt information:
> (XEN) IRQ: 66, IRQ affinity:0x00000002, Vec:218 type=PCI-MSI
> status=00000010 in-flight=0 domain-list=1: 79(----) 
> 
> dom0 lspci -vv -s 0:07:0.0 | grep Address
>                 Address: 00000000fee10000  Data: 40ba (dest ID=16
> APIC ID of CPU1, vector=186) 
> 
> domU lspci -vv -s 00:05.0 | grep Address
>                 Address: 00000000fee02000  Data: 40b1
> 
> I followed the call hierarchy for guest_io_write() as far as I can:
> 
> do_page_fault
>   fixup_page_fault
>     handle_gdt_ldt_mapping_fault
>       do_general_protection
>         emulate_privileged_op
>           guest_io_write
> 
> -------------------------------------------- END DATA
> 
> -----Original Message-----
> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
> Sent: Tuesday, October 20, 2009 6:11 PM
> To: Cinco, Dante; He, Qing
> Cc: xen-devel@lists.xensource.com; Keir Fraser
> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
> > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) 
> 
> Only need to apply the two patches and the previous one should be
> discarded. 
> Xiantao
> 
> -----Original Message-----
> From: Cinco, Dante [mailto:Dante.Cinco@lsi.com]
> Sent: Wednesday, October 21, 2009 1:27 AM
> To: Zhang, Xiantao; He, Qing
> Cc: xen-devel@lists.xensource.com; Keir Fraser
> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
> > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) 
> 
> Xintao,
> With the latest patch (Fix-irq-affinity-msi3.patch,
> Mask_msi_irq_when_programe_it.patch), should I still apply the
> previous patch with removes "desc->handler->set_affinity(irq,
> *cpumask_of(v->processor))" or was that just a one-time experiment
> that should now be discarded?    
> Dante
> 
> -----Original Message-----
> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
> Sent: Tuesday, October 20, 2009 12:51 AM
> To: Zhang, Xiantao; Cinco, Dante; He, Qing
> Cc: xen-devel@lists.xensource.com; Fraser
> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
> > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) 
> 
> Attached two patches should fix the issues. For the issue which
> complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),",
> I root-caused it.  Currenlty, when programs MSI address & data, Xen
> doesn't perform the mask/unmask logic to avoid inconsistent interrupt
> genernation. In this case, according to spec, the interrupt
> generation behavior is undfined, and device may generate MSI
> interrupts with the expected vector and incorrect destination ID, so
> leads to the issue.  The attached two patches should address it.
> Fix-irq-affinity-msi3.patch:  same with the previous post.       
> Mask_msi_irq_when_programe_it.patch : disable irq when program msi.
> 
> Xiantao
> 
> 
> Zhang, Xiantao wrote:
>> Cinco, Dante wrote:
>>> Xiantao,
>>> With vcpus=16 (all CPUs) in domU, I'm able to change the IRQ
>>> smp_affinity to any one-hot value and see the interrupts routed to
>>> the specified CPU. Every now and then though, both domU and dom0
>>> will permanently lockup (cold reboot required) after changing the
>>> smp_affinity. If I change it manually via command-line, it seems to
>>> be okay but if I change it within a script (such as shifting-left a
>>> walking "1" to test all 16 CPUs), it will lockup part way through
>>> the script.
>> 
>> I can't reproduce the failure at my side after applying the patches
>> even with a similar script which changes irq's affinity.  Could you
>> share your script with me ? 
>> 
>> 
>> 
>>> Other observations:
>>> 
>>> In the above log, I had changed the smp_affinity for IRQ 66 but IRQ
>>> 68 and 69 got masked.
>> 
>> We can see the warning as "No irq handler for vector" but it
>> shouldn't hang host, and it maybe related to another potential
>> issue, and maybe need further investigation. 
>> 
>> Xiantao
>> 
>>> -----Original Message-----
>>> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
>>> Sent: Friday, October 16, 2009 5:59 PM
>>> To: Cinco, Dante; He, Qing
>>> Cc: xen-devel@lists.xensource.com; Fraser; Fraser
>>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with
>>> vcpus 
>>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>>> 
>>>  Dante
>>>  It should be another issue as you described.  Can you try the
>>> following code to see whether it works for you ?  Just a try.
>>> Xiantao 
>>> 
>>> diff -r 0705efd9c69e xen/arch/x86/hvm/hvm.c
>>> --- a/xen/arch/x86/hvm/hvm.c    Fri Oct 16 09:04:53 2009 +0100
>>> +++ b/xen/arch/x86/hvm/hvm.c    Sat Oct 17 08:48:23 2009 +0800
>>> @@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)
>>>          continue; irq = desc - irq_desc;
>>>          ASSERT(MSI_IRQ(irq));
>>> -        desc->handler->set_affinity(irq,
>>> *cpumask_of(v->processor)); +       
>>>          //desc->handler->set_affinity(irq,
>>>      *cpumask_of(v->processor)); spin_unlock_irq(&desc->lock); }
>>> spin_unlock(&d->event_lock); 
>>> 
>>> -----Original Message-----
>>> From: xen-devel-bounces@lists.xensource.com
>>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Cinco,
>>> Dante Sent: Saturday, October 17, 2009 2:24 AM
>>> To: Zhang, Xiantao; He, Qing
>>> Cc: Keir; xen-devel@lists.xensource.com; Fraser
>>> Subject: RE: [Xen-devel] IRQ SMP affinity problems in domU with
>>> vcpus 
>>>> 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
>>> 
>>> Xiantao,
>>> I'm still losing the interrupts with your patch but I see some
>>> differences. To simplifiy the data, I'm only going to focus on the
>>> first function of my 4-function PCI device.
>>> 
>>> After changing the IRQ affinity, the IRQ is not masked anymore
>>> (unlike before the patch). What stands out for me is the new vector
>>> (219) as reported by "guest interrupt information" does not match
>>> the vector (187) in dom0 lspci. Before the patch, the new vector in
>>> "guest interrupt information" matched the new vector in dom0 lspci
>>> (dest ID in dom0 lspci was unchanged). I also saw this message pop
>>> on the Xen console when I changed smp_affinity:
>>> 
>>> (XEN) do_IRQ: 1.187 No irq handler for vector (irq -1).
>>> 
>>> 187 is the vector from dom0 lspci before and after the smp_affinity
>>> change but "guest interrupt information" reports the new vector is
>>> 219. To me, this looks like the new MSI message data (with
>>> vector=219) did not get written into the PCI device, right?
>>> 
>>> Here's a comparison before and after changing smp_affinity from ffff
>>> to 2 (dom0 is pvops 2.6.31.1, domU is 2.6.30.1):
>>> 
>>> ---------------------------------------------------------------------
>>> ---
>>> 
>>> /proc/irq/48/smp_affinity=ffff (default):
>>> 
>>> dom0 lspci: Address: 00000000fee00000  Data: 40bb (vector=187)
>>> 
>>> domU lspci: Address: 00000000fee00000  Data: 4071 (vector=113)
>>> 
>>> qemu-dm-dpm.log: pt_msi_setup: msi mapped with pirq 4f (79)
>>>                  pt_msi_update: Update msi with pirq 4f gvec 71
>>> gflags 0 
>>> 
>>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000001,
>>> Vec:187 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>>> 79(----) 
>>> 
>>> Xen console: (XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe:
>>>              bdf = 7:0.0 (XEN) [VT-D]iommu.c:1175:d0
>>>              domain_context_mapping:PCIe: bdf = 7:0.0 (XEN)
>>>              [VT-D]io.c:301:d0 VT-d irq bind: m_irq = 4f device = 5
>>>              intx = 0 (XEN) io.c:326:d0 pt_irq_destroy_bind_vtd:
>>> machine_gsi=79 guest_gsi=36, device=5, intx=0 (XEN) io.c:381:d0
>>> XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0
>>> 
>>> ---------------------------------------------------------------------
>>> ---
>>> 
>>> /proc/irq/48/smp_affinity=2:
>>> 
>>> dom0 lspci: Address: 00000000fee10000  Data: 40bb (dest ID changed
>>> from 0 (APIC ID of CPU0) to 16 (APIC ID of CPU1), vector unchanged)
>>> 
>>> domU lspci: Address: 00000000fee02000  Data: 40b1 (dest ID changed
>>> from 0 (APIC ID of CPU0) to 2 (APIC ID of CPU1), new vector=177)
>>> 
>>> Guest interrupt information: (XEN) IRQ: 74, IRQ affinity:0x00000002,
>>> Vec:219 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:
>>> 79(----) 
>>> 
>>> qemu-dm-dpm.log: pt_msi_update: Update msi with pirq 4f gvec 71
>>>                  gflags 2 pt_msi_update: Update msi with pirq 4f
>>> gvec b1 gflags 2
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  1:58                                     ` Zhang, Xiantao
@ 2009-10-22  2:42                                       ` Zhang, Xiantao
  2009-10-22  6:25                                         ` Keir Fraser
  2009-10-22  5:10                                       ` Qing He
  1 sibling, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-22  2:42 UTC (permalink / raw)
  To: Zhang, Xiantao, Cinco, Dante, He, Qing; +Cc: xen-devel, Ian Jackson, Fraser

Zhang, Xiantao wrote:
> Dante,
>    Have you applied the two patches when you did the testing?  
> Without them, we can reproduce the issue you reported, but with them,
> the issue is gone.  The root-cause is that when program MSI, we have
> to mask the MSI interrupt source first, otherwise it may generate
> inconistent interrupts with incorrect destination and right vector or
> incorrect vector and right destination.     
> 
> For exmaple, if the old MSI interrupt info is 0.186 which means the
> destination id is 0 and the vector is 186, but when the IRQ migrates
> to another cpu(e.g.  Cpu 1), the MSI info should be changed to 1.194.
> When you programs MSI info to pci device, if not mask it first, it
> may generate the interrupt as 1.186 or 0.194. Obviously, ther
> interrupts with the info 1.186 and 0.194 doesn't exist, and according
> to the spec, any combination is possible. Since Xen writes addr field
> first, so it is likely to generate 1.186 instead of 0.194, so your
> pci devices may generate interrupt with new destination and old
> vector(1.186).    In my two patches, one is used to fix guest
> interrupt affinity issue(a race exists between guest eoi old vector
> and guest setting new vector), and another one is used to safely
> program MSI info to pci devices to avoid inconsistent interrupts
> generation.             
> 
>> (XEN) traps.c:1626: guest_io_write::pci_conf_write data=0x40ba
> 
> This should be written by dom0(likely to be Qemu).  And if it does
> exist, we may have to prohibit such unsafe writings about MSI in
> Qemu.  

Another issue may exist which leads to the issue.  Currenlty, both Qemu and hypervisor can program MSI but Xen lacks synchronization mechnism between them to avoid race.  As said in the last mail,  Qemu shouldn't be allowed to do the unsafe writing about MSI Info, and insteadly,  it should resort to hypervisor through hypercall for MSI programing, otherwise, Qemu may write staled MSI info to PCI devices  and leads to the strange issues.   
Keir/Ian
	What's your opinion about the potential issue ?  Maybe we need to add a lock between them or just allow hypervisor to do the writing ?    
Xiantao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  1:58                                     ` Zhang, Xiantao
  2009-10-22  2:42                                       ` Zhang, Xiantao
@ 2009-10-22  5:10                                       ` Qing He
  2009-10-23  0:10                                         ` Cinco, Dante
  1 sibling, 1 reply; 55+ messages in thread
From: Qing He @ 2009-10-22  5:10 UTC (permalink / raw)
  To: Zhang, Xiantao; +Cc: Cinco, Dante, xen-devel, keir.fraser

[-- Attachment #1: Type: text/plain, Size: 1343 bytes --]

On Thu, 2009-10-22 at 09:58 +0800, Zhang, Xiantao wrote:
> > (XEN) traps.c:1626: guest_io_write::pci_conf_write data=0x40ba 
> 
> This should be written by dom0(likely to be Qemu).  And if it does
> exist, we may have to prohibit such unsafe writings about MSI in
> Qemu.  

Yes, it is the case, the problem happens in Qemu, the algorithm looks
like below:

    pt_pci_write_config(new_value)
    {
        dev_value = pci_read_block();

        value = msi_write_handler(dev_value, new_value);

        pci_write_block(value);

    }

    msi_write_handler(dev_value, new_value)
    {
        HYPERVISOR_bind_pt_irq(); // updates MSI binding

	return dev_value;   // it decides not to change it
    }

The problem lies here, when bind_pt_irq is called, the real physical
data/address is updated by the hypervisor. There were no problem
exposed before because at that time hypervisor uses a universal vector
, the data/address of msi remains unchanged. But this isn't the case
when per-CPU vector is there, the pci_write_block is undesirable in
QEmu now, it writes stale value back into the register and invalidate
any modifications.

Clearly, if QEmu decides to hand the management of these registers
to the hypervisor, it shouldn't touch them again. Here is a patch
to fix this by introducing a no_wb flag. Can you have a try?

Thanks,
Qing

[-- Attachment #2: qemu-msi-no-wb.patch --]
[-- Type: text/x-diff, Size: 2471 bytes --]

diff --git a/hw/pass-through.c b/hw/pass-through.c
index 8d80755..b1a3b09 100644
--- a/hw/pass-through.c
+++ b/hw/pass-through.c
@@ -626,6 +626,7 @@ static struct pt_reg_info_tbl pt_emu_reg_msi_tbl[] = {
         .init_val   = 0x00000000,
         .ro_mask    = 0x00000003,
         .emu_mask   = 0xFFFFFFFF,
+        .no_wb      = 1,
         .init       = pt_common_reg_init,
         .u.dw.read  = pt_long_reg_read,
         .u.dw.write = pt_msgaddr32_reg_write,
@@ -638,6 +639,7 @@ static struct pt_reg_info_tbl pt_emu_reg_msi_tbl[] = {
         .init_val   = 0x00000000,
         .ro_mask    = 0x00000000,
         .emu_mask   = 0xFFFFFFFF,
+        .no_wb      = 1,
         .init       = pt_msgaddr64_reg_init,
         .u.dw.read  = pt_long_reg_read,
         .u.dw.write = pt_msgaddr64_reg_write,
@@ -650,6 +652,7 @@ static struct pt_reg_info_tbl pt_emu_reg_msi_tbl[] = {
         .init_val   = 0x0000,
         .ro_mask    = 0x0000,
         .emu_mask   = 0xFFFF,
+        .no_wb      = 1,
         .init       = pt_msgdata_reg_init,
         .u.w.read   = pt_word_reg_read,
         .u.w.write  = pt_msgdata_reg_write,
@@ -662,6 +665,7 @@ static struct pt_reg_info_tbl pt_emu_reg_msi_tbl[] = {
         .init_val   = 0x0000,
         .ro_mask    = 0x0000,
         .emu_mask   = 0xFFFF,
+        .no_wb      = 1,
         .init       = pt_msgdata_reg_init,
         .u.w.read   = pt_word_reg_read,
         .u.w.write  = pt_msgdata_reg_write,
@@ -1550,10 +1554,12 @@ static void pt_pci_write_config(PCIDevice *d, uint32_t address, uint32_t val,
     val >>= ((address & 3) << 3);
 
 out:
-    ret = pci_write_block(pci_dev, address, (uint8_t *)&val, len);
+    if (!reg->no_wb) {
+        ret = pci_write_block(pci_dev, address, (uint8_t *)&val, len);
 
-    if (!ret)
-        PT_LOG("Error: pci_write_block failed. return value[%d].\n", ret);
+        if (!ret)
+            PT_LOG("Error: pci_write_block failed. return value[%d].\n", ret);
+    }
 
     if (pm_state != NULL && pm_state->flags & PT_FLAG_TRANSITING)
         /* set QEMUTimer */
diff --git a/hw/pass-through.h b/hw/pass-through.h
index 028a03e..3c79885 100644
--- a/hw/pass-through.h
+++ b/hw/pass-through.h
@@ -364,6 +364,8 @@ struct pt_reg_info_tbl {
     uint32_t ro_mask;
     /* reg emulate field mask (ON:emu, OFF:passthrough) */
     uint32_t emu_mask;
+    /* no write back allowed */
+    uint32_t no_wb;
     /* emul reg initialize method */
     conf_reg_init init;
     union {

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  2:42                                       ` Zhang, Xiantao
@ 2009-10-22  6:25                                         ` Keir Fraser
  2009-10-22 21:11                                           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 55+ messages in thread
From: Keir Fraser @ 2009-10-22  6:25 UTC (permalink / raw)
  To: Zhang, Xiantao, Cinco, Dante, He, Qing; +Cc: xen-devel, Ian Jackson

On 22/10/2009 03:42, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:

>> This should be written by dom0(likely to be Qemu).  And if it does
>> exist, we may have to prohibit such unsafe writings about MSI in
>> Qemu.  
> 
> Another issue may exist which leads to the issue.  Currenlty, both Qemu and
> hypervisor can program MSI but Xen lacks synchronization mechnism between them
> to avoid race.  As said in the last mail,  Qemu shouldn't be allowed to do the
> unsafe writing about MSI Info, and insteadly,  it should resort to hypervisor
> through hypercall for MSI programing, otherwise, Qemu may write staled MSI
> info to PCI devices  and leads to the strange issues.
> Keir/Ian
> What's your opinion about the potential issue ?  Maybe we need to add a lock
> between them or just allow hypervisor to do the writing ?

In general, having qemu make pci updates via the cf8/cfc method is clearly
unsafe, and cannot be made safe. I would certainly be happy to see some of
the low-level PCI management pushed into pciback (and/or pci-stub, depending
on whether pciback is to be ported to pv_ops).

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-20  7:51                             ` Zhang, Xiantao
  2009-10-20 17:26                               ` Cinco, Dante
@ 2009-10-22  6:46                               ` Jan Beulich
  2009-10-22  7:11                                 ` Zhang, Xiantao
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2009-10-22  6:46 UTC (permalink / raw)
  To: Xiantao Zhang; +Cc: Dante Cinco, xen-devel, Fraser, Qing He

>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 20.10.09 09:51 >>>
>Attached two patches should fix the issues. For the issue which complains
>"(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),", I root-caused it.
>Currenlty, when programs MSI address & data, Xen doesn't perform the
>mask/unmask logic to avoid inconsistent interrupt genernation. In this
>case, according to spec, the interrupt generation behavior is undfined,
>and device may generate MSI interrupts with the expected vector and
>incorrect destination ID, so leads to the issue.  The attached two patches
>should address it. 

What about the case of MSI not having a mask bit? Shouldn't movement
(i.e. vector or affinity changes) be disallowed for non-maskable ones?

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  6:46                               ` Jan Beulich
@ 2009-10-22  7:11                                 ` Zhang, Xiantao
  2009-10-22  7:31                                   ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-22  7:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dante Cinco, xen-devel, Fraser, He, Qing

Jan Beulich wrote:
>>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 20.10.09 09:51 >>>
>> Attached two patches should fix the issues. For the issue which
>> complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),",
>> I root-caused it. Currenlty, when programs MSI address & data, Xen
>> doesn't perform the mask/unmask logic to avoid inconsistent
>> interrupt genernation. In this case, according to spec, the
>> interrupt generation behavior is undfined, 
>> and device may generate MSI interrupts with the expected vector and
>> incorrect destination ID, so leads to the issue.  The attached two
>> patches should address it.
> 
> What about the case of MSI not having a mask bit? Shouldn't movement
> (i.e. vector or affinity changes) be disallowed for non-maskable ones?

IRQ migration shouldn't depend on the interrupt status(mask/unmask), and hyperviosr can handle non-masked irq during the migration. 
Xiantao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  7:11                                 ` Zhang, Xiantao
@ 2009-10-22  7:31                                   ` Jan Beulich
  2009-10-22  8:41                                     ` Zhang, Xiantao
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2009-10-22  7:31 UTC (permalink / raw)
  To: Xiantao Zhang; +Cc: Dante Cinco, xen-devel, Fraser, Qing He

>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 22.10.09 09:11 >>>
>Jan Beulich wrote:
>>>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 20.10.09 09:51 >>>
>>> Attached two patches should fix the issues. For the issue which
>>> complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq -1),",
>>> I root-caused it. Currenlty, when programs MSI address & data, Xen
>>> doesn't perform the mask/unmask logic to avoid inconsistent
>>> interrupt genernation. In this case, according to spec, the
>>> interrupt generation behavior is undfined, 
>>> and device may generate MSI interrupts with the expected vector and
>>> incorrect destination ID, so leads to the issue.  The attached two
>>> patches should address it.
>> 
>> What about the case of MSI not having a mask bit? Shouldn't movement
>> (i.e. vector or affinity changes) be disallowed for non-maskable ones?
>
>IRQ migration shouldn't depend on the interrupt status(mask/unmask),
>and hyperviosr can handle non-masked irq during the migration. 

Hmm, then I don't understand which case your patch was a fix for: I
understood that it addresses an issue when the affinity of an interrupt
gets changed (requiring a re-write of the address/data pair). If the
hypervisor can deal with it without masking, then why did you add it?

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus >  4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  7:31                                   ` Jan Beulich
@ 2009-10-22  8:41                                     ` Zhang, Xiantao
  2009-10-22  9:42                                       ` Keir Fraser
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-22  8:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dante Cinco, xen-devel, Fraser, He, Qing

Jan Beulich wrote:
>>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 22.10.09 09:11 >>>
>> Jan Beulich wrote:
>>>>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 20.10.09 09:51 >>>
>>>> Attached two patches should fix the issues. For the issue which
>>>> complains "(XEN) do_IRQ: 1.187 No irq handler for vector (irq
>>>> -1),", I root-caused it. Currenlty, when programs MSI address &
>>>> data, Xen doesn't perform the mask/unmask logic to avoid
>>>> inconsistent interrupt genernation. In this case, according to
>>>> spec, the interrupt generation behavior is undfined,
>>>> and device may generate MSI interrupts with the expected vector and
>>>> incorrect destination ID, so leads to the issue.  The attached two
>>>> patches should address it.
>>> 
>>> What about the case of MSI not having a mask bit? Shouldn't movement
>>> (i.e. vector or affinity changes) be disallowed for non-maskable
>>> ones? 
>> 
>> IRQ migration shouldn't depend on the interrupt status(mask/unmask),
>> and hyperviosr can handle non-masked irq during the migration.
> 
> Hmm, then I don't understand which case your patch was a fix for: I
> understood that it addresses an issue when the affinity of an
> interrupt gets changed (requiring a re-write of the address/data
> pair). If the hypervisor can deal with it without masking, then why
> did you add it?

Hmm, sorry, seems I misunderstood your question. If the msi doesn't support mask bit(clearing MSI enable bit doesn't help in this case), the issue may still exist. Just checked Linux side, seems it doesn't perform mask operation when program MSI, but don't know why Linux hasn't such issues.  Actaully, we do see inconsisten interrupt message from the device without this patch, and after applying the patch, the issue is gone.  May need further investigation why Linux doesn't need the mask operation.   
Xiantao  

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  8:41                                     ` Zhang, Xiantao
@ 2009-10-22  9:42                                       ` Keir Fraser
  2009-10-22 16:32                                         ` Zhang, Xiantao
                                                           ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Keir Fraser @ 2009-10-22  9:42 UTC (permalink / raw)
  To: Zhang, Xiantao, Jan Beulich; +Cc: Dante Cinco, xen-devel, He, Qing

On 22/10/2009 09:41, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:

>> Hmm, then I don't understand which case your patch was a fix for: I
>> understood that it addresses an issue when the affinity of an
>> interrupt gets changed (requiring a re-write of the address/data
>> pair). If the hypervisor can deal with it without masking, then why
>> did you add it?
> 
> Hmm, sorry, seems I misunderstood your question. If the msi doesn't support
> mask bit(clearing MSI enable bit doesn't help in this case), the issue may
> still exist. Just checked Linux side, seems it doesn't perform mask operation
> when program MSI, but don't know why Linux hasn't such issues.  Actaully, we
> do see inconsisten interrupt message from the device without this patch, and
> after applying the patch, the issue is gone.  May need further investigation
> why Linux doesn't need the mask operation.

Linux is quite careful about when it will reprogram vector/affinity info
isn't it? Doesn't it mark such an update pending and only flush it through
during next interrupt delivery, or something like that? Do we need some of
the upstream Linux patches for this?

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  9:42                                       ` Keir Fraser
@ 2009-10-22 16:32                                         ` Zhang, Xiantao
  2009-10-22 16:33                                         ` Cinco, Dante
  2009-10-26 13:02                                         ` Zhang, Xiantao
  2 siblings, 0 replies; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-22 16:32 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Dante Cinco, xen-devel, He, Qing

Keir Fraser wrote:
> On 22/10/2009 09:41, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:
> 
>>> Hmm, then I don't understand which case your patch was a fix for: I
>>> understood that it addresses an issue when the affinity of an
>>> interrupt gets changed (requiring a re-write of the address/data
>>> pair). If the hypervisor can deal with it without masking, then why
>>> did you add it?
>> 
>> Hmm, sorry, seems I misunderstood your question. If the msi doesn't
>> support mask bit(clearing MSI enable bit doesn't help in this case),
>> the issue may still exist. Just checked Linux side, seems it doesn't
>> perform mask operation when program MSI, but don't know why Linux
>> hasn't such issues.  Actaully, we do see inconsisten interrupt
>> message from the device without this patch, and after applying the
>> patch, the issue is gone.  May need further investigation why Linux
>> doesn't need the mask operation. 
> 
> Linux is quite careful about when it will reprogram vector/affinity
> info isn't it? Doesn't it mark such an update pending and only flush
> it through during next interrupt delivery, or something like that? Do
> we need some of the upstream Linux patches for this?
Yeah, after checking the related logic in Linux, I think we need to port more logic to support IRQ migration to avoid the reported races in this thread.   For setting affinity for specific irq, the first step is to mark it pending, and then do real setting before acking the irq for next interrupt delivery, so at this time there shouldn't be new interrupts generated for normal devcies before acking it.  I will post the backport patch later. 
Xiantao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus >  4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  9:42                                       ` Keir Fraser
  2009-10-22 16:32                                         ` Zhang, Xiantao
@ 2009-10-22 16:33                                         ` Cinco, Dante
  2009-10-23  1:06                                           ` Zhang, Xiantao
  2009-10-26 13:02                                         ` Zhang, Xiantao
  2 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-22 16:33 UTC (permalink / raw)
  To: Keir Fraser, Zhang, Xiantao, Jan Beulich; +Cc: xen-devel, He, Qing

Xiantao,

I'm sorry I forgot to mention that I did apply your two patches but it didn't have any effect (interrupts still lost after changing smp_affinity and "No handler for irq vector" message). I added a dprintk in msi_set_mask_bit() and realized that MSI does not have a mask bit (MSIX does). My PCI device uses MSI not MSIX. I placed my dprintk inside the condition below and it never triggered.

    switch (entry->msi_attrib.type) {
    case PCI_CAP_ID_MSI:
        if (entry->msi_attrib.maskbit) {

While debugging this problem, I thought about the potential problem of an interrupt firing between the writes for the MSI message address and MSI message data. I noticed that pci_conf_write() uses spin_lock_irqsave() to disable interrupts before issuing the "out" instruction but the writes for the address and data are two separate pci_conf_write() calls. To me, it would be safer to write the address and data in a single call and preceded by spin_lock_irqsave(). This way, when the interrupts are enabled, the address and data have both been updated.

Dante

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
Sent: Thursday, October 22, 2009 2:42 AM
To: Zhang, Xiantao; Jan Beulich
Cc: He, Qing; xen-devel@lists.xensource.com; Cinco, Dante
Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

On 22/10/2009 09:41, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:

>> Hmm, then I don't understand which case your patch was a fix for: I 
>> understood that it addresses an issue when the affinity of an 
>> interrupt gets changed (requiring a re-write of the address/data 
>> pair). If the hypervisor can deal with it without masking, then why 
>> did you add it?
> 
> Hmm, sorry, seems I misunderstood your question. If the msi doesn't 
> support mask bit(clearing MSI enable bit doesn't help in this case), 
> the issue may still exist. Just checked Linux side, seems it doesn't 
> perform mask operation when program MSI, but don't know why Linux 
> hasn't such issues.  Actaully, we do see inconsisten interrupt message 
> from the device without this patch, and after applying the patch, the 
> issue is gone.  May need further investigation why Linux doesn't need the mask operation.

Linux is quite careful about when it will reprogram vector/affinity info isn't it? Doesn't it mark such an update pending and only flush it through during next interrupt delivery, or something like that? Do we need some of the upstream Linux patches for this?

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  6:25                                         ` Keir Fraser
@ 2009-10-22 21:11                                           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 55+ messages in thread
From: Jeremy Fitzhardinge @ 2009-10-22 21:11 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel, He, Qing, Konrad Rzeszutek Wilk, Ian Jackson, Cinco,
	Dante, Zhang, Xiantao

On 10/21/09 23:25, Keir Fraser wrote:
> In general, having qemu make pci updates via the cf8/cfc method is clearly
> unsafe, and cannot be made safe. I would certainly be happy to see some of
> the low-level PCI management pushed into pciback (and/or pci-stub, depending
> on whether pciback is to be ported to pv_ops).
>   

I've got Konrad's forward-port of pciback in xen/master at the moment.

    J

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  5:10                                       ` Qing He
@ 2009-10-23  0:10                                         ` Cinco, Dante
  0 siblings, 0 replies; 55+ messages in thread
From: Cinco, Dante @ 2009-10-23  0:10 UTC (permalink / raw)
  To: Qing He, Zhang, Xiantao; +Cc: xen-devel, keir.fraser

Qing,

Your patch worked. It suppressed the extra write that previously overwrote the MSI message data with the old vector. No more "no handler for irq" message and the interrupts were successfully migrated to the new CPU. I still experienced a hang on both domU and dom0 when I changed the smp_affinity of all 4 PCI devices (I have a 4-function PCI device) simultaneously (the "echo <new_smp_affinity> > /proc/irq/<irq#>/smp_affinity" are in a shell script) but I didn't get a chance to pursue this today.

Dante

-----Original Message-----
From: Qing He [mailto:qing.he@intel.com] 
Sent: Wednesday, October 21, 2009 10:11 PM
To: Zhang, Xiantao
Cc: Cinco, Dante; xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com
Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

On Thu, 2009-10-22 at 09:58 +0800, Zhang, Xiantao wrote:
> > (XEN) traps.c:1626: guest_io_write::pci_conf_write data=0x40ba
> 
> This should be written by dom0(likely to be Qemu).  And if it does 
> exist, we may have to prohibit such unsafe writings about MSI in Qemu.

Yes, it is the case, the problem happens in Qemu, the algorithm looks like below:

    pt_pci_write_config(new_value)
    {
        dev_value = pci_read_block();

        value = msi_write_handler(dev_value, new_value);

        pci_write_block(value);

    }

    msi_write_handler(dev_value, new_value)
    {
        HYPERVISOR_bind_pt_irq(); // updates MSI binding

	return dev_value;   // it decides not to change it
    }

The problem lies here, when bind_pt_irq is called, the real physical data/address is updated by the hypervisor. There were no problem exposed before because at that time hypervisor uses a universal vector , the data/address of msi remains unchanged. But this isn't the case when per-CPU vector is there, the pci_write_block is undesirable in QEmu now, it writes stale value back into the register and invalidate any modifications.

Clearly, if QEmu decides to hand the management of these registers to the hypervisor, it shouldn't touch them again. Here is a patch to fix this by introducing a no_wb flag. Can you have a try?

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus >  4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22 16:33                                         ` Cinco, Dante
@ 2009-10-23  1:06                                           ` Zhang, Xiantao
  0 siblings, 0 replies; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-23  1:06 UTC (permalink / raw)
  To: Cinco, Dante, Keir Fraser, Jan Beulich; +Cc: xen-devel, He, Qing

Dante, 
   If the device doesn't support MSI mask bit, the second patch should have no effect for that. And I am working on backporting more IRQ migration logic from Linux, and it should ensure addr/vector are both written to devices before firing new interrrupts.   But as I mentioned before, if you want to solve the guest affinity setting issue,  you have to apply the first patch I sent out (fix-irq-affinity-msi3.patch). :-)
Xiantao

Cinco, Dante wrote:
> Xiantao,
> 
> I'm sorry I forgot to mention that I did apply your two patches but
> it didn't have any effect (interrupts still lost after changing
> smp_affinity and "No handler for irq vector" message). I added a
> dprintk in msi_set_mask_bit() and realized that MSI does not have a
> mask bit (MSIX does). My PCI device uses MSI not MSIX. I placed my
> dprintk inside the condition below and it never triggered.     
> 
>     switch (entry->msi_attrib.type) {
>     case PCI_CAP_ID_MSI:
>         if (entry->msi_attrib.maskbit) {
> 
> While debugging this problem, I thought about the potential problem
> of an interrupt firing between the writes for the MSI message address
> and MSI message data. I noticed that pci_conf_write() uses
> spin_lock_irqsave() to disable interrupts before issuing the "out"
> instruction but the writes for the address and data are two separate
> pci_conf_write() calls. To me, it would be safer to write the address
> and data in a single call and preceded by spin_lock_irqsave(). This
> way, when the interrupts are enabled, the address and data have both
> been updated.        
> 
> Dante
> 
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Thursday, October 22, 2009 2:42 AM
> To: Zhang, Xiantao; Jan Beulich
> Cc: He, Qing; xen-devel@lists.xensource.com; Cinco, Dante
> Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus
> > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) 
> 
> On 22/10/2009 09:41, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:
> 
>>> Hmm, then I don't understand which case your patch was a fix for: I
>>> understood that it addresses an issue when the affinity of an
>>> interrupt gets changed (requiring a re-write of the address/data
>>> pair). If the hypervisor can deal with it without masking, then why
>>> did you add it?
>> 
>> Hmm, sorry, seems I misunderstood your question. If the msi doesn't
>> support mask bit(clearing MSI enable bit doesn't help in this case),
>> the issue may still exist. Just checked Linux side, seems it doesn't
>> perform mask operation when program MSI, but don't know why Linux
>> hasn't such issues.  Actaully, we do see inconsisten interrupt
>> message 
>> from the device without this patch, and after applying the patch, the
>> issue is gone.  May need further investigation why Linux doesn't
>> need the mask operation. 
> 
> Linux is quite careful about when it will reprogram vector/affinity
> info isn't it? Doesn't it mark such an update pending and only flush
> it through during next interrupt delivery, or something like that? Do
> we need some of the upstream Linux patches for this?   
> 
>  -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-22  9:42                                       ` Keir Fraser
  2009-10-22 16:32                                         ` Zhang, Xiantao
  2009-10-22 16:33                                         ` Cinco, Dante
@ 2009-10-26 13:02                                         ` Zhang, Xiantao
  2009-10-26 13:34                                           ` Keir Fraser
  2 siblings, 1 reply; 55+ messages in thread
From: Zhang, Xiantao @ 2009-10-26 13:02 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Dante Cinco, xen-devel, He, Qing

[-- Attachment #1: Type: text/plain, Size: 1784 bytes --]

Keir, 
   The attached patch(irq-migration-enhancement.patch) targets to enhance irq migration logic, and the most logic is ported from Linux and tailored for Xen.  Please apply, and it should eliminate the race between writing msi's vector and addr. In addition, to fix guest's interrupt affinity issue, we also needs to apply the patch(fix-irq-affinity-msi3.patch) . 
Xiantao


Keir Fraser wrote:
> On 22/10/2009 09:41, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:
> 
>>> Hmm, then I don't understand which case your patch was a fix for: I
>>> understood that it addresses an issue when the affinity of an
>>> interrupt gets changed (requiring a re-write of the address/data
>>> pair). If the hypervisor can deal with it without masking, then why
>>> did you add it?
>> 
>> Hmm, sorry, seems I misunderstood your question. If the msi doesn't
>> support mask bit(clearing MSI enable bit doesn't help in this case),
>> the issue may still exist. Just checked Linux side, seems it doesn't
>> perform mask operation when program MSI, but don't know why Linux
>> hasn't such issues.  Actaully, we do see inconsisten interrupt
>> message from the device without this patch, and after applying the
>> patch, the issue is gone.  May need further investigation why Linux
>> doesn't need the mask operation. 
> 
> Linux is quite careful about when it will reprogram vector/affinity
> info isn't it? Doesn't it mark such an update pending and only flush
> it through during next interrupt delivery, or something like that? Do
> we need some of the upstream Linux patches for this?
> 
>  -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


[-- Attachment #2: irq-migration-enhancement.patch --]
[-- Type: application/octet-stream, Size: 9008 bytes --]

x86: IRQ Migration logic enhancement. 

To programme MSI's addr/vector safely, delay irq migration
operation before acking next interrupt. In this way, it should
avoid inconsistent interrupts generation due to non-atomic writing
addr and data registers about MSI.

Port the logic from Linux and tailored it from Xen. 

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>

diff -r dcfa6155b692 -r 1f960f20a33f xen/arch/x86/hpet.c
--- a/xen/arch/x86/hpet.c	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/arch/x86/hpet.c	Mon Oct 26 15:54:00 2009 +0800
@@ -289,6 +289,7 @@ static void hpet_msi_ack(unsigned int ir
     struct irq_desc *desc = irq_to_desc(irq);
 
     irq_complete_move(&desc);
+    move_native_irq(irq);
     ack_APIC_irq();
 }
 
diff -r dcfa6155b692 -r 1f960f20a33f xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/arch/x86/hvm/hvm.c	Mon Oct 26 15:54:00 2009 +0800
@@ -243,7 +243,7 @@ void hvm_migrate_pirqs(struct vcpu *v)
             continue;
         irq = desc - irq_desc;
         ASSERT(MSI_IRQ(irq));
-        desc->handler->set_affinity(irq, *cpumask_of(v->processor));
+        irq_set_affinity(irq, *cpumask_of(v->processor));
         spin_unlock_irq(&desc->lock);
     }
     spin_unlock(&d->event_lock);
diff -r dcfa6155b692 -r 1f960f20a33f xen/arch/x86/io_apic.c
--- a/xen/arch/x86/io_apic.c	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/arch/x86/io_apic.c	Mon Oct 26 15:54:00 2009 +0800
@@ -1379,6 +1379,7 @@ static void ack_edge_ioapic_irq(unsigned
     struct irq_desc *desc = irq_to_desc(irq);
     
     irq_complete_move(&desc);
+    move_native_irq(irq);
 
     if ((desc->status & (IRQ_PENDING | IRQ_DISABLED))
         == (IRQ_PENDING | IRQ_DISABLED))
@@ -1418,6 +1419,38 @@ static void setup_ioapic_ack(char *s)
         printk("Unknown ioapic_ack value specified: '%s'\n", s);
 }
 custom_param("ioapic_ack", setup_ioapic_ack);
+
+static bool_t io_apic_level_ack_pending(unsigned int irq)
+{
+    struct irq_pin_list *entry;
+    unsigned long flags;
+
+    spin_lock_irqsave(&ioapic_lock, flags);
+    entry = &irq_2_pin[irq];
+    for (;;) {
+        unsigned int reg;
+        int pin;
+
+        if (!entry)
+            break;
+
+        pin = entry->pin;
+        if (pin == -1)
+            continue;
+        reg = io_apic_read(entry->apic, 0x10 + pin*2);
+        /* Is the remote IRR bit set? */
+        if (reg & IO_APIC_REDIR_REMOTE_IRR) {
+            spin_unlock_irqrestore(&ioapic_lock, flags);
+            return 1;
+        }
+        if (!entry->next)
+            break;
+        entry = irq_2_pin + entry->next;
+    }
+    spin_unlock_irqrestore(&ioapic_lock, flags);
+
+    return 0;
+}
 
 static void mask_and_ack_level_ioapic_irq (unsigned int irq)
 {
@@ -1456,6 +1489,10 @@ static void mask_and_ack_level_ioapic_ir
     v = apic_read(APIC_TMR + ((i & ~0x1f) >> 1));
 
     ack_APIC_irq();
+    
+    if ((irq_desc[irq].status & IRQ_MOVE_PENDING) &&
+       !io_apic_level_ack_pending(irq))
+        move_native_irq(irq);
 
     if (!(v & (1 << (i & 0x1f)))) {
         atomic_inc(&irq_mis_count);
@@ -1503,6 +1540,10 @@ static void end_level_ioapic_irq (unsign
 
     ack_APIC_irq();
 
+    if ((irq_desc[irq].status & IRQ_MOVE_PENDING) &&
+            !io_apic_level_ack_pending(irq))
+        move_native_irq(irq);
+
     if (!(v & (1 << (i & 0x1f)))) {
         atomic_inc(&irq_mis_count);
         spin_lock(&ioapic_lock);
@@ -1564,6 +1605,7 @@ static void ack_msi_irq(unsigned int irq
     struct irq_desc *desc = irq_to_desc(irq);
 
     irq_complete_move(&desc);
+    move_native_irq(irq);
 
     if ( msi_maskable_irq(desc->msi_desc) )
         ack_APIC_irq(); /* ACKTYPE_NONE */
diff -r dcfa6155b692 -r 1f960f20a33f xen/arch/x86/irq.c
--- a/xen/arch/x86/irq.c	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/arch/x86/irq.c	Mon Oct 26 15:54:00 2009 +0800
@@ -450,6 +450,67 @@ void __setup_vector_irq(int cpu)
         vector = irq_to_vector(irq);
         per_cpu(vector_irq, cpu)[vector] = irq;
     }
+}
+
+void move_masked_irq(int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+
+	if (likely(!(desc->status & IRQ_MOVE_PENDING)))
+		return;
+    
+    desc->status &= ~IRQ_MOVE_PENDING;
+
+    if (unlikely(cpus_empty(desc->pending_mask)))
+        return;
+
+    if (!desc->handler->set_affinity)
+        return;
+
+	/*
+	 * If there was a valid mask to work with, please
+	 * do the disable, re-program, enable sequence.
+	 * This is *not* particularly important for level triggered
+	 * but in a edge trigger case, we might be setting rte
+	 * when an active trigger is comming in. This could
+	 * cause some ioapics to mal-function.
+	 * Being paranoid i guess!
+	 *
+	 * For correct operation this depends on the caller
+	 * masking the irqs.
+	 */
+    if (likely(cpus_intersects(desc->pending_mask, cpu_online_map)))
+        desc->handler->set_affinity(irq, desc->pending_mask);
+
+	cpus_clear(desc->pending_mask);
+}
+
+void move_native_irq(int irq)
+{
+    struct irq_desc *desc = irq_to_desc(irq);
+
+    if (likely(!(desc->status & IRQ_MOVE_PENDING)))
+        return;
+
+    if (unlikely(desc->status & IRQ_DISABLED))
+        return;
+
+    desc->handler->disable(irq);
+    move_masked_irq(irq);
+    desc->handler->enable(irq);
+}
+
+/* For re-setting irq interrupt affinity for specific irq */
+void irq_set_affinity(int irq, cpumask_t mask)
+{
+    struct irq_desc *desc = irq_to_desc(irq);
+    
+    if (!desc->handler->set_affinity)
+        return;
+    
+    ASSERT(spin_is_locked(&desc->lock));
+    desc->status |= IRQ_MOVE_PENDING;
+    cpus_copy(desc->pending_mask, mask);
 }
 
 asmlinkage void do_IRQ(struct cpu_user_regs *regs)
diff -r dcfa6155b692 -r 1f960f20a33f xen/arch/x86/msi.c
--- a/xen/arch/x86/msi.c	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/arch/x86/msi.c	Mon Oct 26 15:54:00 2009 +0800
@@ -231,7 +231,6 @@ static void write_msi_msg(struct msi_des
         u8 slot = PCI_SLOT(dev->devfn);
         u8 func = PCI_FUNC(dev->devfn);
 
-		mask_msi_irq(entry->irq);
         pci_conf_write32(bus, slot, func, msi_lower_address_reg(pos),
                          msg->address_lo);
         if ( entry->msi_attrib.is_64 )
@@ -244,7 +243,6 @@ static void write_msi_msg(struct msi_des
         else
             pci_conf_write16(bus, slot, func, msi_data_reg(pos, 0),
                              msg->data);
-		unmask_msi_irq(entry->irq);
         break;
     }
     case PCI_CAP_ID_MSIX:
@@ -252,13 +250,11 @@ static void write_msi_msg(struct msi_des
         void __iomem *base;
         base = entry->mask_base;
 
-		mask_msi_irq(entry->irq);
         writel(msg->address_lo,
                base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         writel(msg->address_hi,
                base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
         writel(msg->data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
-		unmask_msi_irq(entry->irq);
         break;
     }
     default:
diff -r dcfa6155b692 -r 1f960f20a33f xen/include/asm-x86/io_apic.h
--- a/xen/include/asm-x86/io_apic.h	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/include/asm-x86/io_apic.h	Mon Oct 26 15:54:00 2009 +0800
@@ -21,6 +21,15 @@
 		+ (mp_ioapics[idx].mpc_apicaddr & ~PAGE_MASK)))
 
 #define IO_APIC_ID(idx) (mp_ioapics[idx].mpc_apicid)
+
+/* I/O Unit Redirection Table */
+#define IO_APIC_REDIR_VECTOR_MASK   0x000FF
+#define IO_APIC_REDIR_DEST_LOGICAL  0x00800
+#define IO_APIC_REDIR_DEST_PHYSICAL 0x00000
+#define IO_APIC_REDIR_SEND_PENDING  (1 << 12)
+#define IO_APIC_REDIR_REMOTE_IRR    (1 << 14)
+#define IO_APIC_REDIR_LEVEL_TRIGGER (1 << 15)
+#define IO_APIC_REDIR_MASKED        (1 << 16)
 
 /*
  * The structure of the IO-APIC:
diff -r dcfa6155b692 -r 1f960f20a33f xen/include/asm-x86/irq.h
--- a/xen/include/asm-x86/irq.h	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/include/asm-x86/irq.h	Mon Oct 26 15:54:00 2009 +0800
@@ -138,6 +138,12 @@ int __assign_irq_vector(int irq, struct 
 
 int bind_irq_vector(int irq, int vector, cpumask_t domain);
 
+void move_native_irq(int irq);
+
+void move_masked_irq(int irq);
+
+void irq_set_affinity(int irq, cpumask_t mask);
+
 #define domain_pirq_to_irq(d, pirq) ((d)->arch.pirq_irq[pirq])
 #define domain_irq_to_pirq(d, irq) ((d)->arch.irq_pirq[irq])
 
diff -r dcfa6155b692 -r 1f960f20a33f xen/include/xen/irq.h
--- a/xen/include/xen/irq.h	Tue Oct 20 15:20:07 2009 +0800
+++ b/xen/include/xen/irq.h	Mon Oct 26 15:54:00 2009 +0800
@@ -24,6 +24,7 @@ struct irqaction {
 #define IRQ_REPLAY	8	/* IRQ has been replayed but not acked yet */
 #define IRQ_GUEST       16      /* IRQ is handled by guest OS(es) */
 #define IRQ_GUEST_EOI_PENDING 32 /* IRQ was disabled, pending a guest EOI */
+#define IRQ_MOVE_PENDING      64  /* IRQ is migrating to another CPUs */
 #define IRQ_PER_CPU     256     /* IRQ is per CPU */
 
 /* Special IRQ numbers. */
@@ -75,6 +76,7 @@ typedef struct irq_desc {
     int irq;
     spinlock_t lock;
     cpumask_t affinity;
+    cpumask_t pending_mask;  /* IRQ migration pending mask */
 
     /* irq ratelimit */
     s_time_t rl_quantum_start;

[-- Attachment #3: fix-irq-affinity-msi3.patch --]
[-- Type: application/octet-stream, Size: 5997 bytes --]

# HG changeset patch
# User Xiantao Zhang <xiantao.zhang@intel.com>
# Date 1255684803 -28800
# Node ID d1b3cb3fe044285093c923761d4bc40c7af4d199
# Parent  2eba302831c4534ac40283491f887263c7197b4a
x86: vMSI: Fix msi irq affinity issue for hvm guest.

There is a race between guest setting new vector and doing EOI on old vector.
Once guest sets new vector before its doing EOI on vector, when guest does eoi,
hypervisor may fail to find the related pirq, and hypervisor may miss to EOI real
vector and leads to system hang.  We may need to add a timer for each pirq interrupt
source to avoid host hang, but this is another topic, and will be addressed later.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>

diff -r 2eba302831c4 xen/arch/x86/hvm/vmsi.c
--- a/xen/arch/x86/hvm/vmsi.c	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/arch/x86/hvm/vmsi.c	Fri Oct 16 22:10:36 2009 +0800
@@ -92,8 +92,11 @@ int vmsi_deliver(struct domain *d, int p
     case dest_LowestPrio:
     {
         target = vlapic_lowest_prio(d, NULL, 0, dest, dest_mode);
-        if ( target != NULL )
+        if ( target != NULL ) {
             vmsi_inj_irq(d, target, vector, trig_mode, delivery_mode);
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+        }
         else
             HVM_DBG_LOG(DBG_LEVEL_IOAPIC, "null round robin: "
                         "vector=%x delivery_mode=%x\n",
@@ -106,9 +109,12 @@ int vmsi_deliver(struct domain *d, int p
     {
         for_each_vcpu ( d, v )
             if ( vlapic_match_dest(vcpu_vlapic(v), NULL,
-                                   0, dest, dest_mode) )
+                                   0, dest, dest_mode) ) {
                 vmsi_inj_irq(d, vcpu_vlapic(v),
                              vector, trig_mode, delivery_mode);
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+            }
         break;
     }
 
diff -r 2eba302831c4 xen/drivers/passthrough/io.c
--- a/xen/drivers/passthrough/io.c	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/drivers/passthrough/io.c	Fri Oct 16 21:54:55 2009 +0800
@@ -164,7 +164,9 @@ int pt_irq_create_bind_vtd(
         {
             hvm_irq_dpci->mirq[pirq].flags = HVM_IRQ_DPCI_MACH_MSI |
                                              HVM_IRQ_DPCI_GUEST_MSI;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gvec = pt_irq_bind->u.msi.gvec;
             hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+            hvm_irq_dpci->mirq[pirq].gmsi.old_gflags = pt_irq_bind->u.msi.gflags;
             hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], pirq, 0);
@@ -178,6 +180,8 @@ int pt_irq_create_bind_vtd(
             {
                 hvm_irq_dpci->mirq[pirq].gmsi.gflags = 0;
                 hvm_irq_dpci->mirq[pirq].gmsi.gvec = 0;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec = 0;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gflags = 0;
                 hvm_irq_dpci->mirq[pirq].flags = 0;
                 clear_bit(pirq, hvm_irq_dpci->mapping);
                 spin_unlock(&d->event_lock);
@@ -195,8 +199,14 @@ int pt_irq_create_bind_vtd(
             }
  
             /* if pirq is already mapped as vmsi, update the guest data/addr */
-            hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
-            hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            if ( hvm_irq_dpci->mirq[pirq].gmsi.gvec != pt_irq_bind->u.msi.gvec ) {
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gvec =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gvec;
+                hvm_irq_dpci->mirq[pirq].gmsi.old_gflags =
+                                    hvm_irq_dpci->mirq[pirq].gmsi.gflags;
+                hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
+                hvm_irq_dpci->mirq[pirq].gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            }
         }
         /* Caculate dest_vcpu_id for MSI-type pirq migration */
         dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
@@ -424,14 +434,21 @@ void hvm_dpci_msi_eoi(struct domain *d, 
           pirq = find_next_bit(hvm_irq_dpci->mapping, d->nr_pirqs, pirq + 1) )
     {
         if ( (!(hvm_irq_dpci->mirq[pirq].flags & HVM_IRQ_DPCI_MACH_MSI)) ||
-                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector) )
+                (hvm_irq_dpci->mirq[pirq].gmsi.gvec != vector &&
+                 hvm_irq_dpci->mirq[pirq].gmsi.old_gvec != vector) )
             continue;
 
-        dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
-        dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        if ( hvm_irq_dpci->mirq[pirq].gmsi.gvec == vector ) {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.gflags & VMSI_DM_MASK);
+        } else {
+            dest = hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DEST_ID_MASK;
+            dest_mode = !!(hvm_irq_dpci->mirq[pirq].gmsi.old_gflags & VMSI_DM_MASK);
+        }
         if ( vlapic_match_dest(vcpu_vlapic(current), NULL, 0, dest, dest_mode) )
             break;
     }
+
     if ( pirq < d->nr_pirqs )
         __msi_pirq_eoi(d, pirq);
     spin_unlock(&d->event_lock);
diff -r 2eba302831c4 xen/include/xen/hvm/irq.h
--- a/xen/include/xen/hvm/irq.h	Thu Oct 15 16:49:21 2009 +0100
+++ b/xen/include/xen/hvm/irq.h	Fri Oct 16 21:48:04 2009 +0800
@@ -58,8 +58,10 @@ struct dev_intx_gsi_link {
 #define GLFAGS_SHIFT_TRG_MODE       15
 
 struct hvm_gmsi_info {
-    uint32_t gvec;
+    uint16_t gvec;
+    uint16_t old_gvec;
     uint32_t gflags;
+    uint32_t old_gflags;
     int dest_vcpu_id; /* -1 :multi-dest, non-negative: dest_vcpu_id */
 };
 

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-26 13:02                                         ` Zhang, Xiantao
@ 2009-10-26 13:34                                           ` Keir Fraser
  0 siblings, 0 replies; 55+ messages in thread
From: Keir Fraser @ 2009-10-26 13:34 UTC (permalink / raw)
  To: Zhang, Xiantao, Jan Beulich; +Cc: Dante Cinco, xen-devel, He, Qing

Thanks, applied as c/s 20370. I think fix-irq-affinity-msi3.patch is already
applied as c/s 20334.

 -- Keir

On 26/10/2009 13:02, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:

> Keir, 
>    The attached patch(irq-migration-enhancement.patch) targets to enhance irq
> migration logic, and the most logic is ported from Linux and tailored for Xen.
> Please apply, and it should eliminate the race between writing msi's vector
> and addr. In addition, to fix guest's interrupt affinity issue, we also needs
> to apply the patch(fix-irq-affinity-msi3.patch) .
> Xiantao
> 
> 
> Keir Fraser wrote:
>> On 22/10/2009 09:41, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:
>> 
>>>> Hmm, then I don't understand which case your patch was a fix for: I
>>>> understood that it addresses an issue when the affinity of an
>>>> interrupt gets changed (requiring a re-write of the address/data
>>>> pair). If the hypervisor can deal with it without masking, then why
>>>> did you add it?
>>> 
>>> Hmm, sorry, seems I misunderstood your question. If the msi doesn't
>>> support mask bit(clearing MSI enable bit doesn't help in this case),
>>> the issue may still exist. Just checked Linux side, seems it doesn't
>>> perform mask operation when program MSI, but don't know why Linux
>>> hasn't such issues.  Actaully, we do see inconsisten interrupt
>>> message from the device without this patch, and after applying the
>>> patch, the issue is gone.  May need further investigation why Linux
>>> doesn't need the mask operation.
>> 
>> Linux is quite careful about when it will reprogram vector/affinity
>> info isn't it? Doesn't it mark such an update pending and only flush
>> it through during next interrupt delivery, or something like that? Do
>> we need some of the upstream Linux patches for this?
>> 
>>  -- Keir
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-16  0:09                   ` Konrad Rzeszutek Wilk
@ 2009-10-16  1:40                     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2009-10-16  1:40 UTC (permalink / raw)
  To: Cinco, Dante; +Cc: xen-devel, Keir Fraser, Qing He, xiantao.zhang

On Thu, Oct 15, 2009 at 08:09:42PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 14, 2009 at 01:54:33PM -0600, Cinco, Dante wrote:
> > I switched over to Xen 3.5-unstable (changeset 20303) and pv_ops dom0 2.6.31.1 hoping that this would resolve the IRQ SMP affinity problem. I had to use pci-stub to hide the PCI devices since pciback wasn't working. With vcpus=16 (APIC routing is physical flat), the interrupts were working in domU and being routed to CPU0 with the default smp_affinity (ffff) but as soon as I changed it to any 16-bit one-hot value or even setting it to the  same default value resulted in a complete loss of interrupts (even in the devices that didn't have any change to smp_affinity). With vcpus=4 (APIC routing is logical flat), I can see the interrupts being load balanced across all CPUs but as soon as I changed smp_affinity to any value, the interrupts stopped. This used to work reliably with the non-pv
 _ops kernel. I attached the logs in case anyone wants to take a look.
> > 
> > I did see the MSI message address/data change in both domU and dom0 (using "lspci -vv"):
> > 
> > vcpus=16:
> > 
> > domU MSI message address/data with default smp_affinity: Address: 00000000fee00000  Data: 40a9
> > domU MSI message address/data after smp_affinity=0010:   Address: 00000000fee08000  Data: 40b1 (8 is APIC ID of CPU4)
> 
> What does Xen tell you (hit Ctrl-A three times and then 'z'). Specifically look for vector 169 (a9) and 177 (b1).
> Do those values match with what you see in DomU and Dom0? Mainly that 177 has dest_id of 8.
> Oh, and also check the guest interrupt information, to see if those values match..

N/m. I was thinking that maybe your IOAPIC has those vectors programmed in it. But
that would not make any sense.

> > 
> > dom0 MSI message address/data with default smp_affinity: Address: 00000000fee00000  Data: 4094
> > dom0 MSI message address/data after smp_affinity=0010:   Address: 00000000fee00000  Data: 409c

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-14 19:54                 ` Cinco, Dante
@ 2009-10-16  0:09                   ` Konrad Rzeszutek Wilk
  2009-10-16  1:40                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2009-10-16  0:09 UTC (permalink / raw)
  To: Cinco, Dante; +Cc: xen-devel, Keir Fraser, Qing He, xiantao.zhang

On Wed, Oct 14, 2009 at 01:54:33PM -0600, Cinco, Dante wrote:
> I switched over to Xen 3.5-unstable (changeset 20303) and pv_ops dom0 2.6.31.1 hoping that this would resolve the IRQ SMP affinity problem. I had to use pci-stub to hide the PCI devices since pciback wasn't working. With vcpus=16 (APIC routing is physical flat), the interrupts were working in domU and being routed to CPU0 with the default smp_affinity (ffff) but as soon as I changed it to any 16-bit one-hot value or even setting it to the  same default value resulted in a complete loss of interrupts (even in the devices that didn't have any change to smp_affinity). With vcpus=4 (APIC routing is logical flat), I can see the interrupts being load balanced across all CPUs but as soon as I changed smp_affinity to any value, the interrupts stopped. This used to work reliably with the non-pv_o
 ps kernel. I attached the logs in case anyone wants to take a look.
> 
> I did see the MSI message address/data change in both domU and dom0 (using "lspci -vv"):
> 
> vcpus=16:
> 
> domU MSI message address/data with default smp_affinity: Address: 00000000fee00000  Data: 40a9
> domU MSI message address/data after smp_affinity=0010:   Address: 00000000fee08000  Data: 40b1 (8 is APIC ID of CPU4)

What does Xen tell you (hit Ctrl-A three times and then 'z'). Specifically look for vector 169 (a9) and 177 (b1).
Do those values match with what you see in DomU and Dom0? Mainly that 177 has dest_id of 8.
Oh, and also check the guest interrupt information, to see if those values match..
> 
> dom0 MSI message address/data with default smp_affinity: Address: 00000000fee00000  Data: 4094
> dom0 MSI message address/data after smp_affinity=0010:   Address: 00000000fee00000  Data: 409c
> 
> Aside from "lspci -vv" what other means are there to track down this problem? Is there some way to print the interrupt vector table? I'm considering adding printk's to the code that Qing mentioned in his previous email (see below). Any suggestions on where in the code to add the printk's?

Hit Ctrl-A three times and you can get a wealth of information.. Of interest might also
be the IO APIC area - you can see if the vector in question is masked?

> 
> Thanks.
> 
> Dante
> 
> -----Original Message-----
> From: Qing He [mailto:qing.he@intel.com] 
> Sent: Sunday, October 11, 2009 10:55 PM
> To: Cinco, Dante
> Cc: Keir Fraser; xen-devel@lists.xensource.com; xiantao.zhang@intel.com
> Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
> 
> On Mon, 2009-10-12 at 13:25 +0800, Cinco, Dante wrote:
> > With vcpus < 4, logical flat mode works fine (no error message). I can 
> > change smp_affinity to any value > 0 and < 16 and the interrupts go to 
> > the proper CPU(s). Could you point me to the code that handles MSI so 
> > that I can better understand the MSI implementation?
> 
> There are two parts:
>   1) init or changing the data and address of MSI:
>     1) qemu-xen: hw/passthrough.c: pt_msg.*_write, MSI access are
>                  trapped here first. And then pt_update_msi in
>                  hw/pt-msi.c is called to update the MSI binding.
>     2) xen:      drivers/passthrough/io.c: pt_irq_create_bind_vtd,
>                  where MSI is actually bound to the guest.
> 
>   2) on MSI reception:
>     In drivers/passthrough/io.c, hvm_do_IRQ_dpci and hvm_dirq_assist
>     are the routines responsible for handling all assigned irqs
>     (including MSI), and if an MSI is received, vmsi_deliver in
>     arch/x86/vmsi.c get called to deliver MSI to the corresponding
>     vlapic.
> 
> And I just learned from Xiantao Zhang that for the guest Linux kernel, it enables per-cpu vector if it's in physical mode, and that looks more likely relevant to this problem. It had problem in the older xen to handle this, and changeset 20253 is supposed to fix it, although I noticed your xen version is 20270.
> 
> Thanks,
> Qing


> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-12  5:54               ` Qing He
@ 2009-10-14 19:54                 ` Cinco, Dante
  2009-10-16  0:09                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-14 19:54 UTC (permalink / raw)
  To: Qing He; +Cc: xen-devel, Keir Fraser, xiantao.zhang

[-- Attachment #1: Type: text/plain, Size: 3404 bytes --]

I switched over to Xen 3.5-unstable (changeset 20303) and pv_ops dom0 2.6.31.1 hoping that this would resolve the IRQ SMP affinity problem. I had to use pci-stub to hide the PCI devices since pciback wasn't working. With vcpus=16 (APIC routing is physical flat), the interrupts were working in domU and being routed to CPU0 with the default smp_affinity (ffff) but as soon as I changed it to any 16-bit one-hot value or even setting it to the  same default value resulted in a complete loss of interrupts (even in the devices that didn't have any change to smp_affinity). With vcpus=4 (APIC routing is logical flat), I can see the interrupts being load balanced across all CPUs but as soon as I changed smp_affinity to any value, the interrupts stopped. This used to work reliably with the non-pv_ops kernel. I attached the logs in case anyone wants to take a look.

I did see the MSI message address/data change in both domU and dom0 (using "lspci -vv"):

vcpus=16:

domU MSI message address/data with default smp_affinity: Address: 00000000fee00000  Data: 40a9
domU MSI message address/data after smp_affinity=0010:   Address: 00000000fee08000  Data: 40b1 (8 is APIC ID of CPU4)

dom0 MSI message address/data with default smp_affinity: Address: 00000000fee00000  Data: 4094
dom0 MSI message address/data after smp_affinity=0010:   Address: 00000000fee00000  Data: 409c

Aside from "lspci -vv" what other means are there to track down this problem? Is there some way to print the interrupt vector table? I'm considering adding printk's to the code that Qing mentioned in his previous email (see below). Any suggestions on where in the code to add the printk's?

Thanks.

Dante

-----Original Message-----
From: Qing He [mailto:qing.he@intel.com] 
Sent: Sunday, October 11, 2009 10:55 PM
To: Cinco, Dante
Cc: Keir Fraser; xen-devel@lists.xensource.com; xiantao.zhang@intel.com
Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

On Mon, 2009-10-12 at 13:25 +0800, Cinco, Dante wrote:
> With vcpus < 4, logical flat mode works fine (no error message). I can 
> change smp_affinity to any value > 0 and < 16 and the interrupts go to 
> the proper CPU(s). Could you point me to the code that handles MSI so 
> that I can better understand the MSI implementation?

There are two parts:
  1) init or changing the data and address of MSI:
    1) qemu-xen: hw/passthrough.c: pt_msg.*_write, MSI access are
                 trapped here first. And then pt_update_msi in
                 hw/pt-msi.c is called to update the MSI binding.
    2) xen:      drivers/passthrough/io.c: pt_irq_create_bind_vtd,
                 where MSI is actually bound to the guest.

  2) on MSI reception:
    In drivers/passthrough/io.c, hvm_do_IRQ_dpci and hvm_dirq_assist
    are the routines responsible for handling all assigned irqs
    (including MSI), and if an MSI is received, vmsi_deliver in
    arch/x86/vmsi.c get called to deliver MSI to the corresponding
    vlapic.

And I just learned from Xiantao Zhang that for the guest Linux kernel, it enables per-cpu vector if it's in physical mode, and that looks more likely relevant to this problem. It had problem in the older xen to handle this, and changeset 20253 is supposed to fix it, although I noticed your xen version is 20270.

Thanks,
Qing

[-- Attachment #2: irq_smp_affinity_problem_pv_ops_dom0_2.6.31.1.tar.gz --]
[-- Type: application/x-gzip, Size: 69208 bytes --]

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-12  5:25             ` Cinco, Dante
@ 2009-10-12  5:54               ` Qing He
  2009-10-14 19:54                 ` Cinco, Dante
  0 siblings, 1 reply; 55+ messages in thread
From: Qing He @ 2009-10-12  5:54 UTC (permalink / raw)
  To: Cinco, Dante; +Cc: xen-devel, Keir Fraser, xiantao.zhang

On Mon, 2009-10-12 at 13:25 +0800, Cinco, Dante wrote:
> With vcpus < 4, logical flat mode works fine (no error message). I can
> change smp_affinity to any value > 0 and < 16 and the interrupts go to
> the proper CPU(s). Could you point me to the code that handles MSI so
> that I can better understand the MSI implementation?

There are two parts:
  1) init or changing the data and address of MSI:
    1) qemu-xen: hw/passthrough.c: pt_msg.*_write, MSI access are
                 trapped here first. And then pt_update_msi in
                 hw/pt-msi.c is called to update the MSI binding.
    2) xen:      drivers/passthrough/io.c: pt_irq_create_bind_vtd,
                 where MSI is actually bound to the guest.

  2) on MSI reception:
    In drivers/passthrough/io.c, hvm_do_IRQ_dpci and hvm_dirq_assist
    are the routines responsible for handling all assigned irqs
    (including MSI), and if an MSI is received, vmsi_deliver in
    arch/x86/vmsi.c get called to deliver MSI to the corresponding
    vlapic.

And I just learned from Xiantao Zhang that for the guest Linux kernel,
it enables per-cpu vector if it's in physical mode, and that looks more
likely relevant to this problem. It had problem in the older xen to
handle this, and changeset 20253 is supposed to fix it, although I
noticed your xen version is 20270.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-10  9:43           ` Qing He
  2009-10-10 10:10             ` Keir Fraser
@ 2009-10-12  5:25             ` Cinco, Dante
  2009-10-12  5:54               ` Qing He
  1 sibling, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-12  5:25 UTC (permalink / raw)
  To: Qing He; +Cc: xen-devel, Keir Fraser

With vcpus < 4, logical flat mode works fine (no error message). I can change smp_affinity to any value > 0 and < 16 and the interrupts go to the proper CPU(s). Could you point me to the code that handles MSI so that I can better understand the MSI implementation?

Thanks.

Dante
________________________________________
From: Qing He [qing.he@intel.com]
Sent: Saturday, October 10, 2009 2:43 AM
To: Cinco, Dante
Cc: Keir Fraser; xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4       on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

On Sat, 2009-10-10 at 07:39 +0800, Cinco, Dante wrote:
> When I tried adding "hvm_debug=0x200" in the Xen command line, the domU
> became unaccessible on boot up with the Xen console constantly printing
> this message: "(XEN) [HVM:1.0] <vioapic_irq_positive_edge> irq 2."

So this is useless, maybe one time setups should be split out from
those fire everytime, or using a separate debug level for MSI operation.

> Change /proc/irq/48/smp_affinity from 1 to 2
> - Xen console: (XEN) do_IRQ: 8.211 No irq handler for vector (irq -1)

This is weird, although there is no other confirmation, I guess this
vector 211 (0xd3) is the MSI vector. This can explain why the MSI
doesn't fire any more.

However, this error message is not expected. Physical MSI at xen
level always goes to vcpu 0 when it's first bound, and the affinity
doesn't change after this. Futhermore, logical flat mode works fine,
do you observe this error message when vcpus=4?

I'll continue to investigate and try to reproduce the problem at
my side.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-10  9:43           ` Qing He
@ 2009-10-10 10:10             ` Keir Fraser
  2009-10-12  5:25             ` Cinco, Dante
  1 sibling, 0 replies; 55+ messages in thread
From: Keir Fraser @ 2009-10-10 10:10 UTC (permalink / raw)
  To: Qing He, Cinco, Dante; +Cc: xen-devel

On 10/10/2009 10:43, "Qing He" <qing.he@intel.com> wrote:

> On Sat, 2009-10-10 at 07:39 +0800, Cinco, Dante wrote:
>> When I tried adding "hvm_debug=0x200" in the Xen command line, the domU
>> became unaccessible on boot up with the Xen console constantly printing
>> this message: "(XEN) [HVM:1.0] <vioapic_irq_positive_edge> irq 2."
> 
> So this is useless, maybe one time setups should be split out from
> those fire everytime, or using a separate debug level for MSI operation.

Well, indeed. Message that print on every interrupt are typically unuseful!
I tend to kill them when I find them, but they keep creeping in.

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-09 23:39         ` Cinco, Dante
@ 2009-10-10  9:43           ` Qing He
  2009-10-10 10:10             ` Keir Fraser
  2009-10-12  5:25             ` Cinco, Dante
  0 siblings, 2 replies; 55+ messages in thread
From: Qing He @ 2009-10-10  9:43 UTC (permalink / raw)
  To: Cinco, Dante; +Cc: xen-devel, Keir Fraser

On Sat, 2009-10-10 at 07:39 +0800, Cinco, Dante wrote:
> When I tried adding "hvm_debug=0x200" in the Xen command line, the domU
> became unaccessible on boot up with the Xen console constantly printing
> this message: "(XEN) [HVM:1.0] <vioapic_irq_positive_edge> irq 2."

So this is useless, maybe one time setups should be split out from
those fire everytime, or using a separate debug level for MSI operation.

> Change /proc/irq/48/smp_affinity from 1 to 2
> - Xen console: (XEN) do_IRQ: 8.211 No irq handler for vector (irq -1)

This is weird, although there is no other confirmation, I guess this
vector 211 (0xd3) is the MSI vector. This can explain why the MSI
doesn't fire any more.

However, this error message is not expected. Physical MSI at xen
level always goes to vcpu 0 when it's first bound, and the affinity
doesn't change after this. Futhermore, logical flat mode works fine,
do you observe this error message when vcpus=4?

I'll continue to investigate and try to reproduce the problem at
my side.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-09  9:07       ` Qing He
  2009-10-09 15:59         ` Cinco, Dante
@ 2009-10-09 23:39         ` Cinco, Dante
  2009-10-10  9:43           ` Qing He
  1 sibling, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-09 23:39 UTC (permalink / raw)
  To: Qing He, Keir Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 3622 bytes --]

Qing,

I'm attaching a tar'd directory that contains the various log files I gathered from my system. When I tried adding "hvm_debug=0x200" in the Xen command line, the domU became unaccessible on boot up with the Xen console constantly printing this message: "(XEN) [HVM:1.0] <vioapic_irq_positive_edge> irq 2." So I backed out the hvm_debug but hopefully there are still enough logging to provide some clues. Here's a summary of the events leading to the lost interrupts:

Boot Xen 3.5-unstable with 2.6.30.3
- command line: /xen-3.5-unstable.gz com1=115200,8n1 console=com1 acpi=force apic=on iommu=1,no-intremap,passthrough loglvl=all loglvl_guest=all
- command line: module /vmlinuz-2.6.31.1 root=UUID=xxx ro pciback.hide=(07:00.0)(07:00.1)(07:00.2)(07:00.3) acpi=force console=ttyS0
- dom0: lspci -vv shows device at IRQ 32 with MSI message address, data = 0x0, 0x0

Bringup domU with vcpus=5, hap=0, pci=['07:00.0@8','07:00.1@9','07:00.2@a','07:00.3@b'] (device driver not yet loaded)
- dom0: lspci -vv shows device at IRQ 32 (07:00.0) with MSI message address, data = 0xfee01000, 0x407b
- config: 

Load kernel module that contains device driver
- dom0: no change in lspci -vv
- domU: lspci -vv shows device at IRQ 48 (00:08.0) with MSI message address, data = 0xfee00000, 0x4059
- domU: /proc/interrupts show interrupts for IRQ 48 going to CPU0

Change /proc/irq/48/smp_affinity from 1f to 1
- dom0: no change to lspci -vv
- domU: no change to lspci -vv
- domU: /proc/interrupts show interrupts for IRQ 48 going to CPU0

Change /proc/irq/48/smp_affinity from 1 to 2
- dom0: lspci -vv shows MSI message data changed from 0x407b to 0x40d3, address the same
- domU: lspci -vv shows new MSI message address, data = 0xfee02000, 0x4079
- domU: no more interrupts from IRQ 48
- Xen console: (XEN) do_IRQ: 8.211 No irq handler for vector (irq -1)

Dante

-----Original Message-----
From: Qing He [mailto:qing.he@intel.com] 
Sent: Friday, October 09, 2009 2:08 AM
To: Keir Fraser
Cc: Cinco, Dante; xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

On Fri, 2009-10-09 at 05:35 +0800, Keir Fraser wrote:
> On 08/10/2009 19:11, "Cinco, Dante" <Dante.Cinco@lsi.com> wrote:
> 
> > The IRQ SMP affinity problem happens on just the passthrough one using MSI.
> > 
> > I've only used Xen 3.4.1. Are you aware of recent code changes that 
> > may address this issue?
> 
> No, but it might be worth a try. Unfortunately I'm not so familiar 
> with the MSI passthru code as I am with the rest of the irq emulation 
> layer. Qing He
> (cc'ed) may be able to assist, as I think he did much of the 
> development of MSI support for passthru devices.
> 

MSI passthru uses emulation, there is nothing to do between the guest affinity and the physical affinity. When an MSI is received, a vmsi logic calculates the destination and sets the virtual local APIC of that VCPU.

But after checking the code, the part handling DM=0 is there and I haven't found big problems on the first glance, maybe there is some glitch that causes the MSI failure in physical mode.

Some debug log can be helpful to track down the problem. Can you add 'hvm_debug=0x200' to the xen command line and post the xm dmesg result?
This will print hvm debug level DBG_LEVEL_IOAPIC which includes vmsi delivery logic.

There are two patches between 3.4.1 and unstable (20084, 20140), these are mainly cleanup patches but the related code does change, don't know if they fix this issue.

Thanks,
Qing

[-- Attachment #2: irq_smp_affinity_problem.tar.gz --]
[-- Type: application/x-gzip, Size: 70540 bytes --]

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-09  9:07       ` Qing He
@ 2009-10-09 15:59         ` Cinco, Dante
  2009-10-09 23:39         ` Cinco, Dante
  1 sibling, 0 replies; 55+ messages in thread
From: Cinco, Dante @ 2009-10-09 15:59 UTC (permalink / raw)
  To: Qing He, Keir Fraser; +Cc: xen-devel

Thanks for the suggestions Qing. I will send you the log with "hvm_debug=0x200" and try Xen 3.5 unstable.

Dante 

-----Original Message-----
From: Qing He [mailto:qing.he@intel.com] 
Sent: Friday, October 09, 2009 2:08 AM
To: Keir Fraser
Cc: Cinco, Dante; xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

On Fri, 2009-10-09 at 05:35 +0800, Keir Fraser wrote:
> On 08/10/2009 19:11, "Cinco, Dante" <Dante.Cinco@lsi.com> wrote:
> 
> > The IRQ SMP affinity problem happens on just the passthrough one using MSI.
> > 
> > I've only used Xen 3.4.1. Are you aware of recent code changes that 
> > may address this issue?
> 
> No, but it might be worth a try. Unfortunately I'm not so familiar 
> with the MSI passthru code as I am with the rest of the irq emulation 
> layer. Qing He
> (cc'ed) may be able to assist, as I think he did much of the 
> development of MSI support for passthru devices.
> 

MSI passthru uses emulation, there is nothing to do between the guest affinity and the physical affinity. When an MSI is received, a vmsi logic calculates the destination and sets the virtual local APIC of that VCPU.

But after checking the code, the part handling DM=0 is there and I haven't found big problems on the first glance, maybe there is some glitch that causes the MSI failure in physical mode.

Some debug log can be helpful to track down the problem. Can you add 'hvm_debug=0x200' to the xen command line and post the xm dmesg result?
This will print hvm debug level DBG_LEVEL_IOAPIC which includes vmsi delivery logic.

There are two patches between 3.4.1 and unstable (20084, 20140), these are mainly cleanup patches but the related code does change, don't know if they fix this issue.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-08 21:35     ` Keir Fraser
@ 2009-10-09  9:07       ` Qing He
  2009-10-09 15:59         ` Cinco, Dante
  2009-10-09 23:39         ` Cinco, Dante
  0 siblings, 2 replies; 55+ messages in thread
From: Qing He @ 2009-10-09  9:07 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Cinco, Dante, xen-devel

On Fri, 2009-10-09 at 05:35 +0800, Keir Fraser wrote:
> On 08/10/2009 19:11, "Cinco, Dante" <Dante.Cinco@lsi.com> wrote:
> 
> > The IRQ SMP affinity problem happens on just the passthrough one using MSI.
> > 
> > I've only used Xen 3.4.1. Are you aware of recent code changes that may
> > address this issue?
> 
> No, but it might be worth a try. Unfortunately I'm not so familiar with the
> MSI passthru code as I am with the rest of the irq emulation layer. Qing He
> (cc'ed) may be able to assist, as I think he did much of the development of
> MSI support for passthru devices.
> 

MSI passthru uses emulation, there is nothing to do between the guest
affinity and the physical affinity. When an MSI is received, a vmsi logic
calculates the destination and sets the virtual local APIC of that VCPU.

But after checking the code, the part handling DM=0 is there and I haven't
found big problems on the first glance, maybe there is some glitch that
causes the MSI failure in physical mode.

Some debug log can be helpful to track down the problem. Can you add
'hvm_debug=0x200' to the xen command line and post the xm dmesg result?
This will print hvm debug level DBG_LEVEL_IOAPIC which includes vmsi
delivery logic.

There are two patches between 3.4.1 and unstable (20084, 20140), these
are mainly cleanup patches but the related code does change, don't know
if they fix this issue.

Thanks,
Qing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-08 18:11   ` Cinco, Dante
@ 2009-10-08 21:35     ` Keir Fraser
  2009-10-09  9:07       ` Qing He
  0 siblings, 1 reply; 55+ messages in thread
From: Keir Fraser @ 2009-10-08 21:35 UTC (permalink / raw)
  To: Cinco, Dante, xen-devel; +Cc: Qing He

On 08/10/2009 19:11, "Cinco, Dante" <Dante.Cinco@lsi.com> wrote:

> The IRQ SMP affinity problem happens on just the passthrough one using MSI.
> 
> I've only used Xen 3.4.1. Are you aware of recent code changes that may
> address this issue?

No, but it might be worth a try. Unfortunately I'm not so familiar with the
MSI passthru code as I am with the rest of the irq emulation layer. Qing He
(cc'ed) may be able to assist, as I think he did much of the development of
MSI support for passthru devices.

 -- Keir

> Dante
> 
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Thursday, October 08, 2009 11:06 AM
> To: Cinco, Dante; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on
> HP ProLiant G6 with dual Xeon 5540 (Nehalem)
> 
> On 08/10/2009 01:08, "Cinco, Dante" <Dante.Cinco@lsi.com> wrote:
> 
>> One of my questions is "Why does domU use only even numbered APIC
>> IDs?" If it used odd numbers, then physical flat APIC routing will
>> only trigger when vcpus
>>> 7.
> 
> It's just the mapping we use. Local APICs get even numbers, IOAPIC gets id 1.
> 
>> I welcome any suggestions on how to pursue this problem or hopefully,
>> someone will say that a patch for this already exists.
> 
> Is this true for all interrupts, or just the passthrough one using MSI?
> 
> What Xen version are you using? You say '3.4 unstable' - do you mean tip of
> xen-3.4-testing.hg? Have you tried xen-unstable.hg (current developemnt tree)?
> 
>  -- Keir
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-08 18:05 ` Keir Fraser
@ 2009-10-08 18:11   ` Cinco, Dante
  2009-10-08 21:35     ` Keir Fraser
  0 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-08 18:11 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

The IRQ SMP affinity problem happens on just the passthrough one using MSI.

I've only used Xen 3.4.1. Are you aware of recent code changes that may address this issue?

Dante

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
Sent: Thursday, October 08, 2009 11:06 AM
To: Cinco, Dante; xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)

On 08/10/2009 01:08, "Cinco, Dante" <Dante.Cinco@lsi.com> wrote:

> One of my questions is "Why does domU use only even numbered APIC 
> IDs?" If it used odd numbers, then physical flat APIC routing will 
> only trigger when vcpus
> > 7.

It's just the mapping we use. Local APICs get even numbers, IOAPIC gets id 1.

> I welcome any suggestions on how to pursue this problem or hopefully, 
> someone will say that a patch for this already exists.

Is this true for all interrupts, or just the passthrough one using MSI?

What Xen version are you using? You say '3.4 unstable' - do you mean tip of xen-3.4-testing.hg? Have you tried xen-unstable.hg (current developemnt tree)?

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-08  0:08 Cinco, Dante
  2009-10-08 16:07 ` Bruce Edge
@ 2009-10-08 18:05 ` Keir Fraser
  2009-10-08 18:11   ` Cinco, Dante
  1 sibling, 1 reply; 55+ messages in thread
From: Keir Fraser @ 2009-10-08 18:05 UTC (permalink / raw)
  To: Cinco, Dante, xen-devel

On 08/10/2009 01:08, "Cinco, Dante" <Dante.Cinco@lsi.com> wrote:

> One of my questions is "Why does domU use only even numbered APIC IDs?" If it
> used odd numbers, then physical flat APIC routing will only trigger when vcpus
> > 7.

It's just the mapping we use. Local APICs get even numbers, IOAPIC gets id
1.

> I welcome any suggestions on how to pursue this problem or hopefully, someone
> will say that a patch for this already exists.

Is this true for all interrupts, or just the passthrough one using MSI?

What Xen version are you using? You say '3.4 unstable' - do you mean tip of
xen-3.4-testing.hg? Have you tried xen-unstable.hg (current developemnt
tree)?

 -- Keir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
  2009-10-08  0:08 Cinco, Dante
@ 2009-10-08 16:07 ` Bruce Edge
  2009-10-08 18:05 ` Keir Fraser
  1 sibling, 0 replies; 55+ messages in thread
From: Bruce Edge @ 2009-10-08 16:07 UTC (permalink / raw)
  Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 7041 bytes --]

More info on the version...

It's actually the 3.4.1 release.
Also the dom0 us 2.6.30.3 with Andrew Lyon's patch set.

Built from Boris's HOWTO:
http://bderzhavets.wordpress.com/2009/08/14/attempt-of-prevu-xen-3-4-1-hypervisor-on-ubuntu-jaunty-server-64-bit/

-Bruce

On Wed, Oct 7, 2009 at 5:08 PM, Cinco, Dante <Dante.Cinco@lsi.com> wrote:

>  I need help tracking down an IRQ SMP affinity problem.
>
> Xen version: 3.4 unstable
> dom0: Linux 2.6.30.3 (Debian)
> domU: Linux 2.6.30.1 (Debian)
> Hardware platform: HP ProLiant G6, dual-socket Xeon 5540, hyperthreading
> enable in BIOS and kernel (total of 16 CPUs: 2 sockets * 4 cores per socket
> * 2 threads per core)
>
> With vcpus < 5, I can change /proc/irq/<irq#>/smp_affinity and see the
> interrupts get routed to the proper CPU(s) by checking /proc/interrupts.
> With vcpus > 4, any change to /proc/irq/<irq#>/smp_affinity results in a
> complete loss of interrupts for <irq#>.
>
> I noticed in the domU /var/log/kern.log that APIC routing changes from
> "flat" for vcpus=4 to "physical flat" for vcpus=5. Looking at the source
> code for linux-2.6.30.1/arch/x86/kernel/apic/probe_64.c, this switch occurs
> when "max_physical_apicid >= 8." In the domU /var/log/kern.log and
> /proc/cpuinfo, only even numbered APIC IDs (starting from 0) are used so
> when it gets to the 5th CPU, it is already at APIC ID 8 which triggers the
> physical flat APIC routing.
>
> dom0 has all 16 CPUs available to it. The mapping between CPU numbers and
> APIC ID is 1-to-1 (CPU0:APIC ID0 ... CPU15:APIC ID15). domU is configured
> with either vcpus=4 or vcpus=5. In both cases, the mapping uses even number
> only for the APIC IDs (CPU0:APIC ID0 ... CPU5:APIC ID8).
>
> I'm using an ATTO/PMC Tachyon-based Fibre Channel PCIe card on this
> platform. It uses PCI-MSI-edge for its interrupt. I use pciback.hide in my
> dom0 Xen 3.5 kernel stanza to pass the device directly to domU. I'm also
> using "iommu=1,no-intremap,passthrough" in the stanza. I'm able to see the
> device in dom0 via "lspci -vv" and see the MSI message address and data that
> have been programmed into the Tachyon registers and using IRQ 32. Regardless
> of changes to IRQ 32's SMP affinity in domU, the MSI message address and
> data as seen from dom0 does not change. I can only conclude that domU is
> running some sort of IRQ emulation.
>
> # lspci -vv in dom0
> 07:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
>         Subsystem: Atto Technology Device 003c
>         Interrupt: pin A routed to IRQ 32
>         Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/1 Enable+
>                 Address: 00000000fee00000  Data: 40ba (dest ID=0, RH=DM=0,
> fixed interrupt, vector=0xba)
>         Kernel driver in use: pciback
>
> In domU, the device has been remapped (intentionally in the dom0 config
> file) to bus 0, device 8 and can also be seen via "lspci -vv" with the same
> MSI message address but different data and using IRQ 48.
>
> # lspci -vv in domU with vcpus=5
> 00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
>         Subsystem: Atto Technology Device 003c
>         Interrupt: pin A routed to IRQ 48
>         Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/0 Enable+
>                 Address: 00000000fee00000  Data: 4059 (dest ID=0, RH=DM=0,
> fixed interrupt, vector=0x59)
>         Kernel driver in use: hwdrv
>         Kernel modules: hbas-hw
>
> At this point, the kernel driver for the device has been loaded and the
> number of interrupts can be seen in /proc/interrupts. The default IRQ SMP
> has not been changed and yet the interrupts are all being routed to CPU0.
> This is for vcpus=5 (physical flat APIC routing). Changing IRQ 48's SMP
> affinity to any value will result in a complete loss of all interrupts. domU
> and dom0 need to be rebooted to restore normal operation.
> # cat /proc/irq/48/smp_affinity
> 1f
> # cat /proc/interrupts
>             CPU0       CPU1       CPU2       CPU3       CPU4
>   48:      60920          0          0          0          0
> PCI-MSI-edge      HW_TACHYON
>
> With vcpus=4 (flat APIC routing), IRQ 48's SMP affinity behaves as expected
> (each of the 4 bits in /proc/irq/48/smp_affinity correspond to a CPU or CPUs
> where the interrupts will be routed). The MSI message address and data have
> different attributes compared to vcpus=5. The address has dest ID=f (matches
> default /proc/irq/48/smp_affinity), RH=DM=1 and uses lowest priority instead
> of fixed interrupt.
>
> # lspci -vv in domU with vcpus=4
> 00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
>         Subsystem: Atto Technology Device 003c
>         Interrupt: pin A routed to IRQ 48
>         Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/0 Enable+
>                 Address: 00000000fee0f00c  Data: 4159 (dest ID=f, RH=DM=1,
> lowest priority interrupt, vector=0x59)
>         Kernel driver in use: hwdrv
>         Kernel modules: hbas-hw
>
> # cat /proc/irq/48/smp_affinity
> f
> # cat /proc/interrupts
>             CPU0       CPU1       CPU2       CPU3
>   48:      14082      19052      15337      14645   PCI-MSI-edge
> HW_TACHYON
>
> Changing IRQ 48's SMP affinity to 8 shows that all the interrupts are being
> routed to CPU3 as expected and the MSI message address has changed to
> reflect the new dest ID while the vector stays the same.
>
> # echo 8 > /proc/irq/48/smp_affinity
> # cat /proc/interrupts
>   48:      14082      19052      15338     351361   PCI-MSI-edge
> HW_TACHYON
>
> # lspci -vv in domU with vcpus=4
> 00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
>         Subsystem: Atto Technology Device 003c
>         Interrupt: pin A routed to IRQ 48
>         Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/0 Enable+
>                 Address: 00000000fee0800c  Data: 4159 (dest ID=8, RH=DM=1,
> lowest priority interrupt, vector=0x59)
>         Kernel driver in use: hwdrv
>         Kernel modules: hbas-hw
>
> My hunch is there is something wrong with physical flat APIC routing in
> domU. If I boot this same platform to straight Linux 2.6.30.1 (no Xen),
> /var/log/kern.log shows that it too is using physical flat APIC routing
> which is expected since it has a total of 16 CPUs. Unlike domU though,
> changing the IRQ SMP affinity to any one-hot value (only one bit out of 16
> is set to 1) behaves as expected. A non-one hot value results in all
> interrupts being routed to CPU0 but at least the interrupts are not lost.
>
> One of my questions is "Why does domU use only even numbered APIC IDs?" If
> it used odd numbers, then physical flat APIC routing will only trigger when
> vcpus > 7.
>
> I welcome any suggestions on how to pursue this problem or hopefully,
> someone will say that a patch for this already exists.
>
> Thanks.
>
> Dante Cinco
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

[-- Attachment #1.2: Type: text/html, Size: 8603 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
@ 2009-10-08  0:08 Cinco, Dante
  2009-10-08 16:07 ` Bruce Edge
  2009-10-08 18:05 ` Keir Fraser
  0 siblings, 2 replies; 55+ messages in thread
From: Cinco, Dante @ 2009-10-08  0:08 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 6383 bytes --]

I need help tracking down an IRQ SMP affinity problem.

Xen version: 3.4 unstable
dom0: Linux 2.6.30.3 (Debian)
domU: Linux 2.6.30.1 (Debian)
Hardware platform: HP ProLiant G6, dual-socket Xeon 5540, hyperthreading enable in BIOS and kernel (total of 16 CPUs: 2 sockets * 4 cores per socket * 2 threads per core)

With vcpus < 5, I can change /proc/irq/<irq#>/smp_affinity and see the interrupts get routed to the proper CPU(s) by checking /proc/interrupts. With vcpus > 4, any change to /proc/irq/<irq#>/smp_affinity results in a complete loss of interrupts for <irq#>.

I noticed in the domU /var/log/kern.log that APIC routing changes from "flat" for vcpus=4 to "physical flat" for vcpus=5. Looking at the source code for linux-2.6.30.1/arch/x86/kernel/apic/probe_64.c, this switch occurs when "max_physical_apicid >= 8." In the domU /var/log/kern.log and /proc/cpuinfo, only even numbered APIC IDs (starting from 0) are used so when it gets to the 5th CPU, it is already at APIC ID 8 which triggers the physical flat APIC routing.

dom0 has all 16 CPUs available to it. The mapping between CPU numbers and APIC ID is 1-to-1 (CPU0:APIC ID0 ... CPU15:APIC ID15). domU is configured with either vcpus=4 or vcpus=5. In both cases, the mapping uses even number only for the APIC IDs (CPU0:APIC ID0 ... CPU5:APIC ID8).

I'm using an ATTO/PMC Tachyon-based Fibre Channel PCIe card on this platform. It uses PCI-MSI-edge for its interrupt. I use pciback.hide in my dom0 Xen 3.5 kernel stanza to pass the device directly to domU. I'm also using "iommu=1,no-intremap,passthrough" in the stanza. I'm able to see the device in dom0 via "lspci -vv" and see the MSI message address and data that have been programmed into the Tachyon registers and using IRQ 32. Regardless of changes to IRQ 32's SMP affinity in domU, the MSI message address and data as seen from dom0 does not change. I can only conclude that domU is running some sort of IRQ emulation.

# lspci -vv in dom0
07:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 32
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+
                Address: 00000000fee00000  Data: 40ba (dest ID=0, RH=DM=0, fixed interrupt, vector=0xba)
        Kernel driver in use: pciback

In domU, the device has been remapped (intentionally in the dom0 config file) to bus 0, device 8 and can also be seen via "lspci -vv" with the same MSI message address but different data and using IRQ 48.

# lspci -vv in domU with vcpus=5
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 48
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
                Address: 00000000fee00000  Data: 4059 (dest ID=0, RH=DM=0, fixed interrupt, vector=0x59)
        Kernel driver in use: hwdrv
        Kernel modules: hbas-hw

At this point, the kernel driver for the device has been loaded and the number of interrupts can be seen in /proc/interrupts. The default IRQ SMP has not been changed and yet the interrupts are all being routed to CPU0. This is for vcpus=5 (physical flat APIC routing). Changing IRQ 48's SMP affinity to any value will result in a complete loss of all interrupts. domU and dom0 need to be rebooted to restore normal operation.
# cat /proc/irq/48/smp_affinity
1f
# cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4
  48:      60920          0          0          0          0   PCI-MSI-edge      HW_TACHYON

With vcpus=4 (flat APIC routing), IRQ 48's SMP affinity behaves as expected (each of the 4 bits in /proc/irq/48/smp_affinity correspond to a CPU or CPUs where the interrupts will be routed). The MSI message address and data have different attributes compared to vcpus=5. The address has dest ID=f (matches default /proc/irq/48/smp_affinity), RH=DM=1 and uses lowest priority instead of fixed interrupt.

# lspci -vv in domU with vcpus=4
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 48
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
                Address: 00000000fee0f00c  Data: 4159 (dest ID=f, RH=DM=1, lowest priority interrupt, vector=0x59)
        Kernel driver in use: hwdrv
        Kernel modules: hbas-hw

# cat /proc/irq/48/smp_affinity
f
# cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
  48:      14082      19052      15337      14645   PCI-MSI-edge      HW_TACHYON

Changing IRQ 48's SMP affinity to 8 shows that all the interrupts are being routed to CPU3 as expected and the MSI message address has changed to reflect the new dest ID while the vector stays the same.

# echo 8 > /proc/irq/48/smp_affinity
# cat /proc/interrupts
  48:      14082      19052      15338     351361   PCI-MSI-edge      HW_TACHYON

# lspci -vv in domU with vcpus=4
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 48
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
                Address: 00000000fee0800c  Data: 4159 (dest ID=8, RH=DM=1, lowest priority interrupt, vector=0x59)
        Kernel driver in use: hwdrv
        Kernel modules: hbas-hw

My hunch is there is something wrong with physical flat APIC routing in domU. If I boot this same platform to straight Linux 2.6.30.1 (no Xen), /var/log/kern.log shows that it too is using physical flat APIC routing which is expected since it has a total of 16 CPUs. Unlike domU though, changing the IRQ SMP affinity to any one-hot value (only one bit out of 16 is set to 1) behaves as expected. A non-one hot value results in all interrupts being routed to CPU0 but at least the interrupts are not lost.

One of my questions is "Why does domU use only even numbered APIC IDs?" If it used odd numbers, then physical flat APIC routing will only trigger when vcpus > 7.

I welcome any suggestions on how to pursue this problem or hopefully, someone will say that a patch for this already exists.

Thanks.

Dante Cinco


[-- Attachment #1.2: Type: text/html, Size: 9768 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2009-10-26 13:34 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-16  1:38 IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) Cinco, Dante
2009-10-16  2:34 ` Qing He
2009-10-16  6:37   ` Keir Fraser
2009-10-16  7:32     ` Zhang, Xiantao
2009-10-16  8:24       ` Qing He
2009-10-16  8:22         ` Zhang, Xiantao
2009-10-16  8:34           ` Qing He
2009-10-16  8:35             ` Zhang, Xiantao
2009-10-16  9:01               ` Qing He
2009-10-16  9:42                 ` Qing He
2009-10-16  9:49                 ` Zhang, Xiantao
2009-10-16 14:54                   ` Zhang, Xiantao
2009-10-16 18:24                     ` Cinco, Dante
2009-10-17  0:59                       ` Zhang, Xiantao
2009-10-20  0:19                         ` Cinco, Dante
2009-10-20  5:46                           ` Zhang, Xiantao
2009-10-20  7:51                             ` Zhang, Xiantao
2009-10-20 17:26                               ` Cinco, Dante
2009-10-21  1:10                                 ` Zhang, Xiantao
2009-10-22  1:00                                   ` Cinco, Dante
2009-10-22  1:58                                     ` Zhang, Xiantao
2009-10-22  2:42                                       ` Zhang, Xiantao
2009-10-22  6:25                                         ` Keir Fraser
2009-10-22 21:11                                           ` Jeremy Fitzhardinge
2009-10-22  5:10                                       ` Qing He
2009-10-23  0:10                                         ` Cinco, Dante
2009-10-22  6:46                               ` Jan Beulich
2009-10-22  7:11                                 ` Zhang, Xiantao
2009-10-22  7:31                                   ` Jan Beulich
2009-10-22  8:41                                     ` Zhang, Xiantao
2009-10-22  9:42                                       ` Keir Fraser
2009-10-22 16:32                                         ` Zhang, Xiantao
2009-10-22 16:33                                         ` Cinco, Dante
2009-10-23  1:06                                           ` Zhang, Xiantao
2009-10-26 13:02                                         ` Zhang, Xiantao
2009-10-26 13:34                                           ` Keir Fraser
2009-10-16  9:41               ` Keir Fraser
2009-10-16  9:57                 ` Qing He
2009-10-16  9:58                 ` Zhang, Xiantao
2009-10-16 10:21                   ` Jan Beulich
  -- strict thread matches above, loose matches on Subject: below --
2009-10-08  0:08 Cinco, Dante
2009-10-08 16:07 ` Bruce Edge
2009-10-08 18:05 ` Keir Fraser
2009-10-08 18:11   ` Cinco, Dante
2009-10-08 21:35     ` Keir Fraser
2009-10-09  9:07       ` Qing He
2009-10-09 15:59         ` Cinco, Dante
2009-10-09 23:39         ` Cinco, Dante
2009-10-10  9:43           ` Qing He
2009-10-10 10:10             ` Keir Fraser
2009-10-12  5:25             ` Cinco, Dante
2009-10-12  5:54               ` Qing He
2009-10-14 19:54                 ` Cinco, Dante
2009-10-16  0:09                   ` Konrad Rzeszutek Wilk
2009-10-16  1:40                     ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.