From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Thimo E." Subject: Re: cpuidle and un-eoid interrupts at the local apic Date: Mon, 19 Aug 2013 17:14:38 +0200 Message-ID: <5212365E.7010803@digithi.de> References: <51A908CA.7050604@citrix.com><51F8CB15.1070608@digithi.de><51F8DD40.2090207@citrix.com><51FC37A9.9090809@digithi.de><51FC418D.8020708@citrix.com><51FFBA8502000078000E9462@nat28.tlf.novell.com><51FFBC08.6070804@citrix.com><52055EC9.8030207@digithi.de><520561E1.8020809@citrix.com><520562C8.8080703@citrix.com> <5207CE0C.1000502@digithi.de><5208CC8A.7070703@digithi.de> <5208CF6B.7030505@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7045600617287380372==" Return-path: In-Reply-To: <5208CF6B.7030505@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Cooper Cc: Keir Fraser , Jan Beulich , "Dong, Eddie" , Xen-develList , "Nakajima, Jun" , "Zhang, Yang Z" , "Zhang, Xiantao" List-Id: xen-devel@lists.xenproject.org This is a multi-part message in MIME format. --===============7045600617287380372== Content-Type: multipart/alternative; boundary="------------070300080307000603020406" This is a multi-part message in MIME format. --------------070300080307000603020406 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hello, after one week of testing an intermediate result: Since I've set iommu=no-intremap no crash occured so far. The server never ran longer without a crash. So a careful "it's working", but, because only one 7 days passed so far, not a final horray. Even if this option really avoids the problem I classify it as nothing more than a workaround...obviously a good one because it's working, but still a workaround. Where could the problem of the source be ? Bug in hardware ? Bug in software ? And what does interrupt remapping really do ? Does disabling remapping have a performance impact ? Best regards Thimo Am 12.08.2013 14:04, schrieb Andrew Cooper: > On 12/08/13 12:52, Thimo E wrote: >> Hello Yang, >> >> attached you'll find the kernel dmesg, xen dmesg, lspci and output of >> /proc/interrupts. If you want to see further logfiles, please let me >> know. >> >> The processor is a Core i5-4670. The board is an Intel DH87MC >> Mainboard. I am really not sure if it supports APICv, but VT-d is >> supported enabled enabled. >> >> >>> 4.The status of IRQ 29 is 10 which means the guest already issues >>> the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there >>> should be no pending EOI in the EOI stack. If possible, can you add >>> some debug message in the guest EOI code path(like >>> _irq_guest_eoi())) to track the EOI? >>> >> I don't see the IRQ29 in /proc/interrupts, what I see is: >> cat xen-dmesg.txt |grep "29": (XEN) allocated vector 29 for irq 20 >> cat dmesg.txt | grep "eth0": [ 23.152355] e1000e 0000:00:19.0: PCI >> INT A -> GSI 20 (level, low) -> IRQ 20 >> [ 23.330408] e1000e >> 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection >> >> So is the ethernet irq the bad one ? That is an Onboard Intel network >> adapter. > > That would be consistent with the crash seen with our hardware in > XenServer > >> >>> 6.I guess the interrupt remapping is enabled in your machine. Can >>> you try to disable IR to see whether it still reproduceable? >>> >> Just to be sure, your proposal is to try the parameter "no-intremap" ? > > specifically, iommu=no-intremap > >> >> Best regards >> Thimo > > ~Andrew > >> >> Am 12.08.2013 10:49, schrieb Zhang, Yang Z: >>> >>> Hi Thimo, >>> >>> From your previous experience and log, it shows: >>> >>> 1.The interrupt that triggers the issue is a MSI. >>> >>> 2.MSI are treated as edge-triggered interrupts nomally, except when >>> there is no way to mask the device. In this case, your previous log >>> indicates the device is unmaskable(What special device are you >>> using?Modern PCI devcie should be maskable). >>> >>> 3.The IRQ 29 is belong to dom0, it seems it is not a HVM related issue. >>> >>> 4.The status of IRQ 29 is 10 which means the guest already issues >>> the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there >>> should be no pending EOI in the EOI stack. If possible, can you add >>> some debug message in the guest EOI code path(like >>> _irq_guest_eoi())) to track the EOI? >>> >>> 5.Both of the log show when the issue occured, most of the other >>> interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is >>> it a coincidence? Or it happened only on the special condition like >>> heavy of IRQ migration?Perhaps you can disable irq balance in dom0 >>> and pin the IRQ manually. >>> >> |6.I guess the interrupt remapping is enabled in your machine. Can >> you try to disable IR to see whether it still reproduceable? >>> >>> Also, please provide the whole Xen log. >>> >>> Best regards, >>> >>> Yang >>> >> > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel --------------070300080307000603020406 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
Hello,

after one week of testing an intermediate result:

Since I've set iommu=no-intremap no crash occured so far. The server never ran longer without a crash. So a careful "it's working", but, because only one 7 days passed so far, not a final horray.

Even if this option really avoids the problem I classify it as nothing more than a workaround...obviously a good one because it's working, but still a workaround.

Where could the problem of the source be ? Bug in hardware ? Bug in software ?

And what does interrupt remapping really do ? Does disabling remapping have a performance impact ?

Best regards
  Thimo

Am 12.08.2013 14:04, schrieb Andrew Cooper:
On 12/08/13 12:52, Thimo E wrote:
Hello Yang,

attached you'll find the kernel dmesg, xen dmesg, lspci and output of /proc/interrupts. If you want to see further logfiles, please let me know.

The processor is a Core i5-4670. The board is an Intel  DH87MC Mainboard. I am really not sure if it supports APICv, but VT-d is supported enabled enabled.


4.       The status of IRQ 29 is 10 which means the guest already issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should be no pending EOI in the EOI stack. If possible, can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI?

I don't see the IRQ29 in /proc/interrupts, what I see is:
cat xen-dmesg.txt |grep "29": (XEN) allocated vector 29 for irq 20
cat dmesg.txt | grep "eth0": [   23.152355] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
                                                  [   23.330408] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection

So is the ethernet irq the bad one ? That is an Onboard Intel network adapter.

That would be consistent with the crash seen with our hardware in XenServer


6.       I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable?

 

Just to be sure, your proposal is to try the parameter "no-intremap" ?

specifically, iommu=no-intremap


Best regards
  Thimo

~Andrew


Am 12.08.2013 10:49, schrieb Zhang, Yang Z:

Hi Thimo,

From your previous experience and log, it shows:

1.       The interrupt that triggers the issue is a MSI.

2.       MSI are treated as edge-triggered interrupts nomally, except when there is no way to mask the device. In this case, your previous log indicates the device is unmaskable(What special device are you using?Modern PCI devcie should be maskable).

3.       The IRQ 29 is belong to dom0, it seems it is not a HVM related issue.

4.       The status of IRQ 29 is 10 which means the guest already issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should be no pending EOI in the EOI stack. If possible, can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI?

5.       Both of the log show when the issue occured, most of the other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is it a coincidence? Or it happened only on the special condition like heavy of IRQ migration?Perhaps you can disable irq balance in dom0 and pin the IRQ manually.

|6.       I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable?

Also, please provide the whole Xen log.

 

Best regards,

Yang





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--------------070300080307000603020406-- --===============7045600617287380372== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============7045600617287380372==--