From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Guthro Subject: Re: S3 crash with VTD Queue Invalidation enabled Date: Wed, 5 Jun 2013 09:54:19 -0400 Message-ID: References: <51ACECEB.9030904@citrix.com> <51ADC77402000078000DAF95@nat28.tlf.novell.com> <51AE0F6602000078000DB1F4@nat28.tlf.novell.com> <51AF11F102000078000DB589@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <51AF11F102000078000DB589@nat28.tlf.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: Andrew Cooper , xiantao.zhang@intel.com, xen-devel List-Id: xen-devel@lists.xenproject.org On Wed, Jun 5, 2013 at 4:24 AM, Jan Beulich wrote: >>>> On 04.06.13 at 23:09, Ben Guthro wrote: >> On Tue, Jun 4, 2013 at 3:49 PM, Ben Guthro wrote: >>> On Tue, Jun 4, 2013 at 3:20 PM, Ben Guthro wrote: >>>> On Tue, Jun 4, 2013 at 10:01 AM, Jan Beulich wrote: >>>>>>>> On 04.06.13 at 14:25, Ben Guthro wrote: >>>>>> On Tue, Jun 4, 2013 at 4:54 AM, Jan Beulich wrote: >>>>>>> Is this perhaps having some similarity with >>>>>>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg00343.html? >>>>>>> We're clearly running single-CPU only here and there... >>>>>> >>>>>> We certainly should be, as we have gone through the >>>>>> disable_nonboot_cpus() by this point - and I can verify that from the >>>>>> logs. >>>>> >>>>> I'm much more tending towards the connection here, noting that >>>>> Andrew's original thread didn't really lead anywhere (i.e. we still >>>>> don't know what the panic he saw was actually caused by). >>>>> >>>> >>>> I'm starting to think you're on to something here. >>> >>> hmm - maybe not. >>> I get the same crash with "maxcpus=1" >>> >>> >>> >>>> I've put a bunch of trace throughout the functions in qinval.c >>>> >>>> It seems that everything is functioning properly, up until we go >>>> through the disable_nonboot_cpus() path. >>>> Prior to this, I see the qinval.c functions being executed on all >>>> cpus, and both drhd units >>>> Afterward, it gets stuck in queue_invalidate_wait on the first drhd >>>> unit.. and eventually panics. >>>> >>>> I'm not exactly sure what to make of this yet. >> >> querying status of the hardware all seems to be working correctly... >> it just doesn't work with querying the QINVAL_STAT_DONE state, as far >> as I can tell. >> >> Other register state is: >> >> (XEN) VER = 10 >> (XEN) CAP = c0000020e60262 >> (XEN) n_fault_reg = 1 >> (XEN) fault_recording_offset = 200 >> (XEN) fault_recording_reg_l = 0 >> (XEN) fault_recording_reg_h = 0 >> (XEN) ECAP = f0101a >> (XEN) GCMD = 0 >> (XEN) GSTS = c7000000 >> (XEN) RTADDR = 137a31000 >> (XEN) CCMD = 800000000000000 >> (XEN) FSTS = 0 >> (XEN) FECTL = 0 >> (XEN) FEDATA = 4128 >> (XEN) FEADDR = fee0000c >> (XEN) FEUADDR = 0 >> >> (with code lifted from print_iommu_regs() ) >> >> >> None of this looks suspicious to my untrained eye - but I'm including >> it here in case someone else sees something I don't. > > Xiantao, you certainly will want to give some advice here. I won't > be able to look into this more deeply right away. Thanks Jan. Xiantao - I'd appreciate any insight you may have. One curious thing I have found, that seems buggy to me, is that {dis,en}able_qinval() is being called prior to the platform quirks being executed. It appears they are being called through iommu_{en,dis}able_x2apic_IR() However, when I try to put a BUG(), or dump_execution_state in that code, it would not dump a stack. I was going to put a platform quirk in, to detect, and disable qinval on this platform, but it seems that may be too late in the process.