All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Guthro <ben@guthro.net>
To: Jan Beulich <JBeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	xiantao.zhang@intel.com, xen-devel <xen-devel@lists.xen.org>
Subject: Re: S3 crash with VTD Queue Invalidation enabled
Date: Fri, 14 Jun 2013 14:27:02 -0400	[thread overview]
Message-ID: <CAOvdn6XKvC9phnB-8n-qqYpv47OMysr_vv9E+eERxeKU+RKqvw@mail.gmail.com> (raw)
In-Reply-To: <CAOvdn6XaLiOcTd0KAWVHAz+vCm3866oNFWRPgbw5yUbiC7zOrQ@mail.gmail.com>

On Fri, Jun 14, 2013 at 1:01 PM, Ben Guthro <ben@guthro.net> wrote:
> On Fri, Jun 14, 2013 at 4:38 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 06.06.13 at 01:53, Ben Guthro <ben@guthro.net> wrote:
>>>> Early in the boot process, I see queue_invalidate_wait() called for
>>>> DRHD unit 0, and 1
>>>> (unit 0 is wired up to the IGD, unit 1 is everything else)
>>>>
>>>> Up until i915 does the following, I see that unit being flushed with
>>>> queue_invalidate_wait() :
>>>>
>>>> [    0.704537] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
>>>> [    0.704537] ENERGY_PERF_BIAS: View and update with x86_energy_p
>>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0
>>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0
>>>> [    1.983028] [drm] GMBUS [i915 gmbus dpb] timed out, falling back to
>>>> bit banging on pin 5
>>>> [    2.253551] fbcon: inteldrmfb (fb0) is primary device
>>>> [    3.111838] Console: switching to colour frame buffer device 170x48
>>>> [    3.171631] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
>>>> [    3.171634] i915 0000:00:02.0: registered panic notifier
>>>> [    3.173339] acpi device:00: registered as cooling_device1
>>>> [    3.173401] ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
>>>> [    3.173962] input: Video Bus as
>>>> /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input4
>>>> [    3.174232] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
>>>> [    3.174258] ahci 0000:00:1f.2: version 3.0
>>>> [    3.174270] xen: registering gsi 19 triggering 0 polarity 1
>>>> [    3.174274] Already setup the GSI :19
>>>>
>>>>
>>>> After that - the unit never seems to be flushed.
>>
>> With queue_invalidate_wait() having a single caller -
>> invalidate_sync() -, and with invalidate_sync() being called from
>> all interrupt setup (IO-APIC as well as MSI), that's quite odd to be
>> the case. At least upon network driver load or interface-up, this
>> should be getting called.
>>
>>>> ...until we enter into the S3 hypercall, which loops over all DRHD
>>>> units, and explicitly flushes all of them via iommu_flush_all()
>>>>
>>>> It is at that point that it hangs up when talking to the device that
>>>> the IGD is plumbed up to.
>>>>
>>>>
>>>> Does this point to something in the i915 driver doing something that
>>>> is incompatible with Xen?
>>>
>>> I actually separated it from the S3 hypercall, adding a new debug key
>>> 'F' - to just call iommu_flush_all()
>>> I can crash it on demand with this.
>>>
>>> Booting with "i915.modeset=0 single" (to prevent both KMS, and Xorg) -
>>> it does not occur.
>>> So, that pretty much narrows it down to the IGD, in my mind.
>>
>> Which reminds me of a change I did several weeks back to our kernel,
>> but which isn't as easily done with pv-ops: There are a number of
>> cases in the AGP and DRM code that qualify upon CONFIG_INTEL_IOMMU
>> and use intel_iommu_gfx_mapped. As you certainly know, Linux when
>> running on Xen doesn't see any IOMMU, and hence the config option
>> being enabled or disabled is completely unrelated to whether the
>> driver actually runs on top of an enabled IOMMU. Similarly the setting
>> of intel_iommu_gfx_mapped cannot possibly happen when running on
>> top of Xen, as it sits in code that never gets used in this case.
>>
>> A possibly simple, but rather hacky solution might be to always set
>> that variable when running on Xen. But that wouldn't cover the case
>> of a kernel being built without CONFIG_INTEL_IOMMU, yet in that
>> case the driver might still run with an IOMMU enabled underneath.
>> (In our case I can simply always #define intel_iommu_gfx_mapped
>> to 1, with the INTEL_IOMMU option getting forcibly disabled for the
>> Xen kernel flavors anyway. Whether that's entirely correct when
>> not running on an enabled IOMMU I can't tell yet, and don't know
>> whom to ask.)
>>
>> And that wouldn't cover the IGD getting passed through to a DomU
>> at all - obviously Xen's ability to properly drive all IOMMU operations
>> (including qinval) must not depend on the owning guest's driver code.
>>
>> I have to admit though that it entirely escapes me why a graphics
>> driver needs to peek into IOMMU code/state in the first place. This
>> very much smells of bad design.
>
>
> This all makes sense, and I agree with your assessment.
>
> Unfortunately, I went and got the machine back from our QA department,
> to do some tests on this, and now I am unable to reproduce the issue,
> to prove your analysis is correct.
> It was 100% reproducible a week ago, and now I can't make it happen,
> using the same code base & build.
>
> It is all very strange, and smells of a race condition, or
> uninitialized variable.
> I blame Alpha particles.

I did a little more bisecting of our builds, and it appears I was not
actually testing the version that I thought I was here, and once I did
some bisection, I found it got inadvertently fixed by another change
someone else committed to solve an unrelated problem.

The following changes

Revert:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c79c49826270b8b0061b2fca840fc3f013c8a78a

Apply:
https://lkml.org/lkml/2012/2/10/229

I don't have a good explanation as to why re-enabling PAT would change
the behavior of this IOMMU feature...but I have a very reproducible
test case showing that it, in fact does.

Konrad - do you have any theories that would explain this one?
Or, would we like to leave this one as "Here be Dragons" and look the other way?

  reply	other threads:[~2013-06-14 18:27 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-03 18:29 S3 crash with VTD Queue Invalidation enabled Ben Guthro
2013-06-03 19:22 ` Andrew Cooper
2013-06-04  8:54   ` Jan Beulich
2013-06-04 12:25     ` Ben Guthro
2013-06-04 14:01       ` Jan Beulich
2013-06-04 19:20         ` Ben Guthro
2013-06-04 19:49           ` Ben Guthro
2013-06-04 21:09             ` Ben Guthro
2013-06-05  8:24               ` Jan Beulich
2013-06-05 13:54                 ` Ben Guthro
2013-06-05 15:14                   ` Jan Beulich
2013-06-05 15:25                     ` Ben Guthro
2013-06-05 15:38                       ` Jan Beulich
2013-06-05 20:27                         ` Ben Guthro
2013-06-05 23:53                           ` Ben Guthro
2013-06-06  6:58                             ` Jan Beulich
2013-06-06 15:06                               ` Zhang, Xiantao
2013-06-06 15:07                                 ` Ben Guthro
2013-06-06 15:13                                   ` Zhang, Xiantao
2013-06-06 15:17                                     ` Ben Guthro
2013-06-07  1:33                                       ` Zhang, Xiantao
2013-06-07 15:52                                         ` Ben Guthro
2013-06-14  8:38                             ` Jan Beulich
2013-06-14 17:01                               ` Ben Guthro
2013-06-14 18:27                                 ` Ben Guthro [this message]
2013-06-17  7:23                                   ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOvdn6XKvC9phnB-8n-qqYpv47OMysr_vv9E+eERxeKU+RKqvw@mail.gmail.com \
    --to=ben@guthro.net \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=konrad.wilk@oracle.com \
    --cc=xen-devel@lists.xen.org \
    --cc=xiantao.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.