Linux-PCI Archive on lore.kernel.org
 help / color / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Andreas Hartmann <andihartmann@freenet.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	linux-pci <linux-pci@vger.kernel.org>
Subject: Re: Hard and silent lock up since linux 3.14 with PCIe pass through (vfio)
Date: Thu, 30 Oct 2014 13:45:49 -0600
Message-ID: <1414698349.27420.302.camel@ul30vt.home> (raw)
In-Reply-To: <20141030200922.15126d7a@dualc.maya.org>

On Thu, 2014-10-30 at 20:09 +0100, Andreas Hartmann wrote:
> Alex Williamson wrote:
> > On Thu, 2014-10-30 at 17:35 +0100, Andreas Hartmann wrote:
> >> Alex Williamson wrote:
> >>> On Wed, 2014-10-29 at 20:43 +0100, Andreas Hartmann wrote:
> >> [...]
> >>>> Therefore, I never should need pci_save_vc_state and
> >>>> pci_restore_vc_state. Thus, it should be ok to add "return" at the
> >>>> beginning of each of these function, true? Then it should work.
> >>>>
> >>>> I tested it. It worked.
> >>>>
> >>>> But if I'm removing only one of these returns either in
> >>>> pci_save_vc_state or pci_restore_vc_state, the machine hangs again.
> >>>>
> >>>> Therefore, there must be something odd going on in the for loops. Isn't
> >>>> it possible to add some useful debug code to these loops to see what's
> >>>> really going on? But the output *must* go to the actual console,
> >>>> otherwise I can't see it!
> >>>>
> >>>>
> >>>> int pci_save_vc_state(struct pci_dev *dev)
> >>>> {
> >>>>         return 0; // must be set
> >>>>         int i;
> >>>>
> >>>>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
> >> 		   // continue; -> works
> >>>>                 int pos, ret;
> >>>>                 struct pci_cap_saved_state *save_state;
> >> 		   // continue does not work!
> >>
> >> --> Most probably the
> >>
> >>             struct pci_cap_saved_state *save_state;
> >>
> >>     makes the system hang!
> > 
> > We've done nothing more than declare variables there, there's no actual
> > code.  What happens if you increase the delay after bus reset, edit
> > drivers/pci/pci.c, find the call to ssleep(1) and change the 1 to a 2,
> > doubling the delay after reset.
> 
> Same behaviour.
> 
> >  It seems like VC save/restore is just a
> > scapegoat for the platform already being broken by the bus reset.  Also,
> > if you have any other card to test in this slot, it would be useful
> > comparison data to know if we're dealing with an endpoint issue or a bus
> > issue.
> 
> I organized an Intel pcie card:
> 
> 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
>         Subsystem: Intel Corporation Gigabit CT Desktop Adapter
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 17
>         Region 0: Memory at fdbc0000 (32-bit, non-prefetchable) [disabled] [size=128K]
>         Region 1: Memory at fdb00000 (32-bit, non-prefetchable) [disabled] [size=512K]
>         Region 2: I/O ports at cf00 [disabled] [size=32]
>         Region 3: Memory at fdbfc000 (32-bit, non-prefetchable) [disabled] [size=16K]
>         [virtual] Expansion ROM at fdb80000 [disabled] [size=256K]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable+ DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>                 Address: 0000000000000000  Data: 0000
>         Capabilities: [e0] Express (v1) Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>                 LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>         Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
>                 Vector table: BAR=3 offset=00000000
>                 PBA: BAR=3 offset=00002000
>         Capabilities: [100 v1] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>         Capabilities: [140 v1] Device Serial Number 00-1b-21-ff-ff-cf-8f-57
>         Kernel driver in use: vfio-pci
> 
> 
> and tested with the same kernel, which hangs w/ atheros card. It just
> worked. Not just once, but each of the tests I did. I retested w/
> atheros -> hang. Tested again with intel-card -> works. Back to
> atheros -> hang.

Thanks for the test.

> Seems to be really a problem w/ the atheros card, which is triggered by
> new vc save/restore.

It seems more like the bus reset, not the VC save/restore.  As far a
interacting with hardware is concerned, there's no difference between
the two cases where you found one continue works and the other doesn't.

> Well, but what to do now? I know how to "fix" it. But this means I have
> to compile my kernels again on my own if it is >= 3.14.

Let's not give up hope just yet, I'd like to try another bus reset
mechanism with setpci.  Install the Atheros card and bind it to
pci-stub, then do:

setpci -s 00:05.0 68.w=0010:0010
sleep 0.1
setpci -s 00:05.0 68.w=0000:0010
sleep 1
lspci -xxx -s 3:00.0

This uses the link disable control rather than the secondary bus reset.
Typically the results between the two are the same, but maybe we'll get
lucky.  The BIOS manages to reset the bus with this device installed
somehow, so there must be a mechanism to do it.  Thanks,

Alex


  reply index

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-23 19:03 Andreas Hartmann
2014-09-23 20:07 ` Alex Williamson
2014-09-24 14:54   ` Andreas Hartmann
2014-09-24 17:16     ` Andreas Hartmann
2014-10-10  9:39   ` Andreas Hartmann
2014-10-10 14:37     ` Bjorn Helgaas
2014-10-10 14:49       ` Andreas Hartmann
2014-10-10 15:55         ` Bjorn Helgaas
2014-10-10 16:09           ` Andreas Hartmann
2014-10-10 16:41             ` Bjorn Helgaas
2014-10-10 22:32               ` Andreas Hartmann
2014-10-10 22:54                 ` Bjorn Helgaas
2014-10-11  6:20                   ` Andreas Hartmann
2014-10-15  8:04                     ` Alex Williamson
2014-10-17  1:04                       ` Andreas Hartmann
2014-10-21 21:06                         ` Alex Williamson
2014-10-21 21:32                           ` Alex Williamson
2014-10-22 16:22                             ` Andreas Hartmann
2014-10-22 20:36                               ` Alex Williamson
2014-10-23 16:00                                 ` Andreas Hartmann
2014-10-23 16:33                                   ` Alex Williamson
2014-10-23 17:12                                     ` Andreas Hartmann
2014-10-23 17:33                                     ` Andreas Hartmann
2014-10-23 19:37                                       ` Alex Williamson
2014-10-24 14:21                                         ` Andreas Hartmann
2014-10-25  6:03                                         ` Andreas Hartmann
2014-10-28 21:51                                           ` Alex Williamson
2014-10-29 16:47                                             ` Andreas Hartmann
2014-10-29 17:44                                               ` Alex Williamson
2014-10-29 17:57                                                 ` Andreas Hartmann
2014-10-29 18:16                                                   ` Alex Williamson
2014-10-29 19:43                                                     ` Andreas Hartmann
2014-10-29 20:50                                                       ` Alex Williamson
2014-10-29 21:35                                                         ` Andreas Hartmann
2014-10-30 16:35                                                         ` Andreas Hartmann
2014-10-30 16:58                                                           ` Alex Williamson
2014-10-30 19:09                                                             ` Andreas Hartmann
2014-10-30 19:45                                                               ` Alex Williamson [this message]
2014-10-30 20:21                                                                 ` Andreas Hartmann
2014-10-22 15:34                           ` Andreas Hartmann
2014-10-22 16:02                             ` Alex Williamson
2014-10-22 16:20                               ` Andreas Hartmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1414698349.27420.302.camel@ul30vt.home \
    --to=alex.williamson@redhat.com \
    --cc=andihartmann@freenet.de \
    --cc=bhelgaas@google.com \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-PCI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pci/0 linux-pci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pci linux-pci/ https://lore.kernel.org/linux-pci \
		linux-pci@vger.kernel.org
	public-inbox-index linux-pci

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pci


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git