All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
@ 2018-01-12 10:49 Lukas Wunner
  2018-01-12 11:03 ` Lukas Wunner
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Lukas Wunner @ 2018-01-12 10:49 UTC (permalink / raw)
  To: Sinan Kaya, Bjorn Helgaas
  Cc: Mika Westerberg, Yehezkel Bernat, Michael Jamet, linux-pci

Hi Sinan,

I updated to 4.15 this week and noticed a regression (if one may call it
that, it's more like a very noticeable annoyance) affecting machines
with Thunderbolt controllers wherein a reboot or shutdown is delayed by
8 seconds, caused by

    commit cc27b735ad3a75574a6ab1a66ed6b09385e77e5e
    Author: Sinan Kaya <okaya@codeaurora.org>
    Date:   Wed Oct 25 15:01:02 2017 -0400

    PCI/portdrv: Turn off PCIe services during shutdown

    Some of the PCIe services such as AER are being left enabled during
    shutdown. This might cause spurious AER errors while SOC is being
    powered down.

    Clean up the PCIe services gracefully during shutdown to clear these
    false positives.

The reason is that Thunderbolt controllers claim to support Command
Complete interrupts (value of 0b in the No Command Completed Support
field of the Slot Capabilities register), but in reality never bother
to send the interrupt.  Your commit causes pcie_shutdown_notification()
to be called on ->shutdown, which calls pcie_write_cmd(), which calls
pcie_wait_cmd() twice, each invocation causing a delay of 1 second.
Because Thunderbolt controllers typically have 4 hotplug bridges,
this adds up to 8 seconds.

The 1 second delay comes from PCIe base spec r2.1 sec. 6.7.3.2:

    If command completed events are supported, then software must wait
    for a command to complete before issuing the next command. However,
    if the status field is not set after the 1 second limit on command
    execution, software is permitted to repeat the command or to issue
    the next command.

I'm not sure which Thunderbolt controllers fake the ability to send
CC interrupts but my old Light Ridge is definitely affected and I recall
seeing logs of newer Cactus Ridge and Falcon Ridge (Thunderbolt 2)
controllers which also had this issue.  I'm cc'ing Intel folks in the
hope that they may know the details.  I believe Mika has a MacBookPro13,3
with Alpine Ridge (Thunderbolt 3) and could verify if that also has
this issue. These MacBook Pros have two Thunderbolt controllers,
so if affected, the commit would delay reboot by 16 seconds.

On the one hand, your commit message sounds as though the change merely
addresses a *potential* issue, so one could argue that if it causes
*real* issues such as delayed reboots, it's probably not a good idea.
On the other hand I do see your point since a user might surprise-remove
a device from a hotplug slot during shutdown, and we clearly wouldn't want
to act on that.

The Thunderbolt hotplug slots aren't "real" slots that you insert cards
into, but rather "virtual" hotplug slots implemented in silicon.  Devices
appear beyond those slots once a PCI tunnel is established on top of
the Thunderbolt switching infrastructure.  So I'm wondering if we have
to wait for command completion at all in this case.  We could add a bit
to struct pci_dev_flags indicating that command completion need not be
awaited, and bail out of pcie_wait_cmd() if the bit is set.  The bit
could be set in set_pcie_hotplug_bridge() or in a PCI quirk.

However I'm not sure if it would be safe.  Maybe one of the Intel devs
can comment on whether any delay must be observed after issuing a command
to the Slot Control Register on Thunderbolt controllers?  Surely 1 second
seems excessive.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-12 10:49 Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown Lukas Wunner
@ 2018-01-12 11:03 ` Lukas Wunner
  2018-01-12 14:26 ` Sinan Kaya
  2018-01-12 16:34 ` Mika Westerberg
  2 siblings, 0 replies; 11+ messages in thread
From: Lukas Wunner @ 2018-01-12 11:03 UTC (permalink / raw)
  To: Sinan Kaya, Bjorn Helgaas
  Cc: Mika Westerberg, Yehezkel Bernat, Michael Jamet, linux-pci

On Fri, Jan 12, 2018 at 11:49:29AM +0100, Lukas Wunner wrote:
> I'm not sure which Thunderbolt controllers fake the ability to send
> CC interrupts but my old Light Ridge is definitely affected and I recall
> seeing logs of newer Cactus Ridge and Falcon Ridge (Thunderbolt 2)
> controllers which also had this issue.  I'm cc'ing Intel folks in the
> hope that they may know the details.  I believe Mika has a MacBookPro13,3
> with Alpine Ridge (Thunderbolt 3) and could verify if that also has
> this issue. These MacBook Pros have two Thunderbolt controllers,
> so if affected, the commit would delay reboot by 16 seconds.

I managed to dig up lspci output for a MacBookPro13,3 and Alpine Ridge
has 1b in the No Command Completed Support field of the Slot Capabilities
register, so no delay is observed at all and these newer machines are not
affected.

Question is, do the older controllers likewise not need a delay?

Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-12 10:49 Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown Lukas Wunner
  2018-01-12 11:03 ` Lukas Wunner
@ 2018-01-12 14:26 ` Sinan Kaya
  2018-01-12 15:12   ` Lukas Wunner
  2018-01-12 16:34 ` Mika Westerberg
  2 siblings, 1 reply; 11+ messages in thread
From: Sinan Kaya @ 2018-01-12 14:26 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas
  Cc: Mika Westerberg, Yehezkel Bernat, Michael Jamet, linux-pci

Hi Lukas,

On 1/12/2018 5:49 AM, Lukas Wunner wrote:
> Hi Sinan,
> 
> I updated to 4.15 this week and noticed a regression (if one may call it
> that, it's more like a very noticeable annoyance) affecting machines
> with Thunderbolt controllers wherein a reboot or shutdown is delayed by
> 8 seconds, caused by

Sorry for that.

I wonder if we can separate remove from shutdown and just disable the IRQs
in shutdown case rather than turning off the slot power etc.

Let me see if I can come up with a quick patch.

> On the one hand, your commit message sounds as though the change merely
> addresses a *potential* issue, so one could argue that if it causes
> *real* issues such as delayed reboots, it's probably not a good idea.
> On the other hand I do see your point since a user might surprise-remove
> a device from a hotplug slot during shutdown, and we clearly wouldn't want
> to act on that.

We are seeing the problem on our platform (QDF2400). It is just sporadic. 
That's why, the commit message was loosely worded.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-12 14:26 ` Sinan Kaya
@ 2018-01-12 15:12   ` Lukas Wunner
  2018-01-12 15:31     ` Sinan Kaya
  0 siblings, 1 reply; 11+ messages in thread
From: Lukas Wunner @ 2018-01-12 15:12 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Mika Westerberg, Yehezkel Bernat, Michael Jamet,
	linux-pci

On Fri, Jan 12, 2018 at 09:26:48AM -0500, Sinan Kaya wrote:
> On 1/12/2018 5:49 AM, Lukas Wunner wrote:
> I wonder if we can separate remove from shutdown and just disable the IRQs
> in shutdown case rather than turning off the slot power etc.

But don't we risk "IRQ xx: nobody cared" splats if we do that?

Not sure if just disabling the IRQ without telling the hotplug port
not to send interrupts is safe.

The issue that Command Complete events are falsely claimed to be
supported needs to be adressed regardless because whenever a
Thunderbolt device is unplugged, the same 2 second delay occurs.
So if you unplug a daisy chain of, say 6 devices, it takes 12
seconds until they're all removed from the system.

By now I've found out that Falcon Ridge (Thunderbolt 2) is not affected,
I'm not yet sure if Redwood Ridge is, but it seems the issue is contrained
to Thunderbolt 1 controllers only, probably an erratum.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-12 15:12   ` Lukas Wunner
@ 2018-01-12 15:31     ` Sinan Kaya
  2018-01-13  7:32       ` Lukas Wunner
  0 siblings, 1 reply; 11+ messages in thread
From: Sinan Kaya @ 2018-01-12 15:31 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Mika Westerberg, Yehezkel Bernat, Michael Jamet,
	linux-pci

On 1/12/2018 10:12 AM, Lukas Wunner wrote:
> On Fri, Jan 12, 2018 at 09:26:48AM -0500, Sinan Kaya wrote:
>> On 1/12/2018 5:49 AM, Lukas Wunner wrote:
>> I wonder if we can separate remove from shutdown and just disable the IRQs
>> in shutdown case rather than turning off the slot power etc.
> 
> But don't we risk "IRQ xx: nobody cared" splats if we do that?

I assumed code was turning off the slot power etc. aggressively.

After looking at the code some more time, it seems to be doing the
right thing and telling pcie controller not to generate interrupts for
hotplug.

I think this is what is failing for you probably because by the time you are
shutting down there is nobody to issue the command completion. This would
repeat for each hotplug capable pcie slot.

static void pcie_disable_notification(struct controller *ctrl)
{
	u16 mask;

	mask = (PCI_EXP_SLTCTL_PDCE | PCI_EXP_SLTCTL_ABPE |
		PCI_EXP_SLTCTL_MRLSCE | PCI_EXP_SLTCTL_PFDE |
		PCI_EXP_SLTCTL_HPIE | PCI_EXP_SLTCTL_CCIE |
		PCI_EXP_SLTCTL_DLLSCE);
	pcie_write_cmd(ctrl, 0, mask);
	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, 0);
} 

Can you confirm this?

> 
> Not sure if just disabling the IRQ without telling the hotplug port
> not to send interrupts is safe.
> 
> The issue that Command Complete events are falsely claimed to be
> supported needs to be adressed regardless because whenever a
> Thunderbolt device is unplugged, the same 2 second delay occurs.
> So if you unplug a daisy chain of, say 6 devices, it takes 12
> seconds until they're all removed from the system.
> 
> By now I've found out that Falcon Ridge (Thunderbolt 2) is not affected,
> I'm not yet sure if Redwood Ridge is, but it seems the issue is contrained
> to Thunderbolt 1 controllers only, probably an erratum
> 
> Thanks,
> 
> Lukas
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-12 10:49 Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown Lukas Wunner
  2018-01-12 11:03 ` Lukas Wunner
  2018-01-12 14:26 ` Sinan Kaya
@ 2018-01-12 16:34 ` Mika Westerberg
  2018-01-13  7:14   ` Lukas Wunner
  2 siblings, 1 reply; 11+ messages in thread
From: Mika Westerberg @ 2018-01-12 16:34 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Sinan Kaya, Bjorn Helgaas, Yehezkel Bernat, Michael Jamet, linux-pci

On Fri, Jan 12, 2018 at 11:49:29AM +0100, Lukas Wunner wrote:
> I believe Mika has a MacBookPro13,3 with Alpine Ridge (Thunderbolt 3)
> and could verify if that also has this issue. These MacBook Pros have
> two Thunderbolt controllers, so if affected, the commit would delay
> reboot by 16 seconds.

Checked the MacBook with 2 x AR and it does have NoCompl+ for the
hotplug downstream ports. Also tried reboot the system and it does not
take 16s, more like 1-2s.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-12 16:34 ` Mika Westerberg
@ 2018-01-13  7:14   ` Lukas Wunner
  0 siblings, 0 replies; 11+ messages in thread
From: Lukas Wunner @ 2018-01-13  7:14 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Sinan Kaya, Bjorn Helgaas, Yehezkel Bernat, Michael Jamet, linux-pci

On Fri, Jan 12, 2018 at 06:34:20PM +0200, Mika Westerberg wrote:
> On Fri, Jan 12, 2018 at 11:49:29AM +0100, Lukas Wunner wrote:
> > I believe Mika has a MacBookPro13,3 with Alpine Ridge (Thunderbolt 3)
> > and could verify if that also has this issue. These MacBook Pros have
> > two Thunderbolt controllers, so if affected, the commit would delay
> > reboot by 16 seconds.
> 
> Checked the MacBook with 2 x AR and it does have NoCompl+ for the
> hotplug downstream ports. Also tried reboot the system and it does not
> take 16s, more like 1-2s.

Yes, as said Alpine Ridge isn't affected, only older Thunderbolt 1 chips.
You mentioned last year that you also have a mac mini with Light Ridge
and a MacBook with Cactus Ridge:
http://lkml.iu.edu/hypermail/linux/kernel/1705.3/00443.html

Install a 4.15 kernel on one of those and reboot to see the 8 second
delay.

I've just sent out a patch which fixes the issue for me, an ack would
be appreciated, barring any objections of course.  Do you have access
to specs for older Thunderbolt 1 chips and if so, could you verify if
they require a delay to be observed after writing to the Slot Control
register?  Is there an errata sheet confirming that the advertised
NoCompl- is bogus?

There are certain Xeons from the same era as the affected Thunderbolt 1
chips which advertise NoCompl- but erroneously only acknowledge Command
Completed for a few specific commands:
- Power Controller Control
- Power Indicator Control
- Attention Indicator Control
- Electromechanical Interlock Control

See erratum CF118 in:
https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html

I figured that maybe Intel recycled the faulty VHDL files of these Xeon
hotplug ports in Thunderbolt 1 controllers, so I set all of the above
bits on every access to the Slot Control register.  Alas it didn't help,
the behavior stayed the same, no Command Completed event and no Command
Completed indication in the Slot Status register.  (Perhaps unsurprising
since Light Ridge doesn't advertise support for any of the above four
features.)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-12 15:31     ` Sinan Kaya
@ 2018-01-13  7:32       ` Lukas Wunner
  2018-01-13 17:58         ` okaya
  0 siblings, 1 reply; 11+ messages in thread
From: Lukas Wunner @ 2018-01-13  7:32 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Mika Westerberg, Yehezkel Bernat, Michael Jamet,
	linux-pci

On Fri, Jan 12, 2018 at 10:31:12AM -0500, Sinan Kaya wrote:
> On 1/12/2018 10:12 AM, Lukas Wunner wrote:
> > On Fri, Jan 12, 2018 at 09:26:48AM -0500, Sinan Kaya wrote:
> >> On 1/12/2018 5:49 AM, Lukas Wunner wrote:
> >> I wonder if we can separate remove from shutdown and just disable the IRQs
> >> in shutdown case rather than turning off the slot power etc.
> > 
> > But don't we risk "IRQ xx: nobody cared" splats if we do that?
> 
> I assumed code was turning off the slot power etc. aggressively.
> 
> After looking at the code some more time, it seems to be doing the
> right thing and telling pcie controller not to generate interrupts for
> hotplug.
> 
> I think this is what is failing for you probably because by the time you are
> shutting down there is nobody to issue the command completion. This would
> repeat for each hotplug capable pcie slot.

You mean we disable Command Completed interrupts and thus the port
can't notify that the command was completed?  It seems the code
accommodates to that by polling the PCI_EXP_SLTSTA_CC bit:

	if (ctrl->slot_ctrl & PCI_EXP_SLTCTL_HPIE &&
	    ctrl->slot_ctrl & PCI_EXP_SLTCTL_CCIE)
		rc = wait_event_timeout(ctrl->queue, !ctrl->cmd_busy, timeout);
	else
		rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));

The problem is that these Thunderbolt controllers never seem to set
the PCI_EXP_SLTSTA_CC bit, resulting in "Timeout on hotplug command"
messages.

The pciehp code is okay, we just need a workaround for the broken
Thunderbolt 1 chips.  This has been a pain point all along, but
your patch made the brokenness visible enough that investigating
and fixing it became unavoidable.  So don't worry about your patch,
it's all fine. ;-)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-13  7:32       ` Lukas Wunner
@ 2018-01-13 17:58         ` okaya
  2018-01-13 19:39           ` Lukas Wunner
  0 siblings, 1 reply; 11+ messages in thread
From: okaya @ 2018-01-13 17:58 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Mika Westerberg, Yehezkel Bernat, Michael Jamet,
	linux-pci

On 2018-01-13 02:32, Lukas Wunner wrote:
> On Fri, Jan 12, 2018 at 10:31:12AM -0500, Sinan Kaya wrote:
>> On 1/12/2018 10:12 AM, Lukas Wunner wrote:
>> > On Fri, Jan 12, 2018 at 09:26:48AM -0500, Sinan Kaya wrote:
>> >> On 1/12/2018 5:49 AM, Lukas Wunner wrote:
>> >> I wonder if we can separate remove from shutdown and just disable the IRQs
>> >> in shutdown case rather than turning off the slot power etc.
>> >
>> > But don't we risk "IRQ xx: nobody cared" splats if we do that?
>> 
>> I assumed code was turning off the slot power etc. aggressively.
>> 
>> After looking at the code some more time, it seems to be doing the
>> right thing and telling pcie controller not to generate interrupts for
>> hotplug.
>> 
>> I think this is what is failing for you probably because by the time 
>> you are
>> shutting down there is nobody to issue the command completion. This 
>> would
>> repeat for each hotplug capable pcie slot.
> 
> You mean we disable Command Completed interrupts and thus the port
> can't notify that the command was completed?  It seems the code
> accommodates to that by polling the PCI_EXP_SLTSTA_CC bit:
> 
> 	if (ctrl->slot_ctrl & PCI_EXP_SLTCTL_HPIE &&
> 	    ctrl->slot_ctrl & PCI_EXP_SLTCTL_CCIE)
> 		rc = wait_event_timeout(ctrl->queue, !ctrl->cmd_busy, timeout);
> 	else
> 		rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));
> 
> The problem is that these Thunderbolt controllers never seem to set
> the PCI_EXP_SLTSTA_CC bit, resulting in "Timeout on hotplug command"
> messages.


I waa thinking of using nowait variant of write function in notification 
disable function in order to not introduce new behavior for existing 
silicon.


> 
> The pciehp code is okay, we just need a workaround for the broken
> Thunderbolt 1 chips.  This has been a pain point all along, but
> your patch made the brokenness visible enough that investigating
> and fixing it became unavoidable.  So don't worry about your patch,
> it's all fine. ;-)
> 
> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-13 17:58         ` okaya
@ 2018-01-13 19:39           ` Lukas Wunner
  2018-01-13 20:49             ` okaya
  0 siblings, 1 reply; 11+ messages in thread
From: Lukas Wunner @ 2018-01-13 19:39 UTC (permalink / raw)
  To: okaya
  Cc: Bjorn Helgaas, Mika Westerberg, Yehezkel Bernat, Michael Jamet,
	linux-pci

On Sat, Jan 13, 2018 at 12:58:53PM -0500, okaya@codeaurora.org wrote:
> I waa thinking of using nowait variant of write function in notification
> disable function in order to not introduce new behavior for existing
> silicon.

After writing to the Slot Control register to turn off IRQ notification,
we free the IRQ with pciehp_free_irq().  If we don't wait for the command
to complete, we risk getting an interrupt after the IRQ was freed.

Kind regards,

Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown
  2018-01-13 19:39           ` Lukas Wunner
@ 2018-01-13 20:49             ` okaya
  0 siblings, 0 replies; 11+ messages in thread
From: okaya @ 2018-01-13 20:49 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Mika Westerberg, Yehezkel Bernat, Michael Jamet,
	linux-pci

On 2018-01-13 14:39, Lukas Wunner wrote:
> On Sat, Jan 13, 2018 at 12:58:53PM -0500, okaya@codeaurora.org wrote:
>> I waa thinking of using nowait variant of write function in 
>> notification
>> disable function in order to not introduce new behavior for existing
>> silicon.
> 
> After writing to the Slot Control register to turn off IRQ 
> notification,
> we free the IRQ with pciehp_free_irq().  If we don't wait for the 
> command
> to complete, we risk getting an interrupt after the IRQ was freed.
> 

Yeah, that is true.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-01-13 20:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-12 10:49 Regression (sort of): PCI/portdrv: Turn off PCIe services during shutdown Lukas Wunner
2018-01-12 11:03 ` Lukas Wunner
2018-01-12 14:26 ` Sinan Kaya
2018-01-12 15:12   ` Lukas Wunner
2018-01-12 15:31     ` Sinan Kaya
2018-01-13  7:32       ` Lukas Wunner
2018-01-13 17:58         ` okaya
2018-01-13 19:39           ` Lukas Wunner
2018-01-13 20:49             ` okaya
2018-01-12 16:34 ` Mika Westerberg
2018-01-13  7:14   ` Lukas Wunner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.