From: Ethan Zhao <haifeng.zhao@intel.com>
To: bhelgaas@google.com, oohall@gmail.com, ruscur@russell.cc,
lukas@wunner.de, andriy.shevchenko@linux.intel.com,
stuart.w.hayes@gmail.com, mr.nuke.me@gmail.com,
mika.westerberg@linux.intel.com
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
ashok.raj@linux.intel.com, sathyanarayanan.kuppuswamy@intel.com,
xerces.zhao@gmail.com, Ethan Zhao <haifeng.zhao@intel.com>
Subject: [PATCH v8 3/6] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
Date: Wed, 7 Oct 2020 07:31:55 -0400 [thread overview]
Message-ID: <20201007113158.48933-4-haifeng.zhao@intel.com> (raw)
In-Reply-To: <20201007113158.48933-1-haifeng.zhao@intel.com>
When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver,
pciehp drivers almost at the same time.
Thus will cause following messed and confused errors handling/recovery/
removal/plugin procedure.
1. Port and device are in error recovery resetting initiated by DPC
hardware, pciehp driver treats them as device is doing hot-remove or
hot-plugin the same time.
2. While DPC handler calling device driver->err_handler callback(
error_detected/resume etc), but the slot may be powered off by
pciehp
-> remove_board()
-> pciehp_power_off_slot().
3. While DPC handler -> pci_do_recovery is doing different action to
detect error and recover based on device->error_state, pciehp driver
could change it on the fly by:
pciehp_unconfigure_device()
->pci_walk_bus()
-> pci_dev_set_disconnected()
4. While DPC handler is calling device driver err_handler callback to
detect error and recover, pciehp driver could is doing device unbind
and release its driver.
...
While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result
is not determinate.
So we need some kind of synchronization between pciehp DLLSC/PDC handling
and DPC driver error recover handling. we need a determinate result
of DPC error containment, link is recovered, link isn't recovered, device
is still there, device is removed, then do pciehp hot-remove and
hot-plugin procudure, don't mix them together.
Per our test on ICS platform, DPC error containment and software handler
will take 10ms up to 50ms till clean the DPC triggered status. it is quick
enough for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC
after doing power off.
With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.
After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with
any time internval between the two actions, also stressed with the DPC
injection test. system recovered to normal working state from
NON-FATAL/FATAL errors as expected. hotplug works well without any random
undeterminate errors or malfunction.
Brute DPC error injection script:
for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done
Signed-off-by: Ethan Zhao <haifeng.zhao@intel.com>
Tested-by: Wen Jin <wen.jin@intel.com>
Tested-by: Shanshan Zhang <ShanshanX.Zhang@intel.com>
---
Changes:
v2: revise doc according to Andy's suggestion.
v3: no change.
v4: no change.
v5: no change.
v6: moved to [3/5] from [2/5] and re-wrote description.
v7: no change.
v8: no change.
drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(&ctrl->reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
- else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+ else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+ pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+ }
up_read(&ctrl->reset_lock);
ret = IRQ_HANDLED;
--
2.18.4
next prev parent reply other threads:[~2020-10-07 11:33 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-07 11:31 [PATCH v8 0/6] Fix DPC hotplug race and enhance error handling Ethan Zhao
2020-10-07 11:31 ` [PATCH v8 1/6] PCI/ERR: get device before call device driver to avoid NULL pointer dereference Ethan Zhao
2020-10-07 17:24 ` Kuppuswamy, Sathyanarayanan
2020-10-08 5:38 ` Ethan Zhao
2020-10-07 11:31 ` [PATCH v8 2/6] PCI/DPC: define a function to check and wait till port finish DPC handling Ethan Zhao
2020-10-07 17:28 ` Kuppuswamy, Sathyanarayanan
2020-10-08 5:49 ` Ethan Zhao
2020-10-09 3:16 ` Ethan Zhao
2020-10-07 11:31 ` Ethan Zhao [this message]
2020-10-07 11:31 ` [PATCH v8 4/6] PCI/ERR: simplify function pci_dev_set_io_state() with if Ethan Zhao
2020-10-07 11:31 ` [PATCH v8 5/6] PCI/ERR: only return true when dev io state is really changed Ethan Zhao
2020-10-07 11:31 ` [PATCH v8 6/6] PCI/ERR: don't mix io state not changed and no driver together Ethan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201007113158.48933-4-haifeng.zhao@intel.com \
--to=haifeng.zhao@intel.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=ashok.raj@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mika.westerberg@linux.intel.com \
--cc=mr.nuke.me@gmail.com \
--cc=oohall@gmail.com \
--cc=ruscur@russell.cc \
--cc=sathyanarayanan.kuppuswamy@intel.com \
--cc=stuart.w.hayes@gmail.com \
--cc=xerces.zhao@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).