All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ethan Zhao <haifeng.zhao@linux.intel.com>
To: baolu.lu@linux.intel.com, bhelgaas@google.com,
	robin.murphy@arm.com, jgg@ziepe.ca
Cc: kevin.tian@intel.com, dwmw2@infradead.org, will@kernel.org,
	lukas@wunner.de, yi.l.liu@intel.com, dan.carpenter@linaro.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	Ethan Zhao <haifeng.zhao@linux.intel.com>
Subject: [PATCH v14 0/3] fix vt-d hard lockup when hotplug ATS capable device
Date: Fri,  1 Mar 2024 03:07:24 -0500	[thread overview]
Message-ID: <20240301080727.3529832-1-haifeng.zhao@linux.intel.com> (raw)

This patchset is used to fix vt-d hard lockup reported when surprise
unplug ATS capable endpoint device connects to system via PCIe switch
as following topology.

     +-[0000:15]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
     |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
     |           +-00.2  Intel Corporation Ice Lake RAS
     |           +-00.4  Intel Corporation Device 0b23
     |           \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
                                           NVIDIA Corporation Device 2324
     |                                           +-01.0-[19]----00.0
                          Mellanox Technologies MT2910 Family [ConnectX-7]

User brought endpoint device 19:00.0's link down by flapping it's hotplug
capable slot 17:01.0 link control register, as sequence DLLSC response,
pciehp_ist() will unload device driver and power it off, durning device
driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or
'ATS Invalidation' in PCIe spec) request issued to that link down device,
thus a long time completion/timeout waiting in interrupt context causes
continuous hard lockup warnning and system hang.

Other detail, see every patch commit log.

patch [1&2] were tested by yehaorong@bytedance.com on stable v6.7-rc4.
patch [1-3] passed compiling on stable v6.8-rc4 (Baolu's rbtree branch).

This patch set is based on Baolu's device rbtree patchset
https://lore.kernel.org/lkml/20240221153437.GB13491@ziepe.ca/t/

change log:
v14: 
- made some adjustment to patch[3/3] per comment from Baolu, Dan, Bjorn.
- add fixes tag to patch[3/3] per Dan's sugguestion.
- add ack tag from Bjorn to patch[1/3]
- add review tag from Dan.
v13:
- rebased on Baolu's rbtree patchset.
- removed refactor patches [3/5][4/5] in v12.
- amend commit description of patch[3/3].
- https://lore.kernel.org/lkml/2d1788da-521c-4531-a159-81d2fb801d6c@
linux.intel.com/T/
v12:
- use base-commit tag to format patch.
- fix building issue on v6.8-rc2 repported by lkp@intel.com.
- https://lkml.org/lkml/2024/1/28/535
v11:
- update per latest comment and suggestion from Baolu and YiLiu.
- split refactoring patch into two patches, [3/5] for simplify parameters
  and [4/5] for pdev parameter passing.
- re-order patches.
- fold target device presence check into qi_check_fault().
- combine patch[2][5] in v10 into one patch[5].
- some commit description correctness.
- add fixes tag to patch[2/5].
- rebased on 6.8rc1
- https://lkml.org/lkml/2024/1/25/1314
v10:
- refactor qi_submit_sync() and its callers to get pci_dev instance, as
  Kevin pointed out add target_flush_dev to iommu is not right.
v9:
- unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's
  suggestion.
v8:
- add a patch to break the loop for timeout device-TLB invalidation, as
  Bjorn said there is possibility device just no response but not gone.
v7:
- reorder patches and revise commit log per Bjorn's guide.
- other code and commit log revise per Lukas' suggestion.
- rebased to stable v6.7-rc6.
v6:
- add two patches to break out device-TLB invalidation if device is gone.
v5:
- add a patch try to fix the rare case (surprise remove a device in
  safe removal process). not work because surprise removal handling can't
  re-enter when another safe removal is in process.
v4:
- move the PCI device state checking after ATS per Baolu's suggestion.
v3:
- fix commit description typo.
v2:
- revise commit[1] description part according to Lukas' suggestion.
- revise commit[2] description to clarify the issue's impact.
v1:
- https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@
linux.intel.com/T/


Thanks,
Ethan


Ethan Zhao (3):
  PCI: make pci_dev_is_disconnected() helper public for other drivers
  iommu/vt-d: don't issue ATS Invalidation request when device is
    disconnected
  iommu/vt-d: improve ITE fault handling if target device isn't present

 drivers/iommu/intel/dmar.c  | 22 ++++++++++++++++++++++
 drivers/iommu/intel/pasid.c |  3 +++
 drivers/pci/pci.h           |  5 -----
 include/linux/pci.h         |  5 +++++
 4 files changed, 30 insertions(+), 5 deletions(-)


base-commit: e60bf5aa1a74c0652cd12d0cdc02f5c2b5fe5c74
-- 
2.31.1


             reply	other threads:[~2024-03-01  8:07 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-01  8:07 Ethan Zhao [this message]
2024-03-01  8:07 ` [PATCH v14 1/3] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
2024-03-01  8:07 ` [PATCH v14 2/3] iommu/vt-d: don't issue ATS Invalidation request when device is disconnected Ethan Zhao
2024-03-01  8:07 ` [PATCH v14 3/3] iommu/vt-d: improve ITE fault handling if target device isn't present Ethan Zhao
2024-03-04 19:42   ` kernel test robot
2024-03-05 12:00 ` [PATCH v14 0/3] fix vt-d hard lockup when hotplug ATS capable device Baolu Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240301080727.3529832-1-haifeng.zhao@linux.intel.com \
    --to=haifeng.zhao@linux.intel.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.carpenter@linaro.org \
    --cc=dwmw2@infradead.org \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    --cc=yi.l.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.