From: "Williams, Dan J" <dan.j.williams@intel.com>
To: "kbusch@kernel.org" <kbusch@kernel.org>,
"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
"helgaas@kernel.org" <helgaas@kernel.org>
Cc: "hinko.kocevar@ess.eu" <hinko.kocevar@ess.eu>,
"Kelley, Sean V" <sean.v.kelley@intel.com>,
"Kuppuswamy,
Sathyanarayanan" <sathyanarayanan.kuppuswamy@intel.com>
Subject: Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
Date: Wed, 3 Mar 2021 05:34:25 +0000 [thread overview]
Message-ID: <fe1defb66b5438f45093d67e05ef4153d0ae60eb.camel@intel.com> (raw)
In-Reply-To: <20210104230300.1277180-4-kbusch@kernel.org>
[ Add Sathya ]
On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
> Overwriting the frozen detected status with the result of the link reset
> loses the NEED_RESET result that drivers are depending on for error
> handling to report the .slot_reset() callback. Retain this status so
> that subsequent error handling has the correct flow.
>
> Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
> Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
Just want to report that this fix might be a candidate for -stable.
Recent DPC error recovery testing on v5.10 shows that losing this
status prevents NVME from restarting the queue after error recovery
with a failing signature like:
[ 424.685179] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast error_detected message
[ 424.685185] nvme nvme0: frozen state error detected, reset controller
[ 425.078620] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast resume message
[ 425.078674] pcieport 0000:97:02.0: AER: device recovery successful
...and then later:
[ 751.146277] sysrq: Show Blocked State
[ 751.146431] task:touch state:D stack: 0 pid:11969 ppid: 11413 flags:0x00000000
[ 751.146434] Call Trace:
[ 751.146443] __schedule+0x31b/0x890
[ 751.146445] schedule+0x3c/0xa0
[ 751.146449] blk_queue_enter+0x151/0x220
[ 751.146454] ? finish_wait+0x80/0x80
[ 751.146455] submit_bio_noacct+0x39a/0x410
[ 751.146457] submit_bio+0x42/0x150
[ 751.146460] ? bio_add_page+0x62/0x90
[ 751.146461] ? guard_bio_eod+0x25/0x70
[ 751.146465] submit_bh_wbc+0x16d/0x190
[ 751.146469] ext4_read_bh+0x47/0xa0
[ 751.146472] ext4_read_inode_bitmap+0x3cd/0x5f0
...because NVME was only told to disable, never to reset and resume the
block queue.
I have not tested it, but by code inspection I eventually found this
upstream fix.
next prev parent reply other threads:[~2021-03-03 21:22 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
2021-01-04 23:02 ` [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device Keith Busch
2021-01-04 23:02 ` [PATCHv2 2/5] PCI/AER: Actually get the root port Keith Busch
2021-01-04 23:02 ` [PATCHv2 3/5] PCI/ERR: Retain status from error notification Keith Busch
2021-03-03 5:34 ` Williams, Dan J [this message]
2021-03-03 5:46 ` Kuppuswamy, Sathyanarayanan
2021-03-04 20:01 ` Keith Busch
2021-03-04 22:11 ` Dan Williams
[not found] ` <23551edc-965c-21dc-0da8-a492c27c362d@intel.com>
2021-03-04 22:59 ` Dan Williams
2021-03-04 23:19 ` Kuppuswamy, Sathyanarayanan
2021-03-05 0:23 ` Dan Williams
2021-03-05 0:54 ` Keith Busch
2021-01-04 23:02 ` [PATCHv2 4/5] PCI/AER: Specify the type of port that was reset Keith Busch
2021-01-04 23:03 ` [PATCHv2 5/5] PCI/portdrv: Report reset for frozen channel Keith Busch
2021-01-05 14:21 ` [PATCHv2 0/5] aer handling fixups Hinko Kocevar
2021-01-05 15:06 ` Hinko Kocevar
2021-01-05 18:33 ` Keith Busch
2021-01-05 23:07 ` Kelley, Sean V
2021-01-07 21:42 ` Keith Busch
2021-01-08 9:38 ` Hinko Kocevar
2021-01-11 13:39 ` Hinko Kocevar
2021-01-11 16:37 ` Keith Busch
2021-01-11 20:02 ` Hinko Kocevar
2021-01-11 22:09 ` Keith Busch
[not found] ` <ed8256dd-d70d-b8dc-fdc0-a78b9aa3bbd9@ess.eu>
2021-01-12 19:27 ` Keith Busch
2021-01-12 22:19 ` Hinko Kocevar
2021-01-12 23:17 ` Keith Busch
2021-01-18 8:00 ` Hinko Kocevar
2021-01-19 18:28 ` Keith Busch
2021-02-03 0:03 ` Keith Busch
2021-02-04 8:35 ` Hinko Kocevar
2021-02-08 12:55 ` Hedi Berriche
2021-02-09 23:06 ` Bjorn Helgaas
2021-02-10 4:05 ` Keith Busch
2021-02-10 21:38 ` Bjorn Helgaas
2021-02-10 9:36 ` Yicong Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fe1defb66b5438f45093d67e05ef4153d0ae60eb.camel@intel.com \
--to=dan.j.williams@intel.com \
--cc=helgaas@kernel.org \
--cc=hinko.kocevar@ess.eu \
--cc=kbusch@kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=sathyanarayanan.kuppuswamy@intel.com \
--cc=sean.v.kelley@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).