All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Williams, Dan J" <dan.j.williams@intel.com>
To: "kbusch@kernel.org" <kbusch@kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"helgaas@kernel.org" <helgaas@kernel.org>
Cc: "hinko.kocevar@ess.eu" <hinko.kocevar@ess.eu>,
	"Kelley, Sean V" <sean.v.kelley@intel.com>,
	"Kuppuswamy,
	Sathyanarayanan" <sathyanarayanan.kuppuswamy@intel.com>
Subject: Re: [PATCHv2 3/5] PCI/ERR: Retain status from error notification
Date: Wed, 3 Mar 2021 05:34:25 +0000	[thread overview]
Message-ID: <fe1defb66b5438f45093d67e05ef4153d0ae60eb.camel@intel.com> (raw)
In-Reply-To: <20210104230300.1277180-4-kbusch@kernel.org>

[ Add Sathya ]

On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
> Overwriting the frozen detected status with the result of the link reset
> loses the NEED_RESET result that drivers are depending on for error
> handling to report the .slot_reset() callback. Retain this status so
> that subsequent error handling has the correct flow.
> 
> Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
> Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>

Just want to report that this fix might be a candidate for -stable.
Recent DPC error recovery testing on v5.10 shows that losing this
status prevents NVME from restarting the queue after error recovery
with a failing signature like:

[  424.685179] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast error_detected message
[  424.685185] nvme nvme0: frozen state error detected, reset controller
[  425.078620] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast resume message
[  425.078674] pcieport 0000:97:02.0: AER: device recovery successful

...and then later:

[  751.146277] sysrq: Show Blocked State
[  751.146431] task:touch           state:D stack:    0 pid:11969 ppid: 11413 flags:0x00000000
[  751.146434] Call Trace:
[  751.146443]  __schedule+0x31b/0x890
[  751.146445]  schedule+0x3c/0xa0
[  751.146449]  blk_queue_enter+0x151/0x220
[  751.146454]  ? finish_wait+0x80/0x80
[  751.146455]  submit_bio_noacct+0x39a/0x410
[  751.146457]  submit_bio+0x42/0x150
[  751.146460]  ? bio_add_page+0x62/0x90
[  751.146461]  ? guard_bio_eod+0x25/0x70
[  751.146465]  submit_bh_wbc+0x16d/0x190
[  751.146469]  ext4_read_bh+0x47/0xa0
[  751.146472]  ext4_read_inode_bitmap+0x3cd/0x5f0

...because NVME was only told to disable, never to reset and resume the
block queue.

I have not tested it, but by code inspection I eventually found this
upstream fix.

  reply	other threads:[~2021-03-03 21:22 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
2021-01-04 23:02 ` [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device Keith Busch
2021-01-04 23:02 ` [PATCHv2 2/5] PCI/AER: Actually get the root port Keith Busch
2021-01-04 23:02 ` [PATCHv2 3/5] PCI/ERR: Retain status from error notification Keith Busch
2021-03-03  5:34   ` Williams, Dan J [this message]
2021-03-03  5:46     ` Kuppuswamy, Sathyanarayanan
2021-03-04 20:01       ` Keith Busch
2021-03-04 22:11         ` Dan Williams
     [not found]           ` <23551edc-965c-21dc-0da8-a492c27c362d@intel.com>
2021-03-04 22:59             ` Dan Williams
2021-03-04 23:19               ` Kuppuswamy, Sathyanarayanan
2021-03-05  0:23                 ` Dan Williams
2021-03-05  0:54                   ` Keith Busch
2021-01-04 23:02 ` [PATCHv2 4/5] PCI/AER: Specify the type of port that was reset Keith Busch
2021-01-04 23:03 ` [PATCHv2 5/5] PCI/portdrv: Report reset for frozen channel Keith Busch
2021-01-05 14:21 ` [PATCHv2 0/5] aer handling fixups Hinko Kocevar
2021-01-05 15:06   ` Hinko Kocevar
2021-01-05 18:33     ` Keith Busch
2021-01-05 23:07       ` Kelley, Sean V
2021-01-07 21:42         ` Keith Busch
2021-01-08  9:38           ` Hinko Kocevar
2021-01-11 13:39             ` Hinko Kocevar
2021-01-11 16:37               ` Keith Busch
2021-01-11 20:02                 ` Hinko Kocevar
2021-01-11 22:09                   ` Keith Busch
     [not found]                     ` <ed8256dd-d70d-b8dc-fdc0-a78b9aa3bbd9@ess.eu>
2021-01-12 19:27                       ` Keith Busch
2021-01-12 22:19                         ` Hinko Kocevar
2021-01-12 23:17                           ` Keith Busch
2021-01-18  8:00                             ` Hinko Kocevar
2021-01-19 18:28                               ` Keith Busch
2021-02-03  0:03 ` Keith Busch
2021-02-04  8:35   ` Hinko Kocevar
2021-02-08 12:55 ` Hedi Berriche
2021-02-09 23:06 ` Bjorn Helgaas
2021-02-10  4:05   ` Keith Busch
2021-02-10 21:38     ` Bjorn Helgaas
2021-02-10  9:36 ` Yicong Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fe1defb66b5438f45093d67e05ef4153d0ae60eb.camel@intel.com \
    --to=dan.j.williams@intel.com \
    --cc=helgaas@kernel.org \
    --cc=hinko.kocevar@ess.eu \
    --cc=kbusch@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sathyanarayanan.kuppuswamy@intel.com \
    --cc=sean.v.kelley@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.