All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
Cc: kbusch@kernel.org, axboe@fb.com, hch@lst.de, sagi@grimberg.me,
	linux-nvme@lists.infradead.org, "Khandelwal,
	Rajat" <rajat.khandelwal@intel.com>,
	Aleksander Trofimowicz <alex@n90.eu>,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
Subject: Re: [BUG] nvme-pci: NVMe probe fails with ENODEV
Date: Mon, 27 Mar 2023 17:37:50 -0500	[thread overview]
Message-ID: <20230327223750.GA2860671@bhelgaas> (raw)
In-Reply-To: <975cc790-7dd9-4902-45c1-c69b4be9ba3a@linux.intel.com>

[+cc Aleksander, original report at
https://lore.kernel.org/r/975cc790-7dd9-4902-45c1-c69b4be9ba3a@linux.intel.com]

On Thu, Mar 09, 2023 at 07:34:18PM +0530, Rajat Khandelwal wrote:
> On 3/9/2023 7:31 PM, Rajat Khandelwal wrote:
> > Hi,
> > I am seeking some help regarding an issue I encounter sporadically
> > with Samsung Portable TBT SSD X5.
> > 
> > Right from the thunderbolt discovery to the PCIe enumeration, everything
> > is fine, until 'NVME_REG_CSTS' is tried to be read in 'nvme_reset_work'.
> > Precisely, 'readl(dev->bar + NVME_REG_CSTS)' fails.

> > I handle type-C, thunderbolt and USB4 on Chrome platforms, and currently
> > we are working on Intel Raptorlake systems.
> > This issue has been witnessed from ADL time-frame and now is seen
> > on RPL as well. I would really like to get to the bottom of the problem
> > and close the issue.
> > 
> > I have tried 5.10 and 6.1.15 kernels.

It's intermittent, but happens on both v5.10 and v6.1.15.  So we have
no reason to think this is a regression, right?

And you see it on ADL and RPL?  Do you see it on any other platforms?
Have you tried any others?

> > During the issue:
> > Contents of BAR-0: <garbage> 00000004 (dumped using setpci)
> > Contents of kernel PCI resource-0: 0x83000000 (matches with the mem allocation)
> > Issue: nvme nvme1: Removing after probe failure status: -19

How exactly did you use setpci and what was "<garbage>"?  Can you
include the entire transcript, e.g.,

  $ setpci -G -s 01:00.0 BASE_ADDRESS_0.L
  Trying method linux-sysfs......using /sys/bus/pci...OK
  Decided to use linux-sysfs
  ec000000

What does "lspci -vvxxx" show in this case?

I guess "kernel PCI resource-0: 0x83000000" means the following from
your dmesg log, right?

  pci 0000:03:00.0: BAR 0: assigned [mem 0x83000000-0x83003fff 64bit]

I think the first access to the device should be here (same as what
Keith said):

  nvme_probe
    nvme_pci_enable
      pci_enable_device_mem
      pci_set_master
      readl(dev->bar + NVME_REG_CSTS)

But you mention nvme_reset_work() above.  How did you figure that out?

Maybe there's a race where we reset the device (which clears the BARs)
and do MMIO accesses before the BARs are restored.

Or maybe some PCI error happens and nvme_reset_work() is invoked as
part of recovery?  I see some *corrected* AER errors in your log, but
none look related to your NVMe device at 03:00.0.

I assume reading the BAR with setpci happens in "slow user time" so we
have to assume that's the steady state of the BAR after nvme_probe()
fails with -19.

> > During a working case:
> > Contents of BAR-0: 83000004 (dumped using setpci)
> > 
> > Seems like, the kernel PCIe resource contents don't change (which results in a
> > successful ioremap), but somehow the BAR-0 dumps garbage.
> > 
> > The logs for the scenario: (apologies if this is not the way to attach a log in
> > the mailing list as I have never done that :)).

> > ... (see original report at
> > https://lore.kernel.org/r/975cc790-7dd9-4902-45c1-c69b4be9ba3a@linux.intel.com)

Bjorn

  reply	other threads:[~2023-03-27 22:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <f1ad4c1a-2871-57be-48cb-2b0e5cec1bfa@linux.intel.com>
2023-03-09 14:04 ` [BUG] nvme-pci: NVMe probe fails with ENODEV Rajat Khandelwal
2023-03-27 22:37   ` Bjorn Helgaas [this message]
2023-03-09 15:12 ` Christoph Hellwig
2023-03-09 15:24   ` Keith Busch
2023-03-09 17:06     ` Rajat Khandelwal
2023-03-09 17:24       ` Keith Busch
2023-03-09 18:13         ` Rajat Khandelwal
     [not found]           ` <CGME20230313095802eucas1p2ed9a708d3fb0fb1fac05015a6fb06b7f@eucas1p2.samsung.com>
2023-03-13  9:49             ` Pankaj Raghav
2023-03-13 17:16               ` Rajat Khandelwal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230327223750.GA2860671@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=alex@n90.eu \
    --cc=axboe@fb.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=rajat.khandelwal@intel.com \
    --cc=rajat.khandelwal@linux.intel.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.